We correct typos in search queries

Probably, any service that generally has a search, sooner or later comes to the need to learn how to correct errors in user requests. Errare humanum est; users are constantly sealed and mistaken, and the quality of the search inevitably suffers from this - and with it the user experience. 3r3r6956.  
3r3r6956.  
Moreover, each service has its own specifics, its own lexicon, which should be able to operate on a typo fixer, which greatly complicates the use of existing solutions. For example, such requests had to learn to rule our guardian:
 
3r3r6956.  
We correct typos in search queries 3r3r6956.  
3r3r6956.  
3r3958. It may seem that we have denied the user his dream of vertical reality, but in fact the letter K simply stands on the keyboard next to the letter U. 3r3393961. 3r3r6956.  
3r3r6956.  
In this article, we will analyze one of the classic approaches to correcting typos, from building a model to writing code in Python and Go. And as a bonus - a video from my report “
“Vertical reality glasses”: fix typos in search queries 3r-3960. »On Highload ++. 3r3r6956.  
3r3r6956.  
3r3r6956.  
3r3r6956.  

Problem Statement

3r3r6956.  
3r3r6956.  
So, we received a sealed request and it needs to be fixed. Usually the problem is mathematically put in this way:
 
3r3r6956.  
3r33812.  
3r33867. given the word 3r3903. 3r33434. 3r3905. passed to us with errors; 3r33838.  
3r33867. There is a
dictionary. 3r3905. correct words; 3r33838.  
3r33867. for all words w in the dictionary there are conditional probabilities
3r3905. that meant the word
3r33430. 3r3905. , provided that we received the word
3r33434. 3r3905. ; 3r33838.  
3r33867. You need to find the word w from the dictionary with a maximum probability of
3r3905. . 3r33838.  
3r3r6956.  
This formulation - the most elementary - assumes that if we received a request from several words, then we correct each word separately. In reality, of course, we will want to correct the entire phrase, taking into account the compatibility of neighboring words; I will tell about it below, in the section “How to correct phrases”. 3r3r6956.  
3r3r6956.  
There are two unclear points: where to get the dictionary and how to calculate
3r3905. . The first question is considered simple. In 199?[1]The dictionary was collected from the base of the utility spell and electronic dictionaries; in 2009 on Google[4]made it easier and just took the top most popular words on the Internet (along with the popular erroneous spellings). I also took this approach to build my girdle. 3r3r6956.  
3r3r6956.  
The second question is more complicated. If only because his decision usually begins with the application of the Bayes formula! 3r3r6956.  
3r3651.
3r3903. 3r3905. 3r3651. 3r3r6956.  
Now, instead of the original incomprehensible probability, we need to evaluate two new, slightly more understandable ones:
3r3905. - the probability that when typing the word
3r33430. 3r3905. can be sealed and get
3r33434. 3r3905. , and 3r3903. 3r3905. - in principle, the probability of the user using the word
3r33430. 3r3905. . 3r3r6956.  
3r3r6956.  
How to evaluate 3r3903. 3r3905. ? It is obvious that the user is more likely to confuse A with O than b with B. And if we correct the text recognized from the scanned document, then there is a high probability of confusion between rn and m. Anyway, we need some kind of model that describes errors and their probabilities. 3r3r6956.  
3r3r6956.  
Such a model is called the noisy channel model (the model of a noisy channel; in our case, a noisy channel starts somewhere in 3r3-3132. The center of Broca 3r3-3960. User and ends on the other side of its keyboard) or, more briefly, the error model is an error model. This model, to which a separate section is devoted below, will be responsible for recording both spelling errors and, in fact, typos. 3r3r6956.  
3r3r6956.  
Rate the likelihood of using the word -
3r3905. - can be different. The easiest option is to take for it the frequency with which the word occurs in some large corpus of texts. For our girder, taking into account the context of the phrase, we will need, of course, something more complicated - another model. This model is called a language model, a language model. 3r3r6956.  
3r3r6956.  

The

error model. 3r3r6956.  
3r3r6956.  
The first error models considered 3r3903. 3r3905. , counting the probabilities of elementary substitutions in the training sample: how many times E was written instead of E, T was written instead of T, T was written instead of T, and so on[1]. A model with a small number of parameters was obtained, capable of learning some local effects (for example, that people often confuse E and I). 3r3r6956.  
3r3r6956.  
In our research, we focused on a more advanced model of errors proposed in 2000 by Brill and Moore[2]and reused afterwards (for example, by Google[4]specialists.). Imagine that users do not think in separate characters (confuse E and I, press K instead of U, skip a soft sign), but can change arbitrary pieces of a word to any others - for example, replace TCA with HALF, U with K, SCHA on OA, SS on C and so on. The probability that a user has been sealed and written instead of TCJ has been written by THS, we denote
3r3905. - this is the parameter of our model. If for all possible fragments
3r3162. 3r3905. we can count
3r3402. 3r3905. , then the desired probability is
3r3905. typing the word s when trying to type the word w in the Brill and Moore model can be obtained as follows: we divide the words w and s into shorter fragments in all possible ways so that the fragments in two words have the same number. For each partition we calculate the product of the probabilities of all the fragments w to turn into the corresponding fragments s. The maximum for all such partitions and take for the value of the value
3r3905. :
 
3r3651.
3r3903. 3r3179. 3r3905. 3r3651. 3r3r6956.  
Let's look at an example of the partitioning that occurs when calculating the probability of typing “accessory” instead of “accessory”: 3r3r6956.  
3r3651.
3r3903. 3r3190. 3r3905. 3r3651. As you probably noticed, this is an example of a not-so-good break: it is clear that parts of the words did not lie under each other as well as they could. If the value of
3r3194. 3r3905. and 3r3903. 3r3197. 3r3905. still not so bad then
3r3905. and 3r3903. 3r3905. , most likely, they will make the final “score” of this partition absolutely sad. A better partitioning looks something like this:
 
3r3651.
3r3903. 3r3905. 3r3651. 3r3r6956.  
Here everything immediately fell into place, and it is clear that the final probability will be determined mainly by the value of 3r3903. 3r3905. . 3r3r6956.  
3r3r6956.  
3r3r6956.  
3r33939. How to calculate
3r3905. 3r33939. 3r3r6956.  
3r3r6956.  
Despite the fact that the possible splits for two words are of the order
3r3905. , using dynamic programming calculation algorithm 3r3903. 3r3905. can be done fairly quickly - for
3r3905. . The algorithm itself will very strongly resemble Wagner-Fisher algorithm to calculate Levenshtein distance . 3r3r6956.  
3r3r6956.  
We will create a rectangular table, the rows of which will correspond to the letters of the correct word, and the columns - sealed. In the cell at the intersection of row i and column j, by the end of the algorithm there will be exactly the probability to get 3r35656. s[:j]3r33857. when trying to print w[:i]3r33857. . In order to calculate it, it is enough to calculate the values ​​of all cells in the previous rows and columns and run through them, multiplying by the corresponding 3r3903. 3r3402. 3r3905. . For example, if we have a table filled
 
3r3r6956.  
3r33939. 3r3r6956.  
3r3r6956.  
, then to fill the cell in the fourth row and the third column (gray) you need to take a maximum of the values ​​ 3r3905. and 3r3903. 3r3905. . At the same time, we ran through all the cells highlighted in green in the image. If we also consider modifications of the form 3r3905. and 3r3903. 3r3905. , it is necessary to run through the cells highlighted in yellow. 3r3r6956.  
3r3r6956.  
The complexity of this algorithm, as I mentioned above, is 3r3905. : we fill out the table 3r3905. , and to fill the cell (i, j) you need 3r3905. operations. However, if we limit consideration to fragments no more than some limited length 3r3905. (for example, no more than two letters, as in[4]), the complexity decreases to 3r3905. . For the Russian language in my experiments, I took 3r33544. 3r3905. . 3r3r6956.  
3r3r6956.  
3r3r6956.  
3r33939. How to maximize 3r3905. 3r33939. 3r3r6956.  
3r3r6956.  
We learned to find 3r3903. 3r3905. in polynomial time is good. But we need to learn to quickly find the best words in the whole dictionary. And the best is not for 3r3905. , and on 3r3903. 3r3905. ! In fact, it is enough for us to get some reasonable top (for example, the best 20) of words. 3r3905. which we will then send to the language model to select the most appropriate fixes (more on this below). 3r3r6956.  
3r3r6956.  
To learn how to quickly go through the entire dictionary, we note that the table presented above will have a lot in common for two words with common prefixes. Indeed, if we, correcting the word “accessory”, try to fill it in for two vocabulary words “accessory” and “accessories”, we note that the first nine lines in them do not differ at all! If we can organize the passage through the dictionary in such a way that the two subsequent words have rather long common prefixes, we can save a lot of calculations. 3r3r6956.  
3r3r6956.  
And we can. Let's take vocabulary words and make them trie . Going through it by searching in depth, we get the desired property: most of the steps are steps from the node to its descendant, when the table has enough to fill in the last few lines. 3r3r6956.  
3r3r6956.  
This algorithm, with some additional optimizations, allows us to iterate through the dictionary of a typical European language in 50-100 thousand words within a hundred milliseconds[2]. And caching the results will make the process even faster. 3r3r6956.  
3r3r6956.  
3r3r6956.  
3r33939. How to get 3r3903. 3r3402. 3r3905. 3r33939. 3r3r6956.  
3r3r6956.  
Calculation 3r3903. 3r3402. 3r3905. for all the considered fragments - the most interesting and nontrivial part of building the error model. It is from these values ​​that its quality will depend. 3r3r6956.  
3r3r6956.  
The approach used in[2, 4]is relatively simple. Let's find a lot of pairs 3r3409. 3r3905. where 3r3622. 3r3905. - the correct word from the dictionary, and 3r3619. 3r3905. - its sealed version. (How exactly to find them - just below.) Now we need to extract specific misprints from these pairs of probabilities (replacing some fragments with others). 3r3r6956.  
3r3r6956.  
For each pair, take its components 3r33430. 3r3905. and 3r3903. 3r33434. 3r3905. and construct a correspondence between their letters, minimizing the Levenshtein distance:
 

3r3651.

3r3903. 3r3905. 3r3651. Now we immediately see the replacements: a → a, e → and, c → c, c → empty line, and so on. We also see the replacement of two or more characters: ak → ak, ce → si, eu → iz, ss → s, si → sis, ess → is and so on and so forth. All these replacements must be counted, and each as many times as the word s occurs in the body (if we took words from the body, which is very likely). 3r3r6956.  
3r3r6956.  
After going through all the pairs 3r3409. 3r3905. for the likelihood of 3r3903. 3r3402. 3r3905. the number of substitutions α → β encountered in our pairs (taking into account the occurrence of the corresponding words), divided by the number of repetitions of the α fragment, is taken. 3r3r6956.  
3r3r6956.  
How to find pairs 3r3409. 3r3905. ? In[4]This approach is proposed. Take the large corpus of user-generated content (UGC). In the case of Google, these were simply the texts of hundreds of millions of web pages; in our - millions of user searches and reviews. It is assumed that the correct word is usually found in the corpus more often than any of the erroneous variants. So, let us find for each word words from the hull that are close to him according to Lowenstein, which are much less popular (for example, ten times). Popular take for 3r3903. 3r33430. 3r3905. , less popular - for 3r33434. 3r3905. . So we get even noisy, but a fairly large set of pairs, on which it will be possible to conduct training. 3r3r6956.  
3r3r6956.  
This matching algorithm leaves plenty of room for improvement. In[4]only a filter by occurrence is suggested ( ten times more popular than ), but the authors of this article are trying to make a glover without using any a priori knowledge of the language. If we consider only Russian, we can, for example, take a set of dictionaries of Russian word forms and leave only pairs with the word 3r33430. 3r3905. found in the dictionary (not the best idea, because the dictionary will most likely not be specific to the vocabulary service) orfrom, discard pairs with the word s found in the dictionary (that is, almost guaranteed not to be sealed). 3r3r6956.  
3r3r6956.  
To improve the quality of the resulting pairs, I wrote a simple function that determines whether users use two words as synonyms. The logic is simple: if the words w and s are often found in the environment of the same words, then they are probably synonymous - which, in the light of their proximity to Levenshtey, means that a less popular word is most likely an erroneous version of a more popular one. For these calculations, I used the statistics for the occurrence of trigrams (three-word phrases) constructed for the language model. 3r3r6956.  
3r3r6956.  

Language model

3r3r6956.  
3r3r6956.  
So now for the given vocabulary word w we need to calculate 3r3905. - the probability of its use by the user. The simplest solution is to take the occurrence of a word in some large package. In general, probably, any language model begins with collecting a large corpus of texts and counting the occurrence of words in it. But we should not limit ourselves to this: in fact, when calculating P (w), we can also take into account the phrase, the word in which we are trying to correct, and any other external context. The task turns into the task of computing 3r33450. 3r3905. where one of 3r3622. 3r3905. - the word in which we corrected a typo and for which we now count 3r3905. , and the remaining 3r3903. 3r3622. 3r3905. - the words surrounding the corrected word in the user query. 3r3r6956.  
3r3r6956.  
To learn how to take them into account, it is worth walking around the body again and compiling statistics for n-grams, sequences of words. Usually take sequences of limited length; I limited myself to trigrams, so as not to inflate the index, but it all depends on your strength of mind (and the size of the body - on a small case, even statistics on trigrams will be too noisy). 3r3r6956.  
3r3r6956.  
The traditional language model based on n-grams looks like this. For the phrase 3r3903. 3r3905. its probability is calculated by the formula
 
3r3r6956.  

3r3651.

3r3903. 3r33480. 3r3905. 3r3651. 3r3r6956.  
where 3r3905. - directly the frequency of the word, and 3r3905. - the probability of the word 3r3903. 3r3905. provided that before him go 3r3905. - Nothing but the ratio of the frequency of the trigram 3r3905. to the frequency of the bigram 3r3501. 3r3905. . (Note that this formula is simply the result of repeated use of the Bayes formula.)
 
3r3r6956.  
In other words, if we want to calculate 3r3905. , denoting the frequency of an arbitrary n-gram for 3r3903. 3r33511. 3r3905. we get the formula
 
3r3r6956.  

3r3651.

3r3903. 3r33521. 3r3905. 3r3651. 3r3r6956.  
Is it logical Is logical. However, difficulties begin when phrases become longer. What if the user typed in an impressively detailed search query in ten words? We do not want to keep statistics on all 10 grams - this is expensive, and the data will most likely be noisy and not indicative. We want to do with n-grams of some limited length — for example, the length already proposed above 3.
 
3r3r6956.  
This is where the formula above comes in handy. Let's assume that the likelihood of a word appearing at the end of a phrase is significantly influenced only by a few words directly in front of it, that is,
 
3r3r6956.  

3r3651.

3r3903. 3r33538. 3r3905. 3r3651. 3r3r6956.  
Putting 3r33544. 3r3905. , for a longer phrase, we get the formula
 
3r3r6956.  
3r3r6956.  
3r3903. 3r33553. 3r3905. 3r3r6956.  
3r3r6956.  
3r3r6956.  
Please note: the phrase consists of five words, but n-grams no longer than three appear in the formula. This is exactly what we wanted. 3r3r6956.  
3r3r6956.  
Only one subtle moment remained. What if the user entered a very strange phrase and the corresponding n-grams in our statistics and not at all? It would be easy for unfamiliar n-grams to put 3r3566. 3r3905. if it were not necessary to divide this value. Here smoothing comes to the rescue, which can be done in various ways; However, a detailed discussion of serious approaches to smoothing like 3r-3568. Kneser-Ney smoothing goes far beyond the scope of this article. 3r3r6956.  
3r3r6956.  
3r3r6956.  
3r33939. How to correct the phrase 3r3r6956.  
3r3r6956.  
Let's discuss the last subtle point before proceeding to the implementation. The statement of the task that I described above implied that there is one word and it needs to be corrected. Then we clarified that this one word may be in the middle of a phrase among some other words and they also need to be taken into account when choosing the best correction. But in reality, users simply send us phrases without specifying which word is written with an error; often a few words need to be corrected or even all. 3r3r6956.  
3r3r6956.  
Approaches here can be many. You can, for example, take into account only the left context of the word in the phrase. Then, walking along the words from left to right and correcting them as necessary, we will get a new phrase of some quality. The quality will be low if, for example, the first word turns out to be similar to several popular words and we choose the wrong option. The rest of the phrase (perhaps initially unmistakable in general) will be adjusted by us to the wrong first word and we can get the text completely irrelevant to the original. 3r3r6956.  
3r3r6956.  
You can consider the words separately and use a classifier to understand whether this word is sealed or not, as proposed in[4]. The classifier is trained on probabilities, which we can already count, and a number of other features. If the classifier says that it is necessary to correct - we correct it, taking into account the existing context. Again, if several words are written with an error, it is necessary to make a decision about the first of them, relying on the context with errors, which can lead to quality problems. 3r3r6956.  
3r3r6956.  
In the implementation of our girdle, we used this approach. Give for each word 3r3619. 3r3905. in our phrase, we will find using the model of errors of the top-N dictionary words that could be in mind, concatenate them into phrases in various ways and for each of 3r3903. 3r33232. 3r3905. The resulting phrases where 3r3601. 3r3905. - the number of words in the source phrase, we honestly calculate the value of
 
3r3r6956.  

3r3651.

3r3903. 3r3611. 3r3905. 3r3651. 3r3r6956.  
3r3r6956.  
Here 3r3619. 3r3905. - words entered by the user, 3r3622. 3r3905. - corrections selected for them (which we are going through now), and 3r33625. 3r3905. - the coefficient determined by the comparative quality of the error model and the language model (a large coefficient — we trust the language model more; a small coefficient — we trust the error model more) proposed in[4]. Total for each phrase, we multiply the probabilities of individual words to correct in the appropriate vocabulary variants and multiply this by the probability of the entire phrase in our language. The result of the algorithm is a phrase from vocabulary words that maximizes this value. 3r3r6956.  
3r3r6956.  
So, stop, what? Search 3r3903. 3r33232. 3r3905. phrases? 3r3r6956.  
3r3r6956.  
Fortunately, due to the fact that we have limited the length of n-grams, it is possible to find the maximum for all the phrases much faster. Remember: above, we have simplified the formula for 3r3905. so that it began to depend only on frequencies n-grams of length not higher than three:
 
3r3r6956.  

3r3651.

3r3903. 3r3905. 3r3651. 3r3r6956.  
If we multiply this value by 3r33655. 3r3905. and try to maximize at 3r3658. 3r3905. , we will see that it is enough to go through all sorts of 3r3661. 3r3905. and 3r3903. 3r3664. 3r3905. and solve the problem for them - that is, for the phrases 3r3667. 3r3905. . Total problem is solved by dynamic programming for 3r3903. 3r33670. 3r3905. . 3r3r6956.  
3r3r6956.  

Implement

3r3r6956.  
3r3r6956.  
3r33939. Putting the body together and counting the n-grams 3r3r6956.  
3r3r6956.  
Immediately make a reservation: the data at my disposal was not so much to get some kind of complex MapReduce. So I just collected all the texts of reviews, comments and search queries in Russian (descriptions of goods, alas, come in English, and using the results of the autotranslation worsened rather than improved the results) from our service into one text file and set the server to read at night trigrams in a simple Python script. 3r3r6956.  
3r3r6956.  
As a dictionary, I took the top words in frequency so that it would be about one hundred thousand words. Too long words (more than 20 characters) and too short (less than three characters, except hard-written Russian words) were excluded. Separately, spared the words on regular season 3r33856. r "^[a-z0-9]{2} $" - so that versions of iPhones and other interesting identifiers of length 2. survive.
 
3r3r6956.  
When calculating bigrams and trigrams in a phrase, a non-vocabulary word may occur. In this case, I threw out this word and beat the entire phrase into two parts (before and after this word), with which I worked separately. So, for the phrase “ Do you know what abyrvalg is? This is HEAD, a colleague "The trigrams" and you know "," you know what "," know what it is "and" this is the head fish colleague "(if, of course, the word" head fish "falls into the dictionary ). 3r3r6956.  
3r3r6956.  
3r33939. We teach the error model 3r3r6956.  
3r3r6956.  
Further I carried out all data processing in Jupyter. Statistics on n-grams is loaded from JSON, post-processing is performed in order to quickly find words that are close to each other according to Levenshteyn, and for pairs in a cycle, a (rather cumbersome) function is called that builds words and extracts short edits like ss → s (under the spoiler). 3r3r6956.  
3r3r6956.  

3r33737. Python code [/b] 3r33737.
  3r33737. def generate_modifications (intended_word, misspelled_word, max_l = 2):
# We arrange the letters of words in the Levenshtein optimal way and
# extract modifications of limited length. So that after counting
# distances to restore the optimal location of the letters, we will be
# store in the table in addition to distance pointers to the previous 3r3393968. # cells: memo will store the match
# i -> j -> (distance, prev i, prev j).
# Next, a bit of an unusually scary Python code - that's what
# happens when the language is used for other purposes!
m, n = len (intended_word), len (misspelled_word)
memo =[[None]* (n + 1) for _ in range (m + 1)]
memo[0]=[(j, (0 if j > 0 else -1), j-1) for j in range(n+1)]
for i in range (m + 1):
memo[i] [0]= i, i-? (0 if i> 0 else -1)
for j in range (? n + 1):
for i in range (? m + 1):
if intended_word[i-1]== misspelled_word[j-1]:
memo[i] [j]= memo[i-1] [j-1] [0]i-? j-1
else:
best = min (
(memo[i-1] [j] [0], i-? j), 3r3-3968. (memo[i] [j-1] [0], i, j-1), for a entf, for[0], i, j-1). , i-? j-1),
) 3r3393968. # Separate processing for messed up
# adjacent letters (common error with
# printing).
if (i> 1
and j> 1
and intended_word[i-1]== misspelled_word[j-2]
and intended_word[i-2]== misspelled_word[j-1]
best = min (best, (memo[i-2] [j-2] [0], i-? j-2))
memo[i] [j]= 1 + best[0], best[1], best[2]
# At the end of the loop, the Levenshtein distance between the source words is stored in memo[m] [n] [0].
# Now we restore the optimal arrangement of letters.
s, t =[],[]
i, j = m, n
while i> = 1 or j> = 1:
_, pi, pj = memo[i] [j]
di, dj = i - pi, j - pj
if di == dj == 1:
s.append (intended_word[i-1])
t.append (misspelled_word[j-1])
if di == dj == 2:
s.append (intended_word[i-1])
s.append (intended_word[i-2])
t.append (misspelled_word[j-1])
t.append (misspelled_word[j-2])
if 1 == di> dj == 0:
s.append (intended_word[i-1])
t.append ("")
if 1 == dj> di == 0:
s.append ("")
t.append (misspelled_word[j-1])
i, j = pi, pj
s.reverse ()
t.reverse ()
# Generate modifications of length not higher than specified.
for i, _ in enumerate (s):
ss = ts = ""
while len (ss) < max_l and i < len(s):
ss + = s[i]
ts + = t[i]
yield ss, ts
i + = 1 3r33853. 3r3r6956.  
3r33939. 3r33939. 3r3r6956.  
The very counting of edits looks elementary, although it can take a long time. 3r3r6956.  
3r3r6956.  
3r3r6956.  
3r33939. We apply the error model 3r3r6956.  
3r3r6956.  
This part is implemented as a Go microservice associated with the main backend via gRPC. Implemented the algorithm described by Brill and Moore[2], with small optimizations. He works for me in the end, about twice as slowly as the authors have stated; I do not presume to judge whether the matter is in Go or in me. But in the course of profiling, I learned a little about Go. 3r3r6956.  
3r3r6956.  
3r33812.  
3r33867. Do not use 3r33856. math.Max ​​ to count the maximum. This is about three times slower than  3r33856. if a> b {b = a} ! Just take a look at implementation of this function 3r3393960. :
 
  3r33824. //Max returns the larger of x or y.
//
//Special cases are:
//Max (x, + Inf) = Max (+ Inf, x) = + Inf
//Max (x, NaN) = Max (NaN, x) = NaN
//Max (+? ± 0) = Max (± ? +0) = +0
//Max (-? -0) = -0
func max (x, y float64) float64
func max (x, y float64) float64 {
//special cases
switch {
case IsInf (x, 1) || IsInf (y, 1):
return inf (1)
case IsNaN (x) || IsNaN (y):
return NaN ()
case x == 0 && x == y:
if Signbit (x) {
return y
}
return x
}
if x> y {
return x
}
return y
} 3r33853. 3r3r6956.  
Unless you suddenly have a need for +0 to be more than -? do not use math.Max ​​ . 3r3r6956.  
3r33838.  
3r33867. Do not use a hash table if you can use an array. This, of course, is pretty obvious advice. I had to renumber the unicode characters at the beginning of the program to use them as indices in the array of descendants of the trie node (such a lookup was a very frequent operation). 3r3r6956.  
3r33838.  
3r33867. Callbacks in Go are not cheap. During the refactoring during the code review, some of my attempts to do decoupling significantly slowed down the program, although the algorithm did not formally change. Since then I have been left with the opinion that the Go optimizer compiler has where to grow. 3r3r6956.  
3r33838.  
3r3r6956.  
3r3r6956.  
3r33939. Apply the language model
3r3r6956.  
3r3r6956.  
Here, without special surprises, the dynamic programming algorithm described in the section above was implemented. This component had the least work — the slowest part is the use of the error model. Therefore, between these two layers, the caching of the error model results in Redis was additionally bolted. 3r3r6956.  
3r3r6956.  

Results

3r3r6956.  
3r3r6956.  
According to the results of this work (which took about a month to come), we conducted an A /B test of a guard on our users. Instead of 10% of empty returns among all the search queries that we had before the introduction of the guard, they became 5%; Most of the remaining requests are for goods that we simply do not have on the platform. Also increased the number of sessions without a second search query (and a few more metrics of this kind related to UX). The metrics associated with money, however, did not significantly change - it was unexpected and encouraged us to carefully analyze and re-check other metrics. 3r3r6956.  
3r3r6956.  

Conclusion 3r33911. 3r3r6956.  
3r3r6956.  
Stephen Hawking was once told that every formula he included in the book would halve the number of readers. Well, in this article, their order is fifty — congratulations, one of about
3r3904. 3r3905. readers who have reached this place! 3r3r6956.  
3r3r6956.  
Bonus

3r3r6956.  
3r3r6956.  
3r3393917. 3r33939. 3r3393919 3r33939. 3r33939. 3r33939. 3r3r6956.  
3r3r6956.  
3r33939. References
3r3r6956.  
3r3r6956.  
[1]M. D. Kernighan, K. W. Church, W. A. ​​Gale. 3r3958. 3r3393935. A Spelling Correction Program Based on a Noisy Channel Model
3r33961. . Proceedings of the 13th Conference on Computational Linguistics - Volume ? 1990.
 
3r3r6956.  
[2]E.Brill, R.C. Moore. 3r3958. Noisy Channel Spelling Correction 3r33961. . Proceedings of the 38th Annual Meeting on the Association for Computational Linguistics, 2000.
 
3r3r6956.  
[3]T. Brants, A. C. Popat, P. Xu, F. J. Och, J. Dean. 3r3958. 3r3951. Large Language Models in Machine Translation
3r33961. . Proceedings of the 2007 Conference on Empirical Methods in Natural Language Processing. 3r3r6956.  
3r3r6956.  
[4]C. Whitelaw, B. Hutchinson, G. Y. Chung, G. Ellis. 3r3958. 3r3r9959. Using the Web for Language Independent Spellchecking and Autocorrection
3r33961. .
Proceedings of the 2009 Conference on Empirical Methods of Natural Language Processing:
3r33966. ! function (e) {function t (t, n) {if (! (n in e)) {for (var r, a = e.document, i = a.scripts, o = i.length; o-- ;) if (-1! == i[o].src.indexOf (t)) {r = i[o]; break} if (! r) {r = a.createElement ("script"), r.type = "text /jаvascript", r.async =! ? r.defer =! ? r.src = t, r.charset = "UTF-8"; var d = function () {var e = a.getElementsByTagName ("script")[0]; e.parentNode.insertBefore (r, e)}; "[object Opera]" == e.opera? a.addEventListener? a.addEventListener ("DOMContentLoaded", d,! 1): e.attachEvent ("onload", d ): d ()}}} t ("//mediator.mail.ru/script/2820404/"""_mediator") () ();
3r33939.
+ 0 -

Add comment