Correction of typos, side view
We will talk about the use of fashionable "Word embedding" not exactly for the purpose - namely to correct typos (strictly speaking, and mistakes too, but we assume that people are literate and sealed). The habr was pretty close article , but there will be a little more about the other.
Visualization of Word2Vec model, received by the student. She studied on the "Lord of the Rings". Clearly something in a black dialect.
article , which is well told about this. In order not to repeat, but not to chase the reader through the links - a short digression into the topic.
Is a vector representation of words, that is, one word is associated with a vector of fixed dimension, for example, for the conditional word "house" -
[0.1, 0.3, , 0.7] . An important observation: usually the dimension of the vector is much less than the dimension of the dictionary, otherwise it degenerates into one-hot encoding .
The number of elements of the vector depends on many conditions: the chosen method, the requirements of the task, the weather in Krasnoyarsk, and much more. In the current Word2Vec , GloVe , WordRank and FastText .
Briefly, the idea is based on: words whose contexts are similar - most likely have similar meanings. Hence it follows how mistakes are corrected in the first cited. example . That is, we have two proposals: "to find tours with pr e "and" find tours from 3 r3r381. and "Contexts are similar, therefore, the words" adventures "and" adventures "are similar in meaning (this rough approximation, but the meaning is this).
This approach to correcting misprints, besides obvious advantages, has one important drawback: all the errors that we can correct must be in the text that we are studying. That is, we can not get a vector for a word that we have not met before.
All (modern to the author) modern methods of obtaining word vectors, except FastText, operate with words as indivisible entities (the word is replaced by an integer index, and then the index is used). In FastText (if more precisely, in its add-on ) An interesting sentence is added: let's consider these vectors not for words entirely, but for n-grams of symbols. For example, the word "table" (with the added characters of the beginning and end of the word, like " <стол> ") Is converted to the following list: 3g: <ст, сто, тол, ол> ; 4 grams: <сто, стол, тол> . The authors suggest using n-grams from 3 to 6 inclusive, we will not argue with them. Then the resulting word vector is equal to the sum of the vectors of its constituent n-grams:
- the set of all n-gram words,
- vector of the corresponding n-gram,
Is a word vector.
What important changes does this approach promise us?
First, the authors introduced this to work better in languages with a rich morphology (which is Russian). And indeed, the morphological changes now have less effect on the distance between words, here is a plate for different languages from the same articles in the form of proofs:
Correlation between expert judgment and method evaluation. SG and CBOW - respectively skip-gram and continious bag of words variants of Word2Vec, sisg- - when unknown words are replaced by zero vector, and sisg - when unknown words are replaced by the sum of their n-grammes.
Secondly, it's a little step back, in Word2Vec we moved away from the letter representation of the word, trying to bring together words like "king" and "king", "mother" and "son", now we return to "literal affinity", which for The semantic problem can also be not very good , but for our variant (I will remind - correction of misprints) this that that is necessary.
This concludes the theoretical part, let us proceed to practice.
Practical training ground
We introduce some preconditions:
- For tests, let's take a relatively small text, namely, some art work, for example, "Quiet Flows the Don" by Sholokhov. Why is that? It will be easier to repeat the interested reader and, realizing the context of the work, we can explain the behavior of our method. In general, vector word representations are taught on large language housings, such as Wikipedia dumps.
- Normalize the words before learning, that is, we bring them into a normal form (for example, for nouns this is the only number and nominative case). This is a great simplification so as not to bother with endings and increase the frequency of words for more adequate vectors (this is for the first time taught on large buildings to get as many uses for each word).
The test code is quite simple (thanks, .gensim ), The entire script is here , directly teaching the model in one line:
model = gensim.models.FastText (sentences, size = 30? window = ? min_count = ? sg = ? iter = 35)
- list of lists, each element is a sentence, each element of the sentence is a word;
- size of the output vectors;
- the size of the window, words within the window, we consider the context for the word in the center;
- consider only words that occur at least 2 times;
- use skip-gram option, not CBOW;
- number of iterations.
In addition, there are two parameters that are left by default, their meaning was discussed above, but you can play with them:
- the lower and upper thresholds, which n-grams to take (by default from 3 to 6 characters)
As a metric, let us take the measure of similarity between vectors that has become classical in this problem. Cosine similarity , which takes values from 0 to ? where 0 - vectors are absolutely different, 1 - vectors are the same:
Are the components of the corresponding vectors.
So, we figured out what we have, and what we want, just in case, again:
The hypothesis is that we can correct misprints based on the vector representation of words.
We have a trained FastText model on just one product, the words in it are normalized, we can also get a vector for unknown words.
The method of comparing words, or rather their vectors, is defined in the previous paragraph.
Now let's see what we can do, for the tests, take the following pairs - a word with a typo and the bracketed word in brackets:
man (person), stuent (student), student (student), chilovench (humanity), participate (participate), tactbut (tactics), in general (generally), simpotichny (pretty), create (make), watch (look), algorithm (algorithm), lay down (put).
Summary table with the results:
A word with a typo
The conceived word
The number in the list of the nearest
The value of the metric is
to look at
to lay down
If the number is in parentheses, it means that the word is not in the dictionary, but it could be in this place (based on the metric).
With a couple to lay down, putting is actually not so bad, because the words above are "folded", "postponed", "laid out", etc. (see the spoiler).
Sometimes in the top of similar words there are words that are very different from the requests (stool-driver), presumably it is associated with a kind of collision of vectors - when for approximately n-grams roughly the same vectors are obtained.
Top 10 nearest words for each [/b]
the proximity of ???r3r31129.
the boundedness of ???r3r31129.
the significance of ???r3r31129.
to be poor.
to participate ???r3r31129.
to drink ?86810
to sympathize with ???r3r31129.
armed with ???r3r31129.
The hostel is ???r3r31129.
to make ???r3r31129.
to make ???r3r31129.
to do ???r3r31129.
do not finish ???r3r31129.
to finish ???r3r31129.
to fix ???r3r31129.
to do ???r3r31129.
to do ???r3r31129.
to look at:
third of ???r3r31129.
to weather ???r3r31129.
take a closer look at ???r3r31129.
to brighten up ???r3r31129.
to look at ???r3r31129.
to lay down:
to enclose ???r3r31129.
to lay ???r3r31129.
to impose ???r3r31129.
lay out ???r3r31129.
add up to ???r3r31129.
to impose ???r3r31129.
Using a vector representation can certainly help in the task of correcting typos /errors, but it is dangerous to use it alone, because sometimes (though rarely) there are strong errors.
In fact, this is another metric comparing two lines for similarity, but already a level higher than, for example, the same distance Dahmerau - Levensteine . Using FastText as an addition to other methods may well improve the quality of typing errors.
It may be interesting