Parsim Wikipedia for NLP tasks in 4 teams

3r33333. Parsim Wikipedia for NLP tasks in 4 teams 3r33362. 3r33333. 3r3403.  
3r33333. The essence of
3r33333. 3r3403.  
3r33333. It turns out for this purpose it is enough to run just such a set of commands: 3-33357. 3r3403.  
git clone https://github.com/attardi/wikiextractor.git
cd wikiextractor
wget http://dumps.wikimedia.org/ruwiki/latest/ruwiki-latest-pages-articles.xml.bz2
python3 WikiExtractor.py -o /data/wiki/- no-templates --processes 8 /data/ruwiki-latest-pages-articles.xml.bz2
3r340. 3r3403.  
3r33333. and then polish a little 3r33333. script 3r33448. for post processing 3r3403.  
    python3 process_wikipedia.py    3r340. 3r3403.  
3r33333. The result - ready .csv file with your body. 3r33357.
3r33448. 3r3403.  
3r33333. It is clear that: 3r3403.  
3r33333.  
http://dumps.wikimedia.org/ruwiki/latest/ruwiki-latest-pages-articles.xml.bz2 can be changed to the language you need, more details here  
All information about the parameters wikiextractor can be found in the manual (it seems even the official dock was not updated, unlike the mana);  
3r3403.  
3r33333. Script with post-processing converts wiki files to the following table: 3r3333357. 3r3403.  

3r33333. Why 3r33333. 3r3403.  
3r33333. Perhaps, at the moment, the development of ML-tools has reached this level of[8]It’s literally a couple of days to build a working NLP model /pipeline. Problems arise only in the absence of reliable datasets /ready embeds /ready language models. The purpose of this article is to alleviate your pain a bit by showing that to handle the entire Wikipedia (the idea of ​​the most popular corpus for training embedding of words in NLP) will be enough for a couple of hours. After all, if just a couple of days are enough to build the simplest model, why spend a lot more time getting data for this model? 3r33357. 3r3403.  

3r33333. The principle of the script 3r33333. 3r3403.  
3r33333. wikiExtractor saves wiki articles as text, divided 3r32r4646. in blocks. Actually, the script is based on the following logic: 3r3403.  
3r33333.  
Take a list of all the files on the output;  
We divide files into articles;  
Remove all remaining HTML tags and special characters;  
Using nltk.sent_tokenize we divide into offers;
 
So that the code does not grow to enormous size and remains readable, we assign a uuid to each article;
 
3r3403.  
3r33333. As a preprocessing of the text is simple (you can easily perepilit under him):
3r3403.  
3r33333.  
Remove non-alphabetic characters;
 
Delete stop words;
 
3r3403.  
3r33333. Dataset is, what now? 3r33333. 3r33333. 3r3403.  
3r33333. Main application
3r33312. 3r3403.  
3r33333. Most often in practice in NLP one has to face the task of constructing embeddings. 3r33357. 3r3403.  
3r33333. To solve it, one of the following tools is usually used:: 3r3333357. 3r3403.  
3r33333.  
Ready vectors /word embeddings[6];
 
Internal states of CNN, trained on such tasks as how to determine false sentences /language modeling /classification[7];
 
The combination of the above methods;
 
3r3403.  
3r33333. In addition,[9]has already been shown many times. that, as a good baseline for embeddingdating sentences, one can also take simply averaged (with a couple of minor details, which we now omit) the word vectors. 3r33357. 3r3403.  
3r3309. 3r33333. Other uses are
3r33312. 3r3403.  
3r33333.  
We use random Wiki sentences as negative examples for triplet loss;
 
We teach encoders for sentences using the definition of fake[10]phrases. ;
 
3r3403.  
3r33333. Some charts for Russian Vika
3r33333. 3r3403.  
3r33333. 3r33333. The distribution of the length of sentences for the Russian Wikipedia
3r33357. 3r3403.  
3r33333. 3r33333. Without logarithms (X values ​​are limited to 20)
3r33357. 3r3403.  
3r33333. 3r33333. 3r33357. 3r3403.  
3r33333. 3r33333. In decimal logarithms
3r33357. 3r3403.  
3r33333. 3r33356. 3r33357. 3r3403.  
3r33333. References
3r33333. 3r3403.  
3r3405.  
Fast-text
vectors words, trained on wiki;
 
Fast-text and Word2Vec models for the Russian language;
 
Awesome wiki extractor library 3r33448. for python;
 
Official
page
with links to wiki;
 
Our script for post-processing;
 
The main articles about embeddings words: Word2Vec , Fast-Text , tuning ;
 
Several current SOTA approaches: 3r3403.  
3r3405.  
3r3408. InferSent
;
 
Generative 3r3r. pre-training
CNN;
 
3r33418. ULMFiT
;
 
Contextual approaches for submission words (Elmo);
 
 
Imagenet moment at NLP ?
 
Baselines for embeddingings offers 1 , 3r3438. 2
, 3r33440. 3
, 3r3442. 4
;
 
Definition fake phrases for encoder sentences;
 

+ 0 -

Add comment