Adult journalism: from Russia to the Kremlin

Analysis of publications for 18 years (from September 1999 to December 2017) using python, sklearn, scipy, XGBoost, pymorphy? nltk, gensim, MongoDB, Keras and TensorFlow. 3r33333. 3r33312.  
3r311. 3r33333. 3r33312.  
The study used data from the post “ Analyze this - »User ildarchegg . The author has kindly provided 3 gigabytes of articles in a convenient format, and I decided that this is a great opportunity to test some text processing methods. At the same time, if you're lucky, learn something new about Russian journalism, society and in general. 3r33333. MongoDB to import json into python 3r3r1616.  
3r3165. Clearing and normalizing text 3r3r1616.  
3r3165. 3r3342. Tag Cloud
3r3165. Thematic modeling based on LDA 3r3r1616.  
3r3165. Popularity Prediction: XGBClassifier, LogisticRegression, Embedding & LSTM 3r3r1616.  
3r3165. Explore objects using Word2Vec 3r3r1616.  
3r361. 3r33312.  
MongoDB to import json into python
Unfortunately, json with texts was a bit broken, uncritical for me, but python refused to work with the file. Therefore, I first imported into MongoDB, and only then through MongoClient from the pymongo library I loaded the array and saved it to csv piece by piece. 3r33333. 3r33312.  
From the comments: 1. I had to start the database with the sudo service mongod start command - there are other options, but they did not work; 2. mongoimport - a separate application, from the mongo console does not start, only from the terminal. 3r33333. 3r33312.  
Data gaps are evenly distributed over the years. I do not plan to use the period of less than a year, I hope, it will not affect the correctness of the conclusions. 3r33333. 3r33312.  
Adult journalism: from Russia to the Kremlin 3r33333. 3r33312.  
Clearing and normalizing text
Before analyzing the array directly, you need to bring it to standard form: remove special characters, translate text into lower case (pandas string methods did a great job), remove stop words (stopwords.words ('russian') from nltk.corpus), return the words to normal form using lemmatization (pymorphy2.MorphAnalyzer). 3r33333. 3r33312.  
It was not without flaws, for example, Dmitry Peskov turned into “dmitry” and “sand”, but on the whole I was satisfied with the result. 3r33333. 3r33312.  
Tag Cloud
As a seed, let's see what publications are in the most general form. Let's display the 50 most frequent words that journalists used in Tapes from 1999 to 201? in the form of a tag cloud. 3r33333. 3r33312.  
3r33333. 3r33312.  
“Ria Novosti” (the most popular source), “billion dollar” and “million dollar” (financial topics), “present” (speech circulation, typical of all news sites), “law enforcement agency” and “criminal case” (criminal news ), “Prime Minister” and “Vladimir Putin” (politics) are quite expected style and themes for a news portal. 3r33333. 3r33312.  
3r3114. Thematic modeling based on LDA
We calculate the most popular topics for each year using LDA from gensim. LDA (thematic modeling of the Dirichlet latent placement method) automatically reveals hidden topics (a set of words that occur together and most often) based on the observed word frequencies in articles. 3r33333. 3r33312.  
The cornerstone of domestic journalism turned out to be Russia, Putin, the USA. 3r33333. 3r33312.  
In some years, this topic was diluted by the Chechen war (from 1999 to 2000), September 11 - in 200? by Iraq (from 2002 to 2004). From 2008 to 200? the economy took the first place: interest, company, dollar, ruble, billion, million. In 201? they often wrote about Gaddafi. 3r33333. 3r33312.  
From 2014 to 2017 Russia began and continues the years of Ukraine. The peak came in 201? then the trend began to decline, but still continues to hold high. 3r33333. 3r33312.  
3r33333. 3r33312.  
Interesting, of course, but nothing that I wouldn’t know or guess about. 3r33333. 3r33312.  
Let's change a bit of the approach - let's highlight the top topics for all the time and see how their relationship has changed from year to year, that is, let's study the evolution of topics 3r33333. 3r33312.  
The most interpreted option was Top-5: 3r3333317. 3r33312.  
3r3165. Crime (male, police, occur, detain, policeman); 3r3r1616.  
3r3165. Politics (Russia, Ukraine, President, USA, chapter); 3r3r1616.  
3r3165. Culture (spinner, purulent, instagram, ramming - yes, this is our culture, although this topic turned out to be quite mixed); 3r3r1616.  
3r3165. Sports (match, team, game, club, athlete, championship); 3r3r1616.  
3r3165. Science (scientist, space, satellite, planet, cell). 3r3r1616.  
3r3168. 3r33312.  
Next, take each article and see with what probability it relates to a particular topic, as a result, all materials will be divided into five groups. 3r33333. 3r33312.  
The policy turned out the most popular - under 80% of all publications. However, the peak of popularity of political materials was passed in 201? now their share is declining, and the contribution to the information agenda of Crime and Sport is growing. 3r33333. 3r33312.  
3r33180. 3r33333. 3r33312.  
Check the adequacy of thematic models using the subheadings indicated by the editors. The top sub-categories have been more or less correctly distinguished since 2013. 3r33333. 3r33312.  
3r3189. 3r33333. 3r33312.  
There were no particular contradictions: the policy is stagnating in 201? Football and the Accidents are growing, Ukraine is still in the trend, with a peak in 2015 3r-3317. 3r33312.  
3r3197. Popularity Prediction: XGBClassifier, LogisticRegression, Embedding & LSTM
Let us try to understand whether the text can predict the popularity of an article on the Ribbon, and on what this popularity generally depends. As a target variable, I took the number of reposts on Facebook for 2017. 3r33333. 3r33312.  
3 thousand articles for 2017 did not have any repost on Fb - they were assigned the class “unpopular”, 3 thousand materials with the largest number of reposts received the “most popular” label. 3r33333. 3r33312.  
The text (6 thousand publications for 2017) was divided into unograms and digrams (tokens, both single and two-word phrases) and a matrix was built, where the columns are tokens, rows are articles, and at the intersection is relative the frequency of words in the article. Used functions from sklearn - CountVectorizer and TfidfTransformer. 3r33333. 3r33312.  
The prepared data was fed to the input XGBClassifier (classifier based on gradient boosting from the xgboost library), which after 13 minutes of searching the hyperparameters (GridSearchCV with cv = 3) produced an accuracy of 76% on the test. 3r33333. 3r33312.  
3r33333. 3r33312.  
Then I used the usual logistic regression (sklearn.linear_model.LogisticRegression) and after 17 seconds I got an accuracy of 81%. 3r33333. 3r33312.  
Once again, I am convinced that linear methods are best for classifying texts, provided that the data are carefully prepared. 3r33333. 3r33312.  
As a tribute to fashion, I tested a little neural networks. Translated words into numbers using one_hot from keras, brought all articles to the same length (the pad_sequences function from keras) and fed LSTM (convolutional neural network, using TensorFlow backend) to the input through the Embedding layer (to reduce the dimension and speed up processing time). 3r33333. 3r33312.  
The network worked for 2 minutes and showed accuracy on the test of 70%. It’s not the limit at all, but there’s no point in bothering much. 3r33333. 3r33312.  
In general, all methods gave relatively little accuracy. Experience shows that classification algorithms work well with a variety of stylistics, - on author's materials, in other words. has such materials, but they are very few - less than 2%. 3r33333. 3r33312.  
3r33333. 3r33312.  
The main array is written using neutral news vocabulary. And the popularity of news is determined not by the text itself or even the topic as such, but by their belonging to the upward informational trend. 3r33333. 3r33312.  
For example, quite a lot of popular articles cover events in Ukraine, the least popular of this topic almost do not concern. 3r33333. 3r33312.  
3r33333. 3r33312.  
Explore objects using Word2Vec
As a conclusion, I wanted to conduct a sentiment analysis - to understand how journalists are among the most popular objects that they mention in their articles, whether their attitude changes with time. 3r33333. 3r33312.  
But I have no marked data, and the search for semantic thesauruses is unlikely to work correctly, since the news vocabulary is rather neutral, stingy with emotions. Therefore, I decided to focus on the context in which the objects are mentioned. 3r33333. 3r33312.  
He took Ukraine (2015 vs 2017) and Putin (2000 vs 2017) as a test. I chose the articles in which they are mentioned, translated the text into a multidimensional vector space (Word2Vec from gensim.models) and projected onto a two-dimensional one using the Main Components method. 3r33333. 3r33312.  
After rendering the pictures turned out to be epic, no less than the tapestry from Bayeux. Cut out the necessary clusters to simplify perception, as I could, sorry for the "jackals." 3r33333. 3r33312.  
3r33333. 3r33312.  
What noticed. 3r33333. 3r33312.  
Putin of the 2000 model always appeared in the context of Russia and delivered his addresses personally. In 201? the President of the Russian Federation turned into a leader (whatever that means) and distanced himself from the country, now he is, judging by the context, a representative of the Kremlin, who communicates with the world through his press secretary. 3r33333. 3r33312.  
Ukraine-2015 in the Russian media - war, battles, explosions; mentioned impersonal (Kiev declared, Kiev began). Ukraine 2017 appears mainly in the context of negotiations between officials, and these individuals have specific names. 3r33333. 3r33312.  
It can take a while to interpret the information received for a long time, but I think this is offtopic on this resource. Those interested can look at their own. Code and data attached. 3r33333. 3r33312.  
3r3309. Link to the script
3r33333. 3r33312.  
3r33333. Link to the data
+ 0 -

Add comment