Google News and Leo Tolstoy: Visualizing Vector Representations of Words with t-SNE

Google News and Leo Tolstoy: Visualizing Vector Representations of Words with t-SNE  3r33333.
 3r33333. Each of us perceives the texts in his own way, be it news on the Internet, poetry or classic novels. The same applies to algorithms and methods of machine learning, which, as a rule, perceive texts in mathematical form, in the form of a multidimensional vector space.
 3r33333.
 3r33333. The article is devoted to the visualization of multidimensional vector representations of words using t-SNE calculated Word2Vec. Visualization will allow you to more fully understand how Word2Vec works and how to interpret the relationship between word vectors before using them further in neural networks and other machine learning algorithms. The article focuses on visualization, further research and data analysis are not considered. We use Google News articles and L.N.'s classic works as a data source. Tolstoy. The code will be written in Python in Jupyter Notebook.
 3r33333. Habré . The basic t-SNE principle of operation is to reduce pairwise distances between points while maintaining their relative position. In other words, the algorithm maps multidimensional data to a space of a lower dimension, while maintaining the structure of the neighborhood of points.
 3r33333.
 3r33333.

Vector representations of words and Word2Vec


 3r33333. First of all, we need to present the words in vector form. For this task, I chose the Word2Vec distribution semantics utility, which is designed to display the semantic meaning of words in a vector space. Word2Vec finds relationships between words according to the assumption that semantically close words occur in similar contexts. More information about Word2Vec can be found in the original article[2], as well as here 3r33344. and 3r3335. here 3r33344. .
 3r33333.
 3r33333. As input, we will take articles from Google News and the novels of L.N. Tolstoy. In the first case, we will use the Google News (about 100 billion words) pre-trained on Google News, published by Google
on the project page 3r33344. .
 3r33333.
 3r33333. 3r33333. 3r33312. import gensim
3r33333. model = gensim.models.KeyedVectors.load_word2vec_format ('GoogleNews-vectors-negative300.bin', binary = True)

 3r33333. In addition to the pre-trained vectors using the Gensim[3]library. we will teach another model on the texts of L.N. Tolstoy. Since Word2Vec accepts an array of sentences as input, we use the pre-trained Punkt Sentence Tokenizer model from the NLTK package to automatically break the text into sentences. Model for the Russian language, you can download
from here
.
 3r33333.
 3r33333. 3r33333. 3r33312. import re
import codecs
3r33333. 3r33333. def preprocess_text (text):
text = re.sub ('[^a-zA-Zа-яА-Я1-9]+', '', text)
text = re.sub ('+', '', text)
return text.strip ()
3r33333. 3r33333. def prepare_for_w2v (filename_from, filename_to, lang):
raw_text = codecs.open (filename_from, "r", encoding = 'windows-1251'). read () 3r33373. with open (filename_to, 'w', encoding = 'utf-8') as f:
for sentence in nltk.sent_tokenize (raw_text, lang):
print (preprocess_text (sentence.lower ()), file = f)

 3r33333. Then, using the Gensim library, we will teach the Word2Vec-model with the following parameters:
 3r33333.
 3r33333. 3r3162.  3r33333.
size = 200 - dimension of feature space;
 3r33333.
window = 5 - the number of words from the context, which analyzes the algorithm;
 3r33333.
min_count = 5 - the word must occur at least five times for the model to take it into account.
 3r33333. 3r3179.
 3r33333. 3r33333. 3r33312. import multiprocessing
from gensim.models import Word2Vec
3r33333. 3r33333. def train_word2vec (filename):
data = gensim.models.word2vec.LineSentence (filename)
return Word2Vec (data, size = 20? window = ? min_count = ? workers = multiprocessing.cpu_count ()) 3r37373. 3r33333.
 3r33333.
 3r33333.

Visualize vector representations of words using t-SNE


 3r33333. T-SNE is extremely useful for visualizing the similarities between objects in a multidimensional space. With the increase in the amount of data, it becomes more and more difficult to build a visual graph, so in practice, similar words are grouped together for further visualization. Take for example a few words from a dictionary of a pre-Google2 Word2Vec model.
 3r33333.
 3r33333. 3r33333. 3r33312. keys =['Paris', 'Python', 'Sunday', 'Tolstoy', 'Twitter', 'bachelor', 'delivery', 'election', 'expensive',
'experience', 'financial', 'food', 'iOS', 'peace', 'release', 'war']3r33333. 3r33333. embedding_clusters =[]3r33333. word_clusters =[]3r33333. for word in keys:
embeddings =[]3r33333. words =[]3r33333. for similar_word, _ in model.most_similar (word, topn = 30):
words.append (similar_word)
embeddings.append (model[similar_word])
embedding_clusters.append (embeddings)
word_clusters.append (words)
 3r33333. 3r3149.
 3r33333. Figure 1. Groups of similar words from Google News with different preplexity parameter values. 3r33360.
 3r33333.
 3r33333. Next, go to the most remarkable fragment of the article - the configuration of t-SNE. Here, first of all, attention should be paid to the following hyperparameters:
 3r33333.
 3r33333. 3r3162.  3r33333.
[i] n_components
- the number of components, i.e., the dimension of the value space;
 3r33333.
perplexity - perplexia, the value of which in t-SNE can be equated to the effective number of neighbors. It is related to the number of nearest neighbors, which is used in other models that study on the basis of manifolds (see picture above). Its value is recommended[1]set in the range of 5-50;
 3r33333.
init - type of initial initialization of vectors.
 3r33333. 3r3179.
 3r33333. 3r33333. 3r33312. tsne_model_en_2d = TSNE (perplexity = 1? n_components = ? init = 'pca', n_iter = 350? random_state = 32)
embedding_clusters = np.array (embedding_clusters)
n, m, k = embedding_clusters.shape
embeddings_en_2d = np.array (tsne_model_en_2d.fit_transform (embedding_clusters.reshape (n * m, k))). reshape (n, m, 2)
 3r33333. Below is a script for constructing two-dimensional graphics using Matplotlib, one of the most popular libraries for visualizing data in Python.
 3r33333.
 3r33333. 3r3195.
 3r33333. Figure 2. Groups of similar words from Google News (preplexity = 15). 3r33360.
 3r33333.
 3r33333. 3r33333. 3r33312. from sklearn.manifold import TSNE
3r33333. import matplotlib.pyplot as plt
import matplotlib.cm as cm
import numpy as np
% matplotlib inline
3r33333. 3r33333. def tsne_plot_similar_words (labels, embedding_clusters, word_clusters, a = 0.7):
plt.figure (figsize = (1? 9))
colors = cm.rainbow (np.linspace (? ? len (labels)))
for label, embeddings, words, color in zip (labels, embedding_clusters, word_clusters, colors):
x = embeddings[:,0]3r33333. y = embeddings[:,1]3r33333. plt.scatter (x, y, c = color, alpha = a, label = label)
for i, word in enumerate (words):
plt.annotate (word, alpha = 0.? xy = (x[i], y[i]), xytext = (? 2), 3rr3373. textcoords = 'offset points', ha = 'right', va = 'bottom' , size = 8)
plt.legend (loc = 4)
plt.grid (true)
plt.savefig ("f /gpng", format = 'png', dpi = 15? bbox_inches = 'tight')
plt.show ()
3r33333. 3r33333. tsne_plot_similar_words (keys, embeddings_en_2d, word_clusters)
 3r33333. Sometimes it is necessary to build not separate clusters of words, but the entire dictionary. For this purpose, let's analyze the “Anna Karenina”, the great history of passion, treason, tragedy and redemption.
 3r33333.
 3r33333. 3r33333. 3r33312. prepare_for_w2v ('data /Anna Karenina by Leo Tolstoy (ru) .txt', 'train_anna_karenina_ru.txt', 'russian')
model_ak = train_word2vec ('train_anna_karenina_ru.txt')
3r33333. words =[]3r33333. embeddings =[]3r33333. for word in list (model_ak.wv.vocab):
embeddings.append (model_ak.wv[word])
words.append (word)
3r33333. tsne_ak_2d = TSNE (n_components = ? init = 'pca', n_iter = 350? random_state = 32)
embeddings_ak_2d = tsne_ak_2d.fit_transform (embeddings)

 3r33333. 3r33333. 3r33312. def tsne_plot_2d (label, embeddings, words =[], a = 1):
plt.figure (figsize = (1? 9))
colors = cm.rainbow (np.linspace (? ? 1))
x = embeddings[:,0]3r33333. y = embeddings[:,1]3r33333. plt.scatter (x, y, c = colors, alpha = a, label = label)
for i, word in enumerate (words):
plt.annotate (word, alpha = 0.? xy = (x[i], y[i]), xytext = (? 2), 3rr3373. textcoords = 'offset points', ha = 'right', va = 'bottom' , size = 10)
plt.legend (loc = 4)
plt.grid (true)
plt.savefig ("hhh.png", format = 'png', dpi = 15? bbox_inches = 'tight')
plt.show ()
3r33333. 3r33333. tsne_plot_2d ('Anna Karenina by Leo Tolstoy', embeddings_ak_2d, a = 0.1)
 3r33333.  3r33333.
 3r33333.  3r33333. [i] Figure 3. Visualization of the dictionary of the Word2Vec-model trained in the novel “Anna Karenina”. 3r33360.
 3r33333.
 3r33333. The picture can become even more informative if we use three-dimensional space. Take a look at War and Peace, one of the main novels of world literature.
 3r33333.
 3r33333. 3r33333. 3r33312. prepare_for_w2v ('data /War and Peace by Leo Tolstoy (en) .txt', 'train_war_and_peace_ru.txt', 'russian')
model_wp = train_word2vec ('train_war_and_peace_ru.txt')
3r33333. words_wp =[]3r33333. embeddings_wp =[]3r33333. for word in list (model_wp.wv.vocab):
embeddings_wp.append (model_wp.wv[word])
words_wp.append (word)
3r33333. tsne_wp_3d = TSNE (perplexity = 3? n_components = ? init = 'pca', n_iter = 350? random_state = 12)
embeddings_wp_3d = tsne_wp_3d.fit_transform (embeddings_wp)

 3r33333. 3r33333. 3r33312. from mpl_toolkits.mplot3d import Axes3D
3r33333. 3r33333. def tsne_plot_3d (title, label, embeddings, a = 1):
fig = plt.figure ()
ax = Axes3D (fig)
colors = cm.rainbow (np.linspace (? ? 1))
plt.scatter (embeddings[:, 0], embeddings[:, 1], embeddings[:, 2], c = colors, alpha = a, label = label)
plt.legend (loc = 4)
plt.title (title)
plt.show ()
3r33333. 3r33333. tsne_plot_3d ('Visualizing Embeddings using t-SNE', 'War and Peace', embeddings_wp_3d, a = 0.1)
 3r33333. 3r33333.
 3r33333. [i] Figure 4. Visualization of the dictionary of the Word2Vec-model trained in the novel “War and Peace”. 3r33360.
 3r33333.
 3r33333.

Sources 3r33350.
 3r33333. The code is available at GitHub . There you can also find the code for rendering animations.
 3r33333.
 3r33333.
Sources


 3r33333.
 3r33333.
Maaten L., Hinton G. Visualizing data using t-SNE //Journal of machine learning learning. - 2008. - V. 9. - p. 2579-2605.
 3r33333.
Distributed Representations of Words and Phrases and their Compositionality //[i] Advances in Neural Information Processing Systems
. - 2013. - p. 3111-3119.
 3r33333.
Rehurek R., Sojka P. Software framework for topic modeling with large-scale corpora //In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks. - 2010.
 3r33333.
3r33333. 3r33333. 3r33333. 3r33333. ! function (e) {function t (t, n) {if (! (n in e)) {for (var r, a = e.document, i = a.scripts, o = i.length; o-- ;) if (-1! == i[o].src.indexOf (t)) {r = i[o]; break} if (! r) {r = a.createElement ("script"), r.type = "text /jаvascript", r.async =! ? r.defer =! ? r.src = t, r.charset = "UTF-8"; var d = function () {var e = a.getElementsByTagName ("script")[0]; e.parentNode.insertBefore (r, e)}; "[object Opera]" == e.opera? a.addEventListener? a.addEventListener ("DOMContentLoaded", d,! 1): e.attachEvent ("onload", d ): d ()}}} t ("//mediator.mail.ru/script/2820404/"""_mediator") () ();
3r33333.
+ 0 -

Add comment