As I understood that I eat a lot of sweets, or the classification of goods by checks in the application

3r33750. Task 3r3751. 3r3r7777. 3r3r7787. In this article we want to tell how we created a solution for classifying product names from checks in an application for accounting for expenses by check and purchase assistant. We wanted to give users the ability to view statistics on purchases collected automatically on the basis of scanned checks, namely, to distribute all the goods purchased by the user into categories. Because forcing the user to independently group products is already the last century. There are several approaches to solve this problem: you can try to apply clustering algorithms with different methods of vector representation of words or classical classification algorithms. We didn’t invent anything new and in this article we just want to share a small guide about a possible solution of the problem, examples of how not to do it, an analysis of why other methods did not work and what problems might be encountered in the process. 3r3r7777. 3r3r7787. “UTRRUSTA krnsht” was bought in one of the Russian stores? True connoisseurs of Swedish design will of course answer us right away: A bracket for the Utrusta oven, but keeping such specialists in the headquarters is quite expensive. In addition, we did not have a ready-made, labeled sample suitable for our data, for which we could train the model. Therefore, we first describe how, in the absence of data, we applied clustering algorithms for learning and why we did not like it. 3r3r7777. 3r3r7787. 3r3r7777. 3r3r7787. Such algorithms are based on measurements of the distances between objects, which requires their vector representation or the use of a metric to measure the similarity of words (for example, Levenshtein distance). At this step, the complexity lies in the meaningful vector representation of the names. It is problematic to extract properties from the names that will fully and comprehensively describe the product and its relationship with other products. 3r3r7777. 3r3r7787. 3r3r7777. 3r3r7787. The simplest option is to use Tf-Idf, but in this case, the dimension of the vector space is quite large, and the space itself is discharged. In addition, this approach does not extract any additional information from the titles. Thus, in one cluster there can be many products from different categories, united by a common word, such as, for example, “potato” or “salad”: 3r3777. 3r3r7787. 3r3r7777. 3r3r7787. 3r3758. 3r3333. 3r33788. 3r3r7777. 3r3r7787. We also cannot control which clusters will be assembled. The only thing that can be designated is the number of clusters (if algorithms are used that are not based on density peaks in space). But if you specify too small an amount, then there will be one huge cluster that will contain all the names that could not fit in other clusters. If you specify a large enough, after the operation of the algorithm, we will have to look through hundreds of clusters and combine them into semantic categories by hand. 3r3r7777. 3r3r7787. 3r3r7777. 3r3r7787. The tables below provide information on clusters using the KMeans and Tf-Idf algorithm for vector representation. From these tables we see that the distances between the centers of the clusters are smaller than the average distance between the objects and the centers of the clusters to which they belong. Such data can be explained by the fact that there are no obvious density peaks in the space of vectors and the centers of the clusters are located around the circle, where most of the objects are located outside the boundary of this circle. In addition, one cluster is formed, which contains most of the vectors. Most likely in this cluster are collected names that contain words that occur more frequently than others among all products from different categories. 3r3r7777. 3r3r7787. 3r3r7777. 3r3r7787. 3r33393. 3r3r7787. 3r33395. Table 1. Distances between clusters.

3r3r7787. 3r? 3569. 3r3r7787. 3r33571. Cluster

3r3r7787. 3r33571. C1

3r3r7787. 3r33571. C2 3r37272. 3r3r7787. 3r33571. C3 3r37272. 3r3r7787. 3r33571. C4 3r37272. 3r3r7787. 3r33571. C5 3r3-3572. 3r3r7787. 3r33571. C6 3r???. 3r3r7787. 3r33571. C7 3r37272. 3r3r7787. 3r33571. C8 3r37272. 3r3r7787. 3r33571. C9 3r37272. 3r3r7787. 3r33586. 3r3r7787. 3r? 3569. 3r3r7787. 3r33571. C1 3r3727. 3r3r7787. 3r?383. ???r3-3584. 3r3r7787. 3r?383. ???r33584. 3r3r7787. 3r?383. ???r384. 3r3r7787. 3r?383. ???r384. 3r3r7787. 3r?383. ???r33584. 3r3r7787. 3r?383. ???r384. 3r3r7787. 3r?383. ???r33584. 3r3r7787. 3r?383. ???r???. 3r3r7787. 3r?383. ???r384. 3r3r7787. 3r33586. 3r3r7787. 3r? 3569. 3r3r7787. 3r33571. C2 3r37272. 3r3r7787. 3r?383. ???r33584. 3r3r7787. 3r?383. ???r3-3584. 3r3r7787. 3r?383. ???r3584. 3r3r7787. 3r?383. ???r33584. 3r3r7787. 3r?383. ???r33584. 3r3r7787. 3r?383. ???r33584. 3r3r7787. 3r?383. ???r33584. 3r3r7787. 3r?383. ???r38484. 3r3r7787. 3r?383. ???r384. 3r3r7787. 3r33586. 3r3r7787. 3r? 3569. 3r3r7787. 3r33571. C3 3r37272. 3r3r7787. 3r?383. ???r384. 3r3r7787. 3r?383. ???r3584. 3r3r7787. 3r?383. ???r3-3584. 3r3r7787. 3r?383. ???r384. 3r3r7787. 3r?383. ???r33584. 3r3r7787. 3r?383. ???r38484. 3r3r7787. 3r?383. ???r384. 3r3r7787. 3r?383. ???r33584. 3r3r7787. 3r?383. ???r33584. 3r3r7787. 3r33586. 3r3r7787. 3r? 3569. 3r3r7787. 3r33571. C4 3r37272. 3r3r7787. 3r?383. ???r384. 3r3r7787. 3r?383. ???r33584. 3r3r7787. 3r?383. ???r384. 3r3r7787. 3r?383. ???r3-3584. 3r3r7787. 3r?383. ???r33584. 3r3r7787. 3r?383. ???r38484. 3r3r7787. 3r?383. ???r3584. 3r3r7787. 3r?383. ???r33584. 3r3r7787. 3r?383. ???r33584. 3r3r7787. 3r33586. 3r3r7787. 3r? 3569. 3r3r7787. 3r33571. C5 3r37272. 3r3r7787. 3r?383. ???r33584. 3r3r7787. 3r?383. ???r33584. 3r3r7787. 3r?383. ???r33584. 3r3r7787. 3r?383. ???r33584. 3r3r7787. 3r?383. ???r3-3584. 3r3r7787. 3r?383. ???r384. 3r3r7787. 3r?383. ???r33584. 3r3r7787. 3r?383. ???r33584. 3r3r7787. 3r?383. ???r?384. 3r3r7787. 3r33586. 3r3r7787. 3r? 3569. 3r3r7787. 3r33571. C6 3r37272. 3r3r7787. 3r?383. ???r384. 3r3r7787. 3r?383. ???r384. 3r3r7787. 3r?383. ???r38484. 3r3r7787. 3r?383. ???r38484. 3r3r7787. 3r?383. ???r384. 3r3r7787. 3r?383. ???r3-3584. 3r3r7787. 3r?383. ???r38484. 3r3r7787. 3r?383. ???r33584. 3r3r7787. 3r?383. ???r38484. 3r3r7787. 3r33586. 3r3r7787. 3r? 3569. 3r3r7787. 3r33571. C7 3r37272. 3r3r7787. 3r?383. ???r33584. 3r3r7787. 3r?383. ???r33584. 3r3r7787. 3r?383. ???r384. 3r3r7787. 3r?383. ???r3584. 3r3r7787. 3r?383. ???r33584. 3r3r7787. 3r?383. ???r384. 3r3r7787. 3r?383. ???r3-3584. 3r3r7787. 3r?383. ???? 3r? 3584. 3r3r7787. 3r?383. ???r384. 3r3r7787. 3r33586. 3r3r7787. 3r? 3569. 3r3r7787. 3r33571. C8 3r37272. 3r3r7787. 3r?383. ???r???. 3r3r7787. 3r?383. ???r38484. 3r3r7787. 3r?383. ???r33584. 3r3r7787. 3r?383. ???r33584. 3r3r7787. 3r?383. ???r33584. 3r3r7787. 3r?383. ???r33584. 3r3r7787. 3r?383. ???? 3r? 3584. 3r3r7787. 3r?383. ???r3-3584. 3r3r7787. 3r?383. ???r384. 3r3r7787. 3r33586. 3r3r7787. 3r? 3569. 3r3r7787. 3r33571. C9 3r37272. 3r3r7787. 3r?383. ???r384. 3r3r7787. 3r?383. ???r384. 3r3r7787. 3r?383. ???r33584. 3r3r7787. 3r?383. ???r33584. 3r3r7787. 3r?383. ???r?384. 3r3r7787. 3r?383. ???r38484. 3r3r7787. 3r?383. ???r384. 3r3r7787. 3r?383. ???r384. 3r3r7787. 3r?383. ???r3-3584. 3r3r7787. 3r33586. 3r3r7787. 3r33588. 3r3r7777. 3r3r7787. 3r33393. 3r3r7787. 3r33395. Table 2. Summary of the

clusters. 3r3r7787. 3r? 3569. 3r3r7787. 3r33571. Cluster

3r3r7787. 3r33571. Number of objects 3r37272. 3r3r7787. 3r33571. The average distance is 3r37272. 3r3r7787. 3r33571. The minimum distance is

3r3r7787. 3r33571. The maximum distance is

3r3r7787. 3r33586. 3r3r7787. 3r? 3569. 3r3r7787. 3r33571. C1

3r3r7787. 3r?383. 62530 3r33584. 3r3r7787. 3r?383. ???r33584. 3r3r7787. 3r?383. ???r384. 3r3r7787. 3r?383. ???r33584. 3r3r7787. 3r33586. 3r3r7787. 3r? 3569. 3r3r7787. 3r33571. C2 3r37272. 3r3r7787. 3r?383. 2159 3r?584. 3r3r7787. 3r?383. ???r38484. 3r3r7787. 3r?383. ???r384. 3r3r7787. 3r?383. ???r33584. 3r3r7787. 3r33586. 3r3r7787. 3r? 3569. 3r3r7787. 3r33571. C3 3r37272. 3r3r7787. 3r?383. 1099 3r33584. 3r3r7787. 3r?383. ???r33584. 3r3r7787. 3r?383. ???r33584. 3r3r7787. 3r?383. ???r33584. 3r3r7787. 3r33586. 3r3r7787. 3r? 3569. 3r3r7787. 3r33571. C4 3r37272. 3r3r7787. 3r?383. 1292 3r33584. 3r3r7787. 3r?383. ???r384. 3r3r7787. 3r?383. ???r38484. 3r3r7787. 3r?383. ???r3-3584. 3r3r7787. 3r33586. 3r3r7787. 3r? 3569. 3r3r7787. 3r33571. C5 3r3-3572. 3r3r7787. 3r?383. 746 3r33584. 3r3r7787. 3r?383. ???r384. 3r3r7787. 3r?383. ???r33584. 3r3r7787. 3r?383. ???r33584. 3r3r7787. 3r33586. 3r3r7787. 3r? 3569. 3r3r7787. 3r33571. C6 3r???. 3r3r7787. 3r?383. 2451 3r33584. 3r3r7787. 3r?383. ???r384. 3r3r7787. 3r?383. ???r38484. 3r3r7787. 3r?383. ??? r3r3584. 3r3r7787. 3r33586. 3r3r7787. 3r? 3569. 3r3r7787. 3r33571. C7 3r37272. 3r3r7787. 3r?383. 1133 3r33584. 3r3r7787. 3r?383. ???r38484. 3r3r7787. 3r?383. ???r384. 3r3r7787. 3r?383. ???r33584. 3r3r7787. 3r33586. 3r3r7787. 3r? 3569. 3r3r7787. 3r33571. C8 3r37272. 3r3r7787. 3r?383. 876 3r384. 3r3r7787. 3r?383. ???r38484. 3r3r7787. 3r?383. ???r33584. 3r3r7787. 3r?383. ???r33584. 3r3r7787. 3r33586. 3r3r7787. 3r? 3569. 3r3r7787. 3r33571. C9 3r37272. 3r3r7787. 3r?383. 1879 3r33584. 3r3r7787. 3r?383. ???r33584. 3r3r7787. 3r?383. ???r384. 3r3r7787. 3r?383. ???r33584. 3r3r7787. 3r33586. 3r3r7787. 3r33588. 3r3r7777. 3r3r7787. 3r3r7777. 3r3r7787. But in some places the clusters turn out to be quite decent, as, for example, in the image below - there almost all products belong to cat food. 3r3r7777. 3r3r7787. 3r3r7777. 3r3r7787. 3r3758. 3r3-3598. 3r33788. 3r3r7777. 3r3r7787. 3r3r7777. 3r3r7787. Doc2Vec is another one of the algorithms that allow you to represent texts in vector form. When using this approach, each name will be described by a vector of smaller dimension than when using Tf-Idf. In the resulting vector space, similar texts will be close to each other, and various texts will be far away. 3r3r7777. 3r3r7787. 3r3r7777. 3r3r7787. This approach can solve the problem of large dimensionality and space dilution, which is obtained by the Tf-Idf method. For this algorithm, we used the simplest version of tokenization: we broke the name into separate words and took their initial forms. He was trained on the data in the following way: 3r3777. 3r3r7787. 3r3r7777. 3r3r7787. 3r3668. 3r3669. max_epochs = 100

vec_size = 20

alpha = ???r3r3787. 3r3r7787. model = doc2vec.Doc2Vec (vector_size = vec_size,

alpha = alpha,

min_alpha = ????

min_count = ?

dm = 1)

3r3r7787. model.build_vocab (train_corpus)

3r3r7787. for epoch in range (max_epochs):

print ('iteration {0}'. format (epoch))

model.train (train_corpus,

total_examples = model.corpus_count,

epochs = model.iter)

# decrease the learning rate

model.alpha - = ???r3r3787. # fix the learning rate, no decay

model.min_alpha = model.epochs

3r38080.

3r3r7777. 3r3r7787. But with this approach, we obtained vectors that do not carry information about the name — random values can be used with the same success. Here is one example of how the algorithm works: the image shows goods similar to the algorithm’s opinion to “Borodino breadRM n pack ???k. " 3r3r7777. 3r3r7787. 3r3r7777. 3r3r7787. 3r3758. 3r3644. 3r33788. 3r3r7777. 3r3r7787. Perhaps the problem is in the length and context of the names: the pass in the name "__ club. Banana 200ml" can be either yogurt, juice or a big can of cream. You can achieve a better result by using a different approach to tokenization of names. We had no experience using this method and by the time the first attempts failed, we had already found a couple of marked sets with product names, so we decided to leave this method for a while and go to the classification algorithms. 3r3r7777. 3r3r7787. 3r3r7777. 3r3r7787. 3r33750. Classification

3r3r7777. 3r3r7787. 3r?656. Preprocessing

3r3r7777. 3r3r7787. The names of goods from checks come to us in a not always clear form: Latin and Cyrillic words are mixed in words. For example, the letter “a” can be replaced by “a” Latin, and this increases the number of unique names - for example, the words “milk” and “milk” will be considered different. The names also contain many other typos and abbreviations. 3r3r7777. 3r3r7787. 3r3r7777. 3r3r7787. We studied our database and found common mistakes in the names. At this stage, we were treated with regular expressions, with the help of which we cleaned up the names and led them to a certain general form. When using this approach, the result is increased by approximately 7%. In conjunction with a simple version of the SGD Classifier based on the Huber loss function with twisted parameters, we obtained an accuracy of 81% for F1 (average accuracy for all categories of products). 3r3r7777. 3r3r7787. 3r3r7777. 3r3r7787. 3r3668. 3r3669. sgd_model = SGDClassifier ()

parameters_sgd = {

'max_iter':[100], 3r3787. 'loss':['modified_huber'], 3r3787. 'class_weight':['balanced'], 3r3787. 'penalty':['l2'], 3r3787. 'alpha':[0.0001]3r3r7787.}

sgd_cv = GridSearchCV (sgd_model, parameters_sgd, n_jobs = -1)

sgd_cv.fit (tf_idf_data, prod_cat)

sgd_cv.best_score_, sgd_cv.best_params_

3r3r7777. 3r3r7787. Also, do not forget that some categories of people buy more often than others: for example, “Tea and sweets” and “Vegetables and fruits” are much more popular than “Services” and “Cosmetics”. With such a distribution of data, it is better to use algorithms that allow you to specify weights (degree of importance) for each class. Class weight can be determined inversely with the value equal to the ratio of the number of products in a class to the total number of products. But you can not think about it, because in the implementation of these algorithms, it is possible to automatically determine the weight of categories. 3r3r7777. 3r3r7787. 3r3r7777. 3r3r7787. 3r3758. 3r33788. 3r3r7777. 3r3r7787. 3r33750. Getting new data for learning 3r3751. 3r3r7777. 3r3r7787. For our application, we needed slightly different categories than those used in the competition, and the names of the products from our database were significantly different from those presented in the contest. Therefore, we needed to mark the goods from our checks. We tried to do it on our own, but realized that even if we connected our entire team, it would take a lot of time. Therefore, we decided to take advantage of r3r3697. “Tolkoy”

Yandex. 3r3r7777. 3r3r7787. 3r3r7777. 3r3r7787. There we used the following assignment form: 3r3777. 3r3r7787. 3r3r7777. 3r3r7787.

3r3r7787. 3r33737. In each cell, we had a product, the category of which we need to determine 3r33737. 3r3r7787. 3r33737. its presumptive category, defined by one of our previous models

3r3r7787. 3r33737. field for the answer (if the proposed category was incorrect)

3r3r7787. 3r33737. 3r3r7777. 3r3r7787. We created a detailed instruction with examples that explained the features of each category, and also used quality control methods: a set with reference answers that were shown along with the usual tasks (we implemented the reference answers ourselves, marking out several hundred products). According to the results of the answers to these tasks, users who were incorrectly marking the data were eliminated. However, for the entire project we banned only three users from 600+. 3r3r7777. 3r3r7787. 3r3r7777. 3r3r7787. 3r3758. 3r33788. 3r3r7777. 3r3r7787. With the new data, we received a model that better suited our data, and the accuracy increased slightly (by ~ 11%) and received 92% already. 3r3r7777. 3r3r7787. 3r3r7777. 3r3r7787. 3r33750. The final model

3r3r7777. 3r3r7787. We began the classification process with a combination of data from several datasets with Kaggle - 74%, after which we improved the preprocessing - 81%, collected a new data set - 92% and finally improved the classification process: initially, using logistic regression, we obtain preliminary probabilities of goods belonging to categories, based on the names of the goods, SGD gave greater accuracy in percent, but still had great values on the loss functions, which had a bad effect on the results of the final classifier. Next, we combine the data with other data on the product (product price, store in which it was purchased, shop statistics, checks and other meta-information), and XGBoost learns all this data volume, which gave an accuracy of 98% (increase another 6%). As it turned out, the greatest contribution was made by the quality of the training sample. 3r3r7777. 3r3r7787. 3r3r7777. 3r3r7787. 3r33750. Run on server 3r3751. 3r3r7777. 3r3r7787. In order to speed up the deployment, we raised a simple server on Flask to Docker. There was one method that took goods from the server, which were to be categorized, and already returned goods with categories. Thus, we easily integrated into the existing system, the center of which was Tomcat, and we did not have to make changes to the architecture - we just added it with one more block. 3r3r7777. 3r3r7787. 3r3r7777. 3r3r7787. 3r33750. Release 3r3751. 3r3r7777. 3r3r7787. A few weeks ago, we posted a release categorized on Google Play (it will appear on the App Store after some time). It turned out like this:

3r3r7787. 3r3r7777. 3r3r7787. 3r3758. 3r3759. 3r33788. 3r3r7777. 3r3r7787. In the next releases we plan to add the possibility of correcting categories, which will allow us to quickly collect categorization errors and retrain the categorization model (as long as we do it ourselves). 3r3r7777. 3r3r7787. 3r3r7777. 3r3r7787. Mentioned competitions on Kaggle:

3r3r7787. 3r3r7777. 3r3r7787. www.kaggle.com/c/receipt-categorisation 3r3r7777. 3r3r7787. 3r33775. www.kaggle.com/c/market-basket-analysis

3r3r7777. 3r3r7787. www.kaggle.com/c/prod-price-prediction 3r33788. 3r3r7787. 3r3r7787. 3r3r7787.

3r3r7787. 3r33788.

It may be interesting

#### weber

Author**18-11-2018, 02:24**

Publication Date
#### Data Mining / Python

Category- Comments: 0
- Views: 299