• Guest
HabraHabr
  • Main
  • Users

  • Development
    • Programming
    • Information Security
    • Website development
    • JavaScript
    • Game development
    • Open source
    • Developed for Android
    • Machine learning
    • Abnormal programming
    • Java
    • Python
    • Development of mobile applications
    • Analysis and design of systems
    • .NET
    • Mathematics
    • Algorithms
    • C#
    • System Programming
    • C++
    • C
    • Go
    • PHP
    • Reverse engineering
    • Assembler
    • Development under Linux
    • Big Data
    • Rust
    • Cryptography
    • Entertaining problems
    • Testing of IT systems
    • Testing Web Services
    • HTML
    • Programming microcontrollers
    • API
    • High performance
    • Developed for iOS
    • CSS
    • Industrial Programming
    • Development under Windows
    • Image processing
    • Compilers
    • FPGA
    • Professional literature
    • OpenStreetMap
    • Google Chrome
    • Data Mining
    • PostgreSQL
    • Development of robotics
    • Visualization of data
    • Angular
    • ReactJS
    • Search technologies
    • Debugging
    • Test mobile applications
    • Browsers
    • Designing and refactoring
    • IT Standards
    • Solidity
    • Node.JS
    • Git
    • LaTeX
    • SQL
    • Haskell
    • Unreal Engine
    • Unity3D
    • Development for the Internet of things
    • Functional Programming
    • Amazon Web Services
    • Google Cloud Platform
    • Development under AR and VR
    • Assembly systems
    • Version control systems
    • Kotlin
    • R
    • CAD/CAM
    • Customer Optimization
    • Development of communication systems
    • Microsoft Azure
    • Perfect code
    • Atlassian
    • Visual Studio
    • NoSQL
    • Yii
    • Mono и Moonlight
    • Parallel Programming
    • Asterisk
    • Yandex API
    • WordPress
    • Sports programming
    • Lua
    • Microsoft SQL Server
    • Payment systems
    • TypeScript
    • Scala
    • Google API
    • Development of data transmission systems
    • XML
    • Regular expressions
    • Development under Tizen
    • Swift
    • MySQL
    • Geoinformation services
    • Global Positioning Systems
    • Qt
    • Dart
    • Django
    • Development for Office 365
    • Erlang/OTP
    • GPGPU
    • Eclipse
    • Maps API
    • Testing games
    • Browser Extensions
    • 1C-Bitrix
    • Development under e-commerce
    • Xamarin
    • Xcode
    • Development under Windows Phone
    • Semantics
    • CMS
    • VueJS
    • GitHub
    • Open data
    • Sphinx
    • Ruby on Rails
    • Ruby
    • Symfony
    • Drupal
    • Messaging Systems
    • CTF
    • SaaS / S+S
    • SharePoint
    • jQuery
    • Puppet
    • Firefox
    • Elm
    • MODX
    • Billing systems
    • Graphical shells
    • Kodobred
    • MongoDB
    • SCADA
    • Hadoop
    • Gradle
    • Clojure
    • F#
    • CoffeeScript
    • Matlab
    • Phalcon
    • Development under Sailfish OS
    • Magento
    • Elixir/Phoenix
    • Microsoft Edge
    • Layout of letters
    • Development for OS X
    • Forth
    • Smalltalk
    • Julia
    • Laravel
    • WebGL
    • Meteor.JS
    • Firebird/Interbase
    • SQLite
    • D
    • Mesh-networks
    • I2P
    • Derby.js
    • Emacs
    • Development under Bada
    • Mercurial
    • UML Design
    • Objective C
    • Fortran
    • Cocoa
    • Cobol
    • Apache Flex
    • Action Script
    • Joomla
    • IIS
    • Twitter API
    • Vkontakte API
    • Facebook API
    • Microsoft Access
    • PDF
    • Prolog
    • GTK+
    • LabVIEW
    • Brainfuck
    • Cubrid
    • Canvas
    • Doctrine ORM
    • Google App Engine
    • Twisted
    • XSLT
    • TDD
    • Small Basic
    • Kohana
    • Development for Java ME
    • LiveStreet
    • MooTools
    • Adobe Flash
    • GreaseMonkey
    • INFOLUST
    • Groovy & Grails
    • Lisp
    • Delphi
    • Zend Framework
    • ExtJS / Sencha Library
    • Internet Explorer
    • CodeIgniter
    • Silverlight
    • Google Web Toolkit
    • CakePHP
    • Safari
    • Opera
    • Microformats
    • Ajax
    • VIM
  • Administration
    • System administration
    • IT Infrastructure
    • *nix
    • Network technologies
    • DevOps
    • Server Administration
    • Cloud computing
    • Configuring Linux
    • Wireless technologies
    • Virtualization
    • Hosting
    • Data storage
    • Decentralized networks
    • Database Administration
    • Data Warehousing
    • Communication standards
    • PowerShell
    • Backup
    • Cisco
    • Nginx
    • Antivirus protection
    • DNS
    • Server Optimization
    • Data recovery
    • Apache
    • Spam and antispam
    • Data Compression
    • SAN
    • IPv6
    • Fidonet
    • IPTV
    • Shells
    • Administering domain names
  • Design
    • Interfaces
    • Web design
    • Working with sound
    • Usability
    • Graphic design
    • Design Games
    • Mobile App Design
    • Working with 3D-graphics
    • Typography
    • Working with video
    • Work with vector graphics
    • Accessibility
    • Prototyping
    • CGI (graphics)
    • Computer Animation
    • Working with icons
  • Control
    • Careers in the IT industry
    • Project management
    • Development Management
    • Personnel Management
    • Product Management
    • Start-up development
    • Managing the community
    • Service Desk
    • GTD
    • IT Terminology
    • Agile
    • Business Models
    • Legislation and IT-business
    • Sales management
    • CRM-systems
    • Product localization
    • ECM / EDS
    • Freelance
    • Venture investments
    • ERP-systems
    • Help Desk Software
    • Media management
    • Patenting
    • E-commerce management
    • Creative Commons
  • Marketing
    • Conferences
    • Promotion of games
    • Internet Marketing
    • Search Engine Optimization
    • Web Analytics
    • Monetize Web services
    • Content marketing
    • Monetization of IT systems
    • Monetize mobile apps
    • Mobile App Analytics
    • Growth Hacking
    • Branding
    • Monetize Games
    • Display ads
    • Contextual advertising
    • Increase Conversion Rate
  • Sundry
    • Reading room
    • Educational process in IT
    • Research and forecasts in IT
    • Finance in IT
    • Hakatonas
    • IT emigration
    • Education abroad
    • Lumber room
    • I'm on my way

As I understood that I eat a lot of sweets, or the classification of goods by checks in the application

3r33750. Task 3r3751. 3r3r7777.  3r3r7787. In this article we want to tell how we created a solution for classifying product names from checks in an application for accounting for expenses by check and purchase assistant. We wanted to give users the ability to view statistics on purchases collected automatically on the basis of scanned checks, namely, to distribute all the goods purchased by the user into categories. Because forcing the user to independently group products is already the last century. There are several approaches to solve this problem: you can try to apply clustering algorithms with different methods of vector representation of words or classical classification algorithms. We didn’t invent anything new and in this article we just want to share a small guide about a possible solution of the problem, examples of how not to do it, an analysis of why other methods did not work and what problems might be encountered in the process. 3r3r7777.  3r3r7787. “UTRRUSTA krnsht” was bought in one of the Russian stores? True connoisseurs of Swedish design will of course answer us right away: A bracket for the Utrusta oven, but keeping such specialists in the headquarters is quite expensive. In addition, we did not have a ready-made, labeled sample suitable for our data, for which we could train the model. Therefore, we first describe how, in the absence of data, we applied clustering algorithms for learning and why we did not like it. 3r3r7777.  3r3r7787. 3r3r7777.  3r3r7787. Such algorithms are based on measurements of the distances between objects, which requires their vector representation or the use of a metric to measure the similarity of words (for example, Levenshtein distance). At this step, the complexity lies in the meaningful vector representation of the names. It is problematic to extract properties from the names that will fully and comprehensively describe the product and its relationship with other products. 3r3r7777.  3r3r7787. 3r3r7777.  3r3r7787. The simplest option is to use Tf-Idf, but in this case, the dimension of the vector space is quite large, and the space itself is discharged. In addition, this approach does not extract any additional information from the titles. Thus, in one cluster there can be many products from different categories, united by a common word, such as, for example, “potato” or “salad”: 3r3777.  3r3r7787. 3r3r7777.  3r3r7787. 3r3758. 3r3333. 3r33788. 3r3r7777.  3r3r7787. We also cannot control which clusters will be assembled. The only thing that can be designated is the number of clusters (if algorithms are used that are not based on density peaks in space). But if you specify too small an amount, then there will be one huge cluster that will contain all the names that could not fit in other clusters. If you specify a large enough, after the operation of the algorithm, we will have to look through hundreds of clusters and combine them into semantic categories by hand. 3r3r7777.  3r3r7787. 3r3r7777.  3r3r7787. The tables below provide information on clusters using the KMeans and Tf-Idf algorithm for vector representation. From these tables we see that the distances between the centers of the clusters are smaller than the average distance between the objects and the centers of the clusters to which they belong. Such data can be explained by the fact that there are no obvious density peaks in the space of vectors and the centers of the clusters are located around the circle, where most of the objects are located outside the boundary of this circle. In addition, one cluster is formed, which contains most of the vectors. Most likely in this cluster are collected names that contain words that occur more frequently than others among all products from different categories. 3r3r7777.  3r3r7787. 3r3r7777.  3r3r7787. 3r33393.  3r3r7787. 3r33395. Table 1. Distances between clusters.
 3r3r7787. 3r? 3569.  3r3r7787. 3r33571. Cluster
 3r3r7787. 3r33571. C1
 3r3r7787. 3r33571. C2 3r37272.  3r3r7787. 3r33571. C3 3r37272.  3r3r7787. 3r33571. C4 3r37272.  3r3r7787. 3r33571. C5 3r3-3572.  3r3r7787. 3r33571. C6 3r???.  3r3r7787. 3r33571. C7 3r37272.  3r3r7787. 3r33571. C8 3r37272.  3r3r7787. 3r33571. C9 3r37272.  3r3r7787. 3r33586.  3r3r7787. 3r? 3569.  3r3r7787. 3r33571. C1 3r3727.  3r3r7787. 3r?383. ???r3-3584.  3r3r7787. 3r?383. ???r33584.  3r3r7787. 3r?383. ???r384.  3r3r7787. 3r?383. ???r384.  3r3r7787. 3r?383. ???r33584.  3r3r7787. 3r?383. ???r384.  3r3r7787. 3r?383. ???r33584.  3r3r7787. 3r?383. ???r???.  3r3r7787. 3r?383. ???r384.  3r3r7787. 3r33586.  3r3r7787. 3r? 3569.  3r3r7787. 3r33571. C2 3r37272.  3r3r7787. 3r?383. ???r33584.  3r3r7787. 3r?383. ???r3-3584.  3r3r7787. 3r?383. ???r3584.  3r3r7787. 3r?383. ???r33584.  3r3r7787. 3r?383. ???r33584.  3r3r7787. 3r?383. ???r33584.  3r3r7787. 3r?383. ???r33584.  3r3r7787. 3r?383. ???r38484.  3r3r7787. 3r?383. ???r384.  3r3r7787. 3r33586.  3r3r7787. 3r? 3569.  3r3r7787. 3r33571. C3 3r37272.  3r3r7787. 3r?383. ???r384.  3r3r7787. 3r?383. ???r3584.  3r3r7787. 3r?383. ???r3-3584.  3r3r7787. 3r?383. ???r384.  3r3r7787. 3r?383. ???r33584.  3r3r7787. 3r?383. ???r38484.  3r3r7787. 3r?383. ???r384.  3r3r7787. 3r?383. ???r33584.  3r3r7787. 3r?383. ???r33584.  3r3r7787. 3r33586.  3r3r7787. 3r? 3569.  3r3r7787. 3r33571. C4 3r37272.  3r3r7787. 3r?383. ???r384.  3r3r7787. 3r?383. ???r33584.  3r3r7787. 3r?383. ???r384.  3r3r7787. 3r?383. ???r3-3584.  3r3r7787. 3r?383. ???r33584.  3r3r7787. 3r?383. ???r38484.  3r3r7787. 3r?383. ???r3584.  3r3r7787. 3r?383. ???r33584.  3r3r7787. 3r?383. ???r33584.  3r3r7787. 3r33586.  3r3r7787. 3r? 3569.  3r3r7787. 3r33571. C5 3r37272.  3r3r7787. 3r?383. ???r33584.  3r3r7787. 3r?383. ???r33584.  3r3r7787. 3r?383. ???r33584.  3r3r7787. 3r?383. ???r33584.  3r3r7787. 3r?383. ???r3-3584.  3r3r7787. 3r?383. ???r384.  3r3r7787. 3r?383. ???r33584.  3r3r7787. 3r?383. ???r33584.  3r3r7787. 3r?383. ???r?384.  3r3r7787. 3r33586.  3r3r7787. 3r? 3569.  3r3r7787. 3r33571. C6 3r37272.  3r3r7787. 3r?383. ???r384.  3r3r7787. 3r?383. ???r384.  3r3r7787. 3r?383. ???r38484.  3r3r7787. 3r?383. ???r38484.  3r3r7787. 3r?383. ???r384.  3r3r7787. 3r?383. ???r3-3584.  3r3r7787. 3r?383. ???r38484.  3r3r7787. 3r?383. ???r33584.  3r3r7787. 3r?383. ???r38484.  3r3r7787. 3r33586.  3r3r7787. 3r? 3569.  3r3r7787. 3r33571. C7 3r37272.  3r3r7787. 3r?383. ???r33584.  3r3r7787. 3r?383. ???r33584.  3r3r7787. 3r?383. ???r384.  3r3r7787. 3r?383. ???r3584.  3r3r7787. 3r?383. ???r33584.  3r3r7787. 3r?383. ???r384.  3r3r7787. 3r?383. ???r3-3584.  3r3r7787. 3r?383. ???? 3r? 3584.  3r3r7787. 3r?383. ???r384.  3r3r7787. 3r33586.  3r3r7787. 3r? 3569.  3r3r7787. 3r33571. C8 3r37272.  3r3r7787. 3r?383. ???r???.  3r3r7787. 3r?383. ???r38484.  3r3r7787. 3r?383. ???r33584.  3r3r7787. 3r?383. ???r33584.  3r3r7787. 3r?383. ???r33584.  3r3r7787. 3r?383. ???r33584.  3r3r7787. 3r?383. ???? 3r? 3584.  3r3r7787. 3r?383. ???r3-3584.  3r3r7787. 3r?383. ???r384.  3r3r7787. 3r33586.  3r3r7787. 3r? 3569.  3r3r7787. 3r33571. C9 3r37272.  3r3r7787. 3r?383. ???r384.  3r3r7787. 3r?383. ???r384.  3r3r7787. 3r?383. ???r33584.  3r3r7787. 3r?383. ???r33584.  3r3r7787. 3r?383. ???r?384.  3r3r7787. 3r?383. ???r38484.  3r3r7787. 3r?383. ???r384.  3r3r7787. 3r?383. ???r384.  3r3r7787. 3r?383. ???r3-3584.  3r3r7787. 3r33586.  3r3r7787. 3r33588. 3r3r7777.  3r3r7787. 3r33393.  3r3r7787. 3r33395. Table 2. Summary of the
clusters.  3r3r7787. 3r? 3569.  3r3r7787. 3r33571. Cluster
 3r3r7787. 3r33571. Number of objects 3r37272.  3r3r7787. 3r33571. The average distance is 3r37272.  3r3r7787. 3r33571. The minimum distance is
 3r3r7787. 3r33571. The maximum distance is
 3r3r7787. 3r33586.  3r3r7787. 3r? 3569.  3r3r7787. 3r33571. C1
 3r3r7787. 3r?383. 62530 3r33584.  3r3r7787. 3r?383. ???r33584.  3r3r7787. 3r?383. ???r384.  3r3r7787. 3r?383. ???r33584.  3r3r7787. 3r33586.  3r3r7787. 3r? 3569.  3r3r7787. 3r33571. C2 3r37272.  3r3r7787. 3r?383. 2159 3r?584.  3r3r7787. 3r?383. ???r38484.  3r3r7787. 3r?383. ???r384.  3r3r7787. 3r?383. ???r33584.  3r3r7787. 3r33586.  3r3r7787. 3r? 3569.  3r3r7787. 3r33571. C3 3r37272.  3r3r7787. 3r?383. 1099 3r33584.  3r3r7787. 3r?383. ???r33584.  3r3r7787. 3r?383. ???r33584.  3r3r7787. 3r?383. ???r33584.  3r3r7787. 3r33586.  3r3r7787. 3r? 3569.  3r3r7787. 3r33571. C4 3r37272.  3r3r7787. 3r?383. 1292 3r33584.  3r3r7787. 3r?383. ???r384.  3r3r7787. 3r?383. ???r38484.  3r3r7787. 3r?383. ???r3-3584.  3r3r7787. 3r33586.  3r3r7787. 3r? 3569.  3r3r7787. 3r33571. C5 3r3-3572.  3r3r7787. 3r?383. 746 3r33584.  3r3r7787. 3r?383. ???r384.  3r3r7787. 3r?383. ???r33584.  3r3r7787. 3r?383. ???r33584.  3r3r7787. 3r33586.  3r3r7787. 3r? 3569.  3r3r7787. 3r33571. C6 3r???.  3r3r7787. 3r?383. 2451 3r33584.  3r3r7787. 3r?383. ???r384.  3r3r7787. 3r?383. ???r38484.  3r3r7787. 3r?383. ??? r3r3584.  3r3r7787. 3r33586.  3r3r7787. 3r? 3569.  3r3r7787. 3r33571. C7 3r37272.  3r3r7787. 3r?383. 1133 3r33584.  3r3r7787. 3r?383. ???r38484.  3r3r7787. 3r?383. ???r384.  3r3r7787. 3r?383. ???r33584.  3r3r7787. 3r33586.  3r3r7787. 3r? 3569.  3r3r7787. 3r33571. C8 3r37272.  3r3r7787. 3r?383. 876 3r384.  3r3r7787. 3r?383. ???r38484.  3r3r7787. 3r?383. ???r33584.  3r3r7787. 3r?383. ???r33584.  3r3r7787. 3r33586.  3r3r7787. 3r? 3569.  3r3r7787. 3r33571. C9 3r37272.  3r3r7787. 3r?383. 1879 3r33584.  3r3r7787. 3r?383. ???r33584.  3r3r7787. 3r?383. ???r384.  3r3r7787. 3r?383. ???r33584.  3r3r7787. 3r33586.  3r3r7787. 3r33588. 3r3r7777.  3r3r7787. 3r3r7777.  3r3r7787. But in some places the clusters turn out to be quite decent, as, for example, in the image below - there almost all products belong to cat food. 3r3r7777.  3r3r7787. 3r3r7777.  3r3r7787. 3r3758. 3r3-3598. 3r33788. 3r3r7777.  3r3r7787. 3r3r7777.  3r3r7787. Doc2Vec is another one of the algorithms that allow you to represent texts in vector form. When using this approach, each name will be described by a vector of smaller dimension than when using Tf-Idf. In the resulting vector space, similar texts will be close to each other, and various texts will be far away. 3r3r7777.  3r3r7787. 3r3r7777.  3r3r7787. This approach can solve the problem of large dimensionality and space dilution, which is obtained by the Tf-Idf method. For this algorithm, we used the simplest version of tokenization: we broke the name into separate words and took their initial forms. He was trained on the data in the following way: 3r3777.  3r3r7787. 3r3r7777.  3r3r7787. 3r3668. 3r3669. max_epochs = 100
vec_size = 20
alpha = ???r3r3787. 3r3r7787. model = doc2vec.Doc2Vec (vector_size = vec_size,
alpha = alpha,
min_alpha = ????
min_count = ?
dm = 1)
3r3r7787. model.build_vocab (train_corpus)
3r3r7787. for epoch in range (max_epochs):
print ('iteration {0}'. format (epoch))
model.train (train_corpus,
total_examples = model.corpus_count,
epochs = model.iter)
# decrease the learning rate
model.alpha - = ???r3r3787. # fix the learning rate, no decay
model.min_alpha = model.epochs
3r38080.
3r3r7777.  3r3r7787. But with this approach, we obtained vectors that do not carry information about the name — random values ​​can be used with the same success. Here is one example of how the algorithm works: the image shows goods similar to the algorithm’s opinion to “Borodino breadRM n pack ???k. " 3r3r7777.  3r3r7787. 3r3r7777.  3r3r7787. 3r3758. 3r3644. 3r33788. 3r3r7777.  3r3r7787. Perhaps the problem is in the length and context of the names: the pass in the name "__ club. Banana 200ml" can be either yogurt, juice or a big can of cream. You can achieve a better result by using a different approach to tokenization of names. We had no experience using this method and by the time the first attempts failed, we had already found a couple of marked sets with product names, so we decided to leave this method for a while and go to the classification algorithms. 3r3r7777.  3r3r7787. 3r3r7777.  3r3r7787. 3r33750. Classification
3r3r7777.  3r3r7787. 3r?656. Preprocessing
3r3r7777.  3r3r7787. The names of goods from checks come to us in a not always clear form: Latin and Cyrillic words are mixed in words. For example, the letter “a” can be replaced by “a” Latin, and this increases the number of unique names - for example, the words “milk” and “milk” will be considered different. The names also contain many other typos and abbreviations. 3r3r7777.  3r3r7787. 3r3r7777.  3r3r7787. We studied our database and found common mistakes in the names. At this stage, we were treated with regular expressions, with the help of which we cleaned up the names and led them to a certain general form. When using this approach, the result is increased by approximately 7%. In conjunction with a simple version of the SGD Classifier based on the Huber loss function with twisted parameters, we obtained an accuracy of 81% for F1 (average accuracy for all categories of products). 3r3r7777.  3r3r7787. 3r3r7777.  3r3r7787. 3r3668. 3r3669. sgd_model = SGDClassifier ()
parameters_sgd = {
'max_iter':[100], 3r3787. 'loss':['modified_huber'], 3r3787. 'class_weight':['balanced'], 3r3787. 'penalty':['l2'], 3r3787. 'alpha':[0.0001]3r3r7787.}
sgd_cv = GridSearchCV (sgd_model, parameters_sgd, n_jobs = -1)
sgd_cv.fit (tf_idf_data, prod_cat)
sgd_cv.best_score_, sgd_cv.best_params_
3r3r7777.  3r3r7787. Also, do not forget that some categories of people buy more often than others: for example, “Tea and sweets” and “Vegetables and fruits” are much more popular than “Services” and “Cosmetics”. With such a distribution of data, it is better to use algorithms that allow you to specify weights (degree of importance) for each class. Class weight can be determined inversely with the value equal to the ratio of the number of products in a class to the total number of products. But you can not think about it, because in the implementation of these algorithms, it is possible to automatically determine the weight of categories. 3r3r7777.  3r3r7787. 3r3r7777.  3r3r7787. 3r3758. As I understood that I eat a lot of sweets, or the classification of goods by checks in the application 3r33788. 3r3r7777.  3r3r7787. 3r33750. Getting new data for learning 3r3751. 3r3r7777.  3r3r7787. For our application, we needed slightly different categories than those used in the competition, and the names of the products from our database were significantly different from those presented in the contest. Therefore, we needed to mark the goods from our checks. We tried to do it on our own, but realized that even if we connected our entire team, it would take a lot of time. Therefore, we decided to take advantage of r3r3697. “Tolkoy”
Yandex. 3r3r7777.  3r3r7787. 3r3r7777.  3r3r7787. There we used the following assignment form: 3r3777.  3r3r7787. 3r3r7777.  3r3r7787.
 3r3r7787. 3r33737. In each cell, we had a product, the category of which we need to determine 3r33737.  3r3r7787. 3r33737. its presumptive category, defined by one of our previous models
 3r3r7787. 3r33737. field for the answer (if the proposed category was incorrect)
 3r3r7787. 3r33737. 3r3r7777.  3r3r7787. We created a detailed instruction with examples that explained the features of each category, and also used quality control methods: a set with reference answers that were shown along with the usual tasks (we implemented the reference answers ourselves, marking out several hundred products). According to the results of the answers to these tasks, users who were incorrectly marking the data were eliminated. However, for the entire project we banned only three users from 600+. 3r3r7777.  3r3r7787. 3r3r7777.  3r3r7787. 3r3758. 3r33788. 3r3r7777.  3r3r7787. With the new data, we received a model that better suited our data, and the accuracy increased slightly (by ~ 11%) and received 92% already. 3r3r7777.  3r3r7787. 3r3r7777.  3r3r7787. 3r33750. The final model
3r3r7777.  3r3r7787. We began the classification process with a combination of data from several datasets with Kaggle - 74%, after which we improved the preprocessing - 81%, collected a new data set - 92% and finally improved the classification process: initially, using logistic regression, we obtain preliminary probabilities of goods belonging to categories, based on the names of the goods, SGD gave greater accuracy in percent, but still had great values ​​on the loss functions, which had a bad effect on the results of the final classifier. Next, we combine the data with other data on the product (product price, store in which it was purchased, shop statistics, checks and other meta-information), and XGBoost learns all this data volume, which gave an accuracy of 98% (increase another 6%). As it turned out, the greatest contribution was made by the quality of the training sample. 3r3r7777.  3r3r7787. 3r3r7777.  3r3r7787. 3r33750. Run on server 3r3751. 3r3r7777.  3r3r7787. In order to speed up the deployment, we raised a simple server on Flask to Docker. There was one method that took goods from the server, which were to be categorized, and already returned goods with categories. Thus, we easily integrated into the existing system, the center of which was Tomcat, and we did not have to make changes to the architecture - we just added it with one more block. 3r3r7777.  3r3r7787. 3r3r7777.  3r3r7787. 3r33750. Release 3r3751. 3r3r7777.  3r3r7787. A few weeks ago, we posted a release categorized on Google Play (it will appear on the App Store after some time). It turned out like this:
 3r3r7787. 3r3r7777.  3r3r7787. 3r3758. 3r3759. 3r33788. 3r3r7777.  3r3r7787. In the next releases we plan to add the possibility of correcting categories, which will allow us to quickly collect categorization errors and retrain the categorization model (as long as we do it ourselves). 3r3r7777.  3r3r7787. 3r3r7777.  3r3r7787. Mentioned competitions on Kaggle:
 3r3r7787. 3r3r7777.  3r3r7787. www.kaggle.com/c/receipt-categorisation 3r3r7777.  3r3r7787. 3r33775. www.kaggle.com/c/market-basket-analysis
3r3r7777.  3r3r7787. www.kaggle.com/c/prod-price-prediction 3r33788. 3r3r7787. 3r3r7787. 3r3r7787.
3r3r7787. 3r33788.

It may be interesting

  • Comments
  • About article
  • Similar news
This publication has no comments.

weber

Author

18-11-2018, 02:24

Publication Date

Data Mining / Python

Category
  • Comments: 0
  • Views: 321
The battle for consumer check: an
Case Rate & Goods and Mobio:
Analysis of requests for services using
Database barcodes download free without
Translation of the book by Andrew Eun
Use of nuclear regression for
Write a comment
Name:*
E-Mail:


Comments
Quickly this site  could indisputably generally always be dominant relating to every one of web  site buyers, as a consequence of fastidious stories plus exams https://www.pizzahutcouponcode.com/pizza-hut-coupons-code/


Helpful information. Fortunate me I discovered your web site accidentally,
and I am stunned why this accident did not happen earlier! I bookmarked it. Thanks, I've recently been looking for information about this topic for [hide]a[https://www.pizzahutcouponcode.com/pizza-hut-coupons-code/
] long time and yours is the greatest I've discovered so far. But, what concerning the conclusion? Are you positive about the source?


Today, 09:15

Alytani

this is really nice to read..informative post is very good to read..thanks a lot! How is the cost of house cleaning calculated?
Yesterday, 17:14

Legend SEO

It’s very informative and you are obviously very knowledgeable in this area. You have opened my eyes to varying views on this topic with interesting and solid content.

entegrasyon programları
Yesterday, 17:09

taxiseo2

I am really enjoying reading your well written articles. It looks like you spend a lot of effort and time on your blog. I have bookmarked it and I am looking forward to reading new articles. Keep up the good work.

entegrasyon programları
Yesterday, 17:02

taxiseo2

I found so many interesting stuff in your blog especially its discussion. From the tons of comments on your articles, I guess I am not the only one having all the enjoyment here! keep up the good work...먹튀

Yesterday, 16:50

raymond weber

Adv
Website for web developers. New scripts, best ideas, programming tips. How to write a script for you here, we have a lot of information about various programming languages. You are a webmaster or a beginner programmer, it does not matter, useful articles will help to make your favorite business faster.

Login

Registration Forgot password