• Guest
HabraHabr
  • Main
  • Users

  • Development
    • Programming
    • Information Security
    • Website development
    • JavaScript
    • Game development
    • Open source
    • Developed for Android
    • Machine learning
    • Abnormal programming
    • Java
    • Python
    • Development of mobile applications
    • Analysis and design of systems
    • .NET
    • Mathematics
    • Algorithms
    • C#
    • System Programming
    • C++
    • C
    • Go
    • PHP
    • Reverse engineering
    • Assembler
    • Development under Linux
    • Big Data
    • Rust
    • Cryptography
    • Entertaining problems
    • Testing of IT systems
    • Testing Web Services
    • HTML
    • Programming microcontrollers
    • API
    • High performance
    • Developed for iOS
    • CSS
    • Industrial Programming
    • Development under Windows
    • Image processing
    • Compilers
    • FPGA
    • Professional literature
    • OpenStreetMap
    • Google Chrome
    • Data Mining
    • PostgreSQL
    • Development of robotics
    • Visualization of data
    • Angular
    • ReactJS
    • Search technologies
    • Debugging
    • Test mobile applications
    • Browsers
    • Designing and refactoring
    • IT Standards
    • Solidity
    • Node.JS
    • Git
    • LaTeX
    • SQL
    • Haskell
    • Unreal Engine
    • Unity3D
    • Development for the Internet of things
    • Functional Programming
    • Amazon Web Services
    • Google Cloud Platform
    • Development under AR and VR
    • Assembly systems
    • Version control systems
    • Kotlin
    • R
    • CAD/CAM
    • Customer Optimization
    • Development of communication systems
    • Microsoft Azure
    • Perfect code
    • Atlassian
    • Visual Studio
    • NoSQL
    • Yii
    • Mono и Moonlight
    • Parallel Programming
    • Asterisk
    • Yandex API
    • WordPress
    • Sports programming
    • Lua
    • Microsoft SQL Server
    • Payment systems
    • TypeScript
    • Scala
    • Google API
    • Development of data transmission systems
    • XML
    • Regular expressions
    • Development under Tizen
    • Swift
    • MySQL
    • Geoinformation services
    • Global Positioning Systems
    • Qt
    • Dart
    • Django
    • Development for Office 365
    • Erlang/OTP
    • GPGPU
    • Eclipse
    • Maps API
    • Testing games
    • Browser Extensions
    • 1C-Bitrix
    • Development under e-commerce
    • Xamarin
    • Xcode
    • Development under Windows Phone
    • Semantics
    • CMS
    • VueJS
    • GitHub
    • Open data
    • Sphinx
    • Ruby on Rails
    • Ruby
    • Symfony
    • Drupal
    • Messaging Systems
    • CTF
    • SaaS / S+S
    • SharePoint
    • jQuery
    • Puppet
    • Firefox
    • Elm
    • MODX
    • Billing systems
    • Graphical shells
    • Kodobred
    • MongoDB
    • SCADA
    • Hadoop
    • Gradle
    • Clojure
    • F#
    • CoffeeScript
    • Matlab
    • Phalcon
    • Development under Sailfish OS
    • Magento
    • Elixir/Phoenix
    • Microsoft Edge
    • Layout of letters
    • Development for OS X
    • Forth
    • Smalltalk
    • Julia
    • Laravel
    • WebGL
    • Meteor.JS
    • Firebird/Interbase
    • SQLite
    • D
    • Mesh-networks
    • I2P
    • Derby.js
    • Emacs
    • Development under Bada
    • Mercurial
    • UML Design
    • Objective C
    • Fortran
    • Cocoa
    • Cobol
    • Apache Flex
    • Action Script
    • Joomla
    • IIS
    • Twitter API
    • Vkontakte API
    • Facebook API
    • Microsoft Access
    • PDF
    • Prolog
    • GTK+
    • LabVIEW
    • Brainfuck
    • Cubrid
    • Canvas
    • Doctrine ORM
    • Google App Engine
    • Twisted
    • XSLT
    • TDD
    • Small Basic
    • Kohana
    • Development for Java ME
    • LiveStreet
    • MooTools
    • Adobe Flash
    • GreaseMonkey
    • INFOLUST
    • Groovy & Grails
    • Lisp
    • Delphi
    • Zend Framework
    • ExtJS / Sencha Library
    • Internet Explorer
    • CodeIgniter
    • Silverlight
    • Google Web Toolkit
    • CakePHP
    • Safari
    • Opera
    • Microformats
    • Ajax
    • VIM
  • Administration
    • System administration
    • IT Infrastructure
    • *nix
    • Network technologies
    • DevOps
    • Server Administration
    • Cloud computing
    • Configuring Linux
    • Wireless technologies
    • Virtualization
    • Hosting
    • Data storage
    • Decentralized networks
    • Database Administration
    • Data Warehousing
    • Communication standards
    • PowerShell
    • Backup
    • Cisco
    • Nginx
    • Antivirus protection
    • DNS
    • Server Optimization
    • Data recovery
    • Apache
    • Spam and antispam
    • Data Compression
    • SAN
    • IPv6
    • Fidonet
    • IPTV
    • Shells
    • Administering domain names
  • Design
    • Interfaces
    • Web design
    • Working with sound
    • Usability
    • Graphic design
    • Design Games
    • Mobile App Design
    • Working with 3D-graphics
    • Typography
    • Working with video
    • Work with vector graphics
    • Accessibility
    • Prototyping
    • CGI (graphics)
    • Computer Animation
    • Working with icons
  • Control
    • Careers in the IT industry
    • Project management
    • Development Management
    • Personnel Management
    • Product Management
    • Start-up development
    • Managing the community
    • Service Desk
    • GTD
    • IT Terminology
    • Agile
    • Business Models
    • Legislation and IT-business
    • Sales management
    • CRM-systems
    • Product localization
    • ECM / EDS
    • Freelance
    • Venture investments
    • ERP-systems
    • Help Desk Software
    • Media management
    • Patenting
    • E-commerce management
    • Creative Commons
  • Marketing
    • Conferences
    • Promotion of games
    • Internet Marketing
    • Search Engine Optimization
    • Web Analytics
    • Monetize Web services
    • Content marketing
    • Monetization of IT systems
    • Monetize mobile apps
    • Mobile App Analytics
    • Growth Hacking
    • Branding
    • Monetize Games
    • Display ads
    • Contextual advertising
    • Increase Conversion Rate
  • Sundry
    • Reading room
    • Educational process in IT
    • Research and forecasts in IT
    • Finance in IT
    • Hakatonas
    • IT emigration
    • Education abroad
    • Lumber room
    • I'm on my way

VotingClassifier in sсikit-learn: building and optimizing an ensemble of classification models

 3r3191. 3r3-31. As part of the implementation of a large task on Sentiment Analysis (analysis of reviews), I decided to devote some time to additional study of its separate element — using the VotingClassifier from sklearn.ensemble as a tool for building an ensemble of classification models and improving the final quality of predictions. Why is this important and what are the nuances? 3r3178.  3r3191.
3r314. 3r3178.  3r3191. VotingClassifier in sсikit-learn: building and optimizing an ensemble of classification models 3r3178.  3r3191. 3r3178.  3r3191. It often happens that in the course of solving an applied problem of analyzing data, it is not immediately obvious (or not at all obvious) which learning model is best suited. One solution may be to choose the most popular and /or intuitively suitable model based on the nature of the data available. In this case, the parameters of the selected model are optimized (for example, via GridSearchCV) and it is used in the work. Another approach may be the use of an ensemble of models, when the results of several of them are involved in the formation of the final result. I’ll just say that the purpose of the article is not to describe the advantages of using an ensemble of models or the principles of its construction (3r313? here, 3r314.), but rather in one particular applied approach to solving the problem using a specific example nuances. 3r3178.  3r3191. 3r3178.  3r3191. 3r3104. Setting a global problem is the following 3r3r. : given a total of 100 3r3105. reviews on mobile phones as a test sample and we need a pre-trained model that, on these 100 reviews, will show the best result — namely, determine whether the review is positive or negative. An additional complexity, as follows from the conditions of the problem, is the absence of a training sample. To overcome this difficulty with the help of the Beautiful Soup library, 1?000 reviews of mobile phones and ratings for them from one of the Russian sites were successfully sparred. 3r3178.  3r3191. 3r3178.  3r3191. 3r376. Skipping the steps of parsing, preprocessing data and studying their original structure [/i] , we are transitioning to the moment when there are:
 3r3191. 3r3178.  3r3191. 3r3333.  3r3191. 3r3338. a training sample of 1?000 phone reviews, each feedback is marked binary (positive or negative). The markup of receiving the definition of reviews with grades 1-3 as negative and grades 4-5 as positive.
 3r3191. 3r3338. using Count Vectorizer, the data are presented in a form that is suitable for training
classifier models.  3r3191.
3r3178.  3r3191. 3r3104. How to decide which model will work best? 3r3105. We do not have the ability to manually iterate through models, A test sample of only 100 reviews creates a huge risk that some model will simply better fit this test sample, but if you use it on an additional sample hidden from us or in a “battle”, the result will be below average. 3r3178.  3r3191. 3r3178.  3r3191. To solve this problem in the library Scikit-learn there is a module VotingClassifier , being an excellent tool for using several machine learning models that are not similar to each other and combining them into one classifier. This reduces the risk of retraining, as well as misinterpretation of the results of any one single model. 3r3104. The VotingClassifier module is imported by the following command
:
 3r3191. 3r3398. from sklearn.ensemble import VotingClassifier 3r3178.  3r3191. 3r3178.  3r3191. Practical details when working with this module: 3r3178.  3r3191. 3r3178.  3r3191. 1) The first and most important is how a separate taken prediction of the combined classifier is obtained after receiving the predictions from each of its constituent models. Among the parameters VotingClassifier there is a parameter voting with two possible values: 'hard' and 'soft'. 3r3178.  3r3191. 3r3178.  3r3191. 1.1) In the first case, the final answer of the combined classifier will correspond to the “opinion” of the majority of its members. For example, your combined classifier uses data from three different models. Two of them on a specific observation predict a response “positive feedback”, the third - “negative feedback”. Thus, for this observation, the final prediction will be a “positive feedback”, since we have 2 - “for” and 1 “against”. 3r3178.  3r3191. 3r3178.  3r3191. 1.2) In the second case, i.e. when using the 'soft' value of the parameter 3r376. voting [/i] There is a full-fledged “vote” and weighting of the model predictions for 3r3104. each [/u] class, thus the final response of the combined classifier is argmax of the sum of the predicted probabilities. IMPORTANT! 3r3124. To be able to use this method of "voting" [b] each The classifier from your ensemble should support the method. predict_proba () to obtain a quantitative estimate of the probability of entry into each of the classes. Please note that not all models of classifiers support this method and, accordingly, can be used within the VotingClassifier when using the method of weighted probability (Soft Voting). 3r3178.  3r3191. 3r3178.  3r3191. 3r3104. We will understand the example of [/u] : there are three classifiers and two classes of feedback: positive and negative. Each classifier through the method predict_proba will give a certain probability value (p), with which a specific observation is assigned to it to class 1 and, accordingly, with a probability (1-p) to class two. The combined classifier, after receiving a response from each of the models, weights the obtained estimates and gives the final result, obtained as 3r3392. 3r33939. $$ display $$ max (w1 * p1 + w2 * p1 + w3 * p? w1 * p2 + w2 * p2 + w3 * p3) $$ display $$
3r395. where w? w? w3 are the weights of your classifier in the ensemble that have equal weights by default, and p? p2 is the score for belonging to class 1 or class 2 of each of them. Note also that the weights of classifiers using Soft Vote can be changed using the weights parameter, so the module call should look like this: 3r3178.  3r3191. 3r3398. = VotingClassifier (estimators =[('', clf1), ('', clf2), ('', clf3)], Voting = 'soft', weights =[*,*,*]) where asterisks can be specified required weights for each model. 3r3178.  3r3191. 3r3178.  3r3191. 2) Possibility 3r3104. simultaneous [/u] using the VotingClassifier and GridSearch module to optimize the hyperparameters of each of the classifiers in the ensemble. 3r3178.  3r3191. 3r3178.  3r3191. When you plan to use an ensemble and you want the models included in it to be optimized, you can use GridSearch already in the combined classifier. And the code below shows how you can work with the models included in it (Logistic regression, naive Bayes, stochastic gradient descent) while remaining within the combined classifier (VotingClassifier): 3r3178.  3r3191. 3r3178.  3r3191.
3r3154. clf1 = LogisticRegression ()
clf2 = MultinomialNB ()
clf3 = SGDClassifier (max_iter = 100? loss = 'log')
eclf = VotingClassifier (estimators =[('lr', clf1), ('nb', clf2),('sgd', clf3)], voting = 'hard') # set the voting method through the majority (hard voting), see section ???r3r3191. 3r3191. params = {'lr__C':[0.5,1,1.5], 'lr__class_weight':[None,'balanced'], 3r3191. 'nb__alpha':[0.1,1,2], 3r3191. 'sgd__penalty':['l2', 'l1'], 'sgd__alpha':[0.0001,0.001,0.01]} # we set the grid of parameters for sorting and comparison, the syntax is important in order to optimize the necessary model
3r3191. grid = GridSearchCV (estimator = eclf, param_grid = params, cv = ? scoring = 'accuracy', n_jobs = -1)
grid = grid.fit (data_messages_vectorized, df_texts['Binary_Rate']) # when we have set all the conditions, we train and optimize on 5 folds on the collected
training set. 3r3158.
3r3178.  3r3191. Thus, the params dictionary must be defined in such a way that when accessing it via GridSearch, it is possible to determine which of the models in the ensemble is the parameter whose value is to be optimized. 3r3178.  3r3191. 3r3178.  3r3191. That's all you need to know to fully use the VotingClassifier tool as a way to build an ensemble of models and optimize it. Let's look at the results:
 3r3191. 3r3178.  3r3191.
3r3154. print grid.best_params_
{'lr__class_weight': 'balanced', 'sgd__penalty': 'l1', 'nb__alpha': ? 'lr__C': ? 'sgd__alpha': ???}
3r3158.
3r3178.  3r3191. The optimal values ​​of the parameters are found, it remains to compare the results of the work for the ensemble of classifiers (VotingClassifier) ​​with the optimal parameters, cross-validate the training set and compare the models with the optimal parameters and the ensemble consisting of them: 3r3178.  3r3191. 3r3178.  3r3191.
3r3154. for clf, label in zip ([clf1, clf2, clf3, eclf],['Logistic Regression', 'Naive Bayes', 'SGD', 'Ensemble_HardVoting']):
scores = cross_val_score (clf, data_messages_vectorized, df_texts['Binary_Rate'], cv = ? scoring = 'accuracy')
print ("Accuracy:% 0.2f (+/-% 0.2f)[%s]"% (scores.mean (), scores.std (), label))
3r3158.
3r3178.  3r3191. The final result:
 3r3191. 3r3178.  3r3191. Accuracy: ??? (± ???)[Logistic Regression]3r3178.  3r3191. Accuracy: ??? (± ???)[Naive Bayes]3r3178.  3r3191. Accuracy: ??? (± ???)[SGD]3r3178.  3r3191. Accuracy: ??? (± ???)[Ensemble_HardVoting]3r3178.  3r3191. 3r3178.  3r3191. As can be seen, the models showed themselves somewhat differently in the training sample (with standard parameters, this difference was more noticeable). In this case, the total value (in terms of the accuracy metric) of the ensemble does not have to exceed the best value of the models included in it, since rather, the ensemble is a more stable model that can show ± similar results on a test sample and in “combat”, and thus reduce the risk of retraining, fitting a training sample, and other related classifiers of problems. Good luck in solving applied problems and thank you for your attention! 3r3178.  3r3191. 3r3178.  3r3191. P.S. Considering the specifics and rules of publication in the sandbox, I cannot provide a link to github and the source code for the analysis given in this article, as well as references to Kaggle, as part of the InClass competition which provided a test set and tools for testing models on it. I can only say that this ensemble significantly broke the baseline and took a worthy place on the leaderboard after checking on the test set. I hope in the following publications I can share. 3r3187. 3r3191. 3r3191. 3r3191. 3r3184. ! function (e) {function t (t, n) {if (! (n in e)) {for (var r, a = e.document, i = a.scripts, o = i.length; o-- ;) if (-1! == i[o].src.indexOf (t)) {r = i[o]; break} if (! r) {r = a.createElement ("script"), r.type = "text /jаvascript", r.async =! ? r.defer =! ? r.src = t, r.charset = "UTF-8"; var d = function () {var e = a.getElementsByTagName ("script")[0]; e.parentNode.insertBefore (r, e)}; "[object Opera]" == e.opera? a.addEventListener? a.addEventListener ("DOMContentLoaded", d,! 1): e.attachEvent ("onload", d ): d ()}}} t ("//mediator.mail.ru/script/2820404/"""_mediator") () (); 3r3185. 3r3191. 3r3187. 3r3191. 3r3191. 3r3191. 3r3191.

It may be interesting

  • Comments
  • About article
  • Similar news
This publication has no comments.

weber

Author

18-11-2018, 17:15

Publication Date

Python / Algorithms / Machine learning

Category
  • Comments: 0
  • Views: 376
De facto closed source: arguments for
Bitcoin number
Elegant objects
ES2018 - finally promis method
CJM compilation, key art director
[Микро-навигация (микро-подталкивание)
Write a comment
Name:*
E-Mail:


Comments
I genuinely believed you would probably have something useful to say. All I hear is a bunch of whining about something that you can fix if you were not too busy looking for attention. After all, I know it was my choice to read .. [url = https: //gamebnat.net] 먹튀 [/ url]

Today, 15:56

raymond weber

Lots of interesting comments, but it feels like users are really experts in their field, and it's very cool!
Today, 15:49

claudedufont

This is a good idea, thank you very much to the author!
Today, 15:47

claudedufont

I learned a lot of interesting things on this topic, and it really interested me! Good and smart comments from users!
Today, 15:47

claudedufont


We have years of experience in repairing fences and we can give you one that you will love! No matter what type of fence you have, we can fix it for you. Check Out:
All fences cape coral



Today, 15:03

noorseo

Adv
Website for web developers. New scripts, best ideas, programming tips. How to write a script for you here, we have a lot of information about various programming languages. You are a webmaster or a beginner programmer, it does not matter, useful articles will help to make your favorite business faster.

Login

Registration Forgot password