VotingClassifier in sсikit-learn: building and optimizing an ensemble of classification models

3r3191. 3r3-31. As part of the implementation of a large task on Sentiment Analysis (analysis of reviews), I decided to devote some time to additional study of its separate element — using the VotingClassifier from sklearn.ensemble as a tool for building an ensemble of classification models and improving the final quality of predictions. Why is this important and what are the nuances? 3r3178. 3r3191.

3r314. 3r3178. 3r3191. 3r3178. 3r3191. 3r3178. 3r3191. It often happens that in the course of solving an applied problem of analyzing data, it is not immediately obvious (or not at all obvious) which learning model is best suited. One solution may be to choose the most popular and /or intuitively suitable model based on the nature of the data available. In this case, the parameters of the selected model are optimized (for example, via GridSearchCV) and it is used in the work. Another approach may be the use of an ensemble of models, when the results of several of them are involved in the formation of the final result. I’ll just say that the purpose of the article is not to describe the advantages of using an ensemble of models or the principles of its construction (3r313? here, 3r314.), but rather in one particular applied approach to solving the problem using a specific example nuances. 3r3178. 3r3191. 3r3178. 3r3191. 3r3104. Setting a global problem is the following 3r3r. : given a total of __ 100 3r3105. reviews on mobile phones as a test sample and we need a pre-trained model that, on these 100 reviews, will show the best result — namely, determine whether the review is positive or negative. An additional complexity, as follows from the conditions of the problem, is the absence of a training sample. To overcome this difficulty with the help of the Beautiful Soup library, 1?000 reviews of mobile phones and ratings for them from one of the Russian sites were successfully sparred. 3r3178. 3r3191. 3r3178. 3r3191. 3r376. Skipping the steps of parsing, preprocessing data and studying their original structure [/i] , we are transitioning to the moment when there are: 3r3191. 3r3178. 3r3191. 3r3333. 3r3191. 3r3338. a training sample of 1?000 phone reviews, each feedback is marked binary (positive or negative). The markup of receiving the definition of reviews with grades 1-3 as negative and grades 4-5 as positive. 3r3191. 3r3338. using Count Vectorizer, the data are presented in a form that is suitable for trainingclassifier models. 3r3191.3r3178. 3r3191. 3r3104. How to decide which model will work best? 3r3105. We do not have the ability to manually iterate through models, A test sample of only 100 reviews creates a huge risk that some model will simply better fit this test sample, but if you use it on an additional sample hidden from us or in a “battle”, the result will be below average. 3r3178. 3r3191. 3r3178. 3r3191. To solve this problem __ :

**in the library Scikit-learn there is a module VotingClassifier**, being an excellent tool for using several machine learning models that are not similar to each other and combining them into one classifier. This reduces the risk of retraining, as well as misinterpretation of the results of any one single model. 3r3104. The VotingClassifier module is imported by the following command

3r3191. 3r3398. from sklearn.ensemble import VotingClassifier 3r3178. 3r3191. 3r3178. 3r3191. Practical details when working with this module: 3r3178. 3r3191. 3r3178. 3r3191. 1) The first and most important is how a separate taken prediction of the combined classifier is obtained after receiving the predictions from each of its constituent models. Among the parameters VotingClassifier there is a parameter

*voting*with two possible values: 'hard' and 'soft'. 3r3178. 3r3191. 3r3178. 3r3191. 1.1) In the first case, the final answer of the combined classifier will correspond to the “opinion” of the majority of its members. For example, your combined classifier uses data from three different models. Two of them on a specific observation predict a response “positive feedback”, the third - “negative feedback”. Thus, for this observation, the final prediction will be a “positive feedback”, since we have 2 - “for” and 1 “against”. 3r3178. 3r3191. 3r3178. 3r3191. 1.2) In the second case, i.e. when using the 'soft' value of the parameter 3r376. voting [/i] There is a full-fledged “vote” and weighting of the model predictions for 3r3104. each [/u] class, thus the final response of the combined classifier is argmax of the sum of the predicted probabilities.

**IMPORTANT! 3r3124. To be able to use this method of "voting" [b] each**The classifier from your ensemble should support the

**method. predict_proba ()**to obtain a quantitative estimate of the probability of entry into each of the classes. Please note that not all models of classifiers support this method and, accordingly, can be used within the VotingClassifier when using the method of weighted probability (Soft Voting). 3r3178. 3r3191. 3r3178. 3r3191. 3r3104. We will understand the example of [/u] : there are three classifiers and two classes of feedback: positive and negative. Each classifier through the method predict_proba will give a certain probability value (p), with which a specific observation is assigned to it to class 1 and, accordingly, with a probability (1-p) to class two. The combined classifier, after receiving a response from each of the models, weights the obtained estimates and gives the final result, obtained as 3r3392. 3r33939. $$ display $$ max (w1 * p1 + w2 * p1 + w3 * p? w1 * p2 + w2 * p2 + w3 * p3) $$ display $$

3r395. where w? w? w3 are the weights of your classifier in the ensemble that have equal weights by default, and p? p2 is the score for belonging to class 1 or class 2 of each of them. Note also that the weights of classifiers using Soft Vote can be changed using the weights parameter, so the module call should look like this: 3r3178. 3r3191. 3r3398. = VotingClassifier (estimators =[('', clf1), ('', clf2), ('', clf3)], Voting = 'soft', weights =[*,*,*]) where asterisks can be specified required weights for each model. 3r3178. 3r3191. 3r3178. 3r3191. 2) Possibility 3r3104. simultaneous [/u] using the VotingClassifier and GridSearch module to optimize the hyperparameters of each of the classifiers in the ensemble. 3r3178. 3r3191. 3r3178. 3r3191. When you plan to use an ensemble and you want the models included in it to be optimized, you can use GridSearch already in the combined classifier. And the code below shows how you can work with the models included in it (Logistic regression, naive Bayes, stochastic gradient descent) while remaining within the combined classifier (VotingClassifier): 3r3178. 3r3191. 3r3178. 3r3191.

3r3154. clf1 = LogisticRegression ()

clf2 = MultinomialNB ()

clf3 = SGDClassifier (max_iter = 100? loss = 'log')

eclf = VotingClassifier (estimators =[('lr', clf1), ('nb', clf2),('sgd', clf3)], voting = 'hard') # set the voting method through the majority (hard voting), see section ???r3r3191. 3r3191.

**params = {'lr__C':[0.5,1,1.5], 'lr__class_weight':[None,'balanced'], 3r3191. 'nb__alpha':[0.1,1,2], 3r3191. 'sgd__penalty':['l2', 'l1'], 'sgd__alpha':[0.0001,0.001,0.01]}**# we set the grid of parameters for sorting and comparison, the syntax is important in order to optimize the necessary model

3r3191. grid = GridSearchCV (estimator = eclf, param_grid = params, cv = ? scoring = 'accuracy', n_jobs = -1)

grid = grid.fit (data_messages_vectorized, df_texts['Binary_Rate']) # when we have set all the conditions, we train and optimize on 5 folds on the collected

training set. 3r3158.

3r3178. 3r3191. Thus, the params dictionary must be defined in such a way that when accessing it via GridSearch, it is possible to determine which of the models in the ensemble is the parameter whose value is to be optimized. 3r3178. 3r3191. 3r3178. 3r3191. That's all you need to know to fully use the VotingClassifier tool as a way to build an ensemble of models and optimize it. Let's look at the results:

3r3191. 3r3178. 3r3191.

3r3154. print grid.best_params_

{'lr__class_weight': 'balanced', 'sgd__penalty': 'l1', 'nb__alpha': ? 'lr__C': ? 'sgd__alpha': ???}

3r3158.

3r3178. 3r3191. The optimal values of the parameters are found, it remains to compare the results of the work for the ensemble of classifiers (VotingClassifier) with the optimal parameters, cross-validate the training set and compare the models with the optimal parameters and the ensemble consisting of them: 3r3178. 3r3191. 3r3178. 3r3191.

3r3154. for clf, label in zip ([clf1, clf2, clf3, eclf],['Logistic Regression', 'Naive Bayes', 'SGD', 'Ensemble_HardVoting']):

scores = cross_val_score (clf, data_messages_vectorized, df_texts['Binary_Rate'], cv = ? scoring = 'accuracy')

print ("Accuracy:% 0.2f (+/-% 0.2f)[%s]"% (scores.mean (), scores.std (), label))

3r3158.

3r3178. 3r3191. The final result:

3r3191. 3r3178. 3r3191. Accuracy: ??? (± ???)[Logistic Regression]3r3178. 3r3191. Accuracy: ??? (± ???)[Naive Bayes]3r3178. 3r3191. Accuracy: ??? (± ???)[SGD]3r3178. 3r3191. Accuracy: ??? (± ???)[Ensemble_HardVoting]3r3178. 3r3191. 3r3178. 3r3191. As can be seen, the models showed themselves somewhat differently in the training sample (with standard parameters, this difference was more noticeable). In this case, the total value (in terms of the accuracy metric) of the ensemble does not have to exceed the best value of the models included in it, since rather, the ensemble is a more stable model that can show ± similar results on a test sample and in “combat”, and thus reduce the risk of retraining, fitting a training sample, and other related classifiers of problems. Good luck in solving applied problems and thank you for your attention! 3r3178. 3r3191. 3r3178. 3r3191. P.S. Considering the specifics and rules of publication in the sandbox, I cannot provide a link to github and the source code for the analysis given in this article, as well as references to Kaggle, as part of the InClass competition which provided a test set and tools for testing models on it. I can only say that this ensemble significantly broke the baseline and took a worthy place on the leaderboard after checking on the test set. I hope in the following publications I can share. 3r3187. 3r3191. 3r3191. 3r3191. 3r3184. ! function (e) {function t (t, n) {if (! (n in e)) {for (var r, a = e.document, i = a.scripts, o = i.length; o-- ;) if (-1! == i[o].src.indexOf (t)) {r = i[o]; break} if (! r) {r = a.createElement ("script"), r.type = "text /jаvascript", r.async =! ? r.defer =! ? r.src = t, r.charset = "UTF-8"; var d = function () {var e = a.getElementsByTagName ("script")[0]; e.parentNode.insertBefore (r, e)}; "[object Opera]" == e.opera? a.addEventListener? a.addEventListener ("DOMContentLoaded", d,! 1): e.attachEvent ("onload", d ): d ()}}} t ("//mediator.mail.ru/script/2820404/"""_mediator") () (); 3r3185. 3r3191. 3r3187. 3r3191. 3r3191. 3r3191. 3r3191.

It may be interesting

#### weber

Author**18-11-2018, 17:15**

Publication Date
#### Python / Algorithms / Machine learning

Category- Comments: 0
- Views: 376