Making a machine learning project in Python. Part 3

3r33547. 3r3-31. Translation A Complete Machine Learning Walk-Through in Python: Part Three [/i] 3r?500. 3r33547. 3r?500. 3r33547. Many people dislike the fact that machine learning models are 3-3315. black boxes

: we put in them the data and without any explanation we get answers - often very accurate answers. In this article we will try to figure out how the model we create makes predictions and what it can tell you about the problem we are solving. And we conclude with a discussion of the most important part of the machine learning project: we will document what has been done and present the results. 3r?500. 3r33547. 3r?500. 3r33547. In The first part of 3r33333. we considered data cleansing, exploratory analysis, design, and feature selection. In

second part 3r33333. they studied the filling in of missing data, the implementation and comparison of machine learning models, hyperparameter tuning using random cross-validation search, and finally, the evaluation of the resulting model. 3r?500. 3r33547.

3r33333. 3r?500. 3r33547. All

project code 3r33333. lies on github. And the third Jupyter Notebook related to this article is 3r3142. here is 3r33333. . You can use it for your projects! 3r?500. 3r33547. 3r?500. 3r33547. So, we are working on solving the problem using machine learning, more precisely, using controlled regression (supervised regression). Based on

Building Energy Data in New York We created a model that predicts the number of Energy Star Score. We have a model “

gradient boosting regression 3r3-3533. "Able to predict on the basis of test data within 9.1 points (in the range from 1 to 100). 3r?500. 3r33547. 3r?500. 3r33547.

Interpretation of the model 3r3458. 3r?500. 3r33547. The regression based on gradient boosting is located approximately in the middle of 3r3351. ` model interpretability scales. : the model itself is complex, but consists of hundreds of rather simple`

decision trees 3r33333. . There are three ways to understand the work of our model: 3r33500. 3r33547. 3r?500. 3r33547. 3rr3465. 3r33547. 3r? 3531. Rate it

the importance of signs 3r33333. . 3r? 3534. 3r33547. 3r? 3531. Visualize one of the decision trees. 3r? 3534. 3r33547. 3r? 3531. Apply the method

LIME - Local Interpretable Model-Agnostic Explainations ` , local interpretable model-dependent explanations. 3r? 3534. 3r33547.`

3r?500. 3r33547. The first two methods are characteristic of ensembles of trees, and the third, as you can understand from its name, can be applied to any machine learning model. LIME is a relatively new approach; this is a noticeable step forward in trying explain the work of machine learning 3r33333. . 3r?500. 3r33547. 3r?500. 3r33547.

` The importance of signs `

` 3r?500. 3r33547. The importance of signs allows you to see the relationship of each sign in order to predict. The technical details of this method are complex (`

Measured by the mean decrease in the irregularity of 3r33333. (The mean decrease impurity) or

Decrease in error due to the inclusion of the sign 3r33333.), But we can use relative values to understand which signs are more relevant. In Scikit-Learn, you can

extract severity signs 3r33333. from any ensemble of "pupils" based on trees. 3r?500. 3r33547. 3r?500. 3r33547. In the code below, ` model `

- our trained model, and using ` model.feature_importances_ `

You can determine the importance of signs. Then we send them to the Pandas data frame and display the 10 most important signs: 3r?500. 3r33547. 3r?500. 3r33547. 3r33333. ` import pandas as pd`

3r33547. # model is the trained model

importances = model.feature_importances_

3r33547. # train_features is the dataframe of training features

feature_list = list (train_features.columns)

3r33547. # Extract the feature importances into a dataframe

feature_results = pd.DataFrame ({'feature': feature_list, 'importance': importances})

3r33547. # Show the top 10 most important

feature_results = feature_results.sort_values ('importance', ascending = False) .reset_index (drop = True)

3r33547. feature_results.head (10)

3r33333. 3r?500. 3r33547. 3r?500. 3r33547. 3r?500. 3r33547. The most important signs are ` Site EUI `

(

` Intensity of energy consumption 3r33333.) And 3r3333388. Weather Normalized Site Electricity Intensity `

, they account for more than 66% of total importance. Already in the third sign, the importance falls dramatically, it may mean that we do not need to use all 64 signs to achieve high prediction accuracy (in Jupyter notebook This theory is tested using only the 10 most important signs, and the model was not too accurate). 3r?500. 3r33547. 3r?500. 3r33547. Based on these results, you can finally answer one of the initial questions: the most important indicators of Energy Star Score scores are Site EUI and Weather Normalized Site Electricity Intensity. We will not 3r3148. too deep into the wilds of the signs of signs 3r33333. , we say only that with them you can begin to understand the mechanism of model prediction. 3r?500. 3r33547. 3r?500. 3r33547. #### Visualizing a single decision tree

3r?500. 3r33547. Comprehending the entire regression model based on gradient boosting is hard, which cannot be said about individual decision trees. You can visualize any tree using 3r33333. Scikit-Learn-functions export_graphviz

3r33333. . First we extract the tree from the ensemble, and then save it as a dot-file:3r33547. 3r?500. 3r33547. 3r33333.

` from sklearn import tree`

3r33547. # Extract a single tree (number 105)

single_tree = model.estimators_[105] [0]3r33547. 3r33547. # Save the tree to a dot file

tree.export_graphviz (single_tree, out_file = 'images /tree.dot', feature_names = feature_list)

3r?500. 3r33547. With the help of 3r3178. Visualizer Graphviz convert the dot-file to png by typing in the command line:3r33547. 3r?500. 3r33547. 3r33333. dot -tpng images /tree.dot -o images /tree.png 3r?500. 3r33547. 3r?500. 3r33547. Got a complete decision tree: 3r33500. 3r33547. 3r?500. 3r33547. 3r3194. 3r?500. 3r33547. 3r?500. 3r33547. A little cumbersome! Although this tree is only 6 layers deep, all transitions are difficult to track. Let's change the function call

` export_graphviz `

and limit the depth of the tree to two layers: 3r33500. 3r33547. 3r?500. 3r33547. - measure of error in the node. 3r? 3534. 3r33547. 3r? 3531. 3r33333. Samples - the number of data samples (measurements) in the node. 3r? 3534. 3r33547. 3r? 3531. 3r33333. Value - target evaluation for all sample data in the node. 3r? 3534. 3r33547. 3r?500. 3r33547. Separate node. [/i] 3r?500. 3r33547. 3r?500. 3r33547. (The leaves contain only 2. – 4. Because they are the final grade and do not have children). 3r?500. 3r33547. 3r?500. 3r33547. The formation of the forecast for a given measurement in the decision tree begins with the top node - the root, and then goes down the tree. At each node, you must answer the asked question "yes" or "no." For example, the previous illustration asks: “Site EUI of the building is less than or equal to ????” If yes, the algorithm goes to the right child node, if not, then to the left. 3r?500. 3r33547. 3r?500. 3r33547. This procedure is repeated on each layer of the tree until the algorithm reaches the leaf node on the last layer (these nodes are not shown in the illustration with a reduced tree). The forecast for any measurement in the sheet is ` value `

. If several dimensions come to the sheet ( ` Samples `

), Then each of them will receive the same prediction. As the tree depth increases, the error on the training data will decrease as the leaves will be larger and the samples will be divided more carefully. However, a tree that is too deep will lead to retraining on training data 3r33333. and will not be able to summarize the test data. 3r?500. 3r33547. 3r?500. 3r33547. In second article 3r33333. we configured the number of hyperparameters of the model that control each tree, for example, the maximum depth of the tree and the minimum number of samples needed for each sheet. These two parameters strongly influence the balance between over-training and under-training, and visualizing the decision tree will allow us to understand how these settings work. 3r?500. 3r33547. 3r?500. 3r33547. Although we will not be able to study all the trees in the model, the analysis of one of them will help to understand how each “student” predicts. This flowchart-based method is very similar to how a person makes decisions. Ensembles from decision trees 3r33333. combine the forecasts of numerous individual trees, which allows you to create more accurate models with less variability. Such ensembles very accurate and easy to explain. 3r?500. 3r33547. 3r?500. 3r33547. #### Local interpretable model-dependent explanations (LIME) 3r32r799. 3r?500. 3r33547. The last tool with which you can try to figure out how our model “thinks”. LIME makes it possible to explain how a single forecast of any machine learning model is formed; . For this, locally, a simplified model is created on the basis of a simple model like a linear regression next to some measurement (details are described in this paper: 3r-3284. [url]Https://arxiv.org/pdf/???.pdf[/url] 3r-3533.). 3r?500. 3r33547. 3r?500. 3r33547. We will use the LIME method to study the completely erroneous forecast of our model and understand why it is mistaken. 3r?500. 3r33547. 3r?500. 3r33547. First we find this incorrect prediction. To do this, we will train the model, generate a forecast and select the value with the largest error: 3r?500. 3r33547. 3r?500. 3r33547. 3r33333. ` from sklearn.ensemble import GradientBoostingRegressor`

3r33547. # Create a model with the best hyperparamters

model = GradientBoostingRegressor (loss = 'lad', max_depth = ? max_features = None, min_samples_leaf = ? min_samples_split = ? n_estimators = 80? random_state = 42)

3r33547. # Fit and test on the features

model.fit (X, y)

model_pred = model.predict (X_test)

3r33547. # Find the residuals

residuals = abs (model_pred - y_test)

3r33547. # Extract the most wrong prediction

wrong = X_test[np.argmax(residuals), :]3r33547. 3r33547. print ('Prediction:% 0.4f'% np.argmax (residuals))

print ('Actual Value:% 0.4f'% y_test[np.argmax(residuals)])

3r?500. 3r33547. 3r33320. Prediction: ???r33500. 3r33547. Actual Value: ???r3r3323. 3r?500. 3r33547. 3r?500. 3r33547. Then we will create an explainer and give it training data, mode information, labels for the training data and the names of the attributes. Now you can transfer the observational data and forecasting function to the explainer, and then ask to explain the reason for the erroneous forecast. 3r?500. 3r33547. 3r?500. 3r33547. 3r33333. ` import lime`

3r33547. # Create a lime explainer object

explainer = lime.lime_tabular.LimeTabularExplainer (training_data = X, mode = 'regression', training_labels = y, feature_names = feature_list)

3r33547. 3r33547. # Explanation for wrong prediction

exp = explainer.explain_instance (data_row = wrong, predict_fn = model.predict)

3r33547. # Plot the prediction explaination

exp.as_pyplot_figure (); 3r33333. 3r?500. 3r33547. Forecast explanation diagram: 3r33500. 3r33547. 3r?500. 3r33547. 3r33352. 3r?500. 3r33547. 3r?500. 3r33547. How to interpret a chart: each entry along the Y axis represents one variable value, and the red and green bars reflect the influence of this value on the forecast. For example, according to the top record effect ` Site EUI `

more than 95.9? as a result of the forecast subtracted about 40 points. According to the second record, the effect of 3r33388 Weather Normalized Site Electricity Intensity

less than 3.8? and therefore about 10 points are added to the forecast. The final forecast is the sum of the intercept and the effects of each of the listed values. 3r?500. 3r33547. 3r?500. 3r33547. Let's look at it from the other side and call the ` method. .show_in_notebook () `

: 3r???. 3r33547. 3r?500. 3r33547. 3r33333. ` # Show the explanation in the Jupyter Notebook`

exp.show_in_notebook () 3r33547. 3r33333. 3r?500. 3r33547. Site EUI

was relatively high and one could expect a low Energy Star Score (because it is strongly influenced by the EUI), which our model did. But in this case, this logic turned out to be erroneous, because in fact the building received the highest Energy Star Score - 100. 3r?500. 3r33547. 3r?500. 3r33547. Model errors can upset you, but such explanations can help you understand why the model was wrong. Moreover, thanks to the explanations, you can begin to dig out why the building received the highest score despite the high value of Site EUI. Perhaps we will learn something new about our task that would have escaped our attention if we didn’t begin to analyze model errors. Such tools are not perfect, but they can greatly facilitate the understanding of the model and take 3r33394. better solutions r3r3533. . 3r?500. 3r33547. 3r?500. 3r33547.

Documenting the work and presenting the results of r3r3458. 3r?500. 3r33547. In many projects little attention is paid to documentation and reports. You can do the best analysis in the world, but if not 3r3404. submit results properly 3r33333. , they won't matter! 3r?500. 3r33547. 3r?500. 3r33547. When documenting a data analysis project, we package all versions of the data and code so that other people can reproduce or collect the project. Remember that the code is read more often than they write, so our work should be clear to other people, and to us, if we return to it in a few months. Therefore, insert useful comments into the code and explain your decisions. Notebooks

Jupyter Notebook - a great tool for documenting, they allow you to first explain the solution, and then show the code. 3r?500. 3r33547. 3r?500. 3r33547. Also, Jupyter Notebook is a good platform for interacting with other professionals. Using extensions for 3r33333 notebooks. can be

hide the code from the final report because, no matter how hard you believe it, not everyone wants to see a bunch of code in the document! 3r?500. 3r33547. 3r?500. 3r33547. Perhaps you want to not squeeze, and show all the details. However, 3r3r2424 is important. understand your audience r3r3533. when you submit your project, and make an appropriate report . Here is an example of a brief summary of the essence of our project: 3r33500. 3r33547. 3r?500. 3r33547. 3rr3465. 3r33547. 3r? 3531. Using data on the energy consumption of buildings in New York, you can build a model that predicts the number of Energy Star Score points with an error of 9.1 points. 3r? 3534. 3r33547. 3r? 3531. Site EUI and Weather Normalized Electricity Intensity are the main factors affecting the forecast. 3r? 3534. 3r33547.

3r?500. 3r33547. We wrote a detailed description and conclusions in Jupyter Notebook, but instead of PDF we converted it to 3r3443. Latex 3r33333. The .tex file, which is then edited in texStudio , and 3r3447. The resulting option 3r33333. converted to pdf. The fact is that the default export result from Jupyter to PDF looks pretty decent, but it can be greatly improved in just a few minutes of editing. In addition, Latex - a powerful document preparation system, which is useful to own. 3r?500. 3r33547. 3r?500. 3r33547. Ultimately, the value of our work is determined by the decisions that it helps to make, and it is very important to be able to “present the goods by face”. By correctly documenting, we help other people to reproduce our results and give us feedback, which will allow us to become more experienced and continue to rely on the results obtained. 3r?500. 3r33547. 3r?500. 3r33547.

Conclusions 3r3458. 3r?500. 3r33547. In our series of publications, we have dismantled an educational project on machine learning from start to finish. We started with data cleansing, then created a model, and finally learned to interpret it. Recall the overall structure of the machine learning project: 3r?500. 3r33547. 3r?500. 3r33547. 3rr3465. 3r33547. 3r? 3531. Cleaning and formatting data. 3r? 3534. 3r33547. 3r? 3531. Exploratory data analysis. 3r? 3534. 3r33547. 3r? 3531. Design and selection of features. 3r? 3534. 3r33547. 3r? 3531. Comparison of metrics of several machine learning models. 3r? 3534. 3r33547. 3r? 3531. Hyperparametric adjustment of the best model. 3r? 3534. 3r33547. 3r? 3531. Evaluation of the best model on the test data set. 3r? 3534. 3r33547. 3r? 3531. Interpreting the results of the model. 3r? 3534. 3r33547. 3r? 3531. Conclusions and a well-documented report. 3r? 3534. 3r33547.

3r?500. 3r33547. The set of steps may vary depending on the project, and machine learning is often an iterative process rather than a linear one, so this guide will help you in the future. We hope you can now confidently realize your projects, but remember: no one acts alone! If you need help, there are many very useful communities where you can get advice. 3r?500. 3r33547. 3r?500. 3r33547. These sources can help you: 3r3–3500. 3r33547. 3r?500. 3r33547. 3r3502. 3r33547. 3r? 3531. Hands-On Machine Learning with Scikit-Learn and Tensorflow ( Jupyter Notebook for this book Available for download for free)! 3r? 3534. 3r33547. 3r? 3531. 3r33512. An Introduction to Statistical Learning

3r? 3534. 3r33547. 3r? 3531. 3r? 3517. Kaggle:

3r? 3534. 3r33547. 3r? 3531. 3r? 3522. Datacamp

: A good guide to practice in data analysis programming. 3r? 3534. 3r33547. 3r? 3531. 3r33535. Coursera 3r33333. : free and paid courses on many topics. 3r? 3534. 3r33547. 3r? 3531. 3r33532. Udacity

: paid courses on programming and data analysis. 3r? 3534. 3r33547. 3r33536. 3r? 3543. 3r33547. 3r33547. 3r33540. ! function (e) {function t (t, n) {if (! (n in e)) {for (var r, a = e.document, i = a.scripts, o = i.length; o-- ;) if (-1! == i[o].src.indexOf (t)) {r = i[o]; break} if (! r) {r = a.createElement ("script"), r.type = "text /jаvascript", r.async =! ? r.defer =! ? r.src = t, r.charset = "UTF-8"; var d = function () {var e = a.getElementsByTagName ("script")[0]; e.parentNode.insertBefore (r, e)}; "[object Opera]" == e.opera? a.addEventListener? a.addEventListener ("DOMContentLoaded", d,! 1): e.attachEvent ("onload", d ): d ()}}} t ("//mediator.mail.ru/script/2820404/"""_mediator") () (); 3r33541. 3r33547. 3r? 3543. 3r33547. 3r33547. 3r33547. 3r33547.

3r33547. # Create a model with the best hyperparamters

model = GradientBoostingRegressor (loss = 'lad', max_depth = ? max_features = None, min_samples_leaf = ? min_samples_split = ? n_estimators = 80? random_state = 42)

3r33547. # Fit and test on the features

model.fit (X, y)

model_pred = model.predict (X_test)

3r33547. # Find the residuals

residuals = abs (model_pred - y_test)

3r33547. # Extract the most wrong prediction

wrong = X_test[np.argmax(residuals), :]3r33547. 3r33547. print ('Prediction:% 0.4f'% np.argmax (residuals))

print ('Actual Value:% 0.4f'% y_test[np.argmax(residuals)])

3r33547. # Create a lime explainer object

explainer = lime.lime_tabular.LimeTabularExplainer (training_data = X, mode = 'regression', training_labels = y, feature_names = feature_list)

3r33547. 3r33547. # Explanation for wrong prediction

exp = explainer.explain_instance (data_row = wrong, predict_fn = model.predict)

3r33547. # Plot the prediction explaination

exp.as_pyplot_figure (); 3r33333. 3r?500. 3r33547. Forecast explanation diagram: 3r33500. 3r33547. 3r?500. 3r33547. 3r33352. 3r?500. 3r33547. 3r?500. 3r33547. How to interpret a chart: each entry along the Y axis represents one variable value, and the red and green bars reflect the influence of this value on the forecast. For example, according to the top record effect

` Site EUI `

more than 95.9? as a result of the forecast subtracted about 40 points. According to the second record, the effect of 3r33388 Weather Normalized Site Electricity Intensity exp.show_in_notebook () 3r33547. 3r33333. 3r?500. 3r33547. Site EUI

It may be interesting

#### weber

Author**18-10-2018, 11:30**

Publication Date
#### Machine learning / Algorithms

Category- Comments: 0
- Views: 462

Getting yourself a winter top layer you might get look on to Christmas jackets. Extensive collection of masterpieces which enhances your overall experience of classy wardrobe. Filling up the unlimited type of outfits with these jackets is your dream come true.

Getting yourself a winter top layer you might get look on to Christmas jackets. Extensive collection of masterpieces which enhances your overall experience of classy wardrobe. Filling up the unlimited type of outfits with these jackets is your dream come true.

Are you looking for Home, Kitchen, and Bathroom Remodeling experts to remodel your property for you? Washington’s best – Pacific Remodeling is here for you. Check out: Kitchen Remodeling Federal Way WA

If you log on to SparkOnlineFM, YouTube, iTunes, or iHeartRadio, and listen to my show “Wake up with Tayla Andre.” you’ll quickly understand “Unorthodox Conversations Leading to Universal Consciousness” means I am passionate about educating, motivating, and emancipating my listeners from fear of the unknown. I am a mom, author, real estate agent, and advocate. Check out: Step-by-Step Guide of Real Estate