Identify fraud using Enron dataset. Part ? the search for the optimal model

3r3-3598. 3r3-31. 3r33584. I present to you the second part of the article on the search for suspected fraud on the basis of data from Enron Dataset. If you have not read the first part, you can read it here is 3r33586. . 3r33587. 3r33582. 3r3-3598. 3r33584. Now we will talk about the process of building, optimizing and selecting a model that will give the answer: is it worth to suspect a person of fraud? 3r33587. 3r33582. 3r3-3598. 3r366. 3r? 3594.

3r33586. 3r33582. 3r3-3598. 3r33584. Earlier, we analyzed one of the open datasets that provides information about suspects in the Enron case and fraud. The offset in the original data was also corrected, gaps (NaN) were filled, after which the data were normalized and the signs were selected. 3r33587. 3r33582. 3r3-3598. 3r33584. The result was familiar to many: 3r33587. 3r33582. 3r3-3598. 3r3342. 3r3-3598.

X_train and y_train - sample used for training (111 entries);

3r3-3598.

X_test and y_test - the sample on which the correctness of the predictions of our models will be checked (28 records).

3r3-3598.

3r33582. 3r3-3598. 3r33584. Speaking of models In order to correctly predict whether a person should be suspected, based on some signs that characterize his activity, we will use the classification. The main types of models used to solve problems in this segment can be taken from Sklearn: 3r3-3587. 3r33582. 3r3-3598. 3r3342. 3r3-3598.

Naive Bayes (naive Bayes classifier);

3r3-3598.

SVM (support vector machine);

3r3-3598.

K-nearest neighbors (nearest neighbors search method);

3r3-3598.

Random Forest (random forest);

3r3-3598.

Neural Network (neural networks).

3r3-3598.

3r33582. 3r3-3598. 3r33584. There is also a picture that illustrates their applicability quite well:

3r33582. 3r3-3598. 3r366. 3r? 3594. 3r33582. 3r3-3598. 3r33584. Among them there is a familiar Decision Tree (decision tree), but perhaps there is no point in using this method together with Random Forest, which is 3r37? in one task. ensemble

from crucial trees. Therefore, we replace it with the Logistic Regression (logistic regression), which is able to act as a classifier and produce one of the expected options (0 or 1). 3r33587. 3r33582. 3r3-3598.

Start 3r33511. 3r33582. 3r3-3598. 3r33584. We initialize all mentioned classifiers with default values: 3r33587. 3r33582. 3r3-3598. 3r3544. 3r33545. from sklearn.naive_bayes import GaussianNB

from sklearn.linear_model import LogisticRegression

from sklearn.neighbors import KNeighborsClassifier

from sklearn.svm import SVC

from sklearn.neural_network import MLPClassifier

from sklearn.ensemble import RandomForestClassifier

3r3-3598. random_state = 42

gnb = GaussianNB () 3r33598. svc = svc () 3r33598. knn = KNeighborsClassifier ()

log = LogisticRegression (random_state = random_state)

rfc = RandomForestClassifier (random_state = random_state)

mlp = MLPClassifier (random_state = random_state) 3r33547. 3r33582. 3r3-3598. 3r33584. We also group them in order to make it more convenient to work with them as with an aggregate, rather than writing code for each one individually. For example, we can train them all at once:

3r33582. 3r3-3598. 3r3544. 3r33545. classifiers =[gnb, svc, knn, log, rfc, mlp]3r3-3598. for clf in classifiers:

clf.fit (X_train, y_train) 3r33547. 3r33582. 3r3-3598. 3r33584. After the models have been trained, it's time to first check their prediction quality. Additionally, we visualize our results using Seaborn:

3r33582. 3r3-3598. 3r3544. 3r33545. from sklearn.metrics import accuracy_score

def calculate_accuracy (X, y):

result = pd.DataFrame (columns =['classifier', 'accuracy'])

for clf in classifiers:

predicted = clf.predict (X_test) 3r3-3598. accuracy = round (100.0 * accuracy_score (y_test, predicted), 2)

classifier = clf .__ class __.__ name__

classifier = classifier.replace ('Classifier', '')

result = result.append ({'classifier': classifier, 'accuracy': accuracy}, ignore_index = True)

print ('Accuracy is {accuracy}% for {classifier_name}'. format (accuracy = accuracy, classifier_name = classifier))

3r3-3598. result = result.sort_values (['classifier'], ascending = True)

plt.subplots (figsize = (1? 7))

sns.barplot (x = "classifier", y = 'accuracy', palette = cmap, data = result) 3r33547. 3r33582. 3r3-3598. 3r33584. Let's look at the general idea of the accuracy of the classifiers: 3r33587. 3r33582. 3r3-3598. 3r3544. 3r33545. calculate_accuracy (X_train, y_train) 3r33547. 3r33582. 3r3-3598. 3r33584. 3r33150. 3r33582. 3r3-3598. At first glance, it looks quite good, the accuracy of predictions on the test sample varies about 90%. It seems that the task is completed brilliantly! 3r33587. 3r33582. 3r3-3598. 3r33333.

In fact, not everything is so rosy. 3r33333. 3r33336. 3r33584. High accuracy is not a guarantee of correctness of predictions. In our test sample there are 28 records, 4 of which are related to suspects, and 24 to those who are beyond suspicion. Imagine that we have created some kind of algorithm: 3r33587. 3r33582. 3r3-3598. 3r3544. 3r33545. def QuaziAlgo (features): 3r33598. return 0 r3r3546. 3r33547. 3r33582. 3r3-3598. 3r33584. After that, they gave him our test sample at the entrance, and got that all 28 people were innocent. What will be the accuracy of the algorithm in this case? 3r33587. 3r33582. 3r3-3598. 3r33584. 3r33466. 3r33177. 3r33434. 3r33587. 3r33582. 3r3-3598. 3r33584. Interestingly, KNeighbors have the same accuracy of prediction

3r? 3594. 3r? 3594. 3r33582. 3r3-3598. 3r33584. But still, before flattering ourselves, let's build a confusion matrix for the prediction results: 3r33587. 3r33582. 3r3-3598. 3r3544. 3r33545. from sklearn.metrics import confusion_matrix

def make_confussion_matrices (X, y):

matrices = {}

result = pd.DataFrame (columns =['classifier', 'recall'])

for clf in classifiers:

classifier = clf .__ class __.__ name__

classifier = classifier.replace ('Classifier', '')

predicted = clf.predict (X_test) 3r3-3598. print (f '{predicted} - {classifier}')

matrix = confusion_matrix (y_test, predicted, labels =[1,0])

matrices[classifier]= matrix.T

return matrices 3r33547. 3r33582. 3r3-3598. 3r33584. Let us calculate the error matrices for each classifier and at the same time we will see what they predicted: 3r33587. 3r33582. 3r3-3598. 3r3544. 3r33545. matrices = make_confussion_matrices (X_train, y_train) 3r33547. 3r33582. 3r3-3598. 3r33584. 3r33582. 3r3-3598. 3r33232. 3r3-3598.

What is the reason for this behavior of the KNeighbors classifier?

3r3-3598.

Why did we build error matrices if we do not use them, but simply look at the results of the prediction?

3r3-3598.

3r33582. 3r3-3598.

Take a look deeper 3r36363. 3r33582. 3r3-3598. 3r33584. Let's start with the second question. Let's try to visualize our error matrices, and present the data in graphical format to see where the classification error occurs: 3r33587. 3r33582. 3r3-3598. 3r3544. 3r33545. import itertools

from collections import Iterable

def draw_confussion_matrices (row, col, matrices, figsize = (1?12)):

fig, (axes) = plt.subplots (row, col, sharex = 'col', sharey = 'row', figsize = figsize)

if any (isinstance (i, Iterable) for i in axes):

axes = list (itertools.chain.from_iterable (axes))

3r3-3598. idx = 0

for name, matrix in matrices.items ():

df_cm = pd.DataFrame (3r33598. matrix, index =['True','False'], columns =['True','False'], 3r3-3598.)

3r3-3598. ax = axes[idx]3r3-3598. fig.subplots_adjust (wspace = 0.1)

sns.heatmap (df_cm, annot = True, cmap = cmap, cbar = False, fmt = "d", ax = ax, linewidths = 1)

ax.set_title (name)

idx + = 1 3r33547. 3r33582. 3r3-3598. 3r33584. Let's display them in 2 lines and 3 columns: 3r33587. 3r33582. 3r3-3598. 3r3544. 3r33545. draw_confussion_matrices (?? matrices) 3r33547. 3r33582. 3r3-3598. 3r33584. Before continuing, it is worth giving some explanations. The designation True, which is located to the left of the error matrix of a particular classifier, means that the classifier considered the person to be a suspect, the value False means that the person is out of suspicion. Similarly, True and False at the bottom of the image gives us the real state of affairs, which may not coincide with the decision of the classifier.

3r33587. 3r33582. 3r3-3598. 3r33584.

For example, we see that the decisions of KNeighbors with a prediction accuracy of ???% coincided with the real state of affairs when 24 people, who were beyond suspicion, were listed in the same list by the classifier. But 4 people from the list of suspects were also included in this list. If this classifier made decisions, perhaps someone would have avoided the court.

3r33587. 3r33582. 3r3-3598. 3r33584. Thus, the error matrix is a very good tool for understanding what went wrong in the classification tasks. Their main advantage in clarity, and therefore we turn to them. 3r33587. 3r33582. 3r3-3598. 3r3305. Metrics 3r33511. 3r33582. 3r3-3598. 3r33584. In general terms, this can be illustrated by the following picture: 3r33582. 3r3-3598. 3r33312. 3r33587. 3r33582. 3r3-3598. 3r33584. 3r33333. And what are TP, TN, FP and some FN in this case? 3r33333. 3r33587. 3r33582. 3r3-3598. 3r33584. 3r33466. 3r33434. 3r33587. 3r33582. 3r3-3598. 3r33584. In other words, we strive to ensure that the classifier's answers and the real state of affairs coincide. That is, to ensure that all numbers are distributed between the cells of TP and TN (true solutions) and do not fall into the FN and FP (false decisions). 3r33587. 3r33582. 3r3-3598. 3r33333.

not always everything is so dramatically and unambiguously [/b] 3r33336. 3r33584. For example, in the canonical case with the diagnosis of cancer, FP is preferable to FN, because in the case of a false verdict about cancer, the patient will be prescribed medications and he will be treated. Yes, it will affect his health and wallet, but still it is considered less dangerous than the FN and the missed period in which the cancer can be defeated with small means. 3r33582. 3r3-3598. What about the suspects in our case? Probably, FN is not as bad as FP. But more on that further

3r? 3594. 3r? 3594. 3r33582. 3r3-3598. 3r33584. And once we are talking about abbreviations, it's time to recall the accuracy metrics (Precision) and completeness (Recall). 3r33587. 3r33582. 3r3-3598. 3r33584. If we retreat from the formal record, then

Precision 3r3623. can be expressed as: 3r3r8282. 3r3-3598. 3r33354. 3r33582. 3r3-3598. In other words, an account is kept of how many positive responses received from the classifier are correct. The greater the accuracy, the smaller the number of false hits (accuracy is 1 if there were no FPs). 3r33587. 3r33582. 3r3-3598. 3r33584. 3r33333. Recall

same in general form is presented as: 3r38282. 3r3-3598. 3r33333. 3r33582. 3r3-3598. Recall characterizes the ability of the classifier to "guess" the largest possible number of positive answers from the expected ones. The higher the completeness, the less FN was. 3r33587. 3r33582. 3r3-3598. 3r33584. Usually trying to balance between these two, but in this case, the priority will be fully given to Precision. Reason: a more humanistic approach, a desire to minimize the number of false-positive positives and, as a result, to avoid suspicion falling on the innocent 3r33587. 3r33582. 3r3-3598. 3r33584. We calculate Precision for our classifiers: 3r33587. 3r33582. 3r3-3598. 3r3544. 3r33545. from sklearn.metrics import precision_score

def calculate_precision (X, y):

result = pd.DataFrame (columns =['classifier', 'precision']) 3r3-3598. for clf in classifiers:

3r3-3598. predicted = clf.predict (X_test) 3r3-3598. precision = precision_score (y_test, predicted, average = 'macro')

classifier = clf .__ class __.__ name__

classifier = classifier.replace ('Classifier', '')

result = result.append ({'classifier': classifier, 'precision': precision}, ignore_index = True)

print ('Precision is {precision} for {classifier_name}'. format (precision = round (precision, 2), classifier_name = classifier))

3r3-3598. result = result.sort_values (['classifier'], ascending = True)

plt.subplots (figsize = (1? 7))

sns.barplot (x = "classifier", y = 'precision', palette = cmap, data = result)

calculate_precision (X_train, y_train) 3r33547. 3r33582. 3r3-3598. 3r33584. 3r3013. 3r33587. 3r33582. 3r3-3598. 3r33584. As follows from the figure, it turned out quite expectedly: the accuracy of the KNeighbors turned out to be lower than everyone, because the TP value is the lowest. 3r33587. 3r33582. 3r3-3598. 3r33584. At the same time, there is

on Habré. good article 3r38686. about metrics, and those who want to dive deeper into this topic, you should read it. 3r33587. 3r33582. 3r3-3598. 3r3155. Selection of hyper parameters

3r33582. 3r3-3598. 3r33584. After we have found the metric that best fits the selected conditions (we decrease the number of FPs), we can return to the first question: What is the reason for such behavior of the KNeighbors classifier? 3r33587. 3r33582. 3r3-3598. 3r33584. The reason lies in the default parameters with which this model was created. And, most likely, to this stage, many could exclaim: why train on the default parameters? There are special tools for the selection, for example, often used GridSearchCV. 3r33582. 3r3-3598. Yes, it is, and it is time to resort to it,

3r33582. 3r3-3598. 3r33584. But before that, we remove the Bayess classifier from our list. It allows one FP, and at the same time, this algorithm does not accept any variable parameters, as a result of which the result will not change. 3r33587. 3r33582. 3r3-3598. 3r3544. 3r33545. classifiers.remove (gnb) 3r33547. 3r33582. 3r3-3598.

Adjustment 3r36363. 3r33582. 3r3-3598. 3r33584. We set the parameter grid for each classifier: 3r33587. 3r33582. 3r3-3598. 3r3544. 3r33545. parameters = {'SVC': {'kernel' :( 'linear', 'rbf', 'poly'), 'C':[i for i in range(1,11)], 'random_state': (random_state,)},

'KNeighbors': {' algorithm ':(' ball_tree ',' kd_tree '), n_neighbors':[i for i in range(2,20)]}, 3r39898. 'LogisticRegression': {'penalty' :( 'l1', 'l2'), 'C':[i for i in range(1,11)], 'random_state': (random_state,)},

'RandomForest': {'n_estimators':[i for i in range(10,101,10)], 'random_state': (random_state,)},

'MLP': {'activation' :( 'relu', 'logistic'), 'solver' :( 'sgd', 'lbfgs'), 'max_iter' :( 50?1000), 'hidden_layer_sizes':[(7,),(7,7)], 'random_state': (random_state,)}} 3r33547. 3r33582. 3r3-3598. 3r33584. Additionally, I wanted to draw attention to the number of layers /neurons in the MLP. 3r33582. 3r3-3598. It was decided to set them not by enumerating all possible values, but still based on

the formula 3r33586. : 3r33587. 3r33582. 3r3-3598. 3r33584. 3r33466. 3r33434. 3r33587. 3r33582. 3r3-3598. 3r33584. I would like to say right away that training and cross-validation will be carried out only on the training set. I admit that there is an opinion that it is possible to do this on all data as in 3r-3473. example with Iris Dataset. But, in my opinion, such an approach is not entirely justified, since it will be impossible to trust the results of testing on a test sample. 3r33587. 3r33582. 3r3-3598. 3r33584. We will optimize and replace our classifiers with their improved version: 3r-3587. 3r33582. 3r3-3598. 3r3544. 3r33545. from sklearn.model_selection import GridSearchCV

warnings.filterwarnings ('ignore')

for idx, clf in enumerate (classifiers):

classifier = clf .__ class __.__ name__

classifier = classifier.replace ('Classifier', '')

params = parameters.get (classifier)

if not params:

continue

3r3-3598. new_clf = clf .__ class __ () 3r33598. gs = GridSearchCV (new_clf, params, cv = 5)

result = gs.fit (X_train, y_train)

print (f'The best params for {classifier} are {result.best_params_} ')

classifiers[idx]= result.best_estimator_ 3r33547. 3r33582. 3r3-3598. 3r33584. 3r3502. 3r33587. 3r33582. 3r3-3598. 3r33584. After we have selected the metric for evaluation and completed GridSearchCV, we are ready to draw the final line. 3r33587. 3r33582. 3r3-3598. 3r33510. Summing up r3r3511. 3r33582. 3r3-3598. 3r???. The error matrix v.2

3r33582. 3r3-3598. 3r3544. 3r33545. matrices = make_confussion_matrices (X_train, y_train)

draw_confussion_matrices (?? first_row, figsize = (10.?6))

draw_confussion_matrices (1.? second_row, figsize = (16.6)) 3r33547. 3r33582. 3r3-3598. 3r33584. 3r33535. 3r33582. 3r3-3598. 3r? 3530. 3r33582. 3r3-3598. As can be seen from the matrix, the MLP showed degradation and considered that there were no suspects in the test sample. Random Forest added accuracy and corrected parameters for False Negative and True Positive. And KNeighbors showed an improvement in prediction. The forecast for others has not changed. 3r33587. 3r33582. 3r3-3598. 3r33536. Accuracy v.2

3r33582. 3r3-3598. 3r33584. Now, none of our current classifiers have errors with False Positive, which is good news. But, if we express everything in terms of numbers, we get the following picture: 3r33587. 3r33582. 3r3-3598. 3r3544. 3r33545. calculate_precision (X_train, y_train) 3r33547. 3r33582. 3r3-3598. 3r33584. 3r? 3551. 3r33582. 3r3-3598. 3r33554. 3r33587. 3r33582. 3r3-3598. 3r33584. Identified 3 classifier with the highest rate of precision. And they have the same values, based on the error matrix. Which classifier to choose? 3r33587. 3r33582. 3r3-3598. 3r? 3562. Who is better? 3r33535. 3r33582. 3r3-3598. 3r33584. It seems to me that this is a rather difficult question to which there is no universal answer. However, my point of view in this case would look something like this: 3r3-3587. 3r33582. 3r3-3598. 3r33584. 1. The classifier should be as simple in its technical implementation as possible. Then he will have less risk of retraining (this is probably what happened with the MLP). Therefore, it is not Random Forest, since this algorithm is an ensemble of 30 trees and, as a result, depends on them. In tune with one of the ideas of Python Zen: simple is better than complex. 3r33587. 3r33582. 3r3-3598. 3r33584. 2. Not bad when the algorithm was intuitive. That is, KNeighbors is perceived as simpler than SVM with potential multidimensional space. 3r33582. 3r3-3598. Which in turn is similar to another statement: the explicit is better than the implicit. 3r33587. 3r33582. 3r3-3598. 3r33584. Therefore, KNeighbors with 3 neighbors, in my opinion, the best candidate. 3r33587. 3r33582. 3r3-3598. 3r33584. This is the end of the first part, describing the use of Enron Dataset as an example of a classification task in machine learning. Based on materials from the course Introduction to Machine Learning at Udacity. There are also 3r?585. python notebook

, reflecting the entire sequence of actions described. 3r33587. 3r? 3594. 3r3-3598. 3r3-3598. 3r? 3591. ! function (e) {function t (t, n) {if (! (n in e)) {for (var r, a = e.document, i = a.scripts, o = i.length; o-- ;) if (-1! == i[o].src.indexOf (t)) {r = i[o]; break} if (! r) {r = a.createElement ("script"), r.type = "text /jаvascript", r.async =! ? r.defer =! ? r.src = t, r.charset = "UTF-8"; var d = function () {var e = a.getElementsByTagName ("script")[0]; e.parentNode.insertBefore (r, e)}; "[object Opera]" == e.opera? a.addEventListener? a.addEventListener ("DOMContentLoaded", d,! 1): e.attachEvent ("onload", d ): d ()}}} t ("//mediator.mail.ru/script/2820404/"""_mediator") () (); 3r? 3592. 3r3-3598. 3r? 3594. 3r3-3598. 3r3-3598. 3r3-3598. 3r3-3598.

It may be interesting

#### weber

Author**8-10-2018, 05:14**

Publication Date
#### Machine learning / Python

Category- Comments: 0
- Views: 305