Identification of fraud using Enron dataset. Part ? preparation of data and selection of prizes
Enron Corporation is one of the most famous figures in the American business of the 2000s. This was facilitated not by their sphere of activity (electricity and contracts for its supply), but by the resonance due to fraud in it. Within 15 years, the corporation's revenues grew rapidly, and work in it promised a good salary. But everything ended just as quickly: in the period 2000-2001gg. the price of shares fell from $ 90 /pcs to almost zero due to fraudulent fraudulent disclosure of declared income. Since then, the word "Enron" has become a household name and acts as a label for companies that operate under a similar scheme.
During the trial, 18 people (including the largest defendants in the case: Andrew Fastov, Jeff Skilling and Kenneth Lay) were convicted.
At the same time, an archive of electronic correspondence between employees of the company, better known as Enron Email Dataset, and insider information about the incomes of employees of this company were published.
The article will consider the sources of these data and based on them a model is constructed to determine whether a person is suspected of fraud. Sounds interesting? Then, welcome to Habrcut. source PDF , which is the basis of the dataset, it turned out that the data is slightly distorted, since not for all rows in the dataframe
payments
field
total_payments
is the sum of all financial transactions of the person. You can check this as follows:
errors = payments[payments[payments_features[:-1]].sum (axis = 'columns')! = payments['total_payments']]
errors.head ()
We see that BELFER ROBERT and BHATNAGAR SANJAY have incorrect payment amounts.
You can correct this error by shifting the data in the wrong lines to the left or right and counting the sum of all payments again:
import numpy as np
shifted_values = payments.loc['BELFER ROBERT', payments_features[1:]].values
expected_payments = shifted_values.sum ()
shifted_values = np.append (shifted_values, expected_payments)
payments.loc['BELFER ROBERT', payments_features]= shifted_values
shifted_values = payments.loc['BHATNAGAR SANJAY', payments_features[:-1]].values
payments.loc['BHATNAGAR SANJAY', payments_features]= np.insert (shifted_values, ? 0)
Data on the shares of
stocks = source_df[stock_features]
stocks = stocks.replace ('NaN', 0)
We perform the correctness check and in this case:
errors = stocks[stocks[stock_features[:-1]].sum (axis = 'columns')! = stocks['total_stock_value']]
errors.head ()
We will correct the same error in the shares:
shifted_values = stocks.loc['BELFER ROBERT', stock_features[1:]].values
expected_payments = shifted_values.sum ()
shifted_values = np.append (shifted_values, expected_payments)
stocks.loc['BELFER ROBERT', stock_features]= shifted_values
shifted_values = stocks.loc['BHATNAGAR SANJAY', stock_features[:-1]].values
stocks.loc['BHATNAGAR SANJAY', stock_features]= np.insert (shifted_values, ? shifted_values [-1])
Summary data on electronic correspondence
If NaN was equal to 0 for financial data or shares, and this fits into the final result for each of these groups, in the case of email NaN it is more reasonable to replace it with a certain default value. To do this, you can use Imputer:
from sklearn.impute import SimpleImputer
imp = SimpleImputer ()
At the same time, we will consider the default value for each category (whether the person is suspected of fraud) separately:
target = source_df[target_field]
email_data = source_df[email_features]
email_data = pd.concat ([email_data, target], axis = 1)
email_data_poi = email_data[email_data[target_field]] [email_features]
email_data_nonpoi = email_data[email_data[target_field]== False] [email_features]
email_data_poi[email_features]= imp.fit_transform (email_data_poi)
email_data_nonpoi[email_features]= imp.fit_transform (email_data_nonpoi)
email_data = email_data_poi.append (email_data_nonpoi)
The final data after the correction:
df = payments.join (stocks)
df = df.join (email_data)
df = df.astype (float)
Emissions
At the final step of this stage, we remove all outliers, which may distort the training. At the same time, there is always the question: how much data can we remove from the sample without losing it as a learning model? I adhered to the advice of one of the lecturers leading the course on ML (machine learning) on Udacity - "Remove 10 pieces and check for emissions again".
first_quartile = df.quantile (q = ???)
third_quartile = df.quantile (q = ???)
IQR = third_quartile - first_quartile
outliers = df[(df > (third_quartile + 1.5 * IQR)) | (df < (first_quartile - 1.5 * IQR))].count (axis = 1)
outliers.sort_values (axis = ? ascending = False, inplace = True)
outliers = outliers.head (10)
outliers
At the same time, we will not delete the records, which are emissions and refer to suspected fraud. The reason is that lines with such data are only 1? and we can not sacrifice them, as this can lead to a lack of examples for training. As a consequence, we remove only those who are not suspected of fraud, but at the same time has a large number of signs on which emissions are observed:
target_for_outliers = target.loc[outliers.index]
outliers = pd.concat ([outliers, target_for_outliers], axis = 1)
non_poi_outliers = outliers[np.logical_not(outliers.poi)]
df.drop (non_poi_outliers.index, inplace = True)
Reduction to the final form
We normalize our dаta:
from sklearn.preprocessing import scale
df[df.columns]= scale (df)
Let's bring the target target to the compatible type:
target.drop (non_poi_outliers.index, inplace = True)
target = target.map ({True: ? False: 0})
target.value_counts ()
As a result, 18 suspects and 121 of those who did not come under suspicion.
Characteristics selection
Perhaps one of the most important moments before learning any model is the selection of the most important features.
Multicollinearity check
import matplotlib.pyplot as plt
import seaborn as sns
% matplotlib inline
sns.set (style = "whitegrid")
corr = df.corr () * 100
# Select the upper triangle of correlation matrix
mask = np.zeros_like (corr, dtype = np.bool)
mask[np.triu_indices_from(mask)]= True
# Set up the matplotlib figure
f, ax = plt.subplots (figsize = (1? 11))
# Generate a custom diverging colormap
cmap = sns.diverging_palette (22? 10)
# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap (corr, mask = mask, cmap = cmap, center = ?
linewidths = ? cbar_kws = {"shrink": .7}, annot = True, fmt = ". 2f")
As can be seen from the image, we have a pronounced relationship between 'loan_advanced' and 'total_payments', and also between 'total_stock_value' and 'restricted_stock'. As mentioned earlier, 'total_payments' and 'total_stock_value' are just the result of adding all the indicators in a particular group. Therefore, you can delete them:
df.drop (columns =['total_payments', 'total_stock_value'], inplace = True)
Creating new features
There is also the assumption that the suspects were more likely to write accomplices, rather than employees who were not involved in this. And as a consequence - the proportion of such messages should be greater than the share of messages to ordinary employees. Based on this statement, you can create new characteristics that reflect the percentage of incoming /outgoing, connected with the suspects:
df['ratio_of_poi_mail']= df['from_poi_to_this_person']/df['to_messages']
df['ratio_of_mail_to_poi']= df['from_this_person_to_poi']/df['from_messages']
Screening of superfluous signs
In the toolbox of people associated with ML, there are many excellent tools for selecting the most significant features (SelectKBest, SelectPercentile, VarianceThreshold, etc.). In this case, RFECV will be used, since it includes cross-validation, which allows you to calculate the most important characteristics and check them on all subsets of the sample:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split (df, target, test_size = 0.? random_state = 42)
from sklearn.feature_selection import RFECV
from sklearn.ensemble import RandomForestClassifier
forest = RandomForestClassifier (random_state = 42)
rfecv = RFECV (estimator = forest, cv = ? scoring = 'accuracy')
rfecv = rfecv.fit (X_train, y_train)
plt.figure ()
plt.xlabel ("Number of features selected")
plt.ylabel ("Cross validation score of number of selected features")
plt.plot (range (? len (rfecv.grid_scores_) + 1), rfecv.grid_scores_, '--o')
indices = rfecv.get_support ()
columns = X_train.columns[indices]
print ('The most important columns are {}'. format (','. join (columns)))
As you can see, RandomForestClassifier calculated that only 7 of the 18 attributes matter. Using the rest leads to a decrease in the accuracy of the model.
The most important columns are bonus, deferred_income, other, exercised_stock_options, shared_receipt_with_poi, ratio_of_poi_mail, ratio_of_mail_to_poi
These 7 signs will be used in the future, in order to simplify the model and reduce the risk of retraining:
-
- bonus
- deferred_income
- other
- exercised_stock_options
- shared_receipt_with_poi
- ratio_of_poi_mail
- ratio_of_mail_to_poi
Let's change the structure of the training and test samples for the future learning model:
X_train = X_train[columns]
X_test = X_test[columns]
This is the end of the first part describing the use of Enron Dataset as an example of a classification problem in ML. Based on the materials from the course Introduction to Machine Learning on Udacity. There are also python notebook , reflecting the entire sequence of actions.
It may be interesting
weber
Author30-09-2018, 13:17
Publication DateMachine learning / Python
Category- Comments: 0
- Views: 291
nursing test bank
nursing test bank