• Guest
HabraHabr
  • Main
  • Users

  • Development
    • Programming
    • Information Security
    • Website development
    • JavaScript
    • Game development
    • Open source
    • Developed for Android
    • Machine learning
    • Abnormal programming
    • Java
    • Python
    • Development of mobile applications
    • Analysis and design of systems
    • .NET
    • Mathematics
    • Algorithms
    • C#
    • System Programming
    • C++
    • C
    • Go
    • PHP
    • Reverse engineering
    • Assembler
    • Development under Linux
    • Big Data
    • Rust
    • Cryptography
    • Entertaining problems
    • Testing of IT systems
    • Testing Web Services
    • HTML
    • Programming microcontrollers
    • API
    • High performance
    • Developed for iOS
    • CSS
    • Industrial Programming
    • Development under Windows
    • Image processing
    • Compilers
    • FPGA
    • Professional literature
    • OpenStreetMap
    • Google Chrome
    • Data Mining
    • PostgreSQL
    • Development of robotics
    • Visualization of data
    • Angular
    • ReactJS
    • Search technologies
    • Debugging
    • Test mobile applications
    • Browsers
    • Designing and refactoring
    • IT Standards
    • Solidity
    • Node.JS
    • Git
    • LaTeX
    • SQL
    • Haskell
    • Unreal Engine
    • Unity3D
    • Development for the Internet of things
    • Functional Programming
    • Amazon Web Services
    • Google Cloud Platform
    • Development under AR and VR
    • Assembly systems
    • Version control systems
    • Kotlin
    • R
    • CAD/CAM
    • Customer Optimization
    • Development of communication systems
    • Microsoft Azure
    • Perfect code
    • Atlassian
    • Visual Studio
    • NoSQL
    • Yii
    • Mono и Moonlight
    • Parallel Programming
    • Asterisk
    • Yandex API
    • WordPress
    • Sports programming
    • Lua
    • Microsoft SQL Server
    • Payment systems
    • TypeScript
    • Scala
    • Google API
    • Development of data transmission systems
    • XML
    • Regular expressions
    • Development under Tizen
    • Swift
    • MySQL
    • Geoinformation services
    • Global Positioning Systems
    • Qt
    • Dart
    • Django
    • Development for Office 365
    • Erlang/OTP
    • GPGPU
    • Eclipse
    • Maps API
    • Testing games
    • Browser Extensions
    • 1C-Bitrix
    • Development under e-commerce
    • Xamarin
    • Xcode
    • Development under Windows Phone
    • Semantics
    • CMS
    • VueJS
    • GitHub
    • Open data
    • Sphinx
    • Ruby on Rails
    • Ruby
    • Symfony
    • Drupal
    • Messaging Systems
    • CTF
    • SaaS / S+S
    • SharePoint
    • jQuery
    • Puppet
    • Firefox
    • Elm
    • MODX
    • Billing systems
    • Graphical shells
    • Kodobred
    • MongoDB
    • SCADA
    • Hadoop
    • Gradle
    • Clojure
    • F#
    • CoffeeScript
    • Matlab
    • Phalcon
    • Development under Sailfish OS
    • Magento
    • Elixir/Phoenix
    • Microsoft Edge
    • Layout of letters
    • Development for OS X
    • Forth
    • Smalltalk
    • Julia
    • Laravel
    • WebGL
    • Meteor.JS
    • Firebird/Interbase
    • SQLite
    • D
    • Mesh-networks
    • I2P
    • Derby.js
    • Emacs
    • Development under Bada
    • Mercurial
    • UML Design
    • Objective C
    • Fortran
    • Cocoa
    • Cobol
    • Apache Flex
    • Action Script
    • Joomla
    • IIS
    • Twitter API
    • Vkontakte API
    • Facebook API
    • Microsoft Access
    • PDF
    • Prolog
    • GTK+
    • LabVIEW
    • Brainfuck
    • Cubrid
    • Canvas
    • Doctrine ORM
    • Google App Engine
    • Twisted
    • XSLT
    • TDD
    • Small Basic
    • Kohana
    • Development for Java ME
    • LiveStreet
    • MooTools
    • Adobe Flash
    • GreaseMonkey
    • INFOLUST
    • Groovy & Grails
    • Lisp
    • Delphi
    • Zend Framework
    • ExtJS / Sencha Library
    • Internet Explorer
    • CodeIgniter
    • Silverlight
    • Google Web Toolkit
    • CakePHP
    • Safari
    • Opera
    • Microformats
    • Ajax
    • VIM
  • Administration
    • System administration
    • IT Infrastructure
    • *nix
    • Network technologies
    • DevOps
    • Server Administration
    • Cloud computing
    • Configuring Linux
    • Wireless technologies
    • Virtualization
    • Hosting
    • Data storage
    • Decentralized networks
    • Database Administration
    • Data Warehousing
    • Communication standards
    • PowerShell
    • Backup
    • Cisco
    • Nginx
    • Antivirus protection
    • DNS
    • Server Optimization
    • Data recovery
    • Apache
    • Spam and antispam
    • Data Compression
    • SAN
    • IPv6
    • Fidonet
    • IPTV
    • Shells
    • Administering domain names
  • Design
    • Interfaces
    • Web design
    • Working with sound
    • Usability
    • Graphic design
    • Design Games
    • Mobile App Design
    • Working with 3D-graphics
    • Typography
    • Working with video
    • Work with vector graphics
    • Accessibility
    • Prototyping
    • CGI (graphics)
    • Computer Animation
    • Working with icons
  • Control
    • Careers in the IT industry
    • Project management
    • Development Management
    • Personnel Management
    • Product Management
    • Start-up development
    • Managing the community
    • Service Desk
    • GTD
    • IT Terminology
    • Agile
    • Business Models
    • Legislation and IT-business
    • Sales management
    • CRM-systems
    • Product localization
    • ECM / EDS
    • Freelance
    • Venture investments
    • ERP-systems
    • Help Desk Software
    • Media management
    • Patenting
    • E-commerce management
    • Creative Commons
  • Marketing
    • Conferences
    • Promotion of games
    • Internet Marketing
    • Search Engine Optimization
    • Web Analytics
    • Monetize Web services
    • Content marketing
    • Monetization of IT systems
    • Monetize mobile apps
    • Mobile App Analytics
    • Growth Hacking
    • Branding
    • Monetize Games
    • Display ads
    • Contextual advertising
    • Increase Conversion Rate
  • Sundry
    • Reading room
    • Educational process in IT
    • Research and forecasts in IT
    • Finance in IT
    • Hakatonas
    • IT emigration
    • Education abroad
    • Lumber room
    • I'm on my way

Identification of fraud using Enron dataset. Part ? preparation of data and selection of prizes

Enron Corporation is one of the most famous figures in the American business of the 2000s. This was facilitated not by their sphere of activity (electricity and contracts for its supply), but by the resonance due to fraud in it. Within 15 years, the corporation's revenues grew rapidly, and work in it promised a good salary. But everything ended just as quickly: in the period 2000-2001gg. the price of shares fell from $ 90 /pcs to almost zero due to fraudulent fraudulent disclosure of declared income. Since then, the word "Enron" has become a household name and acts as a label for companies that operate under a similar scheme.
 
During the trial, 18 people (including the largest defendants in the case: Andrew Fastov, Jeff Skilling and Kenneth Lay) were convicted.
 
Identification of fraud using Enron dataset. Part ? preparation of data and selection of prizes
 
At the same time, an archive of electronic correspondence between employees of the company, better known as Enron Email Dataset, and insider information about the incomes of employees of this company were published.
 
The article will consider the sources of these data and based on them a model is constructed to determine whether a person is suspected of fraud. Sounds interesting? Then, welcome to Habrcut. source PDF , which is the basis of the dataset, it turned out that the data is slightly distorted, since not for all rows in the dataframe
payments
field
total_payments
is the sum of all financial transactions of the person. You can check this as follows:
 
errors = payments[payments[payments_features[:-1]].sum (axis = 'columns')! = payments['total_payments']]
errors.head ()

 


 
We see that BELFER ROBERT and BHATNAGAR SANJAY have incorrect payment amounts.


 

You can correct this error by shifting the data in the wrong lines to the left or right and counting the sum of all payments again:


 
    import numpy as np
shifted_values ​​= payments.loc['BELFER ROBERT', payments_features[1:]].values ​​
expected_payments = shifted_values.sum ()
shifted_values ​​= np.append (shifted_values, expected_payments)
payments.loc['BELFER ROBERT', payments_features]= shifted_values ​​
shifted_values ​​= payments.loc['BHATNAGAR SANJAY', payments_features[:-1]].values ​​
payments.loc['BHATNAGAR SANJAY', payments_features]= np.insert (shifted_values, ? 0)

 

Data on the shares of


 
    stocks = source_df[stock_features]
stocks = stocks.replace ('NaN', 0)

 

We perform the correctness check and in this case:


 
    errors = stocks[stocks[stock_features[:-1]].sum (axis = 'columns')! = stocks['total_stock_value']]
errors.head ()

 


 

We will correct the same error in the shares:


 
    shifted_values ​​= stocks.loc['BELFER ROBERT', stock_features[1:]].values ​​
expected_payments = shifted_values.sum ()
shifted_values ​​= np.append (shifted_values, expected_payments)
stocks.loc['BELFER ROBERT', stock_features]= shifted_values ​​
shifted_values ​​= stocks.loc['BHATNAGAR SANJAY', stock_features[:-1]].values ​​
stocks.loc['BHATNAGAR SANJAY', stock_features]= np.insert (shifted_values, ? shifted_values ​​[-1])

 

Summary data on electronic correspondence


 

If NaN was equal to 0 for financial data or shares, and this fits into the final result for each of these groups, in the case of email NaN it is more reasonable to replace it with a certain default value. To do this, you can use Imputer:


 
    from sklearn.impute import SimpleImputer
imp = SimpleImputer ()

 

At the same time, we will consider the default value for each category (whether the person is suspected of fraud) separately:


 
    target = source_df[target_field]
email_data = source_df[email_features]
email_data = pd.concat ([email_data, target], axis = 1)
email_data_poi = email_data[email_data[target_field]] [email_features]
email_data_nonpoi = email_data[email_data[target_field]== False] [email_features]
email_data_poi[email_features]= imp.fit_transform (email_data_poi)
email_data_nonpoi[email_features]= imp.fit_transform (email_data_nonpoi)
email_data = email_data_poi.append (email_data_nonpoi)

 

The final data after the correction:


 
    df = payments.join (stocks)
df = df.join (email_data)
df = df.astype (float)

 

Emissions


 

At the final step of this stage, we remove all outliers, which may distort the training. At the same time, there is always the question: how much data can we remove from the sample without losing it as a learning model? I adhered to the advice of one of the lecturers leading the course on ML (machine learning) on ​​Udacity - "Remove 10 pieces and check for emissions again".


 
    first_quartile = df.quantile (q = ???)
third_quartile = df.quantile (q = ???)
IQR = third_quartile - first_quartile
outliers = df[(df > (third_quartile + 1.5 * IQR)) | (df < (first_quartile - 1.5 * IQR))].count (axis = 1)
outliers.sort_values ​​(axis = ? ascending = False, inplace = True)
outliers = outliers.head (10)
outliers

 

At the same time, we will not delete the records, which are emissions and refer to suspected fraud. The reason is that lines with such data are only 1? and we can not sacrifice them, as this can lead to a lack of examples for training. As a consequence, we remove only those who are not suspected of fraud, but at the same time has a large number of signs on which emissions are observed:


 
    target_for_outliers = target.loc[outliers.index]
outliers = pd.concat ([outliers, target_for_outliers], axis = 1)
non_poi_outliers = outliers[np.logical_not(outliers.poi)]
df.drop (non_poi_outliers.index, inplace = True)

 

Reduction to the final form


 

We normalize our dаta:


 
    from sklearn.preprocessing import scale
df[df.columns]= scale (df)

 

Let's bring the target target to the compatible type:


 
    target.drop (non_poi_outliers.index, inplace = True)
target = target.map ({True: ? False: 0})
target.value_counts ()

 


 
As a result, 18 suspects and 121 of those who did not come under suspicion.


 

Characteristics selection


 

Perhaps one of the most important moments before learning any model is the selection of the most important features.


 

Multicollinearity check


 
    import matplotlib.pyplot as plt
import seaborn as sns
% matplotlib inline
sns.set (style = "whitegrid")
corr = df.corr () * 100
# Select the upper triangle of correlation matrix
mask = np.zeros_like (corr, dtype = np.bool)
mask[np.triu_indices_from(mask)]= True
# Set up the matplotlib figure
f, ax = plt.subplots (figsize = (1? 11))
# Generate a custom diverging colormap
cmap = sns.diverging_palette (22? 10)
# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap (corr, mask = mask, cmap = cmap, center = ?
linewidths = ? cbar_kws = {"shrink": .7}, annot = True, fmt = ". 2f")

 


 
As can be seen from the image, we have a pronounced relationship between 'loan_advanced' and 'total_payments', and also between 'total_stock_value' and 'restricted_stock'. As mentioned earlier, 'total_payments' and 'total_stock_value' are just the result of adding all the indicators in a particular group. Therefore, you can delete them:


 
    df.drop (columns =['total_payments', 'total_stock_value'], inplace = True)    

 

Creating new features


 

There is also the assumption that the suspects were more likely to write accomplices, rather than employees who were not involved in this. And as a consequence - the proportion of such messages should be greater than the share of messages to ordinary employees. Based on this statement, you can create new characteristics that reflect the percentage of incoming /outgoing, connected with the suspects:


 
    df['ratio_of_poi_mail']= df['from_poi_to_this_person']/df['to_messages']
df['ratio_of_mail_to_poi']= df['from_this_person_to_poi']/df['from_messages']

 

Screening of superfluous signs


 

In the toolbox of people associated with ML, there are many excellent tools for selecting the most significant features (SelectKBest, SelectPercentile, VarianceThreshold, etc.). In this case, RFECV will be used, since it includes cross-validation, which allows you to calculate the most important characteristics and check them on all subsets of the sample:


 
    from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split (df, target, test_size = 0.? random_state = 42)

 
    from sklearn.feature_selection import RFECV
from sklearn.ensemble import RandomForestClassifier
forest = RandomForestClassifier (random_state = 42)
rfecv = RFECV (estimator = forest, cv = ? scoring = 'accuracy')
rfecv = rfecv.fit (X_train, y_train)
plt.figure ()
plt.xlabel ("Number of features selected")
plt.ylabel ("Cross validation score of number of selected features")
plt.plot (range (? len (rfecv.grid_scores_) + 1), rfecv.grid_scores_, '--o')
indices = rfecv.get_support ()
columns = X_train.columns[indices]
print ('The most important columns are {}'. format (','. join (columns)))

 


 
As you can see, RandomForestClassifier calculated that only 7 of the 18 attributes matter. Using the rest leads to a decrease in the accuracy of the model.


 
    The most important columns are bonus, deferred_income, other, exercised_stock_options, shared_receipt_with_poi, ratio_of_poi_mail, ratio_of_mail_to_poi    

 

These 7 signs will be used in the future, in order to simplify the model and reduce the risk of retraining:


 
  •  
  • bonus  
  • deferred_income  
  • other  
  • exercised_stock_options  
  • shared_receipt_with_poi  
  • ratio_of_poi_mail  
  • ratio_of_mail_to_poi  

 

Let's change the structure of the training and test samples for the future learning model:


 
    X_train = X_train[columns]
X_test = X_test[columns]

 
This is the end of the first part describing the use of Enron Dataset as an example of a classification problem in ML. Based on the materials from the course Introduction to Machine Learning on Udacity. There are also python notebook , reflecting the entire sequence of actions.

It may be interesting

  • Comments
  • About article
  • Similar news
This publication has no comments.

weber

Author

30-09-2018, 13:17

Publication Date

Machine learning / Python

Category
  • Comments: 0
  • Views: 291
Beware of ingenious fraud with Touch
Amazon gave up and raised employee wages
Identify fraud using Enron dataset.
Research: 80% of ICO's in 2017 are
Samsung will pay Apple $ 539 million
Research: fraudulent ICO attracted more
Write a comment
Name:*
E-Mail:


Comments
Inursing test bank was very pleased  to find this site.I wanted to thank you for this great read!! I definitely  enjoying every little bit of it and I have you bookmarked to check out new  stuff you post.  
Today, 18:20

taxiseo2

You completed certain  reliable points there. I did a search on the subject and found nearly all  persons will agree with your blog.  
nursing test bank
Today, 18:04

taxiseo2

Great post i must say  and thanks for the information. Education is definitely a sticky subject.  However, is still among the leading topics of our time. I appreciate your  post and look forward to more.
nursing test bank
Today, 17:29

taxiseo2

So good! This web post provides knowledge, knowledge, good news, and is very useful. Thank you for everything Taxi Driver Jacket
Today, 15:35

MalenaMorgan

I know this is one of the most meaningful information for me. And I'm animated reading your article. But should remark on some general things, the website style is perfect; the articles are great. Thanks for the ton of tangible and attainable help.2movierulz

Today, 15:34

raymond weber

Adv
Website for web developers. New scripts, best ideas, programming tips. How to write a script for you here, we have a lot of information about various programming languages. You are a webmaster or a beginner programmer, it does not matter, useful articles will help to make your favorite business faster.

Login

Registration Forgot password