• Guest
HabraHabr
  • Main
  • Users

  • Development
    • Programming
    • Information Security
    • Website development
    • JavaScript
    • Game development
    • Open source
    • Developed for Android
    • Machine learning
    • Abnormal programming
    • Java
    • Python
    • Development of mobile applications
    • Analysis and design of systems
    • .NET
    • Mathematics
    • Algorithms
    • C#
    • System Programming
    • C++
    • C
    • Go
    • PHP
    • Reverse engineering
    • Assembler
    • Development under Linux
    • Big Data
    • Rust
    • Cryptography
    • Entertaining problems
    • Testing of IT systems
    • Testing Web Services
    • HTML
    • Programming microcontrollers
    • API
    • High performance
    • Developed for iOS
    • CSS
    • Industrial Programming
    • Development under Windows
    • Image processing
    • Compilers
    • FPGA
    • Professional literature
    • OpenStreetMap
    • Google Chrome
    • Data Mining
    • PostgreSQL
    • Development of robotics
    • Visualization of data
    • Angular
    • ReactJS
    • Search technologies
    • Debugging
    • Test mobile applications
    • Browsers
    • Designing and refactoring
    • IT Standards
    • Solidity
    • Node.JS
    • Git
    • LaTeX
    • SQL
    • Haskell
    • Unreal Engine
    • Unity3D
    • Development for the Internet of things
    • Functional Programming
    • Amazon Web Services
    • Google Cloud Platform
    • Development under AR and VR
    • Assembly systems
    • Version control systems
    • Kotlin
    • R
    • CAD/CAM
    • Customer Optimization
    • Development of communication systems
    • Microsoft Azure
    • Perfect code
    • Atlassian
    • Visual Studio
    • NoSQL
    • Yii
    • Mono и Moonlight
    • Parallel Programming
    • Asterisk
    • Yandex API
    • WordPress
    • Sports programming
    • Lua
    • Microsoft SQL Server
    • Payment systems
    • TypeScript
    • Scala
    • Google API
    • Development of data transmission systems
    • XML
    • Regular expressions
    • Development under Tizen
    • Swift
    • MySQL
    • Geoinformation services
    • Global Positioning Systems
    • Qt
    • Dart
    • Django
    • Development for Office 365
    • Erlang/OTP
    • GPGPU
    • Eclipse
    • Maps API
    • Testing games
    • Browser Extensions
    • 1C-Bitrix
    • Development under e-commerce
    • Xamarin
    • Xcode
    • Development under Windows Phone
    • Semantics
    • CMS
    • VueJS
    • GitHub
    • Open data
    • Sphinx
    • Ruby on Rails
    • Ruby
    • Symfony
    • Drupal
    • Messaging Systems
    • CTF
    • SaaS / S+S
    • SharePoint
    • jQuery
    • Puppet
    • Firefox
    • Elm
    • MODX
    • Billing systems
    • Graphical shells
    • Kodobred
    • MongoDB
    • SCADA
    • Hadoop
    • Gradle
    • Clojure
    • F#
    • CoffeeScript
    • Matlab
    • Phalcon
    • Development under Sailfish OS
    • Magento
    • Elixir/Phoenix
    • Microsoft Edge
    • Layout of letters
    • Development for OS X
    • Forth
    • Smalltalk
    • Julia
    • Laravel
    • WebGL
    • Meteor.JS
    • Firebird/Interbase
    • SQLite
    • D
    • Mesh-networks
    • I2P
    • Derby.js
    • Emacs
    • Development under Bada
    • Mercurial
    • UML Design
    • Objective C
    • Fortran
    • Cocoa
    • Cobol
    • Apache Flex
    • Action Script
    • Joomla
    • IIS
    • Twitter API
    • Vkontakte API
    • Facebook API
    • Microsoft Access
    • PDF
    • Prolog
    • GTK+
    • LabVIEW
    • Brainfuck
    • Cubrid
    • Canvas
    • Doctrine ORM
    • Google App Engine
    • Twisted
    • XSLT
    • TDD
    • Small Basic
    • Kohana
    • Development for Java ME
    • LiveStreet
    • MooTools
    • Adobe Flash
    • GreaseMonkey
    • INFOLUST
    • Groovy & Grails
    • Lisp
    • Delphi
    • Zend Framework
    • ExtJS / Sencha Library
    • Internet Explorer
    • CodeIgniter
    • Silverlight
    • Google Web Toolkit
    • CakePHP
    • Safari
    • Opera
    • Microformats
    • Ajax
    • VIM
  • Administration
    • System administration
    • IT Infrastructure
    • *nix
    • Network technologies
    • DevOps
    • Server Administration
    • Cloud computing
    • Configuring Linux
    • Wireless technologies
    • Virtualization
    • Hosting
    • Data storage
    • Decentralized networks
    • Database Administration
    • Data Warehousing
    • Communication standards
    • PowerShell
    • Backup
    • Cisco
    • Nginx
    • Antivirus protection
    • DNS
    • Server Optimization
    • Data recovery
    • Apache
    • Spam and antispam
    • Data Compression
    • SAN
    • IPv6
    • Fidonet
    • IPTV
    • Shells
    • Administering domain names
  • Design
    • Interfaces
    • Web design
    • Working with sound
    • Usability
    • Graphic design
    • Design Games
    • Mobile App Design
    • Working with 3D-graphics
    • Typography
    • Working with video
    • Work with vector graphics
    • Accessibility
    • Prototyping
    • CGI (graphics)
    • Computer Animation
    • Working with icons
  • Control
    • Careers in the IT industry
    • Project management
    • Development Management
    • Personnel Management
    • Product Management
    • Start-up development
    • Managing the community
    • Service Desk
    • GTD
    • IT Terminology
    • Agile
    • Business Models
    • Legislation and IT-business
    • Sales management
    • CRM-systems
    • Product localization
    • ECM / EDS
    • Freelance
    • Venture investments
    • ERP-systems
    • Help Desk Software
    • Media management
    • Patenting
    • E-commerce management
    • Creative Commons
  • Marketing
    • Conferences
    • Promotion of games
    • Internet Marketing
    • Search Engine Optimization
    • Web Analytics
    • Monetize Web services
    • Content marketing
    • Monetization of IT systems
    • Monetize mobile apps
    • Mobile App Analytics
    • Growth Hacking
    • Branding
    • Monetize Games
    • Display ads
    • Contextual advertising
    • Increase Conversion Rate
  • Sundry
    • Reading room
    • Educational process in IT
    • Research and forecasts in IT
    • Finance in IT
    • Hakatonas
    • IT emigration
    • Education abroad
    • Lumber room
    • I'm on my way

Coefficient Gini. From economy to machine learning

An interesting fact: in 191? the Italian statistician and demographer Corrado Gini wrote a famous work "Variability and variability of the sign," and in the same year the Titanic sank in the waters of the Atlantic. It would seem, what is common between these two events? Everything is simple, their consequences have found wide application in the field of machine learning. And if the dataset "Titanic" in the presentation does not need, then we will talk more about one remarkable statistic, first published in the work of the Italian scientist. Just want to note that the article has nothing to do with the Gini Impulse coefficient, which is used in decision trees as a criterion for the quality of the partitioning in classification problems. These coefficients are in no way connected with each other and the total between them is about the same as the total between the tractor in the Bryansk region and the lawnmower in Oklahoma.
 
 
The Gini coefficient is a quality metric that is often used in evaluating predictive models in binary classification problems under conditions of strong imbalance in the classes of the target variable. It is widely used in the tasks of bank lending, insurance and targeted marketing. To fully understand this metric, we first need to plunge into the economy and figure out what it is used for.
 
several ways to calculate this coefficient, we will focus on the Brown formula (it is necessary first to create a variational series - to rank the population according to income):
 
Coefficient Gini. From economy to machine learning
 
where

- number of inhabitants,

- The cumulative share of the population,

- The cumulative share of income for

 
 
Let's analyze the above described in a toy example to intuitively understand the meaning of this statistic.
 
 
Suppose there are three villages, each of which has 10 inhabitants. In each village, the total annual income of the population is 100 rubles. In the first village, all residents earn the same income - 10 rubles a year, in the second village the income distribution is different: 3 people earn 5 rubles, 4 people - 10 rubles and 3 people 15 rubles each. And in the third village, 7 people receive 1 ruble a year, 1 person - 10 rubles, 1 person - 33 rubles and one person - 50 rubles. For each village, we calculate the Gini coefficient and construct the Lorentz curve.
 
 
Imagine the initial data for the villages in the form of a table and immediately calculate

and

for clarity:
 
 
The Python code is [/b]
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
% matplotlib inline
import warnings
warnings.filterwarnings ('ignore')
village = pd.DataFrame ({'Person':['Person_{}'.format(i) for i in range(1,11)],
'Income_Village_1':[10]* 1?
'Income_Village_2':[5,5,5,10,10,10,10,15,15,15],
'Income_Village_3':[1,1,1,1,1,1,1,10,33,50]})
village['Cum_population']= np.cumsum (np.ones (10) /10)
village['Cum_Income_Village_1']= np.cumsum (village['Income_Village_1']/100)
village['Cum_Income_Village_2']= np.cumsum (village['Income_Village_2']/100)
village['Cum_Income_Village_3']= np.cumsum (village['Income_Village_3']./100)
village = village.iloc[:,[3,4,0,5,1,6,2,7]]
village

 
 
 
 
 
The Python code is [/b]
plt.figure (figsize = (?8))
Gini =[]
for i in range (1.4):
X_k = village['Cum_population'].values ​​
X_k_1 = village['Cum_population'].shift (). fillna (0) .values ​​
Y_k = village['Cum_Income_Village_{}'.format(i)].values ​​
Y_k_1 = village['Cum_Income_Village_{}'.format(i)].shift (). fillna (0) .values ​​
Gini.append (1 - np.sum ((X_k - X_k_1) * (Y_k + Y_k_1)))
plt.plot (np.insert (X_k, ?0), np.insert (village['Cum_Income_Village_{}'.format(i)].values, ?0),
label = 'Village {} (Gini = {: 0.2f})' format (i, Gini[i-1]))
plt.title ('Gini Coefficient')
plt.xlabel ('Cumulative share of the population')
plt.ylabel ('Cumulative share of income')
plt.legend (loc = "upper left")
plt.xlim (? 1)
plt.ylim (? 1)
plt.show ()

 
 
 
 
 
It can be seen that the Lorenz curve for the Gini coefficient in the first village completely coincides with the diagonal ("line of absolute equality"), and the larger the stratification among the population relative to the annual income, the larger the area of ​​the figure formed by the Lorentz curve and the diagonal. Let's show on the example of the third village that the ratio of the area of ​​this figure to the area of ​​the triangle, formed by the line of absolute equality, is exactly equal to the value of the Gini coefficient:
 
 
The Python code is [/b]
curve_area = np.trapz (np.insert (village['Cum_Income_Village_3'].values, ?0), np.insert (village['Cum_population'].values, ?0))
S = (0.5 - curve_area) /???r3r31168.
plt.figure (figsize = (?8))
plt.plot ([0,1],[0,1], linestyle = '-', lw = ? color = 'black')
plt.plot (np.insert (village['Cum_population'] .values, ?0), np.insert (village['Cum_Income_Village_3'].values, ?0),
label = 'Village {} (Gini = {: 0.2f} ) '. format (i, Gini[i-1]), lw = ? color =' green ')
plt.fill_between (np.insert (X_k, ?0), np.insert (X_k, ?0), y2 = np.insert (village['Cum_Income_Village_3'].values, ?0), alpha = 0.5)
plt.text (0.4?0.2? 'S = {: 0.2f}' format (S), fontsize = 28)
plt.title ('Gini Coefficient')
plt.xlabel ('Cumulative share of the population')
plt.ylabel ('Cumulative share of income')
plt.legend (loc = "upper left")
plt.xlim (? 1)
plt.ylim (? 1)
plt.show ()

 
 
 
 
 
We showed that along with algebraic methods, one of the ways to calculate the Gini coefficient is geometric - the calculation of the share of the area between the Lorentz curve and the line of absolute equality of incomes from the total area under the direct absolute equality of incomes.
 
 
Another important point. Let's mentally fix the ends of the curve at the points

and

and start changing its shape. It is quite obvious that the area of ​​the figure will not change, but by the same token we translate members of society from the "middle class" to the poor or the rich while not changing the ratio of income between classes. Take, for example, ten people with the following income:
 

 
Now, to a person with an income of "20", we apply the method of Sharikov "Select and divide!", Redistributing his income proportionally among the rest of society. In this case, the Gini coefficient will not change and remain at ???? we just dragged the "fixed" Lorentz curve to the abscissa and changed its shape:
 

 
Let's dwell on one more important point: by calculating the Gini coefficient, we do not classify people as poor or rich, it does not depend on whom we consider a pauper or an oligarch. But suppose that we faced such a task, for this, depending on what we want to receive, what our goals are, we will need to set a revenue threshold that clearly separates people from the poor and the rich. If you saw in this analogy with Threshold from the problems of binary classification, then it's time for us to proceed to machine learning.
 
 

Machine learning


 

1. General understanding of


 
Immediately it is worth noting that, having come to machine training, the Gini coefficient has changed a lot: it is calculated differently and has a different meaning. Numerically, the coefficient is equal to the area of ​​the figure formed by the line of absolute equality and the Lorentz curve. There are also common features with a relative from the economy, for example, we still need to build a Lorentz curve and calculate the area of ​​the figures. And most importantly, the algorithm for plotting the curve has not changed. The Lorentz curve also underwent changes, it was called the Lift Curve and is a mirror image of the Lorentz curve relative to the line of absolute equality (due to the fact that the ranking of probabilities occurs not in ascending but in descending order). We will analyze all this in the next toy example. To minimize the error in calculating the areas of the figures, we will use the functions scipy
interp1d
(interpolation of a one-dimensional function) and
quad
(calculation of a definite integral).
 
 
Suppose we solve the problem of binary classification for 15 objects and we have the following class distribution:
 

 
Our trained algorithm predicts the following probabilities of the relation to the class "1" on these objects:
 
 
 
 
We calculate the Gini coefficient for two models: our trained algorithm and the ideal model, accurately predicting classes with a probability of 100%. The idea is this: instead of ranking the population according to the level of income, we rank the predicted probabilities of the model in descending order and substitute into the formula a cumulative fraction of the true values ​​of the target variable corresponding to the predicted probabilities. In other words, we sort the table by the line "Predict" and consider the cumulative share of classes instead of the cumulative share of revenues.
 
 
 
 
The Python code is [/b]
from scipy.interpolate import interp1d
from scipy.integrate import quad
actual =[1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]
predict =[0.9, 0.3, 0.8, 0.75, 0.65, 0.6, 0.78, 0.7, 0.05, 0.4, 0.4, 0.05, 0.5, 0.1, 0.1]
data = zip (actual, predict)
sorted_data = sorted (data, key = lambda d: d[1], reverse = True)
sorted_actual =[d[0]for d in sorted_data]
cumulative_actual = np.cumsum (sorted_actual) /sum (actual)
cumulative_index = np.arange (? len (cumulative_actual) +1) /len (predict)
cumulative_actual_perfect = np.cumsum (sorted (actual, reverse = True)) /sum (actual)
x_values ​​=[0]+ list (cumulative_index)
y_values ​​=[0]+ list (cumulative_actual)
y_values_perfect =[0]+ list (cumulative_actual_perfect)
f? f2 = interp1d (x_values, y_values), interp1d(x_values, y_values_perfect)
S_pred = quad (f? ? ? points = x_values)[0]- ???r3r31168. S_actual = quad (f? ? ? points = x_values)[0]- ???r3r31168.
fig, ax = plt.subplots (nrows = ? ncols = ? sharey = True, figsize = (1? 7))
ax[0].plot (x_values, y_values, lw = ? color = 'blue', marker = 'x')
ax[0].fill_between (x_values, x_values, y_values, color = 'blue', alpha = 0.1)
ax[0].text (0.?0.? 'S = {: 0.4f}' format (S_pred), fontsize = 28)
ax[1].plot (x_values, y_values_perfect, lw = ? color = 'green', marker = 'x')
ax[1].fill_between (x_values, x_values, y_values_perfect, color = 'green', alpha = 0.1)
ax[1].text (0.?0.? 'S = {: 0.4f}' format (S_actual), fontsize = 28)
for i in range (2):
ax.plot ([0,1],[0,1], linestyle = '-', lw = ? color = 'black')
ax[i].set (title = 'Gini coefficient', xlabel = 'Cumulative fraction of objects',
ylabel = 'Cumulative fraction of true classes', xlim = (? 1), ylim = (? 1))
plt.show ();

 
 
 
 
 
The coefficient of the Gini for the trained model is ???. Is this small or a lot? How accurate is the algorithm? Without knowing the exact value of the coefficient for an ideal algorithm, we can not say anything about our model. Therefore, the metric of quality in machine learning is normalized coefficient of Gini , which is equal to the ratio of the coefficient of the trained model to the coefficient of the ideal model. Further, the term "Gini Coefficient" will mean exactly this.
 

 
Looking at these two graphs, we can draw the following conclusions:
 
 
 
The prediction of the ideal algorithm is the maximum Gini coefficient for the current data set and depends only on the true distribution of classes in the problem.
 
The area of ​​the figure for an ideal algorithm is:
 

 
Predictions of the trained models can not be greater than the value of the coefficient of the ideal algorithm.
 
With a uniform distribution of classes of the target variable, the Gini coefficient of the ideal algorithm will always be equal to ???r3r3899.  
For an ideal algorithm, the shape of the figure formed by the Lift Curve and the line of absolute equality will always be a triangle of
 
The Gini coefficient of the random algorithm is ? and the Lift Curve coincides with the line of absolute equality
 
The Gini coefficient of the trained algorithm will always be less than the coefficient of the ideal algorithm
 
The values ​​of the normalized Gini coefficient for the trained algorithm are in the range .
 
The normalized Gini coefficient is a quality metric that needs to be maximized.
 
 
 

2. Algebraic representation. Proof of a linear relationship with AUC ROC.


 
We came to the most, perhaps, interesting moment - the algebraic representation of the Gini coefficient. How to calculate this metric? She is not equal to her relative from the economy. It is known that the coefficient can be calculated by the following formula:
 

 
I honestly tried to find the conclusion of this formula on the Internet, but did not find anything. Even in foreign books and scientific articles. But on some dubious websites of statisticians there was a phrase: [i] "It's so obvious that there's nothing to discuss. It is enough to compare the curves of the Lift Curve and the ROC Curve, so that everything becomes clear immediately. "
. A little later, when he himself derived the formula for the connection of these two metrics, I understood that this phrase is an excellent indicator. If you hear or read it, it is obvious only that the author of the phrase has no understanding of the Gini coefficient. Let's take a look at the curves of the Lift Curve and the ROC Curve for our example:
 
 
The Python code is [/b]
from sklearn.metrics import roc_curve, roc_auc_score
aucroc = roc_auc_score (actual, predict)
gini = 2 * roc_auc_score (actual, predict) -1
fpr, tpr, t = roc_curve (actual, predict)
fig, ax = plt.subplots (nrows = ? ncols = ? sharey = True, figsize = (1? 5))
fig.suptitle ('Gini = 2 * AUCROC - 1 = {: 0.2f} nn'.format (gini), fontsize = 1? fontweight =' bold ')
ax[0].plot ([0]+ fpr.tolist (),[0]+ tpr.tolist (), lw = ? color = 'red')
ax[0].fill_between ([0]+ fpr.tolist (),[0]+ tpr.tolist (), color = 'red', alpha = 0.1)
ax[0].text (0.?0.? 'S = {: 0.2f}' format (aucroc), fontsize = 28)
ax[1].plot (x_values, y_values, lw = ? color = 'blue')
ax[1].fill_between (x_values, x_values, y_values, color = 'blue', alpha = 0.1)
ax[1].text (0.?0.? 'S = {: 0.2f}' format (S_pred), fontsize = 28)
ax[2].plot (x_values, y_values_perfect, lw = ? color = 'green')
ax[2].fill_between (x_values, x_values, y_values_perfect, color = 'green', alpha = 0.1)
ax[2].text (0.?0.? 'S = {: 0.2f}' format (S_actual), fontsize = 28)
ax[0].set (title = 'ROC-AUC', xlabel = 'False Positive Rate',
ylabel = 'True Positive Rate', xlim = (? 1), ylim = (? 1))
for i in range (1.3):
ax[i].plot ([0,1],[0,1], linestyle = '-', lw = ? color = 'black')
ax[i].set (title = 'Gini coefficient', xlabel = 'Cumulative fraction of objects',
ylabel = 'Cumulative fraction of true classes', xlim = (? 1), ylim = (? 1))
plt.show ();

 
 
 
 
 
It is perfectly clear that it is impossible to catch the connection from the graphical representation of metrics, therefore we will prove equality algebraically. I managed to do this in two ways - parametrically (integrals) and nonparametrically (via Wilcoxon-Mann-Whitney statistics). The second method is much simpler and without multi-storey fractions with double integrals, therefore, we will dwell on it in detail. To further consider the evidence, let's define the terminology: the cumulative fraction of true classes is nothing more than a True Positive Rate. The cumulative fraction of objects is, in turn, the number of objects in the ranked row (when scaling to the interval

, Respectively, the proportion of objects).
 
 
To understand the proof, you need a basic understanding of the ROC-AUC metric - what is it all about, how the chart is plotted and in which axes. I recommend an article from the blog of Alexander Dyakonov "AUC ROC (area under the error curve)"
 
 
We introduce the following notation:
 
 

- The number of objects in the sample
 

- The number of objects of class "0"
 

- The number of objects of class "1"
 

- True Positive (correct answer of the model in the true class "1" at the given threshold)
 

- False Positive (wrong answer of the model on the true class "0" at the given threshold)
 

- True Positive Rate (ratio
?
?
to
?
?
)
 

- False Positive Rate (ratio
?
?
to
?
?
)
 

Is the current index of the element.
 
 
 

Parametric method


 
 
The parametric equation for the ROC curve can be written in the following form:
 

 
When plotting the curve of the Lift Curve along the axis

we set aside the proportion of objects (their number) pre-sorted in descending order. Thus, the parametric equation for the Gini Coefficient will be as follows:
 
 

 
 
Substituting expression (4) into expression (1) for both models and transforming it, we see that expression (3) can be substituted into one of the parts, which finally gives us the beautiful formula of the normalized Gini (2)
 
 

Nonparametric method


 
 
In the proof, I relied on the elementary postulates of the Theory of Probability. It is known that the numerical value of the AUC ROC is equal to the Wilcoxon-Mann-Whitney statistics:
 
 

 
 
x_j frac {1} {2}, enspace x_i = x_j ? enspace x_i < x_j end{cases}$" data-tex="display">
 
 
where

- the algorithm response on the i-th object from the distribution "1",

- the algorithm response on the j-th object from the distribution "0"
 
 
The proof of this formula can, for example, find here
 
 
This is interpreted very intuitively: if we randomly extract a pair of objects, where the first object will be from the distribution "1", and the second from the distribution "0", then the probability that the first object will have a predicted value greater than or equal to the predicted value of the second object is equal to the AUC ROC value. Combinatorial it is easy to calculate that the number of pairs of such objects will be:

.
 
 
Let the model predict

possible values ​​from the set

, where

and

- some probability distribution, the elements of which take values ​​on the interval .
 
Suppose that

the set of values ​​that the objects
accept.
and

. Suppose that

the set of values ​​that the objects
accept.
and

. It is obvious that the sets

and

can intersect.
 
 
Denote by

as the probability that the object is

will take the value

, and

as the probability that the object is

will take the value

. Then

and

 
 
Having an a priori probability

for each sample object, we can write a formula that determines the probability that the object will take the value

:
 

 
 
We define three distribution functions:
 
- for objects of class "1"
 
- for objects of class "0"
 
- for all sample objects
 

 

 

 
 
An example of how the distribution functions for the two classes in the credit scoring problem may look:
 
 
 
 
The figure also shows the Kolmogorov-Smirnov statistics, which is also used to evaluate models.
 
 
We write the Wilcoxon formula in a probabilistic form and transform it:
 
 
S_ {n_1}) + frac {1} {2} P (S_ {n_1} = S_ {n_1}) = sum_ {i = 1} ^ {k} P (S_ {n_1} geq s_ {i-1} ) P (S_ {n_0} = s_ {i}) + frac {1} {2} sum_ {i = 1} ^ {k} P (S_ {n_1} = s_ {i}) P (S_ {n_0} = s_ {i}) = sum_ {i = 1} ^ {k} big (P (S_ {n_1} geq s_ {i-1}) + frac {1} {2} P (S_ {n_1} = s_ {i}) big) P (S_ {n_0} = s_ {i}) = sum_ {i = 1} ^ {k} frac {1} {2} big ((P {S_ {n_1} geq s_ {i}}} + (P (S_ {n_1} geq s_ {i-1}) big) P (S_ {n_0} = s_ {i}) = sum_ {i = 1} ^ {k} frac {1} {2}} (CDF_ {n_ {1}} ^ {i} + CDF_ {n_ {1}} ^ {i-1}) (CDF_ {n_ {0}} ^ {i} - CDF_ {n_ {0}} ^ {i-1}) hspace {15pt} (6) $ "data-tex =" display "> .
.
 
 
An analogous formula can be written for the area under the Lift Curve (remember that it consists of the sum of two areas, one of which is always 0.5):
 
 

 
 
And now we transform it:
 
 

 
 
For an ideal model, the formula is simple:
 
 

 
 
Therefore, from (8) and (9), we obtain:
 
 

 
 
As they said at school, which was required to prove.
 
 

3. Practical application.


 
 
As mentioned at the beginning of the article, the Gini coefficient is used to evaluate models in many areas, including bank lending, insurance and targeted marketing. And this is a very reasonable explanation. This article does not set itself the goal of detailing the practical application of statistics in a particular field. Many books have been written on this subject, we will only briefly go over this topic.
 
 

Credit scoring


 
 
Worldwide, banks receive thousands of applications every day for a loan. Of course, it is necessary to somehow evaluate the risks of the fact that the client can simply not return the loan, therefore, predictive models are being developed that estimate the characteristic space for the probability that the client will not pay the loan, and these models must first be evaluated somehow , if the model is successful, then choose the optimal threshold (threshold) of probability. The choice of the optimal threshold is determined by the policy of the bank. The task of the analysis in selecting the threshold is to minimize the risk of loss of profit associated with the refusal to issue a loan. But to choose a threshold, one must have a qualitative model. The main quality metrics in the banking sector:
 
 
 
Coefficient of Gini
 
Statistics Kolmogorov-Smirnov (calculated as the maximum difference between the cumulative distribution functions of "bad" and "good" borrowers. Above, the figure with distributions and this statistics was cited)
 
The divergence coefficient (this is an estimate of the difference in the mathematical expectation of scorecard scores for "bad" and "good" borrowers, normalized by the variance of these distributions.) The larger the divergence coefficient, the better the quality of the model.)
 
 
 
I do not know how things are in Russia, although I live here, but in Europe the Gini coefficient is most widely used, in North America - Kolmogorov-Smirnov statistics.
 
 

Insurance


 
In this area, everything is similar to the banking sector, with the only difference that we need to divide customers into those who will file an insurance claim and those who do not. Let's consider a practical example from this area in which one feature of the Lift Curve will be clearly visible - for strongly unbalanced classes in the target variable, the curve almost perfectly coincides with the ROC curve.
 
 
A few months ago, the "Porto Seguro's Safe Driver Prediction" competition was held at Kaggle, in which the task was precisely to predict "Insurance Claim" - the filing of an insurance claim. And in which I missed the silver by my own stupidity, choosing the wrong submission.
 
 
 
 
It was a very strange and at the same time incredibly cognitive competition. And with a record number of participants - 5169. The winner of the competition is Michael Jahrer wrote the code only in C ++ /CUDA, and this causes admiration and respect.
 
 
Porto Seguro is a Brazilian company specializing in car insurance.
 
The dateset consisted of 595207 rows in the trainee, 892816 rows in the test, and 53 anonymous characters. The ratio of classes in the target is 3% and 97%. We'll write a simple baseline, for the benefit of this is done in a couple of lines, and we'll build the charts. Note that the curves almost perfectly coincide, the difference in the areas under the Lift Curve and ROC Curve is ???.
 
 
The Python code is [/b]
from sklearn.model_selection import train_test_split
import xgboost as xgb
from scipy.interpolate import interp1d
from scipy.integrate import quad
df = pd.read_csv ('train.csv', index_col = 'id')
unwanted = df.columns[df.columns.str.startswith('ps_calc_')]
df.drop (unwanted, inplace = True, axis = 1)
df.fillna (-99? inplace = True)
train, test = train_test_split (df, stratify = df.target, test_size = 0.2? random_state = 1)
estimator = xgb.XGBClassifier (seed = ? n_jobs = -1)
estimator.fit (train.drop ('target', axis = 1), train.target)
pred = estimator.predict_proba (test.drop ('target', axis = 1))[:, 1]
test['predict']= pred
actual = test.target.values ​​
predict = test.predict.values ​​
data = zip (actual, predict)
sorted_data = sorted (data, key = lambda d: d[1], reverse = True)
sorted_actual =[d[0]for d in sorted_data]
cumulative_actual = np.cumsum (sorted_actual) /sum (actual)
cumulative_index = np.arange (? len (cumulative_actual) +1) /len (predict)
cumulative_actual_perfect = np.cumsum (sorted (actual, reverse = True)) /sum (actual)
aucroc = roc_auc_score (actual, predict)
gini = 2 * roc_auc_score (actual, predict) -1
fpr, tpr, t = roc_curve (actual, predict)
x_values ​​=[0]+ list (cumulative_index)
y_values ​​=[0]+ list (cumulative_actual)
y_values_perfect =[0]+ list (cumulative_actual_perfect)
fig, ax = plt.subplots (nrows = ? ncols = ? sharey = True, figsize = (1? 6))
fig.suptitle ('Gini = {: 0.3f} nn'.format (gini), fontsize = 2? fontweight =' bold ')
ax[0].plot ([0]+ fpr.tolist (),[0]+ tpr.tolist (), lw = ? color = 'red')
ax[0].plot ([0]+ fpr.tolist (),[0]+ tpr.tolist (), lw = ? color = 'red')
ax[0].fill_between ([0]+ fpr.tolist (),[0]+ tpr.tolist (), color = 'red', alpha = 0.1)
ax[0].text (0.?0.? 'S = {: 0.3f}' format (aucroc), fontsize = 28)
ax[1].plot (x_values, y_values, lw = ? color = 'blue')
ax[1].fill_between (x_values, x_values, y_values, color = 'blue', alpha = 0.1)
ax[1].text (0.?0.? 'S = {: 0.3f}' format (S_pred), fontsize = 28)
ax[2].plot (x_values, y_values_perfect, lw = ? color = 'green')
ax[2].fill_between (x_values, x_values, y_values_perfect, color = 'green', alpha = 0.1)
ax[2].text (0.?0.? 'S = {: 0.3f}' format (S_actual), fontsize = 28)
ax[0].set (title = 'ROC-AUC XGBoost Baseline', xlabel = 'False Positive Rate',
ylabel = 'True Positive Rate', xlim = (? 1), ylim = (? 1))
ax[1].set (title = 'Gini XGBoost Baseline')
ax[2].set (title = 'Gini Perfect')
for i in range (1.3):
ax[i].plot ([0,1],[0,1], linestyle = '-', lw = ? color = 'black')
ax[i].set (xlabel = 'Share of clients', ylabel = 'True Positive Rate', xlim = (? 1), ylim = (? 1))
plt.show ();

 
 
 
 
 
Coefficient Gini winning model - ???r3r31161.  
For me, it's still a mystery what the organizers wanted to achieve by zapping up the signs and making an incredible preprocessing of the data. This is one of the reasons why all the models, including those that won, actually turned out to be garbage. Probably just a PR, no one in the world knew about Porto Seguro except Brazilians, now many people know.
 
 

Target marketing


 
 
In this area, you can best understand the true meaning of the Gini coefficient and the Lift Curve. Almost in all books and articles for some reason examples are given with mail marketing campaigns, which in my opinion is an anachronism. Create an artificial business problem from the scope of free2play games . We have a database of users who once played our game and for some reason fallen off. We want to return them to our game project, for each user we have a certain feature space (time in the project, how much it spent, to what level it reached, etc.) on the basis of which we build the model. We estimate the model by the Gini coefficient and build the Lift Curve:
 
 
 
 
Suppose that within the marketing campaign we in one way or another establish contact with the user (email, social network), the price of contact with one user is 2 rubles. We know that Lifetime Value is 5 rubles. It is necessary to optimize the effectiveness of the marketing campaign. Suppose that there are 100 users in the sample, of which 30 will return. Thus, if we establish contact with 100% of users, we will spend 200 rubles on the marketing campaign and receive a profit of 150 rubles. This is the failure of the campaign. Consider the schedule of the Lift Curve. It is visible, that at contact to 50% of users, we contact 90% of users who will return. the cost of the campaign - 100 rubles, income 135. We are in positive territory. Thus, Lift Curve allows us to optimize our marketing company in the best way.
 
 

4. Sort by bubble.


 
 
The Gini coefficient has a rather amusing, but very useful interpretation, with which we can also easily calculate it. It turns out that it is numerically equal to:
 
 

 
 
where,

the number of permutations that need to be made in the ranked list in order to get the true list of the target variable,

Is the number of permutations for the predictions of the random algorithm. Write the elementary sorting with a bubble and show it:
 
 

 
 
The Python code is [/b]
actual =[1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]
predict =[0.9, 0.3, 0.8, 0.75, 0.65, 0.6, 0.78, 0.7, 0.05, 0.4, 0.4, 0.05, 0.5, 0.1, 0.1]
data = zip (actual, predict)
sorted_data = sorted (data, key = lambda d: d[1], reverse = False)
sorted_actual =[d[0]for d in sorted_data]
swaps = 0
n = len (sorted_actual)
array = sorted_actual
for i in range (? n):
flag = 0
for j in range (n-i):
if array[j]> array[j+1]:
array[j], array[j+1]= array[j+1], array[j]
flag = 1
swaps + = 1
if flag == 0: break
print ("Number of permutations:", swaps)

 
 
 
Number of permutations: 10
 
 
Combinatorically, it is not difficult to calculate the number of permutations for a random algorithm:
 
 

 
 
Thus:
 
 

 
 
We see that we obtained the value of the coefficient, as in the toy example considered above.
 
 
I hope the article was useful and dispelled some myths regarding this quality metric.

It may be interesting

  • Comments
  • About article
  • Similar news
Abbey Clay 26 March 2020 08:17
Economy to machine learning that has to read on it this was good to present functions over the data of program on it. I used to take facts on it so I hire top essay writing service this was the best company to read quality that has to put data on the formation on it.
goharsaab 4 June 2020 15:58
While productivity growth is ultimately what matters for long-term prosperity as haier mobile price. The effects of debt through economy to machine. At the most fundamental level it is a relatively simple machine, yet it is not well understood. 
raymond weber 19 November 2020 20:47
Recently, I have commenced a blog the info you give on this site has encouraged and benefited me hugely. Thanks for all of your time & work. 먹튀


International arbitration in The Hague has suspended consideration of the merits of a lawsuit against Russia, which was filed by Ukraine for the seizure of Ukrainian ships and sailors in the Kerch Strait in November 2018.Марсель Сандерс

raymond weber 21 November 2020 12:10
Thanks for your insight for your fantastic posting. I’m exhilarated I have taken the time to see this. It is not enough; I will visit your site every day.sagame66

raymond weber 23 November 2020 16:33
The app has many games to choose from. It also has baccarat. And live casinos to play Some of you may have heard of pusy slots in the name "Pussy 888".slot ออนไลน์



Our credit repair services work to fix past credit mistakes and verify credit report accuracy. Talk to a credit repair expert today!okdermo skin store

raymond weber 24 November 2020 12:28
I definitely enjoying every little bit of it. It is a great website and nice share. I want to thank you. Good job! You guys do a great blog, and have some great contents. Keep up the good work.Floor cushions

raymond weber 26 November 2020 15:54
Wow! Such an amazing and helpful post this is. I really really love it. It's so good and so awesome. I am just amazed. I hope that you continue to do your work like this in the future alsobest fat burning pills for men



Thankyou for sharing the data which is beneficial for me and others likewise to see.Gulf Coast Western Reviews



Positive site, where did u come up with the information on this posting?I have read a few of the articles on your website now, and I really like your style. Thanks a million and please keep up the effective work.GSM Solutions



Hello I am so delighted I located your blog, I really located you by mistake, while I was watching on google for something else, Anyways I am here now and could just like to say thank for a tremendous post and a all round entertaining website. Please do keep up the great work.thc vape juice

raymond weber 28 November 2020 16:14
Pretty good post. I just stumbled upon your blog and wanted to say that I have really enjoyed reading your blog posts. Any way I'll be subscribing to your feed and I hope you post again soon. Big thanks for the useful info.berita terbaru



You make so many great points here that I read your article a couple of times. Your views are in accordance with my own for the most part. This is great content for your readers.Der Mond Exporte



This is a great high resolution screen which you have shared for the users. Making a website is not an easy task but managing a good website is really a hard work. As far as this website is concerned, I am very happy.monsta x official merch



Thanks for the nice blog. It was very useful for me. I'm happy I found this blog. Thank you for sharing with us,I too always learn something new from your post.multiples

raymond weber 29 November 2020 15:47
This blog was extremely helpful. I really appreciate your kindness in sharing this with me and everyone else!ateez official merch



If you don"t mind proceed with this extraordinary work and I anticipate a greater amount of your magnificent blog entries. ดูหนังออนไลน์



This is such a great resource that you are providing and you give it away for free. I love seeing blog that understand the value of providing a quality resource for free.Buy THC online



Interesting topic for a blog. I have been searching the Internet for fun and came upon your website. Fabulous post. Thanks a ton for sharing your knowledge! It is great to see that some people still put in an effort into managing their websites. I'll be sure to check back again real soon.wow answers



Many homework on the continual hunt along with offstage on the road to winning. Definitely not attached, simple to-fall as a result of wayside; And not investigation, afterward into a path travel toward the black.Customised hi vis UK.



This is extremely fascinating substance! I have completely delighted in perusing your focuses and have reached the conclusion that you are right about a hefty portion of them. You are extraordinary. 먹튀

raymond weber 30 November 2020 15:48
Great job for publishing such a beneficial web site. Your web log isn’t only useful but it is additionally really creative too. There tend to be not many people who can certainly write not so simple posts that artistically. Continue the nice writingCBD Blog



I really loved reading your blog. It was very well authored and easy to undertand. Unlike additional blogs I have read which are really not tht good. I also found your posts very interesting. In fact after reading. I had to go show it to my friend and he ejoyed it as well!wolf iphone case



I think this is definitely an amazing project here. So much good will be coming from this project. The ideas and the work behind this will pay off so much.game cheat talks



The first phase the preparation should, theoretically, be uninfluenced by the intended intensity and duration of the sound which is subsequently produced. In fact, however, so quickly are the three phases accomplished that the pianist rarely has capacity to think, in performance, of each phase separately.häuser im emsland

markeleu 30 November 2020 19:10
Customers who are real website human visitors’ beings and think about every purchase or behavior.
It is true that increasing the actual site traffic may take several months, but this is the right way to go. The increase in visits coincides with the increase in interaction and the increase in conversion rates. Ultimately, these visits will increase your income.

Buy website traffic
Real Human Website Traffic



This is cool,
SEO is a process that goes hand in hand with content. SEO means doing things on or off the site that will ultimately improve your ranking in Google search results. Read more...
Simply put, in SEO, we make Google rank our site better than our competitors. There are more than 200 different factors by which Google determines which site fits the user needs and gives it a better ranking.Buy Targeted Website Traffic
raymond weber 1 December 2020 16:52
You have made some decent points there. I looked on the internet for more information about the issue and found most people will go along with your views on this web site.twice merchandise



A lot of people having an incorrect image about the cash advance loans or sometimes refer it as bad credit payday loans.Car detailing



This was a shocking post. It has some look at here fundamental data on this subject.ข่าวฟุตบอลวันนี้



I love significantly your own post! I look at all post is great. I discovered your personal content using bing search. Discover my webpage is a great one as you.I work to create several content this post. Once more you can thank you and keep it create! Enjoy!ข่าวฟุตบอลวันนี้



The first phase the preparation should, theoretically, be uninfluenced by the intended intensity and duration of the sound which is subsequently produced. In fact, however, so quickly are the three phases accomplished that the pianist rarely has capacity to think, in performance, of each phase separately.buy dmt online

raymond weber 2 December 2020 15:07
This was a shocking post. It has some look at here fundamental data on this subject.streaming gratuit sans compte 2020



For a long time me & my friend were searching for informative blogs, but now I am on the right place guys, you have made a room in my heart!婚約指輪

markeleu 2 December 2020 21:39
Targeted High-quality traffic is traffic that resonates with the buyer persona of the customer you're trying to reach. Because these visitors fit your buyer you are able to get Targeted & Quality Traffic.

looking for the best social traffic services to buy online? look no further. Want to capitalize on the world’s obsession with social media? Buy social traffic that is driven to your website or blog from the most popular social media platforms including Facebook, Instagram, Twitter and more! Activities on Facebook will be fruitful when you, along with your other activities on other social networks and of course your website, also benefit from the facilities and potentials hidden in Facebook. Facebook alone cannot be a factor in the success of your business. So, you should use Facebook as a bridge to acquaint your audience with your main sales channel, which could be another social network or your website. Once you have successfully attracted your users on a global scale, it is time to use its practical tools to attract international customers. The essence of Facebook is free and building and running a business page on it will not cost you anything. But once you get to the right place on the network, the cost of smart advertising may seem very reasonable.
raymond weber 3 December 2020 14:48
I'm impressed, I must say. Very rarely do I come across a blog thats both informative and entertaining, and let me tell you, you ve hit the nail on the head. Your blog is important..Holistic Wellness Blog



Extremely pleasant article, I appreciated perusing your post, exceptionally decent share, I need to twit this to my adherents. Much appreciated!.Buy Website Traffic Cheap


I truly welcome this superb post that you have accommodated us. I guarantee this would be valuable for the vast majority of the general population.tree removal

raymond weber 5 December 2020 16:48
I am happy to find your distinguished way of writing the post. Now you make it easy for me to understand and implement the concept. Thank you for the post.garage door



This blog was extremely helpful. I really appreciate your kindness in sharing this with me and everyone else!read lady midnight online free



Hi there, I found your blog via Google while searching for such kinda informative post and your post looks very interesting for me.lord of the mysteries novel



I was very impressed by this post, this site has always been pleasant news Thank you very much for such an interesting post, and I meet them more often then I visited this site.history's number 1 founder



I know your aptitude on this. I should say we ought to have an online discourse on this. Composing just remarks will close the talk straight away! What's more, will confine the advantages from this data.romance novels online free reading

raymond weber 8 December 2020 14:30
I think this is an informative post and it is very useful and knowledgeable. therefore, I would like to thank you for the efforts you have made in writing this article.메리트카지노 주소

raymond weber 12 December 2020 15:27
If more people that write articles really concerned themselves with writing great content like you, more readers would be interested in their writings.  Thank you for caring about your content.permanent staffing

raymond weber 14 December 2020 16:03
Positive site, where did u come up with the information on this posting?I have read a few of the articles on your website now, and I really like your style. Thanks a million and please keep up the effective work.pretty gaming
raymond weber 24 December 2020 10:05
When you use a genuine service, you will be able to provide instructions, share materials and choose the formatting style.gali satta

weber

Author

6-03-2018, 14:18

Publication Date

Machine learning / Mathematics / Python / Data Mining

Category
  • Comments: 20
  • Views: 5 242
Open lesson "Feature Engineering on the
The problem with skyscraper and eggs is
How much do the data for learning the
The theory of happiness. The law of a
Alexey Zinoviev about BigData + ML on
Management of machine learning projects
Write a comment
Name:*
E-Mail:


Comments
Wow Tastic UK offers a huge range of toys, presents, and gadgets for kids and adults. Discover our great range of fun and unusual toys for baby and toddlers. Check Out: Gadgets for Kids
Today, 19:21

noorseo

Nice information, valuable and excellent design, as share good stuff with good ideas and concepts, lots of great information and inspiraopencarttion, both of which I need, thanks to offer such a helpful information here.

Awesome blog. I enjoyed reading your articles. This is truly a great read for me. I have bookmarked it and I am looking forward to reading new articles. Keep up the good work!
메리트카지노

Really nice and interesting post. I was looking for this kind of information and enjoyed reading this one. Keep posting. Thanks for sharing.

메리트카지노
Today, 18:33

taxiseo2

This is a wonderful article, Given so much info in it, These type of articles keeps the users interest in the website, and keep on sharing more ... good luck.

opencart eticaret

This is a wonderful article, Given so much info in it, These type of articles keeps the users interest in the website, and keep on sharing more ... good luck.

메리트카지노
Today, 18:27

taxiseo2

I really loved reading your blog. It was very well authored and easy to undertand. Unlike additional blogs I have read which are really not tht good. I also found your posts very interesting. In fact after reading. I had to go show it to my friend and he ejoyed it as well!seo toronto



Hey what a brilliant post I have come across and believe me I have been searching out for this similar kind of post for past a week and hardly came across this. Thank you very much and will look for more postings from you. [Url = https: //mtsoul.net] 먹튀 검증 [/ url]

Today, 16:41

raymond weber

I recently came across your blog and have been reading along. I thought I would leave my first comment. I don't know what to say except that I have enjoyed reading. Nice blog. I will keep visiting this blog very often.먹튀검증

Today, 15:58

raymond weber

Adv
Website for web developers. New scripts, best ideas, programming tips. How to write a script for you here, we have a lot of information about various programming languages. You are a webmaster or a beginner programmer, it does not matter, useful articles will help to make your favorite business faster.

Login

Registration Forgot password