# Coefficient Gini. From economy to machine learning

An interesting fact: in 191? the Italian statistician and demographer Corrado Gini wrote a famous work "Variability and variability of the sign," and in the same year the Titanic sank in the waters of the Atlantic. It would seem, what is common between these two events? Everything is simple, their consequences have found wide application in the field of machine learning. And if the dataset "Titanic" in the presentation does not need, then we will talk more about one remarkable statistic, first published in the work of the Italian scientist. Just want to note that the article has nothing to do with the Gini Impulse coefficient, which is used in decision trees as a criterion for the quality of the partitioning in classification problems. These coefficients are in no way connected with each other and the total between them is about the same as the total between the tractor in the Bryansk region and the lawnmower in Oklahoma.

The Gini coefficient is a quality metric that is often used in evaluating predictive models in binary classification problems under conditions of strong imbalance in the classes of the target variable. It is widely used in the tasks of bank lending, insurance and targeted marketing. To fully understand this metric, we first need to plunge into the economy and figure out what it is used for.

several ways to calculate this coefficient, we will focus on the Brown formula (it is necessary first to create a variational series - to rank the population according to income):

where

- number of inhabitants,

- The cumulative share of the population,

- The cumulative share of income for

Let's analyze the above described in a toy example to intuitively understand the meaning of this statistic.

Suppose there are three villages, each of which has 10 inhabitants. In each village, the total annual income of the population is 100 rubles. In the first village, all residents earn the same income - 10 rubles a year, in the second village the income distribution is different: 3 people earn 5 rubles, 4 people - 10 rubles and 3 people 15 rubles each. And in the third village, 7 people receive 1 ruble a year, 1 person - 10 rubles, 1 person - 33 rubles and one person - 50 rubles. For each village, we calculate the Gini coefficient and construct the Lorentz curve.

Imagine the initial data for the villages in the form of a table and immediately calculate

and

for clarity:

The Python code is [/b]

The Python code is [/b]

It can be seen that the Lorenz curve for the Gini coefficient in the first village completely coincides with the diagonal ("line of absolute equality"), and the larger the stratification among the population relative to the annual income, the larger the area of the figure formed by the Lorentz curve and the diagonal. Let's show on the example of the third village that the ratio of the area of this figure to the area of the triangle, formed by the line of absolute equality, is exactly equal to the value of the Gini coefficient:

The Python code is [/b]

We showed that along with algebraic methods, one of the ways to calculate the Gini coefficient is geometric - the calculation of the share of the area between the Lorentz curve and the line of absolute equality of incomes from the total area under the direct absolute equality of incomes.

Another important point. Let's mentally fix the ends of the curve at the points

and

and start changing its shape. It is quite obvious that the area of the figure will not change, but by the same token we translate members of society from the "middle class" to the poor or the rich while not changing the ratio of income between classes. Take, for example, ten people with the following income:

Now, to a person with an income of "20", we apply the method of Sharikov "Select and divide!", Redistributing his income proportionally among the rest of society. In this case, the Gini coefficient will not change and remain at ???? we just dragged the "fixed" Lorentz curve to the abscissa and changed its shape:

Let's dwell on one more important point: by calculating the Gini coefficient, we do not classify people as poor or rich, it does not depend on whom we consider a pauper or an oligarch. But suppose that we faced such a task, for this, depending on what we want to receive, what our goals are, we will need to set a revenue threshold that clearly separates people from the poor and the rich. If you saw in this analogy with Threshold from the problems of binary classification, then it's time for us to proceed to machine learning.

Immediately it is worth noting that, having come to machine training, the Gini coefficient has changed a lot: it is calculated differently and has a different meaning. Numerically, the coefficient is equal to the area of the figure formed by the line of absolute equality and the Lorentz curve. There are also common features with a relative from the economy, for example, we still need to build a Lorentz curve and calculate the area of the figures. And most importantly, the algorithm for plotting the curve has not changed. The Lorentz curve also underwent changes, it was called the Lift Curve and is a mirror image of the Lorentz curve relative to the line of absolute equality (due to the fact that the ranking of probabilities occurs not in ascending but in descending order). We will analyze all this in the next toy example. To minimize the error in calculating the areas of the figures, we will use the functions scipy

interp1d

(interpolation of a one-dimensional function) and

quad

(calculation of a definite integral).

Suppose we solve the problem of binary classification for 15 objects and we have the following class distribution:

Our trained algorithm predicts the following probabilities of the relation to the class "1" on these objects:

We calculate the Gini coefficient for two models: our trained algorithm and the ideal model, accurately predicting classes with a probability of 100%. The idea is this: instead of ranking the population according to the level of income, we rank the predicted probabilities of the model in descending order and substitute into the formula a cumulative fraction of the true values of the target variable corresponding to the predicted probabilities. In other words, we sort the table by the line "Predict" and consider the cumulative share of classes instead of the cumulative share of revenues.

The Python code is [/b]

The Python code is [/b]

It is perfectly clear that it is impossible to catch the connection from the graphical representation of metrics, therefore we will prove equality algebraically. I managed to do this in two ways - parametrically (integrals) and nonparametrically (via Wilcoxon-Mann-Whitney statistics). The second method is much simpler and without multi-storey fractions with double integrals, therefore, we will dwell on it in detail. To further consider the evidence, let's define the terminology: the cumulative fraction of true classes is nothing more than a True Positive Rate. The cumulative fraction of objects is, in turn, the number of objects in the ranked row (when scaling to the interval

, Respectively, the proportion of objects).

To understand the proof, you need a basic understanding of the ROC-AUC metric - what is it all about, how the chart is plotted and in which axes. I recommend an article from the blog of Alexander Dyakonov "AUC ROC (area under the error curve)"

We introduce the following notation:

- The number of objects in the sample

- The number of objects of class "0"

- The number of objects of class "1"

- True Positive (correct answer of the model in the true class "1" at the given threshold)

- False Positive (wrong answer of the model on the true class "0" at the given threshold)

- True Positive Rate (ratio

?

?

to

?

?

)

- False Positive Rate (ratio

?

?

to

?

?

)

Is the current index of the element.

The parametric equation for the ROC curve can be written in the following form:

When plotting the curve of the Lift Curve along the axis

we set aside the proportion of objects (their number) pre-sorted in descending order. Thus, the parametric equation for the Gini Coefficient will be as follows:

Substituting expression (4) into expression (1) for both models and transforming it, we see that expression (3) can be substituted into one of the parts, which finally gives us the beautiful formula of the normalized Gini (2)

In the proof, I relied on the elementary postulates of the Theory of Probability. It is known that the numerical value of the AUC ROC is equal to the Wilcoxon-Mann-Whitney statistics:

x_j frac {1} {2}, enspace x_i = x_j ? enspace x_i < x_j end{cases}$" data-tex="display">

where

- the algorithm response on the i-th object from the distribution "1",

- the algorithm response on the j-th object from the distribution "0"

The proof of this formula can, for example, find here

This is interpreted very intuitively: if we randomly extract a pair of objects, where the first object will be from the distribution "1", and the second from the distribution "0", then the probability that the first object will have a predicted value greater than or equal to the predicted value of the second object is equal to the AUC ROC value. Combinatorial it is easy to calculate that the number of pairs of such objects will be:

.

Let the model predict

possible values from the set

, where

and

- some probability distribution, the elements of which take values on the interval .

Suppose that

the set of values that the objects

accept.

and

. Suppose that

the set of values that the objects

accept.

and

. It is obvious that the sets

and

can intersect.

Denote by

as the probability that the object is

will take the value

, and

as the probability that the object is

will take the value

. Then

and

Having an a priori probability

for each sample object, we can write a formula that determines the probability that the object will take the value

:

We define three distribution functions:

- for objects of class "1"

- for objects of class "0"

- for all sample objects

An example of how the distribution functions for the two classes in the credit scoring problem may look:

The figure also shows the Kolmogorov-Smirnov statistics, which is also used to evaluate models.

We write the Wilcoxon formula in a probabilistic form and transform it:

S_ {n_1}) + frac {1} {2} P (S_ {n_1} = S_ {n_1}) = sum_ {i = 1} ^ {k} P (S_ {n_1} geq s_ {i-1} ) P (S_ {n_0} = s_ {i}) + frac {1} {2} sum_ {i = 1} ^ {k} P (S_ {n_1} = s_ {i}) P (S_ {n_0} = s_ {i}) = sum_ {i = 1} ^ {k} big (P (S_ {n_1} geq s_ {i-1}) + frac {1} {2} P (S_ {n_1} = s_ {i}) big) P (S_ {n_0} = s_ {i}) = sum_ {i = 1} ^ {k} frac {1} {2} big ((P {S_ {n_1} geq s_ {i}}} + (P (S_ {n_1} geq s_ {i-1}) big) P (S_ {n_0} = s_ {i}) = sum_ {i = 1} ^ {k} frac {1} {2}} (CDF_ {n_ {1}} ^ {i} + CDF_ {n_ {1}} ^ {i-1}) (CDF_ {n_ {0}} ^ {i} - CDF_ {n_ {0}} ^ {i-1}) hspace {15pt} (6) $ "data-tex =" display "> .

.

An analogous formula can be written for the area under the Lift Curve (remember that it consists of the sum of two areas, one of which is always 0.5):

And now we transform it:

For an ideal model, the formula is simple:

Therefore, from (8) and (9), we obtain:

As they said at school, which was required to prove.

As mentioned at the beginning of the article, the Gini coefficient is used to evaluate models in many areas, including bank lending, insurance and targeted marketing. And this is a very reasonable explanation. This article does not set itself the goal of detailing the practical application of statistics in a particular field. Many books have been written on this subject, we will only briefly go over this topic.

Worldwide, banks receive thousands of applications every day for a loan. Of course, it is necessary to somehow evaluate the risks of the fact that the client can simply not return the loan, therefore, predictive models are being developed that estimate the characteristic space for the probability that the client will not pay the loan, and these models must first be evaluated somehow , if the model is successful, then choose the optimal threshold (threshold) of probability. The choice of the optimal threshold is determined by the policy of the bank. The task of the analysis in selecting the threshold is to minimize the risk of loss of profit associated with the refusal to issue a loan. But to choose a threshold, one must have a qualitative model. The main quality metrics in the banking sector:

Coefficient of Gini

Statistics Kolmogorov-Smirnov (calculated as the maximum difference between the cumulative distribution functions of "bad" and "good" borrowers. Above, the figure with distributions and this statistics was cited)

The divergence coefficient (this is an estimate of the difference in the mathematical expectation of scorecard scores for "bad" and "good" borrowers, normalized by the variance of these distributions.) The larger the divergence coefficient, the better the quality of the model.)

I do not know how things are in Russia, although I live here, but in Europe the Gini coefficient is most widely used, in North America - Kolmogorov-Smirnov statistics.

In this area, everything is similar to the banking sector, with the only difference that we need to divide customers into those who will file an insurance claim and those who do not. Let's consider a practical example from this area in which one feature of the Lift Curve will be clearly visible - for strongly unbalanced classes in the target variable, the curve almost perfectly coincides with the ROC curve.

A few months ago, the "Porto Seguro's Safe Driver Prediction" competition was held at Kaggle, in which the task was precisely to predict "Insurance Claim" - the filing of an insurance claim. And in which I missed the silver by my own stupidity, choosing the wrong submission.

It was a very strange and at the same time incredibly cognitive competition. And with a record number of participants - 5169. The winner of the competition is Michael Jahrer wrote the code only in C ++ /CUDA, and this causes admiration and respect.

Porto Seguro is a Brazilian company specializing in car insurance.

The dateset consisted of 595207 rows in the trainee, 892816 rows in the test, and 53 anonymous characters. The ratio of classes in the target is 3% and 97%. We'll write a simple baseline, for the benefit of this is done in a couple of lines, and we'll build the charts. Note that the curves almost perfectly coincide, the difference in the areas under the Lift Curve and ROC Curve is ???.

The Python code is [/b]

Coefficient Gini winning model - ???r3r31161.

For me, it's still a mystery what the organizers wanted to achieve by zapping up the signs and making an incredible preprocessing of the data. This is one of the reasons why all the models, including those that won, actually turned out to be garbage. Probably just a PR, no one in the world knew about Porto Seguro except Brazilians, now many people know.

In this area, you can best understand the true meaning of the Gini coefficient and the Lift Curve. Almost in all books and articles for some reason examples are given with mail marketing campaigns, which in my opinion is an anachronism. Create an artificial business problem from the scope of free2play games . We have a database of users who once played our game and for some reason fallen off. We want to return them to our game project, for each user we have a certain feature space (time in the project, how much it spent, to what level it reached, etc.) on the basis of which we build the model. We estimate the model by the Gini coefficient and build the Lift Curve:

Suppose that within the marketing campaign we in one way or another establish contact with the user (email, social network), the price of contact with one user is 2 rubles. We know that Lifetime Value is 5 rubles. It is necessary to optimize the effectiveness of the marketing campaign. Suppose that there are 100 users in the sample, of which 30 will return. Thus, if we establish contact with 100% of users, we will spend 200 rubles on the marketing campaign and receive a profit of 150 rubles. This is the failure of the campaign. Consider the schedule of the Lift Curve. It is visible, that at contact to 50% of users, we contact 90% of users who will return. the cost of the campaign - 100 rubles, income 135. We are in positive territory. Thus, Lift Curve allows us to optimize our marketing company in the best way.

The Gini coefficient has a rather amusing, but very useful interpretation, with which we can also easily calculate it. It turns out that it is numerically equal to:

where,

the number of permutations that need to be made in the ranked list in order to get the true list of the target variable,

Is the number of permutations for the predictions of the random algorithm. Write the elementary sorting with a bubble and show it:

The Python code is [/b]

Number of permutations: 10

Combinatorically, it is not difficult to calculate the number of permutations for a random algorithm:

Thus:

We see that we obtained the value of the coefficient, as in the toy example considered above.

I hope the article was useful and dispelled some myths regarding this quality metric.

The Gini coefficient is a quality metric that is often used in evaluating predictive models in binary classification problems under conditions of strong imbalance in the classes of the target variable. It is widely used in the tasks of bank lending, insurance and targeted marketing. To fully understand this metric, we first need to plunge into the economy and figure out what it is used for.

several ways to calculate this coefficient, we will focus on the Brown formula (it is necessary first to create a variational series - to rank the population according to income):

where

- number of inhabitants,

- The cumulative share of the population,

- The cumulative share of income for

Let's analyze the above described in a toy example to intuitively understand the meaning of this statistic.

Suppose there are three villages, each of which has 10 inhabitants. In each village, the total annual income of the population is 100 rubles. In the first village, all residents earn the same income - 10 rubles a year, in the second village the income distribution is different: 3 people earn 5 rubles, 4 people - 10 rubles and 3 people 15 rubles each. And in the third village, 7 people receive 1 ruble a year, 1 person - 10 rubles, 1 person - 33 rubles and one person - 50 rubles. For each village, we calculate the Gini coefficient and construct the Lorentz curve.

Imagine the initial data for the villages in the form of a table and immediately calculate

and

for clarity:

The Python code is [/b]

` import pandas as pd`

import numpy as np

import matplotlib.pyplot as plt

% matplotlib inline

import warnings

warnings.filterwarnings ('ignore')

village = pd.DataFrame ({'Person':['Person_{}'.format(i) for i in range(1,11)],

'Income_Village_1':[10]* 1?

'Income_Village_2':[5,5,5,10,10,10,10,15,15,15],

'Income_Village_3':[1,1,1,1,1,1,1,10,33,50]})

village['Cum_population']= np.cumsum (np.ones (10) /10)

village['Cum_Income_Village_1']= np.cumsum (village['Income_Village_1']/100)

village['Cum_Income_Village_2']= np.cumsum (village['Income_Village_2']/100)

village['Cum_Income_Village_3']= np.cumsum (village['Income_Village_3']./100)

village = village.iloc[:,[3,4,0,5,1,6,2,7]]

village

The Python code is [/b]

` plt.figure (figsize = (?8))`

Gini =[]

for i in range (1.4):

X_k = village['Cum_population'].values

X_k_1 = village['Cum_population'].shift (). fillna (0) .values

Y_k = village['Cum_Income_Village_{}'.format(i)].values

Y_k_1 = village['Cum_Income_Village_{}'.format(i)].shift (). fillna (0) .values

Gini.append (1 - np.sum ((X_k - X_k_1) * (Y_k + Y_k_1)))

plt.plot (np.insert (X_k, ?0), np.insert (village['Cum_Income_Village_{}'.format(i)].values, ?0),

label = 'Village {} (Gini = {: 0.2f})' format (i, Gini[i-1]))

plt.title ('Gini Coefficient')

plt.xlabel ('Cumulative share of the population')

plt.ylabel ('Cumulative share of income')

plt.legend (loc = "upper left")

plt.xlim (? 1)

plt.ylim (? 1)

plt.show ()

It can be seen that the Lorenz curve for the Gini coefficient in the first village completely coincides with the diagonal ("line of absolute equality"), and the larger the stratification among the population relative to the annual income, the larger the area of the figure formed by the Lorentz curve and the diagonal. Let's show on the example of the third village that the ratio of the area of this figure to the area of the triangle, formed by the line of absolute equality, is exactly equal to the value of the Gini coefficient:

The Python code is [/b]

` curve_area = np.trapz (np.insert (village['Cum_Income_Village_3'].values, ?0), np.insert (village['Cum_population'].values, ?0))`

S = (0.5 - curve_area) /???r3r31168.

plt.figure (figsize = (?8))

plt.plot ([0,1],[0,1], linestyle = '-', lw = ? color = 'black')

plt.plot (np.insert (village['Cum_population'] .values, ?0), np.insert (village['Cum_Income_Village_3'].values, ?0),

label = 'Village {} (Gini = {: 0.2f} ) '. format (i, Gini[i-1]), lw = ? color =' green ')

plt.fill_between (np.insert (X_k, ?0), np.insert (X_k, ?0), y2 = np.insert (village['Cum_Income_Village_3'].values, ?0), alpha = 0.5)

plt.text (0.4?0.2? 'S = {: 0.2f}' format (S), fontsize = 28)

plt.title ('Gini Coefficient')

plt.xlabel ('Cumulative share of the population')

plt.ylabel ('Cumulative share of income')

plt.legend (loc = "upper left")

plt.xlim (? 1)

plt.ylim (? 1)

plt.show ()

We showed that along with algebraic methods, one of the ways to calculate the Gini coefficient is geometric - the calculation of the share of the area between the Lorentz curve and the line of absolute equality of incomes from the total area under the direct absolute equality of incomes.

Another important point. Let's mentally fix the ends of the curve at the points

and

and start changing its shape. It is quite obvious that the area of the figure will not change, but by the same token we translate members of society from the "middle class" to the poor or the rich while not changing the ratio of income between classes. Take, for example, ten people with the following income:

Now, to a person with an income of "20", we apply the method of Sharikov "Select and divide!", Redistributing his income proportionally among the rest of society. In this case, the Gini coefficient will not change and remain at ???? we just dragged the "fixed" Lorentz curve to the abscissa and changed its shape:

Let's dwell on one more important point: by calculating the Gini coefficient, we do not classify people as poor or rich, it does not depend on whom we consider a pauper or an oligarch. But suppose that we faced such a task, for this, depending on what we want to receive, what our goals are, we will need to set a revenue threshold that clearly separates people from the poor and the rich. If you saw in this analogy with Threshold from the problems of binary classification, then it's time for us to proceed to machine learning.

## Machine learning

### 1. General understanding of

Immediately it is worth noting that, having come to machine training, the Gini coefficient has changed a lot: it is calculated differently and has a different meaning. Numerically, the coefficient is equal to the area of the figure formed by the line of absolute equality and the Lorentz curve. There are also common features with a relative from the economy, for example, we still need to build a Lorentz curve and calculate the area of the figures. And most importantly, the algorithm for plotting the curve has not changed. The Lorentz curve also underwent changes, it was called the Lift Curve and is a mirror image of the Lorentz curve relative to the line of absolute equality (due to the fact that the ranking of probabilities occurs not in ascending but in descending order). We will analyze all this in the next toy example. To minimize the error in calculating the areas of the figures, we will use the functions scipy

interp1d

(interpolation of a one-dimensional function) and

quad

(calculation of a definite integral).

Suppose we solve the problem of binary classification for 15 objects and we have the following class distribution:

Our trained algorithm predicts the following probabilities of the relation to the class "1" on these objects:

We calculate the Gini coefficient for two models: our trained algorithm and the ideal model, accurately predicting classes with a probability of 100%. The idea is this: instead of ranking the population according to the level of income, we rank the predicted probabilities of the model in descending order and substitute into the formula a cumulative fraction of the true values of the target variable corresponding to the predicted probabilities. In other words, we sort the table by the line "Predict" and consider the cumulative share of classes instead of the cumulative share of revenues.

The Python code is [/b]

` from scipy.interpolate import interp1d`

from scipy.integrate import quad

actual =[1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]

predict =[0.9, 0.3, 0.8, 0.75, 0.65, 0.6, 0.78, 0.7, 0.05, 0.4, 0.4, 0.05, 0.5, 0.1, 0.1]

data = zip (actual, predict)

sorted_data = sorted (data, key = lambda d: d[1], reverse = True)

sorted_actual =[d[0]for d in sorted_data]

cumulative_actual = np.cumsum (sorted_actual) /sum (actual)

cumulative_index = np.arange (? len (cumulative_actual) +1) /len (predict)

cumulative_actual_perfect = np.cumsum (sorted (actual, reverse = True)) /sum (actual)

x_values =[0]+ list (cumulative_index)

y_values =[0]+ list (cumulative_actual)

y_values_perfect =[0]+ list (cumulative_actual_perfect)

f? f2 = interp1d (x_values, y_values), interp1d(x_values, y_values_perfect)

S_pred = quad (f? ? ? points = x_values)[0]- ???r3r31168. S_actual = quad (f? ? ? points = x_values)[0]- ???r3r31168.

fig, ax = plt.subplots (nrows = ? ncols = ? sharey = True, figsize = (1? 7))

ax[0].plot (x_values, y_values, lw = ? color = 'blue', marker = 'x')

ax[0].fill_between (x_values, x_values, y_values, color = 'blue', alpha = 0.1)

ax[0].text (0.?0.? 'S = {: 0.4f}' format (S_pred), fontsize = 28)

ax[1].plot (x_values, y_values_perfect, lw = ? color = 'green', marker = 'x')

ax[1].fill_between (x_values, x_values, y_values_perfect, color = 'green', alpha = 0.1)

ax[1].text (0.?0.? 'S = {: 0.4f}' format (S_actual), fontsize = 28)

for i in range (2):

ax*.plot ([0,1],[0,1], linestyle = '-', lw = ? color = 'black')*

ax[i].set (title = 'Gini coefficient', xlabel = 'Cumulative fraction of objects',

ylabel = 'Cumulative fraction of true classes', xlim = (? 1), ylim = (? 1))

plt.show ();

*. A little later, when he himself derived the formula for the connection of these two metrics, I understood that this phrase is an excellent indicator. If you hear or read it, it is obvious only that the author of the phrase has no understanding of the Gini coefficient. Let's take a look at the curves of the Lift Curve and the ROC Curve for our example:*

The coefficient of the Gini for the trained model is ???. Is this small or a lot? How accurate is the algorithm? Without knowing the exact value of the coefficient for an ideal algorithm, we can not say anything about our model. Therefore, the metric of quality in machine learning is

Looking at these two graphs, we can draw the following conclusions:

The prediction of the ideal algorithm is the maximum Gini coefficient for the current data set and depends only on the true distribution of classes in the problem.

The area of the figure for an ideal algorithm is:

Predictions of the trained models can not be greater than the value of the coefficient of the ideal algorithm.

With a uniform distribution of classes of the target variable, the Gini coefficient of the ideal algorithm will always be equal to ???r3r3899.

For an ideal algorithm, the shape of the figure formed by the Lift Curve and the line of absolute equality will always be a triangle of

The Gini coefficient of the random algorithm is ? and the Lift Curve coincides with the line of absolute equality

The Gini coefficient of the trained algorithm will always be less than the coefficient of the ideal algorithm

The values of the normalized Gini coefficient for the trained algorithm are in the range .

The normalized Gini coefficient is a quality metric that needs to be maximized.

We came to the most, perhaps, interesting moment - the algebraic representation of the Gini coefficient. How to calculate this metric? She is not equal to her relative from the economy. It is known that the coefficient can be calculated by the following formula:

I honestly tried to find the conclusion of this formula on the Internet, but did not find anything. Even in foreign books and scientific articles. But on some dubious websites of statisticians there was a phrase: [i] "It's so obvious that there's nothing to discuss. It is enough to compare the curves of the Lift Curve and the ROC Curve, so that everything becomes clear immediately. "The coefficient of the Gini for the trained model is ???. Is this small or a lot? How accurate is the algorithm? Without knowing the exact value of the coefficient for an ideal algorithm, we can not say anything about our model. Therefore, the metric of quality in machine learning is

**normalized coefficient of Gini**, which is equal to the ratio of the coefficient of the trained model to the coefficient of the ideal model. Further, the term "Gini Coefficient" will mean exactly this.Looking at these two graphs, we can draw the following conclusions:

The prediction of the ideal algorithm is the maximum Gini coefficient for the current data set and depends only on the true distribution of classes in the problem.

The area of the figure for an ideal algorithm is:

Predictions of the trained models can not be greater than the value of the coefficient of the ideal algorithm.

With a uniform distribution of classes of the target variable, the Gini coefficient of the ideal algorithm will always be equal to ???r3r3899.

For an ideal algorithm, the shape of the figure formed by the Lift Curve and the line of absolute equality will always be a triangle of

The Gini coefficient of the random algorithm is ? and the Lift Curve coincides with the line of absolute equality

The Gini coefficient of the trained algorithm will always be less than the coefficient of the ideal algorithm

The values of the normalized Gini coefficient for the trained algorithm are in the range .

The normalized Gini coefficient is a quality metric that needs to be maximized.

### 2. Algebraic representation. Proof of a linear relationship with AUC ROC.

We came to the most, perhaps, interesting moment - the algebraic representation of the Gini coefficient. How to calculate this metric? She is not equal to her relative from the economy. It is known that the coefficient can be calculated by the following formula:

I honestly tried to find the conclusion of this formula on the Internet, but did not find anything. Even in foreign books and scientific articles. But on some dubious websites of statisticians there was a phrase: [i] "It's so obvious that there's nothing to discuss. It is enough to compare the curves of the Lift Curve and the ROC Curve, so that everything becomes clear immediately. "

The Python code is [/b]

` from sklearn.metrics import roc_curve, roc_auc_score`

aucroc = roc_auc_score (actual, predict)

gini = 2 * roc_auc_score (actual, predict) -1

fpr, tpr, t = roc_curve (actual, predict)

fig, ax = plt.subplots (nrows = ? ncols = ? sharey = True, figsize = (1? 5))

fig.suptitle ('Gini = 2 * AUCROC - 1 = {: 0.2f} nn'.format (gini), fontsize = 1? fontweight =' bold ')

ax[0].plot ([0]+ fpr.tolist (),[0]+ tpr.tolist (), lw = ? color = 'red')

ax[0].fill_between ([0]+ fpr.tolist (),[0]+ tpr.tolist (), color = 'red', alpha = 0.1)

ax[0].text (0.?0.? 'S = {: 0.2f}' format (aucroc), fontsize = 28)

ax[1].plot (x_values, y_values, lw = ? color = 'blue')

ax[1].fill_between (x_values, x_values, y_values, color = 'blue', alpha = 0.1)

ax[1].text (0.?0.? 'S = {: 0.2f}' format (S_pred), fontsize = 28)

ax[2].plot (x_values, y_values_perfect, lw = ? color = 'green')

ax[2].fill_between (x_values, x_values, y_values_perfect, color = 'green', alpha = 0.1)

ax[2].text (0.?0.? 'S = {: 0.2f}' format (S_actual), fontsize = 28)

ax[0].set (title = 'ROC-AUC', xlabel = 'False Positive Rate',

ylabel = 'True Positive Rate', xlim = (? 1), ylim = (? 1))

for i in range (1.3):

ax[i].plot ([0,1],[0,1], linestyle = '-', lw = ? color = 'black')

ax[i].set (title = 'Gini coefficient', xlabel = 'Cumulative fraction of objects',

ylabel = 'Cumulative fraction of true classes', xlim = (? 1), ylim = (? 1))

plt.show ();

It is perfectly clear that it is impossible to catch the connection from the graphical representation of metrics, therefore we will prove equality algebraically. I managed to do this in two ways - parametrically (integrals) and nonparametrically (via Wilcoxon-Mann-Whitney statistics). The second method is much simpler and without multi-storey fractions with double integrals, therefore, we will dwell on it in detail. To further consider the evidence, let's define the terminology: the cumulative fraction of true classes is nothing more than a True Positive Rate. The cumulative fraction of objects is, in turn, the number of objects in the ranked row (when scaling to the interval

, Respectively, the proportion of objects).

To understand the proof, you need a basic understanding of the ROC-AUC metric - what is it all about, how the chart is plotted and in which axes. I recommend an article from the blog of Alexander Dyakonov "AUC ROC (area under the error curve)"

We introduce the following notation:

- The number of objects in the sample

- The number of objects of class "0"

- The number of objects of class "1"

- True Positive (correct answer of the model in the true class "1" at the given threshold)

- False Positive (wrong answer of the model on the true class "0" at the given threshold)

- True Positive Rate (ratio

?

?

to

?

?

)

- False Positive Rate (ratio

?

?

to

?

?

)

Is the current index of the element.

#### Parametric method

The parametric equation for the ROC curve can be written in the following form:

When plotting the curve of the Lift Curve along the axis

we set aside the proportion of objects (their number) pre-sorted in descending order. Thus, the parametric equation for the Gini Coefficient will be as follows:

Substituting expression (4) into expression (1) for both models and transforming it, we see that expression (3) can be substituted into one of the parts, which finally gives us the beautiful formula of the normalized Gini (2)

#### Nonparametric method

In the proof, I relied on the elementary postulates of the Theory of Probability. It is known that the numerical value of the AUC ROC is equal to the Wilcoxon-Mann-Whitney statistics:

x_j frac {1} {2}, enspace x_i = x_j ? enspace x_i < x_j end{cases}$" data-tex="display">

where

- the algorithm response on the i-th object from the distribution "1",

- the algorithm response on the j-th object from the distribution "0"

The proof of this formula can, for example, find here

This is interpreted very intuitively: if we randomly extract a pair of objects, where the first object will be from the distribution "1", and the second from the distribution "0", then the probability that the first object will have a predicted value greater than or equal to the predicted value of the second object is equal to the AUC ROC value. Combinatorial it is easy to calculate that the number of pairs of such objects will be:

.

Let the model predict

possible values from the set

, where

and

- some probability distribution, the elements of which take values on the interval .

Suppose that

the set of values that the objects

accept.

and

. Suppose that

the set of values that the objects

accept.

and

. It is obvious that the sets

and

can intersect.

Denote by

as the probability that the object is

will take the value

, and

as the probability that the object is

will take the value

. Then

and

Having an a priori probability

for each sample object, we can write a formula that determines the probability that the object will take the value

:

We define three distribution functions:

- for objects of class "1"

- for objects of class "0"

- for all sample objects

An example of how the distribution functions for the two classes in the credit scoring problem may look:

The figure also shows the Kolmogorov-Smirnov statistics, which is also used to evaluate models.

We write the Wilcoxon formula in a probabilistic form and transform it:

S_ {n_1}) + frac {1} {2} P (S_ {n_1} = S_ {n_1}) = sum_ {i = 1} ^ {k} P (S_ {n_1} geq s_ {i-1} ) P (S_ {n_0} = s_ {i}) + frac {1} {2} sum_ {i = 1} ^ {k} P (S_ {n_1} = s_ {i}) P (S_ {n_0} = s_ {i}) = sum_ {i = 1} ^ {k} big (P (S_ {n_1} geq s_ {i-1}) + frac {1} {2} P (S_ {n_1} = s_ {i}) big) P (S_ {n_0} = s_ {i}) = sum_ {i = 1} ^ {k} frac {1} {2} big ((P {S_ {n_1} geq s_ {i}}} + (P (S_ {n_1} geq s_ {i-1}) big) P (S_ {n_0} = s_ {i}) = sum_ {i = 1} ^ {k} frac {1} {2}} (CDF_ {n_ {1}} ^ {i} + CDF_ {n_ {1}} ^ {i-1}) (CDF_ {n_ {0}} ^ {i} - CDF_ {n_ {0}} ^ {i-1}) hspace {15pt} (6) $ "data-tex =" display "> .

.

An analogous formula can be written for the area under the Lift Curve (remember that it consists of the sum of two areas, one of which is always 0.5):

And now we transform it:

For an ideal model, the formula is simple:

Therefore, from (8) and (9), we obtain:

As they said at school, which was required to prove.

### 3. Practical application.

As mentioned at the beginning of the article, the Gini coefficient is used to evaluate models in many areas, including bank lending, insurance and targeted marketing. And this is a very reasonable explanation. This article does not set itself the goal of detailing the practical application of statistics in a particular field. Many books have been written on this subject, we will only briefly go over this topic.

#### Credit scoring

Worldwide, banks receive thousands of applications every day for a loan. Of course, it is necessary to somehow evaluate the risks of the fact that the client can simply not return the loan, therefore, predictive models are being developed that estimate the characteristic space for the probability that the client will not pay the loan, and these models must first be evaluated somehow , if the model is successful, then choose the optimal threshold (threshold) of probability. The choice of the optimal threshold is determined by the policy of the bank. The task of the analysis in selecting the threshold is to minimize the risk of loss of profit associated with the refusal to issue a loan. But to choose a threshold, one must have a qualitative model. The main quality metrics in the banking sector:

Coefficient of Gini

Statistics Kolmogorov-Smirnov (calculated as the maximum difference between the cumulative distribution functions of "bad" and "good" borrowers. Above, the figure with distributions and this statistics was cited)

The divergence coefficient (this is an estimate of the difference in the mathematical expectation of scorecard scores for "bad" and "good" borrowers, normalized by the variance of these distributions.) The larger the divergence coefficient, the better the quality of the model.)

I do not know how things are in Russia, although I live here, but in Europe the Gini coefficient is most widely used, in North America - Kolmogorov-Smirnov statistics.

#### Insurance

In this area, everything is similar to the banking sector, with the only difference that we need to divide customers into those who will file an insurance claim and those who do not. Let's consider a practical example from this area in which one feature of the Lift Curve will be clearly visible - for strongly unbalanced classes in the target variable, the curve almost perfectly coincides with the ROC curve.

A few months ago, the "Porto Seguro's Safe Driver Prediction" competition was held at Kaggle, in which the task was precisely to predict "Insurance Claim" - the filing of an insurance claim. And in which I missed the silver by my own stupidity, choosing the wrong submission.

It was a very strange and at the same time incredibly cognitive competition. And with a record number of participants - 5169. The winner of the competition is Michael Jahrer wrote the code only in C ++ /CUDA, and this causes admiration and respect.

Porto Seguro is a Brazilian company specializing in car insurance.

The dateset consisted of 595207 rows in the trainee, 892816 rows in the test, and 53 anonymous characters. The ratio of classes in the target is 3% and 97%. We'll write a simple baseline, for the benefit of this is done in a couple of lines, and we'll build the charts. Note that the curves almost perfectly coincide, the difference in the areas under the Lift Curve and ROC Curve is ???.

The Python code is [/b]

` from sklearn.model_selection import train_test_split`

import xgboost as xgb

from scipy.interpolate import interp1d

from scipy.integrate import quad

df = pd.read_csv ('train.csv', index_col = 'id')

unwanted = df.columns[df.columns.str.startswith('ps_calc_')]

df.drop (unwanted, inplace = True, axis = 1)

df.fillna (-99? inplace = True)

train, test = train_test_split (df, stratify = df.target, test_size = 0.2? random_state = 1)

estimator = xgb.XGBClassifier (seed = ? n_jobs = -1)

estimator.fit (train.drop ('target', axis = 1), train.target)

pred = estimator.predict_proba (test.drop ('target', axis = 1))[:, 1]

test['predict']= pred

actual = test.target.values

predict = test.predict.values

data = zip (actual, predict)

sorted_data = sorted (data, key = lambda d: d[1], reverse = True)

sorted_actual =[d[0]for d in sorted_data]

cumulative_actual = np.cumsum (sorted_actual) /sum (actual)

cumulative_index = np.arange (? len (cumulative_actual) +1) /len (predict)

cumulative_actual_perfect = np.cumsum (sorted (actual, reverse = True)) /sum (actual)

aucroc = roc_auc_score (actual, predict)

gini = 2 * roc_auc_score (actual, predict) -1

fpr, tpr, t = roc_curve (actual, predict)

x_values =[0]+ list (cumulative_index)

y_values =[0]+ list (cumulative_actual)

y_values_perfect =[0]+ list (cumulative_actual_perfect)

fig, ax = plt.subplots (nrows = ? ncols = ? sharey = True, figsize = (1? 6))

fig.suptitle ('Gini = {: 0.3f} nn'.format (gini), fontsize = 2? fontweight =' bold ')

ax[0].plot ([0]+ fpr.tolist (),[0]+ tpr.tolist (), lw = ? color = 'red')

ax[0].plot ([0]+ fpr.tolist (),[0]+ tpr.tolist (), lw = ? color = 'red')

ax[0].fill_between ([0]+ fpr.tolist (),[0]+ tpr.tolist (), color = 'red', alpha = 0.1)

ax[0].text (0.?0.? 'S = {: 0.3f}' format (aucroc), fontsize = 28)

ax[1].plot (x_values, y_values, lw = ? color = 'blue')

ax[1].fill_between (x_values, x_values, y_values, color = 'blue', alpha = 0.1)

ax[1].text (0.?0.? 'S = {: 0.3f}' format (S_pred), fontsize = 28)

ax[2].plot (x_values, y_values_perfect, lw = ? color = 'green')

ax[2].fill_between (x_values, x_values, y_values_perfect, color = 'green', alpha = 0.1)

ax[2].text (0.?0.? 'S = {: 0.3f}' format (S_actual), fontsize = 28)

ax[0].set (title = 'ROC-AUC XGBoost Baseline', xlabel = 'False Positive Rate',

ylabel = 'True Positive Rate', xlim = (? 1), ylim = (? 1))

ax[1].set (title = 'Gini XGBoost Baseline')

ax[2].set (title = 'Gini Perfect')

for i in range (1.3):

ax[i].plot ([0,1],[0,1], linestyle = '-', lw = ? color = 'black')

ax[i].set (xlabel = 'Share of clients', ylabel = 'True Positive Rate', xlim = (? 1), ylim = (? 1))

plt.show ();

Coefficient Gini winning model - ???r3r31161.

For me, it's still a mystery what the organizers wanted to achieve by zapping up the signs and making an incredible preprocessing of the data. This is one of the reasons why all the models, including those that won, actually turned out to be garbage. Probably just a PR, no one in the world knew about Porto Seguro except Brazilians, now many people know.

#### Target marketing

In this area, you can best understand the true meaning of the Gini coefficient and the Lift Curve. Almost in all books and articles for some reason examples are given with mail marketing campaigns, which in my opinion is an anachronism. Create an artificial business problem from the scope of free2play games . We have a database of users who once played our game and for some reason fallen off. We want to return them to our game project, for each user we have a certain feature space (time in the project, how much it spent, to what level it reached, etc.) on the basis of which we build the model. We estimate the model by the Gini coefficient and build the Lift Curve:

Suppose that within the marketing campaign we in one way or another establish contact with the user (email, social network), the price of contact with one user is 2 rubles. We know that Lifetime Value is 5 rubles. It is necessary to optimize the effectiveness of the marketing campaign. Suppose that there are 100 users in the sample, of which 30 will return. Thus, if we establish contact with 100% of users, we will spend 200 rubles on the marketing campaign and receive a profit of 150 rubles. This is the failure of the campaign. Consider the schedule of the Lift Curve. It is visible, that at contact to 50% of users, we contact 90% of users who will return. the cost of the campaign - 100 rubles, income 135. We are in positive territory. Thus, Lift Curve allows us to optimize our marketing company in the best way.

### 4. Sort by bubble.

The Gini coefficient has a rather amusing, but very useful interpretation, with which we can also easily calculate it. It turns out that it is numerically equal to:

where,

the number of permutations that need to be made in the ranked list in order to get the true list of the target variable,

Is the number of permutations for the predictions of the random algorithm. Write the elementary sorting with a bubble and show it:

The Python code is [/b]

` actual =[1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]`

predict =[0.9, 0.3, 0.8, 0.75, 0.65, 0.6, 0.78, 0.7, 0.05, 0.4, 0.4, 0.05, 0.5, 0.1, 0.1]

data = zip (actual, predict)

sorted_data = sorted (data, key = lambda d: d[1], reverse = False)

sorted_actual =[d[0]for d in sorted_data]

swaps = 0

n = len (sorted_actual)

array = sorted_actual

for i in range (? n):

flag = 0

for j in range (n-i):

if array[j]> array[j+1]:

array[j], array[j+1]= array[j+1], array[j]

flag = 1

swaps + = 1

if flag == 0: break

print ("Number of permutations:", swaps)

Number of permutations: 10

Combinatorically, it is not difficult to calculate the number of permutations for a random algorithm:

Thus:

We see that we obtained the value of the coefficient, as in the toy example considered above.

I hope the article was useful and dispelled some myths regarding this quality metric.