34

The problem refers to decision trees building. According to Wikipedia 'Gini coefficient' should not be confused with 'Gini impurity'. However both measures can be used when building a decision tree - these can support our choices when splitting the set of items.

1) 'Gini impurity' - it is a standard decision-tree splitting metric (see in the link above);

2) 'Gini coefficient' - each splitting can be assessed based on the AUC criterion. For each splitting scenario we can build a ROC curve and compute AUC metric. According to Wikipedia AUC=(GiniCoeff+1)/2;

Question is: are both these measures equivalent? On the one hand, I am informed that Gini coefficient should not be confused with Gini impurity. On the other hand, both these measures can be used in doing the same thing - assessing the quality of a decision tree split.

Damien
  • 341
  • 1
  • 3
  • 3

6 Answers6

34

No, despite their names they are not equivalent or even that similar.

  • Gini impurity is a measure of misclassification, which applies in a multiclass classifier context.
  • Gini coefficient applies to binary classification and requires a classifier that can in some way rank examples according to the likelihood of being in a positive class.

Both could be applied in some cases, but they are different measures for different things. Impurity is what is commonly used in decision trees.

Sean Owen
  • 6,664
  • 6
  • 33
  • 44
6

I took an example of Data with two people A and B with wealth of unit 1 and unit 3 respectively. Gini Impurity as per Wikipedia = 1 - [ (1/4)^2 + (3/4)^2 ] = 3/8

Gini coefficient as per Wikipedia would be ratio of area between red and blue line to the total area under blue line in the following graph

enter image description here

Area under red line is 1/2 + 1 + 3/2 = 3

Total area under blue line = 4

So Gini coefficient = 3/4

Clearly the two numbers are different. I will check more cases to see if they are proportional or there is an exact relationship and edit the answer.

Edit: I checked for other combinations as well, the ratio is not constant. Below is a list of few combinations I tried. enter image description here

Gaurav Singhal
  • 263
  • 1
  • 3
  • 11
1

I believe they represent the same thing essentially, as the so-called:

"Gini Coefficient" mainly used in Economics, measures the inequality of a numerical variable, such as income, which we can treat as a regression problem--getting the "mean of each group.

"Gini impurity" mainly used in Decision Tree learning, measures the impurity of a categorical variable, such as colour, sex, etc. which is a classification problem -- getting the "majority" of each group.

Sounds similar right? "inequality" and "impurity" are both measures of variation, which are intuitively the same concept. The difference is "inequality" for numerical variables and "impurity" for categorical variables. And both of them can be named "Gini Index".


In Light, R. J., & Margolin, B. H. (1971). An analysis of variance for categorical data, it says that as the "mean" is an undefined concept for categorical data, Gini extends the "Gini Index" from numerical data to categorical data by using pairwise difference instead of deviation from mean. TL;DR which comes to the variation for categorical responses: $$\frac1{2n}[\sum_{i\neq j}n_in_j] = \frac{n}2 - \frac1{2n}\sum^I_{i=1}n_i^2$$ where $n_i$ is the number of responses in the $i$th category, $i = 1, \cdot\cdot\cdot, I$ which is almost the same, but $\frac{n}2$ times the "Gini Impurity" nowadays, $$1 - \sum^{I}_{i=1} {p_i}^{2}$$


By the way, you said you can use ROC as method 2 to choose split point when growing a decision tree, I can't get it. Could you elaborate that?

PS: I agreed with Pasmod Turing's answer, that Wikipedia can be modified by everyone, and the "Gini Impurity" seems like an incomplete item in the wiki.

I also saw the disputes in the comments under his answer, I must say Machine Learning is originated from statistics, and statistics is the fundamental analysis tool for scientific research, thus, many concepts are the same thing in statistics, even though they have different names in different professional areas. Gini index certainly share the same name in decision tree and economics.

Jokerkeny
  • 11
  • 2
0

I think they both represent the same concept.

In classification trees, the Gini Index is used to compute the impurity of a data partition. So Assume the data partition D consisiting of 4 classes each with equal probability. Then the Gini Index (Gini Impurity) will be: $Gini(D) = 1 - (0.25^2 + 0.25^2 + 0.25^2 + 0.25^2)$

In CART we perform binary splits. So The gini index will be computed as the weighted sum of the resulting partitions and we select the split with the smallest gini index.

So the use of Gini Impurity (Gini Index) is not limited to binary situations.

Another term for Gini Impurity is Gini Coefficient which is used normally as a measure of income distribution.

Ethan
  • 1,657
  • 9
  • 25
  • 39
Pasmod Turing
  • 463
  • 2
  • 6
0

Gini impurity is a special instance of Gini coefficient:

This is Gini coefficient's definition in Wikipedia:

In economics, the Gini coefficient (/ˈdʒiːni/ JEE-nee), also known as the Gini index or Gini ratio, is a measure of statistical dispersion intended to represent the income inequality or the wealth inequality within a nation or a social group.

In another words, it measures the inequality of the wealth of each person in a nation, with the constraint that the sum of their wealth is a constant.

Now replace the above bolded words with:

person -> category
wealth -> probability
nation -> probability distribution
constant -> 1

The above sentence become: it measures the inequality of the probability of each category in a probability distribution , with the constraint that the sum of their probability is 1.

That's exactly the definition of Gini impurity!

0

Gini index of 1 would represent wealth concentration to a single person. However, the Gini impurity in this case would be 0. So, they should move in opposite directions, right?