Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Learn Splitting the Nodes | Decision Tree
Classification with Python

bookSplitting the Nodes

During training, we need to find the best split at each decision node. When we split the data into two nodes, we aim for different classes to be in separate nodes.

  • Best case scenario: all data points in a node belong to the same class;
  • Worst case scenario: an equal number of data points for each class.

Gini Impurity

To measure how good a split is, we can calculate the Gini impurity. It is the probability that if we randomly take two points from a node (with replacement), they will be of different classes. The lower this probability (impurity), the better the split.

You can calculate the Gini impurity for binary classification using the following formula:

gini=1βˆ’p02βˆ’p12=1βˆ’(m0m)2βˆ’(m1m)2\text{gini} = 1 - p_0^2 - p_1^2 = 1 - (\frac{m_0}{m})^2 - (\frac{m_1}{m})^2

Where

  • mim_i - number of instances of class ii in a node;
  • mm - number of instances in a node;
  • pi=mimp_i = \frac{m_i}{m} - probability of choosing class ii.

And for multiclass classification, the formula is:

gini=1βˆ’βˆ‘i=0Cpi2=1βˆ’βˆ‘i=0C(mim)2\text{gini} = 1 - \sum_{i=0}^C p_i^2 = 1 - \sum_{i=0}^C(\frac{m_i}{m})^2

Where

  • CC - number of classes.

We can measure how good the split is by taking the weighted sum of Gini scores for both nodes obtained from a split. That's the value we want to minimize.

To split a decision node, we need to find a feature to split on and the threshold:

At a decision node, the algorithm greedily finds the best threshold for each feature. Then it chooses the split with the lowest Gini impurity out of all features (if there is a tie, it chooses randomly).

Entropy

The entropy is another measure of the impurity. For a binary classification problem, the entropy HH of a node is calculated using the formula:

H(p)=βˆ’plog⁑2(p)βˆ’(1βˆ’p)log⁑2(1βˆ’p)H(p) = -p \log_2(p) - (1 - p) \log_2(1 - p)

where:

  • pp is the proportion of positive examples (class 1);
  • 1βˆ’p1 - p is the proportion of negative examples (class 0).

For a multiclass classification problem, the entropy HH of a node is calculated using the formula:

H(p1,p2,…,pk)=βˆ’βˆ‘i=1kpilog⁑2(pi)H(p_1, p_2, \dots, p_k) = -\sum_{i=1}^{k} p_i \log_2(p_i)

where:

  • kk is the number of classes;
  • pip_i is the proportion of examples belonging to class ii in the node.

Similarly to Gini impurity, we can measure how good a split is by calculating the weighted sum of entropy values for the child nodes obtained from the split. This is the value we want to minimize in order to maximize the information gain.

Note
Note

The entropy is maximum when all classes are equally represented. It is minimum (0) when all examples belong to one class (pure node).

question mark

Choose a better split.

Select the correct answer

Everything was clear?

How can we improve it?

Thanks for your feedback!

SectionΒ 3. ChapterΒ 2

Ask AI

expand

Ask AI

ChatGPT

Ask anything or try one of the suggested questions to begin our chat

Suggested prompts:

Can you explain the difference between Gini impurity and entropy?

How do you choose between using Gini impurity and entropy in practice?

Can you show an example calculation for Gini impurity or entropy?

Awesome!

Completion rate improved to 4.17

bookSplitting the Nodes

Swipe to show menu

During training, we need to find the best split at each decision node. When we split the data into two nodes, we aim for different classes to be in separate nodes.

  • Best case scenario: all data points in a node belong to the same class;
  • Worst case scenario: an equal number of data points for each class.

Gini Impurity

To measure how good a split is, we can calculate the Gini impurity. It is the probability that if we randomly take two points from a node (with replacement), they will be of different classes. The lower this probability (impurity), the better the split.

You can calculate the Gini impurity for binary classification using the following formula:

gini=1βˆ’p02βˆ’p12=1βˆ’(m0m)2βˆ’(m1m)2\text{gini} = 1 - p_0^2 - p_1^2 = 1 - (\frac{m_0}{m})^2 - (\frac{m_1}{m})^2

Where

  • mim_i - number of instances of class ii in a node;
  • mm - number of instances in a node;
  • pi=mimp_i = \frac{m_i}{m} - probability of choosing class ii.

And for multiclass classification, the formula is:

gini=1βˆ’βˆ‘i=0Cpi2=1βˆ’βˆ‘i=0C(mim)2\text{gini} = 1 - \sum_{i=0}^C p_i^2 = 1 - \sum_{i=0}^C(\frac{m_i}{m})^2

Where

  • CC - number of classes.

We can measure how good the split is by taking the weighted sum of Gini scores for both nodes obtained from a split. That's the value we want to minimize.

To split a decision node, we need to find a feature to split on and the threshold:

At a decision node, the algorithm greedily finds the best threshold for each feature. Then it chooses the split with the lowest Gini impurity out of all features (if there is a tie, it chooses randomly).

Entropy

The entropy is another measure of the impurity. For a binary classification problem, the entropy HH of a node is calculated using the formula:

H(p)=βˆ’plog⁑2(p)βˆ’(1βˆ’p)log⁑2(1βˆ’p)H(p) = -p \log_2(p) - (1 - p) \log_2(1 - p)

where:

  • pp is the proportion of positive examples (class 1);
  • 1βˆ’p1 - p is the proportion of negative examples (class 0).

For a multiclass classification problem, the entropy HH of a node is calculated using the formula:

H(p1,p2,…,pk)=βˆ’βˆ‘i=1kpilog⁑2(pi)H(p_1, p_2, \dots, p_k) = -\sum_{i=1}^{k} p_i \log_2(p_i)

where:

  • kk is the number of classes;
  • pip_i is the proportion of examples belonging to class ii in the node.

Similarly to Gini impurity, we can measure how good a split is by calculating the weighted sum of entropy values for the child nodes obtained from the split. This is the value we want to minimize in order to maximize the information gain.

Note
Note

The entropy is maximum when all classes are equally represented. It is minimum (0) when all examples belong to one class (pure node).

question mark

Choose a better split.

Select the correct answer

Everything was clear?

How can we improve it?

Thanks for your feedback!

SectionΒ 3. ChapterΒ 2
some-alt