Crazy Tree

Before we finally jump into implementing a Decision Tree using Python, one more thing should be discussed. That is overfitting – the primary challenge associated with Decision Trees.

Here is an example of how the Decision Tree fits the dataset.

You can notice that the model perfectly fits the training set without misclassifying any instances.

The only problem is that the decision boundaries are too complex, and the test(or cross-validation) accuracy will be significantly lower than the training set's accuracy. The model overfits.
The Decision Tree will make as many splits as required to fit the training data perfectly.

Luckily the Decision Tree is pretty configurable. Let's see how we can constrain the Tree to reduce overfitting:

max_depth

Depth of a node is the distance(vertically) from the node to the root node.

We can constrain the maximum depth of a Decision Tree, making the tree smaller and less likely to overfit. To do so, we turn the Decision Nodes on a maximum depth into Leaf Nodes.

Here is also a gif showing how Decision Boundary changes with different max_depth.

min_samples_leaf

Another way to constrain the Tree is to set the minimum number of samples on the Leaf Nodes. It will make the model simpler and more robust to outliers.

Here is a GIF showing how min_samples_leaf affects Decision Boundary.

Both these parameters are included in Scikit-Learn as Decision Tree's hyperparameters. By default, the tree is unconstraint, so max_depth is set to None, meaning no restrictions to depth, and min_samples_leaf is set to 1.

Everything was clear?

Thanks for your feedback!

Section 3. Chapter 3

Ask AI

Ask anything or try one of the suggested questions to begin our chat

Course Content

Classification with Python