Course Content
Classification with Python
Classification with Python
5. Comparing Models
Random Forest Summary
Let's look at Random Forest's peculiarities:
- No data preparation is required.
Since Random Forest is a bunch of Decision Trees, the preprocessing needed for Random Forest is the same as for Decision Trees, which is very little; - Provides feature importances.
Just like the Decision Tree, Random Forest also provides feature importances that you can access using the.feature_importances_
attribute; - Random Forest is relatively slow.
Since Random Forest trains a lot of Decision Trees(100 by default) during training, it can become quite slow for large datasets. And to make a prediction, a new instance should also run through all the trees, so predictions can also become slow if many trees are used; - Handles datasets with many features well.
Thanks to sampling features, Random Forest's training time does not suffer much from a large number of features. Also, the model can easily ignore useless features just because a better feature will be chosen at each Decision Node. So useless features do not worsen the model unless there are too many of them; - Suitable for complex tasks.
A Decision Tree can build complex decision boundaries, but they are not smooth and very likely to overfit. In contrast, Random Forest produces smoother decision boundaries that generalize better, so Random Forest is much less likely to overfit. And unlike a single Decision Tree, Random Forest is stable, meaning it does not change drastically with minor changes to the dataset or hyperparameters.
And here is a little summary:
Advantages | Disadvantages |
---|---|
No Overfitting | Slow |
Handles datasets with many features well | Not interpretable |
Stable | |
No feature scaling required | |
Provides feature importances | |
Usually robust to outliers | |
Suitable for complex tasks |
Everything was clear?
Thanks for your feedback!
Section 4. Chapter 4