Summary  
This chapter covers transforming categorical features into numerical form using ordinal encoding for ordered categories and one-hot encoding for nominal categories.

General domain of usage  
Machine learning data preprocessing

Clustering algorithms like **K-means** need **numerical data**. Categorical features must be converted to numerical form using encoding. You will learn about **ordinal** and **one-hot encoding**. 

## Ordinal Encoding 

**Ordinal encoding** converts ordered categories to numerical values, preserving their **rank**. For example, ordinal encoding of the `'education_level'` column will transform its values  from `"High School"`, `"Bachelor's"`, `"Master's"`, `'PhD'` to `0`, `1`, `2`, `3`. 

This assumes a **meaningful numerical difference** between encoded values, which may not always be accurate.

```python
from sklearn.preprocessing import OrdinalEncoder

education_levels = [['High School',
                     "Bachelor's",
                     "Master's",
                     "PhD"]]
encoder = OrdinalEncoder(categories=education_levels)

df[['education_encoded']] = encoder.fit_transform(df[['education_level']]) 
```

Such encoding should only be used for **ordinal features** where category order matters.

Note

## One-Hot Encoding 

One-hot encoding converts **nominal** (unordered) categories into binary columns, where each category becomes a **new column**. For a feature with `n` categories, this typically creates `n` columns — one column is `1` for the corresponding category, and the others are `0`. However, only `n-1` columns are actually needed to represent the information **without redundancy**.

For example, a `'color'` column with values `'red'`, `'blue'`, and `'green'` can be encoded with just **two** columns: `'color_red'` and `'color_blue'`. If a row has `0` in both, it implies the color is `'green'`. By dropping one column, we avoid **redundancy**.

The removal of the redundant column is specified via `drop='first'`:

```python
from sklearn.preprocessing import OneHotEncoder 

encoder = OneHotEncoder(drop='first', sparse=False) 

encoded = encoder.fit_transform(df[['color']]) 
```



While one-hot encoding avoids imposing order and suits nominal features, it can increase **data dimensionality**.

Which encoding method is best suited for a categorical feature like `'country'` with values such as `"USA"`, `"Canada"`, and `"Germany"`, where there is no natural order?

Gain a solid understanding of cluster analysis, a key unsupervised learning technique for uncovering patterns in unlabeled data. Explore the essentials of K-Means, Hierarchical Clustering, DBSCAN, and GMMs, and get hands-on experience with real datasets to build confidence in applying clustering to real-world problems.

Dive into the fundamentals of clustering and discover how it differs from classification. Explore essential algorithms, tools, and libraries that power this unsupervised learning technique to uncover hidden patterns in data.

Gain a solid understanding of key preprocessing techniques that ensure effective clustering. Learn how to handle missing values, encode categorical features, normalize data, and choose appropriate distance measures and linkages to boost clustering accuracy.

Master the skills needed to apply K-Means clustering effectively. Learn how the algorithm works, determine the optimal number of clusters, and gain hands-on experience by implementing K-Means on both synthetic and real-world datasets.

Explore the essentials of hierarchical clustering and learn how to group data into meaningful clusters using dendrograms. Build confidence in identifying the optimal number of clusters and implementing the technique on both synthetic and real-world datasets.

Discover how DBSCAN excels at detecting clusters of varying shapes and handling noise in data. Learn the mechanics behind this density-based algorithm, how to assign points to clusters, and apply it to both synthetic and real datasets with confidence.

Gain a solid understanding of Gaussian Mixture Models and how they use probability to model complex cluster shapes. Learn the principles of Gaussian distribution, explore how GMMs work, and build confidence by applying them to both dummy and real-world data.

Categorical Features Encoding

Ordinal Encoding

One-Hot Encoding