Aprende Scaling and Distance-Based Models | Scaling and Model Performance

Distance-based machine learning models, such as k-nearest neighbors (k-NN) and clustering algorithms like k-means, rely on mathematical measures of distance to compare data points. The way features are scaled has a direct and significant impact on how these distances are computed. If one feature has a much larger range or variance than others, it will dominate the distance calculation, causing models to become biased toward that feature and potentially reducing predictive accuracy. For example, if you have two features — height in centimeters (ranging from 150 to 200) and weight in kilograms (ranging from 50 to 100) — the difference in numerical scale can cause the model to consider height much more important than weight, even if both are equally relevant.

Definition

Euclidean distance is the straight-line distance between two points in Euclidean space. It is calculated as the square root of the sum of squared differences between corresponding feature values. Euclidean distance is highly sensitive to the scale of each feature: features with larger ranges contribute more to the total distance, which can distort model behavior if features are not properly scaled.


              123456789101112131415161718
            
import numpy as np
from sklearn.preprocessing import StandardScaler

# Two data points with features on different scales
point1 = np.array([170, 65])   # [height in cm, weight in kg]
point2 = np.array([180, 85])

# Euclidean distance before scaling
distance_before = np.linalg.norm(point1 - point2)
print("Distance before scaling:", distance_before)

# Apply standard scaling (z-score normalization)
scaler = StandardScaler()
points_scaled = scaler.fit_transform(np.vstack([point1, point2]))

# Euclidean distance after scaling
distance_after = np.linalg.norm(points_scaled[0] - points_scaled[1])
print("Distance after scaling:", distance_after)

¿Todo estuvo claro?

¡Gracias por tus comentarios!

Sección 4. Capítulo 2

Pregunte a AI

Pregunte lo que quiera o pruebe una de las preguntas sugeridas para comenzar nuestra charla

Desliza para mostrar el menú

Definition


              123456789101112131415161718
            
import numpy as np
from sklearn.preprocessing import StandardScaler

# Two data points with features on different scales
point1 = np.array([170, 65])   # [height in cm, weight in kg]
point2 = np.array([180, 85])

# Euclidean distance before scaling
distance_before = np.linalg.norm(point1 - point2)
print("Distance before scaling:", distance_before)

# Apply standard scaling (z-score normalization)
scaler = StandardScaler()
points_scaled = scaler.fit_transform(np.vstack([point1, point2]))

# Euclidean distance after scaling
distance_after = np.linalg.norm(points_scaled[0] - points_scaled[1])
print("Distance after scaling:", distance_after)

¿Todo estuvo claro?

¡Gracias por tus comentarios!

Sección 4. Capítulo 2