Lära Scaling and Distance-Based Models | Scaling and Model Performance

Distance-based machine learning models, such as k-nearest neighbors (k-NN) and clustering algorithms like k-means, rely on mathematical measures of distance to compare data points. The way features are scaled has a direct and significant impact on how these distances are computed. If one feature has a much larger range or variance than others, it will dominate the distance calculation, causing models to become biased toward that feature and potentially reducing predictive accuracy. For example, if you have two features — height in centimeters (ranging from 150 to 200) and weight in kilograms (ranging from 50 to 100) — the difference in numerical scale can cause the model to consider height much more important than weight, even if both are equally relevant.

Definition

Euclidean distance is the straight-line distance between two points in Euclidean space. It is calculated as the square root of the sum of squared differences between corresponding feature values. Euclidean distance is highly sensitive to the scale of each feature: features with larger ranges contribute more to the total distance, which can distort model behavior if features are not properly scaled.


              123456789101112131415161718
            
import numpy as np
from sklearn.preprocessing import StandardScaler

# Two data points with features on different scales
point1 = np.array([170, 65])   # [height in cm, weight in kg]
point2 = np.array([180, 85])

# Euclidean distance before scaling
distance_before = np.linalg.norm(point1 - point2)
print("Distance before scaling:", distance_before)

# Apply standard scaling (z-score normalization)
scaler = StandardScaler()
points_scaled = scaler.fit_transform(np.vstack([point1, point2]))

# Euclidean distance after scaling
distance_after = np.linalg.norm(points_scaled[0] - points_scaled[1])
print("Distance after scaling:", distance_after)

Var allt tydligt?

Tack för dina kommentarer!

Avsnitt 4. Kapitel 2

Fråga AI

Fråga vad du vill eller prova någon av de föreslagna frågorna för att starta vårt samtal

Svep för att visa menyn

Definition


              123456789101112131415161718
            
import numpy as np
from sklearn.preprocessing import StandardScaler

# Two data points with features on different scales
point1 = np.array([170, 65])   # [height in cm, weight in kg]
point2 = np.array([180, 85])

# Euclidean distance before scaling
distance_before = np.linalg.norm(point1 - point2)
print("Distance before scaling:", distance_before)

# Apply standard scaling (z-score normalization)
scaler = StandardScaler()
points_scaled = scaler.fit_transform(np.vstack([point1, point2]))

# Euclidean distance after scaling
distance_after = np.linalg.norm(points_scaled[0] - points_scaled[1])
print("Distance after scaling:", distance_after)

Var allt tydligt?

Tack för dina kommentarer!

Avsnitt 4. Kapitel 2