Scaling and Distance-Based Models
Distance-based machine learning models, such as k-nearest neighbors (k-NN) and clustering algorithms like k-means, rely on mathematical measures of distance to compare data points. The way features are scaled has a direct and significant impact on how these distances are computed. If one feature has a much larger range or variance than others, it will dominate the distance calculation, causing models to become biased toward that feature and potentially reducing predictive accuracy. For example, if you have two features — height in centimeters (ranging from 150 to 200) and weight in kilograms (ranging from 50 to 100) — the difference in numerical scale can cause the model to consider height much more important than weight, even if both are equally relevant.
Euclidean distance is the straight-line distance between two points in Euclidean space. It is calculated as the square root of the sum of squared differences between corresponding feature values. Euclidean distance is highly sensitive to the scale of each feature: features with larger ranges contribute more to the total distance, which can distort model behavior if features are not properly scaled.
123456789101112131415161718import numpy as np from sklearn.preprocessing import StandardScaler # Two data points with features on different scales point1 = np.array([170, 65]) # [height in cm, weight in kg] point2 = np.array([180, 85]) # Euclidean distance before scaling distance_before = np.linalg.norm(point1 - point2) print("Distance before scaling:", distance_before) # Apply standard scaling (z-score normalization) scaler = StandardScaler() points_scaled = scaler.fit_transform(np.vstack([point1, point2])) # Euclidean distance after scaling distance_after = np.linalg.norm(points_scaled[0] - points_scaled[1]) print("Distance after scaling:", distance_after)
Tack för dina kommentarer!
Fråga AI
Fråga AI
Fråga vad du vill eller prova någon av de föreslagna frågorna för att starta vårt samtal
Why does scaling the features change the distance between data points?
Can you explain what standard scaling (z-score normalization) does in this context?
Are there other methods for scaling features besides standard scaling?
Awesome!
Completion rate improved to 5.26
Scaling and Distance-Based Models
Svep för att visa menyn
Distance-based machine learning models, such as k-nearest neighbors (k-NN) and clustering algorithms like k-means, rely on mathematical measures of distance to compare data points. The way features are scaled has a direct and significant impact on how these distances are computed. If one feature has a much larger range or variance than others, it will dominate the distance calculation, causing models to become biased toward that feature and potentially reducing predictive accuracy. For example, if you have two features — height in centimeters (ranging from 150 to 200) and weight in kilograms (ranging from 50 to 100) — the difference in numerical scale can cause the model to consider height much more important than weight, even if both are equally relevant.
Euclidean distance is the straight-line distance between two points in Euclidean space. It is calculated as the square root of the sum of squared differences between corresponding feature values. Euclidean distance is highly sensitive to the scale of each feature: features with larger ranges contribute more to the total distance, which can distort model behavior if features are not properly scaled.
123456789101112131415161718import numpy as np from sklearn.preprocessing import StandardScaler # Two data points with features on different scales point1 = np.array([170, 65]) # [height in cm, weight in kg] point2 = np.array([180, 85]) # Euclidean distance before scaling distance_before = np.linalg.norm(point1 - point2) print("Distance before scaling:", distance_before) # Apply standard scaling (z-score normalization) scaler = StandardScaler() points_scaled = scaler.fit_transform(np.vstack([point1, point2])) # Euclidean distance after scaling distance_after = np.linalg.norm(points_scaled[0] - points_scaled[1]) print("Distance after scaling:", distance_after)
Tack för dina kommentarer!