Scaling and Distance-Based Models
Distance-based machine learning models, such as k-nearest neighbors (k-NN) and clustering algorithms like k-means, rely on mathematical measures of distance to compare data points. The way features are scaled has a direct and significant impact on how these distances are computed. If one feature has a much larger range or variance than others, it will dominate the distance calculation, causing models to become biased toward that feature and potentially reducing predictive accuracy. For example, if you have two features — height in centimeters (ranging from 150 to 200) and weight in kilograms (ranging from 50 to 100) — the difference in numerical scale can cause the model to consider height much more important than weight, even if both are equally relevant.
Euclidean distance is the straight-line distance between two points in Euclidean space. It is calculated as the square root of the sum of squared differences between corresponding feature values. Euclidean distance is highly sensitive to the scale of each feature: features with larger ranges contribute more to the total distance, which can distort model behavior if features are not properly scaled.
123456789101112131415161718import numpy as np from sklearn.preprocessing import StandardScaler # Two data points with features on different scales point1 = np.array([170, 65]) # [height in cm, weight in kg] point2 = np.array([180, 85]) # Euclidean distance before scaling distance_before = np.linalg.norm(point1 - point2) print("Distance before scaling:", distance_before) # Apply standard scaling (z-score normalization) scaler = StandardScaler() points_scaled = scaler.fit_transform(np.vstack([point1, point2])) # Euclidean distance after scaling distance_after = np.linalg.norm(points_scaled[0] - points_scaled[1]) print("Distance after scaling:", distance_after)
¡Gracias por tus comentarios!
Pregunte a AI
Pregunte a AI
Pregunte lo que quiera o pruebe una de las preguntas sugeridas para comenzar nuestra charla
Awesome!
Completion rate improved to 5.26
Scaling and Distance-Based Models
Desliza para mostrar el menú
Distance-based machine learning models, such as k-nearest neighbors (k-NN) and clustering algorithms like k-means, rely on mathematical measures of distance to compare data points. The way features are scaled has a direct and significant impact on how these distances are computed. If one feature has a much larger range or variance than others, it will dominate the distance calculation, causing models to become biased toward that feature and potentially reducing predictive accuracy. For example, if you have two features — height in centimeters (ranging from 150 to 200) and weight in kilograms (ranging from 50 to 100) — the difference in numerical scale can cause the model to consider height much more important than weight, even if both are equally relevant.
Euclidean distance is the straight-line distance between two points in Euclidean space. It is calculated as the square root of the sum of squared differences between corresponding feature values. Euclidean distance is highly sensitive to the scale of each feature: features with larger ranges contribute more to the total distance, which can distort model behavior if features are not properly scaled.
123456789101112131415161718import numpy as np from sklearn.preprocessing import StandardScaler # Two data points with features on different scales point1 = np.array([170, 65]) # [height in cm, weight in kg] point2 = np.array([180, 85]) # Euclidean distance before scaling distance_before = np.linalg.norm(point1 - point2) print("Distance before scaling:", distance_before) # Apply standard scaling (z-score normalization) scaler = StandardScaler() points_scaled = scaler.fit_transform(np.vstack([point1, point2])) # Euclidean distance after scaling distance_after = np.linalg.norm(points_scaled[0] - points_scaled[1]) print("Distance after scaling:", distance_after)
¡Gracias por tus comentarios!