Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Вивчайте Scaling and Gradient Descent | Scaling and Model Performance
Feature Scaling and Normalization Deep Dive

bookScaling and Gradient Descent

When you use gradient descent to optimize a machine learning model, the shape of the loss surface is crucial for determining how quickly and effectively the algorithm converges to a minimum. If your features are not scaled, those with larger ranges will dominate the loss function, causing the contours of the loss surface to become elongated and skewed. This distortion leads to inefficient optimization paths, where gradient descent zig-zags or takes tiny steps in some directions and much larger steps in others. As a result, convergence becomes much slower, and the optimizer may even get stuck or fail to reach the true minimum. Feature scaling, such as standardization or normalization, transforms the data so that all features contribute equally. This produces a more spherical loss surface, allowing gradient descent to move efficiently and directly toward the minimum.

Note
Note

Analogy: imagine hiking down a steep, narrow canyon (unscaled features) versus rolling down a smooth, round hill (scaled features). In the canyon, you must zig-zag and carefully pick your steps to avoid obstacles, making your journey slow and indirect. On the hill, you can move straight toward the bottom, reaching your goal much faster. Scaling features reshapes the optimization landscape from a canyon to a hill, making gradient descent more efficient.

123456789101112131415161718192021222324252627282930
import numpy as np import matplotlib.pyplot as plt # Create a synthetic loss surface for two features def loss_surface(w1, w2, scale_x=1, scale_y=10): return (scale_x * w1)**2 + (scale_y * w2)**2 w1 = np.linspace(-2, 2, 100) w2 = np.linspace(-2, 2, 100) W1, W2 = np.meshgrid(w1, w2) # Unscaled (features have different variances) Z_unscaled = loss_surface(W1, W2, scale_x=1, scale_y=10) # Scaled (features have same variance) Z_scaled = loss_surface(W1, W2, scale_x=1, scale_y=1) fig, axes = plt.subplots(1, 2, figsize=(12, 5)) axes[0].contour(W1, W2, Z_unscaled, levels=20, cmap='viridis') axes[0].set_title('Unscaled Features (Elongated Contours)') axes[0].set_xlabel('Weight 1') axes[0].set_ylabel('Weight 2') axes[1].contour(W1, W2, Z_scaled, levels=20, cmap='viridis') axes[1].set_title('Scaled Features (Circular Contours)') axes[1].set_xlabel('Weight 1') axes[1].set_ylabel('Weight 2') plt.tight_layout() plt.show()
copy
question mark

Which of the following best describes the effect of feature scaling on gradient descent optimization?

Select the correct answer

Все було зрозуміло?

Як ми можемо покращити це?

Дякуємо за ваш відгук!

Секція 4. Розділ 1

Запитати АІ

expand

Запитати АІ

ChatGPT

Запитайте про що завгодно або спробуйте одне із запропонованих запитань, щоб почати наш чат

Suggested prompts:

Can you explain why elongated contours slow down gradient descent?

What are some common methods for feature scaling?

How does feature scaling affect other optimization algorithms besides gradient descent?

Awesome!

Completion rate improved to 5.26

bookScaling and Gradient Descent

Свайпніть щоб показати меню

When you use gradient descent to optimize a machine learning model, the shape of the loss surface is crucial for determining how quickly and effectively the algorithm converges to a minimum. If your features are not scaled, those with larger ranges will dominate the loss function, causing the contours of the loss surface to become elongated and skewed. This distortion leads to inefficient optimization paths, where gradient descent zig-zags or takes tiny steps in some directions and much larger steps in others. As a result, convergence becomes much slower, and the optimizer may even get stuck or fail to reach the true minimum. Feature scaling, such as standardization or normalization, transforms the data so that all features contribute equally. This produces a more spherical loss surface, allowing gradient descent to move efficiently and directly toward the minimum.

Note
Note

Analogy: imagine hiking down a steep, narrow canyon (unscaled features) versus rolling down a smooth, round hill (scaled features). In the canyon, you must zig-zag and carefully pick your steps to avoid obstacles, making your journey slow and indirect. On the hill, you can move straight toward the bottom, reaching your goal much faster. Scaling features reshapes the optimization landscape from a canyon to a hill, making gradient descent more efficient.

123456789101112131415161718192021222324252627282930
import numpy as np import matplotlib.pyplot as plt # Create a synthetic loss surface for two features def loss_surface(w1, w2, scale_x=1, scale_y=10): return (scale_x * w1)**2 + (scale_y * w2)**2 w1 = np.linspace(-2, 2, 100) w2 = np.linspace(-2, 2, 100) W1, W2 = np.meshgrid(w1, w2) # Unscaled (features have different variances) Z_unscaled = loss_surface(W1, W2, scale_x=1, scale_y=10) # Scaled (features have same variance) Z_scaled = loss_surface(W1, W2, scale_x=1, scale_y=1) fig, axes = plt.subplots(1, 2, figsize=(12, 5)) axes[0].contour(W1, W2, Z_unscaled, levels=20, cmap='viridis') axes[0].set_title('Unscaled Features (Elongated Contours)') axes[0].set_xlabel('Weight 1') axes[0].set_ylabel('Weight 2') axes[1].contour(W1, W2, Z_scaled, levels=20, cmap='viridis') axes[1].set_title('Scaled Features (Circular Contours)') axes[1].set_xlabel('Weight 1') axes[1].set_ylabel('Weight 2') plt.tight_layout() plt.show()
copy
question mark

Which of the following best describes the effect of feature scaling on gradient descent optimization?

Select the correct answer

Все було зрозуміло?

Як ми можемо покращити це?

Дякуємо за ваш відгук!

Секція 4. Розділ 1
some-alt