Impara Stochastic Gradients: Expectation and Variance | Stochastic and Mini-Batch Methods

To understand the mathematics behind stochastic gradients, begin with a standard supervised learning setting. Suppose you have a loss function $L(w; x, y)$ parameterized by weights $w$ and data samples $(x, y)$ . The full gradient of the empirical risk over a dataset of $N$ samples is given by the average of the gradients over all data points:

\nabla_w J(w) = \frac{1}{N} \sum_{i=1}^N \nabla_w L(w; x_i, y_i)

In practice, computing this full gradient at each optimization step is computationally expensive for large datasets. Stochastic Gradient Descent (SGD) approximates the full gradient by randomly sampling a subset (mini-batch) of size $m$ from the data, and computing the gradient estimate:

g_{\text{mini-batch}}(w) = \frac{1}{m} \sum_{j=1}^m \nabla_w L(w; x_{i_j}, y_{i_j})

where each index $i_j$ is randomly chosen from ${1, 2, ..., N}$ .

The expected value of this mini-batch gradient is equal to the true gradient, assuming the samples are drawn uniformly at random:

\mathbb{E}[g_{\text{mini-batch}}(w)] = \nabla_w J(w)

However, the variance of the mini-batch gradient quantifies how much the stochastic estimate fluctuates around the true gradient. For independent sampling, the variance of the mini-batch gradient is:

\text{Var}[g_{\text{mini-batch}}(w)] = \frac{1}{m} \text{Var}[\nabla_w L(w; x, y)]

This shows that increasing the batch size $m$ reduces the variance of the gradient estimate, making the update direction more stable, while smaller batches lead to noisier updates.

Note

Stochasticity in gradient estimates injects noise into the optimization process. This noise can help the optimizer escape shallow local minima and saddle points by providing enough randomness to avoid getting stuck, potentially leading to better solutions in non-convex landscapes.


              1234567891011121314151617181920212223242526272829
            
import numpy as np
import matplotlib.pyplot as plt

# Simulate a dataset: gradients of a simple quadratic loss with random noise
np.random.seed(0)
N = 1000  # total data points
true_gradient = 2.0  # true gradient for all points
noise_std = 1.0
gradients = true_gradient + np.random.randn(N) * noise_std

batch_sizes = [1, 4, 16, 64, 256]
variances = []

for m in batch_sizes:
    batch_grads = []
    for _ in range(1000):
        batch = np.random.choice(gradients, size=m, replace=False)
        batch_grads.append(np.mean(batch))
    variances.append(np.var(batch_grads))

plt.figure(figsize=(7, 4))
plt.plot(batch_sizes, variances, marker='o')
plt.xscale('log')
plt.yscale('log')
plt.xlabel('Mini-batch size (log scale)')
plt.ylabel('Gradient estimate variance (log scale)')
plt.title('Variance of Stochastic Gradient vs. Batch Size')
plt.grid(True, which="both", ls="--")
plt.show()

Tutto è chiaro?

Grazie per i tuoi commenti!

Sezione 3. Capitolo 1

Chieda ad AI

Chieda pure quello che desidera o provi una delle domande suggerite per iniziare la nostra conversazione

Suggested prompts:

Can you explain why increasing the batch size reduces the variance of the gradient estimate?

What are the practical implications of using smaller versus larger batch sizes in SGD?

Can you provide a summary of the key mathematical points from the explanation?

Awesome!

Completion rate improved to 5.56

Scorri per mostrare il menu

\nabla_w J(w) = \frac{1}{N} \sum_{i=1}^N \nabla_w L(w; x_i, y_i)

g_{\text{mini-batch}}(w) = \frac{1}{m} \sum_{j=1}^m \nabla_w L(w; x_{i_j}, y_{i_j})