Stochastic Gradients: Expectation and Variance
To understand the mathematics behind stochastic gradients, begin with a standard supervised learning setting. Suppose you have a loss function L(w;x,y) parameterized by weights w and data samples (x,y). The full gradient of the empirical risk over a dataset of N samples is given by the average of the gradients over all data points:
∇wJ(w)=N1i=1∑N∇wL(w;xi,yi)In practice, computing this full gradient at each optimization step is computationally expensive for large datasets. Stochastic Gradient Descent (SGD) approximates the full gradient by randomly sampling a subset (mini-batch) of size m from the data, and computing the gradient estimate:
gmini-batch(w)=m1j=1∑m∇wL(w;xij,yij)where each index ij is randomly chosen from 1,2,...,N.
The expected value of this mini-batch gradient is equal to the true gradient, assuming the samples are drawn uniformly at random:
E[gmini-batch(w)]=∇wJ(w)However, the variance of the mini-batch gradient quantifies how much the stochastic estimate fluctuates around the true gradient. For independent sampling, the variance of the mini-batch gradient is:
Var[gmini-batch(w)]=m1Var[∇wL(w;x,y)]This shows that increasing the batch size m reduces the variance of the gradient estimate, making the update direction more stable, while smaller batches lead to noisier updates.
Stochasticity in gradient estimates injects noise into the optimization process. This noise can help the optimizer escape shallow local minima and saddle points by providing enough randomness to avoid getting stuck, potentially leading to better solutions in non-convex landscapes.
1234567891011121314151617181920212223242526272829import numpy as np import matplotlib.pyplot as plt # Simulate a dataset: gradients of a simple quadratic loss with random noise np.random.seed(0) N = 1000 # total data points true_gradient = 2.0 # true gradient for all points noise_std = 1.0 gradients = true_gradient + np.random.randn(N) * noise_std batch_sizes = [1, 4, 16, 64, 256] variances = [] for m in batch_sizes: batch_grads = [] for _ in range(1000): batch = np.random.choice(gradients, size=m, replace=False) batch_grads.append(np.mean(batch)) variances.append(np.var(batch_grads)) plt.figure(figsize=(7, 4)) plt.plot(batch_sizes, variances, marker='o') plt.xscale('log') plt.yscale('log') plt.xlabel('Mini-batch size (log scale)') plt.ylabel('Gradient estimate variance (log scale)') plt.title('Variance of Stochastic Gradient vs. Batch Size') plt.grid(True, which="both", ls="--") plt.show()
Grazie per i tuoi commenti!
Chieda ad AI
Chieda ad AI
Chieda pure quello che desidera o provi una delle domande suggerite per iniziare la nostra conversazione
Can you explain why increasing the batch size reduces the variance of the gradient estimate?
What are the practical implications of using smaller versus larger batch sizes in SGD?
Can you provide a summary of the key mathematical points from the explanation?
Awesome!
Completion rate improved to 5.56
Stochastic Gradients: Expectation and Variance
Scorri per mostrare il menu
To understand the mathematics behind stochastic gradients, begin with a standard supervised learning setting. Suppose you have a loss function L(w;x,y) parameterized by weights w and data samples (x,y). The full gradient of the empirical risk over a dataset of N samples is given by the average of the gradients over all data points:
∇wJ(w)=N1i=1∑N∇wL(w;xi,yi)In practice, computing this full gradient at each optimization step is computationally expensive for large datasets. Stochastic Gradient Descent (SGD) approximates the full gradient by randomly sampling a subset (mini-batch) of size m from the data, and computing the gradient estimate:
gmini-batch(w)=m1j=1∑m∇wL(w;xij,yij)where each index ij is randomly chosen from 1,2,...,N.
The expected value of this mini-batch gradient is equal to the true gradient, assuming the samples are drawn uniformly at random:
E[gmini-batch(w)]=∇wJ(w)However, the variance of the mini-batch gradient quantifies how much the stochastic estimate fluctuates around the true gradient. For independent sampling, the variance of the mini-batch gradient is:
Var[gmini-batch(w)]=m1Var[∇wL(w;x,y)]This shows that increasing the batch size m reduces the variance of the gradient estimate, making the update direction more stable, while smaller batches lead to noisier updates.
Stochasticity in gradient estimates injects noise into the optimization process. This noise can help the optimizer escape shallow local minima and saddle points by providing enough randomness to avoid getting stuck, potentially leading to better solutions in non-convex landscapes.
1234567891011121314151617181920212223242526272829import numpy as np import matplotlib.pyplot as plt # Simulate a dataset: gradients of a simple quadratic loss with random noise np.random.seed(0) N = 1000 # total data points true_gradient = 2.0 # true gradient for all points noise_std = 1.0 gradients = true_gradient + np.random.randn(N) * noise_std batch_sizes = [1, 4, 16, 64, 256] variances = [] for m in batch_sizes: batch_grads = [] for _ in range(1000): batch = np.random.choice(gradients, size=m, replace=False) batch_grads.append(np.mean(batch)) variances.append(np.var(batch_grads)) plt.figure(figsize=(7, 4)) plt.plot(batch_sizes, variances, marker='o') plt.xscale('log') plt.yscale('log') plt.xlabel('Mini-batch size (log scale)') plt.ylabel('Gradient estimate variance (log scale)') plt.title('Variance of Stochastic Gradient vs. Batch Size') plt.grid(True, which="both", ls="--") plt.show()
Grazie per i tuoi commenti!