Stochastic Gradients: Expectation and Variance
To understand the mathematics behind stochastic gradients, begin with a standard supervised learning setting. Suppose you have a loss function L(w;x,y) parameterized by weights w and data samples (x,y). The full gradient of the empirical risk over a dataset of N samples is given by the average of the gradients over all data points:
∇wJ(w)=N1i=1∑N∇wL(w;xi,yi)In practice, computing this full gradient at each optimization step is computationally expensive for large datasets. Stochastic Gradient Descent (SGD) approximates the full gradient by randomly sampling a subset (mini-batch) of size m from the data, and computing the gradient estimate:
gmini-batch(w)=m1j=1∑m∇wL(w;xij,yij)where each index ij is randomly chosen from 1,2,...,N.
The expected value of this mini-batch gradient is equal to the true gradient, assuming the samples are drawn uniformly at random:
E[gmini-batch(w)]=∇wJ(w)However, the variance of the mini-batch gradient quantifies how much the stochastic estimate fluctuates around the true gradient. For independent sampling, the variance of the mini-batch gradient is:
Var[gmini-batch(w)]=m1Var[∇wL(w;x,y)]This shows that increasing the batch size m reduces the variance of the gradient estimate, making the update direction more stable, while smaller batches lead to noisier updates.
Stochasticity in gradient estimates injects noise into the optimization process. This noise can help the optimizer escape shallow local minima and saddle points by providing enough randomness to avoid getting stuck, potentially leading to better solutions in non-convex landscapes.
1234567891011121314151617181920212223242526272829import numpy as np import matplotlib.pyplot as plt # Simulate a dataset: gradients of a simple quadratic loss with random noise np.random.seed(0) N = 1000 # total data points true_gradient = 2.0 # true gradient for all points noise_std = 1.0 gradients = true_gradient + np.random.randn(N) * noise_std batch_sizes = [1, 4, 16, 64, 256] variances = [] for m in batch_sizes: batch_grads = [] for _ in range(1000): batch = np.random.choice(gradients, size=m, replace=False) batch_grads.append(np.mean(batch)) variances.append(np.var(batch_grads)) plt.figure(figsize=(7, 4)) plt.plot(batch_sizes, variances, marker='o') plt.xscale('log') plt.yscale('log') plt.xlabel('Mini-batch size (log scale)') plt.ylabel('Gradient estimate variance (log scale)') plt.title('Variance of Stochastic Gradient vs. Batch Size') plt.grid(True, which="both", ls="--") plt.show()
Tack för dina kommentarer!
Fråga AI
Fråga AI
Fråga vad du vill eller prova någon av de föreslagna frågorna för att starta vårt samtal
Awesome!
Completion rate improved to 5.56
Stochastic Gradients: Expectation and Variance
Svep för att visa menyn
To understand the mathematics behind stochastic gradients, begin with a standard supervised learning setting. Suppose you have a loss function L(w;x,y) parameterized by weights w and data samples (x,y). The full gradient of the empirical risk over a dataset of N samples is given by the average of the gradients over all data points:
∇wJ(w)=N1i=1∑N∇wL(w;xi,yi)In practice, computing this full gradient at each optimization step is computationally expensive for large datasets. Stochastic Gradient Descent (SGD) approximates the full gradient by randomly sampling a subset (mini-batch) of size m from the data, and computing the gradient estimate:
gmini-batch(w)=m1j=1∑m∇wL(w;xij,yij)where each index ij is randomly chosen from 1,2,...,N.
The expected value of this mini-batch gradient is equal to the true gradient, assuming the samples are drawn uniformly at random:
E[gmini-batch(w)]=∇wJ(w)However, the variance of the mini-batch gradient quantifies how much the stochastic estimate fluctuates around the true gradient. For independent sampling, the variance of the mini-batch gradient is:
Var[gmini-batch(w)]=m1Var[∇wL(w;x,y)]This shows that increasing the batch size m reduces the variance of the gradient estimate, making the update direction more stable, while smaller batches lead to noisier updates.
Stochasticity in gradient estimates injects noise into the optimization process. This noise can help the optimizer escape shallow local minima and saddle points by providing enough randomness to avoid getting stuck, potentially leading to better solutions in non-convex landscapes.
1234567891011121314151617181920212223242526272829import numpy as np import matplotlib.pyplot as plt # Simulate a dataset: gradients of a simple quadratic loss with random noise np.random.seed(0) N = 1000 # total data points true_gradient = 2.0 # true gradient for all points noise_std = 1.0 gradients = true_gradient + np.random.randn(N) * noise_std batch_sizes = [1, 4, 16, 64, 256] variances = [] for m in batch_sizes: batch_grads = [] for _ in range(1000): batch = np.random.choice(gradients, size=m, replace=False) batch_grads.append(np.mean(batch)) variances.append(np.var(batch_grads)) plt.figure(figsize=(7, 4)) plt.plot(batch_sizes, variances, marker='o') plt.xscale('log') plt.yscale('log') plt.xlabel('Mini-batch size (log scale)') plt.ylabel('Gradient estimate variance (log scale)') plt.title('Variance of Stochastic Gradient vs. Batch Size') plt.grid(True, which="both", ls="--") plt.show()
Tack för dina kommentarer!