Batch Size and Convergence
The choice of batch size is crucial in stochastic optimization. When you train machine learning models with gradient-based methods, you must decide how much data to process at each optimization step. This decision affects both the statistical properties of the gradients and the computational efficiency of your algorithm.
Suppose you have a dataset with N samples and a loss function L(ΞΈ) parameterized by model parameters ΞΈ. The true gradient is the average over all data points:
βL(ΞΈ)=N1βi=1βNββliβ(ΞΈ)where liβ(ΞΈ) is the loss for the i-th sample. In practice, computing this sum can be expensive for large datasets. Instead, you estimate the gradient using a mini-batch of m samples:
β^L(ΞΈ)=m1βj=1βmββlijββ(ΞΈ)where ijβ are randomly selected indices. As you increase the batch size m, the variance of your gradient estimate decreases, making the updates more stable and less noisy. However, larger batches mean more computation per step and can lead to diminishing returns in variance reduction.
The trade-off is as follows:
- Small batch sizes:
- High variance in gradient estimates;
- Faster, more frequent parameter updates;
- Potential for better generalization due to noise;
- Less efficient use of hardware (e.g., GPUs).
- Large batch sizes:
- Low variance in gradient estimates;
- Slower, less frequent updates;
- May require higher learning rates to maintain progress;
- Better hardware utilization, but diminishing returns after a certain point.
In summary, the batch size controls the balance between the stochastic noise in your optimization trajectory and the computational cost per update.
Practical considerations for choosing batch size:
- Start with a batch size that fits comfortably in your hardware's memory;
- Monitor training speed and convergence; if training is erratic, consider increasing the batch size;
- Very large batches can require learning rate adjustments;
- There is often a "sweet spot" where increasing the batch size further yields little improvement;
- For very large datasets, mini-batch sizes of 32, 64, or 128 are commonly effective.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657import numpy as np import matplotlib.pyplot as plt # Quadratic loss: L(theta) = (theta - 3)^2 def loss(theta): return (theta - 3) ** 2 def grad(theta): return 2 * (theta - 3) # Simulate stochastic gradients with noise depending on batch size def noisy_grad(theta, batch_size, noise_scale=4.0): noise = np.random.randn() * (noise_scale / np.sqrt(batch_size)) return grad(theta) + noise def run_optimization(batch_size, steps=30, lr=0.1): thetas = [0.0] for _ in range(steps): g = noisy_grad(thetas[-1], batch_size) thetas.append(thetas[-1] - lr * g) return np.array(thetas) # Simulation np.random.seed(42) batch_sizes = [1, 8, 32, 128] trajectories = [run_optimization(bs) for bs in batch_sizes] colors = plt.cm.viridis(np.linspace(0.1, 0.9, len(batch_sizes))) # Plot fig, axes = plt.subplots(1, 2, figsize=(12, 5)) theta_vals = np.linspace(-1, 6, 200) target = 3.0 # --- Left: trajectories on the loss curve --- axes[0].plot(theta_vals, loss(theta_vals), 'k--', alpha=0.7, label="Loss surface") for thetas, bs, c in zip(trajectories, batch_sizes, colors): axes[0].plot(thetas, loss(thetas), marker='o', color=c, label=f"Batch size {bs}") axes[0].axvline(target, color='gray', linestyle=':', label="True minimum ΞΈ=3") axes[0].set_xlabel("Theta") axes[0].set_ylabel("Loss") axes[0].set_title("Optimization Path on Loss Surface") axes[0].legend() axes[0].grid(True, alpha=0.3) # --- Right: convergence of loss over iterations --- for thetas, bs, c in zip(trajectories, batch_sizes, colors): losses = loss(thetas) axes[1].plot(losses, marker='o', color=c, label=f"Batch size {bs}") axes[1].set_xlabel("Iteration") axes[1].set_ylabel("Loss") axes[1].set_title("Convergence of Loss over Time") axes[1].set_yscale("log") # show convergence more clearly axes[1].grid(True, alpha=0.3) fig.suptitle("Stochastic Gradient Descent with Different Batch Sizes", fontsize=14) fig.tight_layout() plt.show()
Thanks for your feedback!
Ask AI
Ask AI
Ask anything or try one of the suggested questions to begin our chat
Can you explain how the batch size affects the convergence speed and stability in this example?
What does the noise in the optimization trajectories represent?
How would changing the learning rate interact with different batch sizes?
Awesome!
Completion rate improved to 5.56
Batch Size and Convergence
Swipe to show menu
The choice of batch size is crucial in stochastic optimization. When you train machine learning models with gradient-based methods, you must decide how much data to process at each optimization step. This decision affects both the statistical properties of the gradients and the computational efficiency of your algorithm.
Suppose you have a dataset with N samples and a loss function L(ΞΈ) parameterized by model parameters ΞΈ. The true gradient is the average over all data points:
βL(ΞΈ)=N1βi=1βNββliβ(ΞΈ)where liβ(ΞΈ) is the loss for the i-th sample. In practice, computing this sum can be expensive for large datasets. Instead, you estimate the gradient using a mini-batch of m samples:
β^L(ΞΈ)=m1βj=1βmββlijββ(ΞΈ)where ijβ are randomly selected indices. As you increase the batch size m, the variance of your gradient estimate decreases, making the updates more stable and less noisy. However, larger batches mean more computation per step and can lead to diminishing returns in variance reduction.
The trade-off is as follows:
- Small batch sizes:
- High variance in gradient estimates;
- Faster, more frequent parameter updates;
- Potential for better generalization due to noise;
- Less efficient use of hardware (e.g., GPUs).
- Large batch sizes:
- Low variance in gradient estimates;
- Slower, less frequent updates;
- May require higher learning rates to maintain progress;
- Better hardware utilization, but diminishing returns after a certain point.
In summary, the batch size controls the balance between the stochastic noise in your optimization trajectory and the computational cost per update.
Practical considerations for choosing batch size:
- Start with a batch size that fits comfortably in your hardware's memory;
- Monitor training speed and convergence; if training is erratic, consider increasing the batch size;
- Very large batches can require learning rate adjustments;
- There is often a "sweet spot" where increasing the batch size further yields little improvement;
- For very large datasets, mini-batch sizes of 32, 64, or 128 are commonly effective.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657import numpy as np import matplotlib.pyplot as plt # Quadratic loss: L(theta) = (theta - 3)^2 def loss(theta): return (theta - 3) ** 2 def grad(theta): return 2 * (theta - 3) # Simulate stochastic gradients with noise depending on batch size def noisy_grad(theta, batch_size, noise_scale=4.0): noise = np.random.randn() * (noise_scale / np.sqrt(batch_size)) return grad(theta) + noise def run_optimization(batch_size, steps=30, lr=0.1): thetas = [0.0] for _ in range(steps): g = noisy_grad(thetas[-1], batch_size) thetas.append(thetas[-1] - lr * g) return np.array(thetas) # Simulation np.random.seed(42) batch_sizes = [1, 8, 32, 128] trajectories = [run_optimization(bs) for bs in batch_sizes] colors = plt.cm.viridis(np.linspace(0.1, 0.9, len(batch_sizes))) # Plot fig, axes = plt.subplots(1, 2, figsize=(12, 5)) theta_vals = np.linspace(-1, 6, 200) target = 3.0 # --- Left: trajectories on the loss curve --- axes[0].plot(theta_vals, loss(theta_vals), 'k--', alpha=0.7, label="Loss surface") for thetas, bs, c in zip(trajectories, batch_sizes, colors): axes[0].plot(thetas, loss(thetas), marker='o', color=c, label=f"Batch size {bs}") axes[0].axvline(target, color='gray', linestyle=':', label="True minimum ΞΈ=3") axes[0].set_xlabel("Theta") axes[0].set_ylabel("Loss") axes[0].set_title("Optimization Path on Loss Surface") axes[0].legend() axes[0].grid(True, alpha=0.3) # --- Right: convergence of loss over iterations --- for thetas, bs, c in zip(trajectories, batch_sizes, colors): losses = loss(thetas) axes[1].plot(losses, marker='o', color=c, label=f"Batch size {bs}") axes[1].set_xlabel("Iteration") axes[1].set_ylabel("Loss") axes[1].set_title("Convergence of Loss over Time") axes[1].set_yscale("log") # show convergence more clearly axes[1].grid(True, alpha=0.3) fig.suptitle("Stochastic Gradient Descent with Different Batch Sizes", fontsize=14) fig.tight_layout() plt.show()
Thanks for your feedback!