Batch Size and Convergence
The choice of batch size is crucial in stochastic optimization. When you train machine learning models with gradient-based methods, you must decide how much data to process at each optimization step. This decision affects both the statistical properties of the gradients and the computational efficiency of your algorithm.
Suppose you have a dataset with N samples and a loss function L(θ) parameterized by model parameters θ. The true gradient is the average over all data points:
∇L(θ)=N1i=1∑N∇li(θ)where li(θ) is the loss for the i-th sample. In practice, computing this sum can be expensive for large datasets. Instead, you estimate the gradient using a mini-batch of m samples:
∇^L(θ)=m1j=1∑m∇lij(θ)where ij are randomly selected indices. As you increase the batch size m, the variance of your gradient estimate decreases, making the updates more stable and less noisy. However, larger batches mean more computation per step and can lead to diminishing returns in variance reduction.
The trade-off is as follows:
- Small batch sizes:
- High variance in gradient estimates;
- Faster, more frequent parameter updates;
- Potential for better generalization due to noise;
- Less efficient use of hardware (e.g., GPUs).
- Large batch sizes:
- Low variance in gradient estimates;
- Slower, less frequent updates;
- May require higher learning rates to maintain progress;
- Better hardware utilization, but diminishing returns after a certain point.
In summary, the batch size controls the balance between the stochastic noise in your optimization trajectory and the computational cost per update.
Practical considerations for choosing batch size:
- Start with a batch size that fits comfortably in your hardware's memory;
- Monitor training speed and convergence; if training is erratic, consider increasing the batch size;
- Very large batches can require learning rate adjustments;
- There is often a "sweet spot" where increasing the batch size further yields little improvement;
- For very large datasets, mini-batch sizes of 32, 64, or 128 are commonly effective.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657import numpy as np import matplotlib.pyplot as plt # Quadratic loss: L(theta) = (theta - 3)^2 def loss(theta): return (theta - 3) ** 2 def grad(theta): return 2 * (theta - 3) # Simulate stochastic gradients with noise depending on batch size def noisy_grad(theta, batch_size, noise_scale=4.0): noise = np.random.randn() * (noise_scale / np.sqrt(batch_size)) return grad(theta) + noise def run_optimization(batch_size, steps=30, lr=0.1): thetas = [0.0] for _ in range(steps): g = noisy_grad(thetas[-1], batch_size) thetas.append(thetas[-1] - lr * g) return np.array(thetas) # Simulation np.random.seed(42) batch_sizes = [1, 8, 32, 128] trajectories = [run_optimization(bs) for bs in batch_sizes] colors = plt.cm.viridis(np.linspace(0.1, 0.9, len(batch_sizes))) # Plot fig, axes = plt.subplots(1, 2, figsize=(12, 5)) theta_vals = np.linspace(-1, 6, 200) target = 3.0 # --- Left: trajectories on the loss curve --- axes[0].plot(theta_vals, loss(theta_vals), 'k--', alpha=0.7, label="Loss surface") for thetas, bs, c in zip(trajectories, batch_sizes, colors): axes[0].plot(thetas, loss(thetas), marker='o', color=c, label=f"Batch size {bs}") axes[0].axvline(target, color='gray', linestyle=':', label="True minimum θ=3") axes[0].set_xlabel("Theta") axes[0].set_ylabel("Loss") axes[0].set_title("Optimization Path on Loss Surface") axes[0].legend() axes[0].grid(True, alpha=0.3) # --- Right: convergence of loss over iterations --- for thetas, bs, c in zip(trajectories, batch_sizes, colors): losses = loss(thetas) axes[1].plot(losses, marker='o', color=c, label=f"Batch size {bs}") axes[1].set_xlabel("Iteration") axes[1].set_ylabel("Loss") axes[1].set_title("Convergence of Loss over Time") axes[1].set_yscale("log") # show convergence more clearly axes[1].grid(True, alpha=0.3) fig.suptitle("Stochastic Gradient Descent with Different Batch Sizes", fontsize=14) fig.tight_layout() plt.show()
Obrigado pelo seu feedback!
Pergunte à IA
Pergunte à IA
Pergunte o que quiser ou experimente uma das perguntas sugeridas para iniciar nosso bate-papo
Awesome!
Completion rate improved to 5.56
Batch Size and Convergence
Deslize para mostrar o menu
The choice of batch size is crucial in stochastic optimization. When you train machine learning models with gradient-based methods, you must decide how much data to process at each optimization step. This decision affects both the statistical properties of the gradients and the computational efficiency of your algorithm.
Suppose you have a dataset with N samples and a loss function L(θ) parameterized by model parameters θ. The true gradient is the average over all data points:
∇L(θ)=N1i=1∑N∇li(θ)where li(θ) is the loss for the i-th sample. In practice, computing this sum can be expensive for large datasets. Instead, you estimate the gradient using a mini-batch of m samples:
∇^L(θ)=m1j=1∑m∇lij(θ)where ij are randomly selected indices. As you increase the batch size m, the variance of your gradient estimate decreases, making the updates more stable and less noisy. However, larger batches mean more computation per step and can lead to diminishing returns in variance reduction.
The trade-off is as follows:
- Small batch sizes:
- High variance in gradient estimates;
- Faster, more frequent parameter updates;
- Potential for better generalization due to noise;
- Less efficient use of hardware (e.g., GPUs).
- Large batch sizes:
- Low variance in gradient estimates;
- Slower, less frequent updates;
- May require higher learning rates to maintain progress;
- Better hardware utilization, but diminishing returns after a certain point.
In summary, the batch size controls the balance between the stochastic noise in your optimization trajectory and the computational cost per update.
Practical considerations for choosing batch size:
- Start with a batch size that fits comfortably in your hardware's memory;
- Monitor training speed and convergence; if training is erratic, consider increasing the batch size;
- Very large batches can require learning rate adjustments;
- There is often a "sweet spot" where increasing the batch size further yields little improvement;
- For very large datasets, mini-batch sizes of 32, 64, or 128 are commonly effective.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657import numpy as np import matplotlib.pyplot as plt # Quadratic loss: L(theta) = (theta - 3)^2 def loss(theta): return (theta - 3) ** 2 def grad(theta): return 2 * (theta - 3) # Simulate stochastic gradients with noise depending on batch size def noisy_grad(theta, batch_size, noise_scale=4.0): noise = np.random.randn() * (noise_scale / np.sqrt(batch_size)) return grad(theta) + noise def run_optimization(batch_size, steps=30, lr=0.1): thetas = [0.0] for _ in range(steps): g = noisy_grad(thetas[-1], batch_size) thetas.append(thetas[-1] - lr * g) return np.array(thetas) # Simulation np.random.seed(42) batch_sizes = [1, 8, 32, 128] trajectories = [run_optimization(bs) for bs in batch_sizes] colors = plt.cm.viridis(np.linspace(0.1, 0.9, len(batch_sizes))) # Plot fig, axes = plt.subplots(1, 2, figsize=(12, 5)) theta_vals = np.linspace(-1, 6, 200) target = 3.0 # --- Left: trajectories on the loss curve --- axes[0].plot(theta_vals, loss(theta_vals), 'k--', alpha=0.7, label="Loss surface") for thetas, bs, c in zip(trajectories, batch_sizes, colors): axes[0].plot(thetas, loss(thetas), marker='o', color=c, label=f"Batch size {bs}") axes[0].axvline(target, color='gray', linestyle=':', label="True minimum θ=3") axes[0].set_xlabel("Theta") axes[0].set_ylabel("Loss") axes[0].set_title("Optimization Path on Loss Surface") axes[0].legend() axes[0].grid(True, alpha=0.3) # --- Right: convergence of loss over iterations --- for thetas, bs, c in zip(trajectories, batch_sizes, colors): losses = loss(thetas) axes[1].plot(losses, marker='o', color=c, label=f"Batch size {bs}") axes[1].set_xlabel("Iteration") axes[1].set_ylabel("Loss") axes[1].set_title("Convergence of Loss over Time") axes[1].set_yscale("log") # show convergence more clearly axes[1].grid(True, alpha=0.3) fig.suptitle("Stochastic Gradient Descent with Different Batch Sizes", fontsize=14) fig.tight_layout() plt.show()
Obrigado pelo seu feedback!