Learn Batch Size and Convergence | Stochastic and Mini-Batch Methods

The choice of batch size is crucial in stochastic optimization. When you train machine learning models with gradient-based methods, you must decide how much data to process at each optimization step. This decision affects both the statistical properties of the gradients and the computational efficiency of your algorithm.

Suppose you have a dataset with $N$ samples and a loss function $L(θ)$ parameterized by model parameters $θ$ . The true gradient is the average over all data points:

\nabla L(θ) = \frac{1}{N} \sum_{i=1}^N \nabla l_i(θ)

where $l_i(θ)$ is the loss for the $i$ -th sample. In practice, computing this sum can be expensive for large datasets. Instead, you estimate the gradient using a mini-batch of $m$ samples:

\hat{\nabla} L(θ) = \frac{1}{m} \sum_{j=1}^m \nabla l_{i_j}(θ)

where $i_j$ are randomly selected indices. As you increase the batch size $m$ , the variance of your gradient estimate decreases, making the updates more stable and less noisy. However, larger batches mean more computation per step and can lead to diminishing returns in variance reduction.

The trade-off is as follows:

Small batch sizes:
- High variance in gradient estimates;
- Faster, more frequent parameter updates;
- Potential for better generalization due to noise;
- Less efficient use of hardware (e.g., GPUs).
Large batch sizes:
- Low variance in gradient estimates;
- Slower, less frequent updates;
- May require higher learning rates to maintain progress;
- Better hardware utilization, but diminishing returns after a certain point.

In summary, the batch size controls the balance between the stochastic noise in your optimization trajectory and the computational cost per update.

Note

Practical considerations for choosing batch size:

Start with a batch size that fits comfortably in your hardware's memory;
Monitor training speed and convergence; if training is erratic, consider increasing the batch size;
Very large batches can require learning rate adjustments;
There is often a "sweet spot" where increasing the batch size further yields little improvement;
For very large datasets, mini-batch sizes of 32, 64, or 128 are commonly effective.


              123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657
            
import numpy as np
import matplotlib.pyplot as plt

# Quadratic loss: L(theta) = (theta - 3)^2
def loss(theta):
    return (theta - 3) ** 2

def grad(theta):
    return 2 * (theta - 3)

# Simulate stochastic gradients with noise depending on batch size
def noisy_grad(theta, batch_size, noise_scale=4.0):
    noise = np.random.randn() * (noise_scale / np.sqrt(batch_size))
    return grad(theta) + noise

def run_optimization(batch_size, steps=30, lr=0.1):
    thetas = [0.0]
    for _ in range(steps):
        g = noisy_grad(thetas[-1], batch_size)
        thetas.append(thetas[-1] - lr * g)
    return np.array(thetas)

# Simulation
np.random.seed(42)
batch_sizes = [1, 8, 32, 128]
trajectories = [run_optimization(bs) for bs in batch_sizes]
colors = plt.cm.viridis(np.linspace(0.1, 0.9, len(batch_sizes)))

# Plot
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
theta_vals = np.linspace(-1, 6, 200)
target = 3.0

# --- Left: trajectories on the loss curve ---
axes[0].plot(theta_vals, loss(theta_vals), 'k--', alpha=0.7, label="Loss surface")
for thetas, bs, c in zip(trajectories, batch_sizes, colors):
    axes[0].plot(thetas, loss(thetas), marker='o', color=c, label=f"Batch size {bs}")
axes[0].axvline(target, color='gray', linestyle=':', label="True minimum θ=3")
axes[0].set_xlabel("Theta")
axes[0].set_ylabel("Loss")
axes[0].set_title("Optimization Path on Loss Surface")
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# --- Right: convergence of loss over iterations ---
for thetas, bs, c in zip(trajectories, batch_sizes, colors):
    losses = loss(thetas)
    axes[1].plot(losses, marker='o', color=c, label=f"Batch size {bs}")
axes[1].set_xlabel("Iteration")
axes[1].set_ylabel("Loss")
axes[1].set_title("Convergence of Loss over Time")
axes[1].set_yscale("log")  # show convergence more clearly
axes[1].grid(True, alpha=0.3)

fig.suptitle("Stochastic Gradient Descent with Different Batch Sizes", fontsize=14)
fig.tight_layout()
plt.show()

Everything was clear?

Thanks for your feedback!

Section 3. Chapter 2

Ask AI

Ask anything or try one of the suggested questions to begin our chat

Suggested prompts:

Can you explain how the batch size affects the convergence speed and stability in this example?

What does the noise in the optimization trajectories represent?

How would changing the learning rate interact with different batch sizes?

Awesome!

Completion rate improved to 5.56

Swipe to show menu