Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Lära Batch Size and Convergence | Stochastic and Mini-Batch Methods
Mathematics of Optimization in ML

bookBatch Size and Convergence

The choice of batch size is crucial in stochastic optimization. When you train machine learning models with gradient-based methods, you must decide how much data to process at each optimization step. This decision affects both the statistical properties of the gradients and the computational efficiency of your algorithm.

Suppose you have a dataset with NN samples and a loss function L(θ)L(θ) parameterized by model parameters θθ. The true gradient is the average over all data points:

L(θ)=1Ni=1Nli(θ)\nabla L(θ) = \frac{1}{N} \sum_{i=1}^N \nabla l_i(θ)

where li(θ)l_i(θ) is the loss for the ii-th sample. In practice, computing this sum can be expensive for large datasets. Instead, you estimate the gradient using a mini-batch of mm samples:

^L(θ)=1mj=1mlij(θ)\hat{\nabla} L(θ) = \frac{1}{m} \sum_{j=1}^m \nabla l_{i_j}(θ)

where iji_j are randomly selected indices. As you increase the batch size mm, the variance of your gradient estimate decreases, making the updates more stable and less noisy. However, larger batches mean more computation per step and can lead to diminishing returns in variance reduction.

The trade-off is as follows:

  • Small batch sizes:
    • High variance in gradient estimates;
    • Faster, more frequent parameter updates;
    • Potential for better generalization due to noise;
    • Less efficient use of hardware (e.g., GPUs).
  • Large batch sizes:
    • Low variance in gradient estimates;
    • Slower, less frequent updates;
    • May require higher learning rates to maintain progress;
    • Better hardware utilization, but diminishing returns after a certain point.

In summary, the batch size controls the balance between the stochastic noise in your optimization trajectory and the computational cost per update.

Note
Note

Practical considerations for choosing batch size:

  • Start with a batch size that fits comfortably in your hardware's memory;
  • Monitor training speed and convergence; if training is erratic, consider increasing the batch size;
  • Very large batches can require learning rate adjustments;
  • There is often a "sweet spot" where increasing the batch size further yields little improvement;
  • For very large datasets, mini-batch sizes of 32, 64, or 128 are commonly effective.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657
import numpy as np import matplotlib.pyplot as plt # Quadratic loss: L(theta) = (theta - 3)^2 def loss(theta): return (theta - 3) ** 2 def grad(theta): return 2 * (theta - 3) # Simulate stochastic gradients with noise depending on batch size def noisy_grad(theta, batch_size, noise_scale=4.0): noise = np.random.randn() * (noise_scale / np.sqrt(batch_size)) return grad(theta) + noise def run_optimization(batch_size, steps=30, lr=0.1): thetas = [0.0] for _ in range(steps): g = noisy_grad(thetas[-1], batch_size) thetas.append(thetas[-1] - lr * g) return np.array(thetas) # Simulation np.random.seed(42) batch_sizes = [1, 8, 32, 128] trajectories = [run_optimization(bs) for bs in batch_sizes] colors = plt.cm.viridis(np.linspace(0.1, 0.9, len(batch_sizes))) # Plot fig, axes = plt.subplots(1, 2, figsize=(12, 5)) theta_vals = np.linspace(-1, 6, 200) target = 3.0 # --- Left: trajectories on the loss curve --- axes[0].plot(theta_vals, loss(theta_vals), 'k--', alpha=0.7, label="Loss surface") for thetas, bs, c in zip(trajectories, batch_sizes, colors): axes[0].plot(thetas, loss(thetas), marker='o', color=c, label=f"Batch size {bs}") axes[0].axvline(target, color='gray', linestyle=':', label="True minimum θ=3") axes[0].set_xlabel("Theta") axes[0].set_ylabel("Loss") axes[0].set_title("Optimization Path on Loss Surface") axes[0].legend() axes[0].grid(True, alpha=0.3) # --- Right: convergence of loss over iterations --- for thetas, bs, c in zip(trajectories, batch_sizes, colors): losses = loss(thetas) axes[1].plot(losses, marker='o', color=c, label=f"Batch size {bs}") axes[1].set_xlabel("Iteration") axes[1].set_ylabel("Loss") axes[1].set_title("Convergence of Loss over Time") axes[1].set_yscale("log") # show convergence more clearly axes[1].grid(True, alpha=0.3) fig.suptitle("Stochastic Gradient Descent with Different Batch Sizes", fontsize=14) fig.tight_layout() plt.show()
copy
question mark

Which of the following statements about batch size in stochastic optimization is correct?

Select the correct answer

Var allt tydligt?

Hur kan vi förbättra det?

Tack för dina kommentarer!

Avsnitt 3. Kapitel 2

Fråga AI

expand

Fråga AI

ChatGPT

Fråga vad du vill eller prova någon av de föreslagna frågorna för att starta vårt samtal

Suggested prompts:

Can you explain how the batch size affects the convergence speed and stability in this example?

What does the noise in the optimization trajectories represent?

How would changing the learning rate interact with different batch sizes?

Awesome!

Completion rate improved to 5.56

bookBatch Size and Convergence

Svep för att visa menyn

The choice of batch size is crucial in stochastic optimization. When you train machine learning models with gradient-based methods, you must decide how much data to process at each optimization step. This decision affects both the statistical properties of the gradients and the computational efficiency of your algorithm.

Suppose you have a dataset with NN samples and a loss function L(θ)L(θ) parameterized by model parameters θθ. The true gradient is the average over all data points:

L(θ)=1Ni=1Nli(θ)\nabla L(θ) = \frac{1}{N} \sum_{i=1}^N \nabla l_i(θ)

where li(θ)l_i(θ) is the loss for the ii-th sample. In practice, computing this sum can be expensive for large datasets. Instead, you estimate the gradient using a mini-batch of mm samples:

^L(θ)=1mj=1mlij(θ)\hat{\nabla} L(θ) = \frac{1}{m} \sum_{j=1}^m \nabla l_{i_j}(θ)

where iji_j are randomly selected indices. As you increase the batch size mm, the variance of your gradient estimate decreases, making the updates more stable and less noisy. However, larger batches mean more computation per step and can lead to diminishing returns in variance reduction.

The trade-off is as follows:

  • Small batch sizes:
    • High variance in gradient estimates;
    • Faster, more frequent parameter updates;
    • Potential for better generalization due to noise;
    • Less efficient use of hardware (e.g., GPUs).
  • Large batch sizes:
    • Low variance in gradient estimates;
    • Slower, less frequent updates;
    • May require higher learning rates to maintain progress;
    • Better hardware utilization, but diminishing returns after a certain point.

In summary, the batch size controls the balance between the stochastic noise in your optimization trajectory and the computational cost per update.

Note
Note

Practical considerations for choosing batch size:

  • Start with a batch size that fits comfortably in your hardware's memory;
  • Monitor training speed and convergence; if training is erratic, consider increasing the batch size;
  • Very large batches can require learning rate adjustments;
  • There is often a "sweet spot" where increasing the batch size further yields little improvement;
  • For very large datasets, mini-batch sizes of 32, 64, or 128 are commonly effective.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657
import numpy as np import matplotlib.pyplot as plt # Quadratic loss: L(theta) = (theta - 3)^2 def loss(theta): return (theta - 3) ** 2 def grad(theta): return 2 * (theta - 3) # Simulate stochastic gradients with noise depending on batch size def noisy_grad(theta, batch_size, noise_scale=4.0): noise = np.random.randn() * (noise_scale / np.sqrt(batch_size)) return grad(theta) + noise def run_optimization(batch_size, steps=30, lr=0.1): thetas = [0.0] for _ in range(steps): g = noisy_grad(thetas[-1], batch_size) thetas.append(thetas[-1] - lr * g) return np.array(thetas) # Simulation np.random.seed(42) batch_sizes = [1, 8, 32, 128] trajectories = [run_optimization(bs) for bs in batch_sizes] colors = plt.cm.viridis(np.linspace(0.1, 0.9, len(batch_sizes))) # Plot fig, axes = plt.subplots(1, 2, figsize=(12, 5)) theta_vals = np.linspace(-1, 6, 200) target = 3.0 # --- Left: trajectories on the loss curve --- axes[0].plot(theta_vals, loss(theta_vals), 'k--', alpha=0.7, label="Loss surface") for thetas, bs, c in zip(trajectories, batch_sizes, colors): axes[0].plot(thetas, loss(thetas), marker='o', color=c, label=f"Batch size {bs}") axes[0].axvline(target, color='gray', linestyle=':', label="True minimum θ=3") axes[0].set_xlabel("Theta") axes[0].set_ylabel("Loss") axes[0].set_title("Optimization Path on Loss Surface") axes[0].legend() axes[0].grid(True, alpha=0.3) # --- Right: convergence of loss over iterations --- for thetas, bs, c in zip(trajectories, batch_sizes, colors): losses = loss(thetas) axes[1].plot(losses, marker='o', color=c, label=f"Batch size {bs}") axes[1].set_xlabel("Iteration") axes[1].set_ylabel("Loss") axes[1].set_title("Convergence of Loss over Time") axes[1].set_yscale("log") # show convergence more clearly axes[1].grid(True, alpha=0.3) fig.suptitle("Stochastic Gradient Descent with Different Batch Sizes", fontsize=14) fig.tight_layout() plt.show()
copy
question mark

Which of the following statements about batch size in stochastic optimization is correct?

Select the correct answer

Var allt tydligt?

Hur kan vi förbättra det?

Tack för dina kommentarer!

Avsnitt 3. Kapitel 2
some-alt