Learning Rate Scheduling Strategies
Desliza para mostrar el menú
A fixed learning rate is rarely optimal for LLM training. Too high at the start causes unstable updates; too high at the end prevents convergence to a good minimum. Learning rate scheduling adjusts the rate dynamically throughout training.
Linear Warmup
Start with a near-zero learning rate and increase it linearly to the target value over the first warmup_steps steps. This gives the model time to settle into a reasonable parameter space before large gradient updates begin.
Cosine Decay
After warmup, decay the learning rate following a cosine curve – large updates early, fine-grained adjustments later. The rate approaches zero by the end of training. This is the most widely used schedule for LLM pre-training.
Implementation
123456789101112131415161718192021222324252627282930313233343536import torch import torch.nn as nn import math from torch.optim import AdamW from torch.optim.lr_scheduler import LambdaLR model = nn.Linear(10, 10) optimizer = AdamW(model.parameters(), lr=2e-4) warmup_steps = 100 total_steps = 5000 def cosine_with_warmup(step): if step < warmup_steps: # Linear warmup return step / max(1, warmup_steps) # Cosine decay progress = (step - warmup_steps) / max(1, total_steps - warmup_steps) return 0.5 * (1.0 + math.cos(math.pi * progress)) scheduler = LambdaLR(optimizer, lr_lambda=cosine_with_warmup) # Simulating a training loop for step in range(total_steps): optimizer.zero_grad() # Forward and backward pass would go here loss = model(torch.randn(4, 10)).sum() loss.backward() optimizer.step() scheduler.step() if step % 1000 == 0: current_lr = scheduler.get_last_lr()[0] print(f"Step {step:05d} – lr: {current_lr:.6f}")
¡Gracias por tus comentarios!
Pregunte a AI
Pregunte a AI
Pregunte lo que quiera o pruebe una de las preguntas sugeridas para comenzar nuestra charla