Leer Learning Rate Scheduling Strategies

Veeg om het menu te tonen

A fixed learning rate is rarely optimal for LLM training. Too high at the start causes unstable updates; too high at the end prevents convergence to a good minimum. Learning rate scheduling adjusts the rate dynamically throughout training.

Linear Warmup

Start with a near-zero learning rate and increase it linearly to the target value over the first warmup_steps steps. This gives the model time to settle into a reasonable parameter space before large gradient updates begin.

Cosine Decay

After warmup, decay the learning rate following a cosine curve – large updates early, fine-grained adjustments later. The rate approaches zero by the end of training. This is the most widely used schedule for LLM pre-training.

Implementation


              123456789101112131415161718192021222324252627282930313233343536
            
import torch
import torch.nn as nn
import math
from torch.optim import AdamW
from torch.optim.lr_scheduler import LambdaLR

model = nn.Linear(10, 10)
optimizer = AdamW(model.parameters(), lr=2e-4)

warmup_steps = 100
total_steps = 5000

def cosine_with_warmup(step):
    if step < warmup_steps:
        # Linear warmup
        return step / max(1, warmup_steps)
    # Cosine decay
    progress = (step - warmup_steps) / max(1, total_steps - warmup_steps)
    return 0.5 * (1.0 + math.cos(math.pi * progress))

scheduler = LambdaLR(optimizer, lr_lambda=cosine_with_warmup)

# Simulating a training loop
for step in range(total_steps):
    optimizer.zero_grad()

    # Forward and backward pass would go here
    loss = model(torch.randn(4, 10)).sum()
    loss.backward()

    optimizer.step()
    scheduler.step()

    if step % 1000 == 0:
        current_lr = scheduler.get_last_lr()[0]
        print(f"Step {step:05d} – lr: {current_lr:.6f}")

Was alles duidelijk?

Bedankt voor je feedback!

Sectie 1. Hoofdstuk 8

Vraag AI

Vraag wat u wilt of probeer een van de voorgestelde vragen om onze chat te starten.

Sectie 1. Hoofdstuk 8