Impara What Are Scaling Laws?

Scorri per mostrare il menu

Empirical scaling laws describe a consistent pattern observed across LLM research: model loss decreases predictably as you increase parameters, data, or compute. They are not theoretical guarantees – they are statistical regularities that hold within certain regimes.

The Core Observation

Within a wide range of scales, loss follows a power-law relationship with each resource axis. Doubling parameters, doubling data, or doubling compute each produces a steady – but diminishing – reduction in loss. The key word is diminishing: each additional increment yields less improvement than the previous one.

This means that scaling any single axis indefinitely is inefficient. To get the best performance for a fixed compute budget, you need to scale parameters and data together.

The Chinchilla Result

The most influential practical finding from scaling law research is the Chinchilla result (Hoffmann et al., 2022): for a given compute budget, the optimal strategy is to train a smaller model on more data, rather than a larger model on less data. Prior to this, models like GPT-3 were significantly undertrained relative to their parameter count.

The rough guideline: train on approximately 20 tokens per parameter. A 7B parameter model should see around 140B tokens of training data to be compute-optimal.

Limitations

Scaling laws are most reliable when the model is still underfitting – when more data or parameters would continue to reduce loss. They break down when:

high-quality data is exhausted;
architectural changes shift the loss curve;
the model has already saturated the training distribution.

They also say nothing about downstream task performance, safety, or qualitative capabilities – only about next-token prediction loss.

Tutto è chiaro?

Grazie per i tuoi commenti!

Sezione 1. Capitolo 9

Chieda ad AI

Chieda pure quello che desidera o provi una delle domande suggerite per iniziare la nostra conversazione

Sezione 1. Capitolo 9