What Are Scaling Laws?
Desliza para mostrar el menú
Empirical scaling laws describe a consistent pattern observed across LLM research: model loss decreases predictably as you increase parameters, data, or compute. They are not theoretical guarantees – they are statistical regularities that hold within certain regimes.
The Core Observation
Within a wide range of scales, loss follows a power-law relationship with each resource axis. Doubling parameters, doubling data, or doubling compute each produces a steady – but diminishing – reduction in loss. The key word is diminishing: each additional increment yields less improvement than the previous one.
This means that scaling any single axis indefinitely is inefficient. To get the best performance for a fixed compute budget, you need to scale parameters and data together.
The Chinchilla Result
The most influential practical finding from scaling law research is the Chinchilla result (Hoffmann et al., 2022): for a given compute budget, the optimal strategy is to train a smaller model on more data, rather than a larger model on less data. Prior to this, models like GPT-3 were significantly undertrained relative to their parameter count.
The rough guideline: train on approximately 20 tokens per parameter. A 7B parameter model should see around 140B tokens of training data to be compute-optimal.
Limitations
Scaling laws are most reliable when the model is still underfitting – when more data or parameters would continue to reduce loss. They break down when:
- high-quality data is exhausted;
- architectural changes shift the loss curve;
- the model has already saturated the training distribution.
They also say nothing about downstream task performance, safety, or qualitative capabilities – only about next-token prediction loss.
¡Gracias por tus comentarios!
Pregunte a AI
Pregunte a AI
Pregunte lo que quiera o pruebe una de las preguntas sugeridas para comenzar nuestra charla