Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Aprenda What Are Scaling Laws? | Section
Pre-training Large Language Models

bookWhat Are Scaling Laws?

Deslize para mostrar o menu

Empirical scaling laws describe a consistent pattern observed across LLM research: model loss decreases predictably as you increase parameters, data, or compute. They are not theoretical guarantees – they are statistical regularities that hold within certain regimes.

The Core Observation

Within a wide range of scales, loss follows a power-law relationship with each resource axis. Doubling parameters, doubling data, or doubling compute each produces a steady – but diminishing – reduction in loss. The key word is diminishing: each additional increment yields less improvement than the previous one.

This means that scaling any single axis indefinitely is inefficient. To get the best performance for a fixed compute budget, you need to scale parameters and data together.

The Chinchilla Result

The most influential practical finding from scaling law research is the Chinchilla result (Hoffmann et al., 2022): for a given compute budget, the optimal strategy is to train a smaller model on more data, rather than a larger model on less data. Prior to this, models like GPT-3 were significantly undertrained relative to their parameter count.

The rough guideline: train on approximately 20 tokens per parameter. A 7B parameter model should see around 140B tokens of training data to be compute-optimal.

Limitations

Scaling laws are most reliable when the model is still underfitting – when more data or parameters would continue to reduce loss. They break down when:

  • high-quality data is exhausted;
  • architectural changes shift the loss curve;
  • the model has already saturated the training distribution.

They also say nothing about downstream task performance, safety, or qualitative capabilities – only about next-token prediction loss.

question mark

Which statement best summarizes the core idea of empirical scaling laws in large language model training?

Selecione a resposta correta

Tudo estava claro?

Como podemos melhorá-lo?

Obrigado pelo seu feedback!

Seção 1. Capítulo 9

Pergunte à IA

expand

Pergunte à IA

ChatGPT

Pergunte o que quiser ou experimente uma das perguntas sugeridas para iniciar nosso bate-papo

Seção 1. Capítulo 9
some-alt