Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Oppiskele Evaluating Language Models with Perplexity | Section
Pre-training Large Language Models

bookEvaluating Language Models with Perplexity

Pyyhkäise näyttääksesi valikon

Perplexity is the standard metric for evaluating language model quality during pre-training. It measures how uncertain the model is when predicting the next token – a lower value means the model assigns higher probability to the correct tokens.

The Formula

For a sequence of NN tokens with predicted probabilities  p1,p2,,pN \ p_1, p_2, \ldots, p_N\ for the correct token at each step:

PP=exp(1Ni=1Nlogpi)PP = \exp\left(\frac{1}{N} \sum_{i=1}^{N} -\log p_i\right)

This is simply the exponentiation of the average cross-entropy loss. If your training loop already computes cross-entropy loss, perplexity is just torch.exp(loss).

Computing Perplexity in PyTorch

123456789101112131415161718
import torch import torch.nn.functional as F # Simulating model logits and targets for a single batch vocab_size = 1000 seq_len = 20 batch_size = 4 logits = torch.rand(batch_size, seq_len, vocab_size) targets = torch.randint(0, vocab_size, (batch_size, seq_len)) # Cross-entropy loss (mean over all tokens) loss = F.cross_entropy(logits.view(-1, vocab_size), targets.view(-1)) # Perplexity perplexity = torch.exp(loss) print(f"Loss: {loss.item():.4f}") print(f"Perplexity: {perplexity.item():.2f}")
copy

A randomly initialized model over a vocabulary of 1000 tokens should produce a perplexity close to 1000 – it is essentially guessing uniformly. As training progresses, perplexity drops.

Run this locally and try replacing the random logits with a uniform distribution to verify the relationship between vocabulary size and initial perplexity.

Strengths and Limitations

Perplexity is easy to compute, interpretable, and consistent across training runs on the same dataset. However, it has real limitations:

  • it is sensitive to tokenization – comparing perplexity across models with different vocabularies is not meaningful;
  • it does not capture output quality for generation tasks – a model can have low perplexity while producing repetitive or incoherent text;
  • it only reflects next-token prediction accuracy, not downstream task performance.
question mark

Which Statement Best Describes Perplexity?

Valitse oikea vastaus

Oliko kaikki selvää?

Miten voimme parantaa sitä?

Kiitos palautteestasi!

Osio 1. Luku 10

Kysy tekoälyä

expand

Kysy tekoälyä

ChatGPT

Kysy mitä tahansa tai kokeile jotakin ehdotetuista kysymyksistä aloittaaksesi keskustelumme

Osio 1. Luku 10
some-alt