Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Lære Why Tokenization Matters for Language Models | Section
Pre-training Large Language Models

bookWhy Tokenization Matters for Language Models

Sveip for å vise menyen

Tokenization is the process of converting raw text into a sequence of tokens – the atomic units a language model operates on. Before any training happens, every string of text must pass through a tokenizer.

Three Approaches

The choice of tokenization strategy directly shapes vocabulary size, sequence length, and ultimately model performance:

  • Word-level: each word is one token. Simple, but produces a huge vocabulary – rare words and morphological variants each need their own entry;
  • Character-level: each character is one token. Tiny vocabulary, but sequences become very long, making training slow and context harder to capture;
  • Subword: words are split into frequent subword units. Balances vocabulary size and sequence length, and handles unknown or rare words by decomposing them into known pieces.

Modern LLMs – GPT, BERT, LLaMA – all use subword tokenization.

Why It Matters

The tokenizer determines what the model ever gets to see. A poor tokenization scheme can:

  • split words in unnatural ways, making it harder to learn meaning;
  • produce a vocabulary too large to fit efficiently in memory;
  • generate unnecessarily long sequences, increasing compute costs;
  • fail on rare words or morphologically rich languages.

A well-designed tokenizer keeps vocabulary size manageable, produces compact sequences, and generalizes cleanly to text not seen during training.

from transformers import GPT2Tokenizer

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

text = "Tokenization shapes everything downstream."
tokens = tokenizer.tokenize(text)
ids = tokenizer.encode(text)

print(tokens)
print(ids)

Run this locally to see how GPT-2 splits the sentence into subword tokens and maps them to integer IDs.

question mark

What Happens with Poor Tokenization?

Velg det helt riktige svaret

Alt var klart?

Hvordan kan vi forbedre det?

Takk for tilbakemeldingene dine!

Seksjon 1. Kapittel 1

Spør AI

expand

Spør AI

ChatGPT

Spør om hva du vil, eller prøv ett av de foreslåtte spørsmålene for å starte chatten vår

Seksjon 1. Kapittel 1
some-alt