Why Tokenization Matters for Language Models
Stryg for at vise menuen
Tokenization is the process of converting raw text into a sequence of tokens – the atomic units a language model operates on. Before any training happens, every string of text must pass through a tokenizer.
Three Approaches
The choice of tokenization strategy directly shapes vocabulary size, sequence length, and ultimately model performance:
- Word-level: each word is one token. Simple, but produces a huge vocabulary – rare words and morphological variants each need their own entry;
- Character-level: each character is one token. Tiny vocabulary, but sequences become very long, making training slow and context harder to capture;
- Subword: words are split into frequent subword units. Balances vocabulary size and sequence length, and handles unknown or rare words by decomposing them into known pieces.
Modern LLMs – GPT, BERT, LLaMA – all use subword tokenization.
Why It Matters
The tokenizer determines what the model ever gets to see. A poor tokenization scheme can:
- split words in unnatural ways, making it harder to learn meaning;
- produce a vocabulary too large to fit efficiently in memory;
- generate unnecessarily long sequences, increasing compute costs;
- fail on rare words or morphologically rich languages.
A well-designed tokenizer keeps vocabulary size manageable, produces compact sequences, and generalizes cleanly to text not seen during training.
from transformers import GPT2Tokenizer
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
text = "Tokenization shapes everything downstream."
tokens = tokenizer.tokenize(text)
ids = tokenizer.encode(text)
print(tokens)
print(ids)
Run this locally to see how GPT-2 splits the sentence into subword tokens and maps them to integer IDs.
Tak for dine kommentarer!
Spørg AI
Spørg AI
Spørg om hvad som helst eller prøv et af de foreslåede spørgsmål for at starte vores chat