Why Tokenization Matters for Language Models
Deslize para mostrar o menu
Tokenization is the process of converting raw text into a sequence of tokens – the atomic units a language model operates on. Before any training happens, every string of text must pass through a tokenizer.
Three Approaches
The choice of tokenization strategy directly shapes vocabulary size, sequence length, and ultimately model performance:
- Word-level: each word is one token. Simple, but produces a huge vocabulary – rare words and morphological variants each need their own entry;
- Character-level: each character is one token. Tiny vocabulary, but sequences become very long, making training slow and context harder to capture;
- Subword: words are split into frequent subword units. Balances vocabulary size and sequence length, and handles unknown or rare words by decomposing them into known pieces.
Modern LLMs – GPT, BERT, LLaMA – all use subword tokenization.
Why It Matters
The tokenizer determines what the model ever gets to see. A poor tokenization scheme can:
- split words in unnatural ways, making it harder to learn meaning;
- produce a vocabulary too large to fit efficiently in memory;
- generate unnecessarily long sequences, increasing compute costs;
- fail on rare words or morphologically rich languages.
A well-designed tokenizer keeps vocabulary size manageable, produces compact sequences, and generalizes cleanly to text not seen during training.
from transformers import GPT2Tokenizer
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
text = "Tokenization shapes everything downstream."
tokens = tokenizer.tokenize(text)
ids = tokenizer.encode(text)
print(tokens)
print(ids)
Run this locally to see how GPT-2 splits the sentence into subword tokens and maps them to integer IDs.
Obrigado pelo seu feedback!
Pergunte à IA
Pergunte à IA
Pergunte o que quiser ou experimente uma das perguntas sugeridas para iniciar nosso bate-papo