Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Lära Why Tokenization Matters for Language Models | Section
Pre-training Large Language Models

bookWhy Tokenization Matters for Language Models

Svep för att visa menyn

Tokenization is the process of converting raw text into a sequence of tokens – the atomic units a language model operates on. Before any training happens, every string of text must pass through a tokenizer.

Three Approaches

The choice of tokenization strategy directly shapes vocabulary size, sequence length, and ultimately model performance:

  • Word-level: each word is one token. Simple, but produces a huge vocabulary – rare words and morphological variants each need their own entry;
  • Character-level: each character is one token. Tiny vocabulary, but sequences become very long, making training slow and context harder to capture;
  • Subword: words are split into frequent subword units. Balances vocabulary size and sequence length, and handles unknown or rare words by decomposing them into known pieces.

Modern LLMs – GPT, BERT, LLaMA – all use subword tokenization.

Why It Matters

The tokenizer determines what the model ever gets to see. A poor tokenization scheme can:

  • split words in unnatural ways, making it harder to learn meaning;
  • produce a vocabulary too large to fit efficiently in memory;
  • generate unnecessarily long sequences, increasing compute costs;
  • fail on rare words or morphologically rich languages.

A well-designed tokenizer keeps vocabulary size manageable, produces compact sequences, and generalizes cleanly to text not seen during training.

from transformers import GPT2Tokenizer

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

text = "Tokenization shapes everything downstream."
tokens = tokenizer.tokenize(text)
ids = tokenizer.encode(text)

print(tokens)
print(ids)

Run this locally to see how GPT-2 splits the sentence into subword tokens and maps them to integer IDs.

question mark

What Happens with Poor Tokenization?

Vänligen välj det korrekta svaret

Var allt tydligt?

Hur kan vi förbättra det?

Tack för dina kommentarer!

Avsnitt 1. Kapitel 1

Fråga AI

expand

Fråga AI

ChatGPT

Fråga vad du vill eller prova någon av de föreslagna frågorna för att starta vårt samtal

Avsnitt 1. Kapitel 1
some-alt