Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Aprenda Why Tokenization Matters for Language Models | Section
Pre-training Large Language Models

bookWhy Tokenization Matters for Language Models

Deslize para mostrar o menu

Tokenization is the process of converting raw text into a sequence of tokens – the atomic units a language model operates on. Before any training happens, every string of text must pass through a tokenizer.

Three Approaches

The choice of tokenization strategy directly shapes vocabulary size, sequence length, and ultimately model performance:

  • Word-level: each word is one token. Simple, but produces a huge vocabulary – rare words and morphological variants each need their own entry;
  • Character-level: each character is one token. Tiny vocabulary, but sequences become very long, making training slow and context harder to capture;
  • Subword: words are split into frequent subword units. Balances vocabulary size and sequence length, and handles unknown or rare words by decomposing them into known pieces.

Modern LLMs – GPT, BERT, LLaMA – all use subword tokenization.

Why It Matters

The tokenizer determines what the model ever gets to see. A poor tokenization scheme can:

  • split words in unnatural ways, making it harder to learn meaning;
  • produce a vocabulary too large to fit efficiently in memory;
  • generate unnecessarily long sequences, increasing compute costs;
  • fail on rare words or morphologically rich languages.

A well-designed tokenizer keeps vocabulary size manageable, produces compact sequences, and generalizes cleanly to text not seen during training.

from transformers import GPT2Tokenizer

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

text = "Tokenization shapes everything downstream."
tokens = tokenizer.tokenize(text)
ids = tokenizer.encode(text)

print(tokens)
print(ids)

Run this locally to see how GPT-2 splits the sentence into subword tokens and maps them to integer IDs.

question mark

What Happens with Poor Tokenization?

Selecione a resposta correta

Tudo estava claro?

Como podemos melhorá-lo?

Obrigado pelo seu feedback!

Seção 1. Capítulo 1

Pergunte à IA

expand

Pergunte à IA

ChatGPT

Pergunte o que quiser ou experimente uma das perguntas sugeridas para iniciar nosso bate-papo

Seção 1. Capítulo 1
some-alt