Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Вивчайте Why Tokenization Matters for Language Models | Section
Pre-training Large Language Models

bookWhy Tokenization Matters for Language Models

Свайпніть щоб показати меню

Tokenization is the process of converting raw text into a sequence of tokens – the atomic units a language model operates on. Before any training happens, every string of text must pass through a tokenizer.

Three Approaches

The choice of tokenization strategy directly shapes vocabulary size, sequence length, and ultimately model performance:

  • Word-level: each word is one token. Simple, but produces a huge vocabulary – rare words and morphological variants each need their own entry;
  • Character-level: each character is one token. Tiny vocabulary, but sequences become very long, making training slow and context harder to capture;
  • Subword: words are split into frequent subword units. Balances vocabulary size and sequence length, and handles unknown or rare words by decomposing them into known pieces.

Modern LLMs – GPT, BERT, LLaMA – all use subword tokenization.

Why It Matters

The tokenizer determines what the model ever gets to see. A poor tokenization scheme can:

  • split words in unnatural ways, making it harder to learn meaning;
  • produce a vocabulary too large to fit efficiently in memory;
  • generate unnecessarily long sequences, increasing compute costs;
  • fail on rare words or morphologically rich languages.

A well-designed tokenizer keeps vocabulary size manageable, produces compact sequences, and generalizes cleanly to text not seen during training.

from transformers import GPT2Tokenizer

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

text = "Tokenization shapes everything downstream."
tokens = tokenizer.tokenize(text)
ids = tokenizer.encode(text)

print(tokens)
print(ids)

Run this locally to see how GPT-2 splits the sentence into subword tokens and maps them to integer IDs.

question mark

What Happens with Poor Tokenization?

Виберіть правильну відповідь

Все було зрозуміло?

Як ми можемо покращити це?

Дякуємо за ваш відгук!

Секція 1. Розділ 1

Запитати АІ

expand

Запитати АІ

ChatGPT

Запитайте про що завгодно або спробуйте одне із запропонованих запитань, щоб почати наш чат

Секція 1. Розділ 1
some-alt