Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Learn What Tokenization Really Does | Tokenization as Compression
Tokenization and Information Theory

bookWhat Tokenization Really Does

Tokenization is the process of converting text into a sequence of discrete symbols, enabling computers to process and analyze language efficiently. At its core, tokenization serves as a mapping function: it takes raw text and transforms it into a set of tokens drawn from a fixed vocabulary. This mapping is crucial for computational models, which operate on well-defined, finite sets of symbols rather than free-form text. There are several strategies for tokenization, each with different granularities and trade-offs. The main approaches are character-level, word-level, subword-level, and byte-level tokenization.

Character-level tokenization breaks text into its smallest possible units, treating each character as an individual token. This method ensures that every possible input can be represented, as all text is composed of characters. However, it often leads to long sequences and may lose some semantic information present at higher levels.

Word-level tokenization, in contrast, treats each word as a token. This approach aligns well with human intuition about language, as words are natural units of meaning. However, word-level tokenization faces significant scalability challenges. The number of unique words in a language is vast and continuously growing, especially when considering variations, misspellings, and new terms. Maintaining a vocabulary that covers all possible words is impractical, leading to issues with out-of-vocabulary words and inefficient use of storage.

Subword-level tokenization offers a compromise by splitting words into smaller, more manageable units. These subwords can be prefixes, suffixes, or frequently occurring segments within words. By combining subwords, it is possible to represent both common and rare words efficiently, reducing the vocabulary size while maintaining the ability to reconstruct the original text.

Byte-level tokenization goes even lower, treating each byte as a token. This method is highly robust and language-agnostic, but often results in even longer token sequences and can obscure linguistic structure.

The scalability problem of word-level tokenization becomes clear when dealing with large, diverse corpora. As the vocabulary grows, memory requirements, lookup times, and the risk of encountering unknown words all increase. Mapping text to discrete symbols using finer granularity, such as subwords or characters, helps address these issues by ensuring every possible input can be tokenized with a manageable vocabulary size.

Note
Definition
  • Character tokenization: splits text into individual characters, such as ['T', 'o', 'k', 'e', 'n'];
  • Word tokenization: splits text into words, such as ['Tokenization', 'is', 'powerful'];
  • Subword tokenization: splits words into smaller units, such as ['Token', 'ization', 'is', 'power', 'ful'].
123456789101112131415161718192021222324252627282930
# Example sentence sentence = "Tokenization is powerful!" # Character-level tokenization char_tokens = list(sentence) print("Character tokens:", char_tokens) # Word-level tokenization (simple whitespace-based) word_tokens = sentence.replace("!", "").split() print("Word tokens:", word_tokens) # Simple subword-level tokenization def simple_subword_tokenize(text): suffixes = ['ization', 'ing', 'ful', 'ly'] tokens = [] for word in text.replace("!", "").split(): matched = False for suf in suffixes: if word.endswith(suf) and len(word) > len(suf): base = word[:-len(suf)] tokens.append(base) tokens.append(suf) matched = True break if not matched: tokens.append(word) return tokens subword_tokens = simple_subword_tokenize(sentence) print("Subword tokens:", subword_tokens)
copy
question mark

Why does word-level tokenization not scale well for large language models?

Select the correct answer

Everything was clear?

How can we improve it?

Thanks for your feedback!

SectionΒ 1. ChapterΒ 1

Ask AI

expand

Ask AI

ChatGPT

Ask anything or try one of the suggested questions to begin our chat

Suggested prompts:

Can you explain the differences between character, word, and subword tokenization in more detail?

What are the advantages and disadvantages of each tokenization method?

How does subword tokenization help with out-of-vocabulary words?

bookWhat Tokenization Really Does

Swipe to show menu

Tokenization is the process of converting text into a sequence of discrete symbols, enabling computers to process and analyze language efficiently. At its core, tokenization serves as a mapping function: it takes raw text and transforms it into a set of tokens drawn from a fixed vocabulary. This mapping is crucial for computational models, which operate on well-defined, finite sets of symbols rather than free-form text. There are several strategies for tokenization, each with different granularities and trade-offs. The main approaches are character-level, word-level, subword-level, and byte-level tokenization.

Character-level tokenization breaks text into its smallest possible units, treating each character as an individual token. This method ensures that every possible input can be represented, as all text is composed of characters. However, it often leads to long sequences and may lose some semantic information present at higher levels.

Word-level tokenization, in contrast, treats each word as a token. This approach aligns well with human intuition about language, as words are natural units of meaning. However, word-level tokenization faces significant scalability challenges. The number of unique words in a language is vast and continuously growing, especially when considering variations, misspellings, and new terms. Maintaining a vocabulary that covers all possible words is impractical, leading to issues with out-of-vocabulary words and inefficient use of storage.

Subword-level tokenization offers a compromise by splitting words into smaller, more manageable units. These subwords can be prefixes, suffixes, or frequently occurring segments within words. By combining subwords, it is possible to represent both common and rare words efficiently, reducing the vocabulary size while maintaining the ability to reconstruct the original text.

Byte-level tokenization goes even lower, treating each byte as a token. This method is highly robust and language-agnostic, but often results in even longer token sequences and can obscure linguistic structure.

The scalability problem of word-level tokenization becomes clear when dealing with large, diverse corpora. As the vocabulary grows, memory requirements, lookup times, and the risk of encountering unknown words all increase. Mapping text to discrete symbols using finer granularity, such as subwords or characters, helps address these issues by ensuring every possible input can be tokenized with a manageable vocabulary size.

Note
Definition
  • Character tokenization: splits text into individual characters, such as ['T', 'o', 'k', 'e', 'n'];
  • Word tokenization: splits text into words, such as ['Tokenization', 'is', 'powerful'];
  • Subword tokenization: splits words into smaller units, such as ['Token', 'ization', 'is', 'power', 'ful'].
123456789101112131415161718192021222324252627282930
# Example sentence sentence = "Tokenization is powerful!" # Character-level tokenization char_tokens = list(sentence) print("Character tokens:", char_tokens) # Word-level tokenization (simple whitespace-based) word_tokens = sentence.replace("!", "").split() print("Word tokens:", word_tokens) # Simple subword-level tokenization def simple_subword_tokenize(text): suffixes = ['ization', 'ing', 'ful', 'ly'] tokens = [] for word in text.replace("!", "").split(): matched = False for suf in suffixes: if word.endswith(suf) and len(word) > len(suf): base = word[:-len(suf)] tokens.append(base) tokens.append(suf) matched = True break if not matched: tokens.append(word) return tokens subword_tokens = simple_subword_tokenize(sentence) print("Subword tokens:", subword_tokens)
copy
question mark

Why does word-level tokenization not scale well for large language models?

Select the correct answer

Everything was clear?

How can we improve it?

Thanks for your feedback!

SectionΒ 1. ChapterΒ 1
some-alt