Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Impara What Tokenization Really Does | Tokenization as Compression
Tokenization and Information Theory

bookWhat Tokenization Really Does

Tokenization is the process of converting text into a sequence of discrete symbols, enabling computers to process and analyze language efficiently. At its core, tokenization serves as a mapping function: it takes raw text and transforms it into a set of tokens drawn from a fixed vocabulary. This mapping is crucial for computational models, which operate on well-defined, finite sets of symbols rather than free-form text. There are several strategies for tokenization, each with different granularities and trade-offs. The main approaches are character-level, word-level, subword-level, and byte-level tokenization.

Character-level tokenization breaks text into its smallest possible units, treating each character as an individual token. This method ensures that every possible input can be represented, as all text is composed of characters. However, it often leads to long sequences and may lose some semantic information present at higher levels.

Word-level tokenization, in contrast, treats each word as a token. This approach aligns well with human intuition about language, as words are natural units of meaning. However, word-level tokenization faces significant scalability challenges. The number of unique words in a language is vast and continuously growing, especially when considering variations, misspellings, and new terms. Maintaining a vocabulary that covers all possible words is impractical, leading to issues with out-of-vocabulary words and inefficient use of storage.

Subword-level tokenization offers a compromise by splitting words into smaller, more manageable units. These subwords can be prefixes, suffixes, or frequently occurring segments within words. By combining subwords, it is possible to represent both common and rare words efficiently, reducing the vocabulary size while maintaining the ability to reconstruct the original text.

Byte-level tokenization goes even lower, treating each byte as a token. This method is highly robust and language-agnostic, but often results in even longer token sequences and can obscure linguistic structure.

The scalability problem of word-level tokenization becomes clear when dealing with large, diverse corpora. As the vocabulary grows, memory requirements, lookup times, and the risk of encountering unknown words all increase. Mapping text to discrete symbols using finer granularity, such as subwords or characters, helps address these issues by ensuring every possible input can be tokenized with a manageable vocabulary size.

Note
Definition
  • Character tokenization: splits text into individual characters, such as ['T', 'o', 'k', 'e', 'n'];
  • Word tokenization: splits text into words, such as ['Tokenization', 'is', 'powerful'];
  • Subword tokenization: splits words into smaller units, such as ['Token', 'ization', 'is', 'power', 'ful'].
123456789101112131415161718192021222324252627282930
# Example sentence sentence = "Tokenization is powerful!" # Character-level tokenization char_tokens = list(sentence) print("Character tokens:", char_tokens) # Word-level tokenization (simple whitespace-based) word_tokens = sentence.replace("!", "").split() print("Word tokens:", word_tokens) # Simple subword-level tokenization def simple_subword_tokenize(text): suffixes = ['ization', 'ing', 'ful', 'ly'] tokens = [] for word in text.replace("!", "").split(): matched = False for suf in suffixes: if word.endswith(suf) and len(word) > len(suf): base = word[:-len(suf)] tokens.append(base) tokens.append(suf) matched = True break if not matched: tokens.append(word) return tokens subword_tokens = simple_subword_tokenize(sentence) print("Subword tokens:", subword_tokens)
copy
question mark

Why does word-level tokenization not scale well for large language models?

Select the correct answer

Tutto è chiaro?

Come possiamo migliorarlo?

Grazie per i tuoi commenti!

Sezione 1. Capitolo 1

Chieda ad AI

expand

Chieda ad AI

ChatGPT

Chieda pure quello che desidera o provi una delle domande suggerite per iniziare la nostra conversazione

bookWhat Tokenization Really Does

Scorri per mostrare il menu

Tokenization is the process of converting text into a sequence of discrete symbols, enabling computers to process and analyze language efficiently. At its core, tokenization serves as a mapping function: it takes raw text and transforms it into a set of tokens drawn from a fixed vocabulary. This mapping is crucial for computational models, which operate on well-defined, finite sets of symbols rather than free-form text. There are several strategies for tokenization, each with different granularities and trade-offs. The main approaches are character-level, word-level, subword-level, and byte-level tokenization.

Character-level tokenization breaks text into its smallest possible units, treating each character as an individual token. This method ensures that every possible input can be represented, as all text is composed of characters. However, it often leads to long sequences and may lose some semantic information present at higher levels.

Word-level tokenization, in contrast, treats each word as a token. This approach aligns well with human intuition about language, as words are natural units of meaning. However, word-level tokenization faces significant scalability challenges. The number of unique words in a language is vast and continuously growing, especially when considering variations, misspellings, and new terms. Maintaining a vocabulary that covers all possible words is impractical, leading to issues with out-of-vocabulary words and inefficient use of storage.

Subword-level tokenization offers a compromise by splitting words into smaller, more manageable units. These subwords can be prefixes, suffixes, or frequently occurring segments within words. By combining subwords, it is possible to represent both common and rare words efficiently, reducing the vocabulary size while maintaining the ability to reconstruct the original text.

Byte-level tokenization goes even lower, treating each byte as a token. This method is highly robust and language-agnostic, but often results in even longer token sequences and can obscure linguistic structure.

The scalability problem of word-level tokenization becomes clear when dealing with large, diverse corpora. As the vocabulary grows, memory requirements, lookup times, and the risk of encountering unknown words all increase. Mapping text to discrete symbols using finer granularity, such as subwords or characters, helps address these issues by ensuring every possible input can be tokenized with a manageable vocabulary size.

Note
Definition
  • Character tokenization: splits text into individual characters, such as ['T', 'o', 'k', 'e', 'n'];
  • Word tokenization: splits text into words, such as ['Tokenization', 'is', 'powerful'];
  • Subword tokenization: splits words into smaller units, such as ['Token', 'ization', 'is', 'power', 'ful'].
123456789101112131415161718192021222324252627282930
# Example sentence sentence = "Tokenization is powerful!" # Character-level tokenization char_tokens = list(sentence) print("Character tokens:", char_tokens) # Word-level tokenization (simple whitespace-based) word_tokens = sentence.replace("!", "").split() print("Word tokens:", word_tokens) # Simple subword-level tokenization def simple_subword_tokenize(text): suffixes = ['ization', 'ing', 'ful', 'ly'] tokens = [] for word in text.replace("!", "").split(): matched = False for suf in suffixes: if word.endswith(suf) and len(word) > len(suf): base = word[:-len(suf)] tokens.append(base) tokens.append(suf) matched = True break if not matched: tokens.append(word) return tokens subword_tokens = simple_subword_tokenize(sentence) print("Subword tokens:", subword_tokens)
copy
question mark

Why does word-level tokenization not scale well for large language models?

Select the correct answer

Tutto è chiaro?

Come possiamo migliorarlo?

Grazie per i tuoi commenti!

Sezione 1. Capitolo 1
some-alt