Learn What Tokenization Really Does | Tokenization as Compression

Swipe to show menu

Tokenization is the process of converting text into a sequence of discrete symbols, enabling computers to process and analyze language efficiently. At its core, tokenization serves as a mapping function: it takes raw text and transforms it into a set of tokens drawn from a fixed vocabulary. This mapping is crucial for computational models, which operate on well-defined, finite sets of symbols rather than free-form text. There are several strategies for tokenization, each with different granularities and trade-offs. The main approaches are character-level, word-level, subword-level, and byte-level tokenization.

Character-level tokenization breaks text into its smallest possible units, treating each character as an individual token. This method ensures that every possible input can be represented, as all text is composed of characters. However, it often leads to long sequences and may lose some semantic information present at higher levels.

Word-level tokenization, in contrast, treats each word as a token. This approach aligns well with human intuition about language, as words are natural units of meaning. However, word-level tokenization faces significant scalability challenges. The number of unique words in a language is vast and continuously growing, especially when considering variations, misspellings, and new terms. Maintaining a vocabulary that covers all possible words is impractical, leading to issues with out-of-vocabulary words and inefficient use of storage.

Subword-level tokenization offers a compromise by splitting words into smaller, more manageable units. These subwords can be prefixes, suffixes, or frequently occurring segments within words. By combining subwords, it is possible to represent both common and rare words efficiently, reducing the vocabulary size while maintaining the ability to reconstruct the original text.

Byte-level tokenization goes even lower, treating each byte as a token. This method is highly robust and language-agnostic, but often results in even longer token sequences and can obscure linguistic structure.

The scalability problem of word-level tokenization becomes clear when dealing with large, diverse corpora. As the vocabulary grows, memory requirements, lookup times, and the risk of encountering unknown words all increase. Mapping text to discrete symbols using finer granularity, such as subwords or characters, helps address these issues by ensuring every possible input can be tokenized with a manageable vocabulary size.

Definition

Character tokenization: splits text into individual characters, such as ['T', 'o', 'k', 'e', 'n'];
Word tokenization: splits text into words, such as ['Tokenization', 'is', 'powerful'];
Subword tokenization: splits words into smaller units, such as ['Token', 'ization', 'is', 'power', 'ful'].


              123456789101112131415161718192021222324252627282930
            
# Example sentence
sentence = "Tokenization is powerful!"

# Character-level tokenization
char_tokens = list(sentence)
print("Character tokens:", char_tokens)

# Word-level tokenization (simple whitespace-based)
word_tokens = sentence.replace("!", "").split()
print("Word tokens:", word_tokens)

# Simple subword-level tokenization
def simple_subword_tokenize(text):
    suffixes = ['ization', 'ing', 'ful', 'ly']
    tokens = []
    for word in text.replace("!", "").split():
        matched = False
        for suf in suffixes:
            if word.endswith(suf) and len(word) > len(suf):
                base = word[:-len(suf)]
                tokens.append(base)
                tokens.append(suf)
                matched = True
                break
        if not matched:
            tokens.append(word)
    return tokens

subword_tokens = simple_subword_tokenize(sentence)
print("Subword tokens:", subword_tokens)

Everything was clear?

Thanks for your feedback!

Section 1. Chapter 1

Ask AI

Ask anything or try one of the suggested questions to begin our chat

Section 1. Chapter 1