Learn Vocabulary Size Trade-Offs | Tokenization as Compression

Swipe to show menu

When designing a tokenization system, you must decide how many unique tokens — or vocabulary items — to use. This decision is not trivial: the size of your vocabulary directly affects how text is represented as sequences of tokens, and has far-reaching consequences for model efficiency, sparsity, and performance metrics like perplexity.

A small vocabulary means that each token covers a larger chunk of text — often entire words or even phrases are broken down into subword units or characters. This leads to longer token sequences for the same sentence, because more tokens are needed to cover the same content. However, small vocabularies reduce the number of parameters in the model’s embedding layer, which can help with generalization and reduce memory requirements. On the other hand, longer sequences can slow down processing and increase the risk of information loss, especially if the sequence length exceeds model limits.

A large vocabulary, in contrast, allows more text to be represented by fewer tokens. This shortens input sequences, which can speed up processing and reduce the number of steps the model must take to understand the text. However, large vocabularies increase the risk of data sparsity: many rare words or subwords may appear only a handful of times in the training data, making it harder for the model to learn good representations for them. This can also lead to a much larger embedding matrix, increasing memory usage and the risk of overfitting.

The trade-off between vocabulary size and sequence length also impacts perplexity, a measure of how well the model predicts a sequence. If the vocabulary is too small, the model may struggle to represent complex words or phrases, increasing perplexity. If the vocabulary is too large, the model may not have enough data to learn rare tokens well, again increasing perplexity. Thus, finding the right balance is crucial for efficient and effective language modeling.

Small Vocabulary: Pros

Reduces embedding table size, saving memory;
Handles unseen or rare words better by breaking them into known subwords or characters;
Simplifies handling of out-of-vocabulary (OOV) words;
Improves generalization by forcing the model to learn patterns at the subword or character level.

Small Vocabulary: Cons

Increases sequence length, which can slow down processing and require more computational steps;
May lose semantic information by over-fragmenting meaningful words;
Can make it harder for the model to capture long-range dependencies.

Large Vocabulary: Pros

Shortens sequence length, speeding up model processing;
Captures more semantic meaning in single tokens, improving representation;
Reduces need for token recombination to form words.

Large Vocabulary: Cons

Increases embedding table size, using more memory;
Leads to data sparsity, making it harder to learn good representations for rare tokens;
Increases risk of overfitting and may require more training data.

Examples

English with character-level vocabulary: "unbelievable" -> [ 'u', 'n', 'b', 'e', 'l', 'i', 'e', 'v', 'a', 'b', 'l', 'e' ] (12 tokens);
English with word-level vocabulary: "unbelievable" -> [ 'unbelievable' ] (1 token);
English with subword vocabulary: "unbelievable" -> [ 'un', 'believ', 'able' ] (3 tokens).


              123456789101112131415161718192021222324252627282930313233343536
            
import math

def tokenize(sentence, vocab):
    tokens = []
    i = 0
    while i < len(sentence):
        matched = False
        # Try to match the longest token in vocab at position i
        for j in range(len(sentence), i, -1):
            sub = sentence[i:j]
            if sub in vocab:
                tokens.append(sub)
                i = j
                matched = True
                break
        if not matched:
            # Fallback: single character
            tokens.append(sentence[i])
            i += 1
    return tokens

sentence = "unbelievable"
char_vocab = set(list("abcdefghijklmnopqrstuvwxyz"))
word_vocab = set(["unbelievable"])
subword_vocab = set(["un", "believ", "able"])

char_tokens = tokenize(sentence, char_vocab)
word_tokens = tokenize(sentence, word_vocab)
subword_tokens = tokenize(sentence, subword_vocab)

print("Character-level tokens:", char_tokens)
print("Word-level tokens:", word_tokens)
print("Subword-level tokens:", subword_tokens)
print("Number of tokens (char-level):", len(char_tokens))
print("Number of tokens (word-level):", len(word_tokens))
print("Number of tokens (subword-level):", len(subword_tokens))

Everything was clear?

Thanks for your feedback!

Section 1. Chapter 3

Ask AI

Ask anything or try one of the suggested questions to begin our chat

Section 1. Chapter 3