Learn Compression Bounds in Tokenization | Limits, OOV, and Theoretical Boundaries

Swipe to show menu

Understanding the theoretical boundaries of tokenization is crucial for anyone seeking to design efficient text encoding systems. Information theory tells you that the entropy of a text sets a lower bound on how much you can compress it: you cannot encode information in fewer bits than its entropy. In practice, this means that perfect compression — encoding all possible messages in the absolute minimum length — is unattainable. Every tokenization method faces the fundamental trade-off between compressing text efficiently and generalizing to new, unseen sequences. If you focus solely on optimizing for the most frequent patterns, your tokenizer may achieve impressive compression rates on familiar data but will struggle with out-of-vocabulary (OOV) words or unexpected constructions. Conversely, if you design for maximum generalization, you must sacrifice some compression efficiency, allowing for longer encodings to handle rare or novel input. These trade-offs are not just theoretical; they shape the very limits of what tokenizers can accomplish.

Compression limits

The entropy of a source text defines the minimum average number of bits needed to encode it. No tokenizer or compressor can consistently represent text in fewer bits per symbol than this entropy. Any attempt to compress below this limit will result in information loss or ambiguity.

Generalization

Tokenizers that try to cover every possible sequence with compact codes must allocate space for rare or unseen combinations. This leads to longer encodings for less frequent patterns, ensuring that new or OOV words can still be represented, but at the cost of increased average encoding length.

Practical implications

In real-world applications, tokenizers are always a compromise. They must balance the need for efficient encoding of common patterns with the flexibility to handle novel input. This is why tokenization vocabularies are never perfect and why OOV tokens and suboptimal compression are unavoidable in practice.


              1234567891011121314151617
            
import numpy as np
from scipy.stats import entropy

# Sample text and naive character-level tokenization
text = "the quick brown fox jumps over the lazy dog"
tokens = list(text.replace(" ", ""))  # Remove spaces for simplicity

# Count the frequency of each token
unique_tokens, counts = np.unique(tokens, return_counts=True)
probabilities = counts / counts.sum()

# Calculate entropy (in bits per character)
entropy_bits = entropy(probabilities, base=2)
average_length = len(tokens)

print(f"Estimated entropy per character: {entropy_bits:.3f} bits")
print(f"Minimum achievable encoding length for text: {entropy_bits * average_length:.2f} bits")

Everything was clear?

Thanks for your feedback!

Section 3. Chapter 2

Ask AI

Ask anything or try one of the suggested questions to begin our chat

Section 3. Chapter 2