Learn Subword Entropy and Token Distributions | Subword Tokenization Algorithms

Swipe to show menu

When you examine the vocabulary generated by subword tokenization methods, you often find that the distribution of token frequencies is far from uniform. Instead, it follows a heavy-tailed pattern, where a small number of tokens appear very frequently, but there is a long tail of rare tokens that occur only a few times or even just once in the corpus. This is a direct consequence of the way language is structured and how subword tokenizers operate. The presence of these rare, long-tail tokens can have important implications for language models: they affect how models allocate capacity, how well they generalize to new data, and how efficiently they can learn representations for infrequent words or subword units.

Zipf's Law

In natural language, the frequency of any word or token is inversely proportional to its rank in the frequency table. This means the most common token appears much more often than the second, which appears more than the third, and so on, resulting in a heavy-tailed, power-law distribution.

Long-Tail Tokens

The long tail refers to the large number of tokens that occur very infrequently. In subword vocabularies, these are often rare combinations or fragments that appear only in specialized contexts or infrequent words. Despite their rarity, they collectively account for a significant portion of the vocabulary.

Influence on Model Learning

Heavy-tailed distributions mean that models see some tokens much more often during training, allowing them to learn robust representations for these. However, rare tokens are seen infrequently, which can make it difficult for the model to learn effective representations, potentially leading to poorer performance on rare or out-of-distribution words.


              12345678910111213141516
            
import numpy as np
import matplotlib.pyplot as plt

# Example subword token frequencies (synthetic data for demonstration)
token_frequencies = np.array([10000, 5000, 2500, 1200, 600, 300, 150, 75, 35, 10, 5, 2, 1, 1, 1])
token_ranks = np.arange(1, len(token_frequencies) + 1)

plt.figure(figsize=(8, 5))
plt.plot(token_ranks, token_frequencies, marker='o')
plt.yscale('log')
plt.xscale('log')
plt.title('Heavy-Tailed Distribution of Subword Token Frequencies')
plt.xlabel('Token Rank')
plt.ylabel('Frequency (log scale)')
plt.grid(True, which="both", ls="-")
plt.show()

Everything was clear?

Thanks for your feedback!

Section 2. Chapter 3

Ask AI

Ask anything or try one of the suggested questions to begin our chat

Section 2. Chapter 3