Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Learn Subword Entropy and Token Distributions | Subword Tokenization Algorithms
Tokenization and Information Theory

bookSubword Entropy and Token Distributions

When you examine the vocabulary generated by subword tokenization methods, you often find that the distribution of token frequencies is far from uniform. Instead, it follows a heavy-tailed pattern, where a small number of tokens appear very frequently, but there is a long tail of rare tokens that occur only a few times or even just once in the corpus. This is a direct consequence of the way language is structured and how subword tokenizers operate. The presence of these rare, long-tail tokens can have important implications for language models: they affect how models allocate capacity, how well they generalize to new data, and how efficiently they can learn representations for infrequent words or subword units.

Zipf's Law
expand arrow

In natural language, the frequency of any word or token is inversely proportional to its rank in the frequency table. This means the most common token appears much more often than the second, which appears more than the third, and so on, resulting in a heavy-tailed, power-law distribution.

Long-Tail Tokens
expand arrow

The long tail refers to the large number of tokens that occur very infrequently. In subword vocabularies, these are often rare combinations or fragments that appear only in specialized contexts or infrequent words. Despite their rarity, they collectively account for a significant portion of the vocabulary.

Influence on Model Learning
expand arrow

Heavy-tailed distributions mean that models see some tokens much more often during training, allowing them to learn robust representations for these. However, rare tokens are seen infrequently, which can make it difficult for the model to learn effective representations, potentially leading to poorer performance on rare or out-of-distribution words.

12345678910111213141516
import numpy as np import matplotlib.pyplot as plt # Example subword token frequencies (synthetic data for demonstration) token_frequencies = np.array([10000, 5000, 2500, 1200, 600, 300, 150, 75, 35, 10, 5, 2, 1, 1, 1]) token_ranks = np.arange(1, len(token_frequencies) + 1) plt.figure(figsize=(8, 5)) plt.plot(token_ranks, token_frequencies, marker='o') plt.yscale('log') plt.xscale('log') plt.title('Heavy-Tailed Distribution of Subword Token Frequencies') plt.xlabel('Token Rank') plt.ylabel('Frequency (log scale)') plt.grid(True, which="both", ls="-") plt.show()
copy
question mark

Which of the following statements best explains the significance of heavy-tailed token frequency distributions in subword vocabularies?

Select the correct answer

Everything was clear?

How can we improve it?

Thanks for your feedback!

SectionΒ 2. ChapterΒ 3

Ask AI

expand

Ask AI

ChatGPT

Ask anything or try one of the suggested questions to begin our chat

bookSubword Entropy and Token Distributions

Swipe to show menu

When you examine the vocabulary generated by subword tokenization methods, you often find that the distribution of token frequencies is far from uniform. Instead, it follows a heavy-tailed pattern, where a small number of tokens appear very frequently, but there is a long tail of rare tokens that occur only a few times or even just once in the corpus. This is a direct consequence of the way language is structured and how subword tokenizers operate. The presence of these rare, long-tail tokens can have important implications for language models: they affect how models allocate capacity, how well they generalize to new data, and how efficiently they can learn representations for infrequent words or subword units.

Zipf's Law
expand arrow

In natural language, the frequency of any word or token is inversely proportional to its rank in the frequency table. This means the most common token appears much more often than the second, which appears more than the third, and so on, resulting in a heavy-tailed, power-law distribution.

Long-Tail Tokens
expand arrow

The long tail refers to the large number of tokens that occur very infrequently. In subword vocabularies, these are often rare combinations or fragments that appear only in specialized contexts or infrequent words. Despite their rarity, they collectively account for a significant portion of the vocabulary.

Influence on Model Learning
expand arrow

Heavy-tailed distributions mean that models see some tokens much more often during training, allowing them to learn robust representations for these. However, rare tokens are seen infrequently, which can make it difficult for the model to learn effective representations, potentially leading to poorer performance on rare or out-of-distribution words.

12345678910111213141516
import numpy as np import matplotlib.pyplot as plt # Example subword token frequencies (synthetic data for demonstration) token_frequencies = np.array([10000, 5000, 2500, 1200, 600, 300, 150, 75, 35, 10, 5, 2, 1, 1, 1]) token_ranks = np.arange(1, len(token_frequencies) + 1) plt.figure(figsize=(8, 5)) plt.plot(token_ranks, token_frequencies, marker='o') plt.yscale('log') plt.xscale('log') plt.title('Heavy-Tailed Distribution of Subword Token Frequencies') plt.xlabel('Token Rank') plt.ylabel('Frequency (log scale)') plt.grid(True, which="both", ls="-") plt.show()
copy
question mark

Which of the following statements best explains the significance of heavy-tailed token frequency distributions in subword vocabularies?

Select the correct answer

Everything was clear?

How can we improve it?

Thanks for your feedback!

SectionΒ 2. ChapterΒ 3
some-alt