Subword Entropy and Token Distributions
When you examine the vocabulary generated by subword tokenization methods, you often find that the distribution of token frequencies is far from uniform. Instead, it follows a heavy-tailed pattern, where a small number of tokens appear very frequently, but there is a long tail of rare tokens that occur only a few times or even just once in the corpus. This is a direct consequence of the way language is structured and how subword tokenizers operate. The presence of these rare, long-tail tokens can have important implications for language models: they affect how models allocate capacity, how well they generalize to new data, and how efficiently they can learn representations for infrequent words or subword units.
In natural language, the frequency of any word or token is inversely proportional to its rank in the frequency table. This means the most common token appears much more often than the second, which appears more than the third, and so on, resulting in a heavy-tailed, power-law distribution.
The long tail refers to the large number of tokens that occur very infrequently. In subword vocabularies, these are often rare combinations or fragments that appear only in specialized contexts or infrequent words. Despite their rarity, they collectively account for a significant portion of the vocabulary.
Heavy-tailed distributions mean that models see some tokens much more often during training, allowing them to learn robust representations for these. However, rare tokens are seen infrequently, which can make it difficult for the model to learn effective representations, potentially leading to poorer performance on rare or out-of-distribution words.
12345678910111213141516import numpy as np import matplotlib.pyplot as plt # Example subword token frequencies (synthetic data for demonstration) token_frequencies = np.array([10000, 5000, 2500, 1200, 600, 300, 150, 75, 35, 10, 5, 2, 1, 1, 1]) token_ranks = np.arange(1, len(token_frequencies) + 1) plt.figure(figsize=(8, 5)) plt.plot(token_ranks, token_frequencies, marker='o') plt.yscale('log') plt.xscale('log') plt.title('Heavy-Tailed Distribution of Subword Token Frequencies') plt.xlabel('Token Rank') plt.ylabel('Frequency (log scale)') plt.grid(True, which="both", ls="-") plt.show()
Thanks for your feedback!
Ask AI
Ask AI
Ask anything or try one of the suggested questions to begin our chat
Awesome!
Completion rate improved to 11.11
Subword Entropy and Token Distributions
Swipe to show menu
When you examine the vocabulary generated by subword tokenization methods, you often find that the distribution of token frequencies is far from uniform. Instead, it follows a heavy-tailed pattern, where a small number of tokens appear very frequently, but there is a long tail of rare tokens that occur only a few times or even just once in the corpus. This is a direct consequence of the way language is structured and how subword tokenizers operate. The presence of these rare, long-tail tokens can have important implications for language models: they affect how models allocate capacity, how well they generalize to new data, and how efficiently they can learn representations for infrequent words or subword units.
In natural language, the frequency of any word or token is inversely proportional to its rank in the frequency table. This means the most common token appears much more often than the second, which appears more than the third, and so on, resulting in a heavy-tailed, power-law distribution.
The long tail refers to the large number of tokens that occur very infrequently. In subword vocabularies, these are often rare combinations or fragments that appear only in specialized contexts or infrequent words. Despite their rarity, they collectively account for a significant portion of the vocabulary.
Heavy-tailed distributions mean that models see some tokens much more often during training, allowing them to learn robust representations for these. However, rare tokens are seen infrequently, which can make it difficult for the model to learn effective representations, potentially leading to poorer performance on rare or out-of-distribution words.
12345678910111213141516import numpy as np import matplotlib.pyplot as plt # Example subword token frequencies (synthetic data for demonstration) token_frequencies = np.array([10000, 5000, 2500, 1200, 600, 300, 150, 75, 35, 10, 5, 2, 1, 1, 1]) token_ranks = np.arange(1, len(token_frequencies) + 1) plt.figure(figsize=(8, 5)) plt.plot(token_ranks, token_frequencies, marker='o') plt.yscale('log') plt.xscale('log') plt.title('Heavy-Tailed Distribution of Subword Token Frequencies') plt.xlabel('Token Rank') plt.ylabel('Frequency (log scale)') plt.grid(True, which="both", ls="-") plt.show()
Thanks for your feedback!