Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Leer Subword Entropy and Token Distributions | Subword Tokenization Algorithms
Practice
Projects
Quizzes & Challenges
Quizzes
Challenges
/
Tokenization and Information Theory

bookSubword Entropy and Token Distributions

When you examine the vocabulary generated by subword tokenization methods, you often find that the distribution of token frequencies is far from uniform. Instead, it follows a heavy-tailed pattern, where a small number of tokens appear very frequently, but there is a long tail of rare tokens that occur only a few times or even just once in the corpus. This is a direct consequence of the way language is structured and how subword tokenizers operate. The presence of these rare, long-tail tokens can have important implications for language models: they affect how models allocate capacity, how well they generalize to new data, and how efficiently they can learn representations for infrequent words or subword units.

Zipf's Law
expand arrow

In natural language, the frequency of any word or token is inversely proportional to its rank in the frequency table. This means the most common token appears much more often than the second, which appears more than the third, and so on, resulting in a heavy-tailed, power-law distribution.

Long-Tail Tokens
expand arrow

The long tail refers to the large number of tokens that occur very infrequently. In subword vocabularies, these are often rare combinations or fragments that appear only in specialized contexts or infrequent words. Despite their rarity, they collectively account for a significant portion of the vocabulary.

Influence on Model Learning
expand arrow

Heavy-tailed distributions mean that models see some tokens much more often during training, allowing them to learn robust representations for these. However, rare tokens are seen infrequently, which can make it difficult for the model to learn effective representations, potentially leading to poorer performance on rare or out-of-distribution words.

12345678910111213141516
import numpy as np import matplotlib.pyplot as plt # Example subword token frequencies (synthetic data for demonstration) token_frequencies = np.array([10000, 5000, 2500, 1200, 600, 300, 150, 75, 35, 10, 5, 2, 1, 1, 1]) token_ranks = np.arange(1, len(token_frequencies) + 1) plt.figure(figsize=(8, 5)) plt.plot(token_ranks, token_frequencies, marker='o') plt.yscale('log') plt.xscale('log') plt.title('Heavy-Tailed Distribution of Subword Token Frequencies') plt.xlabel('Token Rank') plt.ylabel('Frequency (log scale)') plt.grid(True, which="both", ls="-") plt.show()
copy
question mark

Which of the following statements best explains the significance of heavy-tailed token frequency distributions in subword vocabularies?

Select the correct answer

Was alles duidelijk?

Hoe kunnen we het verbeteren?

Bedankt voor je feedback!

Sectie 2. Hoofdstuk 3

Vraag AI

expand

Vraag AI

ChatGPT

Vraag wat u wilt of probeer een van de voorgestelde vragen om onze chat te starten.

Suggested prompts:

Can you explain why subword tokenization leads to a heavy-tailed distribution?

What are the implications of long-tail tokens for language model performance?

How can language models handle rare or infrequent tokens more effectively?

bookSubword Entropy and Token Distributions

Veeg om het menu te tonen

When you examine the vocabulary generated by subword tokenization methods, you often find that the distribution of token frequencies is far from uniform. Instead, it follows a heavy-tailed pattern, where a small number of tokens appear very frequently, but there is a long tail of rare tokens that occur only a few times or even just once in the corpus. This is a direct consequence of the way language is structured and how subword tokenizers operate. The presence of these rare, long-tail tokens can have important implications for language models: they affect how models allocate capacity, how well they generalize to new data, and how efficiently they can learn representations for infrequent words or subword units.

Zipf's Law
expand arrow

In natural language, the frequency of any word or token is inversely proportional to its rank in the frequency table. This means the most common token appears much more often than the second, which appears more than the third, and so on, resulting in a heavy-tailed, power-law distribution.

Long-Tail Tokens
expand arrow

The long tail refers to the large number of tokens that occur very infrequently. In subword vocabularies, these are often rare combinations or fragments that appear only in specialized contexts or infrequent words. Despite their rarity, they collectively account for a significant portion of the vocabulary.

Influence on Model Learning
expand arrow

Heavy-tailed distributions mean that models see some tokens much more often during training, allowing them to learn robust representations for these. However, rare tokens are seen infrequently, which can make it difficult for the model to learn effective representations, potentially leading to poorer performance on rare or out-of-distribution words.

12345678910111213141516
import numpy as np import matplotlib.pyplot as plt # Example subword token frequencies (synthetic data for demonstration) token_frequencies = np.array([10000, 5000, 2500, 1200, 600, 300, 150, 75, 35, 10, 5, 2, 1, 1, 1]) token_ranks = np.arange(1, len(token_frequencies) + 1) plt.figure(figsize=(8, 5)) plt.plot(token_ranks, token_frequencies, marker='o') plt.yscale('log') plt.xscale('log') plt.title('Heavy-Tailed Distribution of Subword Token Frequencies') plt.xlabel('Token Rank') plt.ylabel('Frequency (log scale)') plt.grid(True, which="both", ls="-") plt.show()
copy
question mark

Which of the following statements best explains the significance of heavy-tailed token frequency distributions in subword vocabularies?

Select the correct answer

Was alles duidelijk?

Hoe kunnen we het verbeteren?

Bedankt voor je feedback!

Sectie 2. Hoofdstuk 3
some-alt