Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Вивчайте Entropy and Information Density of Text | Tokenization as Compression
Practice
Projects
Quizzes & Challenges
Quizzes
Challenges
/
Tokenization and Information Theory

bookEntropy and Information Density of Text

When you look at natural language text, you might wonder how much information is actually carried by each token. This is where the concept of entropy comes in. In information theory, entropy measures the average amount of information produced by a source of data — in this case, a stream of text tokens. The average bits per token tells you how many binary digits, on average, are needed to represent each token, assuming you encode tokens as efficiently as possible. If a token is highly predictable, it carries less information and can be represented with fewer bits. Conversely, less predictable tokens require more bits. This means that the information density of text — the amount of information packed into each token — depends on how surprising or uncertain each token is. The length of a token, or how many bits are needed to encode it, is thus directly linked to how predictable it is: the less predictable, the longer (in bits) its encoding.

Note
Definition

In the context of text, entropy quantifies the unpredictability or randomness of token occurrence. High entropy means token appearances are less predictable, while low entropy means they are more regular. Entropy is significant for tokenization because it sets a lower bound on how efficiently text can be compressed: you cannot, on average, encode tokens using fewer bits than the entropy of their distribution.

1234567891011121314151617181920
import numpy as np # Sample text text = "the cat sat on the mat and the cat ate the rat" # Tokenize by splitting on spaces tokens = text.split() # Count frequencies token_counts = {} for token in tokens: token_counts[token] = token_counts.get(token, 0) + 1 # Compute probabilities total_tokens = len(tokens) probs = np.array([count / total_tokens for count in token_counts.values()]) # Compute empirical entropy (in bits) entropy = -np.sum(probs * np.log2(probs)) print(f"Empirical entropy: {entropy:.3f} bits per token")
copy
question mark

Which of the following statements are true about entropy and tokenization?

Select the correct answer

Все було зрозуміло?

Як ми можемо покращити це?

Дякуємо за ваш відгук!

Секція 1. Розділ 2

Запитати АІ

expand

Запитати АІ

ChatGPT

Запитайте про що завгодно або спробуйте одне із запропонованих запитань, щоб почати наш чат

bookEntropy and Information Density of Text

Свайпніть щоб показати меню

When you look at natural language text, you might wonder how much information is actually carried by each token. This is where the concept of entropy comes in. In information theory, entropy measures the average amount of information produced by a source of data — in this case, a stream of text tokens. The average bits per token tells you how many binary digits, on average, are needed to represent each token, assuming you encode tokens as efficiently as possible. If a token is highly predictable, it carries less information and can be represented with fewer bits. Conversely, less predictable tokens require more bits. This means that the information density of text — the amount of information packed into each token — depends on how surprising or uncertain each token is. The length of a token, or how many bits are needed to encode it, is thus directly linked to how predictable it is: the less predictable, the longer (in bits) its encoding.

Note
Definition

In the context of text, entropy quantifies the unpredictability or randomness of token occurrence. High entropy means token appearances are less predictable, while low entropy means they are more regular. Entropy is significant for tokenization because it sets a lower bound on how efficiently text can be compressed: you cannot, on average, encode tokens using fewer bits than the entropy of their distribution.

1234567891011121314151617181920
import numpy as np # Sample text text = "the cat sat on the mat and the cat ate the rat" # Tokenize by splitting on spaces tokens = text.split() # Count frequencies token_counts = {} for token in tokens: token_counts[token] = token_counts.get(token, 0) + 1 # Compute probabilities total_tokens = len(tokens) probs = np.array([count / total_tokens for count in token_counts.values()]) # Compute empirical entropy (in bits) entropy = -np.sum(probs * np.log2(probs)) print(f"Empirical entropy: {entropy:.3f} bits per token")
copy
question mark

Which of the following statements are true about entropy and tokenization?

Select the correct answer

Все було зрозуміло?

Як ми можемо покращити це?

Дякуємо за ваш відгук!

Секція 1. Розділ 2
some-alt