Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Lära Next-Token Prediction and Probability Distributions | How Transformers Generate Text
Practice
Projects
Quizzes & Challenges
Quizzes
Challenges
/
Transformers Theory Essentials

bookNext-Token Prediction and Probability Distributions

When a transformer generates text, it predicts the next token in a sequence by evaluating all possible tokens and assigning a score to each. These scores, called logits, indicate the model's confidence in each token being the correct next choice, but they are not probabilities themselves. The logits are the raw outputs from the final layer of the model, and each value corresponds to a token in the vocabulary. The higher the logit, the more likely the model thinks that token should come next. However, since logits can be any real number, they need to be converted into a probability distribution to make a final prediction.

Note
Definition

The softmax function transforms a vector of logits into a probability distribution. It does this by exponentiating each logit and then dividing by the sum of all exponentiated logits. This ensures that all output values are between 0 and 1, and that they sum to 1. The formula for softmax for a logit vector zz is:

softmax(zi)=ezijezj\text{softmax}(z_i) = \frac{e^{z_i}}{\sum_j e^{z_j}}

where ziz_i is the logit for token ii, and the sum is over all tokens in the vocabulary.

Once the logits have been transformed into probabilities using softmax, the transformer can make a prediction about the next token. However, not all predictions are equally confident. Entropy is a measure of uncertainty in a probability distribution. In the context of language modeling, low entropy means the model is very confident—one token has a much higher probability than the others. High entropy means the model is less certain, spreading probability more evenly across many tokens. This uncertainty can influence how creative or deterministic the generated text appears.

question mark

Which statement best describes the role of the softmax function in transformer models?

Select the correct answer

Var allt tydligt?

Hur kan vi förbättra det?

Tack för dina kommentarer!

Avsnitt 2. Kapitel 1

Fråga AI

expand

Fråga AI

ChatGPT

Fråga vad du vill eller prova någon av de föreslagna frågorna för att starta vårt samtal

Suggested prompts:

Can you explain how softmax converts logits into probabilities?

What are some practical examples where high entropy is preferred over low entropy?

How does entropy affect the creativity of generated text?

bookNext-Token Prediction and Probability Distributions

Svep för att visa menyn

When a transformer generates text, it predicts the next token in a sequence by evaluating all possible tokens and assigning a score to each. These scores, called logits, indicate the model's confidence in each token being the correct next choice, but they are not probabilities themselves. The logits are the raw outputs from the final layer of the model, and each value corresponds to a token in the vocabulary. The higher the logit, the more likely the model thinks that token should come next. However, since logits can be any real number, they need to be converted into a probability distribution to make a final prediction.

Note
Definition

The softmax function transforms a vector of logits into a probability distribution. It does this by exponentiating each logit and then dividing by the sum of all exponentiated logits. This ensures that all output values are between 0 and 1, and that they sum to 1. The formula for softmax for a logit vector zz is:

softmax(zi)=ezijezj\text{softmax}(z_i) = \frac{e^{z_i}}{\sum_j e^{z_j}}

where ziz_i is the logit for token ii, and the sum is over all tokens in the vocabulary.

Once the logits have been transformed into probabilities using softmax, the transformer can make a prediction about the next token. However, not all predictions are equally confident. Entropy is a measure of uncertainty in a probability distribution. In the context of language modeling, low entropy means the model is very confident—one token has a much higher probability than the others. High entropy means the model is less certain, spreading probability more evenly across many tokens. This uncertainty can influence how creative or deterministic the generated text appears.

question mark

Which statement best describes the role of the softmax function in transformer models?

Select the correct answer

Var allt tydligt?

Hur kan vi förbättra det?

Tack för dina kommentarer!

Avsnitt 2. Kapitel 1
some-alt