Next-Token Prediction and Probability Distributions
When a transformer generates text, it predicts the next token in a sequence by evaluating all possible tokens and assigning a score to each. These scores, called logits, indicate the model's confidence in each token being the correct next choice, but they are not probabilities themselves. The logits are the raw outputs from the final layer of the model, and each value corresponds to a token in the vocabulary. The higher the logit, the more likely the model thinks that token should come next. However, since logits can be any real number, they need to be converted into a probability distribution to make a final prediction.
The softmax function transforms a vector of logits into a probability distribution. It does this by exponentiating each logit and then dividing by the sum of all exponentiated logits. This ensures that all output values are between 0 and 1, and that they sum to 1. The formula for softmax for a logit vector z is:
softmax(zi)=∑jezjeziwhere zi is the logit for token i, and the sum is over all tokens in the vocabulary.
Once the logits have been transformed into probabilities using softmax, the transformer can make a prediction about the next token. However, not all predictions are equally confident. Entropy is a measure of uncertainty in a probability distribution. In the context of language modeling, low entropy means the model is very confident—one token has a much higher probability than the others. High entropy means the model is less certain, spreading probability more evenly across many tokens. This uncertainty can influence how creative or deterministic the generated text appears.
Tak for dine kommentarer!
Spørg AI
Spørg AI
Spørg om hvad som helst eller prøv et af de foreslåede spørgsmål for at starte vores chat
Can you explain how softmax converts logits into probabilities?
What are some practical examples where high entropy is preferred over low entropy?
How does entropy affect the creativity of generated text?
Fantastisk!
Completion rate forbedret til 11.11
Next-Token Prediction and Probability Distributions
Stryg for at vise menuen
When a transformer generates text, it predicts the next token in a sequence by evaluating all possible tokens and assigning a score to each. These scores, called logits, indicate the model's confidence in each token being the correct next choice, but they are not probabilities themselves. The logits are the raw outputs from the final layer of the model, and each value corresponds to a token in the vocabulary. The higher the logit, the more likely the model thinks that token should come next. However, since logits can be any real number, they need to be converted into a probability distribution to make a final prediction.
The softmax function transforms a vector of logits into a probability distribution. It does this by exponentiating each logit and then dividing by the sum of all exponentiated logits. This ensures that all output values are between 0 and 1, and that they sum to 1. The formula for softmax for a logit vector z is:
softmax(zi)=∑jezjeziwhere zi is the logit for token i, and the sum is over all tokens in the vocabulary.
Once the logits have been transformed into probabilities using softmax, the transformer can make a prediction about the next token. However, not all predictions are equally confident. Entropy is a measure of uncertainty in a probability distribution. In the context of language modeling, low entropy means the model is very confident—one token has a much higher probability than the others. High entropy means the model is less certain, spreading probability more evenly across many tokens. This uncertainty can influence how creative or deterministic the generated text appears.
Tak for dine kommentarer!