Summary  
This chapter explains how to convert a model’s raw logits into a probability distribution using softmax for next-token prediction and how to quantify prediction uncertainty with entropy to adjust the diversity or determinism of generated tokens.

General domain of usage  
Text generation

When a transformer generates text, it predicts the next token in a sequence by evaluating all possible tokens and assigning a score to each. These scores, called **logits**, indicate the model's confidence in each token being the correct next choice, but they are not probabilities themselves. The logits are the raw outputs from the final layer of the model, and each value corresponds to a token in the vocabulary. The higher the logit, the more likely the model thinks that token should come next. However, since logits can be any real number, they need to be converted into a probability distribution to make a final prediction.

The **softmax** function transforms a vector of logits into a probability distribution. It does this by exponentiating each logit and then dividing by the sum of all exponentiated logits. This ensures that all output values are between 0 and 1, and that they sum to 1. The formula for softmax for a logit vector $$z$$ is:

$$
\text{softmax}(z_i) = \frac{e^{z_i}}{\sum_j e^{z_j}}
$$

where $$z_i$$ is the logit for token $$i$$, and the sum is over all tokens in the vocabulary.

Definition

Once the logits have been transformed into probabilities using softmax, the transformer can make a prediction about the next token. However, not all predictions are equally confident. **Entropy** is a measure of uncertainty in a probability distribution. In the context of language modeling, low entropy means the model is very confident—one token has a much higher probability than the others. High entropy means the model is less certain, spreading probability more evenly across many tokens. This uncertainty can influence how creative or deterministic the generated text appears.

Which statement best describes the role of the softmax function in transformer models?

A comprehensive, code-free exploration of transformer-based language models, focusing on their architecture, text generation mechanics, and the theoretical principles underlying their behavior.

Explore the essential components that enable transformers to process and represent language, including attention, residual pathways, and positional encodings.

Examine the theoretical underpinnings of text generation in transformers, from probability distributions to sampling strategies.

Dive into optimization, error modes, and the information-theoretic perspective on transformer-based language models.

Next-Token Prediction and Probability Distributions