Learn Sampling Strategies | How Transformers Generate Text

Swipe to show menu

Transformers generate text by predicting the next token in a sequence, using probability distributions learned from data. At each step, the model outputs a probability for every possible token in its vocabulary, conditioned on the sequence so far. Mathematically, given a sequence of tokens $x_1, x_2, ..., x_{t-1}$ , the model computes the probability of the next token $x_t$ as:

P(x_t \mid x_1, x_2, ..., x_{t-1})

These probabilities are calculated using the model's output logits (raw scores) for each token, which are then transformed into probabilities using the softmax function:

P(x_t = w \mid x_1, ..., x_{t-1}) = \frac{e^{z_w}}{\sum_{v \in V} e{z_v}}

where:

$z_w$ is the logit (score) for token $w$ ;
$V$ is the vocabulary;
$e$ is the exponential function.

The softmax function ensures that all probabilities are positive and sum to 1, forming a valid probability distribution over possible next tokens.

Sampling strategies determine how you select the next token from this distribution:

Greedy decoding always picks the token with the highest probability ( $\arg\max$ ), making the output deterministic and often repetitive.
Random sampling draws a token according to its probability, introducing variability but risking incoherence.
Top-k sampling restricts selection to the $k$ most probable tokens, then samples among them, mathematically zeroing out probabilities for all other tokens before re-normalizing.
Top-p (nucleus) sampling selects the smallest set of tokens whose cumulative probability exceeds a threshold $p$ , then samples within this set.
Temperature scaling modifies the distribution by dividing logits by a temperature parameter $T$ :

P_T(x_t = w) = \frac{e^{\frac{z_w}{T}}}{\sum_{v \in V} e^{\frac{z_v}{T}}}

Lower $T$ sharpens the distribution (more deterministic), while higher $T$ flattens it (more random).

Your choice of sampling strategy alters the mathematical process for selecting each token, directly influencing the diversity, creativity, and coherence of the generated text.

Beam Search

Intuition: keeps track of several best possible sequences at each step, not just the single most likely one.
Strengths: produces coherent and high-quality text; reduces the chance of missing good continuations.
Weaknesses: can be computationally expensive; may still lack diversity and become repetitive if beams converge.

Top-k Sampling

Intuition: limits choices to the top k most probable tokens, then samples randomly among them.
Strengths: prevents unlikely, low-quality tokens; increases diversity compared to greedy search.
Weaknesses: the value of k is fixed, so it may include too many or too few options depending on the distribution.

Top-p (Nucleus) Sampling

Intuition: chooses from the smallest set of tokens whose cumulative probability exceeds a threshold p.
Strengths: adapts to the probability distribution; balances diversity and quality dynamically.
Weaknesses: can be unpredictable if the distribution is flat or peaky; requires tuning p for best results.

Temperature Sampling

Intuition: adjusts the probability distribution before sampling, making it more or less sharp.
Strengths: simple way to control randomness; can make outputs more creative or more predictable.
Weaknesses: too high or too low temperature can harm coherence or diversity; not a standalone strategy.

Note

Temperature directly affects how random or deterministic the sampling is. Lower temperatures (closer to 0) make the model act more deterministically, favoring the highest-probability tokens and producing more predictable text. Higher temperatures increase randomness, allowing more surprising and diverse outputs, but can also lead to less coherent sentences.

Here is a textual diagram to guide your decision when choosing a sampling strategy:

If you want the most likely, highly coherent output and can afford more computation: consider beam search.
If you want creative, diverse text but want to avoid unlikely words: use top-k or top-p (nucleus) sampling.
If you want to finely tune the balance between randomness and predictability: adjust temperature (often in combination with top-k or top-p).
If you need both quality and diversity, combine nucleus sampling with a moderate temperature.

Your choice depends on your application's need for creativity, coherence, and computational resources.

Everything was clear?

Thanks for your feedback!

Section 2. Chapter 3

Ask AI

Ask anything or try one of the suggested questions to begin our chat

Section 2. Chapter 3