Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Lernen Sampling Strategies | How Transformers Generate Text
Transformers Theory Essentials

bookSampling Strategies

Transformers generate text by predicting the next token in a sequence, using probability distributions learned from data. At each step, the model outputs a probability for every possible token in its vocabulary, conditioned on the sequence so far. Mathematically, given a sequence of tokens x1,x2,...,xt1x_1, x_2, ..., x_{t-1}, the model computes the probability of the next token xtx_t as:

P(xtx1,x2,...,xt1)P(x_t \mid x_1, x_2, ..., x_{t-1})

These probabilities are calculated using the model's output logits (raw scores) for each token, which are then transformed into probabilities using the softmax function:

P(xt=wx1,...,xt1)=ezwvVezvP(x_t = w \mid x_1, ..., x_{t-1}) = \frac{e^{z_w}}{\sum_{v \in V} e{z_v}}

where:

  • zwz_w is the logit (score) for token ww;
  • VV is the vocabulary;
  • ee is the exponential function.

The softmax function ensures that all probabilities are positive and sum to 1, forming a valid probability distribution over possible next tokens.

Sampling strategies determine how you select the next token from this distribution:

  • Greedy decoding always picks the token with the highest probability (argmax\arg\max), making the output deterministic and often repetitive.
  • Random sampling draws a token according to its probability, introducing variability but risking incoherence.
  • Top-k sampling restricts selection to the kk most probable tokens, then samples among them, mathematically zeroing out probabilities for all other tokens before re-normalizing.
  • Top-p (nucleus) sampling selects the smallest set of tokens whose cumulative probability exceeds a threshold pp, then samples within this set.
  • Temperature scaling modifies the distribution by dividing logits by a temperature parameter TT:
PT(xt=w)=ezwTvVezvTP_T(x_t = w) = \frac{e^{\frac{z_w}{T}}}{\sum_{v \in V} e^{\frac{z_v}{T}}}

Lower TT sharpens the distribution (more deterministic), while higher TT flattens it (more random).

Your choice of sampling strategy alters the mathematical process for selecting each token, directly influencing the diversity, creativity, and coherence of the generated text.

Beam Search
expand arrow
  • Intuition: keeps track of several best possible sequences at each step, not just the single most likely one.
  • Strengths: produces coherent and high-quality text; reduces the chance of missing good continuations.
  • Weaknesses: can be computationally expensive; may still lack diversity and become repetitive if beams converge.
Top-k Sampling
expand arrow
  • Intuition: limits choices to the top k most probable tokens, then samples randomly among them.
  • Strengths: prevents unlikely, low-quality tokens; increases diversity compared to greedy search.
  • Weaknesses: the value of k is fixed, so it may include too many or too few options depending on the distribution.
Top-p (Nucleus) Sampling
expand arrow
  • Intuition: chooses from the smallest set of tokens whose cumulative probability exceeds a threshold p.
  • Strengths: adapts to the probability distribution; balances diversity and quality dynamically.
  • Weaknesses: can be unpredictable if the distribution is flat or peaky; requires tuning p for best results.
Temperature Sampling
expand arrow
  • Intuition: adjusts the probability distribution before sampling, making it more or less sharp.
  • Strengths: simple way to control randomness; can make outputs more creative or more predictable.
  • Weaknesses: too high or too low temperature can harm coherence or diversity; not a standalone strategy.
Note
Note

Temperature directly affects how random or deterministic the sampling is. Lower temperatures (closer to 0) make the model act more deterministically, favoring the highest-probability tokens and producing more predictable text. Higher temperatures increase randomness, allowing more surprising and diverse outputs, but can also lead to less coherent sentences.

Here is a textual diagram to guide your decision when choosing a sampling strategy:

  • If you want the most likely, highly coherent output and can afford more computation: consider beam search.
  • If you want creative, diverse text but want to avoid unlikely words: use top-k or top-p (nucleus) sampling.
  • If you want to finely tune the balance between randomness and predictability: adjust temperature (often in combination with top-k or top-p).
  • If you need both quality and diversity, combine nucleus sampling with a moderate temperature.

Your choice depends on your application's need for creativity, coherence, and computational resources.

question mark

Which statements about sampling strategies in transformer-based text generation are correct?

Select the correct answer

War alles klar?

Wie können wir es verbessern?

Danke für Ihr Feedback!

Abschnitt 2. Kapitel 3

Fragen Sie AI

expand

Fragen Sie AI

ChatGPT

Fragen Sie alles oder probieren Sie eine der vorgeschlagenen Fragen, um unser Gespräch zu beginnen

Suggested prompts:

Can you explain the difference between top-k and top-p sampling in more detail?

How does temperature scaling interact with other sampling strategies?

When should I use greedy decoding versus beam search?

bookSampling Strategies

Swipe um das Menü anzuzeigen

Transformers generate text by predicting the next token in a sequence, using probability distributions learned from data. At each step, the model outputs a probability for every possible token in its vocabulary, conditioned on the sequence so far. Mathematically, given a sequence of tokens x1,x2,...,xt1x_1, x_2, ..., x_{t-1}, the model computes the probability of the next token xtx_t as:

P(xtx1,x2,...,xt1)P(x_t \mid x_1, x_2, ..., x_{t-1})

These probabilities are calculated using the model's output logits (raw scores) for each token, which are then transformed into probabilities using the softmax function:

P(xt=wx1,...,xt1)=ezwvVezvP(x_t = w \mid x_1, ..., x_{t-1}) = \frac{e^{z_w}}{\sum_{v \in V} e{z_v}}

where:

  • zwz_w is the logit (score) for token ww;
  • VV is the vocabulary;
  • ee is the exponential function.

The softmax function ensures that all probabilities are positive and sum to 1, forming a valid probability distribution over possible next tokens.

Sampling strategies determine how you select the next token from this distribution:

  • Greedy decoding always picks the token with the highest probability (argmax\arg\max), making the output deterministic and often repetitive.
  • Random sampling draws a token according to its probability, introducing variability but risking incoherence.
  • Top-k sampling restricts selection to the kk most probable tokens, then samples among them, mathematically zeroing out probabilities for all other tokens before re-normalizing.
  • Top-p (nucleus) sampling selects the smallest set of tokens whose cumulative probability exceeds a threshold pp, then samples within this set.
  • Temperature scaling modifies the distribution by dividing logits by a temperature parameter TT:
PT(xt=w)=ezwTvVezvTP_T(x_t = w) = \frac{e^{\frac{z_w}{T}}}{\sum_{v \in V} e^{\frac{z_v}{T}}}

Lower TT sharpens the distribution (more deterministic), while higher TT flattens it (more random).

Your choice of sampling strategy alters the mathematical process for selecting each token, directly influencing the diversity, creativity, and coherence of the generated text.

Beam Search
expand arrow
  • Intuition: keeps track of several best possible sequences at each step, not just the single most likely one.
  • Strengths: produces coherent and high-quality text; reduces the chance of missing good continuations.
  • Weaknesses: can be computationally expensive; may still lack diversity and become repetitive if beams converge.
Top-k Sampling
expand arrow
  • Intuition: limits choices to the top k most probable tokens, then samples randomly among them.
  • Strengths: prevents unlikely, low-quality tokens; increases diversity compared to greedy search.
  • Weaknesses: the value of k is fixed, so it may include too many or too few options depending on the distribution.
Top-p (Nucleus) Sampling
expand arrow
  • Intuition: chooses from the smallest set of tokens whose cumulative probability exceeds a threshold p.
  • Strengths: adapts to the probability distribution; balances diversity and quality dynamically.
  • Weaknesses: can be unpredictable if the distribution is flat or peaky; requires tuning p for best results.
Temperature Sampling
expand arrow
  • Intuition: adjusts the probability distribution before sampling, making it more or less sharp.
  • Strengths: simple way to control randomness; can make outputs more creative or more predictable.
  • Weaknesses: too high or too low temperature can harm coherence or diversity; not a standalone strategy.
Note
Note

Temperature directly affects how random or deterministic the sampling is. Lower temperatures (closer to 0) make the model act more deterministically, favoring the highest-probability tokens and producing more predictable text. Higher temperatures increase randomness, allowing more surprising and diverse outputs, but can also lead to less coherent sentences.

Here is a textual diagram to guide your decision when choosing a sampling strategy:

  • If you want the most likely, highly coherent output and can afford more computation: consider beam search.
  • If you want creative, diverse text but want to avoid unlikely words: use top-k or top-p (nucleus) sampling.
  • If you want to finely tune the balance between randomness and predictability: adjust temperature (often in combination with top-k or top-p).
  • If you need both quality and diversity, combine nucleus sampling with a moderate temperature.

Your choice depends on your application's need for creativity, coherence, and computational resources.

question mark

Which statements about sampling strategies in transformer-based text generation are correct?

Select the correct answer

War alles klar?

Wie können wir es verbessern?

Danke für Ihr Feedback!

Abschnitt 2. Kapitel 3
some-alt