Sampling Strategies
Transformers generate text by predicting the next token in a sequence, using probability distributions learned from data. At each step, the model outputs a probability for every possible token in its vocabulary, conditioned on the sequence so far. Mathematically, given a sequence of tokens x1β,x2β,...,xtβ1β, the model computes the probability of the next token xtβ as:
P(xtββ£x1β,x2β,...,xtβ1β)These probabilities are calculated using the model's output logits (raw scores) for each token, which are then transformed into probabilities using the softmax function:
P(xtβ=wβ£x1β,...,xtβ1β)=βvβVβezvβezwββwhere:
- zwβ is the logit (score) for token w;
- V is the vocabulary;
- e is the exponential function.
The softmax function ensures that all probabilities are positive and sum to 1, forming a valid probability distribution over possible next tokens.
Sampling strategies determine how you select the next token from this distribution:
- Greedy decoding always picks the token with the highest probability (argmax), making the output deterministic and often repetitive.
- Random sampling draws a token according to its probability, introducing variability but risking incoherence.
- Top-k sampling restricts selection to the k most probable tokens, then samples among them, mathematically zeroing out probabilities for all other tokens before re-normalizing.
- Top-p (nucleus) sampling selects the smallest set of tokens whose cumulative probability exceeds a threshold p, then samples within this set.
- Temperature scaling modifies the distribution by dividing logits by a temperature parameter T:
Lower T sharpens the distribution (more deterministic), while higher T flattens it (more random).
Your choice of sampling strategy alters the mathematical process for selecting each token, directly influencing the diversity, creativity, and coherence of the generated text.
- Intuition: keeps track of several best possible sequences at each step, not just the single most likely one.
- Strengths: produces coherent and high-quality text; reduces the chance of missing good continuations.
- Weaknesses: can be computationally expensive; may still lack diversity and become repetitive if beams converge.
- Intuition: limits choices to the top k most probable tokens, then samples randomly among them.
- Strengths: prevents unlikely, low-quality tokens; increases diversity compared to greedy search.
- Weaknesses: the value of k is fixed, so it may include too many or too few options depending on the distribution.
- Intuition: chooses from the smallest set of tokens whose cumulative probability exceeds a threshold p.
- Strengths: adapts to the probability distribution; balances diversity and quality dynamically.
- Weaknesses: can be unpredictable if the distribution is flat or peaky; requires tuning p for best results.
- Intuition: adjusts the probability distribution before sampling, making it more or less sharp.
- Strengths: simple way to control randomness; can make outputs more creative or more predictable.
- Weaknesses: too high or too low temperature can harm coherence or diversity; not a standalone strategy.
Temperature directly affects how random or deterministic the sampling is. Lower temperatures (closer to 0) make the model act more deterministically, favoring the highest-probability tokens and producing more predictable text. Higher temperatures increase randomness, allowing more surprising and diverse outputs, but can also lead to less coherent sentences.
Here is a textual diagram to guide your decision when choosing a sampling strategy:
- If you want the most likely, highly coherent output and can afford more computation: consider beam search.
- If you want creative, diverse text but want to avoid unlikely words: use top-k or top-p (nucleus) sampling.
- If you want to finely tune the balance between randomness and predictability: adjust temperature (often in combination with top-k or top-p).
- If you need both quality and diversity, combine nucleus sampling with a moderate temperature.
Your choice depends on your application's need for creativity, coherence, and computational resources.
Thanks for your feedback!
Ask AI
Ask AI
Ask anything or try one of the suggested questions to begin our chat
Can you explain the difference between top-k and top-p sampling in more detail?
How does temperature scaling interact with other sampling strategies?
When should I use greedy decoding versus beam search?
Awesome!
Completion rate improved to 11.11
Sampling Strategies
Swipe to show menu
Transformers generate text by predicting the next token in a sequence, using probability distributions learned from data. At each step, the model outputs a probability for every possible token in its vocabulary, conditioned on the sequence so far. Mathematically, given a sequence of tokens x1β,x2β,...,xtβ1β, the model computes the probability of the next token xtβ as:
P(xtββ£x1β,x2β,...,xtβ1β)These probabilities are calculated using the model's output logits (raw scores) for each token, which are then transformed into probabilities using the softmax function:
P(xtβ=wβ£x1β,...,xtβ1β)=βvβVβezvβezwββwhere:
- zwβ is the logit (score) for token w;
- V is the vocabulary;
- e is the exponential function.
The softmax function ensures that all probabilities are positive and sum to 1, forming a valid probability distribution over possible next tokens.
Sampling strategies determine how you select the next token from this distribution:
- Greedy decoding always picks the token with the highest probability (argmax), making the output deterministic and often repetitive.
- Random sampling draws a token according to its probability, introducing variability but risking incoherence.
- Top-k sampling restricts selection to the k most probable tokens, then samples among them, mathematically zeroing out probabilities for all other tokens before re-normalizing.
- Top-p (nucleus) sampling selects the smallest set of tokens whose cumulative probability exceeds a threshold p, then samples within this set.
- Temperature scaling modifies the distribution by dividing logits by a temperature parameter T:
Lower T sharpens the distribution (more deterministic), while higher T flattens it (more random).
Your choice of sampling strategy alters the mathematical process for selecting each token, directly influencing the diversity, creativity, and coherence of the generated text.
- Intuition: keeps track of several best possible sequences at each step, not just the single most likely one.
- Strengths: produces coherent and high-quality text; reduces the chance of missing good continuations.
- Weaknesses: can be computationally expensive; may still lack diversity and become repetitive if beams converge.
- Intuition: limits choices to the top k most probable tokens, then samples randomly among them.
- Strengths: prevents unlikely, low-quality tokens; increases diversity compared to greedy search.
- Weaknesses: the value of k is fixed, so it may include too many or too few options depending on the distribution.
- Intuition: chooses from the smallest set of tokens whose cumulative probability exceeds a threshold p.
- Strengths: adapts to the probability distribution; balances diversity and quality dynamically.
- Weaknesses: can be unpredictable if the distribution is flat or peaky; requires tuning p for best results.
- Intuition: adjusts the probability distribution before sampling, making it more or less sharp.
- Strengths: simple way to control randomness; can make outputs more creative or more predictable.
- Weaknesses: too high or too low temperature can harm coherence or diversity; not a standalone strategy.
Temperature directly affects how random or deterministic the sampling is. Lower temperatures (closer to 0) make the model act more deterministically, favoring the highest-probability tokens and producing more predictable text. Higher temperatures increase randomness, allowing more surprising and diverse outputs, but can also lead to less coherent sentences.
Here is a textual diagram to guide your decision when choosing a sampling strategy:
- If you want the most likely, highly coherent output and can afford more computation: consider beam search.
- If you want creative, diverse text but want to avoid unlikely words: use top-k or top-p (nucleus) sampling.
- If you want to finely tune the balance between randomness and predictability: adjust temperature (often in combination with top-k or top-p).
- If you need both quality and diversity, combine nucleus sampling with a moderate temperature.
Your choice depends on your application's need for creativity, coherence, and computational resources.
Thanks for your feedback!