Sampling Strategies
Transformers generate text by predicting the next token in a sequence, using probability distributions learned from data. At each step, the model outputs a probability for every possible token in its vocabulary, conditioned on the sequence so far. Mathematically, given a sequence of tokens x1,x2,...,xt−1, the model computes the probability of the next token xt as:
P(xt∣x1,x2,...,xt−1)These probabilities are calculated using the model's output logits (raw scores) for each token, which are then transformed into probabilities using the softmax function:
P(xt=w∣x1,...,xt−1)=∑v∈Vezvezwwhere:
- zw is the logit (score) for token w;
- V is the vocabulary;
- e is the exponential function.
The softmax function ensures that all probabilities are positive and sum to 1, forming a valid probability distribution over possible next tokens.
Sampling strategies determine how you select the next token from this distribution:
- Greedy decoding always picks the token with the highest probability (argmax), making the output deterministic and often repetitive.
- Random sampling draws a token according to its probability, introducing variability but risking incoherence.
- Top-k sampling restricts selection to the k most probable tokens, then samples among them, mathematically zeroing out probabilities for all other tokens before re-normalizing.
- Top-p (nucleus) sampling selects the smallest set of tokens whose cumulative probability exceeds a threshold p, then samples within this set.
- Temperature scaling modifies the distribution by dividing logits by a temperature parameter T:
Lower T sharpens the distribution (more deterministic), while higher T flattens it (more random).
Your choice of sampling strategy alters the mathematical process for selecting each token, directly influencing the diversity, creativity, and coherence of the generated text.
- Intuition: keeps track of several best possible sequences at each step, not just the single most likely one.
- Strengths: produces coherent and high-quality text; reduces the chance of missing good continuations.
- Weaknesses: can be computationally expensive; may still lack diversity and become repetitive if beams converge.
- Intuition: limits choices to the top k most probable tokens, then samples randomly among them.
- Strengths: prevents unlikely, low-quality tokens; increases diversity compared to greedy search.
- Weaknesses: the value of k is fixed, so it may include too many or too few options depending on the distribution.
- Intuition: chooses from the smallest set of tokens whose cumulative probability exceeds a threshold p.
- Strengths: adapts to the probability distribution; balances diversity and quality dynamically.
- Weaknesses: can be unpredictable if the distribution is flat or peaky; requires tuning p for best results.
- Intuition: adjusts the probability distribution before sampling, making it more or less sharp.
- Strengths: simple way to control randomness; can make outputs more creative or more predictable.
- Weaknesses: too high or too low temperature can harm coherence or diversity; not a standalone strategy.
Temperature directly affects how random or deterministic the sampling is. Lower temperatures (closer to 0) make the model act more deterministically, favoring the highest-probability tokens and producing more predictable text. Higher temperatures increase randomness, allowing more surprising and diverse outputs, but can also lead to less coherent sentences.
Here is a textual diagram to guide your decision when choosing a sampling strategy:
- If you want the most likely, highly coherent output and can afford more computation: consider beam search.
- If you want creative, diverse text but want to avoid unlikely words: use top-k or top-p (nucleus) sampling.
- If you want to finely tune the balance between randomness and predictability: adjust temperature (often in combination with top-k or top-p).
- If you need both quality and diversity, combine nucleus sampling with a moderate temperature.
Your choice depends on your application's need for creativity, coherence, and computational resources.
¡Gracias por tus comentarios!
Pregunte a AI
Pregunte a AI
Pregunte lo que quiera o pruebe una de las preguntas sugeridas para comenzar nuestra charla
Can you explain the difference between top-k and top-p sampling in more detail?
How does temperature scaling interact with other sampling strategies?
When should I use greedy decoding versus beam search?
Genial!
Completion tasa mejorada a 11.11
Sampling Strategies
Desliza para mostrar el menú
Transformers generate text by predicting the next token in a sequence, using probability distributions learned from data. At each step, the model outputs a probability for every possible token in its vocabulary, conditioned on the sequence so far. Mathematically, given a sequence of tokens x1,x2,...,xt−1, the model computes the probability of the next token xt as:
P(xt∣x1,x2,...,xt−1)These probabilities are calculated using the model's output logits (raw scores) for each token, which are then transformed into probabilities using the softmax function:
P(xt=w∣x1,...,xt−1)=∑v∈Vezvezwwhere:
- zw is the logit (score) for token w;
- V is the vocabulary;
- e is the exponential function.
The softmax function ensures that all probabilities are positive and sum to 1, forming a valid probability distribution over possible next tokens.
Sampling strategies determine how you select the next token from this distribution:
- Greedy decoding always picks the token with the highest probability (argmax), making the output deterministic and often repetitive.
- Random sampling draws a token according to its probability, introducing variability but risking incoherence.
- Top-k sampling restricts selection to the k most probable tokens, then samples among them, mathematically zeroing out probabilities for all other tokens before re-normalizing.
- Top-p (nucleus) sampling selects the smallest set of tokens whose cumulative probability exceeds a threshold p, then samples within this set.
- Temperature scaling modifies the distribution by dividing logits by a temperature parameter T:
Lower T sharpens the distribution (more deterministic), while higher T flattens it (more random).
Your choice of sampling strategy alters the mathematical process for selecting each token, directly influencing the diversity, creativity, and coherence of the generated text.
- Intuition: keeps track of several best possible sequences at each step, not just the single most likely one.
- Strengths: produces coherent and high-quality text; reduces the chance of missing good continuations.
- Weaknesses: can be computationally expensive; may still lack diversity and become repetitive if beams converge.
- Intuition: limits choices to the top k most probable tokens, then samples randomly among them.
- Strengths: prevents unlikely, low-quality tokens; increases diversity compared to greedy search.
- Weaknesses: the value of k is fixed, so it may include too many or too few options depending on the distribution.
- Intuition: chooses from the smallest set of tokens whose cumulative probability exceeds a threshold p.
- Strengths: adapts to the probability distribution; balances diversity and quality dynamically.
- Weaknesses: can be unpredictable if the distribution is flat or peaky; requires tuning p for best results.
- Intuition: adjusts the probability distribution before sampling, making it more or less sharp.
- Strengths: simple way to control randomness; can make outputs more creative or more predictable.
- Weaknesses: too high or too low temperature can harm coherence or diversity; not a standalone strategy.
Temperature directly affects how random or deterministic the sampling is. Lower temperatures (closer to 0) make the model act more deterministically, favoring the highest-probability tokens and producing more predictable text. Higher temperatures increase randomness, allowing more surprising and diverse outputs, but can also lead to less coherent sentences.
Here is a textual diagram to guide your decision when choosing a sampling strategy:
- If you want the most likely, highly coherent output and can afford more computation: consider beam search.
- If you want creative, diverse text but want to avoid unlikely words: use top-k or top-p (nucleus) sampling.
- If you want to finely tune the balance between randomness and predictability: adjust temperature (often in combination with top-k or top-p).
- If you need both quality and diversity, combine nucleus sampling with a moderate temperature.
Your choice depends on your application's need for creativity, coherence, and computational resources.
¡Gracias por tus comentarios!