Summary  
This chapter explains the common failure modes that limit zero- and few-shot generalization in large language models—ambiguity, hallucinations, context saturation, and domain shift.

General domain of usage  
Natural language processing

When working with large language models (**LLMs**), you may expect them to generalize well to new tasks with zero or few examples. However, there are well-recognized theoretical and practical boundaries to this capability. Understanding why and where zero-shot and few-shot generalization break down is crucial for using LLMs effectively.

**Ambiguity** arises when a prompt or task description is open to multiple interpretations. LLMs rely on statistical associations from their training data, so when faced with ambiguous instructions, the model may choose an unintended interpretation, leading to unpredictable or incorrect outputs.

**Hallucinations** refer to the phenomenon where LLMs generate plausible-sounding but false or unsupported information. This occurs because LLMs do not have a mechanism to verify facts. They only generate text that statistically fits the prompt and context. As a result, in tasks requiring factual accuracy or external validation, LLMs may confidently produce incorrect statements.

**Context saturation** happens when the prompt or input context is too long or complex for the model to process effectively. LLMs have a finite context window, and when this limit is exceeded, important information may be truncated or ignored. This can lead to degraded performance, especially in tasks that require integrating information across a lengthy context.

**Domain shift** is another major failure mode. LLMs are trained on a wide variety of data, but when presented with tasks or data distributions that are significantly different from their training set, their performance can drop sharply. This is because the statistical patterns learned during training may not apply to the new domain, resulting in poor generalization.


The performance of **LLMs** in zero-shot and few-shot settings is fundamentally limited by the statistical properties of their training data and model architecture. If a task requires reasoning patterns or knowledge not present in the training data, the model cannot invent solutions beyond its learned distributions. Additionally, the finite capacity of the model means that it cannot store or retrieve every possible combination of facts or rules.

Mathematical limits of prompt-based generalization

Even with carefully engineered prompts, **LLMs** cannot perform tasks that require explicit new knowledge, logical inference outside their training scope, or reasoning about truly novel concepts. Prompt-based generalization is constrained to what the model has implicitly learned; it cannot extrapolate beyond its conceptual boundaries without additional training or external tools.

Conceptual limits of prompt-based generalization

**Not all tasks are equally generalizable.** Tasks that require memorization, precise calculation, or access to up-to-date information may need explicit training or integration of new knowledge sources, rather than relying solely on prompt-based generalization.

Note

What best describes the failure mode known as hallucination in large language models?

A rigorous theoretical exploration of how large language models generalize to new tasks with zero or few examples, focusing on the principles, mechanisms, and limitations of prompt-driven reasoning and in-context learning.

Explore the core principles that enable large language models to solve tasks without explicit training, focusing on latent representations, scaling, and prompt-driven activation.

Delve into the mechanisms by which LLMs perform few-shot learning and adapt to new tasks in context, without parameter updates.

Investigates the boundaries of zero-shot and few-shot generalization, including failure cases, transfer, and long-term trends.

Failure Modes and Generalization Limits