Summary  
This chapter explains how to formally model reward functions and policies in reinforcement learning from human feedback, and it derives theoretical impossibility results and lower bounds on the regret gap between a learned reward model and the true human objective.  

General domain of usage  
Reinforcement learning

To understand the theoretical limits of alignment in reinforcement learning from human feedback (RLHF), you need to begin with clear formal definitions. In the RLHF setting, **alignment** refers to the degree to which an agent's learned policy achieves outcomes consistent with human values or intentions, as interpreted through human-provided feedback. More formally, suppose there is a true reward function $$R*$$ that perfectly encodes human values, but the agent only has access to a learned reward model $$\hat{R}$$ constructed from human feedback, which may be noisy or incomplete. The agent's policy $$\pi$$ is said to be **aligned** if, for all relevant states and actions, optimizing $$\hat{R}$$ leads to high expected return under $$R*$$.

**Value learning** in RLHF is the process by which the agent infers or approximates $$R*$$ using human feedback, such as preference comparisons or demonstrations. The goal is to minimize the gap between $$\hat{R}$$ and $$R*$$ so that the agent's behavior reflects the true underlying human values as closely as possible. However, perfect alignment is rarely achievable in practice due to the indirectness, ambiguity, and possible inconsistency in human feedback.

The limits of alignment in RLHF are shaped by several theoretical results that establish impossibility theorems and lower bounds. One key impossibility result is that, under realistic assumptions about the noisiness and incompleteness of human feedback, there always exists some irreducible gap between the agent's learned objective and the true human objective. This is sometimes formalized as a lower bound on the **regret** — the difference in expected reward between the agent's policy and the optimal policy under $$R*$$ — that cannot be eliminated even with unlimited data and computation if the feedback is fundamentally ambiguous or misspecified. Theoretical assumptions underlying these results include the expressiveness of the reward model class, the nature of human feedback (e.g., pairwise comparisons vs. scalar rewards), and the possibility of distributional shift between training and deployment environments. These impossibility results highlight that, unless human feedback perfectly identifies $$R*$$ in all relevant contexts, there will always be scenarios where the agent's actions diverge from human intent.

To make these concepts more concrete, consider a diagram that illustrates the gap between the learned and intended objectives. The diagram below shows the true reward function $$R*$$, the learned reward model $$\hat{R}$$, and the resulting agent policy $$\pi$$. The shaded region represents the potential misalignment — the set of states and actions where optimizing $$\hat{R}$$ does not yield optimal outcomes under $$R*$$. This gap persists even as the number of feedback samples grows, reflecting the theoretical limits discussed above.

<div style="display: flex; flex-direction: column; align-items: center; margin: 20px 0; font-family: Arial, sans-serif;">
  <svg width="540" height="280" viewBox="0 0 540 280">
    <defs>
      <marker id="arrowhead-wide" markerWidth="10" markerHeight="7" refX="9" refY="3.5" orient="auto">
        <polygon points="0 0, 10 3.5, 0 7" fill="#455a64"/>
      </marker>
    </defs>

    <ellipse cx="155" cy="130" rx="110" ry="70" fill="#e3f2fd" stroke="#1976d2" stroke-width="2" fill-opacity="0.8"/>
    <text x="155" y="40" fill="#1976d2" font-size="16" font-weight="bold" text-anchor="middle">True Reward (R*)</text>

    <ellipse cx="345" cy="130" rx="110" ry="70" fill="#fffde7" stroke="#fbc02d" stroke-width="2" fill-opacity="0.7"/>
    <text x="345" y="40" fill="#fbc02d" font-size="16" font-weight="bold" text-anchor="middle">Learned Reward (R̂)</text>

    <text x="390" y="150" fill="#c62828" font-size="14" font-weight="bold" text-anchor="middle">Misalignment Gap</text>

    <line x1="240" y1="160" x2="230" y2="230" stroke="#455a64" stroke-width="2" marker-end="url(#arrowhead-wide)"/>
    <line x1="260" y1="160" x2="270" y2="230" stroke="#455a64" stroke-width="2" marker-end="url(#arrowhead-wide)"/>

    <text x="250" y="260" fill="#455a64" font-size="16" font-weight="bold" text-anchor="middle">Agent Policy (π)</text>
  </svg>

  <div style="max-width: 500px; text-align: center; margin-top: 10px; font-size: 14px; color: #333; line-height: 1.5;">
    Diagram: The overlap between the true reward <b>R*</b> and learned reward <b>R̂</b> defines the region of alignment. The shaded area outside the overlap illustrates the persistent misalignment gap, which cannot be closed under RLHF's theoretical limits.
  </div>
</div>

Which statement best describes a key theoretical limit of alignment in reinforcement learning from human feedback (RLHF)?

A rigorous, theory-driven exploration of RLHF: formalizing human preferences, reward modeling, optimization dynamics, and the alignment risks inherent to learning from human feedback.

Formalize the mathematical structure of human preferences and the theoretical basis for learning rewards from feedback.

Investigate the theoretical properties and failure modes of optimizing policies with imperfect, learned reward models.

Synthesize theoretical insights on alignment, generalization, and the fundamental risks and limitations of RLHF.

Theoretical Limits of Alignment in RLHF