Summary  
This chapter explains how distribution shift—both covariate and reward shift—arises when a learned reward model encounters out-of-distribution data, and how this leads to generalization error that can be amplified by feedback loops during agent optimization.  

General domain of usage  
Reinforcement learning from human feedback

When you train reinforcement learning from human feedback (RLHF) agents, the data they see during training rarely matches the real-world situations they will face after deployment. This mismatch is called **distribution shift**. In the RLHF context, distribution shift is especially important because agents are optimized not just for an external environment, but also for a learned reward model that is itself based on human feedback. There are two key types of distribution shift to consider:

- **Covariate shift**: The distribution over states or observations encountered by the agent changes between training and deployment;
- **Reward shift**: The mapping from state-action pairs to rewards differs between the training reward model and the true reward function intended by humans.

Covariate shift often arises because the agent, once optimized, explores parts of the environment that were rarely seen during human preference collection. Reward shift, on the other hand, can happen if the reward model generalizes poorly to novel situations or if human feedback is inconsistent or incomplete in unexplored regions of the state space.

To understand how well a reward model trained from human feedback will perform in new situations, you need to analyze its **generalization error**. Theoretical frameworks for generalization error in RLHF often draw on concepts from **statistical learning theory**. In this setting, the generalization error of the reward model can be expressed as the difference between the expected reward assigned by the model and the true, but unknown, human reward across all possible states and actions the agent might encounter.

**Key factors that influence generalization error include:**

- The complexity of the reward model relative to the amount of feedback data;
- The diversity of the training data, especially with respect to the regions of the environment visited during deployment;
- The presence of feedback loops, where the agent’s behavior shifts the distribution of encountered states, potentially moving it further from the training distribution.
- When the agent encounters states outside the support of the training data, the reward model may extrapolate incorrectly, leading to large generalization errors and, ultimately, misaligned behavior.

<div style="display: flex; gap: 40px; align-items: flex-start; flex-wrap: wrap; font-family: sans-serif;">
  
  <div style="margin-bottom: 20px;">
    <svg width="520" height="240" viewBox="0 0 520 240" xmlns="http://www.w3.org/2000/svg">
      <defs>
        <marker id="arrow-wide" markerWidth="10" markerHeight="10" refX="9" refY="3" orient="auto" markerUnits="strokeWidth">
          <path d="M0,0 L0,6 L9,3 z" fill="#555" />
        </marker>
      </defs>

      <rect x="10" y="20" width="140" height="44" rx="4" fill="#c6e2ff" stroke="#005fa3" stroke-width="1.5"/>
      <text x="80" y="47" font-size="13" font-weight="bold" text-anchor="middle" fill="#005fa3">Human Feedback</text>

      <rect x="190" y="20" width="140" height="44" rx="4" fill="#ffe4b5" stroke="#a36a00" stroke-width="1.5"/>
      <text x="260" y="47" font-size="13" font-weight="bold" text-anchor="middle" fill="#a36a00">Reward Model</text>

      <rect x="370" y="20" width="130" height="44" rx="4" fill="#c1ffc1" stroke="#008000" stroke-width="1.5"/>
      <text x="435" y="47" font-size="13" font-weight="bold" text-anchor="middle" fill="#008000">RL Agent</text>

      <rect x="190" y="150" width="140" height="44" rx="4" fill="#e0c1ff" stroke="#6a00a3" stroke-width="1.5"/>
      <text x="260" y="177" font-size="13" font-weight="bold" text-anchor="middle" fill="#6a00a3">Deployment</text>

      <line x1="150" y1="42" x2="180" y2="42" stroke="#555" stroke-width="2" marker-end="url(#arrow-wide)" />
      
      <line x1="330" y1="42" x2="360" y2="42" stroke="#555" stroke-width="2" marker-end="url(#arrow-wide)" />

      <path d="M 435 64 L 435 172 L 340 172" fill="none" stroke="#555" stroke-width="2" marker-end="url(#arrow-wide)" />

      <path d="M 190 172 L 80 172 L 80 74" fill="none" stroke="#555" stroke-width="2" marker-end="url(#arrow-wide)" />

    </svg>
  </div>

  <div style="max-width: 400px; padding-top: 10px;">
    <ul style="line-height: 1.6; color: #333;">
      <li style="margin-bottom: 8px;"><b>Human feedback</b> trains a <b>reward model</b>;</li>
      <li style="margin-bottom: 8px;">The <b>RL agent</b> is optimized against the reward model;</li>
      <li style="margin-bottom: 8px;">During <b>deployment</b>, the agent encounters new situations not covered by the training data;</li>
      <li style="margin-bottom: 8px;">The agent’s behavior shifts the distribution of encountered states even further;</li>
      <li>This feedback loop can amplify <b>distribution shift</b> and cause the reward model’s <b>generalization error</b> to grow, degrading performance.</li>
    </ul>
  </div>

</div>

Which statement best describes the relationship between distribution shift and generalization error in RLHF?

A rigorous, theory-driven exploration of RLHF: formalizing human preferences, reward modeling, optimization dynamics, and the alignment risks inherent to learning from human feedback.

Formalize the mathematical structure of human preferences and the theoretical basis for learning rewards from feedback.

Investigate the theoretical properties and failure modes of optimizing policies with imperfect, learned reward models.

Synthesize theoretical insights on alignment, generalization, and the fundamental risks and limitations of RLHF.

Distribution Shift and Generalization Failures