Summary  
This chapter explains how optimizing a policy against a learned reward model—rather than a true reward—can introduce instability through reward model overfitting, non-stationarity, and feedback loops that lead to reward hacking.  

General domain of usage  
Reinforcement learning

The optimization process in **Reinforcement Learning from Human Feedback (RLHF)** centers on training an agent, or **policy**б to maximize expected reward as judged by a **learned reward model**. In classical reinforcement learning, agents optimize for a **true**, often hand-crafted, reward function directly tied to the intended task. RLHF, by contrast, relies on a reward model trained on human preference data, which serves as a **proxy for the true reward**. This learned reward function introduces a critical distinction: the agent is not optimizing for the true underlying objective, but for the output of a model that approximates human intent.

Formally, let $$\pi$$ denote the agent's policy, $$r_{true}(s, a)$$ the true reward for state-action pair $$(s, a)$$, and $$r_{learned}(s, a; \theta)$$ the reward predicted by the learned reward model with parameters $$\theta$$. RLHF proceeds in two phases: first, the reward model is trained using supervised learning on human preference data, and then the agent is optimized to maximize the expected cumulative learned reward:

$$
E_{\pi}[\sum_{t} r_{learned}(s_t, a_t; \theta)]
$$

Since the reward model is only an approximation, the optimization process is susceptible to mismatches between $$r_{true}$$ and $$r_{learned}$$, particularly in regions of the state-action space insufficiently covered by human feedback.

A central challenge in this setup is instability during optimization, which can arise from several sources. One major source is **reward model overfitting**: if the reward model is trained too closely on a limited or biased set of human preferences, it may assign high rewards to behaviors that exploit idiosyncrasies of the training data rather than genuinely reflecting human intent. This overfitting can lead the policy to discover and exploit loopholes or artifacts in the learned reward, resulting in so-called "reward hacking".

Another key factor is **non-stationarity**. As the policy improves and explores new parts of the environment, it may encounter states and actions far from those seen during reward model training. Since the learned reward is only reliable within the distribution of its training data, the agent's exploration can cause the effective reward landscape to shift, leading to unpredictable feedback loops. In extreme cases, the reward model may provide misleading or uninformative gradients, destabilizing the optimization.

To visualize these dynamics, consider the following diagrams. The first illustrates the relationship between the true reward landscape and the learned reward model. The second highlights how overfitting and non-stationarity can create feedback loops during policy optimization.

What is a primary source of instability in optimization with learned rewards in RLHF?

A rigorous, theory-driven exploration of RLHF: formalizing human preferences, reward modeling, optimization dynamics, and the alignment risks inherent to learning from human feedback.

Formalize the mathematical structure of human preferences and the theoretical basis for learning rewards from feedback.

Investigate the theoretical properties and failure modes of optimizing policies with imperfect, learned reward models.

Synthesize theoretical insights on alignment, generalization, and the fundamental risks and limitations of RLHF.

Optimization with Learned Rewards: Instability and Feedback Loops