Optimization with Learned Rewards: Instability and Feedback Loops
The optimization process in Reinforcement Learning from Human Feedback (RLHF) centers on training an agent, or policyб to maximize expected reward as judged by a learned reward model. In classical reinforcement learning, agents optimize for a true, often hand-crafted, reward function directly tied to the intended task. RLHF, by contrast, relies on a reward model trained on human preference data, which serves as a proxy for the true reward. This learned reward function introduces a critical distinction: the agent is not optimizing for the true underlying objective, but for the output of a model that approximates human intent.
Formally, let π denote the agent's policy, rtrue(s,a) the true reward for state-action pair (s,a), and rlearned(s,a;θ) the reward predicted by the learned reward model with parameters θ. RLHF proceeds in two phases: first, the reward model is trained using supervised learning on human preference data, and then the agent is optimized to maximize the expected cumulative learned reward:
Eπ[t∑rlearned(st,at;θ)]Since the reward model is only an approximation, the optimization process is susceptible to mismatches between rtrue and rlearned, particularly in regions of the state-action space insufficiently covered by human feedback.
A central challenge in this setup is instability during optimization, which can arise from several sources. One major source is reward model overfitting: if the reward model is trained too closely on a limited or biased set of human preferences, it may assign high rewards to behaviors that exploit idiosyncrasies of the training data rather than genuinely reflecting human intent. This overfitting can lead the policy to discover and exploit loopholes or artifacts in the learned reward, resulting in so-called "reward hacking".
Another key factor is non-stationarity. As the policy improves and explores new parts of the environment, it may encounter states and actions far from those seen during reward model training. Since the learned reward is only reliable within the distribution of its training data, the agent's exploration can cause the effective reward landscape to shift, leading to unpredictable feedback loops. In extreme cases, the reward model may provide misleading or uninformative gradients, destabilizing the optimization.
To visualize these dynamics, consider the following diagrams. The first illustrates the relationship between the true reward landscape and the learned reward model. The second highlights how overfitting and non-stationarity can create feedback loops during policy optimization.
Kiitos palautteestasi!
Kysy tekoälyä
Kysy tekoälyä
Kysy mitä tahansa tai kokeile jotakin ehdotetuista kysymyksistä aloittaaksesi keskustelumme
Can you explain more about how reward hacking happens in RLHF?
What are some strategies to prevent overfitting in the reward model?
How do researchers address non-stationarity in RLHF training?
Mahtavaa!
Completion arvosana parantunut arvoon 11.11
Optimization with Learned Rewards: Instability and Feedback Loops
Pyyhkäise näyttääksesi valikon
The optimization process in Reinforcement Learning from Human Feedback (RLHF) centers on training an agent, or policyб to maximize expected reward as judged by a learned reward model. In classical reinforcement learning, agents optimize for a true, often hand-crafted, reward function directly tied to the intended task. RLHF, by contrast, relies on a reward model trained on human preference data, which serves as a proxy for the true reward. This learned reward function introduces a critical distinction: the agent is not optimizing for the true underlying objective, but for the output of a model that approximates human intent.
Formally, let π denote the agent's policy, rtrue(s,a) the true reward for state-action pair (s,a), and rlearned(s,a;θ) the reward predicted by the learned reward model with parameters θ. RLHF proceeds in two phases: first, the reward model is trained using supervised learning on human preference data, and then the agent is optimized to maximize the expected cumulative learned reward:
Eπ[t∑rlearned(st,at;θ)]Since the reward model is only an approximation, the optimization process is susceptible to mismatches between rtrue and rlearned, particularly in regions of the state-action space insufficiently covered by human feedback.
A central challenge in this setup is instability during optimization, which can arise from several sources. One major source is reward model overfitting: if the reward model is trained too closely on a limited or biased set of human preferences, it may assign high rewards to behaviors that exploit idiosyncrasies of the training data rather than genuinely reflecting human intent. This overfitting can lead the policy to discover and exploit loopholes or artifacts in the learned reward, resulting in so-called "reward hacking".
Another key factor is non-stationarity. As the policy improves and explores new parts of the environment, it may encounter states and actions far from those seen during reward model training. Since the learned reward is only reliable within the distribution of its training data, the agent's exploration can cause the effective reward landscape to shift, leading to unpredictable feedback loops. In extreme cases, the reward model may provide misleading or uninformative gradients, destabilizing the optimization.
To visualize these dynamics, consider the following diagrams. The first illustrates the relationship between the true reward landscape and the learned reward model. The second highlights how overfitting and non-stationarity can create feedback loops during policy optimization.
Kiitos palautteestasi!