Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Lernen Optimization with Learned Rewards: Instability and Feedback Loops | Optimization Dynamics in RLHF
Practice
Projects
Quizzes & Challenges
Quizzes
Challenges
/
Reinforcement Learning from Human Feedback Theory

bookOptimization with Learned Rewards: Instability and Feedback Loops

The optimization process in Reinforcement Learning from Human Feedback (RLHF) centers on training an agent, or policyб to maximize expected reward as judged by a learned reward model. In classical reinforcement learning, agents optimize for a true, often hand-crafted, reward function directly tied to the intended task. RLHF, by contrast, relies on a reward model trained on human preference data, which serves as a proxy for the true reward. This learned reward function introduces a critical distinction: the agent is not optimizing for the true underlying objective, but for the output of a model that approximates human intent.

Formally, let π\pi denote the agent's policy, rtrue(s,a)r_{true}(s, a) the true reward for state-action pair (s,a)(s, a), and rlearned(s,a;θ)r_{learned}(s, a; \theta) the reward predicted by the learned reward model with parameters θ\theta. RLHF proceeds in two phases: first, the reward model is trained using supervised learning on human preference data, and then the agent is optimized to maximize the expected cumulative learned reward:

Eπ[trlearned(st,at;θ)]E_{\pi}[\sum_{t} r_{learned}(s_t, a_t; \theta)]

Since the reward model is only an approximation, the optimization process is susceptible to mismatches between rtruer_{true} and rlearnedr_{learned}, particularly in regions of the state-action space insufficiently covered by human feedback.

A central challenge in this setup is instability during optimization, which can arise from several sources. One major source is reward model overfitting: if the reward model is trained too closely on a limited or biased set of human preferences, it may assign high rewards to behaviors that exploit idiosyncrasies of the training data rather than genuinely reflecting human intent. This overfitting can lead the policy to discover and exploit loopholes or artifacts in the learned reward, resulting in so-called "reward hacking".

Another key factor is non-stationarity. As the policy improves and explores new parts of the environment, it may encounter states and actions far from those seen during reward model training. Since the learned reward is only reliable within the distribution of its training data, the agent's exploration can cause the effective reward landscape to shift, leading to unpredictable feedback loops. In extreme cases, the reward model may provide misleading or uninformative gradients, destabilizing the optimization.

To visualize these dynamics, consider the following diagrams. The first illustrates the relationship between the true reward landscape and the learned reward model. The second highlights how overfitting and non-stationarity can create feedback loops during policy optimization.

question mark

What is a primary source of instability in optimization with learned rewards in RLHF?

Select the correct answer

War alles klar?

Wie können wir es verbessern?

Danke für Ihr Feedback!

Abschnitt 2. Kapitel 1

Fragen Sie AI

expand

Fragen Sie AI

ChatGPT

Fragen Sie alles oder probieren Sie eine der vorgeschlagenen Fragen, um unser Gespräch zu beginnen

bookOptimization with Learned Rewards: Instability and Feedback Loops

Swipe um das Menü anzuzeigen

The optimization process in Reinforcement Learning from Human Feedback (RLHF) centers on training an agent, or policyб to maximize expected reward as judged by a learned reward model. In classical reinforcement learning, agents optimize for a true, often hand-crafted, reward function directly tied to the intended task. RLHF, by contrast, relies on a reward model trained on human preference data, which serves as a proxy for the true reward. This learned reward function introduces a critical distinction: the agent is not optimizing for the true underlying objective, but for the output of a model that approximates human intent.

Formally, let π\pi denote the agent's policy, rtrue(s,a)r_{true}(s, a) the true reward for state-action pair (s,a)(s, a), and rlearned(s,a;θ)r_{learned}(s, a; \theta) the reward predicted by the learned reward model with parameters θ\theta. RLHF proceeds in two phases: first, the reward model is trained using supervised learning on human preference data, and then the agent is optimized to maximize the expected cumulative learned reward:

Eπ[trlearned(st,at;θ)]E_{\pi}[\sum_{t} r_{learned}(s_t, a_t; \theta)]

Since the reward model is only an approximation, the optimization process is susceptible to mismatches between rtruer_{true} and rlearnedr_{learned}, particularly in regions of the state-action space insufficiently covered by human feedback.

A central challenge in this setup is instability during optimization, which can arise from several sources. One major source is reward model overfitting: if the reward model is trained too closely on a limited or biased set of human preferences, it may assign high rewards to behaviors that exploit idiosyncrasies of the training data rather than genuinely reflecting human intent. This overfitting can lead the policy to discover and exploit loopholes or artifacts in the learned reward, resulting in so-called "reward hacking".

Another key factor is non-stationarity. As the policy improves and explores new parts of the environment, it may encounter states and actions far from those seen during reward model training. Since the learned reward is only reliable within the distribution of its training data, the agent's exploration can cause the effective reward landscape to shift, leading to unpredictable feedback loops. In extreme cases, the reward model may provide misleading or uninformative gradients, destabilizing the optimization.

To visualize these dynamics, consider the following diagrams. The first illustrates the relationship between the true reward landscape and the learned reward model. The second highlights how overfitting and non-stationarity can create feedback loops during policy optimization.

question mark

What is a primary source of instability in optimization with learned rewards in RLHF?

Select the correct answer

War alles klar?

Wie können wir es verbessern?

Danke für Ihr Feedback!

Abschnitt 2. Kapitel 1
some-alt