Learn Reward Modeling as an Inverse Problem | Foundations of Human Feedback and Preferences

Swipe to show menu

Reward modeling in reinforcement learning from human feedback (RLHF) is fundamentally an inverse problem. Rather than specifying a reward function directly, you are tasked with inferring a latent reward function from observed human preferences or feedback. This means you observe data such as "trajectory A is preferred over trajectory B" and aim to deduce the underlying reward function that would make these preferences rational. Formally, let $\mathcal{T}$ denote the space of trajectories, and suppose you observe a dataset $\mathcal{D} = \{(\tau_i, \tau_j, y_{ij})\}$ where $y_{ij}$ indicates whether a human prefers trajectory $\tau_i$ over $\tau_j$ . The inverse problem is to find a reward function $r: \mathcal{T} \rightarrow \mathbb{R}$ such that, for all observed preferences, the following holds:

y_{ij} = \begin{cases} 1 & \text{if } r(\tau_i) > r(\tau_j) \\ 0 & \text{otherwise} \end{cases}

In practice, you often model the probability of preference using a stochastic function, such as the Bradley-Terry or logistic model:

P(y_{ij} = 1) = \frac{\exp(r(\tau_i))}{\exp(r(\tau_i)) + \exp(r(\tau_j))}

This approach frames reward modeling as inferring the function $r$ that best explains the observed preference data.

A central challenge in this inverse problem is the identifiability of the true reward function. Identifiability asks: under what conditions can you uniquely recover the original reward function from observed preferences? In many cases, identifiability cannot be guaranteed. One source of ambiguity is that reward functions are only identifiable up to a strictly monotonic transformation when using only preference data. That is, if $r$ is a reward function consistent with the preferences, then so is any strictly increasing function of $r$ . For instance, if humans always prefer higher cumulative reward, both $r$ and $2r + 3$ will induce identical preference orderings over trajectories, making them indistinguishable from preference data alone.

Ambiguities also arise when the observed preferences do not fully cover the space of possible trajectory comparisons. If your dataset only includes a subset of all possible pairs, there may be many reward functions consistent with the limited information, and you cannot distinguish between them without further assumptions or data.

To visualize the space of possible reward functions consistent with observed preferences, consider the following. Suppose you have three trajectories, $\tau_1$ , $\tau_2$ , and $\tau_3$ , and you observe that humans prefer $\tau_1$ over $\tau_2$ , and $\tau_2$ over $\tau_3$ . The set of reward functions $r$ that satisfy these preferences must obey:

r(\tau_1) > r(\tau_2) > r(\tau_3)

This defines a region in the space of all possible reward functions. If you plot $r(\tau_1)$ , $r(\tau_2)$ , and $r(\tau_3)$ on axes, the valid region is the set of points where these inequalities hold.

As you observe more preferences, the valid region shrinks, but unless you have complete and noise-free data, there will always be a set of reward functions that fit the observed preferences. This illustrates that preference data typically constrains the reward function to a subset of the space, but does not uniquely specify it.

Inverse reward modeling has several important limitations. The most fundamental is unidentifiability: many reward functions may explain the same set of observed preferences, especially when those preferences are sparse or noisy. This means you cannot, in general, guarantee that the reward you infer matches the true underlying human objective.

Another limitation is the reliance on inductive biases. To select among the many possible reward functions consistent with the data, you must impose additional assumptions or regularization. These biases might include preferring simpler reward functions, restricting the form of $r$ , or using prior knowledge about the environment or human values. While necessary for practical learning, inductive biases can introduce their own risks: if your biases do not match the true human preferences, the inferred reward may systematically diverge from the intended objective.

These limitations highlight the importance of careful reward design and critical evaluation of inferred models in RLHF. In practice, you must balance the expressiveness of your reward model, the quality and coverage of your preference data, and the inductive biases you impose to achieve robust alignment with human intent.

Everything was clear?

Thanks for your feedback!

Section 1. Chapter 2

Ask AI

Ask anything or try one of the suggested questions to begin our chat

Section 1. Chapter 2