Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Lära Reward Modeling for LLMs | Section
Fine-tuning and Adapting LLMs

bookReward Modeling for LLMs

Svep för att visa menyn

The reward model is the bridge between human feedback and the reinforcement learning loop. It takes a prompt-response pair and outputs a scalar score – a proxy for how much a human would prefer that response. Once trained, it replaces the need for a human annotator at every training step, making RLHF scalable.

How Reward Data Is Collected

Rather than asking annotators to assign absolute scores, the standard approach is pairwise comparison: given the same prompt, show an annotator two responses and ask which is better. Pairwise comparisons are faster, more consistent, and less noisy than absolute ratings.

For a customer support model, an annotation pair might look like:

  • Prompt: "How do I reset my password?";
  • Response A: "Click 'Forgot Password' on the login page.";
  • Response B: "To reset your password, click 'Forgot Password' on the login page and follow the email instructions. Let me know if you need more help.";
  • Human preference: B.

Collecting thousands of such pairs produces a dataset of human preferences.

Training the Reward Model

The reward model is typically initialized from the same SFT model, with a linear head added on top to output a scalar score. It is trained to assign a higher score to the preferred response in each pair using a ranking loss:

123456789101112131415161718192021222324252627
import torch import torch.nn as nn class RewardModel(nn.Module): def __init__(self, base_model, d_model): super().__init__() self.base = base_model # Pretrained LM backbone (frozen or partially frozen) self.head = nn.Linear(d_model, 1) def forward(self, input_ids, attention_mask): outputs = self.base(input_ids=input_ids, attention_mask=attention_mask) last_hidden = outputs.last_hidden_state[:, -1, :] # Last token representation return self.head(last_hidden).squeeze(-1) def reward_loss(score_chosen, score_rejected): # Preferred response should score higher than rejected return -torch.log(torch.sigmoid(score_chosen - score_rejected)).mean() # Simulating a training step d_model = 128 score_chosen = torch.tensor([2.1, 3.4, 1.8]) score_rejected = torch.tensor([0.5, 1.2, 0.9]) loss = reward_loss(score_chosen, score_rejected) print(f"Reward loss: {loss.item():.4f}")
copy

Run this locally to verify that the loss decreases as the gap between score_chosen and score_rejected increases.

question mark

Which statement best describes the role of a reward model in RLHF?

Vänligen välj det korrekta svaret

Var allt tydligt?

Hur kan vi förbättra det?

Tack för dina kommentarer!

Avsnitt 1. Kapitel 8

Fråga AI

expand

Fråga AI

ChatGPT

Fråga vad du vill eller prova någon av de föreslagna frågorna för att starta vårt samtal

Avsnitt 1. Kapitel 8
some-alt