Lära Reward Modeling for LLMs

Svep för att visa menyn

The reward model is the bridge between human feedback and the reinforcement learning loop. It takes a prompt-response pair and outputs a scalar score – a proxy for how much a human would prefer that response. Once trained, it replaces the need for a human annotator at every training step, making RLHF scalable.

How Reward Data Is Collected

Rather than asking annotators to assign absolute scores, the standard approach is pairwise comparison: given the same prompt, show an annotator two responses and ask which is better. Pairwise comparisons are faster, more consistent, and less noisy than absolute ratings.

For a customer support model, an annotation pair might look like:

Prompt: "How do I reset my password?";
Response A: "Click 'Forgot Password' on the login page.";
Response B: "To reset your password, click 'Forgot Password' on the login page and follow the email instructions. Let me know if you need more help.";
Human preference: B.

Collecting thousands of such pairs produces a dataset of human preferences.

Training the Reward Model

The reward model is typically initialized from the same SFT model, with a linear head added on top to output a scalar score. It is trained to assign a higher score to the preferred response in each pair using a ranking loss:


              123456789101112131415161718192021222324252627
            
import torch
import torch.nn as nn

class RewardModel(nn.Module):
    def __init__(self, base_model, d_model):
        super().__init__()
        self.base = base_model       # Pretrained LM backbone (frozen or partially frozen)
        self.head = nn.Linear(d_model, 1)

    def forward(self, input_ids, attention_mask):
        outputs = self.base(input_ids=input_ids, attention_mask=attention_mask)
        last_hidden = outputs.last_hidden_state[:, -1, :]  # Last token representation
        return self.head(last_hidden).squeeze(-1)


def reward_loss(score_chosen, score_rejected):
    # Preferred response should score higher than rejected
    return -torch.log(torch.sigmoid(score_chosen - score_rejected)).mean()


# Simulating a training step
d_model = 128
score_chosen = torch.tensor([2.1, 3.4, 1.8])
score_rejected = torch.tensor([0.5, 1.2, 0.9])

loss = reward_loss(score_chosen, score_rejected)
print(f"Reward loss: {loss.item():.4f}")

Run this locally to verify that the loss decreases as the gap between score_chosen and score_rejected increases.

Var allt tydligt?

Tack för dina kommentarer!

Avsnitt 1. Kapitel 8

Fråga AI

Fråga vad du vill eller prova någon av de föreslagna frågorna för att starta vårt samtal

Avsnitt 1. Kapitel 8