Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Impara RLHF Overview | Section
Fine-tuning and Adapting LLMs

bookRLHF Overview

Scorri per mostrare il menu

Supervised fine-tuning teaches a model to follow instructions. But it cannot capture everything humans care about – tone, safety, avoiding harmful outputs, or knowing when to say "I don't know." Reinforcement Learning from Human Feedback (RLHF) addresses this by incorporating human judgment directly into the training loop.

The Three Stages

1. Supervised fine-tuning (SFT)

Start with a pre-trained model and fine-tune it on high-quality prompt-response pairs. This gives the model a baseline ability to follow instructions before any reinforcement learning is applied.

2. Reward modeling

Human annotators compare multiple model responses to the same prompt and rank them by preference. These rankings are used to train a reward model – a separate neural network that takes a prompt-response pair as input and outputs a scalar score representing how much a human would prefer that response.

3. Policy optimization

The fine-tuned LLM (the "policy") generates responses. The reward model scores them. A reinforcement learning algorithm – typically PPO (Proximal Policy Optimization) – updates the policy to maximize the reward score. This cycle repeats, gradually steering the model toward outputs humans prefer.

A KL divergence penalty is added to prevent the policy from drifting too far from the SFT model – without it, the model can learn to exploit the reward model in ways that look high-scoring but produce nonsensical outputs.

Why RLHF Matters

Consider two responses to "My order hasn't arrived. What should I do?":

  • SFT model: "Check your tracking number.";
  • RLHF model: "I'm sorry to hear that. Please check your tracking number, and if you need further help, I'm here.".

The SFT response is technically correct. The RLHF response is what a human annotator would prefer – empathetic, complete, and actionable. That preference signal cannot easily be encoded in a labeled dataset, but it can be learned from human rankings.

question mark

Which of the following best describes the main stages of the RLHF pipeline?

Seleziona la risposta corretta

Tutto è chiaro?

Come possiamo migliorarlo?

Grazie per i tuoi commenti!

Sezione 1. Capitolo 7

Chieda ad AI

expand

Chieda ad AI

ChatGPT

Chieda pure quello che desidera o provi una delle domande suggerite per iniziare la nostra conversazione

Sezione 1. Capitolo 7
some-alt