Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Lernen RLHF Overview | Section
Fine-tuning and Adapting LLMs

bookRLHF Overview

Swipe um das Menü anzuzeigen

Supervised fine-tuning teaches a model to follow instructions. But it cannot capture everything humans care about – tone, safety, avoiding harmful outputs, or knowing when to say "I don't know." Reinforcement Learning from Human Feedback (RLHF) addresses this by incorporating human judgment directly into the training loop.

The Three Stages

1. Supervised fine-tuning (SFT)

Start with a pre-trained model and fine-tune it on high-quality prompt-response pairs. This gives the model a baseline ability to follow instructions before any reinforcement learning is applied.

2. Reward modeling

Human annotators compare multiple model responses to the same prompt and rank them by preference. These rankings are used to train a reward model – a separate neural network that takes a prompt-response pair as input and outputs a scalar score representing how much a human would prefer that response.

3. Policy optimization

The fine-tuned LLM (the "policy") generates responses. The reward model scores them. A reinforcement learning algorithm – typically PPO (Proximal Policy Optimization) – updates the policy to maximize the reward score. This cycle repeats, gradually steering the model toward outputs humans prefer.

A KL divergence penalty is added to prevent the policy from drifting too far from the SFT model – without it, the model can learn to exploit the reward model in ways that look high-scoring but produce nonsensical outputs.

Why RLHF Matters

Consider two responses to "My order hasn't arrived. What should I do?":

  • SFT model: "Check your tracking number.";
  • RLHF model: "I'm sorry to hear that. Please check your tracking number, and if you need further help, I'm here.".

The SFT response is technically correct. The RLHF response is what a human annotator would prefer – empathetic, complete, and actionable. That preference signal cannot easily be encoded in a labeled dataset, but it can be learned from human rankings.

question mark

Which of the following best describes the main stages of the RLHF pipeline?

Wählen Sie die richtige Antwort aus

War alles klar?

Wie können wir es verbessern?

Danke für Ihr Feedback!

Abschnitt 1. Kapitel 7

Fragen Sie AI

expand

Fragen Sie AI

ChatGPT

Fragen Sie alles oder probieren Sie eine der vorgeschlagenen Fragen, um unser Gespräch zu beginnen

Abschnitt 1. Kapitel 7
some-alt