RLHF Overview
Pyyhkäise näyttääksesi valikon
Supervised fine-tuning teaches a model to follow instructions. But it cannot capture everything humans care about – tone, safety, avoiding harmful outputs, or knowing when to say "I don't know." Reinforcement Learning from Human Feedback (RLHF) addresses this by incorporating human judgment directly into the training loop.
The Three Stages
1. Supervised fine-tuning (SFT)
Start with a pre-trained model and fine-tune it on high-quality prompt-response pairs. This gives the model a baseline ability to follow instructions before any reinforcement learning is applied.
2. Reward modeling
Human annotators compare multiple model responses to the same prompt and rank them by preference. These rankings are used to train a reward model – a separate neural network that takes a prompt-response pair as input and outputs a scalar score representing how much a human would prefer that response.
3. Policy optimization
The fine-tuned LLM (the "policy") generates responses. The reward model scores them. A reinforcement learning algorithm – typically PPO (Proximal Policy Optimization) – updates the policy to maximize the reward score. This cycle repeats, gradually steering the model toward outputs humans prefer.
A KL divergence penalty is added to prevent the policy from drifting too far from the SFT model – without it, the model can learn to exploit the reward model in ways that look high-scoring but produce nonsensical outputs.
Why RLHF Matters
Consider two responses to "My order hasn't arrived. What should I do?":
- SFT model:
"Check your tracking number."; - RLHF model:
"I'm sorry to hear that. Please check your tracking number, and if you need further help, I'm here.".
The SFT response is technically correct. The RLHF response is what a human annotator would prefer – empathetic, complete, and actionable. That preference signal cannot easily be encoded in a labeled dataset, but it can be learned from human rankings.
Kiitos palautteestasi!
Kysy tekoälyä
Kysy tekoälyä
Kysy mitä tahansa tai kokeile jotakin ehdotetuista kysymyksistä aloittaaksesi keskustelumme