Svep för att visa menyn

PPO Basics for LLM Fine-tuning

Once you have a reward model, you need an algorithm to update the LLM using its scores. Proximal Policy Optimization (PPO) is the standard choice. It maximizes reward while preventing updates large enough to destabilize the model or collapse its language generation ability.

The Core Idea

In PPO, the LLM is the policy — it maps a prompt to a response. At each step:

The policy generates a response;
The reward model scores it;
PPO updates the policy to increase the probability of high-reward responses.

The key constraint is that updates are clipped — the ratio between the new policy's probability and the old policy's probability is bounded to the range $[1 - \epsilon,\; 1 + \epsilon]$ , typically with $\epsilon = 0.2$ . This prevents any single update from changing the model's behavior too dramatically.

L^{\text{CLIP}}(\theta) = \mathbb{E}_t \left[ \min\left( r_t(\theta) \hat{A}_t,\; \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) \hat{A}_t \right) \right]

where $r_t(\theta) = \frac{\raisebox{1pt}{$\pi_\theta(a_t \mid s_t)$}}{\raisebox{-1pt}{$\pi_{\theta_{\text{old}}}(a_t \mid s_t)$}}$ is the probability ratio and $\hat{A}_t$ is the advantage — how much better the response was than expected.

KL Penalty in Practice

In addition to clipping, LLM fine-tuning with PPO adds a KL divergence penalty between the current policy and the SFT model:

r_{\text{final}} = r_{\text{reward}} - \beta \cdot \text{KL}(\pi_\theta \| \pi_{\text{SFT}})

This prevents the model from drifting so far toward high reward that it loses fluency or generates degenerate outputs — a failure mode known as reward hacking.

Using TRL for PPO

The trl library implements PPO for LLMs out of the box:

from trl import PPOTrainer, PPOConfig, AutoModelForCausalLMWithValueHead
from transformers import AutoTokenizer

model = AutoModelForCausalLMWithValueHead.from_pretrained("bigscience/bloom-560m")
tokenizer = AutoTokenizer.from_pretrained("bigscience/bloom-560m")

ppo_config = PPOConfig(
    model_name="bloom-560m",
    learning_rate=1.5e-5,
    batch_size=16,
    kl_penalty="kl",
    init_kl_coef=0.2       # β — controls how strongly the KL penalty is applied
)

ppo_trainer = PPOTrainer(ppo_config, model, tokenizer=tokenizer)

# A single PPO step (reward_tensors come from the reward model)
query_tensors = [tokenizer.encode("How do I reset my password?", return_tensors="pt")[0]]
response_tensors = ppo_trainer.generate(query_tensors)
reward_tensors = [torch.tensor(1.8)]  # Score from reward model

stats = ppo_trainer.step(query_tensors, response_tensors, reward_tensors)
print(stats)

Run this locally to see the PPO training statistics — pay attention to kl and mean_reward to monitor whether the policy is staying close to the SFT baseline.

Var allt tydligt?

Tack för dina kommentarer!

Avsnitt 1. Kapitel 9

Fråga AI

Fråga vad du vill eller prova någon av de föreslagna frågorna för att starta vårt samtal

Avsnitt 1. Kapitel 9