Evaluating Fine-tuned LLMs
Glissez pour afficher le menu
Fine-tuning improves a model, but you need a principled way to measure how much and in what direction. Evaluation for instruction-following LLMs combines automatic metrics and human evaluation – each captures something the other misses.
Automatic Metrics
Automatic metrics are fast, reproducible, and cheap to compute at scale:
- Perplexity: measures how confidently the model predicts held-out text. Useful for tracking training progress but does not reflect output quality directly;
- BLEU / ROUGE: measure n-gram overlap between the model's output and a reference answer. Useful for tasks with a single correct answer (e.g. translation), but poor proxies for open-ended generation where many valid responses exist;
- Accuracy: for classification or multiple-choice tasks, the fraction of correct answers. Straightforward but only applicable to structured outputs.
None of these fully capture whether a response is helpful, safe, or aligned with user expectations.
Human Evaluation
Human evaluation fills the gap. Common approaches:
- Preference ranking: show annotators two responses to the same prompt and ask which is better. This is the same signal used to train reward models in RLHF;
- Likert scale rating: annotators score responses on dimensions like helpfulness, correctness, and tone (e.g. 1–5);
- Win rate: the percentage of prompts where your fine-tuned model is preferred over a baseline.
Human evaluation is expensive and slow, but it is the ground truth for alignment-focused tasks.
LLM-as-Judge
A practical middle ground is using a strong LLM (e.g. GPT-4) to rate responses automatically. The judge model receives the prompt, the response, and a scoring rubric, and outputs a score. This scales better than human evaluation while capturing nuance that BLEU cannot.
# Pseudocode – replace with your API client
def llm_judge(prompt, response, rubric):
judge_prompt = f"""
Prompt: {prompt}
Response: {response}
Rubric: {rubric}
Score the response from 1 to 5. Return only the integer.
"""
# score = your_llm_api(judge_prompt)
# return int(score)
rubric = "Is the response helpful, accurate, and polite?"
# llm_judge("How do I reset my password?", "Click Forgot Password.", rubric)
Merci pour vos commentaires !
Demandez à l'IA
Demandez à l'IA
Posez n'importe quelle question ou essayez l'une des questions suggérées pour commencer notre discussion