Apprendre Evaluating Fine-tuned LLMs

Glissez pour afficher le menu

Fine-tuning improves a model, but you need a principled way to measure how much and in what direction. Evaluation for instruction-following LLMs combines automatic metrics and human evaluation – each captures something the other misses.

Automatic Metrics

Automatic metrics are fast, reproducible, and cheap to compute at scale:

Perplexity: measures how confidently the model predicts held-out text. Useful for tracking training progress but does not reflect output quality directly;
BLEU / ROUGE: measure n-gram overlap between the model's output and a reference answer. Useful for tasks with a single correct answer (e.g. translation), but poor proxies for open-ended generation where many valid responses exist;
Accuracy: for classification or multiple-choice tasks, the fraction of correct answers. Straightforward but only applicable to structured outputs.

None of these fully capture whether a response is helpful, safe, or aligned with user expectations.

Human Evaluation

Human evaluation fills the gap. Common approaches:

Preference ranking: show annotators two responses to the same prompt and ask which is better. This is the same signal used to train reward models in RLHF;
Likert scale rating: annotators score responses on dimensions like helpfulness, correctness, and tone (e.g. 1–5);
Win rate: the percentage of prompts where your fine-tuned model is preferred over a baseline.

Human evaluation is expensive and slow, but it is the ground truth for alignment-focused tasks.

LLM-as-Judge

A practical middle ground is using a strong LLM (e.g. GPT-4) to rate responses automatically. The judge model receives the prompt, the response, and a scoring rubric, and outputs a score. This scales better than human evaluation while capturing nuance that BLEU cannot.

# Pseudocode – replace with your API client
def llm_judge(prompt, response, rubric):
    judge_prompt = f"""
    Prompt: {prompt}
    Response: {response}
    Rubric: {rubric}
    Score the response from 1 to 5. Return only the integer.
    """
    # score = your_llm_api(judge_prompt)
    # return int(score)

rubric = "Is the response helpful, accurate, and polite?"
# llm_judge("How do I reset my password?", "Click Forgot Password.", rubric)

Tout était clair ?

Merci pour vos commentaires !

Section 1. Chapitre 10

Demandez à l'IA

Posez n'importe quelle question ou essayez l'une des questions suggérées pour commencer notre discussion

Section 1. Chapitre 10