Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Lära Evaluating Fine-tuned LLMs | Section
Fine-tuning and Adapting LLMs

bookEvaluating Fine-tuned LLMs

Svep för att visa menyn

Fine-tuning improves a model, but you need a principled way to measure how much and in what direction. Evaluation for instruction-following LLMs combines automatic metrics and human evaluation – each captures something the other misses.

Automatic Metrics

Automatic metrics are fast, reproducible, and cheap to compute at scale:

  • Perplexity: measures how confidently the model predicts held-out text. Useful for tracking training progress but does not reflect output quality directly;
  • BLEU / ROUGE: measure n-gram overlap between the model's output and a reference answer. Useful for tasks with a single correct answer (e.g. translation), but poor proxies for open-ended generation where many valid responses exist;
  • Accuracy: for classification or multiple-choice tasks, the fraction of correct answers. Straightforward but only applicable to structured outputs.

None of these fully capture whether a response is helpful, safe, or aligned with user expectations.

Human Evaluation

Human evaluation fills the gap. Common approaches:

  • Preference ranking: show annotators two responses to the same prompt and ask which is better. This is the same signal used to train reward models in RLHF;
  • Likert scale rating: annotators score responses on dimensions like helpfulness, correctness, and tone (e.g. 1–5);
  • Win rate: the percentage of prompts where your fine-tuned model is preferred over a baseline.

Human evaluation is expensive and slow, but it is the ground truth for alignment-focused tasks.

LLM-as-Judge

A practical middle ground is using a strong LLM (e.g. GPT-4) to rate responses automatically. The judge model receives the prompt, the response, and a scoring rubric, and outputs a score. This scales better than human evaluation while capturing nuance that BLEU cannot.

# Pseudocode – replace with your API client
def llm_judge(prompt, response, rubric):
    judge_prompt = f"""
    Prompt: {prompt}
    Response: {response}
    Rubric: {rubric}
    Score the response from 1 to 5. Return only the integer.
    """
    # score = your_llm_api(judge_prompt)
    # return int(score)

rubric = "Is the response helpful, accurate, and polite?"
# llm_judge("How do I reset my password?", "Click Forgot Password.", rubric)
question mark

Which Statement about Evaluating Fine-tuned LLMs Is Correct?

Vänligen välj det korrekta svaret

Var allt tydligt?

Hur kan vi förbättra det?

Tack för dina kommentarer!

Avsnitt 1. Kapitel 10

Fråga AI

expand

Fråga AI

ChatGPT

Fråga vad du vill eller prova någon av de föreslagna frågorna för att starta vårt samtal

Avsnitt 1. Kapitel 10
some-alt