Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Learn Evaluating Fine-tuned LLMs | Section
Fine-tuning and Adapting LLMs

bookEvaluating Fine-tuned LLMs

Swipe to show menu

Fine-tuning improves a model, but you need a principled way to measure how much and in what direction. Evaluation for instruction-following LLMs combines automatic metrics and human evaluation – each captures something the other misses.

Automatic Metrics

Automatic metrics are fast, reproducible, and cheap to compute at scale:

  • Perplexity: measures how confidently the model predicts held-out text. Useful for tracking training progress but does not reflect output quality directly;
  • BLEU / ROUGE: measure n-gram overlap between the model's output and a reference answer. Useful for tasks with a single correct answer (e.g. translation), but poor proxies for open-ended generation where many valid responses exist;
  • Accuracy: for classification or multiple-choice tasks, the fraction of correct answers. Straightforward but only applicable to structured outputs.

None of these fully capture whether a response is helpful, safe, or aligned with user expectations.

Human Evaluation

Human evaluation fills the gap. Common approaches:

  • Preference ranking: show annotators two responses to the same prompt and ask which is better. This is the same signal used to train reward models in RLHF;
  • Likert scale rating: annotators score responses on dimensions like helpfulness, correctness, and tone (e.g. 1–5);
  • Win rate: the percentage of prompts where your fine-tuned model is preferred over a baseline.

Human evaluation is expensive and slow, but it is the ground truth for alignment-focused tasks.

LLM-as-Judge

A practical middle ground is using a strong LLM (e.g. GPT-4) to rate responses automatically. The judge model receives the prompt, the response, and a scoring rubric, and outputs a score. This scales better than human evaluation while capturing nuance that BLEU cannot.

# Pseudocode – replace with your API client
def llm_judge(prompt, response, rubric):
    judge_prompt = f"""
    Prompt: {prompt}
    Response: {response}
    Rubric: {rubric}
    Score the response from 1 to 5. Return only the integer.
    """
    # score = your_llm_api(judge_prompt)
    # return int(score)

rubric = "Is the response helpful, accurate, and polite?"
# llm_judge("How do I reset my password?", "Click Forgot Password.", rubric)
question mark

Which Statement about Evaluating Fine-tuned LLMs Is Correct?

Select the correct answer

Everything was clear?

How can we improve it?

Thanks for your feedback!

Section 1. Chapter 10

Ask AI

expand

Ask AI

ChatGPT

Ask anything or try one of the suggested questions to begin our chat

Section 1. Chapter 10
some-alt