Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
学ぶ Evaluating Fine-tuned LLMs | Section
Fine-tuning and Adapting LLMs

bookEvaluating Fine-tuned LLMs

メニューを表示するにはスワイプしてください

Fine-tuning improves a model, but you need a principled way to measure how much and in what direction. Evaluation for instruction-following LLMs combines automatic metrics and human evaluation – each captures something the other misses.

Automatic Metrics

Automatic metrics are fast, reproducible, and cheap to compute at scale:

  • Perplexity: measures how confidently the model predicts held-out text. Useful for tracking training progress but does not reflect output quality directly;
  • BLEU / ROUGE: measure n-gram overlap between the model's output and a reference answer. Useful for tasks with a single correct answer (e.g. translation), but poor proxies for open-ended generation where many valid responses exist;
  • Accuracy: for classification or multiple-choice tasks, the fraction of correct answers. Straightforward but only applicable to structured outputs.

None of these fully capture whether a response is helpful, safe, or aligned with user expectations.

Human Evaluation

Human evaluation fills the gap. Common approaches:

  • Preference ranking: show annotators two responses to the same prompt and ask which is better. This is the same signal used to train reward models in RLHF;
  • Likert scale rating: annotators score responses on dimensions like helpfulness, correctness, and tone (e.g. 1–5);
  • Win rate: the percentage of prompts where your fine-tuned model is preferred over a baseline.

Human evaluation is expensive and slow, but it is the ground truth for alignment-focused tasks.

LLM-as-Judge

A practical middle ground is using a strong LLM (e.g. GPT-4) to rate responses automatically. The judge model receives the prompt, the response, and a scoring rubric, and outputs a score. This scales better than human evaluation while capturing nuance that BLEU cannot.

# Pseudocode – replace with your API client
def llm_judge(prompt, response, rubric):
    judge_prompt = f"""
    Prompt: {prompt}
    Response: {response}
    Rubric: {rubric}
    Score the response from 1 to 5. Return only the integer.
    """
    # score = your_llm_api(judge_prompt)
    # return int(score)

rubric = "Is the response helpful, accurate, and polite?"
# llm_judge("How do I reset my password?", "Click Forgot Password.", rubric)
question mark

Which Statement about Evaluating Fine-tuned LLMs Is Correct?

正しい答えを選んでください

すべて明確でしたか?

どのように改善できますか?

フィードバックありがとうございます!

セクション 1.  10

AIに質問する

expand

AIに質問する

ChatGPT

何でも質問するか、提案された質問の1つを試してチャットを始めてください

セクション 1.  10
some-alt