Preparing Instruction Datasets
Swipe um das Menü anzuzeigen
The quality of your instruction dataset directly determines the quality of your fine-tuned model. A poorly curated dataset – inconsistent formats, noisy responses, sensitive data left in – will produce a model that is unreliable regardless of how well the training loop is implemented.
From Chat Logs to Instruction-Response Pairs
Raw customer support logs need to be converted into clean prompt-response pairs. For each exchange, extract the customer query as the instruction and the agent reply as the response. Strip greetings, small talk, and off-topic content:
# Example raw chat log entry
log = {
"customer": "Hi there! I can't reset my password. Can you help?",
"agent": "Of course! Click 'Forgot Password' on the login page and follow the steps. Let me know if you need more help."
}
# Cleaning into a structured pair
def extract_pair(log):
instruction = log["customer"].replace("Hi there! ", "").strip()
response = log["agent"].strip()
return {"instruction": instruction, "response": response}
pair = extract_pair(log)
print(pair)
In a real pipeline you would apply this across thousands of logs, then store the result as JSONL – one JSON object per line – which is the standard format for SFT datasets.
Dataset Quality Checklist
Before training, validate your dataset:
- Remove duplicates: repeated pairs bias the model toward overrepresented responses;
- Anonymize: strip names, emails, account numbers, and any other PII;
- Check factual accuracy: responses must reflect current policies and correct information;
- Remove incomplete pairs: discard exchanges where the agent did not resolve the issue or the instruction is ambiguous;
- Balance issue types: if 80% of your pairs cover password resets, the model will underperform on everything else;
- Spot-check: sample randomly and review for tone, accuracy, and format consistency.
Dataset Format
A standard JSONL format for SFT:
import json
pairs = [
{
"instruction": "I can't reset my password. Can you help?",
"response": "Click 'Forgot Password' on the login page and follow the steps."
},
{
"instruction": "My order hasn't arrived yet. What should I do?",
"response": "Please share your order number and I'll look into the status right away."
}
]
with open("support_dataset.jsonl", "w") as f:
for pair in pairs:
f.write(json.dumps(pair) + "\n")
Run this locally to create a small sample dataset, then open the file and verify the format before feeding it into your training pipeline.
When working with raw chat logs, you must convert them into structured instruction-response pairs suitable for supervised fine-tuning. Start by reviewing the chat logs and identifying clear question-and-answer exchanges. For each exchange, extract the customer's query as the instruction and the agent's reply as the response. For example, a chat log segment:
Customer: "Hi, I can't reset my password. Can you help?"
Agent: "Of course! Please click the 'Forgot Password' link on the login page and follow the instructions. If you need further assistance, let me know."
Becomes the following pair:
- Instruction:
"I can't reset my password. Can you help?"; - Response:
"Please click the 'Forgot Password' link on the login page and follow the instructions. If you need further assistance, let me know.".
Remove greetings, unrelated small talk, and any off-topic content to keep the focus on actionable instructions and relevant responses.
Once you have assembled your instruction dataset, validate and clean it to maximize its quality and effectiveness. Here are some tips:
- Check for duplicates: remove repeated instruction-response pairs to avoid skewing the model;
- Correct typos and grammar: ensure instructions and responses are professionally written;
- Verify factual accuracy: responses should be up-to-date and correct for your domain;
- Remove incomplete or ambiguous pairs: only include exchanges where both instruction and response are clear and actionable;
- Conduct spot checks: sample random pairs and review them for quality and relevance;
- Solicit feedback: have subject matter experts review a subset of the data for accuracy and tone.
Danke für Ihr Feedback!
Fragen Sie AI
Fragen Sie AI
Fragen Sie alles oder probieren Sie eine der vorgeschlagenen Fragen, um unser Gespräch zu beginnen