Chất lượng dataset thường quan trọng hơn kỹ thuật fine-tune. Model sẽ học từ dữ liệu — garbage in, garbage out.
Kích thước tối thiểu:
- PEFT (LoRA/QLoRA) task hẹp — 500-5,000 ví dụ chất lượng cao thường đủ.
- Full fine-tune / SFT — 10K-100K ví dụ; dưới mức này dễ overfit.
- Instruction tuning chung — 50K-1M ví dụ đa dạng task.
- Preference data (DPO/RLHF) — 5K-50K pair preference đã có SFT model tốt.
Quy luật: gấp đôi data chỉ giúp ~10-20% metric; mua chất lượng tăng > mua số lượng.
Format chuẩn (thường OpenAI chat format hoặc ShareGPT):
{
"messages": [
{"role": "system", "content": "Bạn là trợ lý..."},
{"role": "user", "content": "..."},
{"role": "assistant", "content": "..."}
]
}Multi-turn: lưu full conversation. Với function calling: thêm tools schema + tool_calls trong assistant message. Kiểm tra provider-specific format (OpenAI fine-tune, Together AI, Hugging Face trainer).
Nguồn data:
1. Human-written — chất lượng cao nhất, dùng cho golden dataset.
2. Production logs — log user ↔ LLM cũ; filter conversation tốt + edit thành ideal response.
3. Synthetic generation — LLM mạnh (GPT-4, Claude) sinh data qua Self-Instruct/Evol-Instruct; Nhược: bias teacher model, legal concern.
4. Public datasets: OpenHermes, UltraChat, SlimOrca, Tulu-3.
Checklist chất lượng (quan trọng nhất):
- Diversity + dedup — cover edge case, dedup bằng embedding similarity (>0.95 cosine = duplicate).
- Correctness — fact, logic, code đúng; sai ở data → model học sai.
- Format consistency — cùng schema/style; inconsistency dạy model chaos.
- No leakage — không có PII, secret, test-set data.
- Split — 5-10% val (early stop), test set độc lập.
Tools: Argilla, Label Studio, Cleanlab. Workflow: 100-500 hand-written → LoRA eval → synthetic + human-review 10-20% → iterate theo failure pattern.
Dataset quality often matters more than fine-tuning technique. The model learns from the data — garbage in, garbage out.
Minimum sizes:
- PEFT (LoRA/QLoRA) narrow task — 500–5,000 high-quality examples usually enough.
- Full fine-tune / SFT — 10K–100K examples; below this overfits easily.
- General instruction tuning — 50K–1M diverse examples.
- Preference data (DPO/RLHF) — 5K–50K preference pairs on top of a good SFT model.
Rule: doubling data only buys ~10–20% on metrics; buying quality beats buying quantity.
Standard format (OpenAI chat format or ShareGPT):
{
"messages": [
{"role": "system", "content": "You are an assistant..."},
{"role": "user", "content": "..."},
{"role": "assistant", "content": "..."}
]
}Multi-turn: store the full conversation. For function calling: add a tools schema + tool_calls in assistant messages. Check provider-specific formats (OpenAI fine-tune, Together AI, HF trainer).
Data sources:
1. Human-written — highest quality; use for the golden set.
2. Production logs — log user ↔ prior-LLM conversations; filter good ones + edit into ideal responses.
3. Synthetic generation — strong LLMs (GPT-4, Claude) via Self-Instruct / Evol-Instruct; downsides: teacher bias, legal concerns.
4. Public datasets: OpenHermes, UltraChat, SlimOrca, Tulu-3.
Quality checklist (most important):
- Diversity + dedup — cover edge cases; remove embedding-similar pairs (> 0.95 cosine).
- Correctness — factually, logically, code-wise accurate; errors in data → errors learned.
- Format consistency — same schema/style; inconsistency teaches chaos.
- No leakage — no PII, secrets, or test-set data.
- Split — 5–10% val (early stop), separate test set.
Tools: Argilla, Label Studio, Cleanlab. Workflow: 100–500 hand-written → LoRA eval → synthetic + 10–20% human review → iterate on failure patterns.