Chuẩn bị dataset cho fine-tune LLM: format, kích thước, chất lượng?

Question

Luyện Phỏng Vấn IT · Accepted Answer

Chất lượng dataset thường quan trọng hơn kỹ thuật fine-tune. Model sẽ học từ dữ liệu — garbage in, garbage out. Kích thước tối thiểu: - PEFT (LoRA/QLoRA) task hẹp — 500-5,000 ví dụ chất lượng cao thường đủ. - Full fine-tune / SFT — 10K-100K ví dụ; dưới mức này dễ overfit. - Instruction tuning chung — 50K-1M ví dụ đa dạng task. - Preference data (DPO/RLHF) — 5K-50K pair preference đã có SFT model tốt. Quy luật: gấp đôi data chỉ giúp ~10-20% metric; mua chất lượng tăng > mua số lượng. Format chuẩn (thường OpenAI chat format hoặc ShareGPT): Multi-turn: lưu full conversation. Với function calling: thêm tools schema + toolcalls trong assistant message. Kiểm tra provider-specific format (OpenAI fine-tune, Together AI, Hugging Face trainer). Nguồn data: 1. Human-written — chất lượng cao nhất, dùng cho golden dataset. 2. Production logs — log user ↔ LLM cũ; filter conversation tốt + edit thành ideal response. 3. Synthetic generation — LLM mạnh (GPT-4, Claude) sinh data qua Self-Instruct/Evol-Instruct; Nhược: bias teacher model, legal concern. 4. Public datasets: OpenHermes, UltraChat, SlimOrca, Tulu-3. Checklist chất lượng (quan trọng nhất): - Diversity + dedup — cover edge case, dedup bằng embedding similarity (>0.95 cosine = duplicate). - Correctness — fact, logic, code đúng; sai ở data → model học sai. - Format consistency — cùng schema/style; inconsistency dạy model chaos. - No leakage — không có PII, secret, test-set data. - Split — 5-10% val (early stop), test set độc lập. Tools: Argilla, Label Studio, Cleanlab. Workflow: 100-500 hand-written → LoRA eval → synthetic + human-review 10-20% → iterate theo failure pattern.