Prompt là artifact chính của LLM app — thay đổi 1 dòng có thể crash chất lượng. Cần quản lý như code:
Prompt versioning:
1. Store prompts externally — không hard-code trong repo. Dùng:
- PromptLayer, Langfuse, Braintrust, LangSmith — versioned prompt store có UI edit, diff, deploy.
- Database tự host (Postgres table: id, name, version, content, model, params, created_at, deployed_at).
- Git — prompt là file YAML/MD commit, tag version.
2. Version scheme — semver (v1.0.0, breaking change bump major) hoặc timestamp. Mỗi version immutable.
3. Associate with evaluation — mỗi prompt version có eval metrics đi kèm. Không deploy version nếu chưa pass eval suite.
4. Metadata — lưu cùng model, temperature, max_tokens, tool schema — tất cả là "prompt contract".
A/B testing prompt:
1. Offline A/B (ưu tiên trước)
- Chạy prompt A và B trên cùng golden dataset → so metrics (RAGAS, LLM-judge, human eval).
- Rẻ, nhanh, không risk production traffic.
- Dùng cho khác biệt rõ ràng về quality.
2. Online A/B (shadow)
- Traffic thật → song song gửi đến A và B (chỉ A trả user, B log kết quả) → so metrics offline.
- Không risk user thấy B kém.
- Chi phí: 2x LLM call trong period shadow.
3. Online A/B (split traffic)
- Chia traffic: 50% A, 50% B (hoặc canary 5% B).
- Đo real metrics: CSAT, task completion, retention, conversion.
- Cần feature flag infra (LaunchDarkly, Flagsmith, Unleash, GrowthBook).
- Statistical rigor: calculate sample size, p-value, confidence interval. Đừng stop sớm khi B "có vẻ tốt".
- Use for product-level metrics mà offline eval không đo được.
4. Bandit test — thuật toán bandit (Thompson sampling) tự allocate traffic theo performance. Hữu ích khi có nhiều variant, cần optimize liên tục.
Metrics theo dõi khi A/B:
- Quality: eval score, faithfulness, user rating.
- Engagement: CSAT, resolution rate, retry rate, abandon rate.
- Operational: latency p50/p95, cost/request, error rate.
- Guardrail: refusal rate, safety violation rate.
Rollback strategy:
1. Blue-green deployment cho prompt — giữ cả version cũ và mới deployed; router chọn version → switch instant khi rollback.
2. Config-driven — prompt version là env var/config entry. Deploy = update config, không cần rebuild. Rollback = revert config.
3. Auto-rollback on regression — alert khi metric chính drop > X% → tự động revert. Cần dashboard + alerting.
4. Kill switch — feature flag "disable prompt V2, fall back to V1" — 1 click rollback.
Model version management tương tự:
- OpenAI/Anthropic releases model version mới — không auto-upgrade production (pin version).
- Test model mới trên shadow traffic trước.
- Rollback nếu regression.
- Nhiều bug production xảy ra khi provider silent-deprecate model → force upgrade.
Regression suite (bắt buộc trong CI/CD):
1. Eval set 100-500 query + expected output/metric.
2. Chạy sau mỗi prompt/model change.
3. Fail build nếu score drop > threshold.
4. Block merge PR prompt thay đổi nếu regression.
Anti-patterns cần tránh:
- Prompt trong code hard-coded, deploy cần rebuild.
- Deploy prompt thẳng production không eval.
- Không log prompt version vào trace → không biết version nào gây bug.
- A/B mà không có statistical rigor → conclusion sai.
- Rollback bằng Git revert → prompt đã chạy production có log lộn xộn.
Prompts are the main artifact of an LLM app — a one-line change can crash quality. Manage them like code:
Prompt versioning:
1. Store prompts externally — don't hardcode in the repo. Use:
- PromptLayer, Langfuse, Braintrust, LangSmith — versioned prompt stores with edit/diff/deploy UI.
- Self-hosted database (Postgres table: id, name, version, content, model, params, created_at, deployed_at).
- Git — prompts are YAML/MD files, tagged versions.
2. Version scheme — semver (v1.0.0, bump major on breaking change) or timestamp. Each version immutable.
3. Associate with evaluation — every version has paired eval metrics. Don't deploy without passing the eval suite.
4. Metadata — store the model, temperature, max_tokens, tool schema alongside — all part of the "prompt contract".
A/B testing prompts:
1. Offline A/B (do this first)
- Run prompt A and B on the same golden dataset → compare metrics (RAGAS, LLM-judge, human eval).
- Cheap, fast, no production risk.
- Use for clear quality differences.
2. Online A/B (shadow)
- Real traffic → parallel send to A and B (only A reaches user, B logged) → compare metrics offline.
- No user-facing risk from B.
- Cost: 2x LLM calls during the shadow period.
3. Online A/B (split traffic)
- Split traffic: 50% A, 50% B (or 5% canary B).
- Measure real metrics: CSAT, task completion, retention, conversion.
- Needs a feature flag infra (LaunchDarkly, Flagsmith, Unleash, GrowthBook).
- Statistical rigor: sample size, p-value, confidence interval. Don't stop early because B "looks better".
- Use for product-level metrics offline eval can't capture.
4. Bandit test — a bandit algorithm (Thompson sampling) auto-allocates traffic by performance. Useful for many variants under continuous optimization.
Metrics to track during A/B:
- Quality: eval score, faithfulness, user rating.
- Engagement: CSAT, resolution rate, retry rate, abandon rate.
- Operational: latency p50/p95, cost/request, error rate.
- Guardrails: refusal rate, safety violation rate.
Rollback strategy:
1. Blue-green deployment for prompts — keep both old and new versions deployed; router picks → instant switch on rollback.
2. Config-driven — prompt version as env var/config entry. Deploy = update config, no rebuild. Rollback = revert config.
3. Auto-rollback on regression — alert when a core metric drops > X% → auto-revert. Requires dashboards + alerting.
4. Kill switch — feature flag "disable prompt V2, fall back to V1" — one-click rollback.
Model version management is similar:
- OpenAI/Anthropic ship new versions — don't auto-upgrade production (pin versions).
- Test new models on shadow traffic first.
- Rollback on regression.
- Many production bugs come from providers silent-deprecating models and forcing upgrades.
Regression suite (required in CI/CD):
1. Eval set of 100–500 queries + expected output/metric.
2. Runs on every prompt/model change.
3. Fail build if score drops beyond threshold.
4. Block PR merges for prompt changes that regress.
Anti-patterns to avoid:
- Hardcoded prompts requiring rebuild to deploy.
- Ship prompts straight to production without eval.
- Not logging prompt version in traces → can't tell which version caused a bug.
- A/B without statistical rigor → wrong conclusions.
- Rollback via Git revert → mixed logs in production.