Design AI code review system scale cho tổ chức 1000+ engineer?

Question

Luyện Phỏng Vấn IT · Accepted Answer

System giúp AI review PR tự động, cung cấp feedback trước khi human reviewer, giảm review time và bug reach production. Requirements: - 1000 engineer × ~5 PR/week = 5K PR/week = ~700 PR/day. - Mỗi PR average 200 LOC change, có PR 5000+ LOC. - Feedback cần core > experimental. Dedup khi PR update nhiều lần liên tiếp (rebase). 3. Context Builder — chuẩn bị đầy đủ context cho LLM: - PR diff — unified diff với context 3 lines mỗi side. - PR metadata — title, description, linked issues, author history. - Touched files full content (cho file nhỏ). - Symbol dependency — function bị đổi được gọi ở đâu (dùng LSP/tree-sitter). - Related files — test file của code change, config liên quan. - Codebase conventions — coding standard, style guide, previous review patterns. - Past similar PRs — từ memory/RAG. 4. Code indexer — pre-build index của codebase: - Symbol graph (function/class/import) dùng tree-sitter, ctags, hoặc language server. - Embeddings của function/file → retrieval similar code. - Update incrementally theo commit. - Dùng Sourcegraph, Aider repo map, hoặc tự build. 5. Parallel review agents — mỗi agent specialized: - Security agent — SQL injection, XSS, secret leak, auth bypass. System prompt chuyên sâu security + OWASP. - Bug agent — null reference, off-by-one, race condition, resource leak. Focus logic error. - Performance agent — N+1 query, inefficient algorithm, memory leak. - Style agent — convention violation, naming, documentation. Rule-based linter trước, LLM cho nuance. - Test coverage agent — có test cho change mới không, edge case. - Architecture agent — separation of concerns, SOLID, dependency violation. - Documentation agent — missing docstring, changelog. Agents chạy song song → merge output. 6. LLM strategy: - Small PR ( 1000 LOC): chunk theo file, review từng file, aggregate. - Critical repo (payment, security): luôn dùng strongest model (o1 hoặc Claude 3.5 Sonnet reasoning mode). 7. Memory/RAG layer: - Past review patterns — khi human reviewer đã approve/reject issue tương tự → memory. - Repo-specific conventions — auto-learn từ codebase. - Common bugs của team (từ incident history, bug tracker). 8. Report composer — format output cho GitHub/GitLab: - Inline comment trên line cụ thể có issue. - PR summary tổng hợp top concerns. - Severity labels — 🔴 must-fix, 🟡 suggestion, 🟢 nit. - Confidence — "I'm 90% sure this is a bug" vs "Consider whether...". - Citation — link về similar past PR, docs. - Auto-suggest fix khi confident (PR suggestion block). 9. Feedback loop — critical cho quality: - Track feedback: human reviewer 👍/👎 AI comment; author dismiss/apply suggestion. - Aggregate: false positive rate, useful-to-noise ratio. - Fine-tune / adjust prompt theo feedback. - Black-list comment type có false positive cao. 10. Privacy & security: - Code không ra ngoài: self-host LLM cho sensitive repo (Llama 3.3 70B, Qwen 2.5 Coder). - Enterprise agreement với provider (ZDR với Anthropic, OpenAI). - Secret scan trước khi send prompt (remove API key, password pattern). - Audit log mọi LLM call. Scale considerations: - Throughput: 700 PR/day × 6 agent parallel = ~4200 LLM call/day. Với prompt cache 70%: ~1200 non-cached call. Feasible với multi-provider. - Latency: target p95 < 5 min. Parallelize agent, chunk large PR, pipeline steps. - Cost: estimate $0.5-3 per PR. 700 PR × $2 = $1400/day. So với cost human review ($50-200/PR human time), ROI rõ. - Storage: PR context, review history, metrics. Postgres + S3. Rollout strategy: 1. Shadow mode — AI review, không post, so sánh với human. 2 tuần. 2. Opt-in beta — một số team thử. 3. Default on, easy opt-out — cho phép developer disable nếu không muốn. 4. Gradual trust — ban đầu chỉ suggest; sau khi accuracy proven → auto-request-changes cho critical issue. Anti-patterns: - Review tất cả PR với cùng model/depth → waste cost. - Quá nhiều comment → noise, developer ignore. - Không có feedback loop → quality không improve. - Block merge trên AI comment → dev frustrated. - Ignore codebase context → comment generic. Benchmarks thực tế: - GitHub Copilot Pull Request (Copilot Workspace), CodeRabbit, Codium PR-Agent, Sweep AI, Greptile, Vercel Agent đều triển khai pattern tương tự.