System giúp AI review PR tự động, cung cấp feedback trước khi human reviewer, giảm review time và bug reach production.
Requirements:
- 1000 engineer × ~5 PR/week = 5K PR/week = ~700 PR/day.
- Mỗi PR average 200 LOC change, có PR 5000+ LOC.
- Feedback cần < 5 phút (không block developer).
- Support đa ngôn ngữ (TypeScript, Python, Go, Java).
- Integrate GitHub / GitLab.
- Respect code privacy (không leak ra ngoài).
High-level architecture:
┌────────────────────────────────┐
│ GitHub / GitLab Webhook │
│ (on PR opened/updated) │
└──────────────┬─────────────────┘
│
▼
┌────────────────────────────────┐
│ Intake Queue (SQS/BullMQ) │
│ - Dedup, priority │
└──────────────┬─────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Orchestrator │
│ ┌────────────────┐ ┌────────────────┐ ┌────────────────┐│
│ │ Context Builder│→│ Parallel Agents│→│ Report Composer││
│ └────────────────┘ └────────────────┘ └────────────────┘│
└──────────┬──────────────────┬──────────────────┬────────────┘
│ │ │
▼ ▼ ▼
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Code Indexer │ │ LLM Gateway │ │ Memory/RAG │
│ (symbol, │ │ (Claude 3.5, │ │ (past PR, │
│ ts-server) │ │ GPT-4o) │ │ patterns) │
└──────────────┘ └──────────────┘ └──────────────┘
│ │ │
└───────────────────┴───────────────────┘
│
▼
┌────────────────────────────────┐
│ Post Review to PR │
│ (inline comments, summary) │
└────────────────────────────────┘Components chi tiết:
1. Webhook handler — lightweight service nhận PR event, enqueue task. Support rate limit, signature verify.
2. Intake queue — decouple webhook và processing. Priority: security-critical repo > core > experimental. Dedup khi PR update nhiều lần liên tiếp (rebase).
3. Context Builder — chuẩn bị đầy đủ context cho LLM:
- PR diff — unified diff với context 3 lines mỗi side.
- PR metadata — title, description, linked issues, author history.
- Touched files full content (cho file nhỏ).
- Symbol dependency — function bị đổi được gọi ở đâu (dùng LSP/tree-sitter).
- Related files — test file của code change, config liên quan.
- Codebase conventions — coding standard, style guide, previous review patterns.
- Past similar PRs — từ memory/RAG.
4. Code indexer — pre-build index của codebase:
- Symbol graph (function/class/import) dùng tree-sitter, ctags, hoặc language server.
- Embeddings của function/file → retrieval similar code.
- Update incrementally theo commit.
- Dùng Sourcegraph, Aider repo map, hoặc tự build.
5. Parallel review agents — mỗi agent specialized:
- Security agent — SQL injection, XSS, secret leak, auth bypass. System prompt chuyên sâu security + OWASP.
- Bug agent — null reference, off-by-one, race condition, resource leak. Focus logic error.
- Performance agent — N+1 query, inefficient algorithm, memory leak.
- Style agent — convention violation, naming, documentation. Rule-based linter trước, LLM cho nuance.
- Test coverage agent — có test cho change mới không, edge case.
- Architecture agent — separation of concerns, SOLID, dependency violation.
- Documentation agent — missing docstring, changelog.
Agents chạy song song → merge output.
6. LLM strategy:
- Small PR (< 200 LOC): 1 pass GPT-4o-mini.
- Medium (200-1000 LOC): Claude 3.5 Sonnet với full context.
- Large (> 1000 LOC): chunk theo file, review từng file, aggregate.
- Critical repo (payment, security): luôn dùng strongest model (o1 hoặc Claude 3.5 Sonnet reasoning mode).
7. Memory/RAG layer:
- Past review patterns — khi human reviewer đã approve/reject issue tương tự → memory.
- Repo-specific conventions — auto-learn từ codebase.
- Common bugs của team (từ incident history, bug tracker).
8. Report composer — format output cho GitHub/GitLab:
- Inline comment trên line cụ thể có issue.
- PR summary tổng hợp top concerns.
- Severity labels — 🔴 must-fix, 🟡 suggestion, 🟢 nit.
- Confidence — "I'm 90% sure this is a bug" vs "Consider whether...".
- Citation — link về similar past PR, docs.
- Auto-suggest fix khi confident (PR suggestion block).
9. Feedback loop — critical cho quality:
- Track feedback: human reviewer 👍/👎 AI comment; author dismiss/apply suggestion.
- Aggregate: false positive rate, useful-to-noise ratio.
- Fine-tune / adjust prompt theo feedback.
- Black-list comment type có false positive cao.
10. Privacy & security:
- Code không ra ngoài: self-host LLM cho sensitive repo (Llama 3.3 70B, Qwen 2.5 Coder).
- Enterprise agreement với provider (ZDR với Anthropic, OpenAI).
- Secret scan trước khi send prompt (remove API key, password pattern).
- Audit log mọi LLM call.
Scale considerations:
- Throughput: 700 PR/day × 6 agent parallel = ~4200 LLM call/day. Với prompt cache 70%: ~1200 non-cached call. Feasible với multi-provider.
- Latency: target p95 < 5 min. Parallelize agent, chunk large PR, pipeline steps.
- Cost: estimate $0.5-3 per PR. 700 PR × $2 = $1400/day. So với cost human review ($50-200/PR human time), ROI rõ.
- Storage: PR context, review history, metrics. Postgres + S3.
Rollout strategy:
1. Shadow mode — AI review, không post, so sánh với human. 2 tuần.
2. Opt-in beta — một số team thử.
3. Default on, easy opt-out — cho phép developer disable nếu không muốn.
4. Gradual trust — ban đầu chỉ suggest; sau khi accuracy proven → auto-request-changes cho critical issue.
Anti-patterns:
- Review tất cả PR với cùng model/depth → waste cost.
- Quá nhiều comment → noise, developer ignore.
- Không có feedback loop → quality không improve.
- Block merge trên AI comment → dev frustrated.
- Ignore codebase context → comment generic.
Benchmarks thực tế:
- GitHub Copilot Pull Request (Copilot Workspace), CodeRabbit, Codium PR-Agent, Sweep AI, Greptile, Vercel Agent đều triển khai pattern tương tự.
A system that auto-reviews PRs with AI and gives feedback before humans, reducing review time and bugs in production.
Requirements:
- 1000 engineers × ~5 PRs/week = 5K PRs/week ≈ 700 PRs/day.
- Avg PR 200 LOC changed, some 5000+ LOC.
- Feedback in < 5 minutes (can't block developers).
- Supports multiple languages (TypeScript, Python, Go, Java).
- GitHub / GitLab integration.
- Respects code privacy (no external leaks).
High-level architecture:
┌────────────────────────────────┐
│ GitHub / GitLab Webhook │
│ (on PR opened/updated) │
└──────────────┬─────────────────┘
│
▼
┌────────────────────────────────┐
│ Intake Queue (SQS/BullMQ) │
│ - Dedup, priority │
└──────────────┬─────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Orchestrator │
│ ┌────────────────┐ ┌────────────────┐ ┌────────────────┐│
│ │ Context Builder│→│ Parallel Agents│→│ Report Composer││
│ └────────────────┘ └────────────────┘ └────────────────┘│
└──────────┬──────────────────┬──────────────────┬────────────┘
│ │ │
▼ ▼ ▼
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Code Indexer │ │ LLM Gateway │ │ Memory/RAG │
│ (symbols, │ │ (Claude 3.5, │ │ (past PRs, │
│ ts-server) │ │ GPT-4o) │ │ patterns) │
└──────────────┘ └──────────────┘ └──────────────┘
│ │ │
└───────────────────┴───────────────────┘
│
▼
┌────────────────────────────────┐
│ Post Review to PR │
│ (inline comments, summary) │
└────────────────────────────────┘Component details:
1. Webhook handler — lightweight service receiving PR events, enqueuing tasks. Supports rate limit + signature verification.
2. Intake queue — decouples webhook from processing. Priorities: security-critical repos > core > experimental. Dedup when PRs update repeatedly (rebases).
3. Context Builder — prepares full context for the LLM:
- PR diff — unified diff with 3 lines of surrounding context.
- PR metadata — title, description, linked issues, author history.
- Full contents of touched files (for small files).
- Symbol dependencies — where modified functions are called (via LSP/tree-sitter).
- Related files — tests for changed code, relevant configs.
- Codebase conventions — coding standards, style guides, prior review patterns.
- Past similar PRs — from memory/RAG.
4. Code indexer — pre-built codebase index:
- Symbol graph (function/class/import) via tree-sitter, ctags, or language servers.
- Embeddings of functions/files → similar-code retrieval.
- Incrementally updated per commit.
- Use Sourcegraph, Aider repo map, or build your own.
5. Parallel review agents — each specialized:
- Security agent — SQL injection, XSS, secret leaks, auth bypass. Deep security + OWASP system prompt.
- Bug agent — null references, off-by-ones, race conditions, resource leaks. Logic-focused.
- Performance agent — N+1 queries, inefficient algorithms, memory leaks.
- Style agent — convention violations, naming, documentation. Rule-based linter first, LLM for nuance.
- Test coverage agent — are tests added for changes, edge cases.
- Architecture agent — separation of concerns, SOLID, dependency violations.
- Documentation agent — missing docstrings, changelogs.
Agents run in parallel → merge output.
6. LLM strategy:
- Small PR (< 200 LOC): single-pass GPT-4o-mini.
- Medium (200–1000 LOC): Claude 3.5 Sonnet with full context.
- Large (> 1000 LOC): chunk by file, review each, aggregate.
- Critical repos (payments, security): always the strongest model (o1 or Claude 3.5 Sonnet reasoning mode).
7. Memory/RAG layer:
- Past review patterns — when human reviewers have approved/rejected similar issues → memory.
- Repo-specific conventions — auto-learned from the codebase.
- Team common bugs (from incident history, bug tracker).
8. Report composer — formats output for GitHub/GitLab:
- Inline comments on specific problematic lines.
- PR summary aggregating top concerns.
- Severity labels — 🔴 must-fix, 🟡 suggestion, 🟢 nit.
- Confidence — "90% sure this is a bug" vs "Consider whether...".
- Citations — links to similar past PRs, docs.
- Auto-suggest fixes when confident (PR suggestion blocks).
9. Feedback loop — critical for quality:
- Track: human 👍/👎 on AI comments; author dismiss/apply of suggestions.
- Aggregate: false-positive rate, useful-to-noise ratio.
- Fine-tune / adjust prompts from feedback.
- Blacklist comment types with high false positives.
10. Privacy & security:
- Code never leaves: self-host LLMs for sensitive repos (Llama 3.3 70B, Qwen 2.5 Coder).
- Enterprise agreements with providers (ZDR with Anthropic, OpenAI).
- Secret scan before sending prompts (strip API keys, password-like patterns).
- Audit log every LLM call.
Scale considerations:
- Throughput: 700 PRs/day × 6 parallel agents ≈ 4200 LLM calls/day. With 70% prompt cache: ~1200 non-cached calls. Feasible multi-provider.
- Latency: target p95 < 5 minutes. Parallelize agents, chunk big PRs, pipeline steps.
- Cost: ~$0.5–3 per PR. 700 PRs × $2 = $1400/day. Vs human review ($50–200/PR of time), clear ROI.
- Storage: PR context, review history, metrics. Postgres + S3.
Rollout strategy:
1. Shadow mode — AI reviews without posting, compared to humans. 2 weeks.
2. Opt-in beta — some teams try it.
3. Default on, easy opt-out — let developers disable if unwanted.
4. Gradual trust — initially suggest only; once accuracy is proven → auto-request-changes for critical issues.
Anti-patterns:
- Review all PRs with the same model/depth → waste.
- Too many comments → noise, devs ignore.
- No feedback loop → quality doesn't improve.
- Block merges on AI comments → frustrated devs.
- Ignore codebase context → generic comments.
Real benchmarks:
- GitHub Copilot Pull Request (Copilot Workspace), CodeRabbit, Codium PR-Agent, Sweep AI, Greptile, Vercel Agent all implement similar patterns.