RAG cần 2 giai đoạn: Indexing (offline build knowledge base) và Querying (online retrieve + generate).
A. Indexing (offline):
1. Data loading — đọc tài liệu (PDF, HTML, Markdown, DB, API).
2. Parsing/cleaning — extract text, xử lý table/image, loại bỏ boilerplate.
3. Chunking — chia thành đoạn (chunk) phù hợp: fixed-size, recursive, semantic, hoặc parent-child.
4. Embedding — biến mỗi chunk thành vector bằng embedding model (text-embedding-3-small, BGE, E5...).
5. Storage — lưu vector + metadata vào vector DB (Pinecone, Qdrant, Weaviate, pgvector) + optional BM25 index cho hybrid.
B. Querying (online):
1. Query preprocessing — query rewriting, expansion, HyDE, decomposition với câu hỏi phức tạp.
2. Retrieval — embed query → vector search top-K + optional BM25 → hybrid fusion (RRF).
3. Re-ranking — dùng cross-encoder (Cohere Rerank, BGE reranker) chấm lại top-K → giữ top-N có chất lượng cao.
4. Context assembly — sắp xếp chunks, thêm metadata/source, cắt theo context budget.
5. Generation — prompt LLM với context + câu hỏi, yêu cầu trích dẫn nguồn.
6. Post-processing — citation extraction, faithfulness check, guardrail.
Observability: log query, retrieved docs, scores, latency, cost để debug và đánh giá (RAGAS, Ragas, TruLens).
A production RAG pipeline has two major phases:
A. Indexing (offline):
1. Data loading — read documents (PDF, HTML, Markdown, DB, API).
2. Parsing/cleaning — extract text, handle tables/images, strip boilerplate.
3. Chunking — split into chunks: fixed-size, recursive, semantic, or parent-child.
4. Embedding — convert each chunk into a vector using an embedding model (text-embedding-3-small, BGE, E5...).
5. Storage — store vectors + metadata in a vector DB (Pinecone, Qdrant, Weaviate, pgvector) + optional BM25 index for hybrid.
B. Querying (online):
1. Query preprocessing — query rewriting, expansion, HyDE, decomposition for complex questions.
2. Retrieval — embed query → top-K vector search + optional BM25 → hybrid fusion (RRF).
3. Re-ranking — cross-encoder (Cohere Rerank, BGE reranker) rescores top-K → keep top-N highest quality.
4. Context assembly — order chunks, attach metadata/source, trim to context budget.
5. Generation — prompt the LLM with context + question, require citations.
6. Post-processing — citation extraction, faithfulness check, guardrails.
Observability: log query, retrieved docs, scores, latency, cost for debugging and eval (RAGAS, TruLens).