Multi-modal RAG: xử lý PDF có bảng, hình ảnh, chart trong knowledge base?

Question

Luyện Phỏng Vấn IT · Accepted Answer

Knowledge base thực tế (policy doc, financial report, product spec) không chỉ text — có table, image, chart, diagram. Text-only RAG miss thông tin này.

Các strategy:

1. Layout-aware parsing trước indexing
- Dùng OCR + layout model: Unstructured.io, LlamaParse, PyMuPDF + layout, Azure Document Intelligence, Mistral OCR, Amazon Textract.
- Extract: text blocks + table (dạng markdown/HTML) + image (với caption) + hierarchy (heading level).
- Output: structured representation, không phải flat text.

2. Table handling
- Serialize table thành markdown/HTML và embed cùng với context (heading, caption). Model hiện tại (GPT-4o, Claude 3.5) đọc markdown table tốt.
- Table summary — LLM sinh summary mô tả table (columns, key insights) → embed summary để retrieve, khi hit thì gửi cả summary + raw table vào context.
- Text-to-SQL cho table lớn — convert user query thành SQL, chạy trên table data, return result cùng với LLM-generated explanation.

3. Image handling
- Caption generation: VLM (GPT-4V, Claude 3.5 Sonnet, Gemini) sinh caption chi tiết mô tả image → embed text caption, index.
- Multi-modal embedding: CLIP hoặc SigLIP embed cả image và text vào shared space → search cross-modal.
- OCR text extraction nếu image là screenshot, diagram có chữ.
- Store reference to original image; khi retrieved, gửi cả image vào VLM context.

4. Chart/diagram
- VLM analyze chart → generate textual description (trục, trend, key values) → index như caption.
- Extract data points nếu cần precise (một số chart có table backing).

5. Hierarchical chunking
- Giữ cấu trúc document: section > subsection > paragraph > sentence.
- Retrieve granular chunk, generate với parent context.
- Tránh cắt giữa table hoặc tách image khỏi caption.

Pipeline end-to-end:

PDF → Parser (LlamaParse/Unstructured) → structured elements:
  - Text blocks
  - Tables (markdown)
  - Images (với auto-caption từ VLM)
  - Charts (với description)
  
→ Chunk với metadata:
  {content, type: text|table|image|chart, page, section, 
   image_url (nếu có), parent_id}

→ Embed text representation (text/caption/description)
→ Index in vector DB với metadata

Query:
  1. Embed query → vector search
  2. Retrieve top-K chunks
  3. Assemble context: text inline, table as markdown, image as URL/base64
  4. Send to VLM (Claude 3.5, GPT-4o) với multi-modal input
  5. Generate answer với citation

Framework/tool:
- LlamaIndex MultiModal — built-in multi-modal RAG.
- LangChain multi-vector retriever — parent document + summary pattern.
- ColPali / ColQwen2 — vision-based retrieval trực tiếp trên PDF page image (skip OCR).
- Vertex AI Grounding (Google) managed.

Failure modes thường gặp:
- OCR miss rotated text, low-quality scan → test với sample thực tế.
- Table cross nhiều page → cần join trước khi parse.
- Chart không có data underlying → VLM description không đủ precise.
- Image size quá lớn → tốn token VLM context.

Cost consideration: multi-modal RAG đắt hơn text-only 3-10x vì VLM input ảnh tốn token. Cân nhắc: cache caption, chỉ retrieve image khi thực sự cần.