Knowledge base thực tế (policy doc, financial report, product spec) không chỉ text — có table, image, chart, diagram. Text-only RAG miss thông tin này.
Các strategy:
1. Layout-aware parsing trước indexing
- Dùng OCR + layout model: Unstructured.io, LlamaParse, PyMuPDF + layout, Azure Document Intelligence, Mistral OCR, Amazon Textract.
- Extract: text blocks + table (dạng markdown/HTML) + image (với caption) + hierarchy (heading level).
- Output: structured representation, không phải flat text.
2. Table handling
- Serialize table thành markdown/HTML và embed cùng với context (heading, caption). Model hiện tại (GPT-4o, Claude 3.5) đọc markdown table tốt.
- Table summary — LLM sinh summary mô tả table (columns, key insights) → embed summary để retrieve, khi hit thì gửi cả summary + raw table vào context.
- Text-to-SQL cho table lớn — convert user query thành SQL, chạy trên table data, return result cùng với LLM-generated explanation.
3. Image handling
- Caption generation: VLM (GPT-4V, Claude 3.5 Sonnet, Gemini) sinh caption chi tiết mô tả image → embed text caption, index.
- Multi-modal embedding: CLIP hoặc SigLIP embed cả image và text vào shared space → search cross-modal.
- OCR text extraction nếu image là screenshot, diagram có chữ.
- Store reference to original image; khi retrieved, gửi cả image vào VLM context.
4. Chart/diagram
- VLM analyze chart → generate textual description (trục, trend, key values) → index như caption.
- Extract data points nếu cần precise (một số chart có table backing).
5. Hierarchical chunking
- Giữ cấu trúc document: section > subsection > paragraph > sentence.
- Retrieve granular chunk, generate với parent context.
- Tránh cắt giữa table hoặc tách image khỏi caption.
Pipeline end-to-end:
PDF → Parser (LlamaParse/Unstructured) → structured elements:
- Text blocks
- Tables (markdown)
- Images (với auto-caption từ VLM)
- Charts (với description)
→ Chunk với metadata:
{content, type: text|table|image|chart, page, section,
image_url (nếu có), parent_id}
→ Embed text representation (text/caption/description)
→ Index in vector DB với metadata
Query:
1. Embed query → vector search
2. Retrieve top-K chunks
3. Assemble context: text inline, table as markdown, image as URL/base64
4. Send to VLM (Claude 3.5, GPT-4o) với multi-modal input
5. Generate answer với citationFramework/tool:
- LlamaIndex MultiModal — built-in multi-modal RAG.
- LangChain multi-vector retriever — parent document + summary pattern.
- ColPali / ColQwen2 — vision-based retrieval trực tiếp trên PDF page image (skip OCR).
- Vertex AI Grounding (Google) managed.
Failure modes thường gặp:
- OCR miss rotated text, low-quality scan → test với sample thực tế.
- Table cross nhiều page → cần join trước khi parse.
- Chart không có data underlying → VLM description không đủ precise.
- Image size quá lớn → tốn token VLM context.
Cost consideration: multi-modal RAG đắt hơn text-only 3-10x vì VLM input ảnh tốn token. Cân nhắc: cache caption, chỉ retrieve image khi thực sự cần.
Real-world knowledge bases (policy docs, financial reports, product specs) aren't just text — they contain tables, images, charts, diagrams. Text-only RAG misses this information.
Strategies:
1. Layout-aware parsing before indexing
- Use OCR + layout models: Unstructured.io, LlamaParse, PyMuPDF + layout, Azure Document Intelligence, Mistral OCR, Amazon Textract.
- Extract: text blocks + tables (markdown/HTML) + images (with captions) + hierarchy (heading levels).
- Output: structured representation, not flat text.
2. Table handling
- Serialize tables as markdown/HTML and embed alongside context (heading, caption). Modern models (GPT-4o, Claude 3.5) read markdown tables well.
- Table summary — LLM generates a description (columns, key insights) → embed the summary for retrieval; on hit, include both summary + raw table in context.
- Text-to-SQL for large tables — convert user query to SQL, run on table data, return result alongside an LLM-generated explanation.
3. Image handling
- Caption generation: a VLM (GPT-4V, Claude 3.5 Sonnet, Gemini) writes a detailed caption → embed and index the caption text.
- Multi-modal embedding: CLIP or SigLIP embeds image and text into a shared space → cross-modal search.
- OCR text extraction for screenshots or diagrams with text.
- Store a reference to the original image; when retrieved, pass the image into a VLM context.
4. Charts/diagrams
- VLM analyzes the chart → generate a textual description (axes, trends, key values) → index like a caption.
- Extract data points where precision is required (some charts have backing tables).
5. Hierarchical chunking
- Preserve document structure: section > subsection > paragraph > sentence.
- Retrieve granular chunks, generate with parent context.
- Don't cut mid-table or separate image from caption.
End-to-end pipeline:
PDF → Parser (LlamaParse/Unstructured) → structured elements:
- Text blocks
- Tables (markdown)
- Images (with auto-caption from VLM)
- Charts (with description)
→ Chunk with metadata:
{content, type: text|table|image|chart, page, section,
image_url (if any), parent_id}
→ Embed text representation (text/caption/description)
→ Index in vector DB with metadata
Query:
1. Embed query → vector search
2. Retrieve top-K chunks
3. Assemble context: text inline, table as markdown, image as URL/base64
4. Send to VLM (Claude 3.5, GPT-4o) with multi-modal input
5. Generate answer with citationsFrameworks / tools:
- LlamaIndex MultiModal — built-in multi-modal RAG.
- LangChain multi-vector retriever — parent document + summary pattern.
- ColPali / ColQwen2 — vision-based retrieval directly on PDF page images (skip OCR).
- Vertex AI Grounding (Google) managed.
Common failure modes:
- OCR misses rotated text, low-quality scans → test with real samples.
- Tables crossing pages → merge before parsing.
- Charts without underlying data → VLM descriptions may lack precision.
- Oversized images → bloated VLM token usage.
Cost consideration: multi-modal RAG is 3–10x more expensive than text-only because VLM inputs cost tokens. Cache captions; retrieve images only when necessary.