Document AI = extract structured data (fields, table, relations) từ document dạng form, invoice, contract, financial report. Kết hợp OCR + layout + LLM.
Các kỹ thuật theo generation:
Gen 1 (pre-LLM) — rule-based OCR
- Tesseract + regex → chỉ work với template fixed.
- Brittle, fail với layout variation.
Gen 2 — Layout-aware transformers (2020-2023)
- LayoutLM v1/v2/v3 (Microsoft) — pre-train trên (text, bbox, image) pair.
- Donut (NAVER) — OCR-free, encoder-decoder, input image → output structured text.
- Pix2Struct (Google) — image-to-text cho document.
- UDOP (universal), StructuralLM.
- Limit: fine-tune per task, cần annotated data.
Gen 3 — VLM (2024-2025) — state-of-the-art
- GPT-4o, Claude 3.5 Sonnet, Gemini 2.0 Flash — đọc PDF/image trực tiếp, extract theo schema với structured output.
- Ưu thế: zero-shot / few-shot, không cần fine-tune, handle layout variation.
- Giá: nhiều token cho image input, nhưng dropping nhanh.
Modern pipeline thực tế (2025):
Input PDF/Image
│
▼
┌───────────────────────────┐
│ 1. PRE-PROCESS │
│ - PDF → page images │
│ - Quality enhance │
│ - Deskew, denoise │
└───────────────────────────┘
│
▼
┌───────────────────────────┐
│ 2. PARSE (2 options) │
│ │
│ A. Layout-aware (nhanh) │
│ LlamaParse / │
│ Unstructured.io / │
│ Azure DocIntelligence │
│ → text + table + bbox │
│ │
│ B. VLM direct (linh hoạt)│
│ GPT-4o / Claude 3.5 │
│ → structured output │
│ JSON theo Pydantic │
└───────────────────────────┘
│
▼
┌───────────────────────────┐
│ 3. VALIDATE │
│ - Schema check (Zod/ │
│ Pydantic) │
│ - Business rules │
│ - Confidence score │
└───────────────────────────┘
│
▼
┌───────────────────────────┐
│ 4. HUMAN-IN-LOOP │
│ - Low confidence → │
│ route to reviewer │
│ - Edit + approve │
└───────────────────────────┘
│
▼
Structured data ready for DB/APIImplementation với VLM + structured output:
from openai import OpenAI
from pydantic import BaseModel, Field
import base64
client = OpenAI()
class LineItem(BaseModel):
description: str
quantity: int
unit_price: float
total: float
class Invoice(BaseModel):
invoice_number: str
issue_date: str
due_date: str | None
vendor_name: str
vendor_address: str
customer_name: str
line_items: list[LineItem]
subtotal: float
tax: float
total: float
confidence_notes: str = Field(
description="Any fields extracted with low confidence"
)
def extract_invoice(pdf_page_image_path: str) -> Invoice:
with open(pdf_page_image_path, "rb") as f:
img_b64 = base64.b64encode(f.read()).decode()
response = client.beta.chat.completions.parse(
model="gpt-4o",
messages=[
{"role": "system", "content": "Extract invoice data. If any field is unclear or missing, note it in confidence_notes. Never fabricate values."},
{"role": "user", "content": [
{"type": "text", "text": "Extract all fields from this invoice:"},
{"type": "image_url", "image_url": {
"url": f"data:image/png;base64,{img_b64}"
}}
]}
],
response_format=Invoice,
temperature=0,
)
return response.choices[0].message.parsedUse case phổ biến:
- Invoice extraction — PO, line item, tax.
- Contract analysis — clause, party, date, key terms.
- Resume parsing — work history, skill, education.
- Financial statements — balance sheet, income, cash flow.
- Medical records — diagnosis, medication, vitals.
- Forms processing — tax, insurance, KYC.
- Research paper — abstract, figures, references.
Challenges thực tế:
1. Scanned / low-quality image — OCR confidence thấp; preprocess critical.
2. Handwriting — model generic yếu, cần specialized (Google Cloud Vision, Azure).
3. Multi-page — table span nhiều page, cần merge.
4. Multiple languages, mixed script — ensure VLM hỗ trợ.
5. Checkbox, signature — VLM có thể miss.
6. Nested structure — bảng trong bảng, footnote.
7. Consistency across docs — cùng format nhưng field slightly khác (amount có/không currency).
Accuracy optimization:
- Few-shot với example docs cùng type.
- Schema detailed — field description, format hint ("date as YYYY-MM-DD").
- Validate logical (subtotal = sum(line_items)) và regenerate nếu fail.
- Multi-model ensemble — extract với 2 model, compare.
- Specialized model cho document type cụ thể (Donut fine-tune cho invoice).
Cost/latency:
- VLM đắt: 1 page PDF ~1500 token image + 500 text = $0.01-0.05 per doc.
- Layout-aware parser + smaller LLM reformulate thường rẻ hơn nếu volume lớn.
- Cache kết quả cho doc giống.
Tools/services managed:
- Azure AI Document Intelligence — invoice/receipt/ID card prebuilt + custom.
- Amazon Textract — extract with table/form.
- Google Document AI — prebuilt parsers.
- LlamaParse — PDF → markdown preserving structure.
- Reducto, Nanonets, Affinda — commercial document AI.
- Mistral OCR (2025) — SOTA mở.
Document AI = extracting structured data (fields, tables, relations) from documents like forms, invoices, contracts, financial reports. Combines OCR + layout + LLM.
Techniques by generation:
Gen 1 (pre-LLM) — rule-based OCR
- Tesseract + regex → only works with fixed templates.
- Brittle, fails on layout variation.
Gen 2 — layout-aware transformers (2020–2023)
- LayoutLM v1/v2/v3 (Microsoft) — pre-trained on (text, bbox, image) triples.
- Donut (NAVER) — OCR-free, encoder-decoder, image → structured text.
- Pix2Struct (Google) — image-to-text for documents.
- UDOP (universal), StructuralLM.
- Limits: per-task fine-tuning, needs annotated data.
Gen 3 — VLMs (2024–2025) — state-of-the-art
- GPT-4o, Claude 3.5 Sonnet, Gemini 2.0 Flash — read PDFs/images directly, extract to schemas with structured output.
- Advantages: zero-shot / few-shot, no fine-tuning, handle layout variation.
- Cost: many tokens for image input, but dropping fast.
Modern real-world pipeline (2025):
Input PDF/Image
│
▼
┌───────────────────────────┐
│ 1. PRE-PROCESS │
│ - PDF → page images │
│ - Quality enhance │
│ - Deskew, denoise │
└───────────────────────────┘
│
▼
┌───────────────────────────┐
│ 2. PARSE (2 options) │
│ │
│ A. Layout-aware (fast) │
│ LlamaParse / │
│ Unstructured.io / │
│ Azure DocIntelligence │
│ → text + table + bbox │
│ │
│ B. VLM direct (flexible) │
│ GPT-4o / Claude 3.5 │
│ → structured output │
│ JSON via Pydantic │
└───────────────────────────┘
│
▼
┌───────────────────────────┐
│ 3. VALIDATE │
│ - Schema check (Zod/ │
│ Pydantic) │
│ - Business rules │
│ - Confidence score │
└───────────────────────────┘
│
▼
┌───────────────────────────┐
│ 4. HUMAN-IN-LOOP │
│ - Low confidence → │
│ route to reviewer │
│ - Edit + approve │
└───────────────────────────┘
│
▼
Structured data ready for DB/APIImplementation with VLM + structured output:
from openai import OpenAI
from pydantic import BaseModel, Field
import base64
client = OpenAI()
class LineItem(BaseModel):
description: str
quantity: int
unit_price: float
total: float
class Invoice(BaseModel):
invoice_number: str
issue_date: str
due_date: str | None
vendor_name: str
vendor_address: str
customer_name: str
line_items: list[LineItem]
subtotal: float
tax: float
total: float
confidence_notes: str = Field(
description="Any fields extracted with low confidence"
)
def extract_invoice(pdf_page_image_path: str) -> Invoice:
with open(pdf_page_image_path, "rb") as f:
img_b64 = base64.b64encode(f.read()).decode()
response = client.beta.chat.completions.parse(
model="gpt-4o",
messages=[
{"role": "system", "content": "Extract invoice data. If any field is unclear or missing, note it in confidence_notes. Never fabricate values."},
{"role": "user", "content": [
{"type": "text", "text": "Extract all fields from this invoice:"},
{"type": "image_url", "image_url": {
"url": f"data:image/png;base64,{img_b64}"
}}
]}
],
response_format=Invoice,
temperature=0,
)
return response.choices[0].message.parsedCommon use cases:
- Invoice extraction — POs, line items, tax.
- Contract analysis — clauses, parties, dates, key terms.
- Resume parsing — work history, skills, education.
- Financial statements — balance sheets, income, cash flow.
- Medical records — diagnoses, medications, vitals.
- Forms processing — tax, insurance, KYC.
- Research papers — abstracts, figures, references.
Real-world challenges:
1. Scanned / low-quality images — low OCR confidence; preprocessing is critical.
2. Handwriting — generic models are weak, need specialized (Google Cloud Vision, Azure).
3. Multi-page — tables spanning pages need merging.
4. Multiple languages, mixed scripts — ensure VLM supports them.
5. Checkboxes, signatures — VLMs may miss them.
6. Nested structures — tables within tables, footnotes.
7. Cross-doc consistency — same format but fields vary slightly (amount with/without currency).
Accuracy optimizations:
- Few-shot with same-type example docs.
- Detailed schema — field descriptions, format hints ("date as YYYY-MM-DD").
- Logical validation (subtotal = sum(line_items)) with regenerate on fail.
- Multi-model ensemble — extract with 2 models, compare.
- Specialized models for specific doc types (Donut fine-tuned for invoices).
Cost/latency:
- VLMs are expensive: 1 PDF page ~1500 image tokens + 500 text = $0.01–0.05 per doc.
- Layout-aware parser + smaller LLM reformulation is often cheaper at scale.
- Cache results for identical docs.
Managed tools/services:
- Azure AI Document Intelligence — prebuilt invoice/receipt/ID card + custom.
- Amazon Textract — extraction with tables/forms.
- Google Document AI — prebuilt parsers.
- LlamaParse — PDF → markdown preserving structure.
- Reducto, Nanonets, Affinda — commercial document AI.
- Mistral OCR (2025) — open-source SOTA.