Document AI = extract structured data (fields, table, relations) từ document dạng form, invoice, contract, financial report. Kết hợp OCR + layout + LLM.
Các kỹ thuật theo generation:
Gen 1 (pre-LLM) — rule-based OCR
- Tesseract + regex → chỉ work với template fixed.
- Brittle, fail với layout variation.
Gen 2 — Layout-aware transformers (2020-2023)
- LayoutLM v1/v2/v3 (Microsoft) — pre-train trên (text, bbox, image) pair.
- Donut (NAVER) — OCR-free, encoder-decoder, input image → output structured text.
- Pix2Struct (Google) — image-to-text cho document.
- UDOP (universal), StructuralLM.
- Limit: fine-tune per task, cần annotated data.
Gen 3 — VLM (2024-2025) — state-of-the-art
- GPT-4o, Claude 3.5 Sonnet, Gemini 2.0 Flash — đọc PDF/image trực tiếp, extract theo schema với structured output.
- Ưu thế: zero-shot / few-shot, không cần fine-tune, handle layout variation.
- Giá: nhiều token cho image input, nhưng dropping nhanh.
Modern pipeline thực tế (2025):
Input PDF/Image
│
▼
┌───────────────────────────┐
│ 1. PRE-PROCESS │
│ - PDF → page images │
│ - Quality enhance │
│ - Deskew, denoise │
└───────────────────────────┘
│
▼
┌───────────────────────────┐
│ 2. PARSE (2 options) │
│ │
│ A. Layout-aware (nhanh) │
│ LlamaParse / │
│ Unstructured.io / │
│ Azure DocIntelligence │
│ → text + table + bbox │
│ │
│ B. VLM direct (linh hoạt)│
│ GPT-4o / Claude 3.5 │
│ → structured output │
│ JSON theo Pydantic │
└───────────────────────────┘
│
▼
┌───────────────────────────┐
│ 3. VALIDATE │
│ - Schema check (Zod/ │
│ Pydantic) │
│ - Business rules │
│ - Confidence score │
└───────────────────────────┘
│
▼
┌───────────────────────────┐
│ 4. HUMAN-IN-LOOP │
│ - Low confidence → │
│ route to reviewer │
│ - Edit + approve │
└───────────────────────────┘
│
▼
Structured data ready for DB/APIImplementation với VLM + structured output:
from openai import OpenAI
from pydantic import BaseModel, Field
import base64
client = OpenAI()
class LineItem(BaseModel):
description: str
quantity: int
unit_price: float
total: float
class Invoice(BaseModel):
invoice_number: str
issue_date: str
due_date: str | None
vendor_name: str
vendor_address: str
customer_name: str
line_items: list[LineItem]
subtotal: float
tax: float
total: float
confidence_notes: str = Field(
description="Any fields extracted with low confidence"
)
def extract_invoice(pdf_page_image_path: str) -> Invoice:
with open(pdf_page_image_path, "rb") as f:
img_b64 = base64.b64encode(f.read()).decode()
response = client.beta.chat.completions.parse(
model="gpt-4o",
messages=[
{"role": "system", "content": "Extract invoice data. If any field is unclear or missing, note it in confidence_notes. Never fabricate values."},
{"role": "user", "content": [
{"type": "text", "text": "Extract all fields from this invoice:"},
{"type": "image_url", "image_url": {
"url": f"data:image/png;base64,{img_b64}"
}}
]}
],
response_format=Invoice,
temperature=0,
)
return response.choices[0].message.parsedUse case phổ biến:
- Invoice extraction — PO, line item, tax.
- Contract analysis — clause, party, date, key terms.
- Resume parsing — work history, skill, education.
- Financial statements — balance sheet, income, cash flow.
- Medical records — diagnosis, medication, vitals.
- Forms processing — tax, insurance, KYC.
- Research paper — abstract, figures, references.
Challenges thực tế:
1. Scanned / low-quality image — OCR confidence thấp; preprocess critical.
2. Handwriting — model generic yếu, cần specialized (Google Cloud Vision, Azure).
3. Multi-page — table span nhiều page, cần merge.
4. Multiple languages, mixed script — ensure VLM hỗ trợ.
5. Checkbox, signature — VLM có thể miss.
6. Nested structure — bảng trong bảng, footnote.
7. Consistency across docs — cùng format nhưng field slightly khác (amount có/không currency).
Accuracy optimization:
- Few-shot với example docs cùng type.
- Schema detailed — field description, format hint ("date as YYYY-MM-DD").
- Validate logical (subtotal = sum(line_items)) và regenerate nếu fail.
- Multi-model ensemble — extract với 2 model, compare.
- Specialized model cho document type cụ thể (Donut fine-tune cho invoice).
Cost/latency:
- VLM đắt: 1 page PDF ~1500 token image + 500 text = $0.01-0.05 per doc.
- Layout-aware parser + smaller LLM reformulate thường rẻ hơn nếu volume lớn.
- Cache kết quả cho doc giống.
Tools/services managed:
- Azure AI Document Intelligence — invoice/receipt/ID card prebuilt + custom.
- Amazon Textract — extract with table/form.
- Google Document AI — prebuilt parsers.
- LlamaParse — PDF → markdown preserving structure.
- Reducto, Nanonets, Affinda — commercial document AI.
- Mistral OCR (2025) — SOTA mở.