Document understanding: extract structured data từ PDF có layout phức tạp?

Document AI = extract structured data (fields, table, relations) từ document dạng form, invoice, contract, financial report. Kết hợp OCR + layout + LLM.

Các kỹ thuật theo generation:

Gen 1 (pre-LLM) — rule-based OCR
- Tesseract + regex → chỉ work với template fixed.
- Brittle, fail với layout variation.

Gen 2 — Layout-aware transformers (2020-2023)
- LayoutLM v1/v2/v3 (Microsoft) — pre-train trên (text, bbox, image) pair.
- Donut (NAVER) — OCR-free, encoder-decoder, input image → output structured text.
- Pix2Struct (Google) — image-to-text cho document.
- UDOP (universal), StructuralLM.
- Limit: fine-tune per task, cần annotated data.

Gen 3 — VLM (2024-2025) — state-of-the-art
- GPT-4o, Claude 3.5 Sonnet, Gemini 2.0 Flash — đọc PDF/image trực tiếp, extract theo schema với structured output.
- Ưu thế: zero-shot / few-shot, không cần fine-tune, handle layout variation.
- Giá: nhiều token cho image input, nhưng dropping nhanh.

Modern pipeline thực tế (2025):

Input PDF/Image
    │
    ▼
┌───────────────────────────┐
│ 1. PRE-PROCESS           │
│   - PDF → page images    │
│   - Quality enhance      │
│   - Deskew, denoise      │
└───────────────────────────┘
    │
    ▼
┌───────────────────────────┐
│ 2. PARSE (2 options)     │
│                          │
│ A. Layout-aware (nhanh)  │
│    LlamaParse /          │
│    Unstructured.io /     │
│    Azure DocIntelligence │
│    → text + table + bbox │
│                          │
│ B. VLM direct (linh hoạt)│
│    GPT-4o / Claude 3.5   │
│    → structured output   │
│    JSON theo Pydantic    │
└───────────────────────────┘
    │
    ▼
┌───────────────────────────┐
│ 3. VALIDATE              │
│   - Schema check (Zod/   │
│     Pydantic)            │
│   - Business rules       │
│   - Confidence score     │
└───────────────────────────┘
    │
    ▼
┌───────────────────────────┐
│ 4. HUMAN-IN-LOOP         │
│   - Low confidence →     │
│     route to reviewer    │
│   - Edit + approve       │
└───────────────────────────┘
    │
    ▼
Structured data ready for DB/API

Implementation với VLM + structured output:

python
from openai import OpenAI
from pydantic import BaseModel, Field
import base64

client = OpenAI()

class LineItem(BaseModel):
    description: str
    quantity: int
    unit_price: float
    total: float

class Invoice(BaseModel):
    invoice_number: str
    issue_date: str
    due_date: str | None
    vendor_name: str
    vendor_address: str
    customer_name: str
    line_items: list[LineItem]
    subtotal: float
    tax: float
    total: float
    confidence_notes: str = Field(
        description="Any fields extracted with low confidence"
    )

def extract_invoice(pdf_page_image_path: str) -> Invoice:
    with open(pdf_page_image_path, "rb") as f:
        img_b64 = base64.b64encode(f.read()).decode()
    
    response = client.beta.chat.completions.parse(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": "Extract invoice data. If any field is unclear or missing, note it in confidence_notes. Never fabricate values."},
            {"role": "user", "content": [
                {"type": "text", "text": "Extract all fields from this invoice:"},
                {"type": "image_url", "image_url": {
                    "url": f"data:image/png;base64,{img_b64}"
                }}
            ]}
        ],
        response_format=Invoice,
        temperature=0,
    )
    return response.choices[0].message.parsed

Use case phổ biến:
- Invoice extraction — PO, line item, tax.
- Contract analysis — clause, party, date, key terms.
- Resume parsing — work history, skill, education.
- Financial statements — balance sheet, income, cash flow.
- Medical records — diagnosis, medication, vitals.
- Forms processing — tax, insurance, KYC.
- Research paper — abstract, figures, references.

Challenges thực tế:

1. Scanned / low-quality image — OCR confidence thấp; preprocess critical.
2. Handwriting — model generic yếu, cần specialized (Google Cloud Vision, Azure).
3. Multi-page — table span nhiều page, cần merge.
4. Multiple languages, mixed script — ensure VLM hỗ trợ.
5. Checkbox, signature — VLM có thể miss.
6. Nested structure — bảng trong bảng, footnote.
7. Consistency across docs — cùng format nhưng field slightly khác (amount có/không currency).

Accuracy optimization:
- Few-shot với example docs cùng type.
- Schema detailed — field description, format hint ("date as YYYY-MM-DD").
- Validate logical (subtotal = sum(line_items)) và regenerate nếu fail.
- Multi-model ensemble — extract với 2 model, compare.
- Specialized model cho document type cụ thể (Donut fine-tune cho invoice).

Cost/latency:
- VLM đắt: 1 page PDF ~1500 token image + 500 text = $0.01-0.05 per doc.
- Layout-aware parser + smaller LLM reformulate thường rẻ hơn nếu volume lớn.
- Cache kết quả cho doc giống.

Tools/services managed:
- Azure AI Document Intelligence — invoice/receipt/ID card prebuilt + custom.
- Amazon Textract — extract with table/form.
- Google Document AI — prebuilt parsers.
- LlamaParse — PDF → markdown preserving structure.
- Reducto, Nanonets, Affinda — commercial document AI.
- Mistral OCR (2025) — SOTA mở.

Xem toàn bộ AI Engineering cùng filter theo level & chủ đề con.

Mở danh sách AI Engineering