Streaming response trong LLM: cách triển khai và lợi ích?

Question

Luyện Phỏng Vấn IT · Accepted Answer

LLM sinh token lần lượt (autoregressive). Streaming là stream từng token/chunk ra client ngay khi sinh, thay vì đợi hoàn thành → time to first token (TTFT) thấp, UX mượt. Lợi ích: - Perceived latency — user thấy output ngay (< 1s), dù total latency 10s vẫn "cảm giác" nhanh. - Cancellable — user thấy kết quả sai có thể stop sớm → tiết kiệm cost. - Bắt buộc cho chat UI, code editor, voice assistant. Cơ chế server-side (OpenAI/Anthropic): Provider trả về chunks qua HTTP chunked transfer / SSE. Transport protocol: 1. Server-Sent Events (SSE) — phổ biến nhất cho LLM streaming. HTTP/1.1 long-lived connection, server push data: {...}

events. Chạy sau CDN/proxy ok. Format OpenAI dùng: data: {"choices":[...]}

data: [DONE]

. 2. WebSocket — full-duplex. Dùng khi cần bidirectional (voice, interrupt mid-stream, user gõ trong khi model nói). Phức tạp hơn SSE, sticky session. 3. HTTP/2 Server Push / gRPC streaming — hiếm hơn, dùng trong microservice. 4. HTTP chunked + JSONL — đơn giản nhất, Transfer-Encoding: chunked + mỗi chunk 1 JSON line. Implementation trong Next.js 15 App Router: Frontend dùng Vercel AI SDK (useChat, useCompletion) để handle SSE parsing và state UI. Thách thức production: 1. Parsing structured output khi stream — JSON mode/tool calling stream từng fragment, cần partial JSON parser (partial-json, streamText trong AI SDK). 2. Cache khó — stream không cache đơn giản như response fixed. Giải pháp: cache metadata (input hash → full response); lần sau replay. 3. Error mid-stream — đã gửi một phần response rồi gặp lỗi. Cần protocol báo lỗi trong stream (event: error) + client handle gracefully. 4. Cancellation propagation — user close tab → cần abort upstream API call để không bị charge tiếp. Dùng AbortController truyền xuống SDK. 5. Load balancer / CDN — Cloudflare, nginx có thể buffer SSE → phá streaming. Cần config X-Accel-Buffering: no, hoặc disable Cloudflare proxy. 6. Edge runtime limitations — Vercel Edge timeout 30s mặc định, Fluid Compute 300s. Lambda cold start ảnh hưởng TTFT. 7. Metrics — track TTFT (quan trọng hơn total latency), inter-token latency, dropped connection rate. 8. Token counting while streaming — mỗi chunk chỉ có partial; tally usage ở event cuối (finishreason) hoặc ước lượng bằng tokenizer client. Best practices: - Hiện cursor/typing indicator trước khi token đầu về. - Smooth rendering (không nhấp nháy): accumulate chunks, render mỗi 16ms. - Scroll anchor cuối message khi token mới đến. - Guardrail check có thể buffer → delay few token trước khi emit để check PII/toxicity (trade-off latency).