LLM sinh token lần lượt (autoregressive). Streaming là stream từng token/chunk ra client ngay khi sinh, thay vì đợi hoàn thành → time to first token (TTFT) thấp, UX mượt.
Lợi ích:
- Perceived latency — user thấy output ngay (< 1s), dù total latency 10s vẫn "cảm giác" nhanh.
- Cancellable — user thấy kết quả sai có thể stop sớm → tiết kiệm cost.
- Bắt buộc cho chat UI, code editor, voice assistant.
Cơ chế server-side (OpenAI/Anthropic):
stream = client.chat.completions.create(
model="gpt-4o", messages=[...], stream=True
)
for chunk in stream:
delta = chunk.choices[0].delta.content
if delta: yield deltaProvider trả về chunks qua HTTP chunked transfer / SSE.
Transport protocol:
1. Server-Sent Events (SSE) — phổ biến nhất cho LLM streaming. HTTP/1.1 long-lived connection, server push data: {...}\n\n events. Chạy sau CDN/proxy ok. Format OpenAI dùng: data: {"choices":[...]}\n\ndata: [DONE]\n\n.
2. WebSocket — full-duplex. Dùng khi cần bidirectional (voice, interrupt mid-stream, user gõ trong khi model nói). Phức tạp hơn SSE, sticky session.
3. HTTP/2 Server Push / gRPC streaming — hiếm hơn, dùng trong microservice.
4. HTTP chunked + JSONL — đơn giản nhất, Transfer-Encoding: chunked + mỗi chunk 1 JSON line.
Implementation trong Next.js 15 App Router:
export async function POST(req: Request) {
const stream = await openai.chat.completions.create({ stream: true, ... });
const readable = new ReadableStream({
async start(ctrl) {
for await (const chunk of stream) {
const delta = chunk.choices[0]?.delta?.content;
if (delta) ctrl.enqueue(new TextEncoder().encode(delta));
}
ctrl.close();
},
});
return new Response(readable, { headers: { "Content-Type": "text/event-stream" } });
}Frontend dùng Vercel AI SDK (useChat, useCompletion) để handle SSE parsing và state UI.
Thách thức production:
1. Parsing structured output khi stream — JSON mode/tool calling stream từng fragment, cần partial JSON parser (partial-json, streamText trong AI SDK).
2. Cache khó — stream không cache đơn giản như response fixed. Giải pháp: cache metadata (input hash → full response); lần sau replay.
3. Error mid-stream — đã gửi một phần response rồi gặp lỗi. Cần protocol báo lỗi trong stream (event: error) + client handle gracefully.
4. Cancellation propagation — user close tab → cần abort upstream API call để không bị charge tiếp. Dùng AbortController truyền xuống SDK.
5. Load balancer / CDN — Cloudflare, nginx có thể buffer SSE → phá streaming. Cần config X-Accel-Buffering: no, hoặc disable Cloudflare proxy.
6. Edge runtime limitations — Vercel Edge timeout 30s mặc định, Fluid Compute 300s. Lambda cold start ảnh hưởng TTFT.
7. Metrics — track TTFT (quan trọng hơn total latency), inter-token latency, dropped connection rate.
8. Token counting while streaming — mỗi chunk chỉ có partial; tally usage ở event cuối (finish_reason) hoặc ước lượng bằng tokenizer client.
Best practices:
- Hiện cursor/typing indicator trước khi token đầu về.
- Smooth rendering (không nhấp nháy): accumulate chunks, render mỗi 16ms.
- Scroll anchor cuối message khi token mới đến.
- Guardrail check có thể buffer → delay few token trước khi emit để check PII/toxicity (trade-off latency).
LLMs generate tokens sequentially (autoregressive). Streaming pushes each token/chunk to the client as it's produced instead of waiting for completion → low time to first token (TTFT), smooth UX.
Benefits:
- Perceived latency — user sees output almost immediately (< 1s) even if total latency is 10s.
- Cancellable — user spots a wrong result and can stop early → saves cost.
- Mandatory for chat UIs, code editors, voice assistants.
Server-side mechanics (OpenAI/Anthropic):
stream = client.chat.completions.create(
model="gpt-4o", messages=[...], stream=True
)
for chunk in stream:
delta = chunk.choices[0].delta.content
if delta: yield deltaProviders return chunks via HTTP chunked transfer / SSE.
Transport protocols:
1. Server-Sent Events (SSE) — most popular for LLM streaming. HTTP/1.1 long-lived connection, server pushes data: {...}\n\n events. Works behind CDN/proxy. Format OpenAI uses: data: {"choices":[...]}\n\ndata: [DONE]\n\n.
2. WebSocket — full-duplex. Use when you need bidirectional (voice, interrupt mid-stream, user types while model speaks). More complex than SSE, needs sticky sessions.
3. HTTP/2 Server Push / gRPC streaming — rarer, used in microservices.
4. HTTP chunked + JSONL — simplest: Transfer-Encoding: chunked + one JSON line per chunk.
Next.js 15 App Router implementation:
export async function POST(req: Request) {
const stream = await openai.chat.completions.create({ stream: true, ... });
const readable = new ReadableStream({
async start(ctrl) {
for await (const chunk of stream) {
const delta = chunk.choices[0]?.delta?.content;
if (delta) ctrl.enqueue(new TextEncoder().encode(delta));
}
ctrl.close();
},
});
return new Response(readable, { headers: { "Content-Type": "text/event-stream" } });
}Frontend uses Vercel AI SDK (useChat, useCompletion) for SSE parsing and UI state.
Production challenges:
1. Parsing structured output while streaming — JSON mode / tool calls stream fragments; you need a partial JSON parser (partial-json, streamText in AI SDK).
2. Caching is hard — streams aren't cached like fixed responses. Solution: cache metadata (input hash → full response); replay on next hit.
3. Mid-stream errors — part of response already sent, then failure. Protocol must signal errors in-stream (event: error) + client handles gracefully.
4. Cancellation propagation — user closes tab → abort upstream API call to stop the meter. Use AbortController threaded through the SDK.
5. Load balancers / CDNs — Cloudflare, nginx may buffer SSE → breaks streaming. Set X-Accel-Buffering: no, or disable Cloudflare proxy.
6. Edge runtime limits — Vercel Edge default 30s timeout, Fluid Compute 300s. Lambda cold starts hurt TTFT.
7. Metrics — track TTFT (more important than total latency), inter-token latency, dropped-connection rate.
8. Token counting while streaming — chunks only carry partials; tally usage on the final event (finish_reason) or estimate with a client-side tokenizer.
Best practices:
- Show a cursor/typing indicator before the first token arrives.
- Smooth rendering (no flicker): accumulate chunks, render every 16ms.
- Scroll-anchor the end of message as new tokens arrive.
- Guardrails may buffer → delay a few tokens before emission to run PII/toxicity checks (latency trade-off).