Tensor parallelism vs Pipeline parallelism vs Data parallelism: khi nào dùng?

Question

Luyện Phỏng Vấn IT · Accepted Answer

Khi model không fit trong 1 GPU hoặc cần tăng tốc training, có 3 kiểu parallelism cơ bản (có thể combine = 3D parallelism).

1. Data Parallelism (DP) — kiểu đơn giản nhất
- Mỗi GPU giữ bản sao đầy đủ của model; chia data batch thành shard, mỗi GPU xử lý shard riêng.
- Forward độc lập → backward → all-reduce gradients giữa GPU → update weights.
- Ưu: đơn giản, scale tốt với compute.
- Nhược: mỗi GPU cần đủ VRAM cho model + gradient + optimizer state → không work với model lớn.
- Dùng: model vừa đủ fit 1 GPU, muốn tăng training throughput.

2. Tensor Parallelism (TP) — split bên trong layer
- Chia ma trận weights của mỗi layer (attention, FFN) theo row/column giữa GPU.
- Ví dụ Y = X·W: chia W thành [W1, W2], GPU1 tính X·W1, GPU2 tính X·W2, concat.
- Mỗi forward/backward cần all-reduce communication sau mỗi layer → cần high bandwidth (NVLink, NVSwitch) giữa GPU.
- Ưu: giảm memory và latency per layer.
- Nhược: communication overhead lớn; chỉ scale tốt trong 1 node (≤ 8 GPU cùng NVLink).
- Dùng: model > single GPU memory; single-node multi-GPU serving.

3. Pipeline Parallelism (PP) — split giữa các layer
- Chia các layer của model thành stage, mỗi GPU giữ 1 stage.
- Data flow: GPU1 (layer 1-10) → GPU2 (layer 11-20) → ... → output.
- Forward pass đi qua pipeline; backward ngược lại.
- Naïve PP có "bubble" (GPU idle khi data chưa tới). 1F1B (one-forward-one-backward) scheduling giảm bubble. Micro-batching chia batch nhỏ để fill pipeline.
- Ưu: giảm communication (chỉ gửi activation giữa stage, không all-reduce).
- Ưu: scale tốt đa node (communication ít).
- Nhược: bubble → sub-linear speedup; latency mỗi sample tăng theo số stage.
- Dùng: model siêu lớn > 1 node; cross-datacenter training.

4. FSDP / ZeRO (hybrid, state-of-art) — thay DP
- FSDP (PyTorch Fully Sharded Data Parallel) và DeepSpeed ZeRO shard cả weights/gradient/optimizer state giữa GPU; gather on-demand khi cần compute.
- Ưu: memory per-GPU giảm mạnh, scale tới model 1T params trên multi-node.
- Gần như default cho training lớn giờ, thay thế pure DP.
- ZeRO stages: ZeRO-1 shard optimizer state; ZeRO-2 shard + gradient; ZeRO-3 shard + weights (giống FSDP).

5. Expert Parallelism (EP) — chỉ cho MoE model. Distribute các expert giữa GPU; router gửi token tới expert GPU tương ứng.

6. Sequence Parallelism (SP) — chia sequence dài giữa GPU. Hữu ích cho context 1M+ token training. Ring Attention.

Kết hợp (3D / 4D parallelism):
- Training LLaMA 405B: TP=8 (trong node) × PP=16 (cross node) × DP=? × SP=?
- Megatron-LM, DeepSpeed orchestrate.

Quyết định khi inference (không phải training):

Single GPU đủ: không parallel.
Model vừa fit nhiều GPU single node: Tensor Parallel (vLLM --tensor-parallel-size 8).
Model > single node: Pipeline Parallel + Tensor Parallel.
Multi-tenant serving (nhiều model nhỏ): gán mỗi model / GPU (không parallel), scale replica.

Tools:
- Training: PyTorch FSDP, DeepSpeed ZeRO, Megatron-LM, NVIDIA NeMo, ColossalAI.
- Inference: vLLM (TP + PP), TensorRT-LLM (TP + PP), SGLang (TP), DeepSpeed-Inference.