Khi model không fit trong 1 GPU hoặc cần tăng tốc training, có 3 kiểu parallelism cơ bản (có thể combine = 3D parallelism).
1. Data Parallelism (DP) — kiểu đơn giản nhất
- Mỗi GPU giữ bản sao đầy đủ của model; chia data batch thành shard, mỗi GPU xử lý shard riêng.
- Forward độc lập → backward → all-reduce gradients giữa GPU → update weights.
- Ưu: đơn giản, scale tốt với compute.
- Nhược: mỗi GPU cần đủ VRAM cho model + gradient + optimizer state → không work với model lớn.
- Dùng: model vừa đủ fit 1 GPU, muốn tăng training throughput.
2. Tensor Parallelism (TP) — split bên trong layer
- Chia ma trận weights của mỗi layer (attention, FFN) theo row/column giữa GPU.
- Ví dụ Y = X·W: chia W thành [W1, W2], GPU1 tính X·W1, GPU2 tính X·W2, concat.
- Mỗi forward/backward cần all-reduce communication sau mỗi layer → cần high bandwidth (NVLink, NVSwitch) giữa GPU.
- Ưu: giảm memory và latency per layer.
- Nhược: communication overhead lớn; chỉ scale tốt trong 1 node (≤ 8 GPU cùng NVLink).
- Dùng: model > single GPU memory; single-node multi-GPU serving.
3. Pipeline Parallelism (PP) — split giữa các layer
- Chia các layer của model thành stage, mỗi GPU giữ 1 stage.
- Data flow: GPU1 (layer 1-10) → GPU2 (layer 11-20) → ... → output.
- Forward pass đi qua pipeline; backward ngược lại.
- Naïve PP có "bubble" (GPU idle khi data chưa tới). 1F1B (one-forward-one-backward) scheduling giảm bubble. Micro-batching chia batch nhỏ để fill pipeline.
- Ưu: giảm communication (chỉ gửi activation giữa stage, không all-reduce).
- Ưu: scale tốt đa node (communication ít).
- Nhược: bubble → sub-linear speedup; latency mỗi sample tăng theo số stage.
- Dùng: model siêu lớn > 1 node; cross-datacenter training.
4. FSDP / ZeRO (hybrid, state-of-art) — thay DP
- FSDP (PyTorch Fully Sharded Data Parallel) và DeepSpeed ZeRO shard cả weights/gradient/optimizer state giữa GPU; gather on-demand khi cần compute.
- Ưu: memory per-GPU giảm mạnh, scale tới model 1T params trên multi-node.
- Gần như default cho training lớn giờ, thay thế pure DP.
- ZeRO stages: ZeRO-1 shard optimizer state; ZeRO-2 shard + gradient; ZeRO-3 shard + weights (giống FSDP).
5. Expert Parallelism (EP) — chỉ cho MoE model. Distribute các expert giữa GPU; router gửi token tới expert GPU tương ứng.
6. Sequence Parallelism (SP) — chia sequence dài giữa GPU. Hữu ích cho context 1M+ token training. Ring Attention.
Kết hợp (3D / 4D parallelism):
- Training LLaMA 405B: TP=8 (trong node) × PP=16 (cross node) × DP=? × SP=?
- Megatron-LM, DeepSpeed orchestrate.
Quyết định khi inference (không phải training):
- Single GPU đủ: không parallel.
- Model vừa fit nhiều GPU single node: Tensor Parallel (vLLM
--tensor-parallel-size 8). - Model > single node: Pipeline Parallel + Tensor Parallel.
- Multi-tenant serving (nhiều model nhỏ): gán mỗi model / GPU (không parallel), scale replica.
Tools:
- Training: PyTorch FSDP, DeepSpeed ZeRO, Megatron-LM, NVIDIA NeMo, ColossalAI.
- Inference: vLLM (TP + PP), TensorRT-LLM (TP + PP), SGLang (TP), DeepSpeed-Inference.
When a model doesn't fit on one GPU or training needs to scale, there are 3 basic parallelism types (combinable = 3D parallelism).
1. Data Parallelism (DP) — simplest
- Each GPU holds a full copy of the model; batch is split into shards per GPU.
- Independent forward → backward → all-reduce gradients across GPUs → update.
- Pros: simple, scales with compute.
- Cons: each GPU must hold model + grad + optimizer state → fails on large models.
- Use: the model fits on one GPU, training throughput is the goal.
2. Tensor Parallelism (TP) — split within a layer
- Shard each layer's weight matrices (attention, FFN) by row/column across GPUs.
- e.g. Y = X·W: split W into [W1, W2], GPU1 computes X·W1, GPU2 computes X·W2, concat.
- Each forward/backward needs all-reduce after each layer → needs high bandwidth (NVLink, NVSwitch) across GPUs.
- Pros: reduces per-layer memory and latency.
- Cons: heavy communication; scales well only within a node (≤ 8 GPUs on NVLink).
- Use: model > single-GPU memory; single-node multi-GPU serving.
3. Pipeline Parallelism (PP) — split between layers
- Split model layers into stages, one per GPU.
- Data flow: GPU1 (layers 1–10) → GPU2 (layers 11–20) → ... → output.
- Forward walks through the pipeline; backward comes back.
- Naïve PP has "bubbles" (idle GPUs waiting for data). 1F1B scheduling reduces bubbles. Micro-batching chops the batch to fill the pipeline.
- Pros: less communication (only activations between stages, no all-reduce).
- Pros: scales across nodes (low comm).
- Cons: bubbles → sub-linear speedup; per-sample latency grows with stages.
- Use: very large models > 1 node; cross-datacenter training.
4. FSDP / ZeRO (hybrid, state-of-art) — replaces DP
- FSDP (PyTorch Fully Sharded Data Parallel) and DeepSpeed ZeRO shard weights/grad/optimizer across GPUs; gather on demand.
- Pros: big drop in per-GPU memory, scales to 1T-param models across nodes.
- Default for modern big training, replacing pure DP.
- ZeRO stages: ZeRO-1 shards optimizer state; ZeRO-2 adds gradients; ZeRO-3 adds weights (like FSDP).
5. Expert Parallelism (EP) — MoE-only. Distribute experts across GPUs; router routes tokens to the expert's GPU.
6. Sequence Parallelism (SP) — split long sequences across GPUs. Needed for 1M+ token training. Ring Attention.
Combining (3D / 4D parallelism):
- Training LLaMA 405B: TP=8 (intra-node) × PP=16 (across nodes) × DP=? × SP=?
- Orchestrated by Megatron-LM, DeepSpeed.
Inference (not training) decision guide:
- One GPU fits: no parallelism.
- Model fits across GPUs in one node: Tensor Parallel (vLLM
--tensor-parallel-size 8). - Model spans nodes: Pipeline Parallel + Tensor Parallel.
- Multi-tenant serving (many small models): one model per GPU (no parallelism), scale replicas.
Tools:
- Training: PyTorch FSDP, DeepSpeed ZeRO, Megatron-LM, NVIDIA NeMo, ColossalAI.
- Inference: vLLM (TP + PP), TensorRT-LLM (TP + PP), SGLang (TP), DeepSpeed-Inference.