Distributed Training
Training GPT-3 took 355 GPU-years on a single GPU — but it was done in weeks using thousands of GPUs in parallel. Distributed training is how you scale deep learning beyond what a single GPU can handle.
Why Distribute Training?
GPT-3 (175B params) needs ~700GB at FP32 or ~350GB at FP16 for weights alone — far beyond any single GPU without sharding, quantisation, or offloading
Use 8 GPUs → roughly 8× faster (with communication overhead)
Aggregate gradients across machines for more stable training
Strategy 1: Data Parallelism (DDP)
The simplest and most common approach. Each GPU gets a complete copy of the model and a different mini-batch of data. After the backward pass, gradients are averaged across all GPUs (all-reduce), then each GPU updates its copy.
🖥️ Data Parallelism Visualizer
Click Train Step to see how gradients are synchronised across GPUs.
import torch
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP
from torch.utils.data.distributed import DistributedSampler
# Initialise process group
dist.init_process_group(backend='nccl') # nccl = NVIDIA's collective comms library
local_rank = int(os.environ['LOCAL_RANK'])
torch.cuda.set_device(local_rank)
# Wrap model in DDP
model = MyModel().to(local_rank)
model = DDP(model, device_ids=[local_rank])
# DistributedSampler ensures each GPU gets different data
sampler = DistributedSampler(dataset)
dataloader = DataLoader(dataset, sampler=sampler, batch_size=32)
# Training loop — gradients automatically averaged!
for batch in dataloader:
loss = model(batch).mean()
loss.backward() # All-reduce happens here
optimizer.step()
# Launch: torchrun --nproc_per_node=4 train.py Strategy 2: Model Parallelism (Tensor Parallelism)
When the model is too large for a single GPU, split the model itself across GPUs. Tensor parallelism shards individual weight matrices — each GPU holds a slice of each layer and they communicate after each matrix multiply.
Layer 1 on GPU 0, Layer 2 on GPU 1. Simple but only one GPU active at a time.
GPU utilisation: ~25%Each weight matrix is split column/row-wise. All GPUs active simultaneously.
GPU utilisation: ~85%+Strategy 3: Pipeline Parallelism
Split the model layer-wise (Layer 1–4 on GPU0, Layer 5–8 on GPU1, etc.). Use micro-batches so multiple batches are in flight simultaneously — like a CPU pipeline.
ZeRO — DeepSpeed's Memory Miracle
ZeRO (Zero Redundancy Optimizer) eliminates the memory redundancy in data parallel training. Instead of each GPU holding a full copy of weights + gradients + optimizer states, ZeRO shards them across GPUs:
8× memory reduction. Shard optimizer states (Adam momentum, variance) across GPUs.
16× memory reduction. Also shard gradients.
Memory per GPU scales linearly with number of GPUs. Train 100B+ models on commodity hardware.
{
"zero_optimization": {
"stage": 3,
"offload_optimizer": { "device": "cpu" },
"offload_param": { "device": "cpu" }
},
"bf16": { "enabled": true },
"train_batch_size": 256,
"gradient_accumulation_steps": 4
}
# Launch:
# deepspeed --num_gpus=8 train.py --deepspeed ds_config.json Communication: NVLink & InfiniBand
DDP: Model fits in one GPU — just want to train faster. Most common for <7B models.
Tensor Parallel: Large models (7B–70B) that need to be split within a node.
Pipeline + Tensor + Data: "3D parallelism" for 100B+ models across many nodes.
Frequently Asked Questions
How much does gradient communication overhead cost?
With NVLink between GPUs in the same machine, overhead is 5–15%. Across machines over InfiniBand, it's 15–30%. That's why GPU clusters use InfiniBand rather than regular Ethernet — the bandwidth difference (200 Gb/s vs 25 Gb/s) directly impacts training speed.
What is gradient accumulation?
If your batch size is too large for GPU memory, accumulate gradients over multiple forward/backward passes before calling optimizer.step(). With 4 accumulation steps, a batch size of 32 per GPU behaves like 128. No extra memory needed — just slower per-step time.
Do I need distributed training as a beginner?
No. Start with a single GPU (Google Colab or Kaggle). DDP is needed when training speed or model size become bottlenecks. Fine-tuning 7B models with LoRA and QLoRA often fits on a single 24GB GPU — no distribution needed.
Frequently Asked Questions
What will I learn here?
This page covers the core concepts and techniques you need to understand the topic and progress confidently to the next lesson.
How should I use this page?
Start with the overview, then follow the section links to deepen your understanding. Use the table of contents on the right to jump to specific sections.
What should I read next?
Use the navigation below to continue to the next lesson or explore related topics.