Distributed Training

Training GPT-3 took 355 GPU-years on a single GPU — but it was done in weeks using thousands of GPUs in parallel. Distributed training is how you scale deep learning beyond what a single GPU can handle.

📖 Covers: Data Parallelism · Model Parallelism · Pipeline Parallelism · ZeRO · NVLink · PyTorch DDP

Why Distribute Training?

💾
Model too large for one GPU

GPT-3 (175B params) needs ~700GB at FP32 or ~350GB at FP16 for weights alone — far beyond any single GPU without sharding, quantisation, or offloading

Speed up training

Use 8 GPUs → roughly 8× faster (with communication overhead)

📊
Larger effective batch size

Aggregate gradients across machines for more stable training

Strategy 1: Data Parallelism (DDP)

The simplest and most common approach. Each GPU gets a complete copy of the model and a different mini-batch of data. After the backward pass, gradients are averaged across all GPUs (all-reduce), then each GPU updates its copy.

🖥️ Data Parallelism Visualizer

Click Train Step to see how gradients are synchronised across GPUs.

Each GPU has a full model copy. Click to start.
Python · PyTorch DDP (Distributed Data Parallel)
import torch
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP
from torch.utils.data.distributed import DistributedSampler

# Initialise process group
dist.init_process_group(backend='nccl')  # nccl = NVIDIA's collective comms library
local_rank = int(os.environ['LOCAL_RANK'])
torch.cuda.set_device(local_rank)

# Wrap model in DDP
model = MyModel().to(local_rank)
model = DDP(model, device_ids=[local_rank])

# DistributedSampler ensures each GPU gets different data
sampler = DistributedSampler(dataset)
dataloader = DataLoader(dataset, sampler=sampler, batch_size=32)

# Training loop — gradients automatically averaged!
for batch in dataloader:
    loss = model(batch).mean()
    loss.backward()  # All-reduce happens here
    optimizer.step()

# Launch: torchrun --nproc_per_node=4 train.py

Strategy 2: Model Parallelism (Tensor Parallelism)

When the model is too large for a single GPU, split the model itself across GPUs. Tensor parallelism shards individual weight matrices — each GPU holds a slice of each layer and they communicate after each matrix multiply.

Naive Model Parallel

Layer 1 on GPU 0, Layer 2 on GPU 1. Simple but only one GPU active at a time.

GPU utilisation: ~25%
Tensor Parallel (Megatron-LM)

Each weight matrix is split column/row-wise. All GPUs active simultaneously.

GPU utilisation: ~85%+

Strategy 3: Pipeline Parallelism

Split the model layer-wise (Layer 1–4 on GPU0, Layer 5–8 on GPU1, etc.). Use micro-batches so multiple batches are in flight simultaneously — like a CPU pipeline.

GPU 0
F1
F2
B2
B1
GPU 1
F1
F2
B2
F = Forward passB = Backward pass

ZeRO — DeepSpeed's Memory Miracle

ZeRO (Zero Redundancy Optimizer) eliminates the memory redundancy in data parallel training. Instead of each GPU holding a full copy of weights + gradients + optimizer states, ZeRO shards them across GPUs:

ZeRO-1Optimizer State Sharding

8× memory reduction. Shard optimizer states (Adam momentum, variance) across GPUs.

ZeRO-2+ Gradient Sharding

16× memory reduction. Also shard gradients.

ZeRO-3+ Parameter Sharding

Memory per GPU scales linearly with number of GPUs. Train 100B+ models on commodity hardware.

Python · DeepSpeed ZeRO-3 config (ds_config.json)
{
  "zero_optimization": {
    "stage": 3,
    "offload_optimizer": { "device": "cpu" },
    "offload_param": { "device": "cpu" }
  },
  "bf16": { "enabled": true },
  "train_batch_size": 256,
  "gradient_accumulation_steps": 4
}

# Launch:
# deepspeed --num_gpus=8 train.py --deepspeed ds_config.json

Communication: NVLink & InfiniBand

TechnologyBandwidthUse Case
PCIe 4.032 GB/sCPU↔GPU (bottleneck)
NVLink 4.0900 GB/sGPU↔GPU within node (H100)
NVSwitch3.6 TB/sAll-to-all within 8-GPU node
InfiniBand HDR200 Gb/sNode-to-node (multi-machine)
💡 When to Use Each Strategy

DDP: Model fits in one GPU — just want to train faster. Most common for <7B models.
Tensor Parallel: Large models (7B–70B) that need to be split within a node.
Pipeline + Tensor + Data: "3D parallelism" for 100B+ models across many nodes.

Frequently Asked Questions

How much does gradient communication overhead cost?

With NVLink between GPUs in the same machine, overhead is 5–15%. Across machines over InfiniBand, it's 15–30%. That's why GPU clusters use InfiniBand rather than regular Ethernet — the bandwidth difference (200 Gb/s vs 25 Gb/s) directly impacts training speed.

What is gradient accumulation?

If your batch size is too large for GPU memory, accumulate gradients over multiple forward/backward passes before calling optimizer.step(). With 4 accumulation steps, a batch size of 32 per GPU behaves like 128. No extra memory needed — just slower per-step time.

Do I need distributed training as a beginner?

No. Start with a single GPU (Google Colab or Kaggle). DDP is needed when training speed or model size become bottlenecks. Fine-tuning 7B models with LoRA and QLoRA often fits on a single 24GB GPU — no distribution needed.

Frequently Asked Questions

What will I learn here?

This page covers the core concepts and techniques you need to understand the topic and progress confidently to the next lesson.

How should I use this page?

Start with the overview, then follow the section links to deepen your understanding. Use the table of contents on the right to jump to specific sections.

What should I read next?

Use the navigation below to continue to the next lesson or explore related topics.