Distributed Training

Training GPT-3 took 355 GPU-years on a single GPU — but it was done in weeks using thousands of GPUs in parallel. Distributed training is how you scale deep learning beyond what a single GPU can handle.

Why Distribute Training?

💾

Model too large for one GPU

GPT-3 (175B params) needs ~700GB at FP32 or ~350GB at FP16 for weights alone — far beyond any single GPU without sharding, quantisation, or offloading

⚡

Speed up training

Use 8 GPUs → roughly 8× faster (with communication overhead)

📊

Larger effective batch size

Aggregate gradients across machines for more stable training

Strategy 1: Data Parallelism (DDP)

The simplest and most common approach. Each GPU gets a complete copy of the model and a different mini-batch of data. After the backward pass, gradients are averaged across all GPUs (all-reduce), then each GPU updates its copy.

🖥️ Data Parallelism Visualizer

Click Train Step to see how gradients are synchronised across GPUs.

Each GPU has a full model copy. Click to start.

Python · PyTorch DDP (Distributed Data Parallel)

import torch
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP
from torch.utils.data.distributed import DistributedSampler

# Initialise process group
dist.init_process_group(backend='nccl')  # nccl = NVIDIA's collective comms library
local_rank = int(os.environ['LOCAL_RANK'])
torch.cuda.set_device(local_rank)

# Wrap model in DDP
model = MyModel().to(local_rank)
model = DDP(model, device_ids=[local_rank])

# DistributedSampler ensures each GPU gets different data
sampler = DistributedSampler(dataset)
dataloader = DataLoader(dataset, sampler=sampler, batch_size=32)

# Training loop — gradients automatically averaged!
for batch in dataloader:
    loss = model(batch).mean()
    loss.backward()  # All-reduce happens here
    optimizer.step()

# Launch: torchrun --nproc_per_node=4 train.py

Strategy 2: Model Parallelism (Tensor Parallelism)

When the model is too large for a single GPU, split the model itself across GPUs. Tensor parallelism shards individual weight matrices — each GPU holds a slice of each layer and they communicate after each matrix multiply.

Naive Model Parallel

Layer 1 on GPU 0, Layer 2 on GPU 1. Simple but only one GPU active at a time.

GPU utilisation: ~25%

Tensor Parallel (Megatron-LM)

Each weight matrix is split column/row-wise. All GPUs active simultaneously.

GPU utilisation: ~85%+

Strategy 3: Pipeline Parallelism

Split the model layer-wise (Layer 1–4 on GPU0, Layer 5–8 on GPU1, etc.). Use micro-batches so multiple batches are in flight simultaneously — like a CPU pipeline.

GPU 0

GPU 1

F = Forward passB = Backward pass

ZeRO — DeepSpeed's Memory Miracle

ZeRO (Zero Redundancy Optimizer) eliminates the memory redundancy in data parallel training. Instead of each GPU holding a full copy of weights + gradients + optimizer states, ZeRO shards them across GPUs:

ZeRO-1Optimizer State Sharding

8× memory reduction. Shard optimizer states (Adam momentum, variance) across GPUs.

ZeRO-2+ Gradient Sharding

16× memory reduction. Also shard gradients.

ZeRO-3+ Parameter Sharding

Memory per GPU scales linearly with number of GPUs. Train 100B+ models on commodity hardware.

Python · DeepSpeed ZeRO-3 config (ds_config.json)

{
  "zero_optimization": {
    "stage": 3,
    "offload_optimizer": { "device": "cpu" },
    "offload_param": { "device": "cpu" }
  },
  "bf16": { "enabled": true },
  "train_batch_size": 256,
  "gradient_accumulation_steps": 4
}

# Launch:
# deepspeed --num_gpus=8 train.py --deepspeed ds_config.json

Communication: NVLink & InfiniBand

PCIe 4.032 GB/sCPU↔GPU (bottleneck)

NVLink 4.0900 GB/sGPU↔GPU within node (H100)

NVSwitch3.6 TB/sAll-to-all within 8-GPU node

InfiniBand HDR200 Gb/sNode-to-node (multi-machine)

💡 When to Use Each Strategy

DDP: Model fits in one GPU — just want to train faster. Most common for <7B models.
Tensor Parallel: Large models (7B–70B) that need to be split within a node.
Pipeline + Tensor + Data: "3D parallelism" for 100B+ models across many nodes.

Frequently Asked Questions

How much does gradient communication overhead cost?

With NVLink between GPUs in the same machine, overhead is 5–15%. Across machines over InfiniBand, it's 15–30%. That's why GPU clusters use InfiniBand rather than regular Ethernet — the bandwidth difference (200 Gb/s vs 25 Gb/s) directly impacts training speed.

What is gradient accumulation?

If your batch size is too large for GPU memory, accumulate gradients over multiple forward/backward passes before calling optimizer.step(). With 4 accumulation steps, a batch size of 32 per GPU behaves like 128. No extra memory needed — just slower per-step time.

Do I need distributed training as a beginner?

No. Start with a single GPU (Google Colab or Kaggle). DDP is needed when training speed or model size become bottlenecks. Fine-tuning 7B models with LoRA and QLoRA often fits on a single 24GB GPU — no distribution needed.

Distributed Training

Why Distribute Training?

Strategy 1: Data Parallelism (DDP)

🖥️ Data Parallelism Visualizer

Strategy 2: Model Parallelism (Tensor Parallelism)

Strategy 3: Pipeline Parallelism

ZeRO — DeepSpeed's Memory Miracle

Communication: NVLink & InfiniBand

Frequently Asked Questions

Frequently Asked Questions

What will I learn here?

How should I use this page?

What should I read next?