Distributed Computing for AI
A trillion-parameter model doesn't fit in one GPU. Even if it did, training it would take years. Distributed computing for AI is the art of splitting both the model and the data across hundreds or thousands of GPUs in a way that they all cooperate efficiently — and that's harder than it sounds.
Why Distributed Training is Necessary
Consider GPT-3 with 175 billion parameters. In BF16 precision, that's 350GB of model weights — before optimizer states. An H100 has 80GB of GPU memory. The model doesn't fit on one GPU. And even if it did, you'd need years of training time. Distributing the work across thousands of GPUs is not optional — it's the only way large models exist.
Data Parallelism
Data parallelism is the simplest and most common form of distributed training. Each GPU has a complete copy of the model. Different GPUs process different batches of training data simultaneously. After each batch, they synchronize their gradients — averaging them across all GPUs — before updating the weights.
How Gradient Synchronization Works
After each backward pass, each GPU has computed gradients for its batch. These must be averaged across all GPUs so that every GPU updates its weights identically. This is done via an All-Reduce collective operation — each GPU sends its gradients to all others and receives the average back. NCCL (NVIDIA Collective Communications Library) handles this efficiently over InfiniBand or NVLink.
Limitations of Data Parallelism
Data parallelism requires the full model to fit in each GPU's memory. If your model is 350GB and your GPU has 80GB — data parallelism alone doesn't work. You need model parallelism.
Model Parallelism
Model parallelism splits the model itself across multiple GPUs. Different GPUs hold different layers or parts of the model. Data flows sequentially through the GPUs as it passes through the model layers.
Tensor Parallelism
Tensor parallelism splits individual matrix operations (like attention heads and feed-forward layers) across multiple GPUs. GPU 0 computes columns 0–512 of a weight matrix; GPU 1 computes columns 513–1024. Results are combined with an All-Reduce. This requires very fast interconnects (NVLink) because data moves between GPUs for every layer. Used by Megatron-LM (NVIDIA's framework) to split transformer layers.
Pipeline Parallelism
Pipeline parallelism assigns different transformer layers to different GPUs. GPU 0 runs layers 1–8, GPU 1 runs layers 9–16, etc. Data flows through them like an assembly line. The challenge: while GPU 1 is processing batch 1, GPU 0 is idle waiting for batch 2. Micro-batching splits each mini-batch into micro-batches, keeping the pipeline full. GPipe and PipeDream implement this.
3D Parallelism: Combining All Three
State-of-the-art training uses all three parallelism strategies simultaneously:
FSDP: Fully Sharded Data Parallel
FSDP (PyTorch's Fully Sharded Data Parallel) is a modern approach that shards model parameters, optimizer states, and gradients across GPUs. Each GPU only holds 1/N of all these tensors — dramatically reducing per-GPU memory. Parameters are gathered from other GPUs just-in-time for computation, then discarded. This achieves memory efficiency close to model parallelism while maintaining the simpler programming model of data parallelism.
Frequently Asked Questions
How do GPUs communicate during distributed training?
Through collective communications libraries. NCCL (NVIDIA Collective Communications Library) handles All-Reduce, All-Gather, Reduce-Scatter, and Broadcast operations between GPUs. Within a single server, this runs over NVLink (900 GB/s). Between servers, it runs over InfiniBand (400 Gb/s+) or high-speed Ethernet (RoCEv2). Communication speed is often the bottleneck — this is why AI networking is its own specialty.
What is gradient accumulation and why is it useful?
Gradient accumulation lets you simulate a larger batch size than your GPU memory allows. Instead of updating weights after each mini-batch, you accumulate gradients over N mini-batches before taking an optimizer step. If your GPU can handle a batch of 8 examples, you can accumulate for 16 steps to simulate a batch of 128. This is useful when you can't afford enough GPUs for the batch size your training recipe requires.
What is ZeRO optimization?
ZeRO (Zero Redundancy Optimizer) from Microsoft DeepSpeed is a memory optimization strategy that shards optimizer states, gradients, and parameters across data-parallel workers — eliminating the memory redundancy in standard data parallelism. ZeRO Stage 1 shards optimizer states; Stage 2 also shards gradients; Stage 3 shards all of the above plus model parameters. ZeRO-3 is essentially equivalent to FSDP and is the basis for training the largest models at Microsoft Research.
Frequently Asked Questions
What will I learn here?
This page covers the core concepts and techniques you need to understand the topic and progress confidently to the next lesson.
How should I use this page?
Start with the overview, then follow the section links to deepen your understanding. Use the table of contents on the right to jump to specific sections.
What should I read next?
Use the navigation below to continue to the next lesson or explore related topics.