Distributed Computing for AI

A trillion-parameter model doesn't fit in one GPU. Even if it did, training it would take years. Distributed computing for AI is the art of splitting both the model and the data across hundreds or thousands of GPUs in a way that they all cooperate efficiently — and that's harder than it sounds.

Why Distributed Training is Necessary

Consider GPT-3 with 175 billion parameters. In BF16 precision, that's 350GB of model weights — before optimizer states. An H100 has 80GB of GPU memory. The model doesn't fit on one GPU. And even if it did, you'd need years of training time. Distributing the work across thousands of GPUs is not optional — it's the only way large models exist.

Data Parallelism

Data parallelism is the simplest and most common form of distributed training. Each GPU has a complete copy of the model. Different GPUs process different batches of training data simultaneously. After each batch, they synchronize their gradients — averaging them across all GPUs — before updating the weights.

How Gradient Synchronization Works

After each backward pass, each GPU has computed gradients for its batch. These must be averaged across all GPUs so that every GPU updates its weights identically. This is done via an All-Reduce collective operation — each GPU sends its gradients to all others and receives the average back. NCCL (NVIDIA Collective Communications Library) handles this efficiently over InfiniBand or NVLink.

Limitations of Data Parallelism

Data parallelism requires the full model to fit in each GPU's memory. If your model is 350GB and your GPU has 80GB — data parallelism alone doesn't work. You need model parallelism.

Model Parallelism

Model parallelism splits the model itself across multiple GPUs. Different GPUs hold different layers or parts of the model. Data flows sequentially through the GPUs as it passes through the model layers.

Tensor Parallelism

Tensor parallelism splits individual matrix operations (like attention heads and feed-forward layers) across multiple GPUs. GPU 0 computes columns 0–512 of a weight matrix; GPU 1 computes columns 513–1024. Results are combined with an All-Reduce. This requires very fast interconnects (NVLink) because data moves between GPUs for every layer. Used by Megatron-LM (NVIDIA's framework) to split transformer layers.

Pipeline Parallelism

Pipeline parallelism assigns different transformer layers to different GPUs. GPU 0 runs layers 1–8, GPU 1 runs layers 9–16, etc. Data flows through them like an assembly line. The challenge: while GPU 1 is processing batch 1, GPU 0 is idle waiting for batch 2. Micro-batching splits each mini-batch into micro-batches, keeping the pipeline full. GPipe and PipeDream implement this.

3D Parallelism: Combining All Three

State-of-the-art training uses all three parallelism strategies simultaneously:

Data Parallel

Multiple copies of the model-shard, each processing different data batches. Scales batch size.

Tensor Parallel

Splits each layer across GPUs. Requires NVLink speeds. Reduces per-GPU memory for large layers.

Pipeline Parallel

Different layer groups on different GPU groups. Handles model depth that doesn't fit on one node.

FSDP: Fully Sharded Data Parallel

FSDP (PyTorch's Fully Sharded Data Parallel) is a modern approach that shards model parameters, optimizer states, and gradients across GPUs. Each GPU only holds 1/N of all these tensors — dramatically reducing per-GPU memory. Parameters are gathered from other GPUs just-in-time for computation, then discarded. This achieves memory efficiency close to model parallelism while maintaining the simpler programming model of data parallelism.

Frequently Asked Questions

How do GPUs communicate during distributed training?

Through collective communications libraries. NCCL (NVIDIA Collective Communications Library) handles All-Reduce, All-Gather, Reduce-Scatter, and Broadcast operations between GPUs. Within a single server, this runs over NVLink (900 GB/s). Between servers, it runs over InfiniBand (400 Gb/s+) or high-speed Ethernet (RoCEv2). Communication speed is often the bottleneck — this is why AI networking is its own specialty.

What is gradient accumulation and why is it useful?

Gradient accumulation lets you simulate a larger batch size than your GPU memory allows. Instead of updating weights after each mini-batch, you accumulate gradients over N mini-batches before taking an optimizer step. If your GPU can handle a batch of 8 examples, you can accumulate for 16 steps to simulate a batch of 128. This is useful when you can't afford enough GPUs for the batch size your training recipe requires.

What is ZeRO optimization?

ZeRO (Zero Redundancy Optimizer) from Microsoft DeepSpeed is a memory optimization strategy that shards optimizer states, gradients, and parameters across data-parallel workers — eliminating the memory redundancy in standard data parallelism. ZeRO Stage 1 shards optimizer states; Stage 2 also shards gradients; Stage 3 shards all of the above plus model parameters. ZeRO-3 is essentially equivalent to FSDP and is the basis for training the largest models at Microsoft Research.

Distributed Computing for AI

Why Distributed Training is Necessary

Data Parallelism

How Gradient Synchronization Works

Limitations of Data Parallelism

Model Parallelism

Tensor Parallelism

Pipeline Parallelism

3D Parallelism: Combining All Three

FSDP: Fully Sharded Data Parallel

Frequently Asked Questions

Frequently Asked Questions

What will I learn here?

How should I use this page?

What should I read next?