AI-Optimized Networking

When 1,000 GPUs need to synchronize gradients every few milliseconds, regular Ethernet is the equivalent of using a garden hose to fill a swimming pool. AI training at scale requires specialized networking that moves data between GPUs faster than most data centers move data between entire racks.

Why Standard Networking Isn't Enough

Typical data center Ethernet tops out at 25–100 Gbps per server. During distributed AI training, each All-Reduce operation synchronizes gradients across all GPUs — for a 70B parameter model, that's ~140GB of data that must be exchanged between all nodes after every training step. At 100 Gbps, that would take over 11 seconds per step. Training would be dominated by communication, not computation.

The bandwidth gap: An H100's internal memory bandwidth is 3.35 TB/s. Standard 100GbE networking provides 12.5 GB/s. The GPU is 268x faster internally than the network connecting it to other GPUs. Closing that gap is the entire goal of AI networking.

InfiniBand — The Gold Standard

InfiniBand is a high-speed networking technology originally developed for supercomputers and HPC clusters. It dominates AI training infrastructure because it offers both extreme bandwidth and extremely low latency.

HDR and NDR InfiniBand

HDR InfiniBand (200 Gbps per port) was the standard for A100 clusters. NDR InfiniBand (400 Gbps per port) is deployed with H100 clusters. With link aggregation, GPU nodes connect at 400–800 Gbps. NVIDIA's Quantum-2 InfiniBand switch fabric provides 51.2 Tbps of switch bandwidth — enough to move the entire contents of a movie library every second.

RDMA — Remote Direct Memory Access

InfiniBand uses RDMA — a technology that allows one computer to directly read from or write to the memory of another computer without involving the CPU of either. Data goes GPU → NIC → network → NIC → GPU memory, bypassing the host CPU entirely. This dramatically reduces latency (sub-microsecond) and CPU overhead during gradient synchronization.

RoCE — InfiniBand Over Ethernet

RoCEv2 (RDMA over Converged Ethernet) brings RDMA semantics to standard Ethernet infrastructure. It's a cost compromise — you get RDMA's low CPU overhead at lower hardware cost than native InfiniBand, using 400GbE or 800GbE network equipment.

Who Uses RoCE

Many hyperscalers (including AWS's EFA — Elastic Fabric Adapter, and Meta's AI Research clusters) use RoCE-based fabrics instead of InfiniBand. AWS EFA provides up to 3,200 Gbps of aggregate network bandwidth on p5 instances (H100 clusters) using a custom RDMA-capable network fabric. It's not InfiniBand, but it achieves similar All-Reduce performance for most training workloads.

Network Topology for AI Clusters

How you connect thousands of GPUs together matters as much as the speed of individual links. The topology determines which GPUs can communicate efficiently with which others.

Fat-Tree Topology

The most common data center network topology — a tree of switches where bandwidth at the top ("spine") is equal to bandwidth at the bottom ("leaves"). All servers can communicate with all others at full bandwidth. NVIDIA's DGX SuperPOD uses fat-tree InfiniBand topologies to give all-to-all full-bandwidth communication between thousands of GPUs.

3D Torus (Google TPU Pods)

Google's TPU pods use a 3D torus mesh topology — each chip connects to six neighbors in three dimensions. This topology is optimized for the collective communication patterns in tensor parallelism across large TPU pods. It's more efficient for structured all-to-all communication than fat-tree for very large scale.

Frequently Asked Questions

Does networking matter for AI inference (not just training)?

Yes, increasingly so. Serving large models (70B+) requires tensor parallelism across multiple GPUs — which requires the same low-latency interconnects as training. For smaller models served on a single GPU, standard networking is fine. But as "frontier model inference" becomes a standard service, the networking requirements for inference are approaching those of training. vLLM and TensorRT-LLM both have optimizations for multi-GPU inference that require fast interconnects.

What is AWS EFA and how does it compare to InfiniBand?

AWS Elastic Fabric Adapter (EFA) is Amazon's custom low-latency network interface for HPC and AI workloads. It provides RDMA-like semantics (using a custom protocol called SRD — Scalable Reliable Datagram) over a custom-built Amazon network fabric. On p5 instances (H100 nodes), EFA provides 3,200 Gbps aggregate bandwidth per instance across 32 network interfaces. It's not native InfiniBand, but performance in practice is competitive for most distributed training workloads.

How does NCCL use the network?

NCCL (NVIDIA Collective Communications Library) is the software layer between PyTorch/TensorFlow and the network hardware. NCCL implements collective operations (All-Reduce, All-Gather, Reduce-Scatter, Broadcast) and automatically detects whether to use NVLink (within a server), InfiniBand/RoCE (between servers), or PCIe. It selects the optimal communication algorithm (ring, tree, or direct-connect) based on the topology. NCCL is why you can write torch.distributed code that runs on both a single workstation and a 1,000-GPU cluster without changes.

Frequently Asked Questions

What will I learn here?

This page covers the core concepts and techniques you need to understand the topic and progress confidently to the next lesson.

How should I use this page?

Start with the overview, then follow the section links to deepen your understanding. Use the table of contents on the right to jump to specific sections.

What should I read next?

Use the navigation below to continue to the next lesson or explore related topics.