AI-Optimized Networking

When 1,000 GPUs need to synchronize gradients every few milliseconds, regular Ethernet is the equivalent of using a garden hose to fill a swimming pool. AI training at scale requires specialized networking that moves data between GPUs faster than most data centers move data between entire racks.

Why Standard Networking Isn't Enough

Typical data center Ethernet tops out at 25–100 Gbps per server. During distributed AI training, each All-Reduce operation synchronizes gradients across all GPUs — for a 70B parameter model, that's ~140GB of data that must be exchanged between all nodes after every training step. At 100 Gbps, that would take over 11 seconds per step. Training would be dominated by communication, not computation.

InfiniBand — The Gold Standard

InfiniBand is a high-speed networking technology originally developed for supercomputers and HPC clusters. It dominates AI training infrastructure because it offers both extreme bandwidth and extremely low latency.

HDR and NDR InfiniBand

HDR InfiniBand (200 Gbps per port) was the standard for A100 clusters. NDR InfiniBand (400 Gbps per port) is deployed with H100 clusters. With link aggregation, GPU nodes connect at 400–800 Gbps. NVIDIA's Quantum-2 InfiniBand switch fabric provides 51.2 Tbps of switch bandwidth — enough to move the entire contents of a movie library every second.

RDMA — Remote Direct Memory Access

InfiniBand uses RDMA — a technology that allows one computer to directly read from or write to the memory of another computer without involving the CPU of either. Data goes GPU → NIC → network → NIC → GPU memory, bypassing the host CPU entirely. This dramatically reduces latency (sub-microsecond) and CPU overhead during gradient synchronization.

RoCE — InfiniBand Over Ethernet

RoCEv2 (RDMA over Converged Ethernet) brings RDMA semantics to standard Ethernet infrastructure. It's a cost compromise — you get RDMA's low CPU overhead at lower hardware cost than native InfiniBand, using 400GbE or 800GbE network equipment.

Who Uses RoCE

Many hyperscalers (including AWS's EFA — Elastic Fabric Adapter, and Meta's AI Research clusters) use RoCE-based fabrics instead of InfiniBand. AWS EFA provides up to 3,200 Gbps of aggregate network bandwidth on p5 instances (H100 clusters) using a custom RDMA-capable network fabric. It's not InfiniBand, but it achieves similar All-Reduce performance for most training workloads.

NVLink — Within a Server

NVLink is NVIDIA's proprietary interconnect for connecting GPUs within a single server node. While InfiniBand and RoCE handle GPU-to-GPU communication between servers, NVLink handles it within one machine — at dramatically higher speeds.

900 GB/s

NVLink 4.0 (H100) bidirectional bandwidth between GPUs in one server

50 GB/s

PCIe 5.0 — what you'd use without NVLink

18x

NVLink vs PCIe bandwidth advantage

NVLink Switch — Beyond 8 GPUs

NVLink originally connected GPUs within a single server (up to 8 H100s). NVIDIA's NVLink Switch extends NVLink beyond a single server, connecting up to 576 GPUs across multiple servers in a single NVLink domain. This enables tensor parallelism across servers at NVLink speeds — previously only possible within a single node. The NVIDIA GB200 NVL72 rack uses NVLink Switch to treat 72 GPUs as a single unit.

Network Topology for AI Clusters

How you connect thousands of GPUs together matters as much as the speed of individual links. The topology determines which GPUs can communicate efficiently with which others.

Fat-Tree Topology

The most common data center network topology — a tree of switches where bandwidth at the top ("spine") is equal to bandwidth at the bottom ("leaves"). All servers can communicate with all others at full bandwidth. NVIDIA's DGX SuperPOD uses fat-tree InfiniBand topologies to give all-to-all full-bandwidth communication between thousands of GPUs.

3D Torus (Google TPU Pods)

Google's TPU pods use a 3D torus mesh topology — each chip connects to six neighbors in three dimensions. This topology is optimized for the collective communication patterns in tensor parallelism across large TPU pods. It's more efficient for structured all-to-all communication than fat-tree for very large scale.

Frequently Asked Questions

Does networking matter for AI inference (not just training)?

Yes, increasingly so. Serving large models (70B+) requires tensor parallelism across multiple GPUs — which requires the same low-latency interconnects as training. For smaller models served on a single GPU, standard networking is fine. But as "frontier model inference" becomes a standard service, the networking requirements for inference are approaching those of training. vLLM and TensorRT-LLM both have optimizations for multi-GPU inference that require fast interconnects.

What is AWS EFA and how does it compare to InfiniBand?

AWS Elastic Fabric Adapter (EFA) is Amazon's custom low-latency network interface for HPC and AI workloads. It provides RDMA-like semantics (using a custom protocol called SRD — Scalable Reliable Datagram) over a custom-built Amazon network fabric. On p5 instances (H100 nodes), EFA provides 3,200 Gbps aggregate bandwidth per instance across 32 network interfaces. It's not native InfiniBand, but performance in practice is competitive for most distributed training workloads.

How does NCCL use the network?

NCCL (NVIDIA Collective Communications Library) is the software layer between PyTorch/TensorFlow and the network hardware. NCCL implements collective operations (All-Reduce, All-Gather, Reduce-Scatter, Broadcast) and automatically detects whether to use NVLink (within a server), InfiniBand/RoCE (between servers), or PCIe. It selects the optimal communication algorithm (ring, tree, or direct-connect) based on the topology. NCCL is why you can write torch.distributed code that runs on both a single workstation and a 1,000-GPU cluster without changes.

AI-Optimized Networking

Why Standard Networking Isn't Enough

InfiniBand — The Gold Standard

HDR and NDR InfiniBand

RDMA — Remote Direct Memory Access

RoCE — InfiniBand Over Ethernet

Who Uses RoCE

NVLink — Within a Server

NVLink Switch — Beyond 8 GPUs

Network Topology for AI Clusters

Fat-Tree Topology

3D Torus (Google TPU Pods)

Frequently Asked Questions

Frequently Asked Questions

What will I learn here?

How should I use this page?

What should I read next?