GPU Clusters & AI Accelerators

Training GPT-4 reportedly used around 25,000 NVIDIA A100 GPUs for months. Training Llama 3 took thousands of H100s. Modern AI doesn't run on regular computers — it runs on specialized hardware designed specifically for the massive parallel math that deep learning requires. Let's understand what that hardware is and why it matters.

Why GPUs for AI?

A CPU (Central Processing Unit) is designed for sequential tasks — it has a small number of very powerful cores (typically 4–64) optimized for complex, branchy logic. A GPU (Graphics Processing Unit) is designed for parallel tasks — it has thousands of smaller cores that execute simple operations simultaneously.

The Matrix Math Connection

Deep learning is fundamentally matrix multiplication. Training a neural network means multiplying billions of numbers together, billions of times. GPUs were originally built to render 3D graphics — which is also matrix multiplication (transforming 3D coordinates to 2D pixels). Deep learning researchers noticed in 2012 that GPUs trained neural networks 10–100x faster than CPUs, triggering the AI hardware revolution.

NVIDIA GPU Generations for AI

NVIDIA dominates the AI accelerator market. Understanding their product generations helps you understand what's available in the cloud today:

A100

Ampere (2020). 312 TFLOPS BF16. 80GB HBM2e. The workhorse of 2022–2024 AI training. AWS p4d, GCP A2.

H100

Hopper (2022). 989 TFLOPS BF16. 80GB HBM3. 3x A100 performance. AWS p5, GCP A3, Azure ND H100.

H200

Hopper (2024). H100 chip + 141GB HBM3e memory. Crucial for fitting larger models. Limited availability.

B200

Blackwell (2025). 2x H100 performance. 192GB HBM3e. The frontier for 2025–2026 frontier model training.

NVLink: Connecting GPUs Together

NVLink is NVIDIA's high-speed interconnect that links multiple GPUs within a single server at speeds up to 900 GB/s (NVLink 4.0). An NVIDIA DGX H100 server contains 8 H100 GPUs all connected via NVLink — they act as a single unified unit with 640GB of combined GPU memory. This is orders of magnitude faster than connecting GPUs via PCIe.

Google TPUs: An Alternative Architecture

Google's Tensor Processing Units (TPUs) are ASICs (Application-Specific Integrated Circuits) designed specifically for matrix operations in TensorFlow and JAX. Unlike GPUs which are general-purpose accelerators repurposed for AI, TPUs are built from the ground up for one thing: tensor math.

TPU v4 and v5e

TPU v4 pods can scale to 4,096 chips connected via a 3D torus mesh — no network bottlenecks between chips. Google trains Gemini on TPU pods. TPU v5e (2023) targets inference workloads at lower cost. If you're training in JAX or TensorFlow, TPUs on GCP can be competitive with or faster than GPU equivalents at large scale.

When to Choose TPUs vs. GPUs

Choose GPUs (NVIDIA) for: PyTorch models, maximum framework compatibility, research flexibility. Choose TPUs for: JAX/TensorFlow workflows, very large training runs where TPU pod topology helps, and when you want to avoid NVIDIA's pricing power. Most of the AI ecosystem (research, open source) is GPU-native, so TPUs have a steeper on-ramp.

Renting GPUs in the Cloud

You don't need to own a $30,000 H100 GPU. Cloud providers rent them by the hour:

Instance	Provider	GPU	GPUs	~Price/hr
p4d.24xlarge	AWS	A100	8	$32.77
p5.48xlarge	AWS	H100	8	$98.32
a2-megagpu-16g	GCP	A100	16	$22.32
a3-megagpu-8g	GCP	H100	8	$33.19
Standard_ND96asr_v4	Azure	A100	8	~$27.20

Reserved vs. Spot pricing: On-demand GPU instances are expensive. Spot/preemptible instances offer 60–80% discounts — but can be interrupted. Most large training runs use spot instances with checkpoint-resume logic to tolerate interruptions and dramatically cut costs.

AMD as an Alternative to NVIDIA

AMD's MI300X (2024) is a serious competitor to H100 — 192GB of HBM3 memory (vs H100's 80GB) makes it excellent for inference of large models that need to fit in GPU memory. AMD ROCm (their CUDA equivalent) has matured significantly. AWS is deploying MI300X in their Trn1 and new instance families. For inference-focused workloads, AMD is increasingly competitive.

Frequently Asked Questions

Do I need a GPU to learn machine learning?

No — for learning and small experiments, a CPU is fine. Run small models locally, use Google Colab (free GPU access), or use AWS SageMaker Studio Lab. When you're training models that take hours or days, or running foundation models for inference, that's when you need GPUs. Don't invest in hardware until you've validated what you're building is worth the cost.

What is GPU memory and why does it matter so much for AI?

GPU memory (VRAM) is the fast memory on the GPU card itself (HBM — High Bandwidth Memory). Your model parameters, activations, and gradients must fit in GPU memory during training. A 7 billion parameter model in 16-bit precision requires ~14GB of VRAM just for weights — plus more for optimizer states and activations. An H100 with 80GB can fit models that an A100 with 40GB can't. This is why memory capacity often constrains what models you can train or run.

What is CUDA and why is it important?

CUDA (Compute Unified Device Architecture) is NVIDIA's programming platform for GPUs — the software layer that lets PyTorch, TensorFlow, and other frameworks talk to NVIDIA hardware. NVIDIA's dominance is partly technical (GPU performance) and partly ecosystem: CUDA has a decade of optimization, libraries (cuDNN, cuBLAS, NCCL), and developer tooling that competitors lack. Most AI frameworks are CUDA-native first, everything else second.

GPU Clusters & AI Accelerators

Why GPUs for AI?

The Matrix Math Connection

NVIDIA GPU Generations for AI

NVLink: Connecting GPUs Together

Google TPUs: An Alternative Architecture

TPU v4 and v5e

When to Choose TPUs vs. GPUs

Renting GPUs in the Cloud

AMD as an Alternative to NVIDIA

Frequently Asked Questions

Frequently Asked Questions

What will I learn here?

How should I use this page?

What should I read next?