GPU Clusters & AI Accelerators

Training GPT-4 reportedly used around 25,000 NVIDIA A100 GPUs for months. Training Llama 3 took thousands of H100s. Modern AI doesn't run on regular computers — it runs on specialized hardware designed specifically for the massive parallel math that deep learning requires. Let's understand what that hardware is and why it matters.

Why GPUs for AI?

A CPU (Central Processing Unit) is designed for sequential tasks — it has a small number of very powerful cores (typically 4–64) optimized for complex, branchy logic. A GPU (Graphics Processing Unit) is designed for parallel tasks — it has thousands of smaller cores that execute simple operations simultaneously.

The Matrix Math Connection

Deep learning is fundamentally matrix multiplication. Training a neural network means multiplying billions of numbers together, billions of times. GPUs were originally built to render 3D graphics — which is also matrix multiplication (transforming 3D coordinates to 2D pixels). Deep learning researchers noticed in 2012 that GPUs trained neural networks 10–100x faster than CPUs, triggering the AI hardware revolution.

Simple analogy: A CPU is like a team of 8 expert surgeons — great for complex tasks but limited in throughput. A GPU is like an assembly line with 10,000 workers — perfect for doing the same simple operation millions of times in parallel.

NVIDIA GPU Generations for AI

NVIDIA dominates the AI accelerator market. Understanding their product generations helps you understand what's available in the cloud today:

A100
Ampere (2020). 312 TFLOPS BF16. 80GB HBM2e. The workhorse of 2022–2024 AI training. AWS p4d, GCP A2.
H100
Hopper (2022). 989 TFLOPS BF16. 80GB HBM3. 3x A100 performance. AWS p5, GCP A3, Azure ND H100.
H200
Hopper (2024). H100 chip + 141GB HBM3e memory. Crucial for fitting larger models. Limited availability.
B200
Blackwell (2025). 2x H100 performance. 192GB HBM3e. The frontier for 2025–2026 frontier model training.

NVLink: Connecting GPUs Together

NVLink is NVIDIA's high-speed interconnect that links multiple GPUs within a single server at speeds up to 900 GB/s (NVLink 4.0). An NVIDIA DGX H100 server contains 8 H100 GPUs all connected via NVLink — they act as a single unified unit with 640GB of combined GPU memory. This is orders of magnitude faster than connecting GPUs via PCIe.

Google TPUs: An Alternative Architecture

Google's Tensor Processing Units (TPUs) are ASICs (Application-Specific Integrated Circuits) designed specifically for matrix operations in TensorFlow and JAX. Unlike GPUs which are general-purpose accelerators repurposed for AI, TPUs are built from the ground up for one thing: tensor math.

TPU v4 and v5e

TPU v4 pods can scale to 4,096 chips connected via a 3D torus mesh — no network bottlenecks between chips. Google trains Gemini on TPU pods. TPU v5e (2023) targets inference workloads at lower cost. If you're training in JAX or TensorFlow, TPUs on GCP can be competitive with or faster than GPU equivalents at large scale.

When to Choose TPUs vs. GPUs

Choose GPUs (NVIDIA) for: PyTorch models, maximum framework compatibility, research flexibility. Choose TPUs for: JAX/TensorFlow workflows, very large training runs where TPU pod topology helps, and when you want to avoid NVIDIA's pricing power. Most of the AI ecosystem (research, open source) is GPU-native, so TPUs have a steeper on-ramp.

Renting GPUs in the Cloud

You don't need to own a $30,000 H100 GPU. Cloud providers rent them by the hour:

Instance Provider GPU GPUs ~Price/hr
p4d.24xlargeAWSA1008$32.77
p5.48xlargeAWSH1008$98.32
a2-megagpu-16gGCPA10016$22.32
a3-megagpu-8gGCPH1008$33.19
Standard_ND96asr_v4AzureA1008~$27.20
Reserved vs. Spot pricing: On-demand GPU instances are expensive. Spot/preemptible instances offer 60–80% discounts — but can be interrupted. Most large training runs use spot instances with checkpoint-resume logic to tolerate interruptions and dramatically cut costs.

AMD as an Alternative to NVIDIA

AMD's MI300X (2024) is a serious competitor to H100 — 192GB of HBM3 memory (vs H100's 80GB) makes it excellent for inference of large models that need to fit in GPU memory. AMD ROCm (their CUDA equivalent) has matured significantly. AWS is deploying MI300X in their Trn1 and new instance families. For inference-focused workloads, AMD is increasingly competitive.

Frequently Asked Questions

Do I need a GPU to learn machine learning?

No — for learning and small experiments, a CPU is fine. Run small models locally, use Google Colab (free GPU access), or use AWS SageMaker Studio Lab. When you're training models that take hours or days, or running foundation models for inference, that's when you need GPUs. Don't invest in hardware until you've validated what you're building is worth the cost.

What is GPU memory and why does it matter so much for AI?

GPU memory (VRAM) is the fast memory on the GPU card itself (HBM — High Bandwidth Memory). Your model parameters, activations, and gradients must fit in GPU memory during training. A 7 billion parameter model in 16-bit precision requires ~14GB of VRAM just for weights — plus more for optimizer states and activations. An H100 with 80GB can fit models that an A100 with 40GB can't. This is why memory capacity often constrains what models you can train or run.

What is CUDA and why is it important?

CUDA (Compute Unified Device Architecture) is NVIDIA's programming platform for GPUs — the software layer that lets PyTorch, TensorFlow, and other frameworks talk to NVIDIA hardware. NVIDIA's dominance is partly technical (GPU performance) and partly ecosystem: CUDA has a decade of optimization, libraries (cuDNN, cuBLAS, NCCL), and developer tooling that competitors lack. Most AI frameworks are CUDA-native first, everything else second.

Frequently Asked Questions

What will I learn here?

This page covers the core concepts and techniques you need to understand the topic and progress confidently to the next lesson.

How should I use this page?

Start with the overview, then follow the section links to deepen your understanding. Use the table of contents on the right to jump to specific sections.

What should I read next?

Use the navigation below to continue to the next lesson or explore related topics.