GPU Clusters & AI Accelerators
Training GPT-4 reportedly used around 25,000 NVIDIA A100 GPUs for months. Training Llama 3 took thousands of H100s. Modern AI doesn't run on regular computers — it runs on specialized hardware designed specifically for the massive parallel math that deep learning requires. Let's understand what that hardware is and why it matters.
Why GPUs for AI?
A CPU (Central Processing Unit) is designed for sequential tasks — it has a small number of very powerful cores (typically 4–64) optimized for complex, branchy logic. A GPU (Graphics Processing Unit) is designed for parallel tasks — it has thousands of smaller cores that execute simple operations simultaneously.
The Matrix Math Connection
Deep learning is fundamentally matrix multiplication. Training a neural network means multiplying billions of numbers together, billions of times. GPUs were originally built to render 3D graphics — which is also matrix multiplication (transforming 3D coordinates to 2D pixels). Deep learning researchers noticed in 2012 that GPUs trained neural networks 10–100x faster than CPUs, triggering the AI hardware revolution.
NVIDIA GPU Generations for AI
NVIDIA dominates the AI accelerator market. Understanding their product generations helps you understand what's available in the cloud today:
NVLink: Connecting GPUs Together
NVLink is NVIDIA's high-speed interconnect that links multiple GPUs within a single server at speeds up to 900 GB/s (NVLink 4.0). An NVIDIA DGX H100 server contains 8 H100 GPUs all connected via NVLink — they act as a single unified unit with 640GB of combined GPU memory. This is orders of magnitude faster than connecting GPUs via PCIe.
Google TPUs: An Alternative Architecture
Google's Tensor Processing Units (TPUs) are ASICs (Application-Specific Integrated Circuits) designed specifically for matrix operations in TensorFlow and JAX. Unlike GPUs which are general-purpose accelerators repurposed for AI, TPUs are built from the ground up for one thing: tensor math.
TPU v4 and v5e
TPU v4 pods can scale to 4,096 chips connected via a 3D torus mesh — no network bottlenecks between chips. Google trains Gemini on TPU pods. TPU v5e (2023) targets inference workloads at lower cost. If you're training in JAX or TensorFlow, TPUs on GCP can be competitive with or faster than GPU equivalents at large scale.
When to Choose TPUs vs. GPUs
Choose GPUs (NVIDIA) for: PyTorch models, maximum framework compatibility, research flexibility. Choose TPUs for: JAX/TensorFlow workflows, very large training runs where TPU pod topology helps, and when you want to avoid NVIDIA's pricing power. Most of the AI ecosystem (research, open source) is GPU-native, so TPUs have a steeper on-ramp.
Renting GPUs in the Cloud
You don't need to own a $30,000 H100 GPU. Cloud providers rent them by the hour:
| Instance | Provider | GPU | GPUs | ~Price/hr |
|---|---|---|---|---|
| p4d.24xlarge | AWS | A100 | 8 | $32.77 |
| p5.48xlarge | AWS | H100 | 8 | $98.32 |
| a2-megagpu-16g | GCP | A100 | 16 | $22.32 |
| a3-megagpu-8g | GCP | H100 | 8 | $33.19 |
| Standard_ND96asr_v4 | Azure | A100 | 8 | ~$27.20 |
AMD as an Alternative to NVIDIA
AMD's MI300X (2024) is a serious competitor to H100 — 192GB of HBM3 memory (vs H100's 80GB) makes it excellent for inference of large models that need to fit in GPU memory. AMD ROCm (their CUDA equivalent) has matured significantly. AWS is deploying MI300X in their Trn1 and new instance families. For inference-focused workloads, AMD is increasingly competitive.
Frequently Asked Questions
Do I need a GPU to learn machine learning?
No — for learning and small experiments, a CPU is fine. Run small models locally, use Google Colab (free GPU access), or use AWS SageMaker Studio Lab. When you're training models that take hours or days, or running foundation models for inference, that's when you need GPUs. Don't invest in hardware until you've validated what you're building is worth the cost.
What is GPU memory and why does it matter so much for AI?
GPU memory (VRAM) is the fast memory on the GPU card itself (HBM — High Bandwidth Memory). Your model parameters, activations, and gradients must fit in GPU memory during training. A 7 billion parameter model in 16-bit precision requires ~14GB of VRAM just for weights — plus more for optimizer states and activations. An H100 with 80GB can fit models that an A100 with 40GB can't. This is why memory capacity often constrains what models you can train or run.
What is CUDA and why is it important?
CUDA (Compute Unified Device Architecture) is NVIDIA's programming platform for GPUs — the software layer that lets PyTorch, TensorFlow, and other frameworks talk to NVIDIA hardware. NVIDIA's dominance is partly technical (GPU performance) and partly ecosystem: CUDA has a decade of optimization, libraries (cuDNN, cuBLAS, NCCL), and developer tooling that competitors lack. Most AI frameworks are CUDA-native first, everything else second.
Frequently Asked Questions
What will I learn here?
This page covers the core concepts and techniques you need to understand the topic and progress confidently to the next lesson.
How should I use this page?
Start with the overview, then follow the section links to deepen your understanding. Use the table of contents on the right to jump to specific sections.
What should I read next?
Use the navigation below to continue to the next lesson or explore related topics.