Phase 5: GPU & Hardware for AI
Training a neural network involves doing the same mathematical operations — matrix multiplications — billions of times. GPUs can do this 100× faster than CPUs because they have thousands of cores designed for parallel computation. Understanding GPU hardware is essential for serious AI work.
Understand GPU architecture and use hardware efficiently
4 – 6 weeks
CUDA, PyTorch CUDA, nvidia-smi, nvtop
Why GPUs? CPU vs GPU
⚡ CPU vs GPU Parallel Processing
Click Run Simulation to see how CPUs and GPUs process tasks differently.
🖥️ CPU (Central Processing Unit)
- 4–64 powerful cores
- Optimised for sequential tasks
- Large cache, complex branch prediction
- Clock: 3–5 GHz
- Great for: OS, web servers, databases
🎮 GPU (Graphics Processing Unit)
- Thousands of smaller cores (CUDA cores)
- Optimised for parallel matrix math
- High memory bandwidth
- A100: 6,912 CUDA + 432 Tensor Cores
- Great for: Training neural networks
Key GPU Hardware Concepts
GPU Architecture & CUDA
SM, CUDA cores, warp execution, thread blocks, grids. How GPUs actually run code.
Read Guide →GPU Memory Management
VRAM, HBM2e, memory bandwidth, batch size tradeoffs, gradient checkpointing.
Read Guide →Distributed Training
Data parallelism, model parallelism, pipeline parallelism, NVLink, InfiniBand.
Read Guide →GPU Hardware Landscape (2024)
Learning & small projects: Google Colab (free T4) or Kaggle (free P100).
Fine-tuning 7B models: RTX 3090/4090 (24GB VRAM), ~$800-1,500.
Training from scratch: Rent A100s or H100s on AWS/GCP/Lambda Labs.
Tensor Cores — The AI Accelerator
Modern NVIDIA GPUs have specialised Tensor Cores that perform fused multiply-add on 4×4 matrices in a single clock cycle. These are 8–16× faster than standard CUDA cores for the matrix operations used in deep learning.
To use Tensor Cores, use mixed precision training: store weights in FP32 but do forward/backward pass in FP16. PyTorch makes this trivial:
from torch.cuda.amp import autocast, GradScaler
scaler = GradScaler()
for batch in dataloader:
optimizer.zero_grad()
with autocast(): # Runs in FP16 on Tensor Cores
output = model(batch)
loss = criterion(output, targets)
scaler.scale(loss).backward() # Scale gradients to prevent underflow
scaler.step(optimizer)
scaler.update() Checking GPU Usage
# Check GPU memory and utilisation
nvidia-smi
# Watch live (updates every 1 second)
watch -n 1 nvidia-smi
# In Python
import torch
print(torch.cuda.is_available()) # True
print(torch.cuda.device_count()) # 1
print(torch.cuda.get_device_name(0)) # NVIDIA A100
print(f"Memory: {torch.cuda.memory_allocated()/1e9:.2f}GB used") Frequently Asked Questions
Can I train models without a GPU?
Yes, but slowly. Small models (<100K parameters) train fine on CPU. For anything larger, use Google Colab (free T4 GPU), Kaggle (free P100), or rent cloud GPUs. Training a BERT model that takes 2 minutes on a T4 would take ~3 hours on CPU.
What's the difference between VRAM and RAM?
VRAM (Video RAM) is memory on the GPU itself — fast, but limited (8–80GB). System RAM is on the CPU side — larger (16–512GB) but much slower to access from GPU. Model weights, activations, and gradients must all fit in VRAM during training.
What is FLOP/s and why does it matter?
FLOP/s = Floating Point Operations Per Second. It measures computational throughput. An H100 does ~1,000 TFLOP/s at FP16. More FLOP/s = faster training/inference. But memory bandwidth often bottlenecks before compute — moving data is frequently slower than computing.
Frequently Asked Questions
What will I learn here?
This page covers the core concepts and techniques you need to understand the topic and progress confidently to the next lesson.
How should I use this page?
Start with the overview, then follow the section links to deepen your understanding. Use the table of contents on the right to jump to specific sections.
What should I read next?
Use the navigation below to continue to the next lesson or explore related topics.