GPU Architecture & CUDA — How GPUs Run AI Code
GPUs accelerate AI training by thousands of times compared to CPUs. Understanding why requires knowing how GPUs work — the hierarchy of threads, the memory system, and the CUDA programming model. This knowledge helps you write faster PyTorch code and debug performance bottlenecks.
GPU vs CPU — Fundamentally Different
🖥️ CPU (e.g., AMD Ryzen 9)
- 8–32 powerful cores
- Optimised for sequential tasks
- Large cache (32MB+), complex branch prediction
- Low latency per operation
- Good at control flow and general-purpose logic
🎮 GPU (e.g., NVIDIA A100)
- 6,912 CUDA cores (A100)
- Optimised for massively parallel tasks
- Small cache per core, simpler execution
- High throughput (hides latency with parallelism)
- Perfect for matrix multiplications, convolutions
Matrix multiply of 4096×4096 matrices requires ~134M multiply-add operations. A CPU does this sequentially in seconds. A GPU does all operations in parallel in milliseconds. Every layer forward/backward pass is a matrix multiply.
GPU Thread Hierarchy
🧵 Thread Visualiser — Click a Level
GPUs organize computation in a 3-level hierarchy. Click each level to understand it.
The Thread Hierarchy Explained
GPU Memory Hierarchy
Registers
~1 cycle latency
Private to each thread. Fastest but limited. If a kernel uses too many registers, it "spills" to slower local memory.
Shared Memory (SRAM)
~5 cycle latency · 192KB/SM
Shared within a thread block. Programmable L1 cache. Key for optimised CUDA kernels — load data once from DRAM, reuse from shared memory.
L2 Cache
~50 cycle latency · 40MB (A100)
Shared across all SMs. Caches frequently accessed global memory. PyTorch's memory manager tries to reuse allocations to benefit from L2 hits.
Global Memory (HBM)
~500 cycle latency · 80GB (A100)
Your model weights, activations, gradients live here. HBM2e provides 2 TB/s bandwidth. Most AI training is memory bandwidth-bound, not compute-bound.
PyTorch CUDA API
import torch
# Device management
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(torch.cuda.device_count()) # Number of GPUs
print(torch.cuda.get_device_name(0)) # "NVIDIA A100 80GB PCIe"
print(torch.cuda.get_device_properties(0)) # Full specs
# Memory management
print(torch.cuda.memory_allocated(0)) # Bytes in use
print(torch.cuda.memory_reserved(0)) # Bytes allocated (including cache)
torch.cuda.empty_cache() # Release cached memory back to OS
# Multi-GPU: select specific GPU
with torch.cuda.device(1): # Use GPU 1
x = torch.randn(1000, 1000, device='cuda')
# CUDA Streams — run operations concurrently
stream1 = torch.cuda.Stream()
stream2 = torch.cuda.Stream()
with torch.cuda.stream(stream1):
out1 = model_a(data_a) # Runs on stream1
with torch.cuda.stream(stream2):
out2 = model_b(data_b) # Runs on stream2, overlapping with stream1
torch.cuda.synchronize() # Wait for all streams to finish
# Mixed precision — ~2× faster with FP16 tensors
from torch.cuda.amp import autocast, GradScaler
scaler = GradScaler()
with autocast(): # Forward pass in FP16
output = model(inputs)
loss = criterion(output, targets)
scaler.scale(loss).backward() # Scale to avoid FP16 underflow
scaler.step(optimizer)
scaler.update() Profiling with PyTorch Profiler
from torch.profiler import profile, record_function, ProfilerActivity
with profile(
activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
record_shapes=True,
profile_memory=True,
) as prof:
with record_function("model_forward"):
output = model(inputs)
with record_function("model_backward"):
loss.backward()
# Print top 20 most expensive operations
print(prof.key_averages().table(
sort_by="cuda_time_total",
row_limit=20
))
# Export for TensorBoard visualization
prof.export_chrome_trace("trace.json")
# Open in chrome://tracing or Perfetto UI Frequently Asked Questions
Why is my GPU utilization low even though my code runs on CUDA?
Low GPU utilization usually means CPU-GPU data transfer bottleneck or data loading is too slow. Check: use num_workers>0 in DataLoader, enable pin_memory=True, prefetch data, make sure your batch size is large enough to saturate GPU compute. Also check with nvidia-smi that memory is actually allocated. Use PyTorch Profiler to identify the bottleneck.
What is CUDA out of memory (OOM) and how do I fix it?
OOM happens when your model + activations + gradients exceed available VRAM. Fix strategies: reduce batch size, use gradient checkpointing (trade compute for memory), use mixed precision (FP16 halves activation memory), accumulate gradients over multiple small batches, use model sharding. For debugging, call torch.cuda.memory_summary() to see what's consuming memory.
What is the difference between CUDA cores and Tensor Cores?
CUDA cores perform general FP32 operations (one multiply-add per clock). Tensor Cores (introduced in Volta/V100) perform 4×4 matrix multiply in FP16 in a single instruction — 8× the throughput. PyTorch's torch.mm and nn.Linear automatically use Tensor Cores when inputs are FP16 or BF16. This is why mixed precision training is so much faster on modern GPUs.
Frequently Asked Questions
What will I learn here?
This page covers the core concepts and techniques you need to understand the topic and progress confidently to the next lesson.
How should I use this page?
Start with the overview, then follow the section links to deepen your understanding. Use the table of contents on the right to jump to specific sections.
What should I read next?
Use the navigation below to continue to the next lesson or explore related topics.