GPU Architecture & CUDA — How GPUs Run AI Code

GPUs accelerate AI training by thousands of times compared to CPUs. Understanding why requires knowing how GPUs work — the hierarchy of threads, the memory system, and the CUDA programming model. This knowledge helps you write faster PyTorch code and debug performance bottlenecks.

Covers: GPU vs CPU · SM · Warp · Thread Block · Grid · CUDA Model · Memory Hierarchy · PyTorch CUDA API · Profiling

GPU vs CPU — Fundamentally Different

🖥️ CPU (e.g., AMD Ryzen 9)

  • 8–32 powerful cores
  • Optimised for sequential tasks
  • Large cache (32MB+), complex branch prediction
  • Low latency per operation
  • Good at control flow and general-purpose logic

🎮 GPU (e.g., NVIDIA A100)

  • 6,912 CUDA cores (A100)
  • Optimised for massively parallel tasks
  • Small cache per core, simpler execution
  • High throughput (hides latency with parallelism)
  • Perfect for matrix multiplications, convolutions

Matrix multiply of 4096×4096 matrices requires ~134M multiply-add operations. A CPU does this sequentially in seconds. A GPU does all operations in parallel in milliseconds. Every layer forward/backward pass is a matrix multiply.

GPU Thread Hierarchy

🧵 Thread Visualiser — Click a Level

GPUs organize computation in a 3-level hierarchy. Click each level to understand it.

Thread: The smallest unit of execution. Each thread runs the same kernel code with a unique thread ID. Threads in a block can share memory and synchronize.

The Thread Hierarchy Explained

🧵
Thread — The smallest execution unit. Each thread runs the kernel function independently with its unique (blockIdx, threadIdx). On A100: up to 2048 threads can be active per SM at once.
🌀
Warp — 32 threads that execute the same instruction simultaneously (SIMT — Single Instruction, Multiple Thread). If threads in a warp take different branches (if/else), the warp executes both paths serially — avoid divergence!
📦
Thread Block — A group of threads (up to 1024) that run on the same Streaming Multiprocessor (SM). Threads in a block share fast shared memory (SRAM) and can synchronize with __syncthreads(). Choose block sizes that are multiples of 32 (warp size).
🌐
Grid — All thread blocks for a kernel launch. Blocks are assigned to SMs dynamically — a GPU with 108 SMs (A100) processes 108 blocks simultaneously. Grids can be 1D, 2D, or 3D to match your data shape.

GPU Memory Hierarchy

Registers

~1 cycle latency

Private to each thread. Fastest but limited. If a kernel uses too many registers, it "spills" to slower local memory.

Shared Memory (SRAM)

~5 cycle latency · 192KB/SM

Shared within a thread block. Programmable L1 cache. Key for optimised CUDA kernels — load data once from DRAM, reuse from shared memory.

L2 Cache

~50 cycle latency · 40MB (A100)

Shared across all SMs. Caches frequently accessed global memory. PyTorch's memory manager tries to reuse allocations to benefit from L2 hits.

Global Memory (HBM)

~500 cycle latency · 80GB (A100)

Your model weights, activations, gradients live here. HBM2e provides 2 TB/s bandwidth. Most AI training is memory bandwidth-bound, not compute-bound.

PyTorch CUDA API

Python · Essential PyTorch CUDA Operations
import torch

# Device management
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(torch.cuda.device_count())           # Number of GPUs
print(torch.cuda.get_device_name(0))       # "NVIDIA A100 80GB PCIe"
print(torch.cuda.get_device_properties(0)) # Full specs

# Memory management
print(torch.cuda.memory_allocated(0))      # Bytes in use
print(torch.cuda.memory_reserved(0))       # Bytes allocated (including cache)
torch.cuda.empty_cache()                   # Release cached memory back to OS

# Multi-GPU: select specific GPU
with torch.cuda.device(1):                 # Use GPU 1
    x = torch.randn(1000, 1000, device='cuda')

# CUDA Streams — run operations concurrently
stream1 = torch.cuda.Stream()
stream2 = torch.cuda.Stream()

with torch.cuda.stream(stream1):
    out1 = model_a(data_a)   # Runs on stream1

with torch.cuda.stream(stream2):
    out2 = model_b(data_b)   # Runs on stream2, overlapping with stream1

torch.cuda.synchronize()   # Wait for all streams to finish

# Mixed precision — ~2× faster with FP16 tensors
from torch.cuda.amp import autocast, GradScaler

scaler = GradScaler()
with autocast():            # Forward pass in FP16
    output = model(inputs)
    loss = criterion(output, targets)

scaler.scale(loss).backward()   # Scale to avoid FP16 underflow
scaler.step(optimizer)
scaler.update()

Profiling with PyTorch Profiler

Python · Profile GPU Kernel Performance
from torch.profiler import profile, record_function, ProfilerActivity

with profile(
    activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
    record_shapes=True,
    profile_memory=True,
) as prof:
    with record_function("model_forward"):
        output = model(inputs)
    with record_function("model_backward"):
        loss.backward()

# Print top 20 most expensive operations
print(prof.key_averages().table(
    sort_by="cuda_time_total",
    row_limit=20
))

# Export for TensorBoard visualization
prof.export_chrome_trace("trace.json")
# Open in chrome://tracing or Perfetto UI

Frequently Asked Questions

Why is my GPU utilization low even though my code runs on CUDA?

Low GPU utilization usually means CPU-GPU data transfer bottleneck or data loading is too slow. Check: use num_workers>0 in DataLoader, enable pin_memory=True, prefetch data, make sure your batch size is large enough to saturate GPU compute. Also check with nvidia-smi that memory is actually allocated. Use PyTorch Profiler to identify the bottleneck.

What is CUDA out of memory (OOM) and how do I fix it?

OOM happens when your model + activations + gradients exceed available VRAM. Fix strategies: reduce batch size, use gradient checkpointing (trade compute for memory), use mixed precision (FP16 halves activation memory), accumulate gradients over multiple small batches, use model sharding. For debugging, call torch.cuda.memory_summary() to see what's consuming memory.

What is the difference between CUDA cores and Tensor Cores?

CUDA cores perform general FP32 operations (one multiply-add per clock). Tensor Cores (introduced in Volta/V100) perform 4×4 matrix multiply in FP16 in a single instruction — 8× the throughput. PyTorch's torch.mm and nn.Linear automatically use Tensor Cores when inputs are FP16 or BF16. This is why mixed precision training is so much faster on modern GPUs.

Frequently Asked Questions

What will I learn here?

This page covers the core concepts and techniques you need to understand the topic and progress confidently to the next lesson.

How should I use this page?

Start with the overview, then follow the section links to deepen your understanding. Use the table of contents on the right to jump to specific sections.

What should I read next?

Use the navigation below to continue to the next lesson or explore related topics.