Phase 5: GPU & Hardware for AI

Training a neural network involves doing the same mathematical operations — matrix multiplications — billions of times. GPUs can do this 100× faster than CPUs because they have thousands of cores designed for parallel computation. Understanding GPU hardware is essential for serious AI work.

🎯

Goal

Understand GPU architecture and use hardware efficiently

⏱️

Time

4 – 6 weeks

🛠️

Tools

CUDA, PyTorch CUDA, nvidia-smi, nvtop

Why GPUs? CPU vs GPU

⚡ CPU vs GPU Parallel Processing

Click Run Simulation to see how CPUs and GPUs process tasks differently.

🖥️ CPU (Central Processing Unit)

4–64 powerful cores
Optimised for sequential tasks
Large cache, complex branch prediction
Clock: 3–5 GHz
Great for: OS, web servers, databases

~1 TFLOP

🎮 GPU (Graphics Processing Unit)

Thousands of smaller cores (CUDA cores)
Optimised for parallel matrix math
High memory bandwidth
A100: 6,912 CUDA + 432 Tensor Cores
Great for: Training neural networks

~300 TFLOP (H100 FP16)

Key GPU Hardware Concepts

🔧

GPU Architecture & CUDA

SM, CUDA cores, warp execution, thread blocks, grids. How GPUs actually run code.

Read Guide →

💾

GPU Memory Management

VRAM, HBM2e, memory bandwidth, batch size tradeoffs, gradient checkpointing.

Read Guide →

🌐

Distributed Training

Data parallelism, model parallelism, pipeline parallelism, NVLink, InfiniBand.

Read Guide →

GPU Hardware Landscape (2024)

RTX 409024 GB GDDR6X1651.0 TB/sConsumer / Fine-tuning

A100 80GB80 GB HBM2e3122.0 TB/sResearch / Training

H100 SXM80 GB HBM39893.35 TB/sLarge-scale LLM training

H200141 GB HBM3e9894.8 TB/sInference at scale

B200192 GB HBM3e2,2508.0 TB/sNext-gen (2025)

💡 What GPU Do I Need?

Learning & small projects: Google Colab (free T4) or Kaggle (free P100).
Fine-tuning 7B models: RTX 3090/4090 (24GB VRAM), ~$800-1,500.
Training from scratch: Rent A100s or H100s on AWS/GCP/Lambda Labs.

Tensor Cores — The AI Accelerator

Modern NVIDIA GPUs have specialised Tensor Cores that perform fused multiply-add on 4×4 matrices in a single clock cycle. These are 8–16× faster than standard CUDA cores for the matrix operations used in deep learning.

To use Tensor Cores, use mixed precision training: store weights in FP32 but do forward/backward pass in FP16. PyTorch makes this trivial:

Python · Mixed Precision Training (2× speedup, half the memory)

from torch.cuda.amp import autocast, GradScaler

scaler = GradScaler()

for batch in dataloader:
    optimizer.zero_grad()

    with autocast():  # Runs in FP16 on Tensor Cores
        output = model(batch)
        loss = criterion(output, targets)

    scaler.scale(loss).backward()   # Scale gradients to prevent underflow
    scaler.step(optimizer)
    scaler.update()

Checking GPU Usage

Terminal · Monitor GPU

# Check GPU memory and utilisation
nvidia-smi

# Watch live (updates every 1 second)
watch -n 1 nvidia-smi

# In Python
import torch
print(torch.cuda.is_available())       # True
print(torch.cuda.device_count())       # 1
print(torch.cuda.get_device_name(0))  # NVIDIA A100
print(f"Memory: {torch.cuda.memory_allocated()/1e9:.2f}GB used")

Frequently Asked Questions

Can I train models without a GPU?

Yes, but slowly. Small models (<100K parameters) train fine on CPU. For anything larger, use Google Colab (free T4 GPU), Kaggle (free P100), or rent cloud GPUs. Training a BERT model that takes 2 minutes on a T4 would take ~3 hours on CPU.

What's the difference between VRAM and RAM?

VRAM (Video RAM) is memory on the GPU itself — fast, but limited (8–80GB). System RAM is on the CPU side — larger (16–512GB) but much slower to access from GPU. Model weights, activations, and gradients must all fit in VRAM during training.

What is FLOP/s and why does it matter?

FLOP/s = Floating Point Operations Per Second. It measures computational throughput. An H100 does ~1,000 TFLOP/s at FP16. More FLOP/s = faster training/inference. But memory bandwidth often bottlenecks before compute — moving data is frequently slower than computing.

Frequently Asked Questions

What will I learn here?

This page covers the core concepts and techniques you need to understand the topic and progress confidently to the next lesson.

How should I use this page?

Start with the overview, then follow the section links to deepen your understanding. Use the table of contents on the right to jump to specific sections.

What should I read next?

Use the navigation below to continue to the next lesson or explore related topics.