GPU Memory Management for Deep Learning
"CUDA out of memory" is the most common error in deep learning. Understanding how GPU memory is allocated — and how to reduce it — lets you train larger models, run bigger batches, and avoid expensive hardware upgrades.
VRAM Types — What's Inside Your GPU
HBM2e (High Bandwidth Memory)
Used in data center GPUs: A100 (80GB), H100 (80GB). Stacked dies on the same package as the GPU — extremely high bandwidth (2 TB/s). Critical for LLM training.
A100 · H100 · MI300XGDDR6X
Used in consumer GPUs: RTX 4090 (24GB), RTX 3090 (24GB). Cheaper than HBM but less bandwidth (~1 TB/s). Excellent for fine-tuning and inference on consumer budgets.
RTX 4090 · RTX 3090Unified Memory (Apple Silicon)
Apple M-series chips share RAM between CPU and GPU. The full system RAM is accessible to the GPU (up to 192GB on M3 Ultra). Bandwidth lower than HBM but revolutionary for accessibility.
M2 Ultra · M3 MaxInteractive Memory Calculator
🧮 Model Memory Estimator
Adjust parameters to estimate how much VRAM your training run needs.
Where Does Memory Go During Training?
The actual parameter values. 7B FP16 model = ~14GB. Frozen during inference, updated during training.
Adam stores 2 momentum tensors per parameter (m, v). In FP32, this is 8 bytes per parameter — 56GB for a 7B model. The hidden cost of Adam.
Intermediate layer outputs stored for backprop. Grow linearly with batch size × sequence length. The main knob for memory reduction.
Same shape as weights, temporarily stored during backward pass. Can be freed immediately after optimizer step.
Memory Reduction Techniques
1. Mixed Precision Training (BF16/FP16)
Store and compute activations in 16-bit instead of 32-bit. Halves activation memory, doubles throughput on Tensor Cores. Keep master weights in FP32 for numerical stability. BF16 is preferred over FP16 for LLM training (larger dynamic range, no loss scaling needed).
# PyTorch 2.0+ — simplest approach
from torch.cuda.amp import autocast, GradScaler
# BF16 (preferred for A100/H100, no scaling needed)
model = model.to(torch.bfloat16)
# Or use autocast per forward pass (keeps master weights in FP32)
scaler = GradScaler() # Only needed for FP16, not BF16
with autocast(dtype=torch.bfloat16):
output = model(inputs)
loss = criterion(output, targets)
loss.backward()
optimizer.step() 2. Gradient Checkpointing
Instead of storing all activations for backprop, recompute them on the fly during the backward pass. Reduces activation memory by ~10× at the cost of ~30% more compute (recomputing forward pass twice). Essential for training large models on limited VRAM.
from torch.utils.checkpoint import checkpoint
class TransformerBlock(nn.Module):
def forward(self, x):
# Normally: all intermediate activations stored
return self.attention(self.norm1(x)) + x
class TransformerWithCheckpointing(nn.Module):
def forward(self, x):
# With checkpointing: recomputes during backward
# saves ~10x activation memory at ~30% compute cost
return checkpoint(self.transformer_block, x, use_reentrant=False)
# Or enable globally with Hugging Face models
model.gradient_checkpointing_enable()
# This call alone can reduce activation memory by 8-10× for LLMs 3. Gradient Accumulation — Simulate Large Batches
accumulation_steps = 8 # Effective batch = actual_batch × 8
optimizer.zero_grad()
for i, (inputs, targets) in enumerate(dataloader):
outputs = model(inputs)
loss = criterion(outputs, targets) / accumulation_steps # Scale loss
loss.backward() # Gradients accumulate
if (i + 1) % accumulation_steps == 0:
optimizer.step()
optimizer.zero_grad()
# Effect: train with batch=4 on GPU, but behave as batch=32
# Memory usage = small batch · no memory savings vs large batch
# But gradient quality ≈ training with large batch 4. CPU Offloading with DeepSpeed
{
"zero_optimization": {
"stage": 3,
"offload_optimizer": {
"device": "cpu",
"pin_memory": true
},
"offload_param": {
"device": "cpu",
"pin_memory": true
},
"overlap_comm": true
},
"fp16": { "enabled": true },
"train_micro_batch_size_per_gpu": 1
}
// ZeRO-3 + offload: can train a 70B model on 4× A100 80GB GPUs
// Without offload: would require 8× A100s Quick Memory Rules of Thumb
Model Weights
FP32: 4 bytes × params
FP16/BF16: 2 bytes × params
INT4: 0.5 bytes × params
Adam Optimizer
Always stored in FP32:
12 bytes × params
(weight + m + v, each 4 bytes)
Training Total
Rule of thumb for FP16:
~16–20 bytes × params
(weights + grads + optimizer + activations)
Inference Only
No optimizer states or gradients. INT4 quantized:
~1 byte × params
7B model ≈ 7GB
Frequently Asked Questions
How do I find what's taking up my VRAM?
Use torch.cuda.memory_summary(device=0, abbreviated=False) for a detailed breakdown. For profiling, the PyTorch Memory Profiler (torch.profiler with profile_memory=True) shows peak memory usage per operation. Also: run with batch_size=1 to measure baseline model+optimizer memory, then increase batch size to measure activation growth.
Is it better to use one large batch or many small accumulated batches?
Mathematically equivalent for the gradient update (gradient accumulation averages gradients). However, true large-batch training with the same learning rate can converge faster with linear scaling rule (scale LR proportional to batch size). Small accumulated batches use less VRAM. For most fine-tuning scenarios, gradient accumulation to reach effective batch size 64–128 is sufficient.
Can I use system RAM (CPU) as GPU memory overflow?
Yes, via CPU offloading in DeepSpeed ZeRO-Offload or Accelerate. GPU→CPU transfer bandwidth is the bottleneck (~16 GB/s PCIe 4.0 vs 2000 GB/s HBM2e), so offloaded tensors are only fetched to GPU when needed. This adds latency but allows training models that would otherwise OOM. With NVLink (600 GB/s GPU-GPU) and high-speed CPU-GPU interconnects, this is increasingly practical.
Frequently Asked Questions
What will I learn here?
This page covers the core concepts and techniques you need to understand the topic and progress confidently to the next lesson.
How should I use this page?
Start with the overview, then follow the section links to deepen your understanding. Use the table of contents on the right to jump to specific sections.
What should I read next?
Use the navigation below to continue to the next lesson or explore related topics.