GPU Memory Management for Deep Learning

"CUDA out of memory" is the most common error in deep learning. Understanding how GPU memory is allocated — and how to reduce it — lets you train larger models, run bigger batches, and avoid expensive hardware upgrades.

VRAM Types — What's Inside Your GPU

HBM2e (High Bandwidth Memory)

Used in data center GPUs: A100 (80GB), H100 (80GB). Stacked dies on the same package as the GPU — extremely high bandwidth (2 TB/s). Critical for LLM training.

A100 · H100 · MI300X

GDDR6X

Used in consumer GPUs: RTX 4090 (24GB), RTX 3090 (24GB). Cheaper than HBM but less bandwidth (~1 TB/s). Excellent for fine-tuning and inference on consumer budgets.

RTX 4090 · RTX 3090

Unified Memory (Apple Silicon)

Apple M-series chips share RAM between CPU and GPU. The full system RAM is accessible to the GPU (up to 192GB on M3 Ultra). Bandwidth lower than HBM but revolutionary for accessibility.

M2 Ultra · M3 Max

Interactive Memory Calculator

🧮 Model Memory Estimator

Adjust parameters to estimate how much VRAM your training run needs.

Parameters (billions): Precision: Batch size: Seq length: Hidden dim:

Calculating...

Where Does Memory Go During Training?

Model Weights (~30%)

The actual parameter values. 7B FP16 model = ~14GB. Frozen during inference, updated during training.

Optimizer States (~25%)

Adam stores 2 momentum tensors per parameter (m, v). In FP32, this is 8 bytes per parameter — 56GB for a 7B model. The hidden cost of Adam.

Activations (~30%)

Intermediate layer outputs stored for backprop. Grow linearly with batch size × sequence length. The main knob for memory reduction.

Gradients (~15%)

Same shape as weights, temporarily stored during backward pass. Can be freed immediately after optimizer step.

Memory Reduction Techniques

1. Mixed Precision Training (BF16/FP16)

Store and compute activations in 16-bit instead of 32-bit. Halves activation memory, doubles throughput on Tensor Cores. Keep master weights in FP32 for numerical stability. BF16 is preferred over FP16 for LLM training (larger dynamic range, no loss scaling needed).

Python · BF16 Training in PyTorch

# PyTorch 2.0+ — simplest approach
from torch.cuda.amp import autocast, GradScaler

# BF16 (preferred for A100/H100, no scaling needed)
model = model.to(torch.bfloat16)

# Or use autocast per forward pass (keeps master weights in FP32)
scaler = GradScaler()   # Only needed for FP16, not BF16

with autocast(dtype=torch.bfloat16):
    output = model(inputs)
    loss = criterion(output, targets)

loss.backward()
optimizer.step()

2. Gradient Checkpointing

Instead of storing all activations for backprop, recompute them on the fly during the backward pass. Reduces activation memory by ~10× at the cost of ~30% more compute (recomputing forward pass twice). Essential for training large models on limited VRAM.

Python · Gradient Checkpointing

from torch.utils.checkpoint import checkpoint

class TransformerBlock(nn.Module):
    def forward(self, x):
        # Normally: all intermediate activations stored
        return self.attention(self.norm1(x)) + x

class TransformerWithCheckpointing(nn.Module):
    def forward(self, x):
        # With checkpointing: recomputes during backward
        # saves ~10x activation memory at ~30% compute cost
        return checkpoint(self.transformer_block, x, use_reentrant=False)

# Or enable globally with Hugging Face models
model.gradient_checkpointing_enable()
# This call alone can reduce activation memory by 8-10× for LLMs

3. Gradient Accumulation — Simulate Large Batches

Python · Gradient Accumulation

accumulation_steps = 8   # Effective batch = actual_batch × 8

optimizer.zero_grad()
for i, (inputs, targets) in enumerate(dataloader):
    outputs = model(inputs)
    loss = criterion(outputs, targets) / accumulation_steps  # Scale loss
    loss.backward()  # Gradients accumulate

    if (i + 1) % accumulation_steps == 0:
        optimizer.step()
        optimizer.zero_grad()

# Effect: train with batch=4 on GPU, but behave as batch=32
# Memory usage = small batch · no memory savings vs large batch
# But gradient quality ≈ training with large batch

4. CPU Offloading with DeepSpeed

JSON · DeepSpeed ZeRO-Offload Config

{
  "zero_optimization": {
    "stage": 3,
    "offload_optimizer": {
      "device": "cpu",
      "pin_memory": true
    },
    "offload_param": {
      "device": "cpu",
      "pin_memory": true
    },
    "overlap_comm": true
  },
  "fp16": { "enabled": true },
  "train_micro_batch_size_per_gpu": 1
}
// ZeRO-3 + offload: can train a 70B model on 4× A100 80GB GPUs
// Without offload: would require 8× A100s

Quick Memory Rules of Thumb

Model Weights

FP32: 4 bytes × params
FP16/BF16: 2 bytes × params
INT4: 0.5 bytes × params

Adam Optimizer

Always stored in FP32:
12 bytes × params
(weight + m + v, each 4 bytes)

Training Total

Rule of thumb for FP16:
~16–20 bytes × params
(weights + grads + optimizer + activations)

Inference Only

No optimizer states or gradients. INT4 quantized:
~1 byte × params
7B model ≈ 7GB

Frequently Asked Questions

How do I find what's taking up my VRAM?

Use torch.cuda.memory_summary(device=0, abbreviated=False) for a detailed breakdown. For profiling, the PyTorch Memory Profiler (torch.profiler with profile_memory=True) shows peak memory usage per operation. Also: run with batch_size=1 to measure baseline model+optimizer memory, then increase batch size to measure activation growth.

Is it better to use one large batch or many small accumulated batches?

Mathematically equivalent for the gradient update (gradient accumulation averages gradients). However, true large-batch training with the same learning rate can converge faster with linear scaling rule (scale LR proportional to batch size). Small accumulated batches use less VRAM. For most fine-tuning scenarios, gradient accumulation to reach effective batch size 64–128 is sufficient.

Can I use system RAM (CPU) as GPU memory overflow?

Yes, via CPU offloading in DeepSpeed ZeRO-Offload or Accelerate. GPU→CPU transfer bandwidth is the bottleneck (~16 GB/s PCIe 4.0 vs 2000 GB/s HBM2e), so offloaded tensors are only fetched to GPU when needed. This adds latency but allows training models that would otherwise OOM. With NVLink (600 GB/s GPU-GPU) and high-speed CPU-GPU interconnects, this is increasingly practical.

GPU Memory Management for Deep Learning

VRAM Types — What's Inside Your GPU

HBM2e (High Bandwidth Memory)

GDDR6X

Unified Memory (Apple Silicon)

Interactive Memory Calculator

🧮 Model Memory Estimator

Where Does Memory Go During Training?

Memory Reduction Techniques

1. Mixed Precision Training (BF16/FP16)

2. Gradient Checkpointing

3. Gradient Accumulation — Simulate Large Batches

4. CPU Offloading with DeepSpeed

Quick Memory Rules of Thumb

Model Weights

Adam Optimizer

Training Total

Inference Only

Frequently Asked Questions

Frequently Asked Questions

What will I learn here?

How should I use this page?

What should I read next?