AI Training Infrastructure

Training a large language model is one of the most complex distributed computing challenges humans have ever attempted. It's not just plugging in GPUs — it requires carefully orchestrated pipelines, fault-tolerant distributed systems, and infrastructure optimizations that make the difference between a training run that finishes and one that fails halfway through.

What Does an AI Training Run Actually Look Like?

At a high level, training a large model involves: collecting data → processing data → loading data into GPUs → running forward and backward passes → saving checkpoints → repeating for billions or trillions of tokens. In practice, each of these steps involves significant infrastructure.

The Data Pipeline

Before training starts, raw data (web crawls, books, code) must be tokenized, shuffled, deduplicated, and stored in an efficient format (usually packed sequences of fixed length). For a 1T-token training run, this data pipeline processes petabytes of text. It typically runs on CPU-based distributed processing (Spark, Ray, or DASK) and stores outputs in a format like WebDataset or MosaicML StreamingDataset that GPUs can read efficiently at training time.

The Training Loop

Training is an iterative process: load a batch of data → run it through the model (forward pass) → calculate the loss → compute gradients (backward pass) → update model weights. This repeats billions of times. Each step takes milliseconds on GPU clusters, but multiplied by billions of steps, training runs take days to months.

Checkpointing: Surviving Failures

In a cluster of 1,000 GPUs, hardware failure is not a rare event — it's expected. A training run that doesn't save progress regularly will lose hours or days of compute when a GPU fails.

What a Checkpoint Contains

A model checkpoint saves: model weights (billions of floating-point numbers), optimizer state (Adam momentum buffers, which are 2x the size of the weights), and the current position in the training data. A checkpoint for a 70B parameter model in fp16 might be 140GB+ — just for the weights. With optimizer states included, it's often 3-4x larger.

Checkpoint Strategy

Modern training frameworks checkpoint every N steps (e.g., every 1,000 steps) and save to fast object storage (S3, GCS). On failure, training resumes from the last checkpoint. The tradeoff: checkpointing interrupts training (writing hundreds of GB to storage takes time), so checkpointing frequency is tuned based on failure rate and checkpoint speed.

Async checkpointing: Modern frameworks (PyTorch FSDP, Megatron-LM) support async checkpointing — writing checkpoints in the background while training continues on GPU, reducing checkpoint overhead from minutes to seconds.

Training Job Orchestration

Running a training job on 1,000 GPUs requires orchestration — something that coordinates starting all the processes, handling failures, routing network traffic, and monitoring progress.

Slurm — The HPC Standard

Slurm (Simple Linux Utility for Resource Management) is the dominant scheduler in HPC and many AI training clusters. You submit a job script describing what you need (nodes, GPUs, time), and Slurm queues it and launches it when resources are available. Meta, Google DeepMind, and many others use Slurm-like systems for training.

Kubernetes for Training

Kubernetes (with operators like PyTorchJob from Kubeflow) provides a cloud-native alternative to Slurm. You define a distributed training job as a Kubernetes custom resource, and the operator handles launching the pods, setting up process group communication, and restarting failed workers. Used heavily in cloud-native AI platforms.

Ray Train

Ray (from Anyscale) provides a Python-native distributed computing framework with built-in support for distributed ML training. Ray Train abstracts away the complexity of process groups and distributed coordination, making it easier to scale PyTorch training from a single GPU to thousands.

Monitoring a Training Run

How do you know if a training run is going well when it takes weeks and involves thousands of GPUs?

📉

Loss Curves

Training loss should decrease smoothly. Loss spikes often indicate bad batches, learning rate issues, or numerical instability in mixed-precision training.

GPU Utilization

Low GPU utilization (under 80%) means your data pipeline or communication is the bottleneck, not compute. Every idle GPU second is wasted money.

🌡️

GPU Temperature

High temperatures can trigger thermal throttling, reducing performance. Data center cooling is a real constraint at large scale.

📊

MFU (Model FLOP Utilization)

MFU measures how much of the theoretical GPU compute you're actually using. Top training teams target 40–60% MFU — losses from data loading, communication, and memory constraints.

Frequently Asked Questions

How long does it take to train a large language model?

It varies enormously by model size and compute budget. GPT-3 (175B parameters) reportedly took about 34 days on thousands of A100 GPUs. Llama 2 (70B) took around 1.7 million GPU-hours on A100s. Smaller models (7B) can be trained in weeks on a hundred GPUs. Modern models train faster due to better hardware (H100), better algorithms (FlashAttention, selective recomputation), and more efficient training frameworks.

What is mixed-precision training?

Mixed-precision training uses 16-bit (FP16 or BF16) numbers instead of 32-bit (FP32) for most computations. This roughly doubles the number of parameters that fit in GPU memory, doubles compute throughput, and halves memory bandwidth requirements. The "master copy" of weights is kept in FP32 for numerical stability, but forward/backward passes run in FP16/BF16. Almost all large model training uses BF16 today (it handles a wider range of values than FP16, reducing loss spikes).

What happens when a node fails mid-training?

With proper checkpointing, training resumes from the last checkpoint — losing only the work since the last save. Without checkpointing, you might lose hours of compute. Modern infrastructure teams set up automatic node replacement: when a node fails, the orchestration system provisions a new one, the training job restarts from the last checkpoint, and training continues. Large runs treat node failures as routine events to engineer around, not exceptional situations.

What is the difference between pre-training and fine-tuning in terms of infrastructure?

Pre-training (training a model from scratch on vast amounts of data) requires massive infrastructure — thousands of GPUs, weeks of runtime, petabytes of data. It's done by a handful of well-resourced labs. Fine-tuning (adapting a pre-trained model to a specific task) requires far less — often a single GPU or a small cluster for hours to days. Techniques like LoRA (Low-Rank Adaptation) make fine-tuning even more efficient, requiring only a fraction of GPU memory.

Frequently Asked Questions

What will I learn here?

This page covers the core concepts and techniques you need to understand the topic and progress confidently to the next lesson.

How should I use this page?

Start with the overview, then follow the section links to deepen your understanding. Use the table of contents on the right to jump to specific sections.

What should I read next?

Use the navigation below to continue to the next lesson or explore related topics.