PyTorch Fundamentals — Complete Beginner Guide

PyTorch is the leading deep learning framework for research and increasingly for production. Its Pythonic design, dynamic computation graphs, and excellent GPU support make it the go-to choice for everything from quick experiments to training billion-parameter LLMs.

🔥 Covers: Tensors · Autograd · nn.Module · DataLoader · Training Loop · Saving Models · GPU · Image Classifier

Installation

Shell · Install PyTorch
# CUDA 12.1 (recommended for most NVIDIA GPUs)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

# CPU only (for testing/Mac M-series)
pip install torch torchvision torchaudio

# Check installation
python -c "import torch; print(torch.__version__); print(torch.cuda.is_available())"

1. Tensors — PyTorch's Core Data Structure

A tensor is a multi-dimensional array — like NumPy arrays but with GPU support and automatic differentiation. Everything in PyTorch is a tensor: weights, activations, gradients, data.

Python · Tensor Basics
import torch

# Creating tensors
x = torch.tensor([[1.0, 2.0], [3.0, 4.0]])  # From Python list
y = torch.zeros(3, 4)                          # 3×4 zeros
z = torch.randn(2, 3, 4)                       # Random normal, shape (2,3,4)

print(x.shape)    # torch.Size([2, 2])
print(x.dtype)    # torch.float32
print(x.device)   # cpu

# Tensor operations (same API as NumPy)
a = torch.ones(3, 3)
b = torch.eye(3)
print(a + b)             # Element-wise add
print(a @ b)             # Matrix multiply
print(a.mean(), a.std()) # Statistics

# Move to GPU
if torch.cuda.is_available():
    x_gpu = x.to("cuda")       # or x.cuda()
    print(x_gpu.device)        # cuda:0

# Reshape / view
flat = x.view(-1)    # [1, 2, 3, 4] — shared memory
flat = x.reshape(-1) # Safer: may copy

# NumPy interop
import numpy as np
arr = x.numpy()        # Tensor → NumPy (CPU only, shared memory)
t = torch.from_numpy(arr)  # NumPy → Tensor

2. Autograd — Automatic Differentiation

PyTorch tracks every operation on requires_grad=True tensors and builds a computation graph. Calling .backward() computes gradients for all leaf tensors automatically.

Python · Autograd and Gradient Computation
x = torch.tensor(3.0, requires_grad=True)
y = torch.tensor(2.0, requires_grad=True)

# Define a computation
z = x**2 + 3*y + 5
# z = 3² + 3×2 + 5 = 20

z.backward()   # Compute gradients: dz/dx, dz/dy

print(x.grad)  # dz/dx = 2x = 6.0
print(y.grad)  # dz/dy = 3.0

# In a training loop, gradients accumulate — always zero them first
x.grad.zero_()

# Disable grad tracking (for inference — saves memory and speed)
with torch.no_grad():
    pred = model(input_data)  # No grad tracking

3. nn.Module — Building Neural Networks

Python · Custom nn.Module
import torch.nn as nn
import torch.nn.functional as F

class MLP(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim):
        super().__init__()
        # Define layers as attributes — PyTorch tracks their parameters
        self.fc1 = nn.Linear(input_dim, hidden_dim)
        self.bn1 = nn.BatchNorm1d(hidden_dim)
        self.fc2 = nn.Linear(hidden_dim, hidden_dim)
        self.fc3 = nn.Linear(hidden_dim, output_dim)
        self.dropout = nn.Dropout(0.3)

    def forward(self, x):
        # Define the forward pass
        x = F.relu(self.bn1(self.fc1(x)))
        x = self.dropout(x)
        x = F.relu(self.fc2(x))
        x = self.dropout(x)
        return self.fc3(x)   # Raw logits (no softmax — CrossEntropyLoss does it)

model = MLP(784, 512, 10)
print(model)

# Inspect parameters
for name, param in model.named_parameters():
    print(f"{name}: {param.shape}")

total_params = sum(p.numel() for p in model.parameters())
print(f"Total parameters: {total_params:,}")

4. DataLoader — Efficient Data Pipelines

Python · Custom Dataset and DataLoader
from torch.utils.data import Dataset, DataLoader
from torchvision import transforms
import PIL.Image

class ImageDataset(Dataset):
    def __init__(self, image_paths, labels, transform=None):
        self.image_paths = image_paths
        self.labels = labels
        self.transform = transform

    def __len__(self):
        return len(self.image_paths)

    def __getitem__(self, idx):
        img = PIL.Image.open(self.image_paths[idx]).convert('RGB')
        if self.transform:
            img = self.transform(img)
        label = self.labels[idx]
        return img, label

# Define augmentation pipeline
train_transform = transforms.Compose([
    transforms.RandomHorizontalFlip(),
    transforms.RandomCrop(224),
    transforms.ColorJitter(brightness=0.2, contrast=0.2),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406],
                         std=[0.229, 0.224, 0.225]),  # ImageNet stats
])

train_dataset = ImageDataset(train_paths, train_labels, train_transform)
train_loader = DataLoader(
    train_dataset,
    batch_size=32,
    shuffle=True,
    num_workers=4,     # Parallel data loading
    pin_memory=True,   # Faster CPU→GPU transfers
    drop_last=True,    # Drop final incomplete batch
)

5. The Training Loop

Python · Complete Training + Validation Loop
import torch.optim as optim
from torch.optim.lr_scheduler import CosineAnnealingLR

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)

optimizer = optim.AdamW(model.parameters(), lr=3e-4, weight_decay=1e-4)
scheduler = CosineAnnealingLR(optimizer, T_max=num_epochs)
criterion = nn.CrossEntropyLoss()

def train_epoch(model, loader, optimizer, criterion, device):
    model.train()  # Enables dropout, batch norm in train mode
    total_loss, correct = 0, 0
    for images, labels in loader:
        images, labels = images.to(device), labels.to(device)

        optimizer.zero_grad()           # 1. Clear previous gradients
        outputs = model(images)         # 2. Forward pass
        loss = criterion(outputs, labels)  # 3. Compute loss
        loss.backward()                 # 4. Backprop (compute gradients)
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)  # Clip gradients
        optimizer.step()                # 5. Update weights

        total_loss += loss.item()
        correct += (outputs.argmax(1) == labels).sum().item()
    return total_loss / len(loader), correct / len(loader.dataset)

def eval_epoch(model, loader, criterion, device):
    model.eval()  # Disables dropout, uses running stats for batch norm
    total_loss, correct = 0, 0
    with torch.no_grad():
        for images, labels in loader:
            images, labels = images.to(device), labels.to(device)
            outputs = model(images)
            loss = criterion(outputs, labels)
            total_loss += loss.item()
            correct += (outputs.argmax(1) == labels).sum().item()
    return total_loss / len(loader), correct / len(loader.dataset)

# Main training loop
for epoch in range(num_epochs):
    train_loss, train_acc = train_epoch(model, train_loader, optimizer, criterion, device)
    val_loss, val_acc = eval_epoch(model, val_loader, criterion, device)
    scheduler.step()
    print(f"Epoch {epoch+1}: Train Loss={train_loss:.4f} Acc={train_acc:.3f} | Val Loss={val_loss:.4f} Acc={val_acc:.3f}")

6. Saving and Loading Models

Python · Save / Load Checkpoints
# Save full checkpoint (recommended)
torch.save({
    'epoch': epoch,
    'model_state_dict': model.state_dict(),
    'optimizer_state_dict': optimizer.state_dict(),
    'val_loss': val_loss,
}, 'checkpoint.pth')

# Load checkpoint to resume training
checkpoint = torch.load('checkpoint.pth', map_location=device)
model.load_state_dict(checkpoint['model_state_dict'])
optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
start_epoch = checkpoint['epoch']

# Save model weights only (for deployment)
torch.save(model.state_dict(), 'model_weights.pth')

# Load for inference
model = MLP(784, 512, 10)
model.load_state_dict(torch.load('model_weights.pth', map_location='cpu'))
model.eval()

Common PyTorch Patterns at a Glance

🎯 Multi-GPU (DataParallel)

model = nn.DataParallel(model)

Quick multi-GPU — wraps model, splits batches across all GPUs. For serious training, use DistributedDataParallel instead.

⚡ Mixed Precision

scaler = torch.cuda.amp.GradScaler()

Train in FP16 for 2× speed. Use autocast() context manager around forward pass and GradScaler for loss.backward().

📊 TensorBoard

from torch.utils.tensorboard import SummaryWriter

Log metrics, model graphs, images. Run tensorboard --logdir=runs to view dashboards in browser.

🔍 torch.compile

model = torch.compile(model)

PyTorch 2.0+: JIT-compile your model for 1.5–3× speedup with zero code changes. Uses Triton under the hood.

Frequently Asked Questions

When should I use model.train() vs model.eval()?

Call model.train() at the start of each training loop — it enables dropout (randomly zeros neurons) and uses batch statistics for BatchNorm. Call model.eval() before evaluation and inference — it disables dropout and uses running mean/variance for BatchNorm. Forgetting model.eval() is one of the most common PyTorch bugs, leading to inconsistent predictions.

Why do I need optimizer.zero_grad() every step?

PyTorch accumulates (adds) gradients by default rather than replacing them. This is useful for gradient accumulation tricks (simulating larger batches) but means you must explicitly zero them before each backward pass. If you forget, gradients from the previous step add to the current step's gradients, corrupting training.

What is the difference between .detach(), .no_grad(), and .data?

.detach() creates a new tensor that shares data but is detached from the computation graph — useful for stopping gradients in specific paths. torch.no_grad() context manager disables grad tracking for all operations within it — use for inference. .data gives raw access to the underlying tensor without grad overhead — avoid unless you know what you're doing, as it can corrupt autograd.

PyTorch vs TensorFlow — which should I learn?

PyTorch is the clear choice for learning deep learning today. It dominates research (~80% of academic papers), is increasingly used in production (Meta, Tesla, many startups), and has a more intuitive Pythonic API with eager execution. TensorFlow/Keras is still used in production at Google and has TFLite for mobile. But for learning and most new projects: PyTorch.

Frequently Asked Questions

What will I learn here?

This page covers the core concepts and techniques you need to understand the topic and progress confidently to the next lesson.

How should I use this page?

Start with the overview, then follow the section links to deepen your understanding. Use the table of contents on the right to jump to specific sections.

What should I read next?

Use the navigation below to continue to the next lesson or explore related topics.