CNNs — Convolutional Neural Networks for Computer Vision
Convolutional Neural Networks are the reason machines can recognize faces, detect tumors in X-rays, and drive cars. They are specially designed to process grid-like data — images, videos, spectrograms — by learning spatial patterns through local connections and weight sharing.
Why Not Just Use a Fully Connected Network?
A 224×224 RGB image has 150,528 inputs. One hidden layer of 1000 neurons = 150M parameters just for layer 1. No spatial structure is preserved — a pixel in the top-left has no special relationship to its neighbours.
A 3×3 filter has just 9 weights (+ bias) — shared across the entire image. Spatial relationships preserved. Hierarchical features: edge → texture → object part → full object.
The Convolution Operation
A filter (kernel) is a small matrix of learnable weights. It slides across the input image, computing an element-wise product and sum at each position. The result is a feature map that highlights where that pattern appears in the image.
🔍 Interactive Filter Visualiser
Select a filter to see what it detects. The output shows the response across a sample input.
Key CNN Concepts
🔢 Stride
How many pixels the filter moves each step. Stride 1 = dense output. Stride 2 = halves the spatial dimensions. Larger stride = downsampling without pooling.
🔲 Padding
Valid: no padding (output shrinks). Same: zero-pad edges so output = input size. Preserves spatial resolution through layers.
🏊 Pooling
Max pooling: take the maximum value in a region. Provides translation invariance and downsamples. Average pooling is smoother, used in later layers.
📦 Channels
RGB image = 3 channels. Each conv layer produces N feature maps (output channels), one per filter. These capture different patterns at the same spatial location.
🔄 ReLU
Applied after each convolution: max(0, x). Introduces non-linearity. Without it, stacking linear conv layers is just one big linear transform.
📐 Feature Maps
The output of applying a filter to the input. Deeper layers have smaller spatial size but more channels — trading resolution for semantic richness.
Computing Output Dimensions
Output size = ⌊(Input + 2×Padding − Kernel) / Stride⌋ + 1
Example: Input=224, Kernel=3, Padding=1, Stride=1 → (224 + 2 - 3) / 1 + 1 = 224 (same padding)
Example: Input=224, Kernel=3, Padding=0, Stride=2 → (224 + 0 - 3) / 2 + 1 = 111
Famous Architectures
Uses only 3×3 convolutions stacked deeply (16–19 layers). Simple, uniform design. Still used as a feature extractor. Weakness: 138M parameters — very large.
Introduced residual (skip) connections: add the input directly to the output (x + F(x)). Enables training of 50, 101, even 152+ layer networks without vanishing gradients. ImageNet winner.
Scales width, depth, and resolution together via a compound coefficient. EfficientNet-B7 achieves state-of-the-art accuracy with 8.4× fewer parameters than GPT-sized models. Best accuracy/compute tradeoff.
Uses depthwise separable convolutions to dramatically reduce computation. Designed for mobile and edge devices. 8–9× faster than standard conv with minimal accuracy loss.
PyTorch CNN from Scratch
import torch
import torch.nn as nn
import torch.nn.functional as F
class SimpleCNN(nn.Module):
def __init__(self, num_classes=10):
super().__init__()
# Block 1: 3→32 channels, 224→112
self.conv1 = nn.Conv2d(3, 32, kernel_size=3, padding=1)
self.bn1 = nn.BatchNorm2d(32)
self.pool = nn.MaxPool2d(2, 2) # Halves spatial size
# Block 2: 32→64 channels, 112→56
self.conv2 = nn.Conv2d(32, 64, kernel_size=3, padding=1)
self.bn2 = nn.BatchNorm2d(64)
# Block 3: 64→128 channels, 56→28
self.conv3 = nn.Conv2d(64, 128, kernel_size=3, padding=1)
self.bn3 = nn.BatchNorm2d(128)
# Global average pooling — removes spatial dimensions
self.gap = nn.AdaptiveAvgPool2d(1)
# Classifier head
self.fc1 = nn.Linear(128, 256)
self.dropout = nn.Dropout(0.5)
self.fc2 = nn.Linear(256, num_classes)
def forward(self, x):
x = self.pool(F.relu(self.bn1(self.conv1(x)))) # 112×112×32
x = self.pool(F.relu(self.bn2(self.conv2(x)))) # 56×56×64
x = self.pool(F.relu(self.bn3(self.conv3(x)))) # 28×28×128
x = self.gap(x).flatten(1) # 128
x = F.relu(self.fc1(x))
x = self.dropout(x)
return self.fc2(x)
model = SimpleCNN(num_classes=10)
print(model)
# Count parameters
total = sum(p.numel() for p in model.parameters())
print(f"Parameters: {total:,}") # ~300K — very small! Transfer Learning — The Right Way
import torchvision.models as models
import torch.optim as optim
# Load pretrained ResNet50 (weights from ImageNet)
model = models.resnet50(weights=models.ResNet50_Weights.IMAGENET1K_V2)
# Freeze all backbone layers — only train the head
for param in model.parameters():
param.requires_grad = False
# Replace final layer for your number of classes
num_classes = 5 # e.g., flower species
model.fc = nn.Linear(model.fc.in_features, num_classes)
# Only the new head has requires_grad=True
optimizer = optim.Adam(model.fc.parameters(), lr=1e-3)
criterion = nn.CrossEntropyLoss()
# After initial training, unfreeze later layers for fine-tuning
for name, param in model.named_parameters():
if "layer4" in name or "fc" in name:
param.requires_grad = True
optimizer2 = optim.Adam(
filter(lambda p: p.requires_grad, model.parameters()),
lr=1e-5 # Much smaller LR for pretrained layers
) Frequently Asked Questions
When should I use CNNs vs Vision Transformers (ViT)?
CNNs are still better for small datasets (<100K images) due to their inductive biases (local connectivity, translation invariance). Vision Transformers (ViT) outperform CNNs at large scale (>1M images) but require more data to learn good representations. Hybrid models like ConvNeXT and EfficientViT combine strengths of both.
Why does batch normalisation help so much?
BatchNorm normalises activations within a mini-batch, reducing internal covariate shift (the distribution of layer inputs changing as weights update). This lets you use higher learning rates, acts as mild regularisation, and makes training much more stable. Apply it after the convolution, before the activation function.
What data augmentation should I use for CNNs?
Standard: random horizontal flip, random crop, colour jitter, normalise to ImageNet mean/std. Advanced: MixUp (blend two images), CutMix (cut patch from one image into another), RandAugment (automatically selects augmentations). For medical images: avoid flips that change label (e.g., handedness in brain scans).
Frequently Asked Questions
What will I learn here?
This page covers the core concepts and techniques you need to understand the topic and progress confidently to the next lesson.
How should I use this page?
Start with the overview, then follow the section links to deepen your understanding. Use the table of contents on the right to jump to specific sections.
What should I read next?
Use the navigation below to continue to the next lesson or explore related topics.