CNNs — Convolutional Neural Networks for Computer Vision

Convolutional Neural Networks are the reason machines can recognize faces, detect tumors in X-rays, and drive cars. They are specially designed to process grid-like data — images, videos, spectrograms — by learning spatial patterns through local connections and weight sharing.

Why Not Just Use a Fully Connected Network?

❌ Fully Connected on Images

A 224×224 RGB image has 150,528 inputs. One hidden layer of 1000 neurons = 150M parameters just for layer 1. No spatial structure is preserved — a pixel in the top-left has no special relationship to its neighbours.

✅ Convolutional Network

A 3×3 filter has just 9 weights (+ bias) — shared across the entire image. Spatial relationships preserved. Hierarchical features: edge → texture → object part → full object.

The Convolution Operation

A filter (kernel) is a small matrix of learnable weights. It slides across the input image, computing an element-wise product and sum at each position. The result is a feature map that highlights where that pattern appears in the image.

🔍 Interactive Filter Visualiser

Select a filter to see what it detects. The output shows the response across a sample input.

Horizontal edge detector: responds strongly to horizontal boundaries between light and dark regions.

Key CNN Concepts

🔢 Stride

How many pixels the filter moves each step. Stride 1 = dense output. Stride 2 = halves the spatial dimensions. Larger stride = downsampling without pooling.

🔲 Padding

Valid: no padding (output shrinks). Same: zero-pad edges so output = input size. Preserves spatial resolution through layers.

🏊 Pooling

Max pooling: take the maximum value in a region. Provides translation invariance and downsamples. Average pooling is smoother, used in later layers.

📦 Channels

RGB image = 3 channels. Each conv layer produces N feature maps (output channels), one per filter. These capture different patterns at the same spatial location.

🔄 ReLU

Applied after each convolution: max(0, x). Introduces non-linearity. Without it, stacking linear conv layers is just one big linear transform.

📐 Feature Maps

The output of applying a filter to the input. Deeper layers have smaller spatial size but more channels — trading resolution for semantic richness.

Computing Output Dimensions

Output size = ⌊(Input + 2×Padding − Kernel) / Stride⌋ + 1

Example: Input=224, Kernel=3, Padding=1, Stride=1 → (224 + 2 - 3) / 1 + 1 = 224 (same padding)

Example: Input=224, Kernel=3, Padding=0, Stride=2 → (224 + 0 - 3) / 2 + 1 = 111

Famous Architectures

VGGNet (2014) Oxford VGG

Uses only 3×3 convolutions stacked deeply (16–19 layers). Simple, uniform design. Still used as a feature extractor. Weakness: 138M parameters — very large.

ResNet (2015) Microsoft

Introduced residual (skip) connections: add the input directly to the output (x + F(x)). Enables training of 50, 101, even 152+ layer networks without vanishing gradients. ImageNet winner.

EfficientNet (2019) Google Brain

Scales width, depth, and resolution together via a compound coefficient. EfficientNet-B7 achieves state-of-the-art accuracy with 8.4× fewer parameters than GPT-sized models. Best accuracy/compute tradeoff.

MobileNet (2017) Google

Uses depthwise separable convolutions to dramatically reduce computation. Designed for mobile and edge devices. 8–9× faster than standard conv with minimal accuracy loss.

PyTorch CNN from Scratch

Python · Custom CNN Classifier with PyTorch

import torch
import torch.nn as nn
import torch.nn.functional as F

class SimpleCNN(nn.Module):
    def __init__(self, num_classes=10):
        super().__init__()
        # Block 1: 3→32 channels, 224→112
        self.conv1 = nn.Conv2d(3, 32, kernel_size=3, padding=1)
        self.bn1   = nn.BatchNorm2d(32)
        self.pool  = nn.MaxPool2d(2, 2)  # Halves spatial size

        # Block 2: 32→64 channels, 112→56
        self.conv2 = nn.Conv2d(32, 64, kernel_size=3, padding=1)
        self.bn2   = nn.BatchNorm2d(64)

        # Block 3: 64→128 channels, 56→28
        self.conv3 = nn.Conv2d(64, 128, kernel_size=3, padding=1)
        self.bn3   = nn.BatchNorm2d(128)

        # Global average pooling — removes spatial dimensions
        self.gap = nn.AdaptiveAvgPool2d(1)

        # Classifier head
        self.fc1 = nn.Linear(128, 256)
        self.dropout = nn.Dropout(0.5)
        self.fc2 = nn.Linear(256, num_classes)

    def forward(self, x):
        x = self.pool(F.relu(self.bn1(self.conv1(x))))  # 112×112×32
        x = self.pool(F.relu(self.bn2(self.conv2(x))))  # 56×56×64
        x = self.pool(F.relu(self.bn3(self.conv3(x))))  # 28×28×128
        x = self.gap(x).flatten(1)                       # 128
        x = F.relu(self.fc1(x))
        x = self.dropout(x)
        return self.fc2(x)

model = SimpleCNN(num_classes=10)
print(model)

# Count parameters
total = sum(p.numel() for p in model.parameters())
print(f"Parameters: {total:,}")  # ~300K — very small!

Transfer Learning — The Right Way

Python · Fine-tune ResNet50 on Custom Data

import torchvision.models as models
import torch.optim as optim

# Load pretrained ResNet50 (weights from ImageNet)
model = models.resnet50(weights=models.ResNet50_Weights.IMAGENET1K_V2)

# Freeze all backbone layers — only train the head
for param in model.parameters():
    param.requires_grad = False

# Replace final layer for your number of classes
num_classes = 5  # e.g., flower species
model.fc = nn.Linear(model.fc.in_features, num_classes)

# Only the new head has requires_grad=True
optimizer = optim.Adam(model.fc.parameters(), lr=1e-3)
criterion = nn.CrossEntropyLoss()

# After initial training, unfreeze later layers for fine-tuning
for name, param in model.named_parameters():
    if "layer4" in name or "fc" in name:
        param.requires_grad = True

optimizer2 = optim.Adam(
    filter(lambda p: p.requires_grad, model.parameters()),
    lr=1e-5  # Much smaller LR for pretrained layers
)

Frequently Asked Questions

When should I use CNNs vs Vision Transformers (ViT)?

CNNs are still better for small datasets (<100K images) due to their inductive biases (local connectivity, translation invariance). Vision Transformers (ViT) outperform CNNs at large scale (>1M images) but require more data to learn good representations. Hybrid models like ConvNeXT and EfficientViT combine strengths of both.

Why does batch normalisation help so much?

BatchNorm normalises activations within a mini-batch, reducing internal covariate shift (the distribution of layer inputs changing as weights update). This lets you use higher learning rates, acts as mild regularisation, and makes training much more stable. Apply it after the convolution, before the activation function.

What data augmentation should I use for CNNs?

Standard: random horizontal flip, random crop, colour jitter, normalise to ImageNet mean/std. Advanced: MixUp (blend two images), CutMix (cut patch from one image into another), RandAugment (automatically selects augmentations). For medical images: avoid flips that change label (e.g., handedness in brain scans).

CNNs — Convolutional Neural Networks for Computer Vision

Why Not Just Use a Fully Connected Network?

The Convolution Operation

🔍 Interactive Filter Visualiser

Key CNN Concepts

🔢 Stride

🔲 Padding

🏊 Pooling

📦 Channels

🔄 ReLU

📐 Feature Maps

Computing Output Dimensions

Famous Architectures

PyTorch CNN from Scratch

Transfer Learning — The Right Way

Frequently Asked Questions

Frequently Asked Questions

What will I learn here?

How should I use this page?

What should I read next?