Unsupervised Learning

What if you have data but no labels? Unsupervised learning finds hidden structure in data on its own — grouping similar items together, reducing noise, and discovering patterns that humans couldn't spot manually.

📖 This page covers: K-Means · Hierarchical Clustering · DBSCAN · PCA · t-SNE · Anomaly Detection

Why No Labels?

Labelling data is expensive and slow. Imagine having 1 million customer purchase records. Hiring humans to label each customer's "type" would be impractical. Unsupervised learning can automatically segment those customers into groups based on purchasing behaviour — no labels needed.

🛒
Customer Segmentation
Group users by behaviour for targeted marketing
🔍
Anomaly Detection
Find fraud or broken sensors
📉
Dimensionality Reduction
Compress 1000 features to 2 for visualisation
🧬
Gene Clustering
Group genes with similar expression patterns

K-Means Clustering

K-Means is the most popular clustering algorithm. It splits data into K groups by minimising the distance between each point and its cluster centre (centroid).

1 Choose K (how many clusters)
2 Randomly place K centroids
3 Assign each point to nearest centroid
4 Move each centroid to the mean of its points
5 Repeat steps 3–4 until centroids stop moving

🔵 Interactive K-Means Visualizer

Click to add data points, then press Run K-Means to watch it cluster.

Iteration: 0 | Points: 0
Python · Scikit-Learn
from sklearn.cluster import KMeans
import numpy as np

# Generate sample data
X = np.random.randn(300, 2)
X[:100] += [3, 3]   # cluster 1
X[100:200] += [-3, 3]  # cluster 2
# cluster 3 stays at origin

kmeans = KMeans(n_clusters=3, random_state=42)
labels = kmeans.fit_predict(X)
centroids = kmeans.cluster_centers_

print(f"Inertia (within-cluster variance): {kmeans.inertia_:.2f}")

Choosing K — The Elbow Method

The hardest part of K-Means is choosing K. The elbow method plots inertia (within-cluster variance) vs K. The "elbow" — where the curve bends — is the optimal K.

The elbow at K=3 suggests 3 is the optimal number of clusters.

Hierarchical Clustering

Instead of specifying K upfront, hierarchical clustering builds a dendrogram (a tree of merges). You cut the tree at any level to get any number of clusters.

Agglomerative (Bottom-Up)

Start with every point as its own cluster. Repeatedly merge the two closest clusters. Most common approach.

Divisive (Top-Down)

Start with one big cluster. Repeatedly split it. Computationally expensive, rarely used.

DBSCAN — Density-Based Clustering

DBSCAN groups points that are closely packed together and marks isolated points as noise/outliers. It can find clusters of arbitrary shape and does not require specifying K in advance.

PropertyK-MeansDBSCAN
Need to specify K?✅ Yes❌ No
Handles outliers?❌ No (pulls to cluster)✅ Yes (marks as noise)
Cluster shapesSpherical onlyAny shape
Speed on large dataFastSlower

Principal Component Analysis (PCA)

PCA is not a clustering algorithm — it's a dimensionality reduction technique. It finds new axes (principal components) that capture the maximum variance in data, allowing you to represent high-dimensional data in 2D or 3D without losing too much information.

Common use cases: Visualising 50-dimensional customer data in 2D, compressing image features before training a classifier, removing correlated features.

Python · PCA to 2D
from sklearn.decomposition import PCA
from sklearn.datasets import load_digits

X, y = load_digits(return_X_y=True)  # 1797 images, 64 features each

pca = PCA(n_components=2)
X_2d = pca.fit_transform(X)

# Explained variance tells you info retained
print(f"Variance explained: {pca.explained_variance_ratio_.sum():.1%}")
# → Variance explained: 28.5%  (use more components for more info)

t-SNE — Visualising High-Dimensional Data

t-SNE (t-distributed Stochastic Neighbour Embedding) is designed purely for visualisation. It preserves local neighbourhood structure, making clusters visually obvious when plotted in 2D. Unlike PCA, it's non-linear and not suitable for preprocessing before training.

💡 When to Use Each

Use PCA when you want to reduce dimensions before feeding into another model. Use t-SNE when you want to visualise high-dimensional data. Never use t-SNE features as inputs to a classifier.

Anomaly Detection

Anomaly detection finds rare events that look different from normal data. Examples: credit card fraud, network intrusion, manufacturing defects.

Isolation Forest — Randomly partitions data; anomalies are isolated faster (fewer splits needed)
One-Class SVM — Learns a boundary around normal data; anything outside is anomalous
Autoencoders — Neural network that reconstructs normal data well but fails on anomalies (large reconstruction error = anomaly)

Frequently Asked Questions

How do I evaluate clustering without labels?

Use the Silhouette Score (how similar points are to their own cluster vs others, range -1 to 1). Also use domain knowledge — do the clusters make business sense?

K-Means gives different results each run. Why?

K-Means initialises centroids randomly, leading to different local optima. Use random_state=42 for reproducibility, or use KMeans(init='k-means++') for smarter initialisation.

Can I use unsupervised learning before supervised learning?

Yes! This is called semi-supervised learning. You cluster unlabelled data, then label just a few representatives per cluster. This can dramatically reduce labelling effort.

Frequently Asked Questions

What will I learn here?

This page covers the core concepts and techniques you need to understand the topic and progress confidently to the next lesson.

How should I use this page?

Start with the overview, then follow the section links to deepen your understanding. Use the table of contents on the right to jump to specific sections.

What should I read next?

Use the navigation below to continue to the next lesson or explore related topics.