Unsupervised Learning

What if you have data but no labels? Unsupervised learning finds hidden structure in data on its own — grouping similar items together, reducing noise, and discovering patterns that humans couldn't spot manually.

Why No Labels?

Labelling data is expensive and slow. Imagine having 1 million customer purchase records. Hiring humans to label each customer's "type" would be impractical. Unsupervised learning can automatically segment those customers into groups based on purchasing behaviour — no labels needed.

🛒

Customer Segmentation
Group users by behaviour for targeted marketing

🔍

Anomaly Detection
Find fraud or broken sensors

📉

Dimensionality Reduction
Compress 1000 features to 2 for visualisation

🧬

Gene Clustering
Group genes with similar expression patterns

K-Means Clustering

K-Means is the most popular clustering algorithm. It splits data into K groups by minimising the distance between each point and its cluster centre (centroid).

1 Choose K (how many clusters)

2 Randomly place K centroids

3 Assign each point to nearest centroid

4 Move each centroid to the mean of its points

5 Repeat steps 3–4 until centroids stop moving

🔵 Interactive K-Means Visualizer

Click to add data points, then press Run K-Means to watch it cluster.

K clusters: 3

Iteration: 0 | Points: 0

Python · Scikit-Learn

from sklearn.cluster import KMeans
import numpy as np

# Generate sample data
X = np.random.randn(300, 2)
X[:100] += [3, 3]   # cluster 1
X[100:200] += [-3, 3]  # cluster 2
# cluster 3 stays at origin

kmeans = KMeans(n_clusters=3, random_state=42)
labels = kmeans.fit_predict(X)
centroids = kmeans.cluster_centers_

print(f"Inertia (within-cluster variance): {kmeans.inertia_:.2f}")

Choosing K — The Elbow Method

The hardest part of K-Means is choosing K. The elbow method plots inertia (within-cluster variance) vs K. The "elbow" — where the curve bends — is the optimal K.

The elbow at K=3 suggests 3 is the optimal number of clusters.

Hierarchical Clustering

Instead of specifying K upfront, hierarchical clustering builds a dendrogram (a tree of merges). You cut the tree at any level to get any number of clusters.

Agglomerative (Bottom-Up)

Start with every point as its own cluster. Repeatedly merge the two closest clusters. Most common approach.

Divisive (Top-Down)

Start with one big cluster. Repeatedly split it. Computationally expensive, rarely used.

DBSCAN — Density-Based Clustering

DBSCAN groups points that are closely packed together and marks isolated points as noise/outliers. It can find clusters of arbitrary shape and does not require specifying K in advance.

Need to specify K?✅ Yes❌ No

Handles outliers?❌ No (pulls to cluster)✅ Yes (marks as noise)

Cluster shapesSpherical onlyAny shape

Speed on large dataFastSlower

Principal Component Analysis (PCA)

PCA is not a clustering algorithm — it's a dimensionality reduction technique. It finds new axes (principal components) that capture the maximum variance in data, allowing you to represent high-dimensional data in 2D or 3D without losing too much information.

Common use cases: Visualising 50-dimensional customer data in 2D, compressing image features before training a classifier, removing correlated features.

Python · PCA to 2D

from sklearn.decomposition import PCA
from sklearn.datasets import load_digits

X, y = load_digits(return_X_y=True)  # 1797 images, 64 features each

pca = PCA(n_components=2)
X_2d = pca.fit_transform(X)

# Explained variance tells you info retained
print(f"Variance explained: {pca.explained_variance_ratio_.sum():.1%}")
# → Variance explained: 28.5%  (use more components for more info)

t-SNE — Visualising High-Dimensional Data

t-SNE (t-distributed Stochastic Neighbour Embedding) is designed purely for visualisation. It preserves local neighbourhood structure, making clusters visually obvious when plotted in 2D. Unlike PCA, it's non-linear and not suitable for preprocessing before training.

💡 When to Use Each

Use PCA when you want to reduce dimensions before feeding into another model. Use t-SNE when you want to visualise high-dimensional data. Never use t-SNE features as inputs to a classifier.

Anomaly Detection

Anomaly detection finds rare events that look different from normal data. Examples: credit card fraud, network intrusion, manufacturing defects.

Isolation Forest — Randomly partitions data; anomalies are isolated faster (fewer splits needed)

One-Class SVM — Learns a boundary around normal data; anything outside is anomalous

Autoencoders — Neural network that reconstructs normal data well but fails on anomalies (large reconstruction error = anomaly)

Frequently Asked Questions

How do I evaluate clustering without labels?

Use the Silhouette Score (how similar points are to their own cluster vs others, range -1 to 1). Also use domain knowledge — do the clusters make business sense?

K-Means gives different results each run. Why?

K-Means initialises centroids randomly, leading to different local optima. Use random_state=42 for reproducibility, or use KMeans(init='k-means++') for smarter initialisation.

Can I use unsupervised learning before supervised learning?

Yes! This is called semi-supervised learning. You cluster unlabelled data, then label just a few representatives per cluster. This can dramatically reduce labelling effort.

Unsupervised Learning

Why No Labels?

K-Means Clustering

🔵 Interactive K-Means Visualizer

Choosing K — The Elbow Method

Hierarchical Clustering

Agglomerative (Bottom-Up)

Divisive (Top-Down)

DBSCAN — Density-Based Clustering

Principal Component Analysis (PCA)

t-SNE — Visualising High-Dimensional Data

Anomaly Detection

Frequently Asked Questions

Frequently Asked Questions

What will I learn here?

How should I use this page?

What should I read next?