Unsupervised Learning
What if you have data but no labels? Unsupervised learning finds hidden structure in data on its own — grouping similar items together, reducing noise, and discovering patterns that humans couldn't spot manually.
Why No Labels?
Labelling data is expensive and slow. Imagine having 1 million customer purchase records. Hiring humans to label each customer's "type" would be impractical. Unsupervised learning can automatically segment those customers into groups based on purchasing behaviour — no labels needed.
Group users by behaviour for targeted marketing
Find fraud or broken sensors
Compress 1000 features to 2 for visualisation
Group genes with similar expression patterns
K-Means Clustering
K-Means is the most popular clustering algorithm. It splits data into K groups by minimising the distance between each point and its cluster centre (centroid).
🔵 Interactive K-Means Visualizer
Click to add data points, then press Run K-Means to watch it cluster.
from sklearn.cluster import KMeans
import numpy as np
# Generate sample data
X = np.random.randn(300, 2)
X[:100] += [3, 3] # cluster 1
X[100:200] += [-3, 3] # cluster 2
# cluster 3 stays at origin
kmeans = KMeans(n_clusters=3, random_state=42)
labels = kmeans.fit_predict(X)
centroids = kmeans.cluster_centers_
print(f"Inertia (within-cluster variance): {kmeans.inertia_:.2f}") Choosing K — The Elbow Method
The hardest part of K-Means is choosing K. The elbow method plots inertia (within-cluster variance) vs K. The "elbow" — where the curve bends — is the optimal K.
The elbow at K=3 suggests 3 is the optimal number of clusters.
Hierarchical Clustering
Instead of specifying K upfront, hierarchical clustering builds a dendrogram (a tree of merges). You cut the tree at any level to get any number of clusters.
Agglomerative (Bottom-Up)
Start with every point as its own cluster. Repeatedly merge the two closest clusters. Most common approach.
Divisive (Top-Down)
Start with one big cluster. Repeatedly split it. Computationally expensive, rarely used.
DBSCAN — Density-Based Clustering
DBSCAN groups points that are closely packed together and marks isolated points as noise/outliers. It can find clusters of arbitrary shape and does not require specifying K in advance.
Principal Component Analysis (PCA)
PCA is not a clustering algorithm — it's a dimensionality reduction technique. It finds new axes (principal components) that capture the maximum variance in data, allowing you to represent high-dimensional data in 2D or 3D without losing too much information.
Common use cases: Visualising 50-dimensional customer data in 2D, compressing image features before training a classifier, removing correlated features.
from sklearn.decomposition import PCA
from sklearn.datasets import load_digits
X, y = load_digits(return_X_y=True) # 1797 images, 64 features each
pca = PCA(n_components=2)
X_2d = pca.fit_transform(X)
# Explained variance tells you info retained
print(f"Variance explained: {pca.explained_variance_ratio_.sum():.1%}")
# → Variance explained: 28.5% (use more components for more info) t-SNE — Visualising High-Dimensional Data
t-SNE (t-distributed Stochastic Neighbour Embedding) is designed purely for visualisation. It preserves local neighbourhood structure, making clusters visually obvious when plotted in 2D. Unlike PCA, it's non-linear and not suitable for preprocessing before training.
Use PCA when you want to reduce dimensions before feeding into another model. Use t-SNE when you want to visualise high-dimensional data. Never use t-SNE features as inputs to a classifier.
Anomaly Detection
Anomaly detection finds rare events that look different from normal data. Examples: credit card fraud, network intrusion, manufacturing defects.
Frequently Asked Questions
How do I evaluate clustering without labels?
Use the Silhouette Score (how similar points are to their own cluster vs others, range -1 to 1). Also use domain knowledge — do the clusters make business sense?
K-Means gives different results each run. Why?
K-Means initialises centroids randomly, leading to different local optima. Use random_state=42 for reproducibility, or use KMeans(init='k-means++') for smarter initialisation.
Can I use unsupervised learning before supervised learning?
Yes! This is called semi-supervised learning. You cluster unlabelled data, then label just a few representatives per cluster. This can dramatically reduce labelling effort.
Frequently Asked Questions
What will I learn here?
This page covers the core concepts and techniques you need to understand the topic and progress confidently to the next lesson.
How should I use this page?
Start with the overview, then follow the section links to deepen your understanding. Use the table of contents on the right to jump to specific sections.
What should I read next?
Use the navigation below to continue to the next lesson or explore related topics.