Model Evaluation

A model that looks great on training data can fail completely in the real world. Model evaluation is how you tell whether your model actually works — and how to make it better when it doesn't.

📖 This page covers: Accuracy · Precision · Recall · F1 · ROC-AUC · Confusion Matrix · Cross-Validation · Overfitting

The Train / Validation / Test Split

Never evaluate your model on the same data it trained on — it will look artificially perfect because it has memorised the answers. Instead, split your data into three sets:

Training Set (60–70%)
Validation (15–20%)
Test Set (15–20%)
🔵 Train: Model learns from this 🟡 Validate: Tune hyperparameters 🔴 Test: Final, honest score (use once!)

Classification Metrics

Accuracy alone is misleading. If 99% of transactions are legitimate, a model that always predicts "not fraud" has 99% accuracy — but it misses every single fraud case.

TP / (TP + FP)

Precision

Of all the positive predictions I made, how many were actually positive?

Use when: False positives are costly (e.g. spam filter removing real emails)

TP / (TP + FN)

Recall (Sensitivity)

Of all actual positives, how many did I find?

Use when: Missing positives is costly (e.g. cancer screening)

2 × P × R / (P + R)

F1 Score

Harmonic mean of precision and recall. Balanced score when both matter.

Use when: Classes are imbalanced

Area under ROC curve

ROC-AUC

Measures model's ability to rank positives above negatives. 0.5 = random, 1.0 = perfect.

Use when: You need a threshold-independent metric

🎯 Interactive Confusion Matrix

Adjust TP, FP, FN, TN with sliders to see how metrics change.

Predicted Positive
Predicted Negative
Actual Positive
TP: 80
FN: 20
Actual Negative
FP: 10
TN: 90
Accuracy
Precision
Recall
F1 Score

Regression Metrics

MAE
Mean Absolute Error
mean(|y - ŷ|)
Easy to interpret. In same units as target.
RMSE
Root Mean Squared Error
√mean((y - ŷ)²)
Penalises large errors more. Sensitive to outliers.

Coefficient of Determination
1 - SS_res/SS_tot
0 = constant model, 1 = perfect. Negative = worse than average.

Overfitting vs Underfitting

This is the most important concept in model evaluation. Every model sits on a spectrum:

Underfitting
Too simple, high bias
Just Right
Good generalisation
Overfitting
Too complex, high variance
🔍 How to Detect Overfitting

Plot training accuracy vs validation accuracy over epochs/iterations. If training accuracy keeps climbing while validation accuracy plateaus or drops — you're overfitting.

ProblemSymptomsFixes
UnderfittingBoth train & val accuracy are lowUse a more complex model, add features, train longer
OverfittingTrain high, val lowMore data, regularisation (L1/L2), dropout, simpler model

Cross-Validation

A single train/val split can be misleading if you got lucky (or unlucky) with which samples ended up where. K-Fold Cross-Validation splits data into K folds, trains K models, and averages the scores for a more reliable estimate.

Score: 0.92
Fold 1
Fold 2
Fold 3
Fold 4
Fold 5
Average score: 0.91 ± 0.02
Python · Cross-Validation
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(n_estimators=100)
scores = cross_val_score(model, X, y, cv=5, scoring='f1')

print(f"F1 scores: {scores}")
print(f"Mean F1: {scores.mean():.3f} ± {scores.std():.3f}")

Frequently Asked Questions

Should I always use accuracy as my metric?

No. Accuracy is only meaningful when classes are balanced. If 95% of your data is class A and 5% is class B, a dumb model that always predicts A gets 95% accuracy. Use F1 or ROC-AUC for imbalanced datasets.

What's the difference between validation set and test set?

The validation set is used during development to tune hyperparameters. The test set is touched exactly once at the end to report final performance. Using the test set for tuning "leaks" information and inflates your reported accuracy.

How many folds should I use in cross-validation?

5 or 10 folds are standard. With small datasets (<1000 rows), use Leave-One-Out CV (LOOCV). With large datasets, a single train/val split is often sufficient and much faster.

Frequently Asked Questions

What will I learn here?

This page covers the core concepts and techniques you need to understand the topic and progress confidently to the next lesson.

How should I use this page?

Start with the overview, then follow the section links to deepen your understanding. Use the table of contents on the right to jump to specific sections.

What should I read next?

Use the navigation below to continue to the next lesson or explore related topics.