Model Evaluation
A model that looks great on training data can fail completely in the real world. Model evaluation is how you tell whether your model actually works — and how to make it better when it doesn't.
The Train / Validation / Test Split
Never evaluate your model on the same data it trained on — it will look artificially perfect because it has memorised the answers. Instead, split your data into three sets:
Classification Metrics
Accuracy alone is misleading. If 99% of transactions are legitimate, a model that always predicts "not fraud" has 99% accuracy — but it misses every single fraud case.
Precision
Of all the positive predictions I made, how many were actually positive?
Use when: False positives are costly (e.g. spam filter removing real emails)
Recall (Sensitivity)
Of all actual positives, how many did I find?
Use when: Missing positives is costly (e.g. cancer screening)
F1 Score
Harmonic mean of precision and recall. Balanced score when both matter.
Use when: Classes are imbalanced
ROC-AUC
Measures model's ability to rank positives above negatives. 0.5 = random, 1.0 = perfect.
Use when: You need a threshold-independent metric
🎯 Interactive Confusion Matrix
Adjust TP, FP, FN, TN with sliders to see how metrics change.
Regression Metrics
Mean Absolute Error
mean(|y - ŷ|)Easy to interpret. In same units as target.
Root Mean Squared Error
√mean((y - ŷ)²)Penalises large errors more. Sensitive to outliers.
Coefficient of Determination
1 - SS_res/SS_tot0 = constant model, 1 = perfect. Negative = worse than average.
Overfitting vs Underfitting
This is the most important concept in model evaluation. Every model sits on a spectrum:
Too simple, high bias Just Right
Good generalisation Overfitting
Too complex, high variance
Plot training accuracy vs validation accuracy over epochs/iterations. If training accuracy keeps climbing while validation accuracy plateaus or drops — you're overfitting.
Cross-Validation
A single train/val split can be misleading if you got lucky (or unlucky) with which samples ended up where. K-Fold Cross-Validation splits data into K folds, trains K models, and averages the scores for a more reliable estimate.
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=100)
scores = cross_val_score(model, X, y, cv=5, scoring='f1')
print(f"F1 scores: {scores}")
print(f"Mean F1: {scores.mean():.3f} ± {scores.std():.3f}") Frequently Asked Questions
Should I always use accuracy as my metric?
No. Accuracy is only meaningful when classes are balanced. If 95% of your data is class A and 5% is class B, a dumb model that always predicts A gets 95% accuracy. Use F1 or ROC-AUC for imbalanced datasets.
What's the difference between validation set and test set?
The validation set is used during development to tune hyperparameters. The test set is touched exactly once at the end to report final performance. Using the test set for tuning "leaks" information and inflates your reported accuracy.
How many folds should I use in cross-validation?
5 or 10 folds are standard. With small datasets (<1000 rows), use Leave-One-Out CV (LOOCV). With large datasets, a single train/val split is often sufficient and much faster.
Frequently Asked Questions
What will I learn here?
This page covers the core concepts and techniques you need to understand the topic and progress confidently to the next lesson.
How should I use this page?
Start with the overview, then follow the section links to deepen your understanding. Use the table of contents on the right to jump to specific sections.
What should I read next?
Use the navigation below to continue to the next lesson or explore related topics.