Experiment Tracking
Machine learning involves hundreds of experiments — different hyperparameters, architectures, datasets, preprocessing steps. Without systematic tracking, you'll find yourself unable to reproduce your best result, unsure which model is in production, or running the same experiment twice. Experiment tracking tools solve this entirely.
What to Track
Parameters
Learning rate, batch size, model architecture, regularisation, random seed. Everything that affects the outcome.
Metrics
Training loss, validation accuracy, F1, BLEU, latency. Tracked per epoch/step for learning curves.
Artefacts
Model checkpoints, datasets, confusion matrices, prediction samples. Linked to the exact run that produced them.
Environment
Git commit hash, Python version, library versions, hardware (GPU model, VRAM). Guarantees reproducibility.
MLflow: Open-Source Experiment Tracking
MLflow is the most widely-used open-source ML tracking library. Self-hostable, integrates with every ML framework, and includes a model registry for managing deployment.
Core Concepts
import mlflow
import mlflow.sklearn
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
mlflow.set_experiment("fraud-detection")
with mlflow.start_run(run_name="rf-100-trees"):
# Log hyperparameters
params = {"n_estimators": 100, "max_depth": 10, "random_state": 42}
mlflow.log_params(params)
# Train model
model = RandomForestClassifier(**params)
model.fit(X_train, y_train)
# Log metrics
acc = accuracy_score(y_val, model.predict(X_val))
mlflow.log_metric("val_accuracy", acc)
# Log model
mlflow.sklearn.log_model(model, "model")
# View results: mlflow ui → http://localhost:5000 Weights & Biases (W&B)
W&B is the industry standard for deep learning experiment tracking. Richer visualisations than MLflow, excellent team collaboration features, and purpose-built for long neural network training runs.
import wandb
import torch
wandb.init(
project="image-classifier",
config={
"learning_rate": 3e-4,
"epochs": 50,
"batch_size": 64,
"architecture": "ResNet50",
}
)
for epoch in range(config.epochs):
train_loss = train_one_epoch(model, train_loader)
val_acc = evaluate(model, val_loader)
wandb.log({
"epoch": epoch,
"train_loss": train_loss,
"val_accuracy": val_acc,
})
# Log model checkpoint as artefact
if val_acc > best_acc:
wandb.save("model_best.pt")
wandb.finish() MLflow vs W&B: Which to Use?
MLflow
- Free, open-source, self-hostable
- No data leaves your infrastructure
- Model registry with stage transitions
- Native integrations: Spark, sklearn, PyTorch, HF
- Best for: regulated industries, on-prem teams
- Weaker UI / collaboration than W&B
Weights & Biases
- SaaS (free tier generous); enterprise pricing
- Best-in-class visualisations and dashboards
- W&B Sweeps: built-in hyperparameter search
- W&B Artifacts: dataset + model versioning
- Best for: research teams, collaborative projects
- Data stored on W&B servers (privacy concern)
Model Registry Workflow
After a training run, register the model to the registry with a name and version.
Transition model to "Staging" for QA: run integration tests, check latency, validate on held-out data.
Promote to "Production". Serving infrastructure pulls the latest Production model on restart. Previous version archived for rollback.
Log every single run, including failed ones. Knowing what doesn't work is as valuable as knowing what does. Add a "notes" tag to each run explaining what you were testing. You'll thank yourself in two weeks when you can't remember why you tried learning rate 1e-2.
Frequently Asked Questions
Can I use MLflow with Hugging Face Transformers?
Yes. Hugging Face Trainer has built-in MLflow integration — set MLFLOW_EXPERIMENT_NAME env var and pass report_to="mlflow" to TrainingArguments. All training metrics are automatically logged. You can also use mlflow.transformers.log_model() to log the full model to the registry with its tokeniser.
How do I reproduce an old experiment exactly?
Good tracking practice: log the git commit hash (mlflow.set_tag("git_commit", subprocess.check_output(["git", "rev-parse", "HEAD"]).decode())), log all environment versions (mlflow.log_artifact("requirements.txt")), and log the exact dataset version (DVC hash or dataset artifact). With these three, you can exactly reproduce any run.
What is a hyperparameter sweep?
A systematic search over the hyperparameter space to find the optimal configuration. W&B Sweeps supports grid search, random search, and Bayesian optimisation. MLflow integrates with Optuna and Ray Tune for the same purpose. For most tasks, random search over 20–50 runs outperforms manual tuning and is more efficient than grid search.
Frequently Asked Questions
What will I learn here?
This page covers the core concepts and techniques you need to understand the topic and progress confidently to the next lesson.
How should I use this page?
Start with the overview, then follow the section links to deepen your understanding. Use the table of contents on the right to jump to specific sections.
What should I read next?
Use the navigation below to continue to the next lesson or explore related topics.