Experiment Tracking

Machine learning involves hundreds of experiments — different hyperparameters, architectures, datasets, preprocessing steps. Without systematic tracking, you'll find yourself unable to reproduce your best result, unsure which model is in production, or running the same experiment twice. Experiment tracking tools solve this entirely.

What to Track

⚙️

Parameters

Learning rate, batch size, model architecture, regularisation, random seed. Everything that affects the outcome.

📈

Metrics

Training loss, validation accuracy, F1, BLEU, latency. Tracked per epoch/step for learning curves.

📦

Artefacts

Model checkpoints, datasets, confusion matrices, prediction samples. Linked to the exact run that produced them.

🌿

Environment

Git commit hash, Python version, library versions, hardware (GPU model, VRAM). Guarantees reproducibility.

MLflow: Open-Source Experiment Tracking

MLflow is the most widely-used open-source ML tracking library. Self-hostable, integrates with every ML framework, and includes a model registry for managing deployment.

Core Concepts

ConceptDescription
ExperimentA named collection of related runs (e.g., "fraud-model-v2")
RunA single training execution with its own params, metrics, and artefacts
Model RegistryVersioned model store with staging/production lifecycle management
MLflow Tracking ServerBackend that stores all run data (local files, S3, or SQL database)
import mlflow
import mlflow.sklearn
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

mlflow.set_experiment("fraud-detection")

with mlflow.start_run(run_name="rf-100-trees"):
    # Log hyperparameters
    params = {"n_estimators": 100, "max_depth": 10, "random_state": 42}
    mlflow.log_params(params)

    # Train model
    model = RandomForestClassifier(**params)
    model.fit(X_train, y_train)

    # Log metrics
    acc = accuracy_score(y_val, model.predict(X_val))
    mlflow.log_metric("val_accuracy", acc)

    # Log model
    mlflow.sklearn.log_model(model, "model")

# View results: mlflow ui  →  http://localhost:5000

Weights & Biases (W&B)

W&B is the industry standard for deep learning experiment tracking. Richer visualisations than MLflow, excellent team collaboration features, and purpose-built for long neural network training runs.

import wandb
import torch

wandb.init(
    project="image-classifier",
    config={
        "learning_rate": 3e-4,
        "epochs": 50,
        "batch_size": 64,
        "architecture": "ResNet50",
    }
)

for epoch in range(config.epochs):
    train_loss = train_one_epoch(model, train_loader)
    val_acc = evaluate(model, val_loader)

    wandb.log({
        "epoch": epoch,
        "train_loss": train_loss,
        "val_accuracy": val_acc,
    })

    # Log model checkpoint as artefact
    if val_acc > best_acc:
        wandb.save("model_best.pt")

wandb.finish()

MLflow vs W&B: Which to Use?

MLflow

  • Free, open-source, self-hostable
  • No data leaves your infrastructure
  • Model registry with stage transitions
  • Native integrations: Spark, sklearn, PyTorch, HF
  • Best for: regulated industries, on-prem teams
  • Weaker UI / collaboration than W&B

Weights & Biases

  • SaaS (free tier generous); enterprise pricing
  • Best-in-class visualisations and dashboards
  • W&B Sweeps: built-in hyperparameter search
  • W&B Artifacts: dataset + model versioning
  • Best for: research teams, collaborative projects
  • Data stored on W&B servers (privacy concern)

Model Registry Workflow

1
Register Model

After a training run, register the model to the registry with a name and version.

2
Staging

Transition model to "Staging" for QA: run integration tests, check latency, validate on held-out data.

3
Production

Promote to "Production". Serving infrastructure pulls the latest Production model on restart. Previous version archived for rollback.

💡 Track Everything, Even "Bad" Experiments

Log every single run, including failed ones. Knowing what doesn't work is as valuable as knowing what does. Add a "notes" tag to each run explaining what you were testing. You'll thank yourself in two weeks when you can't remember why you tried learning rate 1e-2.

Frequently Asked Questions

Can I use MLflow with Hugging Face Transformers?

Yes. Hugging Face Trainer has built-in MLflow integration — set MLFLOW_EXPERIMENT_NAME env var and pass report_to="mlflow" to TrainingArguments. All training metrics are automatically logged. You can also use mlflow.transformers.log_model() to log the full model to the registry with its tokeniser.

How do I reproduce an old experiment exactly?

Good tracking practice: log the git commit hash (mlflow.set_tag("git_commit", subprocess.check_output(["git", "rev-parse", "HEAD"]).decode())), log all environment versions (mlflow.log_artifact("requirements.txt")), and log the exact dataset version (DVC hash or dataset artifact). With these three, you can exactly reproduce any run.

What is a hyperparameter sweep?

A systematic search over the hyperparameter space to find the optimal configuration. W&B Sweeps supports grid search, random search, and Bayesian optimisation. MLflow integrates with Optuna and Ray Tune for the same purpose. For most tasks, random search over 20–50 runs outperforms manual tuning and is more efficient than grid search.

Frequently Asked Questions

What will I learn here?

This page covers the core concepts and techniques you need to understand the topic and progress confidently to the next lesson.

How should I use this page?

Start with the overview, then follow the section links to deepen your understanding. Use the table of contents on the right to jump to specific sections.

What should I read next?

Use the navigation below to continue to the next lesson or explore related topics.