Experiment Tracking

Machine learning involves hundreds of experiments — different hyperparameters, architectures, datasets, preprocessing steps. Without systematic tracking, you'll find yourself unable to reproduce your best result, unsure which model is in production, or running the same experiment twice. Experiment tracking tools solve this entirely.

What to Track

⚙️

Parameters

Learning rate, batch size, model architecture, regularisation, random seed. Everything that affects the outcome.

📈

Metrics

Training loss, validation accuracy, F1, BLEU, latency. Tracked per epoch/step for learning curves.

📦

Artefacts

Model checkpoints, datasets, confusion matrices, prediction samples. Linked to the exact run that produced them.

🌿

Environment

Git commit hash, Python version, library versions, hardware (GPU model, VRAM). Guarantees reproducibility.

MLflow: Open-Source Experiment Tracking

MLflow is the most widely-used open-source ML tracking library. Self-hostable, integrates with every ML framework, and includes a model registry for managing deployment.

Core Concepts

ExperimentA named collection of related runs (e.g., "fraud-model-v2")

RunA single training execution with its own params, metrics, and artefacts

Model RegistryVersioned model store with staging/production lifecycle management

MLflow Tracking ServerBackend that stores all run data (local files, S3, or SQL database)

import mlflow
import mlflow.sklearn
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

mlflow.set_experiment("fraud-detection")

with mlflow.start_run(run_name="rf-100-trees"):
    # Log hyperparameters
    params = {"n_estimators": 100, "max_depth": 10, "random_state": 42}
    mlflow.log_params(params)

    # Train model
    model = RandomForestClassifier(**params)
    model.fit(X_train, y_train)

    # Log metrics
    acc = accuracy_score(y_val, model.predict(X_val))
    mlflow.log_metric("val_accuracy", acc)

    # Log model
    mlflow.sklearn.log_model(model, "model")

# View results: mlflow ui  →  http://localhost:5000

Weights & Biases (W&B)

W&B is the industry standard for deep learning experiment tracking. Richer visualisations than MLflow, excellent team collaboration features, and purpose-built for long neural network training runs.

import wandb
import torch

wandb.init(
    project="image-classifier",
    config={
        "learning_rate": 3e-4,
        "epochs": 50,
        "batch_size": 64,
        "architecture": "ResNet50",
    }
)

for epoch in range(config.epochs):
    train_loss = train_one_epoch(model, train_loader)
    val_acc = evaluate(model, val_loader)

    wandb.log({
        "epoch": epoch,
        "train_loss": train_loss,
        "val_accuracy": val_acc,
    })

    # Log model checkpoint as artefact
    if val_acc > best_acc:
        wandb.save("model_best.pt")

wandb.finish()

MLflow vs W&B: Which to Use?

MLflow

Free, open-source, self-hostable
No data leaves your infrastructure
Model registry with stage transitions
Native integrations: Spark, sklearn, PyTorch, HF
Best for: regulated industries, on-prem teams
Weaker UI / collaboration than W&B

Weights & Biases

SaaS (free tier generous); enterprise pricing
Best-in-class visualisations and dashboards
W&B Sweeps: built-in hyperparameter search
W&B Artifacts: dataset + model versioning
Best for: research teams, collaborative projects
Data stored on W&B servers (privacy concern)

Model Registry Workflow

Register Model

After a training run, register the model to the registry with a name and version.

↓

Staging

Transition model to "Staging" for QA: run integration tests, check latency, validate on held-out data.

↓

Production

Promote to "Production". Serving infrastructure pulls the latest Production model on restart. Previous version archived for rollback.

💡 Track Everything, Even "Bad" Experiments

Log every single run, including failed ones. Knowing what doesn't work is as valuable as knowing what does. Add a "notes" tag to each run explaining what you were testing. You'll thank yourself in two weeks when you can't remember why you tried learning rate 1e-2.

Frequently Asked Questions

Can I use MLflow with Hugging Face Transformers?

Yes. Hugging Face Trainer has built-in MLflow integration — set MLFLOW_EXPERIMENT_NAME env var and pass report_to="mlflow" to TrainingArguments. All training metrics are automatically logged. You can also use mlflow.transformers.log_model() to log the full model to the registry with its tokeniser.

How do I reproduce an old experiment exactly?

Good tracking practice: log the git commit hash (mlflow.set_tag("git_commit", subprocess.check_output(["git", "rev-parse", "HEAD"]).decode())), log all environment versions (mlflow.log_artifact("requirements.txt")), and log the exact dataset version (DVC hash or dataset artifact). With these three, you can exactly reproduce any run.

What is a hyperparameter sweep?

A systematic search over the hyperparameter space to find the optimal configuration. W&B Sweeps supports grid search, random search, and Bayesian optimisation. MLflow integrates with Optuna and Ray Tune for the same purpose. For most tasks, random search over 20–50 runs outperforms manual tuning and is more efficient than grid search.

Experiment Tracking

What to Track

Parameters

Metrics

Artefacts

Environment

MLflow: Open-Source Experiment Tracking

Core Concepts

Weights & Biases (W&B)

MLflow vs W&B: Which to Use?

MLflow

Weights & Biases

Model Registry Workflow

Frequently Asked Questions

Frequently Asked Questions

What will I learn here?

How should I use this page?

What should I read next?