Model Monitoring

A model that's 94% accurate today may be 78% accurate in 6 months — without anyone changing a line of code. Input data changes, user behaviour shifts, the world evolves. Model monitoring detects these changes before they impact users, enabling proactive retraining and maintenance of production AI systems.

Types of Model Degradation

📊

Data Drift

The distribution of input features changes. Example: an e-commerce model trained before a major sale sees completely different price ranges during the sale.

🔀

Concept Drift

The relationship between features and labels changes. Example: words associated with "spam" in 2020 differ from spam patterns in 2024.

⚠️

Data Quality Issues

Missing values, unexpected nulls, schema changes, upstream pipeline failures. The most common production failure mode.

📉

Performance Degradation

Model accuracy, F1, or business metrics declining over time, detectable via holdout sets, shadow mode, or A/B test comparisons.

Monitoring Without Ground Truth

The hard part: in production, you often don't have ground truth labels immediately. A fraud model's predictions are labelled "fraud" or "not fraud" only after investigation days later. You need proxy signals to detect problems early:

Input feature distributionsData drift (no labels needed)Real-time

Prediction distributionsConcept drift proxyReal-time

Model confidence scoresOut-of-distribution inputsReal-time

Business metricsReal impact (click-through, conversion)Hours–days

Delayed ground truthTrue performanceDays–weeks

Drift Detection with Evidently AI

from evidently.report import Report
from evidently.metric_preset import DataDriftPreset, DataQualityPreset
import pandas as pd

# reference = training data, current = recent production data
reference_data = pd.read_parquet("train_features.parquet")
current_data   = pd.read_parquet("prod_features_last_week.parquet")

report = Report(metrics=[
    DataDriftPreset(),
    DataQualityPreset(),
])
report.run(reference_data=reference_data, current_data=current_data)
report.save_html("drift_report.html")

# Programmatic access
results = report.as_dict()
drift_detected = results["metrics"][0]["result"]["dataset_drift"]
print(f"Data drift detected: {drift_detected}")

Statistical Tests for Drift Detection

Numerical Features

KS Test (Kolmogorov-Smirnov): Compares CDFs of two distributions. Sensitive to shape changes.

PSI (Population Stability Index): Measures distributional shift. PSI <0.1 = stable, >0.2 = significant shift.

Wasserstein Distance: Earth-mover's distance between distributions.

Categorical Features

Chi-squared Test: Tests whether category frequency distributions differ significantly.

Jensen-Shannon Divergence: Symmetric version of KL divergence. JSD = 0 means identical distributions.

New Category Rate: % of categories not seen in training — catches vocabulary drift.

LLM-Specific Monitoring

LLM applications need additional monitoring beyond traditional ML:

💬

Hallucination Rate

Track factual accuracy using NLI models or retrieval-grounded evaluation. Alert when hallucination rate exceeds threshold.

⚡

Latency & Cost

P50/P95/P99 latency, tokens per request, cost per user session. Critical for LLM economics.

🚫

Safety & Toxicity

Monitor outputs for policy violations, harmful content, prompt injection attempts. Tools: LlamaGuard, Perspective API, custom classifiers.

👍

User Feedback

Thumbs up/down signals, regeneration rate, session abandonment. Weak labels but very scalable. Feed back into RLHF data collection.

Setting Up Alerts

Define Thresholds

PSI > 0.2 for key features, prediction distribution shift >10%, accuracy drop >2% on held-out set. Set conservative thresholds to avoid alert fatigue.

↓

Schedule Monitoring Jobs

Run drift checks daily (Airflow/Prefect cron). Compute metrics over rolling 7-day windows vs last 30-day reference period.

↓

Alert & Respond

Send Slack/PagerDuty alerts on threshold breach. Runbook: investigate cause → decide: retrain, rollback, or accept drift.

💡 Log Everything at Prediction Time

The most important MLOps practice: log every prediction with its input features, timestamp, model version, and (asynchronously) ground truth label when available. This data is your monitoring foundation. Without it, you're flying blind. Store in a column store (BigQuery, Snowflake) for efficient drift queries.

Frequently Asked Questions

How often should I retrain my model?

It depends on how fast your data changes. Fraud models may need weekly retraining. Product recommendation models might be retrained daily. Content moderation models may need continuous learning. Start by monitoring drift and retraining when PSI > 0.2 on important features or when performance on a holdout set drops below a threshold. Automate retraining triggers rather than using fixed schedules.

What tools are available for LLM monitoring?

LangSmith (LangChain) — trace and evaluate LLM calls with latency and token usage. Arize AI — ML observability platform with LLM features. Braintrust — LLM evaluation and monitoring. Evidently AI — open-source, now includes LLM metrics. For simple setups, structured logging to a database + a Grafana dashboard covers most monitoring needs without additional tooling.

What is shadow mode deployment?

Shadow mode (or dark launch) runs a new model alongside the production model without affecting users. Both models score every request; only the production model's output is used. You collect real predictions from the new model and compare its distribution and (when ground truth arrives) performance against the production model. It's the safest way to validate a new model before promoting it to production.

Model Monitoring

Types of Model Degradation

Data Drift

Concept Drift

Data Quality Issues

Performance Degradation

Monitoring Without Ground Truth

Drift Detection with Evidently AI

Statistical Tests for Drift Detection

Numerical Features

Categorical Features

LLM-Specific Monitoring

Hallucination Rate

Latency & Cost

Safety & Toxicity

User Feedback

Setting Up Alerts

Frequently Asked Questions

Frequently Asked Questions

What will I learn here?

How should I use this page?

What should I read next?