Model Monitoring
A model that's 94% accurate today may be 78% accurate in 6 months — without anyone changing a line of code. Input data changes, user behaviour shifts, the world evolves. Model monitoring detects these changes before they impact users, enabling proactive retraining and maintenance of production AI systems.
Types of Model Degradation
Data Drift
The distribution of input features changes. Example: an e-commerce model trained before a major sale sees completely different price ranges during the sale.
Concept Drift
The relationship between features and labels changes. Example: words associated with "spam" in 2020 differ from spam patterns in 2024.
Data Quality Issues
Missing values, unexpected nulls, schema changes, upstream pipeline failures. The most common production failure mode.
Performance Degradation
Model accuracy, F1, or business metrics declining over time, detectable via holdout sets, shadow mode, or A/B test comparisons.
Monitoring Without Ground Truth
The hard part: in production, you often don't have ground truth labels immediately. A fraud model's predictions are labelled "fraud" or "not fraud" only after investigation days later. You need proxy signals to detect problems early:
Drift Detection with Evidently AI
from evidently.report import Report
from evidently.metric_preset import DataDriftPreset, DataQualityPreset
import pandas as pd
# reference = training data, current = recent production data
reference_data = pd.read_parquet("train_features.parquet")
current_data = pd.read_parquet("prod_features_last_week.parquet")
report = Report(metrics=[
DataDriftPreset(),
DataQualityPreset(),
])
report.run(reference_data=reference_data, current_data=current_data)
report.save_html("drift_report.html")
# Programmatic access
results = report.as_dict()
drift_detected = results["metrics"][0]["result"]["dataset_drift"]
print(f"Data drift detected: {drift_detected}") Statistical Tests for Drift Detection
Numerical Features
KS Test (Kolmogorov-Smirnov): Compares CDFs of two distributions. Sensitive to shape changes.
PSI (Population Stability Index): Measures distributional shift. PSI <0.1 = stable, >0.2 = significant shift.
Wasserstein Distance: Earth-mover's distance between distributions.
Categorical Features
Chi-squared Test: Tests whether category frequency distributions differ significantly.
Jensen-Shannon Divergence: Symmetric version of KL divergence. JSD = 0 means identical distributions.
New Category Rate: % of categories not seen in training — catches vocabulary drift.
LLM-Specific Monitoring
LLM applications need additional monitoring beyond traditional ML:
Hallucination Rate
Track factual accuracy using NLI models or retrieval-grounded evaluation. Alert when hallucination rate exceeds threshold.
Latency & Cost
P50/P95/P99 latency, tokens per request, cost per user session. Critical for LLM economics.
Safety & Toxicity
Monitor outputs for policy violations, harmful content, prompt injection attempts. Tools: LlamaGuard, Perspective API, custom classifiers.
User Feedback
Thumbs up/down signals, regeneration rate, session abandonment. Weak labels but very scalable. Feed back into RLHF data collection.
Setting Up Alerts
PSI > 0.2 for key features, prediction distribution shift >10%, accuracy drop >2% on held-out set. Set conservative thresholds to avoid alert fatigue.
Run drift checks daily (Airflow/Prefect cron). Compute metrics over rolling 7-day windows vs last 30-day reference period.
Send Slack/PagerDuty alerts on threshold breach. Runbook: investigate cause → decide: retrain, rollback, or accept drift.
The most important MLOps practice: log every prediction with its input features, timestamp, model version, and (asynchronously) ground truth label when available. This data is your monitoring foundation. Without it, you're flying blind. Store in a column store (BigQuery, Snowflake) for efficient drift queries.
Frequently Asked Questions
How often should I retrain my model?
It depends on how fast your data changes. Fraud models may need weekly retraining. Product recommendation models might be retrained daily. Content moderation models may need continuous learning. Start by monitoring drift and retraining when PSI > 0.2 on important features or when performance on a holdout set drops below a threshold. Automate retraining triggers rather than using fixed schedules.
What tools are available for LLM monitoring?
LangSmith (LangChain) — trace and evaluate LLM calls with latency and token usage. Arize AI — ML observability platform with LLM features. Braintrust — LLM evaluation and monitoring. Evidently AI — open-source, now includes LLM metrics. For simple setups, structured logging to a database + a Grafana dashboard covers most monitoring needs without additional tooling.
What is shadow mode deployment?
Shadow mode (or dark launch) runs a new model alongside the production model without affecting users. Both models score every request; only the production model's output is used. You collect real predictions from the new model and compare its distribution and (when ground truth arrives) performance against the production model. It's the safest way to validate a new model before promoting it to production.
Frequently Asked Questions
What will I learn here?
This page covers the core concepts and techniques you need to understand the topic and progress confidently to the next lesson.
How should I use this page?
Start with the overview, then follow the section links to deepen your understanding. Use the table of contents on the right to jump to specific sections.
What should I read next?
Use the navigation below to continue to the next lesson or explore related topics.