Phase 11: MLOps & Production AI

Building a model is 10% of the work. Getting it to production, keeping it accurate over time, and iterating quickly is the other 90%. MLOps (Machine Learning Operations) is the discipline of running ML systems reliably at scale. This phase covers the tools and practices that separate research prototypes from production AI systems.

🎯

Goal

Deploy, monitor, and iterate ML systems reliably in production

⏱️

Time

4 – 8 weeks

🛠️

Tools

MLflow, W&B, Feast, Evidently, Airflow, DVC

The ML Production Gap

Most ML failures in production aren't model quality failures — they're operational failures:

📉

Data Drift

Input data distribution shifts over time. A fraud model trained in 2022 may miss 2024 fraud patterns. Without monitoring, you won't know until business metrics decline.

🔀

Reproducibility

"Which version of the model is in production?" Without experiment tracking, teams can't reproduce results, roll back safely, or explain why a model changed behaviour.

🏗️

Feature Inconsistency

Training uses historical batch features; serving computes features in real time. Subtle differences (different aggregation windows, null handling) cause training-serving skew.

🔁

Slow Retraining

Without CI/CD pipelines, retraining a model takes days of manual steps. Teams can't respond to drift quickly. Automation is the answer.

Topics in This Phase

📊

Experiment Tracking

MLflow and Weights & Biases. Track hyperparameters, metrics, artefacts, and model versions. Never lose a good experiment again.

Read Guide →

🔍

Model Monitoring

Detect data drift, concept drift, and model degradation in production. Set up alerts before users notice problems.

Read Guide →

🏪

Feature Stores

Feast, Tecton, Hopsworks. Centralise feature computation, eliminate training-serving skew, enable feature reuse across teams.

Read Guide →

The MLOps Maturity Model

0ManualNotebooks → manual deployment → no monitoring

1ML PipelineAutomated retraining pipeline; model registry; basic monitoring

2CI/CD for MLAutomated testing + deployment; A/B testing; feature store

3ML PlatformSelf-serve platform; continuous training; automated drift response

Most companies are at Level 0–1. Getting to Level 2 is high-impact and achievable with the tools in this phase.

Core MLOps Stack

Open Source

MLflow — Experiment tracking + model registry
DVC — Data versioning + pipeline management
Feast — Open-source feature store
Evidently — Data/model drift detection
Airflow / Prefect — Pipeline orchestration
BentoML — Model serving framework

Managed / Cloud

Weights & Biases — Best-in-class experiment tracking
Tecton — Enterprise feature store
AWS SageMaker — End-to-end ML platform
Vertex AI — Google's ML platform
Azure ML — Microsoft's ML platform
Databricks — Unified data + ML platform

💡 Start with MLflow + DVC

MLflow is free, self-hostable, and integrates with every ML framework. DVC adds data versioning with git-like semantics. These two tools alone take a project from "notebooks in Dropbox" to reproducible, versioned ML workflows. Add Evidently for monitoring once you have a deployed model.

Frequently Asked Questions

What is the difference between MLOps and DevOps?

DevOps manages software deployments — code is deterministic, and a deployment either works or it doesn't. MLOps adds the complexity that ML models are statistical — they degrade gradually, not catastrophically. You need to track model performance over time, detect distribution shifts in data, version datasets (not just code), and manage the experiment → training → deployment lifecycle. MLOps borrows DevOps practices (CI/CD, infrastructure-as-code) and extends them for ML-specific concerns.

Do LLM-based applications need MLOps?

Yes, but differently. For LLM apps (RAG, agents), MLOps concerns include: prompt versioning, evaluation dataset management, latency/cost monitoring, hallucination rate tracking, and A/B testing of prompt changes. Tools like LangSmith, Braintrust, and Arize AI are emerging for LLM observability specifically. The principles are the same: track, version, monitor, automate.

When should I set up a feature store?

When: (1) multiple models use the same features and you're recomputing them separately, (2) you have training-serving skew bugs, or (3) feature computation takes more than a few minutes. A feature store is significant infrastructure — premature adoption adds complexity. For early-stage projects, consistent Pandas/SQL pipelines with good documentation are often sufficient.

Frequently Asked Questions

What will I learn here?

This page covers the core concepts and techniques you need to understand the topic and progress confidently to the next lesson.

How should I use this page?

Start with the overview, then follow the section links to deepen your understanding. Use the table of contents on the right to jump to specific sections.

What should I read next?

Use the navigation below to continue to the next lesson or explore related topics.