MLOps on Cloud

Getting a model to work in a notebook is the easy part. Getting it to reliably serve predictions to millions of users, retrain itself when data drifts, version itself so you can roll back, and alert you when something goes wrong — that's MLOps. It's the DevOps of machine learning, and cloud platforms have made it dramatically more accessible.

What is MLOps?

MLOps (Machine Learning Operations) is the set of practices, tools, and processes that bridge the gap between ML experimentation and production deployment. It applies DevOps principles (automation, CI/CD, monitoring, versioning) to the ML lifecycle.

Why ML Needs Its Own Ops Discipline

Software bugs are deterministic — the same input gives the same output. ML model failures are stochastic and often invisible. A model can be "working" (returning predictions) but be silently degrading in quality because its training data distribution is drifting away from production data. Traditional software monitoring (uptime, error rates) doesn't detect this. MLOps adds ML-specific monitoring: data drift detection, model performance tracking, and automated retraining triggers.

The research-production gap: Studies show that over 85% of ML models never make it to production. Of those that do, many fail within the first year due to data drift, model decay, or infrastructure issues. MLOps practices dramatically improve these numbers.

Experiment Tracking

Before you can deploy the best model, you need to know which experiment produced it. Experiment tracking records every training run — hyperparameters, metrics, code version, dataset version — so you can compare runs and reproduce results.

Tools: MLflow, W&B, and Cloud-Native Options

MLflow (open source, runs anywhere) tracks parameters, metrics, and artifacts. Weights & Biases (W&B) is the go-to for research teams — rich visualizations, collaborative, strong community. SageMaker Experiments, Vertex AI Experiments, and Azure ML Experiments are the cloud-native options — tightly integrated with their respective training infrastructure but less portable.

Model Registry

A model registry is a central catalog of all your trained models — with their versions, metadata, evaluation metrics, and deployment status. It's the bridge between training and production.

The Model Lifecycle in a Registry

Models flow through stages: Staging (just trained, under evaluation) → Production (serving live traffic) → Archived (retired). Promoting a model requires passing evaluation thresholds. This prevents bad models from accidentally going to production and ensures every production model has a documented lineage.

Model Metadata That Matters

A good registry entry includes: training dataset version (which data?), training code commit hash (which code?), hyperparameters, evaluation metrics on held-out test sets, training environment (Docker image, framework versions), and who approved the production promotion. This metadata is essential for debugging — when something goes wrong in production, you can trace exactly what changed.

ML Pipelines: Automating the Workflow

ML pipelines automate the end-to-end workflow from raw data to deployed model — triggered by new data, schedule, or manual approval.

📥

Data Ingestion

Pull new data from sources, validate schema, check for drift, and register the new dataset version.

🔧

Feature Engineering

Transform raw data into model-ready features. Feature stores (Feast, Tecton, SageMaker Feature Store) cache these for reuse.

🏋️

Training

Distributed training job on GPU cluster. Log metrics to experiment tracker. Save model artifact to registry.

Evaluation & Approval

Automated evaluation against test set. Comparison to current production model. Human approval gate before promotion.

🚀

Deployment

Shadow deployment → canary (1% traffic) → A/B test → full rollout. Automatic rollback on performance regression.

📊

Monitoring

Track prediction quality, input data distribution, feature drift, and business metrics. Trigger retraining when drift detected.

Model Monitoring in Production

Data Drift

The statistical properties of production input data diverge from training data over time. A fraud detection model trained on 2022 transactions degrades as fraud patterns evolve in 2025. Data drift monitoring tracks the distribution of each input feature and alerts when it shifts beyond a threshold.

Concept Drift

Even with stable input data, the relationship between inputs and correct outputs can change. A house price prediction model trained pre-2020 made wrong assumptions when COVID changed the housing market. Ground truth labels (when available) let you track actual model accuracy in production and detect concept drift.

Infrastructure Monitoring

Beyond model quality: latency percentiles (p50, p95, p99), throughput, error rates, memory utilization, and GPU utilization for inference endpoints. SageMaker, Vertex AI, and Azure ML all integrate with CloudWatch/Cloud Monitoring/Azure Monitor for this layer.

Frequently Asked Questions

When should I invest in MLOps tooling?

Start simple: experiment tracking from day one (it's cheap and pays off immediately), a model registry when you have more than one model in production, and monitoring when you have real users depending on predictions. Don't build a full MLOps platform before you have a production model. The "you aren't gonna need it" principle applies — add tooling when the pain of not having it exceeds the cost of building it.

What is a feature store and do I need one?

A feature store is a system that computes, stores, and serves ML features — bridging training and serving. Training uses historical feature values (point-in-time correct); serving uses real-time feature values from the same definitions. Without a feature store, you often end up with training-serving skew: the features computed during training differ from those computed during inference, degrading model accuracy. Use a feature store (Feast, Tecton, SageMaker Feature Store) when you have complex, reusable features computed from multiple data sources and shared across multiple models.

How do LLM-based applications fit into MLOps?

LLM applications (RAG systems, fine-tuned models, agent systems) need a adapted MLOps approach. Instead of monitoring feature drift, you monitor output quality (using LLM-as-judge evaluations), prompt performance (which prompt versions perform best), retrieval quality (are the right documents being retrieved?), and latency (time-to-first-token, total generation time). Tools like LangSmith, Weights & Biases Prompts, and Arize AI Phoenix have adapted to this new paradigm.

Frequently Asked Questions

What will I learn here?

This page covers the core concepts and techniques you need to understand the topic and progress confidently to the next lesson.

How should I use this page?

Start with the overview, then follow the section links to deepen your understanding. Use the table of contents on the right to jump to specific sections.

What should I read next?

Use the navigation below to continue to the next lesson or explore related topics.