High Availability & Disaster Recovery
Everything fails. Networks partition. Availability zones go down. Databases corrupt. Disks fail. The question isn't whether your system will encounter failure — it's whether you designed it to survive. High availability (HA) and disaster recovery (DR) are the engineering disciplines that turn "when things break" into "users never notice."
Understanding Availability: The Nines
Availability is measured as a percentage of time a system is operational. The difference between 99% and 99.99% sounds small — it isn't:
SLAs, SLOs, and SLIs
An SLA (Service Level Agreement) is the contractual commitment — "we guarantee 99.9% uptime or you get a refund." An SLO (Service Level Objective) is the internal target you aim for — typically stricter than the SLA, so you have headroom before breaching your commitment. An SLI (Service Level Indicator) is the actual measurement — request success rate, latency percentiles, error rates. You define SLIs to measure SLOs to meet SLAs. For AI systems: SLIs typically include inference latency (p50/p99), model serving availability, and prediction quality metrics.
RTO and RPO
Recovery Time Objective (RTO): how long can the system be down before it causes unacceptable business impact? A payment processing AI needs an RTO of seconds. A batch recommendation system might tolerate hours. Recovery Point Objective (RPO): how much data can you afford to lose? An RTO of 1 hour means your system must be restored within 1 hour of failure. An RPO of 15 minutes means you can tolerate losing at most 15 minutes of data. These two numbers define your DR architecture — tighter requirements demand more expensive solutions.
High Availability Patterns
Multi-AZ Deployment
The foundation of cloud HA. Deploy your application across at least two Availability Zones — physically separate data centers within a region, connected by low-latency links. A load balancer distributes traffic across AZs. If one AZ fails (power outage, network issue), the others continue serving traffic. AWS RDS Multi-AZ, EKS with nodes across AZs, and stateless application replicas in each AZ are standard patterns. Multi-AZ protects against AZ-level failures (the most common kind) but not regional failures.
Multi-Region for AI Serving
For global AI services (inference endpoints serving users worldwide), multi-region deployment reduces latency and provides regional failover. DNS-based routing (Route 53 latency routing, Cloudflare Load Balancing) directs users to the nearest healthy region. Challenges: model artifacts must be replicated to each region, feature stores must be globally consistent or partitioned by region, and state synchronization across regions is complex. Many teams use multi-region for serving but single-region for training.
Stateless Design
The most important principle for HA: make services stateless. A stateless service holds no session state locally — all state lives in an external store (Redis, DynamoDB). Any instance can handle any request. This means you can replace, scale, or fail over any instance without losing data, and your load balancer can send requests to any healthy instance. Stateless model serving is straightforward (models are read-only). Stateful components (feature stores, model registries) need their own HA strategy.
Circuit Breakers and Bulkheads
A circuit breaker automatically stops calling a failing dependency after a threshold of errors, "opening" the circuit. This prevents a slow downstream service from making all your threads wait, cascading failures into a system-wide outage. A bulkhead isolates components so a failure in one doesn't consume all resources and starve others. For AI systems: if the feature store is slow, the circuit breaker returns cached or default features rather than queuing requests until the system falls over.
Disaster Recovery Strategies
Backup & Restore
Lowest cost. Longest RTO (hours). Restore from S3 backups. Good for non-critical workloads. RPO = backup frequency.
Pilot Light
Core infrastructure running but scaled to minimum. Scale up during DR event. RTO: 10–30 minutes. Medium cost.
Warm Standby
Scaled-down but fully functional copy running. Failover in minutes. Higher cost. Most common for production AI.
Active-Active
Full capacity in two regions simultaneously. Instant failover — users don't notice. Highest cost. For critical AI services.
Chaos Engineering: Testing Your Resilience
What is Chaos Engineering?
Chaos Engineering is the practice of deliberately injecting failures into your system to find weaknesses before they cause real outages. Netflix famously built Chaos Monkey, a tool that randomly terminates production instances. The theory: if you know your system can't survive random instance termination, it's better to discover that in a controlled experiment than during an actual failure. It sounds reckless but is actually the opposite — it forces you to build genuine resilience rather than relying on systems "not failing."
Chaos Engineering for AI Systems
AI systems have unique failure modes to test: model serving endpoint outage (does the system fall back gracefully?), feature store latency spike (does the circuit breaker engage?), stale model serving (what happens if model deployment fails silently?), training job interruption (does checkpoint recovery work?), and data pipeline failure (does the system degrade gracefully or fail catastrophically?). Tools like Gremlin, AWS Fault Injection Simulator, and LitmusChaos support structured chaos experiments with defined blast radius controls.
Frequently Asked Questions
How do I decide between multi-AZ and multi-region?
Multi-AZ is the baseline — use it for any production workload. It protects against the most common failures (single AZ issues) at relatively low cost and complexity. Multi-region is for when you need protection against regional failures (rare but catastrophic), lower global latency (serving users in multiple continents), or regulatory requirements for geographic distribution. Multi-region is significantly more complex and expensive — data replication, latency synchronization, and operational overhead multiply. Most applications need multi-AZ, not multi-region. Upgrade to multi-region only when you have a clear business requirement.
What does HA mean for ML model serving specifically?
For ML serving HA: (1) Deploy model replicas across multiple pods/instances — at least 2, ideally 3+ — so no single instance failure takes down serving. (2) Use a load balancer (Kubernetes Service, ALB) to distribute traffic. (3) Implement readiness probes so traffic only routes to instances that have loaded the model (a large LLM can take 30–60 seconds to load). (4) Use rolling deployments for model updates so you always have serving capacity during deploys. (5) Implement health checks that verify the model can actually run inference, not just that the container is running.
How do I back up and restore ML models?
Model artifacts (weights, configs) should be stored in S3 with versioning enabled — every training run produces a uniquely versioned artifact. The model registry (MLflow, SageMaker Model Registry) records metadata about each artifact. For DR: enable S3 Cross-Region Replication to automatically copy artifacts to a backup region. For restore: your deployment pipeline should reference the model registry, which points to S3 — restoring means spinning up the serving infrastructure and pointing it at the artifact. Test this process: "can I restore full serving capability from scratch in under 30 minutes?" should have a yes answer before you call your DR plan complete.
Frequently Asked Questions
What will I learn here?
This page covers the core concepts and techniques you need to understand the topic and progress confidently to the next lesson.
How should I use this page?
Start with the overview, then follow the section links to deepen your understanding. Use the table of contents on the right to jump to specific sections.
What should I read next?
Use the navigation below to continue to the next lesson or explore related topics.