Kubernetes Orchestration

Running one Docker container on your laptop is easy. Running 500 containers across 50 servers — making sure they're healthy, restarting crashed ones, scaling up under load, and routing traffic to the right ones — is a different problem entirely. That's the problem Kubernetes solves.

What is Kubernetes?

Kubernetes (often abbreviated as K8s — the 8 letters between "K" and "s") is an open-source system for automating deployment, scaling, and management of containerized applications. It was originally designed by Google engineers who had been running containers at massive scale internally for years, and open-sourced in 2014.

The Problem It Solves

Imagine your AI inference API suddenly goes from 100 requests/second to 10,000 — maybe your app went viral. Without Kubernetes: you'd need to manually spin up more servers, install Docker, start more containers, and update your load balancer. With Kubernetes: it detects the load spike via metrics and automatically starts more container replicas within seconds. When load drops, it scales back down to save cost.

Core Kubernetes Concepts

Pods — The Smallest Unit

A pod is one or more containers that always run together on the same node (server) and share the same network and storage. Think of a pod as a logical host for your containers. Usually one container per pod, but sometimes you'll see sidecar containers (like a logging agent) paired alongside the main container.

Nodes — The Workers

A node is a worker machine (usually a VM) in your Kubernetes cluster. Pods run on nodes. A typical production cluster has 5–100+ nodes. For AI workloads, nodes are often GPU instances (each node is an EC2 p4d with 8 A100 GPUs, for example).

Deployments — Managing Pod Replicas

A Deployment tells Kubernetes: "I want 5 replicas of this pod running at all times." Kubernetes makes it so — and if a pod crashes or a node fails, it automatically starts a replacement. Deployments also handle rolling updates: deploy new versions without downtime by replacing pods one at a time.

Services — Stable Network Endpoints

Pods are temporary — they get created and destroyed. A Service provides a stable IP address and DNS name that always points to the current live pods, no matter which specific pods are running. Your load balancer points to a Service, not individual pods.

Namespaces — Logical Isolation

Namespaces partition a single Kubernetes cluster into logically isolated segments. You might have a "production" namespace and a "staging" namespace in the same cluster, with different resource limits and access controls for each.

Autoscaling in Kubernetes

Kubernetes has three levels of autoscaling:

📈

HPA

Horizontal Pod Autoscaler — adds/removes pod replicas based on CPU, memory, or custom metrics (requests/sec, queue depth).

📏

VPA

Vertical Pod Autoscaler — adjusts CPU/memory limits for existing pods based on observed usage. Good for long-running training jobs.

🖥️

Cluster Autoscaler

Adds/removes nodes when pods can't be scheduled (not enough capacity) or nodes are underutilized. Integrates with cloud provider APIs to spin up VMs.

Kubernetes for AI & ML Workloads

Serving ML Models at Scale

Kubernetes is the standard platform for deploying ML inference APIs. You package your model server (vLLM, Triton, TorchServe) in a Docker image, define a Kubernetes Deployment with GPU resource requests, and let Kubernetes handle scheduling, health checks, and autoscaling.

ML Training on Kubernetes

The Kubeflow project extends Kubernetes with ML-specific primitives — distributed training operators (for PyTorch, TensorFlow), pipeline management, and experiment tracking. Many ML platforms (Vertex AI Pipelines, AWS SageMaker) use Kubernetes under the hood.

GPU Scheduling

Kubernetes schedules GPU resources using NVIDIA's device plugin. You request GPUs in your pod spec: nvidia.com/gpu: 4. Kubernetes places that pod on a node with 4 available GPUs and prevents other pods from using those GPUs simultaneously.

Managed Kubernetes Services

Running Kubernetes yourself is complex. The "control plane" (scheduler, API server, etcd database) is notoriously tricky to operate. All major cloud providers offer managed Kubernetes — they run the control plane for you:

EKS

AWS Elastic Kubernetes Service — most popular, deep AWS integration

GKE

Google Kubernetes Engine — Google invented K8s, best tooling

AKS

Azure Kubernetes Service — free control plane, deep Azure AD integration

Frequently Asked Questions

Is Kubernetes too complex for a small team?

It can be. Kubernetes has significant operational complexity. For small teams or simple applications, managed alternatives like Google Cloud Run, AWS App Runner, or Heroku may be better. Start with Kubernetes when you have multiple services, need GPU scheduling, or require fine-grained control over how your containers are deployed. If you use a managed service like GKE Autopilot, much of the complexity is abstracted away.

What is Helm in the Kubernetes ecosystem?

Helm is the package manager for Kubernetes. Instead of writing raw Kubernetes YAML files for every service, Helm packages them into reusable "charts" with configurable values. Deploying Prometheus monitoring to your cluster? helm install prometheus. Deploying an NVIDIA GPU operator? helm install gpu-operator. Helm dramatically simplifies deploying complex applications to Kubernetes.

What's the difference between Docker Compose and Kubernetes?

Docker Compose is for running multiple containers on a single machine, typically for local development. Kubernetes is for running containers across a cluster of machines in production, with autoscaling, self-healing, and advanced networking. Think of Compose as the local development tool and Kubernetes as the production system.

How much does Kubernetes cost?

Kubernetes itself is open source and free. You pay for the underlying infrastructure (VM nodes). With managed services: EKS charges $0.10/hour for the control plane plus EC2 node costs; GKE Autopilot charges per pod resource rather than per node; AKS has a free control plane but charges for nodes. For AI workloads, GPU node costs dominate — a single H100 GPU node can run $30–60/hour.

Kubernetes Orchestration

What is Kubernetes?

The Problem It Solves

Core Kubernetes Concepts

Pods — The Smallest Unit

Nodes — The Workers

Deployments — Managing Pod Replicas

Services — Stable Network Endpoints

Namespaces — Logical Isolation

Autoscaling in Kubernetes

HPA

VPA

Cluster Autoscaler

Kubernetes for AI & ML Workloads

Serving ML Models at Scale

ML Training on Kubernetes

GPU Scheduling

Managed Kubernetes Services

Frequently Asked Questions

Frequently Asked Questions

What will I learn here?

How should I use this page?

What should I read next?