AI System Design

Building a prototype AI app is easy. Building one that handles 10,000 users, stays within budget, recovers from failures, and improves over time — that is AI system design. This page covers the architecture patterns and tradeoffs that separate demos from production systems.

📖 Covers: Architecture Patterns · RAG Pipelines · Agent Orchestration · Scalability · Observability · Cost Control

Core Design Dimensions

Latency

Time to first token, end-to-end response time. Critical for interactive apps.

💰 Cost

Input/output tokens, compute, storage. Scales with usage — design for it.

📈 Throughput

Requests per second. Batching, caching, and async patterns matter here.

🎯 Quality

Accuracy, relevance, helpfulness. Often trades off against speed and cost.

🔒 Reliability

Uptime, graceful degradation, retry logic, circuit breakers.

🔍 Observability

Logging, tracing, evaluation. How do you know when it breaks?

Common AI System Architectures

1. Simple LLM Gateway

The baseline: your app calls an LLM API directly. Good for prototypes and low-traffic applications.

Client
Your App
LLM API
(OpenAI / Claude / Gemini)
✅ Pros
  • Simple to build
  • No infra to manage
  • Fast to iterate
❌ Cons
  • No caching
  • No rate limiting
  • No fallback on API outage
  • Hard to switch providers

2. RAG Pipeline

Retrieval-Augmented Generation grounds the LLM in your own data. The retriever fetches relevant context; the LLM generates with that context in the prompt. This is the most common production pattern for knowledge-intensive apps.

User Query
Embed Query
Vector Search
(top-k chunks)
Re-rank / Filter
+
Build Prompt
(query + context)
LLM Generate
Grounded Answer
🔗 Deep dive: RAG Systems · Vector Databases

3. Agentic Pipeline

When the task requires multiple steps, tool calls, or decisions, a simple LLM call is not enough. An agent loop handles this — but adds latency and cost. Design for bounded iterations and human-in-the-loop checkpoints.

User Request
Orchestrator
(plan + dispatch)
Tool: Search
Tool: Code
Tool: DB
LLM
(reason + synthesise)
Response

Scalability Patterns

Caching

LLM calls are expensive and slow. Cache aggressively at every layer.

Python · Semantic Cache with Redis
import redis
import numpy as np
from sentence_transformers import SentenceTransformer

encoder = SentenceTransformer('all-MiniLM-L6-v2')
cache = redis.Redis()

def semantic_cache_get(query: str, threshold: float = 0.92) -> str | None:
    query_vec = encoder.encode(query).tolist()
    # Search cached queries by vector similarity
    keys = cache.keys("cache:*")
    for key in keys:
        cached = cache.hgetall(key)
        cached_vec = np.frombuffer(cached[b"embedding"], dtype=np.float32)
        similarity = np.dot(query_vec, cached_vec) / (
            np.linalg.norm(query_vec) * np.linalg.norm(cached_vec)
        )
        if similarity > threshold:
            return cached[b"response"].decode()
    return None  # cache miss

def semantic_cache_set(query: str, response: str):
    key = f"cache:{hash(query)}"
    embedding = encoder.encode(query).astype(np.float32).tobytes()
    cache.hset(key, mapping={"response": response, "embedding": embedding})
    cache.expire(key, 3600)  # 1-hour TTL

Async & Batching

Async I/O

Use asyncio + httpx or async SDK clients. Never block on LLM calls in a web handler.

Request Batching

Group multiple short requests into one API call (where the API supports it). Cuts overhead and often reduces cost.

Streaming

Stream tokens to the user as they are generated. Dramatically improves perceived latency even if total time is the same.

Queue + Workers

For batch jobs (bulk embeddings, nightly summarisation), use a job queue (Celery, Redis Queue, SQS) with worker processes.

Model Selection Strategy

Simple query?

Use a fast, cheap model — GPT-4o mini, Claude Haiku. Handles most intent classification, short Q&A.

↓ if complex
Reasoning required?

Route to a mid-tier model — GPT-4o, Claude Sonnet. Good for multi-step tasks and analysis.

↓ if very complex
Maximum quality?

Route to a flagship model — o1, Claude Opus, Gemini Ultra. Reserve for high-value, low-volume tasks.

Observability Stack

You cannot improve what you cannot measure. A production AI system needs structured logging and evaluation at every step.

📊 Metrics
  • Request latency (p50, p95, p99)
  • Token usage (input / output)
  • Cost per request
  • Cache hit rate
  • Error rate by model / endpoint
🔍 Tracing
  • Trace each step: retrieval, prompt build, LLM call
  • Log input/output at every stage
  • Use LangSmith, Langfuse, or Helicone
  • Correlate traces with user feedback
✅ Evaluation
  • Automated evals: faithfulness, relevance, correctness
  • Human labelling pipeline for golden datasets
  • Regression tests on every model change
  • A/B test prompts and models
Python · Structured Logging for LLM Calls
import time
import uuid
import structlog

log = structlog.get_logger()

async def llm_call_with_observability(
    prompt: str,
    model: str = "claude-sonnet-4-6",
    trace_id: str = None
) -> str:
    trace_id = trace_id or str(uuid.uuid4())
    start = time.monotonic()

    log.info("llm_call_start", trace_id=trace_id, model=model,
             prompt_tokens=len(prompt.split()))

    try:
        response = await client.messages.create(
            model=model,
            max_tokens=1024,
            messages=[{"role": "user", "content": prompt}]
        )
        latency_ms = (time.monotonic() - start) * 1000

        log.info("llm_call_success",
            trace_id=trace_id,
            model=model,
            latency_ms=round(latency_ms, 1),
            input_tokens=response.usage.input_tokens,
            output_tokens=response.usage.output_tokens,
            stop_reason=response.stop_reason
        )
        return response.content[0].text

    except Exception as e:
        log.error("llm_call_error", trace_id=trace_id, model=model,
                  error=str(e), latency_ms=round((time.monotonic()-start)*1000,1))
        raise

Failure Modes & Mitigations

FailureSymptomMitigation
Hallucination Model confidently states false facts RAG with citations · Retrieval verification · Output validation
Prompt Injection User input overrides system instructions Input sanitisation · Separate system/user contexts · Output parsing
Context Overflow Context window exceeded → truncation Chunk size limits · Sliding window · Summarisation
Runaway Agents Agent loops indefinitely, burning tokens Max iteration caps · Budget limits · Timeout hard stops
API Rate Limits 429 errors under load Exponential backoff · Request queuing · Multiple API keys
Model Regression New model version changes behaviour Pin model versions · Automated regression evals · Canary deploys

Production Checklist

Frequently Asked Questions

When should I use RAG vs fine-tuning?

Use RAG when your data changes frequently or you need source citations — it's faster to update and easier to debug. Use fine-tuning when you need to change the model's style, format, or specialised reasoning on a fixed domain. Most production systems start with RAG; fine-tuning is added when RAG quality plateaus. See Fine-Tuning & LoRA and RAG Systems for details.

How do I reduce LLM costs in production?

The biggest wins: (1) route simple queries to cheaper models, (2) add a semantic cache, (3) compress prompts — remove filler text and redundant instructions, (4) use streaming to avoid timeout retries, (5) batch offline jobs. Typically you can cut costs 60–80% without sacrificing quality by applying all five.

What's the best way to evaluate an AI system?

Layer multiple signal types: automated metrics (RAGAS for RAG, exact match for structured output), LLM-as-judge for free-form quality, and human review for high-stakes flows. Build a golden dataset of 50–200 hand-labelled examples and run it on every change. Track metrics over time to detect regressions.

How do I handle prompt injection attacks?

Sanitise user input before inserting it into prompts. Use separate system and user message roles — never concatenate user content into the system prompt. Validate and parse structured outputs rather than trusting the model's free-form text. For agentic systems, add a tool-use approval layer for irreversible actions. See AI Ethics & Governance for more on safety.

Frequently Asked Questions

What will I learn here?

This page covers the core concepts and techniques you need to understand the topic and progress confidently to the next lesson.

How should I use this page?

Start with the overview, then follow the section links to deepen your understanding. Use the table of contents on the right to jump to specific sections.

What should I read next?

Use the navigation below to continue to the next lesson or explore related topics.