AI System Design

Building a prototype AI app is easy. Building one that handles 10,000 users, stays within budget, recovers from failures, and improves over time — that is AI system design. This page covers the architecture patterns and tradeoffs that separate demos from production systems.

Core Design Dimensions

⚡ Latency

Time to first token, end-to-end response time. Critical for interactive apps.

💰 Cost

Input/output tokens, compute, storage. Scales with usage — design for it.

📈 Throughput

Requests per second. Batching, caching, and async patterns matter here.

🎯 Quality

Accuracy, relevance, helpfulness. Often trades off against speed and cost.

🔒 Reliability

Uptime, graceful degradation, retry logic, circuit breakers.

🔍 Observability

Logging, tracing, evaluation. How do you know when it breaks?

Common AI System Architectures

1. Simple LLM Gateway

The baseline: your app calls an LLM API directly. Good for prototypes and low-traffic applications.

Client

→

Your App

→

LLM API
(OpenAI / Claude / Gemini)

✅ Pros

Simple to build
No infra to manage
Fast to iterate

❌ Cons

No caching
No rate limiting
No fallback on API outage
Hard to switch providers

2. RAG Pipeline

Retrieval-Augmented Generation grounds the LLM in your own data. The retriever fetches relevant context; the LLM generates with that context in the prompt. This is the most common production pattern for knowledge-intensive apps.

User Query

Embed Query

Vector Search
(top-k chunks)

Re-rank / Filter

Build Prompt
(query + context)

LLM Generate

Grounded Answer

3. Agentic Pipeline

When the task requires multiple steps, tool calls, or decisions, a simple LLM call is not enough. An agent loop handles this — but adds latency and cost. Design for bounded iterations and human-in-the-loop checkpoints.

User Request

↓

Orchestrator
(plan + dispatch)

↓

Tool: Search

Tool: Code

Tool: DB

↓

LLM
(reason + synthesise)

↓

Response

Scalability Patterns

Caching

LLM calls are expensive and slow. Cache aggressively at every layer.

Python · Semantic Cache with Redis

import redis
import numpy as np
from sentence_transformers import SentenceTransformer

encoder = SentenceTransformer('all-MiniLM-L6-v2')
cache = redis.Redis()

def semantic_cache_get(query: str, threshold: float = 0.92) -> str | None:
    query_vec = encoder.encode(query).tolist()
    # Search cached queries by vector similarity
    keys = cache.keys("cache:*")
    for key in keys:
        cached = cache.hgetall(key)
        cached_vec = np.frombuffer(cached[b"embedding"], dtype=np.float32)
        similarity = np.dot(query_vec, cached_vec) / (
            np.linalg.norm(query_vec) * np.linalg.norm(cached_vec)
        )
        if similarity > threshold:
            return cached[b"response"].decode()
    return None  # cache miss

def semantic_cache_set(query: str, response: str):
    key = f"cache:{hash(query)}"
    embedding = encoder.encode(query).astype(np.float32).tobytes()
    cache.hset(key, mapping={"response": response, "embedding": embedding})
    cache.expire(key, 3600)  # 1-hour TTL

Async & Batching

Async I/O

Use asyncio + httpx or async SDK clients. Never block on LLM calls in a web handler.

Request Batching

Group multiple short requests into one API call (where the API supports it). Cuts overhead and often reduces cost.

Streaming

Stream tokens to the user as they are generated. Dramatically improves perceived latency even if total time is the same.

Queue + Workers

For batch jobs (bulk embeddings, nightly summarisation), use a job queue (Celery, Redis Queue, SQS) with worker processes.

Model Selection Strategy

Simple query?

Use a fast, cheap model — GPT-4o mini, Claude Haiku. Handles most intent classification, short Q&A.

↓ if complex

Reasoning required?

Route to a mid-tier model — GPT-4o, Claude Sonnet. Good for multi-step tasks and analysis.

↓ if very complex

Maximum quality?

Route to a flagship model — o1, Claude Opus, Gemini Ultra. Reserve for high-value, low-volume tasks.

Observability Stack

You cannot improve what you cannot measure. A production AI system needs structured logging and evaluation at every step.

📊 Metrics

Request latency (p50, p95, p99)
Token usage (input / output)
Cost per request
Cache hit rate
Error rate by model / endpoint

🔍 Tracing

Trace each step: retrieval, prompt build, LLM call
Log input/output at every stage
Use LangSmith, Langfuse, or Helicone
Correlate traces with user feedback

✅ Evaluation

Automated evals: faithfulness, relevance, correctness
Human labelling pipeline for golden datasets
Regression tests on every model change
A/B test prompts and models

Python · Structured Logging for LLM Calls

import time
import uuid
import structlog

log = structlog.get_logger()

async def llm_call_with_observability(
    prompt: str,
    model: str = "claude-sonnet-4-6",
    trace_id: str = None
) -> str:
    trace_id = trace_id or str(uuid.uuid4())
    start = time.monotonic()

    log.info("llm_call_start", trace_id=trace_id, model=model,
             prompt_tokens=len(prompt.split()))

    try:
        response = await client.messages.create(
            model=model,
            max_tokens=1024,
            messages=[{"role": "user", "content": prompt}]
        )
        latency_ms = (time.monotonic() - start) * 1000

        log.info("llm_call_success",
            trace_id=trace_id,
            model=model,
            latency_ms=round(latency_ms, 1),
            input_tokens=response.usage.input_tokens,
            output_tokens=response.usage.output_tokens,
            stop_reason=response.stop_reason
        )
        return response.content[0].text

    except Exception as e:
        log.error("llm_call_error", trace_id=trace_id, model=model,
                  error=str(e), latency_ms=round((time.monotonic()-start)*1000,1))
        raise

Failure Modes & Mitigations

Hallucination Model confidently states false facts RAG with citations · Retrieval verification · Output validation

Prompt Injection User input overrides system instructions Input sanitisation · Separate system/user contexts · Output parsing

Context Overflow Context window exceeded → truncation Chunk size limits · Sliding window · Summarisation

Runaway Agents Agent loops indefinitely, burning tokens Max iteration caps · Budget limits · Timeout hard stops

API Rate Limits 429 errors under load Exponential backoff · Request queuing · Multiple API keys

Model Regression New model version changes behaviour Pin model versions · Automated regression evals · Canary deploys

Production Checklist

Frequently Asked Questions

When should I use RAG vs fine-tuning?

Use RAG when your data changes frequently or you need source citations — it's faster to update and easier to debug. Use fine-tuning when you need to change the model's style, format, or specialised reasoning on a fixed domain. Most production systems start with RAG; fine-tuning is added when RAG quality plateaus. See Fine-Tuning & LoRA and RAG Systems for details.

How do I reduce LLM costs in production?

The biggest wins: (1) route simple queries to cheaper models, (2) add a semantic cache, (3) compress prompts — remove filler text and redundant instructions, (4) use streaming to avoid timeout retries, (5) batch offline jobs. Typically you can cut costs 60–80% without sacrificing quality by applying all five.

What's the best way to evaluate an AI system?

Layer multiple signal types: automated metrics (RAGAS for RAG, exact match for structured output), LLM-as-judge for free-form quality, and human review for high-stakes flows. Build a golden dataset of 50–200 hand-labelled examples and run it on every change. Track metrics over time to detect regressions.

How do I handle prompt injection attacks?

Sanitise user input before inserting it into prompts. Use separate system and user message roles — never concatenate user content into the system prompt. Validate and parse structured outputs rather than trusting the model's free-form text. For agentic systems, add a tool-use approval layer for irreversible actions. See AI Ethics & Governance for more on safety.

What to Learn Next

⚡ ReAct & Tool Use

Build agents that reason and act using function calling.

🤝 Multi-Agent Systems

Orchestrate teams of specialised agents.

🔍 RAG Systems

Ground your LLM in your own data with retrieval.

⚖️ AI Ethics & Governance

Safety, bias, and responsible deployment.

AI System Design

Core Design Dimensions

Common AI System Architectures

1. Simple LLM Gateway

2. RAG Pipeline

3. Agentic Pipeline

Scalability Patterns

Caching

Async & Batching

Model Selection Strategy

Observability Stack

Failure Modes & Mitigations

Production Checklist

Frequently Asked Questions

What to Learn Next

Frequently Asked Questions

What will I learn here?

How should I use this page?

What should I read next?