AI System Design
Building a prototype AI app is easy. Building one that handles 10,000 users, stays within budget, recovers from failures, and improves over time — that is AI system design. This page covers the architecture patterns and tradeoffs that separate demos from production systems.
Core Design Dimensions
Time to first token, end-to-end response time. Critical for interactive apps.
Input/output tokens, compute, storage. Scales with usage — design for it.
Requests per second. Batching, caching, and async patterns matter here.
Accuracy, relevance, helpfulness. Often trades off against speed and cost.
Uptime, graceful degradation, retry logic, circuit breakers.
Logging, tracing, evaluation. How do you know when it breaks?
Common AI System Architectures
1. Simple LLM Gateway
The baseline: your app calls an LLM API directly. Good for prototypes and low-traffic applications.
(OpenAI / Claude / Gemini)
- Simple to build
- No infra to manage
- Fast to iterate
- No caching
- No rate limiting
- No fallback on API outage
- Hard to switch providers
2. RAG Pipeline
Retrieval-Augmented Generation grounds the LLM in your own data. The retriever fetches relevant context; the LLM generates with that context in the prompt. This is the most common production pattern for knowledge-intensive apps.
(top-k chunks)
(query + context)
3. Agentic Pipeline
When the task requires multiple steps, tool calls, or decisions, a simple LLM call is not enough. An agent loop handles this — but adds latency and cost. Design for bounded iterations and human-in-the-loop checkpoints.
(plan + dispatch)
(reason + synthesise)
Scalability Patterns
Caching
LLM calls are expensive and slow. Cache aggressively at every layer.
import redis
import numpy as np
from sentence_transformers import SentenceTransformer
encoder = SentenceTransformer('all-MiniLM-L6-v2')
cache = redis.Redis()
def semantic_cache_get(query: str, threshold: float = 0.92) -> str | None:
query_vec = encoder.encode(query).tolist()
# Search cached queries by vector similarity
keys = cache.keys("cache:*")
for key in keys:
cached = cache.hgetall(key)
cached_vec = np.frombuffer(cached[b"embedding"], dtype=np.float32)
similarity = np.dot(query_vec, cached_vec) / (
np.linalg.norm(query_vec) * np.linalg.norm(cached_vec)
)
if similarity > threshold:
return cached[b"response"].decode()
return None # cache miss
def semantic_cache_set(query: str, response: str):
key = f"cache:{hash(query)}"
embedding = encoder.encode(query).astype(np.float32).tobytes()
cache.hset(key, mapping={"response": response, "embedding": embedding})
cache.expire(key, 3600) # 1-hour TTL Async & Batching
Use asyncio + httpx or async SDK clients. Never block on LLM calls in a web handler.
Group multiple short requests into one API call (where the API supports it). Cuts overhead and often reduces cost.
Stream tokens to the user as they are generated. Dramatically improves perceived latency even if total time is the same.
For batch jobs (bulk embeddings, nightly summarisation), use a job queue (Celery, Redis Queue, SQS) with worker processes.
Model Selection Strategy
Use a fast, cheap model — GPT-4o mini, Claude Haiku. Handles most intent classification, short Q&A.
Route to a mid-tier model — GPT-4o, Claude Sonnet. Good for multi-step tasks and analysis.
Route to a flagship model — o1, Claude Opus, Gemini Ultra. Reserve for high-value, low-volume tasks.
Observability Stack
You cannot improve what you cannot measure. A production AI system needs structured logging and evaluation at every step.
- Request latency (p50, p95, p99)
- Token usage (input / output)
- Cost per request
- Cache hit rate
- Error rate by model / endpoint
- Trace each step: retrieval, prompt build, LLM call
- Log input/output at every stage
- Use LangSmith, Langfuse, or Helicone
- Correlate traces with user feedback
- Automated evals: faithfulness, relevance, correctness
- Human labelling pipeline for golden datasets
- Regression tests on every model change
- A/B test prompts and models
import time
import uuid
import structlog
log = structlog.get_logger()
async def llm_call_with_observability(
prompt: str,
model: str = "claude-sonnet-4-6",
trace_id: str = None
) -> str:
trace_id = trace_id or str(uuid.uuid4())
start = time.monotonic()
log.info("llm_call_start", trace_id=trace_id, model=model,
prompt_tokens=len(prompt.split()))
try:
response = await client.messages.create(
model=model,
max_tokens=1024,
messages=[{"role": "user", "content": prompt}]
)
latency_ms = (time.monotonic() - start) * 1000
log.info("llm_call_success",
trace_id=trace_id,
model=model,
latency_ms=round(latency_ms, 1),
input_tokens=response.usage.input_tokens,
output_tokens=response.usage.output_tokens,
stop_reason=response.stop_reason
)
return response.content[0].text
except Exception as e:
log.error("llm_call_error", trace_id=trace_id, model=model,
error=str(e), latency_ms=round((time.monotonic()-start)*1000,1))
raise Failure Modes & Mitigations
Production Checklist
Frequently Asked Questions
When should I use RAG vs fine-tuning?
Use RAG when your data changes frequently or you need source citations — it's faster to update and easier to debug. Use fine-tuning when you need to change the model's style, format, or specialised reasoning on a fixed domain. Most production systems start with RAG; fine-tuning is added when RAG quality plateaus. See Fine-Tuning & LoRA and RAG Systems for details.
How do I reduce LLM costs in production?
The biggest wins: (1) route simple queries to cheaper models, (2) add a semantic cache, (3) compress prompts — remove filler text and redundant instructions, (4) use streaming to avoid timeout retries, (5) batch offline jobs. Typically you can cut costs 60–80% without sacrificing quality by applying all five.
What's the best way to evaluate an AI system?
Layer multiple signal types: automated metrics (RAGAS for RAG, exact match for structured output), LLM-as-judge for free-form quality, and human review for high-stakes flows. Build a golden dataset of 50–200 hand-labelled examples and run it on every change. Track metrics over time to detect regressions.
How do I handle prompt injection attacks?
Sanitise user input before inserting it into prompts. Use separate system and user message roles — never concatenate user content into the system prompt. Validate and parse structured outputs rather than trusting the model's free-form text. For agentic systems, add a tool-use approval layer for irreversible actions. See AI Ethics & Governance for more on safety.
Frequently Asked Questions
What will I learn here?
This page covers the core concepts and techniques you need to understand the topic and progress confidently to the next lesson.
How should I use this page?
Start with the overview, then follow the section links to deepen your understanding. Use the table of contents on the right to jump to specific sections.
What should I read next?
Use the navigation below to continue to the next lesson or explore related topics.