AI APIs & Foundation Model Services

You don't have to train a model to build an AI application. Foundation model APIs let you call the world's most capable AI models with a single HTTP request — paying per token, with no infrastructure to manage. This is the fastest path from idea to working AI application.

What is a Foundation Model API?

A foundation model API is an HTTP endpoint that accepts a prompt and returns a model-generated response. You send text in, you get text (or images, or structured data) back. The provider hosts, scales, and maintains the model — you just call it.

Token-Based Pricing

APIs charge per token — roughly 3/4 of a word. You pay separately for input tokens (your prompt) and output tokens (the model's response). As of 2025, GPT-4o costs ~$2.50 per million input tokens and ~$10 per million output tokens. Claude 3.5 Sonnet costs $3/$15 per million tokens. For most applications, API costs are surprisingly manageable — $100/month covers millions of interactions for typical use cases.

Cost sense-check: A typical chatbot interaction is ~500 input + 300 output tokens = 800 tokens total ≈ $0.006 (less than a cent) at GPT-4o pricing. 10,000 such interactions/day = $60/day. Factor this into your product economics.

The Major Foundation Model Providers

OpenAI API (GPT-4o, o3)

The original and still the most widely used. GPT-4o is the flagship multimodal model — accepts text, images, and audio. The o-series (o3, o1) are reasoning models that "think before answering" — better for math, coding, and complex reasoning. Available via api.openai.com or through Azure OpenAI Service (enterprise SLA, private endpoints, EU data residency). Extensive documentation, large developer ecosystem, hundreds of client libraries.

Anthropic API (Claude 3.5 Sonnet, Claude 3 Opus)

Anthropic's Claude models are known for long context windows (200K tokens — you can fit an entire codebase), strong instruction following, and thoughtful safety properties. Claude 3.5 Sonnet is a sweet spot of capability and cost. Available via api.anthropic.com or AWS Bedrock. Preferred by many enterprises for its reliability and constitutional AI approach.

Google AI / Gemini API

Google's Gemini models are natively multimodal (trained on text, images, video, audio simultaneously). Gemini 1.5 Pro has a 1 million token context window. Available via Google AI Studio (free tier) and Vertex AI (enterprise). Gemini Flash is the cost-optimized variant for high-throughput applications.

Cloud Provider Aggregators

AWS Bedrock provides a single API for accessing models from Anthropic (Claude), Meta (Llama), Mistral, Cohere, and Amazon's own Titan — with AWS-native security, IAM, and data privacy guarantees. Azure AI Foundry (formerly Azure OpenAI + Azure AI) aggregates OpenAI models plus open-source models with enterprise SLAs. These aggregators are popular in enterprises that need contractual data privacy guarantees.

Open-Weight Models You Can Self-Host

Not all foundation models require API access. Meta's Llama series, Mistral, Qwen, and DeepSeek are open-weight models you can download and run yourself — on your own GPU or a rented cloud instance.

🦙

Llama 3.3 (70B)

Meta's open-weight model. Competitive with GPT-4-class on many benchmarks. Free to use, self-host, and fine-tune (with Meta's license).

🌊

Mistral / Mixtral

French AI startup. Efficient models with Apache 2.0 license — truly open source. Mixtral uses Mixture-of-Experts for excellent performance per compute dollar.

🔮

DeepSeek R1

Chinese lab's open-weight reasoning model. Competitive with o1 on reasoning benchmarks. Released under MIT license — one of the most capable truly open models.

⚙️

Qwen 2.5 (72B)

Alibaba's open-weight model. Strong multilingual capabilities. Available in sizes from 0.5B to 72B. Popular in Asia and for multilingual applications.

API vs. Fine-Tuning vs. Self-Hosting: When to Use Each

Use the API (No Training Needed)

For 80% of use cases — chatbots, summarization, code generation, Q&A, classification — the raw API with a good system prompt is sufficient. Start here. It's the fastest path to a working product, with zero infrastructure cost and state-of-the-art capability.

Fine-Tune When You Need Style or Domain Specialization

Fine-tuning a base model (or using parameter-efficient methods like LoRA on an open-weight model) makes sense when: you need consistent tone/format the API doesn't reliably produce, you have proprietary domain knowledge to encode, or you need sub-100ms latency that you can't achieve with large API models.

Self-Host for Privacy, Cost at Scale, or Customization

Self-hosting open-weight models on your own infrastructure makes sense when: data privacy requirements prevent sending data to third-party APIs, you're at a scale where API costs exceed hosting costs (usually millions of requests/day), or you need complete control over the model behavior.

Frequently Asked Questions

Is my data safe when I use a foundation model API?

It depends on the provider and your contract. By default, OpenAI and Anthropic do not train on API data (as opposed to their consumer products). Enterprise agreements with AWS Bedrock or Azure OpenAI provide contractual data privacy guarantees and ensure data doesn't leave your chosen region. If you're handling PII, financial data, or health data, always review the provider's data processing agreements before sending data through the API.

What is a context window and why does it matter?

The context window is the maximum amount of text (in tokens) a model can process in a single request — including both your prompt and the model's response. GPT-4o has a 128K token context; Claude 3.5 Sonnet has 200K; Gemini 1.5 Pro has 1M. Larger context windows let you: include entire codebases in prompts, summarize long documents, maintain longer conversations, and do multi-document analysis without chunking. Context window size is a major factor in choosing a model for document-heavy applications.

What is RAG and how does it relate to foundation model APIs?

RAG (Retrieval-Augmented Generation) is a pattern where you retrieve relevant documents from a knowledge base (using vector search) and include them in your API prompt — giving the model access to information not in its training data. Instead of fine-tuning a model on your company's documentation (expensive), you retrieve the relevant docs at query time and include them in the context. It's the most common pattern for building knowledge-base chatbots and enterprise Q&A systems. See the vector databases article for the retrieval side of RAG.

What is function calling / tool use in AI APIs?

Function calling (also called "tool use") allows a model to request that your application call a function and return the result. You define available functions (like search_database(query) or send_email(to, subject, body)) in your API call. The model decides when to use them, formats the arguments, and incorporates the result into its response. This is how AI agents work — the model orchestrates calls to external tools to accomplish tasks that require live data or actions in the world.

Frequently Asked Questions

What will I learn here?

This page covers the core concepts and techniques you need to understand the topic and progress confidently to the next lesson.

How should I use this page?

Start with the overview, then follow the section links to deepen your understanding. Use the table of contents on the right to jump to specific sections.

What should I read next?

Use the navigation below to continue to the next lesson or explore related topics.