Multimodal AI

Multimodal AI systems understand and generate content across multiple data types — text, images, audio, and video. CLIP showed that images and text can share a unified embedding space. GPT-4V, Gemini, and LLaVA demonstrated that language models can reason about visual content with near-human capability. This is now the frontier of AI products.

What is Multimodal Learning?

Traditional AI models are unimodal — a language model only processes text, a CNN only processes images. Multimodal models learn shared representations across modalities, enabling cross-modal retrieval, generation, and reasoning.

🔗

Contrastive Learning

Train image and text encoders so matching pairs (image + caption) are nearby in embedding space. CLIP is the canonical example.

🧠

Vision-Language Models

LLMs with a visual encoder attached. The model "sees" images by converting them to patch embeddings fed into the language model.

🎵

Audio & Video

Whisper (speech → text), AudioCraft (text → audio), Sora (text → video). Each modality needs a specialised encoder/tokeniser.

CLIP: Contrastive Language-Image Pre-training

OpenAI's CLIP (2021) is the foundational multimodal model. It was trained on 400 million image-text pairs scraped from the internet using a simple but powerful objective:

Dual Encoder Architecture

A Vision Transformer (ViT) encodes images. A Transformer encodes text. Both produce 512-dim embeddings.

↓

Contrastive Loss (InfoNCE)

Given a batch of N image-text pairs, maximise cosine similarity for matching pairs and minimise it for the N²-N non-matching pairs.

↓

Zero-Shot Classification

To classify an image, compute similarity with text prompts like "a photo of a {class}". No task-specific fine-tuning needed.

import torch
from transformers import CLIPProcessor, CLIPModel

model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

inputs = processor(
    text=["a dog", "a cat", "a car"],
    images=image,
    return_tensors="pt",
    padding=True
)
outputs = model(**inputs)
probs = outputs.logits_per_image.softmax(dim=1)
# → tensor([[0.92, 0.07, 0.01]])  # it's a dog

Vision-Language Models (VLMs)

VLMs extend language models with the ability to process image inputs. The architecture connects a visual encoder to an LLM via a projection layer:

Vision Encoder (ViT / CLIP)Extract visual features as patch embeddings from the input image

Projection / AdapterMap visual embeddings into the LLM's token embedding space

Language Model (LLaMA, Mistral)Process interleaved visual tokens + text tokens to generate responses

Key Models Compared

Open Source

LLaVA — CLIP + LLaMA; visual instruction tuning
LLaVA-1.6 / NeXT — Higher resolution, better OCR
InternVL2 — Strong multilingual VLM
Qwen-VL — Alibaba's multimodal model
Phi-3.5-Vision — Small but capable (4B params)
MolMo — AllenAI; state-of-art open VLM (2024)

Proprietary

GPT-4V / GPT-4o — OpenAI; vision + voice + text
Claude 3.5 Sonnet — Anthropic; strong on documents
Gemini 1.5 Pro — Google; native video understanding
Gemini Flash — Fast, cheap multimodal inference
Claude 3.5 Haiku — Fast, cheap vision inference

Multimodal Applications

📄

Document Understanding

Extract structured data from invoices, forms, PDFs. VLMs outperform OCR+NLP pipelines by understanding layout and context jointly.

🔍

Visual Search

CLIP embeddings power multimodal search: search by text, by image, or by combined text+image query. Used in e-commerce, stock photos, medical imaging.

🤖

Embodied AI

Robots and autonomous agents that perceive the visual world and reason about it in language. VLMs serve as the "brain" in robotics pipelines.

♿

Accessibility

Automatic alt-text, real-time scene description for the visually impaired, audio description of video content. High-impact, low-competition application space.

💡 Building with VLMs

For most product applications, start with the Claude or GPT-4o API — both handle images via base64 or URL. For on-prem or high-volume use, LLaVA-1.6 or InternVL2 can run on a single A10G GPU and match GPT-4V on many tasks. CLIP embeddings are free and excellent for semantic image search without needing a full VLM.

Frequently Asked Questions

What's the difference between CLIP and a VLM like GPT-4V?

CLIP produces fixed-size embeddings for images and text — it can measure similarity but cannot generate text descriptions or answer questions. GPT-4V and similar VLMs are generative: they can describe images, answer questions about them, extract text, reason about relationships between objects, and produce structured outputs. CLIP is efficient for retrieval; VLMs are for understanding and generation.

Can VLMs understand video?

Some do natively (Gemini 1.5 Pro supports up to 1 hour of video input). Most process video as sampled frames (e.g., 1 frame/second) passed as a sequence of images. This works well for slow-moving content but misses rapid temporal dynamics. Dedicated video models like VideoLLaMA add temporal attention for better video understanding.

How do I choose image resolution for VLM inputs?

Higher resolution improves OCR and fine detail recognition but increases token count (cost and latency). Most VLMs tile high-res images into patches. For documents and charts, use the highest resolution supported. For general scene understanding, 512×512 or 768×768 is usually sufficient. Check the model's native resolution — feeding an 8MP photo to a model trained on 336×336 patches wastes tokens.

Multimodal AI

What is Multimodal Learning?

Contrastive Learning

Vision-Language Models

Audio & Video

CLIP: Contrastive Language-Image Pre-training

Vision-Language Models (VLMs)

Key Models Compared

Open Source

Proprietary

Multimodal Applications

Document Understanding

Visual Search

Embodied AI

Accessibility

Frequently Asked Questions

Frequently Asked Questions

What will I learn here?

How should I use this page?

What should I read next?