Multimodal AI
Multimodal AI systems understand and generate content across multiple data types — text, images, audio, and video. CLIP showed that images and text can share a unified embedding space. GPT-4V, Gemini, and LLaVA demonstrated that language models can reason about visual content with near-human capability. This is now the frontier of AI products.
What is Multimodal Learning?
Traditional AI models are unimodal — a language model only processes text, a CNN only processes images. Multimodal models learn shared representations across modalities, enabling cross-modal retrieval, generation, and reasoning.
Contrastive Learning
Train image and text encoders so matching pairs (image + caption) are nearby in embedding space. CLIP is the canonical example.
Vision-Language Models
LLMs with a visual encoder attached. The model "sees" images by converting them to patch embeddings fed into the language model.
Audio & Video
Whisper (speech → text), AudioCraft (text → audio), Sora (text → video). Each modality needs a specialised encoder/tokeniser.
CLIP: Contrastive Language-Image Pre-training
OpenAI's CLIP (2021) is the foundational multimodal model. It was trained on 400 million image-text pairs scraped from the internet using a simple but powerful objective:
A Vision Transformer (ViT) encodes images. A Transformer encodes text. Both produce 512-dim embeddings.
Given a batch of N image-text pairs, maximise cosine similarity for matching pairs and minimise it for the N²-N non-matching pairs.
To classify an image, compute similarity with text prompts like "a photo of a {class}". No task-specific fine-tuning needed.
import torch
from transformers import CLIPProcessor, CLIPModel
model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
inputs = processor(
text=["a dog", "a cat", "a car"],
images=image,
return_tensors="pt",
padding=True
)
outputs = model(**inputs)
probs = outputs.logits_per_image.softmax(dim=1)
# → tensor([[0.92, 0.07, 0.01]]) # it's a dog Vision-Language Models (VLMs)
VLMs extend language models with the ability to process image inputs. The architecture connects a visual encoder to an LLM via a projection layer:
Key Models Compared
Open Source
- LLaVA — CLIP + LLaMA; visual instruction tuning
- LLaVA-1.6 / NeXT — Higher resolution, better OCR
- InternVL2 — Strong multilingual VLM
- Qwen-VL — Alibaba's multimodal model
- Phi-3.5-Vision — Small but capable (4B params)
- MolMo — AllenAI; state-of-art open VLM (2024)
Proprietary
- GPT-4V / GPT-4o — OpenAI; vision + voice + text
- Claude 3.5 Sonnet — Anthropic; strong on documents
- Gemini 1.5 Pro — Google; native video understanding
- Gemini Flash — Fast, cheap multimodal inference
- Claude 3.5 Haiku — Fast, cheap vision inference
Multimodal Applications
Document Understanding
Extract structured data from invoices, forms, PDFs. VLMs outperform OCR+NLP pipelines by understanding layout and context jointly.
Visual Search
CLIP embeddings power multimodal search: search by text, by image, or by combined text+image query. Used in e-commerce, stock photos, medical imaging.
Embodied AI
Robots and autonomous agents that perceive the visual world and reason about it in language. VLMs serve as the "brain" in robotics pipelines.
Accessibility
Automatic alt-text, real-time scene description for the visually impaired, audio description of video content. High-impact, low-competition application space.
For most product applications, start with the Claude or GPT-4o API — both handle images via base64 or URL. For on-prem or high-volume use, LLaVA-1.6 or InternVL2 can run on a single A10G GPU and match GPT-4V on many tasks. CLIP embeddings are free and excellent for semantic image search without needing a full VLM.
Frequently Asked Questions
What's the difference between CLIP and a VLM like GPT-4V?
CLIP produces fixed-size embeddings for images and text — it can measure similarity but cannot generate text descriptions or answer questions. GPT-4V and similar VLMs are generative: they can describe images, answer questions about them, extract text, reason about relationships between objects, and produce structured outputs. CLIP is efficient for retrieval; VLMs are for understanding and generation.
Can VLMs understand video?
Some do natively (Gemini 1.5 Pro supports up to 1 hour of video input). Most process video as sampled frames (e.g., 1 frame/second) passed as a sequence of images. This works well for slow-moving content but misses rapid temporal dynamics. Dedicated video models like VideoLLaMA add temporal attention for better video understanding.
How do I choose image resolution for VLM inputs?
Higher resolution improves OCR and fine detail recognition but increases token count (cost and latency). Most VLMs tile high-res images into patches. For documents and charts, use the highest resolution supported. For general scene understanding, 512×512 or 768×768 is usually sufficient. Check the model's native resolution — feeding an 8MP photo to a model trained on 336×336 patches wastes tokens.
Frequently Asked Questions
What will I learn here?
This page covers the core concepts and techniques you need to understand the topic and progress confidently to the next lesson.
How should I use this page?
Start with the overview, then follow the section links to deepen your understanding. Use the table of contents on the right to jump to specific sections.
What should I read next?
Use the navigation below to continue to the next lesson or explore related topics.