Embodied AI & Foundation Models

What if a robot could understand natural language instructions? What if it could figure out how to complete a task it's never been trained on, by drawing on its knowledge of how the world works? That's the promise of Embodied AI — giving robots the common sense and generalization of large foundation models, grounded in a physical body that can act in the real world. It's the most exciting frontier in robotics today.

What is Embodied AI?

Embodied AI is the study of AI agents that learn and act through physical interaction with an environment — an "embodied" agent has a body and perceives the world through sensors, unlike a language model that only processes text. The goal is to create AI that understands the physical world the way humans do: through years of embodied experience, not just text descriptions of the world.

Why embodiment matters for intelligence

Researchers like Rodney Brooks and Yann LeCun have long argued that intelligence cannot be separated from physical interaction with the world. A language model that has read every physics textbook still doesn't truly understand "heavy" the way a child who has dropped things does. Embodied AI tries to close this gap — giving AI systems grounded, physical understanding.

The key challenge: generalization

Traditional robot learning is narrow — a robot trained to pick apples can't pick oranges without retraining. Foundation models trained on internet-scale data have broad world knowledge. Embodied AI asks: can we give robots this broad generalization? Can a robot that has never seen a specific object figure out how to handle it by reasoning about what it is?

Foundation Models for Robots

RT-2 (Robotics Transformer 2) — Google DeepMind

RT-2 takes a vision-language model (VLM) pretrained on internet-scale text and image data, and fine-tunes it on robot trajectory data. The result: a single model that can follow natural language instructions ("put the apple in the blue bowl"), generalize to novel objects it's never grasped before, and exhibit emergent reasoning ("pick the object that can be used to cool a drink"). RT-2 represents robot actions as text tokens — the same architecture used for language generates motor commands.

Gato — DeepMind's Generalist Agent

Gato is a single transformer model trained across 604 tasks: playing Atari games, captioning images, chatting in natural language, and controlling robot arms. All tasks are represented as sequences of tokens. This demonstrated that a single neural network architecture can be both a conversational AI and a robot controller — they're not fundamentally different problems, just different data distributions.

PaLM-E — Google's Embodied Language Model

PaLM-E integrates sensor observations (camera images, proprioception, point clouds) directly into a large language model. Instead of just asking "what do you see?", you can ask "plan the steps to pack this box for shipping" while showing it a camera view of a cluttered table. PaLM-E reasons about physical tasks using language, then passes plans to lower-level controllers for execution.

pi.0 (Physical Intelligence)

Physical Intelligence, founded in 2023 with $70M in funding, is building a foundation model for physical tasks — a generalist robot "brain" trained on diverse manipulation tasks, similar to how GPT is trained on diverse text. Their first model, pi.0, demonstrated dexterous tasks across multiple robot morphologies. This is the OpenAI-moment for robotics that many in the field have been anticipating.

Using VLMs as Robot Brains Today

You don't need to train RT-2 from scratch. Existing vision-language models can be used as robot task planners right now.

LLMs for task decomposition

Give a language model a task description ("clean up the kitchen") and a list of available robot skills (navigate_to, pick_up, place_on, open_drawer). The LLM generates a sequence of skill calls in Python-like pseudocode. Lower-level controllers execute each skill. This approach — LLM as planner, specialist controllers as executors — is called LLM-based task and motion planning (TAMP).

SayCan (Google)

SayCan (2022) was a landmark paper combining LLM planning with robot affordances. The LLM generates candidate next actions; an "affordance function" (a value function from RL) scores how feasible each action is given the current robot state. The robot picks the most feasible high-value action. This grounds LLM planning in physical reality.

Claude / GPT-4V + ROS

A practical pattern: capture a camera frame, send it to a multimodal LLM API (Claude or GPT-4V) with a prompt describing the robot's capabilities, parse the natural language response into ROS action goals. Simple, works now, doesn't require GPU training. Great for prototyping and demonstrations. Latency (1–3 seconds per LLM call) limits real-time applications.

Frequently Asked Questions

What is the difference between LLM robots and traditional robots?

Traditional robots are programmed with explicit rules and capabilities for specific tasks. LLM-powered robots can understand natural language, reason about novel situations, and compose existing skills in flexible ways to handle unanticipated tasks. The trade-off: LLM robots are harder to guarantee correctness for safety-critical applications, and the LLM's planning can sometimes be physically infeasible.

Can a robot truly understand instructions today?

Sort of. Models like RT-2 can follow simple instructions about novel objects — "pick up the green thing that would squirt in your eye" — by reasoning about visual features and language meaning. But current systems still fail on complex multi-step reasoning, rare edge cases, and fine motor tasks requiring dexterity. We're in the early stages of genuinely language-grounded robot intelligence.

What is Open-X Embodiment?

A 2023 collaboration between 33 research labs that pooled 1 million+ robot trajectories across 22 robot embodiments into a single open dataset. The resulting RT-X model, trained on this data, generalized better across tasks and robots than models trained on single-robot data. It's the ImageNet moment for robot learning data — a large, diverse, open dataset enabling pre-training.

How far away are truly general-purpose household robots?

Most experts estimate 5–15 years to robots that can handle most household tasks reliably in arbitrary homes. The bottlenecks are dexterity (manipulating deformable objects, fine motor tasks), robustness (thousands of hours of failure-free operation), and generalization (working in any home, not just the one it was tested in). The progress since 2022 has been faster than most predicted — the timeline is shrinking.

Embodied AI & Foundation Models

What is Embodied AI?

Why embodiment matters for intelligence

The key challenge: generalization

Foundation Models for Robots

RT-2 (Robotics Transformer 2) — Google DeepMind

Gato — DeepMind's Generalist Agent

PaLM-E — Google's Embodied Language Model

pi.0 (Physical Intelligence)

Using VLMs as Robot Brains Today

LLMs for task decomposition

SayCan (Google)

Claude / GPT-4V + ROS

Frequently Asked Questions

What is the difference between LLM robots and traditional robots?

Can a robot truly understand instructions today?

What is Open-X Embodiment?

How far away are truly general-purpose household robots?

Frequently Asked Questions

What will I learn here?

How should I use this page?

What should I read next?