Deep Learning for Robotics
Traditional robot programming is rigid: every scenario must be manually coded. A robot programmed to grasp a red cup fails when it encounters a blue one. Deep learning changes this entirely. Instead of writing rules, you show the robot examples — thousands of grasps, thousands of scenarios — and a neural network learns the underlying patterns. The robot becomes flexible, generalizing to situations it's never explicitly seen.
Why deep learning changed robotics
Before deep learning, robotic manipulation required exquisitely engineered perception pipelines: hand-tuned color filters, manually specified grasp poses, brittle feature detectors. Any change in lighting, object placement, or object type could break the whole system. Deep learning replaced these hand-engineered components with learned ones that generalize far more robustly.
The key insight
A convolutional neural network (CNN) trained on millions of images can extract features — edges, textures, shapes, object identities — that are robust to lighting, viewpoint, and appearance changes. These features are far richer than anything hand-engineered. When you plug this learned perception into a robot, everything downstream improves.
1. CNNs for Visual Grasping
The grasping problem: given a camera view of an object on a table, compute the 6-DoF pose (position + orientation) for the gripper to grasp it successfully.
GraspNet and grasp pose estimation
CNN-based grasp detection networks take an RGB-D image as input and output a set of candidate grasp poses, each scored by estimated success probability. The robot picks the highest-scored grasp, solves IK, and executes. These networks are trained on datasets of thousands of human-labeled or simulation-generated grasps.
6-DoF pose estimation
Networks like PoseCNN or FoundPose estimate the full 3D pose of a known object in a camera frame. Given a reference 3D model and a camera image, they output the object's position and orientation — enabling precise, repeatable grasping of specific parts (think: a robot assembly line picking screws with exact orientation).
Category-level grasping
More powerful: train a network to grasp all mugs, not just one specific mug. Category-level methods generalize to novel instances within a category. The network learns the concept of "handle of a mug" rather than "this particular red mug." This is the direction robotics is heading.
2. Imitation Learning — Learning from Humans
Instead of writing control code, a human demonstrates the task by physically guiding the robot arm. A neural network learns a policy — a mapping from sensor observations to actions — that reproduces the demonstrated behavior.
Behavioral cloning
The simplest form: treat imitation as supervised learning. Inputs: camera images + proprioception (joint angles, end effector position). Outputs: the actions the human took. Train a neural network to predict actions given observations. Then deploy: the robot takes camera images and computes what action to take next, just like the human did. This is exactly how Diffusion Policy and ACT (Action Chunking with Transformers) work.
Diffusion Policy
A recent breakthrough from Columbia University. Instead of directly predicting the next action, a diffusion model learns the distribution of actions across all demonstrations. At inference, it "denoises" a random action to a high-quality action for the current observation. Significantly more robust than simple behavioral cloning, especially for multi-modal tasks (tasks where there are multiple valid ways to do something).
Data collection hardware
Getting demonstration data is hard. Teleoperation (human controls robot via joystick) is common but slow. More recently, bilateral teleoperation systems like ACT's demonstration collection setup let a human control a lightweight "leader" arm while the robot "follower" arm mirrors the motion — collecting 50+ demonstrations per hour. Companies like Physical Intelligence (pi.) are building large-scale data collection infrastructure.
3. Visual Servoing — Closing the Loop on Vision
Visual servoing uses camera feedback directly in the control loop — the robot continuously adjusts its motion to keep a visual feature at a desired location in the image.
Image-based visual servoing (IBVS)
Define a set of visual features (e.g., the pixel coordinates of the object's center) and a desired feature value (center at [320, 240] — middle of the image). Compute the error in feature space and use the image Jacobian to convert it to end-effector velocity. The robot moves until the features reach their desired values. No explicit 3D pose estimation required — pure image feedback.
Deep visual servoing
Replace hand-crafted features with CNN-extracted features. Train an end-to-end network that takes the current image and goal image as input and outputs the velocity command to reduce the visual difference. More robust to appearance changes than classical IBVS. Used in insertion tasks (plugging a cable into a socket) where sub-millimeter precision is needed.
Frequently Asked Questions
How much data does a robot learning system need?
It varies enormously. A behavioral cloning policy for a single task (pick and place one object) might need 50–200 demonstrations. A foundation model like RT-2 was trained on internet-scale vision-language data plus hundreds of thousands of robot demonstrations. The field is actively working on data efficiency — how to learn more from fewer demonstrations.
What is end-to-end learning for robots?
End-to-end learning trains a single neural network that takes raw sensor input (camera images) and outputs raw motor commands — no hand-engineered intermediate representations. In theory, the network can discover the optimal internal representation for the task. In practice, end-to-end systems are harder to debug but often more robust than modular pipelines with many hand-engineered components.
What's the difference between imitation learning and reinforcement learning?
Imitation learning (IL) learns by copying demonstrations — a human shows what to do. Reinforcement learning (RL) learns by trial and error with reward signals — no human demonstrations needed, just a success criterion. IL is more data-efficient if demonstrations are available; RL can exceed human performance by discovering novel strategies. Many modern systems combine both: RL fine-tunes an IL-initialized policy.
Which PyTorch architecture is best for robotics?
It depends on the task. ResNet/ConvNeXt for pure visual feature extraction. Transformer-based architectures (like ACT and Diffusion Policy with UNet backbones) for action prediction from multi-step observations. Diffusion Policy is currently state-of-the-art for dexterous manipulation. For real-time applications on a Jetson, quantized MobileNet or EfficientDet variants are preferred for their speed.
Frequently Asked Questions
What will I learn here?
This page covers the core concepts and techniques you need to understand the topic and progress confidently to the next lesson.
How should I use this page?
Start with the overview, then follow the section links to deepen your understanding. Use the table of contents on the right to jump to specific sections.
What should I read next?
Use the navigation below to continue to the next lesson or explore related topics.