Computer Vision for Robots

Vision is the richest sensor a robot has — a single camera frame contains more information than all other sensors combined. But raw pixels are just numbers. Computer vision is the science of turning those numbers into meaning: "there's a door at 2 meters, a person at 5 meters, and the floor is here." This article walks you through how robots extract that meaning from images.

1. What a Robot Sees — Image Fundamentals

A digital image is a 3D array of numbers: height × width × 3 (for RGB). Each pixel has a Red, Green, and Blue value between 0 and 255. That's all. Everything a robot "understands" about visual input is derived from this grid of numbers.

Color spaces

RGB is intuitive but bad for many vision tasks. For color-based detection, HSV (Hue, Saturation, Value) is far better — the "hue" channel captures color independently of lighting. Convert with cv2.cvtColor(frame, cv2.COLOR_BGR2HSV). For edge detection and intensity, use grayscale. For some deep learning tasks, pixel values are normalized to [0, 1] or [-1, 1].

Camera calibration

Every camera has distortion — lenses bend light, making straight lines appear curved at the edges. Camera calibration computes the intrinsic parameters (focal length, principal point) and distortion coefficients that let you "undistort" images to an ideal pinhole camera model. This is essential for any quantitative measurements from a camera — like computing how far away an object is based on its apparent size.

2. Classical Computer Vision Techniques

Before deep learning, these algorithms solved real robotics problems. They're still used today, often in combination with neural networks.

Edge detection (Canny)

Edges are where pixel intensity changes sharply — they correspond to object boundaries, surface normals, and depth discontinuities. The Canny edge detector applies a Gaussian blur (to reduce noise), then finds the gradient magnitude and direction at each pixel. The result is a thin line at every edge in the image. One line: edges = cv2.Canny(gray, 50, 150).

Contour detection and blob analysis

Find the boundaries of objects in a binary mask: contours, _ = cv2.findContours(mask, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE). From each contour, compute area, centroid, bounding box, and orientation. Used for tracking colored objects, measuring shapes, and guiding robot grasps based on simple geometry.

Optical flow

Tracks how pixels move between consecutive frames. Sparse optical flow (Lucas-Kanade) tracks specific corner points. Dense optical flow tracks every pixel. Used for ego-motion estimation (calculating how the camera itself moved), obstacle avoidance in drones (detect things rushing toward you), and action recognition.

Feature matching (ORB, SIFT)

Detect distinctive "keypoints" in an image (corners, blobs) and describe them with a compact vector. Match these descriptors between two images to find corresponding points — used in SLAM, visual odometry, and augmented reality. ORB is fast enough for real-time use on a Raspberry Pi; SIFT is more accurate but slower.

3. Depth Sensing — The Third Dimension

A 2D image tells you what's where in angle, but not in distance. Depth sensing adds the Z dimension.

Structured light (Intel RealSense)

Projects a known infrared pattern onto the scene and cameras read how the pattern deforms. Deformation encodes depth. Works well indoors, up to ~10 meters. The RealSense D435 is the most popular depth camera for robotics research — it gives a 640×480 depth map at 90fps and has excellent ROS 2 driver support.

Point clouds

Combining depth with the camera's intrinsic parameters, you can "unproject" each pixel into a 3D point: X = (u - cx) * Z / fx, Y = (v - cy) * Z / fy. The result is a point cloud — a set of 3D coordinates, one per pixel. The ROS message type is sensor_msgs/PointCloud2, and the PCL (Point Cloud Library) has algorithms for segmenting, filtering, and extracting geometry from point clouds.

Monocular depth estimation

Deep learning models (MiDaS, Depth Anything) can estimate depth from a single RGB image — no depth sensor needed. The accuracy isn't as good as a dedicated depth sensor, but it works with any standard camera. Useful for robots where weight, cost, or size constraints prevent adding a dedicated depth camera.

Frequently Asked Questions

What resolution camera do robots typically use?

It depends on the task. A mobile robot navigating indoors might use 640×480 at 30fps — enough for SLAM and obstacle avoidance. A robotic arm doing precise manipulation might use a high-resolution 4K camera with a narrow field of view. In general, use the lowest resolution that gives you the information you need — higher resolution means more computation.

OpenCV vs. deep learning — when to use each?

OpenCV (classical vision) is fast, interpretable, and doesn't require training data. Use it for color-based detection, edge/line finding, and blob tracking. Deep learning is more powerful and generalizes better to complex scenes — use it for object detection, semantic segmentation, and pose estimation. In practice, most real systems use both.

What is the "pinhole camera model"?

The mathematical model that maps 3D world points to 2D image pixels. A 3D point [X, Y, Z] projects to image pixel [u, v] = [fx·X/Z + cx, fy·Y/Z + cy], where fx, fy are focal lengths and cx, cy is the principal point (image center). Camera calibration finds these four numbers for your specific lens.

Can a robot navigate using only cameras (no LiDAR)?

Yes — this is called visual-only navigation or visual SLAM. Systems like ORB-SLAM3 and RTAB-Map work with a single stereo or RGB-D camera. Tesla's Autopilot is famously camera-only. The trade-off: camera-based systems are more sensitive to lighting changes and featureless environments than LiDAR-based approaches.

Computer Vision for Robots

1. What a Robot Sees — Image Fundamentals

Color spaces

Camera calibration

2. Classical Computer Vision Techniques

Edge detection (Canny)

Contour detection and blob analysis

Optical flow

Feature matching (ORB, SIFT)

3. Depth Sensing — The Third Dimension

Structured light (Intel RealSense)

Point clouds

Monocular depth estimation

Frequently Asked Questions

What resolution camera do robots typically use?

OpenCV vs. deep learning — when to use each?

What is the "pinhole camera model"?

Can a robot navigate using only cameras (no LiDAR)?

Frequently Asked Questions

What will I learn here?

How should I use this page?

What should I read next?