Object Detection & Tracking for Robots

A robot that can see but can't identify what it sees isn't very useful. Object detection is the ability to find specific things in a camera frame — "there's a person at [320, 240] that is 0.8m wide and 1.6m tall." Object tracking follows that person across subsequent frames. Together, they're the perceptual foundation of autonomous systems.

What is object detection?

Object detection is a computer vision task that answers two questions simultaneously: What is in the image (classification) and Where is it (localization as a bounding box). The output is a list of detections, each with a class label, a confidence score, and a bounding box [x, y, width, height].

Detection vs. classification vs. segmentation

Classification: "This image contains a cat." One label for the whole image.
Detection: "There's a cat at [100, 150, 80, 80] and a dog at [300, 200, 90, 100]." Multiple objects, each with location.
Instance segmentation: Same as detection, but gives you the exact pixel mask of each object instead of just a bounding box. Useful for robot grasping where exact shape matters.

1. YOLO — Real-Time Detection

YOLO (You Only Look Once) is the dominant object detection model for real-time robotics applications. It processes the entire image in a single neural network pass, making it fast enough for 30–60fps inference on a NVIDIA Jetson.

How YOLO works

YOLO divides the image into a grid. For each grid cell, it simultaneously predicts bounding box coordinates, confidence scores, and class probabilities. Because it's one forward pass (not multiple proposal stages like older detectors), it's extremely fast. The latest YOLOv11 achieves state-of-the-art accuracy at real-time speeds.

Using YOLO in a robot (Ultralytics)

The Ultralytics library makes YOLO trivially easy: from ultralytics import YOLO; model = YOLO('yolov8n.pt'); results = model(frame). The results object contains all detections with bounding boxes, classes, and confidence scores. Integrate with ROS by publishing detections as custom messages or using existing packages like ros2_yolo.

YOLO model sizes

YOLO comes in multiple sizes: nano (n), small (s), medium (m), large (l), extra-large (x). Nano is fastest but least accurate; x is most accurate but slowest. For a Jetson Nano, YOLOv8n runs at 30fps. For a Jetson AGX Orin, YOLOv8l runs at 60fps. Choose based on your hardware and accuracy requirements.

2. Other Detection Architectures

SSD (Single Shot Detector)

Similar philosophy to YOLO — single-pass detection. SSD uses multiple feature map scales from a backbone CNN to detect objects of different sizes. Historically popular for edge devices; now largely superseded by modern YOLO variants in terms of speed/accuracy trade-off.

Faster R-CNN

A two-stage detector: first a Region Proposal Network (RPN) generates candidate object locations, then a second network classifies and refines each proposal. Slower than YOLO but historically more accurate on small objects. Used in research and applications where accuracy matters more than real-time speed.

Transformer-based detection (DETR)

Detection Transformer (DETR) applies attention mechanisms directly to object detection, treating it as a set prediction problem. Eliminates anchor boxes and NMS (non-maximum suppression) — the architectural complexity of YOLO. Newer variants like RT-DETR achieve real-time speeds with excellent accuracy.

3. Object Tracking — Following Across Frames

Detection answers "what and where?" for a single frame. Tracking maintains consistent identities across frames: "Person #1 is still Person #1 in the next frame, even though they moved." This is essential for robots that need to follow, avoid, or interact with moving objects.

DeepSORT — The standard tracker

DeepSORT combines two signals: a Kalman filter (predicts where each tracked object will be in the next frame based on velocity) and a deep appearance model (identifies the same object even after it was briefly occluded). Most YOLO implementations support DeepSORT directly via the --track flag or dedicated tracking classes.

ByteTrack

A newer, faster tracker that outperforms DeepSORT in crowded scenes. Instead of discarding low-confidence detections (as DeepSORT does), ByteTrack uses them to maintain tracks during brief occlusions. The default tracker in Ultralytics YOLO.

When tracking matters for robots

A warehouse robot needs to track which bin it's targeting as it approaches (avoids latching onto a different bin). A security robot needs to follow a specific person. A collaborative robot needs to keep track of the human worker's hand position. In all these cases, consistent IDs across frames are critical.

Frequently Asked Questions

How do I train YOLO on custom objects for my robot?

Collect images of your object (500–2000 images), label them with a tool like LabelImg or Roboflow (draw bounding boxes and assign class names), export in YOLO format, and fine-tune a pretrained YOLOv8 model: model.train(data='custom.yaml', epochs=50, imgsz=640). Training takes 30–60 minutes on a GPU and a few hours on CPU.

What is NMS (Non-Maximum Suppression)?

Object detectors often produce multiple overlapping bounding boxes for the same object. NMS keeps only the highest-confidence box and removes lower-confidence boxes that overlap significantly (above an IoU threshold). It's a post-processing step applied to the raw detection output. YOLO handles this automatically.

How do I get 3D position from a 2D detection?

If you have a depth camera, lookup the depth at the center of the bounding box and use the camera's intrinsic matrix to unproject to 3D: [X, Y, Z] = depth * K_inv * [u, v, 1]. Alternatively, use the known physical size of the object and its apparent size in the image to estimate distance (this requires camera calibration).

Can YOLO run on a Raspberry Pi?

Yes, but slowly. YOLOv8n (nano) runs at about 5–8fps on a Raspberry Pi 4 without acceleration. With a Coral TPU USB accelerator running a TFLite-exported YOLO model, you can reach 20–25fps. For smooth 30fps+ detection, use a Jetson Nano or Jetson Orin Nano.

Frequently Asked Questions

What will I learn here?

This page covers the core concepts and techniques you need to understand the topic and progress confidently to the next lesson.

How should I use this page?

Start with the overview, then follow the section links to deepen your understanding. Use the table of contents on the right to jump to specific sections.

What should I read next?

Use the navigation below to continue to the next lesson or explore related topics.