Tracking is the process of locating a moving object or multiple objects over time in a video stream. Tracking an object is not the same as object detection. Object detection is the process of locating an object of interest in a single frame. Tracking associates detections of an object across multiple frames.
Tracking multiple objects requires detection, prediction, and data association.
Detection: Detect objects of interest in a video frame.
Prediction: Predict the object locations in the next frame.
Data association: Use the predicted locations to associate detections across frames to form tracks.
Selecting the right approach for detecting objects of interest depends on what you want to track and whether the camera is stationary.
To detect objects in motion with a stationary camera, you can
perform background subtraction using the vision.ForegroundDetector
System
object. The background subtraction approach works efficiently but
requires the camera to be stationary.
To detect objects in motion with a moving camera, you can use a sliding-window detection approach. This approach typically works more slowly than the background subtraction approach. To detect and track a specific category of object, use the System objects or functions described in the table.
Type of Object to Track | Camera | Functionality |
---|---|---|
Anything that moves | Stationary |
|
Faces, eyes, nose, mouth, upper body | Stationary, Moving |
|
Pedestrians | Stationary, Moving |
|
Custom object category | Stationary, Moving |
|
To track an object over time means that you must predict its location in the next frame. The simplest method of prediction is to assume that the object will be near its last known location. In other words, the previous detection serves as the next prediction. This method is especially effective for high frame rates. However, using this prediction method can fail when objects move at varying speeds, or when the frame rate is low relative to the speed of the object in motion.
A more sophisticated method of prediction is to use the previously observed motion of the
object. The Kalman filter (vision.KalmanFilter
) predicts the next
location of an object, assuming that it moves according to a motion model, such as
constant velocity or constant acceleration. The Kalman filter also takes into account
process noise and measurement noise. Process noise is the
deviation of the actual motion of the object from the motion model.
Measurement noise is the detection error.
To make configuring a Kalman filter easier, use configureKalmanFilter
. This function sets up the filter for tracking a
physical object moving with constant velocity or constant acceleration within a
Cartesian coordinate system. The statistics are the same along all dimensions. If you
need to configure a Kalman filter with different assumptions, you need to construct the
vision.KalmanFilter
object directly.
Data association is the process of associating detections corresponding to the same physical object across frames. The temporal history of a particular object consists of multiple detections, and is called a track. A track representation can include the entire history of the previous locations of the object. Alternatively, it can consist only of the object's last known location and its current velocity.
To match a detection to a track, you must establish criteria for evaluating the matches.
Typically, you establish this criteria by defining a cost function. The higher the
cost of matching a detection to a track, the less likely that the detection belongs
to the track. A simple cost function can be defined as the degree of overlap between
the bounding boxes of the predicted and detected objects. The Tracking Pedestrians from a Moving Car example implements this cost function using the bboxOverlapRatio
function. You can implement a more sophisticated
cost function, one that accounts for the uncertainty of the prediction, using the
distance
function of the vision.KalmanFilter
object. You can also implement a custom cost
function than can incorporate information about the object size and
appearance.
Gating is a method of eliminating highly unlikely matches from consideration, such as by imposing a threshold on the cost function. An observation cannot be matched to a track if the cost exceeds a certain threshold value. Using this threshold method effectively results in a circular gating region around each prediction, where a matching detection can be found. An alternative gating technique is to make the gating region large enough to include the k-nearest neighbors of the prediction.
Data association reduces to a minimum weight bipartite matching problem, which is a well-studied area of graph theory. A bipartite graph represents tracks and detections as vertices. It also represents the cost of matching a detection and a track as a weighted edge between the corresponding vertices.
The assignDetectionsToTracks
function
implements the Munkres' variant of the Hungarian bipartite matching
algorithm. Its input is the cost matrix, where
the rows correspond to tracks and the columns correspond to detections.
Each entry contains the cost of assigning a particular detection to
a particular track. You can implement gating by setting the cost of
impossible matches to infinity.
Data association must take into account the fact that new objects
can appear in the field of view, or that an object being tracked can
leave the field of view. In other words, in any given frame, some
number of new tracks might need to be created, and some number of
existing tracks might need to be discarded. The assignDetectionsToTracks
function returns
the indices of unassigned tracks and unassigned detections in addition
to the matched pairs.
One way of handling unmatched detections is to create a new track from each of them. Alternatively, you can create new tracks from unmatched detections greater than a certain size, or from detections that have certain locations or appearance. For example, if the scene has a single entry point, such as a doorway, then you can specify that only unmatched detections located near the entry point can begin new tracks, and that all other detections are considered noise.
Another way of handling unmatched tracks is to delete any track that remain unmatched for a certain number of frames. Alternatively, you can specify to delete an unmatched track when its last known location is near an exit point.
assignDetectionsToTracks
| bboxOverlapRatio
| configureKalmanFilter
| extractHOGFeatures
| selectStrongestBbox
| trainCascadeObjectDetector
| vision.CascadeObjectDetector
| vision.ForegroundDetector
| vision.KalmanFilter
| vision.PeopleDetector
| vision.PointTracker