Getting Started with SSD Multibox Detection

The single shot multibox detector (SSD) uses a single stage object detection network that merges detections predicted from multiscale features. The SSD is faster than two-stage detectors, such as the Faster R-CNN detector, and can localize objects more accurately compared to single-scale feature detectors, such as the YOLO v2 detector.

The SSD runs a deep learning CNN on an input image to produce network predictions from multiple feature maps. The object detector gathers and decodes predictions to generate bounding boxes.

Predict Objects in the Image

SSD uses anchor boxes to detect classes of objects in an image. For more details, see Anchor Boxes for Object Detection. The SSD predicts these two attributes for each anchor box.

Anchor box offsets — Refine the anchor box position.
Class probability — Predict the class label assigned to each anchor box.

This figure shows predefined anchor boxes (the dotted lines) at each location in a feature map and the refined location after offsets are applied. Matched boxes with a class are in blue and orange.

Transfer Learning

With transfer learning, you can use a pretrained CNN as the feature extractor in an SSD detection network. Use the ssdLayers function to create an SSD detection network from any pretrained CNN, such as MobileNet v2. For a list of pretrained CNNs, see Pretrained Deep Neural Networks (Deep Learning Toolbox).

You can also design a custom model based on a pretrained image classification CNN. For more details, see Design an SSD Detection Network.

Design an SSD Detection Network

You can design a custom SSD model programatically or use the Deep Network Designer (Deep Learning Toolbox) app to manually create a network. The app incorporates Computer Vision Toolbox™ SSD features.

To design an SSD Multibox detection network, follow these steps.

Start the model with a feature extractor network, which can be initialized from a pretrained CNN or trained from scratch.
Select prediction layers from the feature extraction network. Any layer from the feature extraction network can be used as a prediction layer. However, to leverage the benefits of using multiscale features for object detection, choose feature maps of different sizes.
Specify anchor boxes to the prediction layer by attaching an anchorBoxLayer to each of the layers.
Connect the outputs of the anchorBoxLayer objects to a classification branch and to a regression branch. The classification branch has at least one convolution layer that predicts the class for each tiled anchor box. The regression branch has at least one convolution layer that predicts anchor box offsets. You can add more layers in the classification and regression branches, however, the final convolution layer (before the merge layer) must have the number of filters according to this table.

Branch Number of Filters
Classification Number of anchor boxes + 1 (for background class)
Regression Four times the number of anchor boxes
For all prediction layers, combine the outputs of the classification branches by using the ssdMergeLayer object. Connect the ssdMergeLayer object to a softmaxLayer (Deep Learning Toolbox) object, followed by a focalLossLayer object. Gather all outputs of the regression branches by using the ssdMergeLayer object again. Connect the ssdMergeLayer output to an rcnnBoxRegressionLayer object.

Branch	Number of Filters
Classification	Number of anchor boxes + 1 (for background class)
Regression	Four times the number of anchor boxes

For more details on creating this type of network, see Create SSD Object Detection Network

Train an Object Detector and Detect Objects with an SSD Model

To learn how to train an object detector by using the SSD deep learning technique, see the Object Detection Using SSD Deep Learning example.

Code Generation

To learn how to generate CUDA^® code using the SSD object detector (created using the ssdObjectDetector object), see Code Generation for Object Detection by Using Single Shot Multibox Detector.

Label Training Data for Deep Learning

You can use the Image Labeler, Video Labeler, or Ground Truth Labeler (Automated Driving Toolbox) (available in Automated Driving Toolbox™) apps to interactively label pixels and export label data for training. The apps can also be used to label rectangular regions of interest (ROIs) for object detection, scene labels for image classification, and pixels for semantic segmentation.

References

[1] Liu, Wei, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C. Berg. "SSD: Single Shot MultiBox Detector." In Computer Vision – ECCV 2016, edited by Bastian Leibe, Jiri Matas, Nicu Sebe, and Max Welling, 9905:21-37. Cham: Springer International Publishing, 2016. https://doi.org/10.1007/978-3-319-46448-0_2.

Documentation