This example shows how to modify a pretrained MobileNet v2 network to create a YOLO v2 object detection network.
The procedure to convert a pretrained network into a YOLO v2 network is similar to the transfer learning procedure for image classification:
Load the pretrained network.
Select a layer from the pretrained network to use for feature extraction.
Remove all layers after the feature extraction layer.
Add new layers to support the object detection task.
Load a pretrained MobileNet v2 network using mobilenetv2
. This requires the Deep Learning Toolbox Model for MobileNet v2 Network™ support package. If this support package is not installed, then the function provides a download link. After you load the network, convert the network into a layerGraph
object so that you can manipulate the layers.
net = mobilenetv2(); lgraph = layerGraph(net);
Update the network input size to meet the training data requirements. For example, assume the training data are 300-by-300 RGB images. Set the input size.
imageInputSize = [300 300 3];
Next, create a new image input layer with the same name as the original layer.
imgLayer = imageInputLayer(imageInputSize,"Name","input_1")
imgLayer = ImageInputLayer with properties: Name: 'input_1' InputSize: [300 300 3] Hyperparameters DataAugmentation: 'none' Normalization: 'zerocenter' NormalizationDimension: 'auto' Mean: []
Replace the old image input layer with the new image input layer.
lgraph = replaceLayer(lgraph,"input_1",imgLayer);
A YOLO v2 feature extraction layer is most effective when the output feature width and height are between 8 and 16 times smaller than the input image. This amount of downsampling is a trade-off between spatial resolution and output-feature quality. You can use the analyzeNetwork
function or the Deep Network Designer app to determine the output sizes of layers within a network. Note that selecting an optimal feature extraction layer requires empirical evaluation.
Set the feature extraction layer to "block_12_add".
The output size of this layer is about 16 times smaller than the input image size of 300-by-300.
featureExtractionLayer = "block_12_add";
Next, remove the layers after the feature extraction layer. You can do so by importing the network into the Deep Network Designer app, manually removing the layers, and exporting the modified the network to your workspace.
For this example, load the modified network, which has been added to this example as a supporting file.
modified = load("mobilenetv2Block12Add.mat");
lgraph = modified.mobilenetv2Block12Add;
The detection subnetwork consists of groups of serially connected convolution, ReLU, and batch normalization layers. These layers are followed by a yolov2TransformLayer
and a yolov2OutputLayer
.
First, create two groups of serially connected convolution, ReLU, and batch normalization layers. Set the convolution layer filter size to 3-by-3 and the number of filters to match the number of channels in the feature extraction layer output. Specify "same"
padding in the convolution layer to preserve the input size.
filterSize = [3 3]; numFilters = 96; detectionLayers = [ convolution2dLayer(filterSize,numFilters,"Name","yolov2Conv1","Padding", "same", "WeightsInitializer",@(sz)randn(sz)*0.01) batchNormalizationLayer("Name","yolov2Batch1") reluLayer("Name","yolov2Relu1") convolution2dLayer(filterSize,numFilters,"Name","yolov2Conv2","Padding", "same", "WeightsInitializer",@(sz)randn(sz)*0.01) batchNormalizationLayer("Name","yolov2Batch2") reluLayer("Name","yolov2Relu2") ]
detectionLayers = 6x1 Layer array with layers: 1 'yolov2Conv1' Convolution 96 3x3 convolutions with stride [1 1] and padding 'same' 2 'yolov2Batch1' Batch Normalization Batch normalization 3 'yolov2Relu1' ReLU ReLU 4 'yolov2Conv2' Convolution 96 3x3 convolutions with stride [1 1] and padding 'same' 5 'yolov2Batch2' Batch Normalization Batch normalization 6 'yolov2Relu2' ReLU ReLU
Next, create the final portion of the detection subnetwork, which has a convolution layer followed by a yolov2TransformLayer
and a yolov2OutputLayer
. The output of convolution layer predicts the following for each anchor box:
The object class probabilities.
The x and y location offset.
The width and height offset.
Specify the anchor boxes and number of classes and compute the number of filters for the convolution layer.
numClasses = 5; anchorBoxes = [ 16 16 32 16 ]; numAnchors = size(anchorBoxes,1); numPredictionsPerAnchor = 5; numFiltersInLastConvLayer = numAnchors*(numClasses+numPredictionsPerAnchor);
Add the convolution2dLayer
, yolov2TransformLayer
, and yolov2OutputLayer
to the detection subnetwork.
detectionLayers = [ detectionLayers convolution2dLayer(1,numFiltersInLastConvLayer,"Name","yolov2ClassConv",... "WeightsInitializer", @(sz)randn(sz)*0.01) yolov2TransformLayer(numAnchors,"Name","yolov2Transform") yolov2OutputLayer(anchorBoxes,"Name","yolov2OutputLayer") ]
detectionLayers = 9x1 Layer array with layers: 1 'yolov2Conv1' Convolution 96 3x3 convolutions with stride [1 1] and padding 'same' 2 'yolov2Batch1' Batch Normalization Batch normalization 3 'yolov2Relu1' ReLU ReLU 4 'yolov2Conv2' Convolution 96 3x3 convolutions with stride [1 1] and padding 'same' 5 'yolov2Batch2' Batch Normalization Batch normalization 6 'yolov2Relu2' ReLU ReLU 7 'yolov2ClassConv' Convolution 20 1x1 convolutions with stride [1 1] and padding [0 0 0 0] 8 'yolov2Transform' YOLO v2 Transform Layer. YOLO v2 Transform Layer with 2 anchors. 9 'yolov2OutputLayer' YOLO v2 Output YOLO v2 Output with 2 anchors.
Attach the detection subnetwork to the feature extraction network.
lgraph = addLayers(lgraph,detectionLayers);
lgraph = connectLayers(lgraph,featureExtractionLayer,"yolov2Conv1");
Use analyzeNetwork(lgraph)
to check the network and then train a YOLO v2 object detector using the trainYOLOv2ObjectDetector
function.