Code Generation For Object Detection Using YOLO v3 Deep Learning

This example shows how to generate CUDA® MEX for a you only look once (YOLO) v3 object detector with custom layers. The example uses YOLO v3 object detection to illustrate:

  • CUDA code generation for a deep learning network with custom layers.

  • Convert a deep learning dlnetwork object into a DAGNetwork object for code generation.

YOLO v3 improves upon YOLO v2 by adding detection at multiple scales to help detect smaller objects. Moreover, the loss function used for training is separated into mean squared error for bounding box regression and binary cross-entropy for object classification to help improve detection accuracy. This example uses the YOLO v3 network trained in the Object Detection Using YOLO v3 Deep Learning example from the Computer Vision Toolbox (TM). For more information, see Object Detection Using YOLO v3 Deep Learning (Computer Vision Toolbox).

Third-Party Prerequisites

Required

  • CUDA enabled NVIDIA® GPU and compatible driver.

Optional

For non-MEX builds such as static, dynamic libraries or executables, this example has the following additional requirements.

Verify GPU Environment

To verify that the compilers and libraries for running this example are set up correctly, use the coder.checkGpuInstall (GPU Coder) function.

envCfg = coder.gpuEnvConfig('host');
envCfg.DeepLibTarget = 'cudnn';
envCfg.DeepCodegen = 1;
envCfg.Quiet = 1;
coder.checkGpuInstall(envCfg);

YOLO v3 Network

The YOLO v3 network in this example is based on squeezenet, and uses the feature extraction network in SqueezeNet with the addition of two detection heads at the end. The second detection head is twice the size of the first detection head, so it is better able to detect small objects. Note that any number of detection heads of different sizes can be specified based on the size of the objects to be detected. The YOLO v3 network uses anchor boxes estimated using training data to have better initial priors corresponding to the type of data set and to help the network learn to predict the boxes accurately. For information about anchor boxes, see Anchor Boxes for Object Detection (Computer Vision Toolbox).

The YOLO v3 network in this example is illustrated in the following diagram.

Each detection head predicts the bounding box coordinates (x, y, width, height), object confidence, and class probabilities for the respective anchor box masks. Therefore, for each detection head, the number of output filters in the last convolution layer is the number of anchor box mask times the number of prediction elements per anchor box. The detection heads comprise the output layer of the network.

Pretrained YOLO v3 Network

Download the YOLO v3 network trained in the Object Detection Using YOLO v3 Deep Learning example. To train the network yourself, see Object Detection Using YOLO v3 Deep Learning (Computer Vision Toolbox).

pretrained = load("yolov3SqueezeNetVehicleExample_20a.mat");
net = pretrained.net;

Prepare Pretrained Model for Code Generation

The pretrained YOLO v3 model is a dlnetwork object that is not supported for code generation. Extract the layer graph of the dlnetwork object by using the layerGraph function.

lGraph = layerGraph(net);

The layer graph output by the layerGraph function does not include output layers. Add a regression layer to the layer graph for each of its outputs by using the addLayers and connectLayers functions.

outLayerIdx = 1:numel(lGraph.Layers);
isOutLayer = arrayfun(@(x) any(strcmp(x.Name, net.OutputNames)), lGraph.Layers);
outLayerIdx(~isOutLayer) = [];

for iOut = 1:numel(outLayerIdx)
    outLayer = lGraph.Layers(outLayerIdx(iOut));
    newLayer = regressionLayer('Name', [outLayer.Name '_output_' num2str(iOut)]);
    lGraph = addLayers(lGraph, newLayer);
    lGraph = connectLayers(lGraph, outLayer.Name, newLayer.Name);
end

The number of output layers of the network is same as the number of detection heads in the network.

Custom Upsample Layer

YOLO v3 network uses a custom upsample layer for upsampling the input image data. Here, we use a code generation compatible version of upsampleLayer mentioned in the Object Detection Using YOLO v3 Deep Learning (Computer Vision Toolbox) example. The upsampleBy2Layer upsamples the input image by replicating the neighboring pixel values by a factor of 2 using the repelem function. Because code generation for the repelem function is supported only for vector or 2-D matrix inputs, the upsampleBy2Layer implementation needs to be parameterized by using the coder.target (MATLAB Coder) function. When running a simulation on MATLAB, a single call to repelem is sufficient as the input can be an N-D matrix. For code generation, the example uses nested for-loops so that the input to repelem function is 2-D matrix for each call to repelem in the generated code. Furthermore, for code generation layer output size is required to be constant at code generation time. Hence, UpsampleFactor property of the layer is modified to be Constant.

type('upsampleBy2Layer.m')
% Upsample by replicating neighbouring pixel values.

% Copyright 2020 The MathWorks, Inc.

classdef upsampleBy2Layer < nnet.layer.Layer
    properties (Constant)
        % factor to upsample the input.
        UpSampleFactor = 2
    end
    
    methods
        function layer = upsampleBy2Layer(name)
            
            % Set layer name.
            layer.Name = name;
            
            % Set layer description.
            layer.Description = "upSamplingLayer with factor " + layer.UpSampleFactor;
            
        end
        
        function Z = predict(layer, X)
            % Z = predict(layer, X) forwards the input data X through the
            % layer and outputs the result Z.
            if coder.target('MATLAB')
                Z = repelem(X,layer.UpSampleFactor,layer.UpSampleFactor);
            else
                numChannels = size(X, 3);
                numBatches = size(X, 4);
                Zsize = coder.const([size(X, 1) size(X, 2) numChannels numBatches] .* [layer.UpSampleFactor layer.UpSampleFactor 1 1]);
                Z = coder.nullcopy(zeros(Zsize, 'like', X));
                
                coder.gpu.kernel(-1, -1);
                for iBatch = 1:numBatches
                    for iChannel = 1:numChannels
                        Z(:, :, iChannel, iBatch) = repelem(X(:, :, iChannel, iBatch), layer.UpSampleFactor, layer.UpSampleFactor);
                    end
                end
            end
        end
    end
end

Replace the upsampleLayer present in the pretrained network with its code generation compatible version upsampleBy2Layer by using the replaceLayer function.

isUpsampleLayer = arrayfun(@(x) isa(x, 'upsampleLayer'), lGraph.Layers);
layerNameToReplace = lGraph.Layers(isUpsampleLayer).Name;
lGraph = replaceLayer(lGraph, layerNameToReplace, upsampleBy2Layer(layerNameToReplace));

Assemble layerGraph into a DAGNetwork

Assemble the layerGraph into a DAGNetwork object ready to use for code generation by using the assembleNetwork function.

dagNet = assembleNetwork(lGraph)
dagNet = 
  DAGNetwork with properties:

         Layers: [72×1 nnet.cnn.layer.Layer]
    Connections: [80×2 table]
     InputNames: {'data'}
    OutputNames: {'conv2Detection1_output_1'  'conv2Detection2_output_2'}

Save the network to a MAT-file.

matFile = 'yolov3DAGNetwork.mat';
save(matFile, 'dagNet');

The yolov3Detect Entry-Point Function

The yolov3Detect entry-point function takes an input image and passes it to a trained network for prediction through the yolov3Predict function. The yolov3Predict function loads the network object from the MAT-file into a persistent variable and reuses the persistent object for subsequent prediction calls. Specifically, the function uses the DAGNetwork representation of the network trained in the Object Detection Using YOLO v3 Deep Learning (Computer Vision Toolbox) example. The predictions from the YOLO v3 grid cell coordinates obtained from the yolov3Predict calls are then converted to bounding box coordinates by using the supporting functions generateTiledAnchors and applyAnchorBoxOffsets.

type('yolov3Detect.m')
function [bboxes,scores,labels] = yolov3Detect(matFile, im, networkInputSize, networkOutputs, confidenceThreshold, overlapThreshold, classes)
% The yolov3Detect function detects the bounding boxes, scores, and labels in an image.
coder.extrinsic('generateYOLOv3Detections');

%% Preprcess Data
% This example applies all the preprocessing transforms to the data set
% applied during training, except data augmentation. Because the example
% uses a pretrained YOLO v3 network, the input data must be representative
% of the original data and left unmodified for unbiased evaluation.

% Specifically the following preprocessing operations are applied to the
% input data. 
%     1. Resize the images to the network input size, as the images are bigger than networkInputSize. 
%     2. Scale the image pixels in the range [0 1].

im = preprocessData(im, networkInputSize);
imageSize = size(im,[1,2]);

%% Define Anchor Boxes
% Specify the anchor boxes estimated on the basis of the preprocessed
% training data used when training the YOLO v3 network. These anchor box
% values are same as mentioned in
% <docid:vision_ug#mw_47d9a223-5ec7-4d36-a020-4f9d147ecdec Object Detection
% Using YOLO v3 Deep Learning> example. For details on estimating anchor
% boxes, see <docid:vision_ug#mw_f9f22f48-0ad0-4f37-8bc1-22a2046637f2
% Anchor Boxes for Object Detection>.

anchors = [150   127;
    97    90;
    68    67;
    38    42;
    41    29;
    31    23];

% Specify anchorBoxMasks to select anchor boxes to use in both the
% detection heads of the YOLO v3 network. anchorBoxMasks is a cell array of
% size M-by-1, where M denotes the number of detection heads. Each
% detection head consists of a 1-by-N array of row index of anchors in
% anchorBoxes, where N is the number of anchor boxes to use. Select anchor
% boxes for each detection head based on size-use larger anchor boxes at
% lower scale and smaller anchor boxes at higher scale. To do so, sort the
% anchor boxes with the larger anchor boxes first and assign the first
% three to the first detection head and the next three to the second
% detection head.

area = anchors(:, 1).*anchors(:, 2);
[~, idx] = sort(area, 'descend');
anchors = anchors(idx, :);
anchorBoxMasks = {[1,2,3]
    [4,5,6]
    };

%% Predict on Yolov3
% Predict and filter the detections based on confidence threshold.
predictions = yolov3Predict(matFile,im,networkOutputs,anchorBoxMasks);

%% Generate Detections
anchorIndex = 2:5; % indices corresponding to x,y,w,h predictions for bounding boxes
tiledAnchors = generateTiledAnchors(predictions,anchors,anchorBoxMasks,anchorIndex);
predictions = applyAnchorBoxOffsets(tiledAnchors, predictions, networkInputSize, anchorIndex);
[bboxes,scores,labels] = generateYOLOv3Detections(predictions, confidenceThreshold, overlapThreshold, imageSize, classes);

% Apply suppression to the detections to filter out multiple overlapping
% detections.
if ~isempty(scores)
    [bboxes, scores, labels] = selectStrongestBboxMulticlass(bboxes, scores, labels ,...
        'RatioType', 'Union', 'OverlapThreshold', overlapThreshold);
end
end

function YPredCell = yolov3Predict(matFile,im,networkOutputs,anchorBoxMask)
% Predict the output of network and extract the confidence, x, y,
% width, height, and class.

% load the deep learning network for prediction
persistent net;

if isempty(net)
    net = coder.loadDeepLearningNetwork(matFile);
end

YPredictions = cell(size(networkOutputs));
[YPredictions{:}] = predict(net, im);
YPredCell = extractPredictions(YPredictions, anchorBoxMask);

% Apply activation to the predicted cell array.
YPredCell = applyActivations(YPredCell);
end

Evaluate the Entry-Point Function for Object Detection

Follow these steps to evaluate the entry-point function on an image from the test data.

  • Specify the confidence threshold as 0.5 to keep only detections with confidence scores above this value.

  • Specify the overlap threshold as 0.5 to remove overlapping detections.

  • Read an image from the input data.

  • Use the entry-point function yolov3Detect to get the predicted bounding boxes, confidence scores, and class labels.

  • Display the image with bounding boxes and confidence scores.

Define the desired thresholds.

confidenceThreshold = 0.5;
overlapThreshold = 0.5;

Get the network input size from the input layer of the trained network and the number of network outputs.

networkInputIdx = arrayfun(@(x)isa(x,'nnet.cnn.layer.ImageInputLayer'),net.Layers);
networkInputSize = net.Layers(networkInputIdx).InputSize;
networkOutputs = numel(dagNet.OutputNames);

Read the example image data obtained from the labeled data set from the Object Detection Using YOLO v3 Deep Learning (Computer Vision Toolbox) example. This image contains one instance of an object of type vehicle.

I = imread('vehicleImage.jpg');

Specify the class names.

classNames = {'vehicle'};

Invoke the detect method on YOLO v3 network and display the results.

[bboxes,scores,~] = yolov3Detect(matFile, I, networkInputSize, networkOutputs, confidenceThreshold, overlapThreshold, classNames);

% Display the detections on the image
IAnnotated = insertObjectAnnotation(I, 'rectangle', bboxes, scores);
figure
imshow(IAnnotated)

Generate CUDA MEX

To generate CUDA® code for the yolov3Detect entry-point function, create a GPU code configuration object for a MEX target and set the target language to C++. Use the coder.DeepLearningConfig (GPU Coder) function to create a CuDNN deep learning configuration object and assign it to the DeepLearningConfig property of the GPU code configuration object.

cfg = coder.gpuConfig('mex');
cfg.TargetLang = 'C++';
cfg.DeepLearningConfig = coder.DeepLearningConfig('cudnn');

args = {coder.Constant(matFile), I, coder.Constant(networkInputSize), networkOutputs, ...
    confidenceThreshold, overlapThreshold, classNames};

codegen -config cfg yolov3Detect -args args
Code generation successful: View report

To generate CUDA® code for TensorRT target create and use a TensorRT deep learning configuration object instead of the CuDNN configuration object. Similarly, to generate code for MKLDNN target, create a CPU code configuration object and use MKLDNN deep learning configuration object as its DeepLearningConfig property.

Run the Generated MEX

Call the generated CUDA MEX with the same image input I as before and display the results.

[bboxes, scores, labels] = yolov3Detect_mex(matFile, I, networkInputSize, networkOutputs, ...
    confidenceThreshold, overlapThreshold, classNames);
figure;
IAnnotated = insertObjectAnnotation(I, 'rectangle', bboxes, scores);
imshow(IAnnotated);

Utility Functions

The utillity functions listed below are similar to the ones used in Object Detection Using YOLO v3 Deep Learning (Computer Vision Toolbox) example, except for the preProcessData function. In this example, we only pre-process the image data unlike in Object Detection Using YOLO v3 Deep Learning (Computer Vision Toolbox) example, where the bounding boxes are also processed.

type('applyActivations.m')
function YPredCell = applyActivations(YPredCell)
for idx = 1:3
    YPredCell{:, idx} = sigmoidActivation(YPredCell{:,idx});
end
for idx = 4:5
    YPredCell{:, idx} = exp(YPredCell{:, idx});
end
YPredCell{:, 6} = sigmoidActivation(YPredCell{:, 6});
end

function out = sigmoidActivation(x)
out = 1./(1+exp(-x));
end
type('extractPredictions.m')
function predictions = extractPredictions(YPredictions, anchorBoxMask)
predictions = cell(size(YPredictions, 1),6);
for ii = 1:size(YPredictions, 1)
    numAnchors = size(anchorBoxMask{ii},2);
    % Confidence scores.
    startIdx = 1;
    endIdx = numAnchors;
    predictions{ii,1} = YPredictions{ii}(:,:,startIdx:endIdx,:);
    
    % X positions.
    startIdx = startIdx + numAnchors;
    endIdx = endIdx+numAnchors;
    predictions{ii,2} = YPredictions{ii}(:,:,startIdx:endIdx,:);
    
    % Y positions.
    startIdx = startIdx + numAnchors;
    endIdx = endIdx+numAnchors;
    predictions{ii,3} = YPredictions{ii}(:,:,startIdx:endIdx,:);
    
    % Width.
    startIdx = startIdx + numAnchors;
    endIdx = endIdx+numAnchors;
    predictions{ii,4} = YPredictions{ii}(:,:,startIdx:endIdx,:);
    
    % Height.
    startIdx = startIdx + numAnchors;
    endIdx = endIdx+numAnchors;
    predictions{ii,5} = YPredictions{ii}(:,:,startIdx:endIdx,:);
    
    % Class probabilities.
    startIdx = startIdx + numAnchors;
    predictions{ii,6} = YPredictions{ii}(:,:,startIdx:end,:);
end
end
type('generateTiledAnchors.m')
function tiledAnchors = generateTiledAnchors(YPredCell,anchorBoxes,anchorBoxMask,anchorIndex)
% Generate tiled anchor offset for converting the predictions from the YOLO 
% v3 grid cell coordinates to bounding box coordinates

tiledAnchors = cell(size(anchorIndex));
for i=1:size(YPredCell,1)
    anchors = anchorBoxes(anchorBoxMask{i}, :);
    [h,w,~,n] = size(YPredCell{i,1});
    [tiledAnchors{i,2}, tiledAnchors{i,1}] = ndgrid(0:h-1,0:w-1,1:size(anchors,1),1:n);
    [~,~,tiledAnchors{i,3}] = ndgrid(0:h-1,0:w-1,anchors(:,2),1:n);
    [~,~,tiledAnchors{i,4}] = ndgrid(0:h-1,0:w-1,anchors(:,1),1:n);
end
end
type('applyAnchorBoxOffsets.m')
function YPredCell = applyAnchorBoxOffsets(tiledAnchors,YPredCell,inputImageSize,anchorIndex)
% Convert the predictions from the YOLO v3 grid cell coordinates to bounding box coordinates
for i=1:size(YPredCell,1)
    [h,w,~,~] = size(YPredCell{i,1});  
    YPredCell{i,anchorIndex(1)} = (tiledAnchors{i,1}+YPredCell{i,anchorIndex(1)})./w;
    YPredCell{i,anchorIndex(2)} = (tiledAnchors{i,2}+YPredCell{i,anchorIndex(2)})./h;
    YPredCell{i,anchorIndex(3)} = (tiledAnchors{i,3}.*YPredCell{i,anchorIndex(3)})./inputImageSize(2);
    YPredCell{i,anchorIndex(4)} = (tiledAnchors{i,4}.*YPredCell{i,anchorIndex(4)})./inputImageSize(1);
end
end
type('preprocessData.m')
function image = preprocessData(image, targetSize)
% Resize the images and scale the pixels to between 0 and 1.

imgSize = size(image);

% Convert an input image with single channel to 3 channels.
if numel(imgSize) == 1
    image = repmat(image,1,1,3);
end
image = im2single(imresize(image, coder.const(targetSize(1:2))));

end

References

1. Redmon, Joseph, and Ali Farhadi. “YOLOv3: An Incremental Improvement.” Preprint, submitted April 8, 2018. https://arxiv.org/abs/1804.02767.