This example shows how to generate CUDA® MEX for a you only look once (YOLO) v3 object detector with custom layers. The example uses YOLO v3 object detection to illustrate:
CUDA code generation for a deep learning network with custom layers.
Convert a deep learning dlnetwork
object into a DAGNetwork
object for code generation.
YOLO v3 improves upon YOLO v2 by adding detection at multiple scales to help detect smaller objects. Moreover, the loss function used for training is separated into mean squared error for bounding box regression and binary cross-entropy for object classification to help improve detection accuracy. This example uses the YOLO v3 network trained in the Object Detection Using YOLO v3 Deep Learning example from the Computer Vision Toolbox (TM). For more information, see Object Detection Using YOLO v3 Deep Learning (Computer Vision Toolbox).
CUDA enabled NVIDIA® GPU and compatible driver.
For non-MEX builds such as static, dynamic libraries or executables, this example has the following additional requirements.
NVIDIA CUDA toolkit.
NVIDIA cuDNN library.
Environment variables for the compilers and libraries. For more information, see Third-Party Hardware (GPU Coder) and Setting Up the Prerequisite Products (GPU Coder).
To verify that the compilers and libraries for running this example are set up correctly, use the coder.checkGpuInstall
(GPU Coder) function.
envCfg = coder.gpuEnvConfig('host'); envCfg.DeepLibTarget = 'cudnn'; envCfg.DeepCodegen = 1; envCfg.Quiet = 1; coder.checkGpuInstall(envCfg);
The YOLO v3 network in this example is based on squeezenet
, and uses the feature extraction network in SqueezeNet with the addition of two detection heads at the end. The second detection head is twice the size of the first detection head, so it is better able to detect small objects. Note that any number of detection heads of different sizes can be specified based on the size of the objects to be detected. The YOLO v3 network uses anchor boxes estimated using training data to have better initial priors corresponding to the type of data set and to help the network learn to predict the boxes accurately. For information about anchor boxes, see Anchor Boxes for Object Detection (Computer Vision Toolbox).
The YOLO v3 network in this example is illustrated in the following diagram.
Each detection head predicts the bounding box coordinates (x, y, width, height), object confidence, and class probabilities for the respective anchor box masks. Therefore, for each detection head, the number of output filters in the last convolution layer is the number of anchor box mask times the number of prediction elements per anchor box. The detection heads comprise the output layer of the network.
Download the YOLO v3 network trained in the Object Detection Using YOLO v3 Deep Learning example. To train the network yourself, see Object Detection Using YOLO v3 Deep Learning (Computer Vision Toolbox).
pretrained = load("yolov3SqueezeNetVehicleExample_20a.mat");
net = pretrained.net;
The pretrained YOLO v3 model is a dlnetwork
object that is not supported for code generation. Extract the layer graph of the dlnetwork
object by using the layerGraph
function.
lGraph = layerGraph(net);
The layer graph output by the layerGraph
function does not include output layers. Add a regression layer to the layer graph for each of its outputs by using the addLayers
and connectLayers
functions.
outLayerIdx = 1:numel(lGraph.Layers); isOutLayer = arrayfun(@(x) any(strcmp(x.Name, net.OutputNames)), lGraph.Layers); outLayerIdx(~isOutLayer) = []; for iOut = 1:numel(outLayerIdx) outLayer = lGraph.Layers(outLayerIdx(iOut)); newLayer = regressionLayer('Name', [outLayer.Name '_output_' num2str(iOut)]); lGraph = addLayers(lGraph, newLayer); lGraph = connectLayers(lGraph, outLayer.Name, newLayer.Name); end
The number of output layers of the network is same as the number of detection heads in the network.
YOLO v3 network uses a custom upsample layer for upsampling the input image data. Here, we use a code generation compatible version of upsampleLayer
mentioned in the Object Detection Using YOLO v3 Deep Learning (Computer Vision Toolbox) example. The upsampleBy2Layer
upsamples the input image by replicating the neighboring pixel values by a factor of 2 using the repelem
function. Because code generation for the repelem
function is supported only for vector or 2-D matrix inputs, the upsampleBy2Layer
implementation needs to be parameterized by using the coder.target
(MATLAB Coder) function. When running a simulation on MATLAB, a single call to repelem
is sufficient as the input can be an N-D matrix. For code generation, the example uses nested for-loops so that the input to repelem
function is 2-D matrix for each call to repelem
in the generated code. Furthermore, for code generation layer output size is required to be constant at code generation time. Hence, UpsampleFactor property of the layer is modified to be Constant.
type('upsampleBy2Layer.m')
% Upsample by replicating neighbouring pixel values. % Copyright 2020 The MathWorks, Inc. classdef upsampleBy2Layer < nnet.layer.Layer properties (Constant) % factor to upsample the input. UpSampleFactor = 2 end methods function layer = upsampleBy2Layer(name) % Set layer name. layer.Name = name; % Set layer description. layer.Description = "upSamplingLayer with factor " + layer.UpSampleFactor; end function Z = predict(layer, X) % Z = predict(layer, X) forwards the input data X through the % layer and outputs the result Z. if coder.target('MATLAB') Z = repelem(X,layer.UpSampleFactor,layer.UpSampleFactor); else numChannels = size(X, 3); numBatches = size(X, 4); Zsize = coder.const([size(X, 1) size(X, 2) numChannels numBatches] .* [layer.UpSampleFactor layer.UpSampleFactor 1 1]); Z = coder.nullcopy(zeros(Zsize, 'like', X)); coder.gpu.kernel(-1, -1); for iBatch = 1:numBatches for iChannel = 1:numChannels Z(:, :, iChannel, iBatch) = repelem(X(:, :, iChannel, iBatch), layer.UpSampleFactor, layer.UpSampleFactor); end end end end end end
Replace the upsampleLayer
present in the pretrained network with its code generation compatible version upsampleBy2Layer
by using the replaceLayer
function.
isUpsampleLayer = arrayfun(@(x) isa(x, 'upsampleLayer'), lGraph.Layers);
layerNameToReplace = lGraph.Layers(isUpsampleLayer).Name;
lGraph = replaceLayer(lGraph, layerNameToReplace, upsampleBy2Layer(layerNameToReplace));
Assemble the layerGraph
into a DAGNetwork
object ready to use for code generation by using the assembleNetwork
function.
dagNet = assembleNetwork(lGraph)
dagNet = DAGNetwork with properties: Layers: [72×1 nnet.cnn.layer.Layer] Connections: [80×2 table] InputNames: {'data'} OutputNames: {'conv2Detection1_output_1' 'conv2Detection2_output_2'}
Save the network to a MAT-file.
matFile = 'yolov3DAGNetwork.mat'; save(matFile, 'dagNet');
yolov3Detect
Entry-Point FunctionThe yolov3Detect
entry-point function takes an input image and passes it to a trained network for prediction through the yolov3Predict
function. The yolov3Predict
function loads the network object from the MAT-file into a persistent variable and reuses the persistent object for subsequent prediction calls. Specifically, the function uses the DAGNetwork representation of the network trained in the Object Detection Using YOLO v3 Deep Learning (Computer Vision Toolbox) example. The predictions from the YOLO v3 grid cell coordinates obtained from the yolov3Predict
calls are then converted to bounding box coordinates by using the supporting functions generateTiledAnchors
and applyAnchorBoxOffsets
.
type('yolov3Detect.m')
function [bboxes,scores,labels] = yolov3Detect(matFile, im, networkInputSize, networkOutputs, confidenceThreshold, overlapThreshold, classes) % The yolov3Detect function detects the bounding boxes, scores, and labels in an image. coder.extrinsic('generateYOLOv3Detections'); %% Preprcess Data % This example applies all the preprocessing transforms to the data set % applied during training, except data augmentation. Because the example % uses a pretrained YOLO v3 network, the input data must be representative % of the original data and left unmodified for unbiased evaluation. % Specifically the following preprocessing operations are applied to the % input data. % 1. Resize the images to the network input size, as the images are bigger than networkInputSize. % 2. Scale the image pixels in the range [0 1]. im = preprocessData(im, networkInputSize); imageSize = size(im,[1,2]); %% Define Anchor Boxes % Specify the anchor boxes estimated on the basis of the preprocessed % training data used when training the YOLO v3 network. These anchor box % values are same as mentioned in % <docid:vision_ug#mw_47d9a223-5ec7-4d36-a020-4f9d147ecdec Object Detection % Using YOLO v3 Deep Learning> example. For details on estimating anchor % boxes, see <docid:vision_ug#mw_f9f22f48-0ad0-4f37-8bc1-22a2046637f2 % Anchor Boxes for Object Detection>. anchors = [150 127; 97 90; 68 67; 38 42; 41 29; 31 23]; % Specify anchorBoxMasks to select anchor boxes to use in both the % detection heads of the YOLO v3 network. anchorBoxMasks is a cell array of % size M-by-1, where M denotes the number of detection heads. Each % detection head consists of a 1-by-N array of row index of anchors in % anchorBoxes, where N is the number of anchor boxes to use. Select anchor % boxes for each detection head based on size-use larger anchor boxes at % lower scale and smaller anchor boxes at higher scale. To do so, sort the % anchor boxes with the larger anchor boxes first and assign the first % three to the first detection head and the next three to the second % detection head. area = anchors(:, 1).*anchors(:, 2); [~, idx] = sort(area, 'descend'); anchors = anchors(idx, :); anchorBoxMasks = {[1,2,3] [4,5,6] }; %% Predict on Yolov3 % Predict and filter the detections based on confidence threshold. predictions = yolov3Predict(matFile,im,networkOutputs,anchorBoxMasks); %% Generate Detections anchorIndex = 2:5; % indices corresponding to x,y,w,h predictions for bounding boxes tiledAnchors = generateTiledAnchors(predictions,anchors,anchorBoxMasks,anchorIndex); predictions = applyAnchorBoxOffsets(tiledAnchors, predictions, networkInputSize, anchorIndex); [bboxes,scores,labels] = generateYOLOv3Detections(predictions, confidenceThreshold, overlapThreshold, imageSize, classes); % Apply suppression to the detections to filter out multiple overlapping % detections. if ~isempty(scores) [bboxes, scores, labels] = selectStrongestBboxMulticlass(bboxes, scores, labels ,... 'RatioType', 'Union', 'OverlapThreshold', overlapThreshold); end end function YPredCell = yolov3Predict(matFile,im,networkOutputs,anchorBoxMask) % Predict the output of network and extract the confidence, x, y, % width, height, and class. % load the deep learning network for prediction persistent net; if isempty(net) net = coder.loadDeepLearningNetwork(matFile); end YPredictions = cell(size(networkOutputs)); [YPredictions{:}] = predict(net, im); YPredCell = extractPredictions(YPredictions, anchorBoxMask); % Apply activation to the predicted cell array. YPredCell = applyActivations(YPredCell); end
Follow these steps to evaluate the entry-point function on an image from the test data.
Specify the confidence threshold as 0.5 to keep only detections with confidence scores above this value.
Specify the overlap threshold as 0.5 to remove overlapping detections.
Read an image from the input data.
Use the entry-point function yolov3Detect
to get the predicted bounding boxes, confidence scores, and class labels.
Display the image with bounding boxes and confidence scores.
Define the desired thresholds.
confidenceThreshold = 0.5; overlapThreshold = 0.5;
Get the network input size from the input layer of the trained network and the number of network outputs.
networkInputIdx = arrayfun(@(x)isa(x,'nnet.cnn.layer.ImageInputLayer'),net.Layers);
networkInputSize = net.Layers(networkInputIdx).InputSize;
networkOutputs = numel(dagNet.OutputNames);
Read the example image data obtained from the labeled data set from the Object Detection Using YOLO v3 Deep Learning (Computer Vision Toolbox) example. This image contains one instance of an object of type vehicle.
I = imread('vehicleImage.jpg');
Specify the class names.
classNames = {'vehicle'};
Invoke the detect method on YOLO v3 network and display the results.
[bboxes,scores,~] = yolov3Detect(matFile, I, networkInputSize, networkOutputs, confidenceThreshold, overlapThreshold, classNames); % Display the detections on the image IAnnotated = insertObjectAnnotation(I, 'rectangle', bboxes, scores); figure imshow(IAnnotated)
To generate CUDA® code for the yolov3Detect
entry-point function, create a GPU code configuration object for a MEX target and set the target language to C++. Use the coder.DeepLearningConfig
(GPU Coder) function to create a CuDNN deep learning configuration object and assign it to the DeepLearningConfig
property of the GPU code configuration object.
cfg = coder.gpuConfig('mex'); cfg.TargetLang = 'C++'; cfg.DeepLearningConfig = coder.DeepLearningConfig('cudnn'); args = {coder.Constant(matFile), I, coder.Constant(networkInputSize), networkOutputs, ... confidenceThreshold, overlapThreshold, classNames}; codegen -config cfg yolov3Detect -args args
Code generation successful: View report
To generate CUDA® code for TensorRT target create and use a TensorRT deep learning configuration object instead of the CuDNN configuration object. Similarly, to generate code for MKLDNN target, create a CPU code configuration object and use MKLDNN deep learning configuration object as its DeepLearningConfig
property.
Call the generated CUDA MEX with the same image input I
as before and display the results.
[bboxes, scores, labels] = yolov3Detect_mex(matFile, I, networkInputSize, networkOutputs, ... confidenceThreshold, overlapThreshold, classNames); figure; IAnnotated = insertObjectAnnotation(I, 'rectangle', bboxes, scores); imshow(IAnnotated);
The utillity functions listed below are similar to the ones used in Object Detection Using YOLO v3 Deep Learning (Computer Vision Toolbox) example, except for the preProcessData
function. In this example, we only pre-process the image data unlike in Object Detection Using YOLO v3 Deep Learning (Computer Vision Toolbox) example, where the bounding boxes are also processed.
type('applyActivations.m')
function YPredCell = applyActivations(YPredCell) for idx = 1:3 YPredCell{:, idx} = sigmoidActivation(YPredCell{:,idx}); end for idx = 4:5 YPredCell{:, idx} = exp(YPredCell{:, idx}); end YPredCell{:, 6} = sigmoidActivation(YPredCell{:, 6}); end function out = sigmoidActivation(x) out = 1./(1+exp(-x)); end
type('extractPredictions.m')
function predictions = extractPredictions(YPredictions, anchorBoxMask) predictions = cell(size(YPredictions, 1),6); for ii = 1:size(YPredictions, 1) numAnchors = size(anchorBoxMask{ii},2); % Confidence scores. startIdx = 1; endIdx = numAnchors; predictions{ii,1} = YPredictions{ii}(:,:,startIdx:endIdx,:); % X positions. startIdx = startIdx + numAnchors; endIdx = endIdx+numAnchors; predictions{ii,2} = YPredictions{ii}(:,:,startIdx:endIdx,:); % Y positions. startIdx = startIdx + numAnchors; endIdx = endIdx+numAnchors; predictions{ii,3} = YPredictions{ii}(:,:,startIdx:endIdx,:); % Width. startIdx = startIdx + numAnchors; endIdx = endIdx+numAnchors; predictions{ii,4} = YPredictions{ii}(:,:,startIdx:endIdx,:); % Height. startIdx = startIdx + numAnchors; endIdx = endIdx+numAnchors; predictions{ii,5} = YPredictions{ii}(:,:,startIdx:endIdx,:); % Class probabilities. startIdx = startIdx + numAnchors; predictions{ii,6} = YPredictions{ii}(:,:,startIdx:end,:); end end
type('generateTiledAnchors.m')
function tiledAnchors = generateTiledAnchors(YPredCell,anchorBoxes,anchorBoxMask,anchorIndex) % Generate tiled anchor offset for converting the predictions from the YOLO % v3 grid cell coordinates to bounding box coordinates tiledAnchors = cell(size(anchorIndex)); for i=1:size(YPredCell,1) anchors = anchorBoxes(anchorBoxMask{i}, :); [h,w,~,n] = size(YPredCell{i,1}); [tiledAnchors{i,2}, tiledAnchors{i,1}] = ndgrid(0:h-1,0:w-1,1:size(anchors,1),1:n); [~,~,tiledAnchors{i,3}] = ndgrid(0:h-1,0:w-1,anchors(:,2),1:n); [~,~,tiledAnchors{i,4}] = ndgrid(0:h-1,0:w-1,anchors(:,1),1:n); end end
type('applyAnchorBoxOffsets.m')
function YPredCell = applyAnchorBoxOffsets(tiledAnchors,YPredCell,inputImageSize,anchorIndex) % Convert the predictions from the YOLO v3 grid cell coordinates to bounding box coordinates for i=1:size(YPredCell,1) [h,w,~,~] = size(YPredCell{i,1}); YPredCell{i,anchorIndex(1)} = (tiledAnchors{i,1}+YPredCell{i,anchorIndex(1)})./w; YPredCell{i,anchorIndex(2)} = (tiledAnchors{i,2}+YPredCell{i,anchorIndex(2)})./h; YPredCell{i,anchorIndex(3)} = (tiledAnchors{i,3}.*YPredCell{i,anchorIndex(3)})./inputImageSize(2); YPredCell{i,anchorIndex(4)} = (tiledAnchors{i,4}.*YPredCell{i,anchorIndex(4)})./inputImageSize(1); end end
type('preprocessData.m')
function image = preprocessData(image, targetSize) % Resize the images and scale the pixels to between 0 and 1. imgSize = size(image); % Convert an input image with single channel to 3 channels. if numel(imgSize) == 1 image = repmat(image,1,1,3); end image = im2single(imresize(image, coder.const(targetSize(1:2)))); end
1. Redmon, Joseph, and Ali Farhadi. “YOLOv3: An Incremental Improvement.” Preprint, submitted April 8, 2018. https://arxiv.org/abs/1804.02767.