Object Detection Using YOLO v3 Deep Learning

This example uses:

This example shows how to train a YOLO v3 object detector.

Deep learning is a powerful machine learning technique that you can use to train robust object detectors. Several techniques for object detection exist, including Faster R-CNN, you only look once (YOLO) v2, and single shot detector (SSD). This example shows how to train a YOLO v3 object detector. YOLO v3 improves upon YOLO v2 by adding detection at multiple scales to help detect smaller objects. Moreover, the loss function used for training is separated into mean squared error for bounding box regression and binary cross-entropy for object classification to help improve detection accuracy.

Download Pretrained Network

Download a pretrained network to avoid having to wait for training to complete. If you want to train the network, set the doTraining variable to true.

doTraining = false;
if ~doTraining
    if ~exist('yolov3SqueezeNetVehicleExample_20a.mat','file')
        disp('Downloading pretrained detector (8.9 MB)...');
        pretrainedURL = 'https://www.mathworks.com/supportfiles/vision/data/yolov3SqueezeNetVehicleExample_20a.mat';
        websave('yolov3SqueezeNetVehicleExample_20a.mat', pretrainedURL);
    end
    pretrained = load("yolov3SqueezeNetVehicleExample_20a.mat");
    net = pretrained.net;
end

Downloading pretrained detector (8.9 MB)...

Load Data

This example uses a small labeled data set that contains 295 images. Each image contains one or two labeled instances of a vehicle. A small data set is useful for exploring the YOLO v3 training procedure, but in practice, more labeled images are needed to train a robust network.

Unzip the vehicle images and load the vehicle ground truth data.

unzip vehicleDatasetImages.zip
data = load('vehicleDatasetGroundTruth.mat');
vehicleDataset = data.vehicleDataset;

% Add the full path to the local vehicle data folder.
vehicleDataset.imageFilename = fullfile(pwd,vehicleDataset.imageFilename);

Split the data set into a training set for training the network, and a test set for evaluating the network. Use 60% of the data for training set and the rest for the test set.

rng(0);
shuffledIndices = randperm(height(vehicleDataset));
idx = floor(0.6 * length(shuffledIndices));
trainingDataTbl = vehicleDataset(shuffledIndices(1:idx),:);
testDataTbl = vehicleDataset(shuffledIndices(idx+1:end),:);

Create an image datastore for loading the images.

imdsTrain = imageDatastore(trainingDataTbl.imageFilename);
imdsTest = imageDatastore(testDataTbl.imageFilename);

Create a datastore for the ground truth bounding boxes.

bldsTrain = boxLabelDatastore(trainingDataTbl(:, 2:end));
bldsTest = boxLabelDatastore(testDataTbl(:, 2:end));

Specify the mini-batch size. Set ReadSize of the training image datastore and box label datastore equal to the mini-batch size.

miniBatchSize = 8;
imdsTrain.ReadSize = miniBatchSize;
bldsTrain.ReadSize = miniBatchSize;

Combine the image and box label datastores.

trainingData = combine(imdsTrain, bldsTrain);
testData = combine(imdsTest,bldsTest);

Data Augmentation

Data augmentation is used to improve network accuracy by randomly transforming the original data during training. By using data augmentation, you can add more variety to the training data without actually having to increase the number of labeled training samples.

Use transform function to apply custom data augmentations to the training data. The augmentData helper function, listed at the end of the example, applies the following augmentations to the input data.

Color jitter augmentation in HSV space
Random horizontal flip
Random scaling by 10 percent

augmentedTrainingData = transform(trainingData,@augmentData);

Read the same image four times and display the augmented training data.

% Visualize the augmented images.
augmentedData = cell(4,1);
for k = 1:4
    data = read(augmentedTrainingData);
    augmentedData{k} = insertShape(data{1,1},'Rectangle',data{1,2});
    reset(augmentedTrainingData);
end
figure
montage(augmentedData,'BorderSize',10)

Preprocess Training Data

Specify the network input size. When choosing the network input size, consider the minimum size required to run the network itself, the size of the training images, and the computational cost incurred by processing data at the selected size. When feasible, choose a network input size that is close to the size of the training image and larger than the input size required for the network. To reduce the computational cost of running the example, specify a network input size of [227 227 3].

networkInputSize = [227 227 3];

Preprocess the augmented training data to prepare for training. The preprocessData helper function, listed at the end of the example, applies the following preprocessing operations to the input data.

Resize the images to the network input size, as the images are bigger than 227-by-227.
Scale the image pixels in the range [0 1].

preprocessedTrainingData = transform(augmentedTrainingData, @(data)preprocessData(data, networkInputSize));

Read the preprocessed training data.

data = read(preprocessedTrainingData);

Display the image with the bounding boxes.

I = data{1,1};
bbox = data{1,2};
annotatedImage = insertShape(I,'Rectangle',bbox);
annotatedImage = imresize(annotatedImage,2);
figure
imshow(annotatedImage)

Define YOLO v3 Network

The YOLO v3 network in this example is based on SqueezeNet, and uses the feature extraction network in SqueezeNet with the addition of two detection heads at the end. The second detection head is twice the size of the first detection head, so it is better able to detect small objects. Note that you can specify any number of detection heads of different sizes based on the size of the objects that you want to detect. The YOLO v3 network uses anchor boxes estimated using training data to have better initial priors corresponding to the type of data set and to help the network learn to predict the boxes accurately. For information about anchor boxes, see Anchor Boxes for Object Detection.

The YOLO v3 network in this example is illustrated in the following diagram.

You can use Deep Network Designer to create the network shown in the diagram.

First, use transform to preprocess the training data for computing the anchor boxes, as the training images used in this example are bigger than 227-by-227 and vary in size. Specify the number of anchors as 6 to achieve a good tradeoff between number of anchors and mean IoU. Use the estimateAnchorBoxes function to estimate the anchor boxes. For details on estimating anchor boxes, see Estimate Anchor Boxes From Training Data.

trainingDataForEstimation = transform(trainingData,@(data)preprocessData(data,networkInputSize));
numAnchors = 6;
[anchorBoxes, meanIoU] = estimateAnchorBoxes(trainingDataForEstimation, numAnchors)

anchorBoxes = 6×2

    35    25
   165   138
    73    70
   151   125
   113   103
    42    38

meanIoU = 0.8369

Specify anchorBoxMasks to select anchor boxes to use in both the detection heads. anchorBoxMasks is a cell array of [Mx1], where M denotes the number of detection heads. Each detection head consists of a [1xN] array of row index of anchors in anchorBoxes, where N is the number of anchor boxes to use. Select anchor boxes for each detection head based on size—use larger anchor boxes at lower scale and smaller anchor boxes at higher scale. To do so, sort the anchor boxes with the larger anchor boxes first and assign the first three to the first detection head and the next three to the second detection head.

area = anchorBoxes(:, 1).*anchorBoxes(:, 2);
[~, idx] = sort(area, 'descend');
anchorBoxes = anchorBoxes(idx, :);
anchorBoxMasks = {[1,2,3]
    [4,5,6]
    };

Load the SqueezeNet network pretrained on Imagenet data set. You can also choose to load a different pretrained network such as MobileNet-v2 or ResNet-18. YOLO v3 performs better and trains faster when you use a pretrained network.

Next, create the feature extraction network. Choosing the optimal feature extraction layer requires trial and error, and you can use analyzeNetwork to find the names of potential feature extraction layers within a network. For this example, use the squeezenetFeatureExtractor helper function, listed at the end of this example, to remove the layers after the feature extraction layer 'fire9-concat'. The layers after this layer are specific to classification tasks and do not help with object detection.

baseNetwork = squeezenet;
lgraph = squeezenetFeatureExtractor(baseNetwork, networkInputSize);

Then specify the class names, the number of object classes to detect, and number of prediction elements per anchor box, and add the detection heads to the feature extraction network. Each detection head predicts the bounding box coordinates (x, y, width, height), object confidence, and class probabilities for the respective anchor box masks. Therefore, for each detection head, the number of output filters in the last convolution layer is the number of anchor box mask times the number of prediction elements per anchor box. Use the supporting functions addFirstDetectionHead and addSecondDetectionHead to add the detection heads to the feature extraction network.

classNames = trainingDataTbl.Properties.VariableNames(2:end);
numClasses = size(classNames, 2);
numPredictorsPerAnchor = 5 + numClasses;
lgraph = addFirstDetectionHead(lgraph, anchorBoxMasks{1}, numPredictorsPerAnchor);
lgraph = addSecondDetectionHead(lgraph, anchorBoxMasks{2}, numPredictorsPerAnchor);

Finally, connect the detection heads by connecting the first detection head to the feature extraction layer and the second detection head to the output of the first detection head. In addition, merge the upsampled features in the second detection head with features from the 'fire5-concat' layer to get more meaningful semantic information in the second detection head.

lgraph = connectLayers(lgraph, 'fire9-concat', 'conv1Detection1');
lgraph = connectLayers(lgraph,'relu1Detection1','upsample1Detection2');
lgraph = connectLayers(lgraph,'fire5-concat','depthConcat1Detection2/in2');

The detection heads comprise the output layer of the network. To extract output features, specify the names of detection heads using an array of form [Mx1]. M is the number of detection heads. Specify the names of detection heads in the order in which it occurs in the network.

networkOutputs = ["conv2Detection1"
    "conv2Detection2"
    ];

Specify Training Options

Specify these training options.

Set the number of iterations to 2000.
Set the learning rate to 0.001.
Set the warmup period to 1000. This parameter denotes the number of iterations to increase the learning rate exponentially based on the formula $learningRate \times (\frac{iteration}{warmupPeriod})$ . It helps in stabilizing the gradients at higher learning rates.
Set the L2 regularization factor to 0.0005.
Specify the penalty threshold as 0.5. Detections that overlap less than 0.5 with the ground truth are penalized.
Initialize the velocity of gradient as []. This is used by SGDM to store the velocity of gradients.

numIterations = 2000;
learningRate = 0.001;
warmupPeriod = 1000;
l2Regularization = 0.0005;
penaltyThreshold = 0.5;
velocity = [];

Train Model

Train on a GPU, if one is available. Using a GPU requires Parallel Computing Toolbox™ and a CUDA® enabled NVIDIA® GPU with compute capability 3.0 or higher. To automatically detect if you have a GPU available, set executionEnvironment to "auto". If you do not have a GPU, or do not want to use one for training, set executionEnvironment to "cpu". To ensure the use of a GPU for training, set executionEnvironment to "gpu".

executionEnvironment = "auto";

To train the network with a custom training loop and enable automatic differentiation, convert the layer graph to a dlnetwork object. Then create the training progress plotter using supporting function configureTrainingProgressPlotter.

Finally, specify the custom training loop. For each iteration:

Read from preprocessedTrainingData and create a batch of images and ground truth boxes using the createBatchData supporting function.
Convert the batch of images to dlarray objects with underlying type single and specify the dimension labels 'SSCB' (spatial, spatial, channel, batch).
For GPU training, convert the data to gpuArray objects.
Evaluate the model gradients using dlfeval and the modelGradients function. The function modelGradients, listed as a supporting function, returns the gradients of the loss with respect to the learnable parameters in net, the corresponding mini-batch loss, and the state of the current batch.
Apply a weight decay factor to the gradients to regularization for more robust training.
Determine the learning rate based on the number of iterations using the piecewiseLearningRateWithWarmup supporting function.
Update the network parameters using the sgdmupdate function.
Update the state parameters of net with the moving average.
Update the training progress plot.

if doTraining
    % Convert layer graph to dlnetwork.
    net = dlnetwork(lgraph);
    
    % Create subplots for the learning rate and mini-batch loss.
    fig = figure;
    [lossPlotter, learningRatePlotter] = configureTrainingProgressPlotter(fig);
    
    % Custom training loop.
    for iteration = 1:numIterations
        
        % Reset datastore.
        if ~hasdata(preprocessedTrainingData)
            reset(preprocessedTrainingData);
        end
        
        % Read batch of data and create batch of images and
        % ground truths.
        data = read(preprocessedTrainingData);
        [XTrain,YTrain] = createBatchData(data, classNames);
        
        % Convert mini-batch of data to dlarray.
        XTrain = dlarray(single(XTrain),'SSCB');
        
        % If training on a GPU, then convert data to gpuArray.
        if (executionEnvironment == "auto" && canUseGPU) || executionEnvironment == "gpu"
            XTrain = gpuArray(XTrain);
        end
        
        % Evaluate the model gradients and loss using dlfeval and the
        % modelGradients function.
        [gradients,loss,state] = dlfeval(@modelGradients, net, XTrain, YTrain, anchorBoxes, anchorBoxMasks, penaltyThreshold, networkOutputs);
        
        % Apply L2 regularization.
        gradients = dlupdate(@(g,w) g + l2Regularization*w, gradients, net.Learnables);
        
        % Determine the current learning rate value.
        currentLR = piecewiseLearningRateWithWarmup(iteration, learningRate, warmupPeriod, numIterations);
        
        % Update the network learnable parameters using the SGDM optimizer.
        [net, velocity] = sgdmupdate(net, gradients, velocity, currentLR);
        
        % Update the state parameters of dlnetwork.
        net.State = state;
        
        % Update training plot with new points.
        addpoints(lossPlotter, iteration, double(gather(extractdata(loss))));
        addpoints(learningRatePlotter, iteration, currentLR);
        drawnow  
    end
end

Evaluate Model

Computer Vision System Toolbox™ provides object detector evaluation functions to measure common metrics such as average precision (evaluateDetectionPrecision) and log-average miss rates (evaluateDetectionMissRate). In this example, the average precision metric is used. The average precision provides a single number that incorporates the ability of the detector to make correct classifications (precision) and the ability of the detector to find all relevant objects (recall).

Following these steps to evaluate the trained dlnetwork object net on test data.

Specify the confidence threshold as 0.5 to keep only detections with confidence scores above this value.
Specify the overlap threshold as 0.5 to remove overlapping detections.
Apply the same preprocessing transform to the test data as for the training data. Note that data augmentation is not applied to the test data. Test data must be representative of the original data and be left unmodified for unbiased evaluation.
Collect the detection results by running the detector on preprocessedTestData. Use the supporting function yolov3Detect to get the bounding boxes, object confidence scores, and class labels.
Call evaluateDetectionPrecision with predicted results and preprocessedTestData as arguments.

confidenceThreshold = 0.5;
overlapThreshold = 0.5;

% Create the test datastore.
preprocessedTestData = transform(testData,@(data)preprocessData(data,networkInputSize));

% Create a table to hold the bounding boxes, scores, and labels returned by
% the detector. 
numImages = size(testDataTbl,1);
results = table('Size',[numImages 3],...
    'VariableTypes',{'cell','cell','cell'},...
    'VariableNames',{'Boxes','Scores','Labels'});

% Run detector on each image in the test set and collect results.
for i = 1:numImages
    
    % Read the datastore and get the image.
    data = read(preprocessedTestData);
    I = data{1};
    
    % Convert to dlarray. If GPU is available, then convert data to gpuArray.
    XTest = dlarray(I,'SSCB');
    if (executionEnvironment == "auto" && canUseGPU) || executionEnvironment == "gpu"
        XTest = gpuArray(XTest);
    end
    
    % Run the detector.
    [bboxes, scores, labels] = yolov3Detect(net, XTest, networkOutputs, anchorBoxes, anchorBoxMasks, confidenceThreshold, overlapThreshold, classNames);
    
    % Collect the results.
    results.Boxes{i} = bboxes;
    results.Scores{i} = scores;
    results.Labels{i} = labels;
end

% Evaluate the object detector using Average Precision metric.
[ap, recall, precision] = evaluateDetectionPrecision(results, preprocessedTestData);

The precision-recall (PR) curve shows how precise a detector is at varying levels of recall. Ideally, the precision is 1 at all recall levels.

% Plot precision-recall curve.
figure
plot(recall, precision)
xlabel('Recall')
ylabel('Precision')
grid on
title(sprintf('Average Precision = %.2f', ap))

Detect Objects Using YOLO v3

Use the network for object detection.

Read an image.
Convert the image to a dlarray and use a GPU if one is available..
Use the supporting function yolov3Detect to get the predicted bounding boxes, confidence scores, and class labels.
Display the image with bounding boxes and confidence scores.

% Read the datastore.
reset(preprocessedTestData)
data = read(preprocessedTestData);

% Get the image.
I = data{1};

% Convert to dlarray.
XTest = dlarray(I,'SSCB');

% If GPU is available, then convert data to gpuArray.
if (executionEnvironment == "auto" && canUseGPU) || executionEnvironment == "gpu"
    XTest = gpuArray(XTest);
end

[bboxes, scores, labels] = yolov3Detect(net, XTest, networkOutputs, anchorBoxes, anchorBoxMasks, confidenceThreshold, overlapThreshold, classNames);

% Display the detections on image.
if ~isempty(scores)
    I = insertObjectAnnotation(I, 'rectangle', bboxes, scores);
end
figure
imshow(I)

Supporting Functions

Model Gradients Function

The function modelGradients takes as input the dlnetwork object net, a mini-batch of input data XTrain with corresponding ground truth boxes YTrain, anchor boxes, anchor box mask, the specified penalty threshold, and the network output names as input arguments and returns the gradients of the loss with respect to the learnable parameters in net, the corresponding mini-batch loss, and the state of the current batch.

The model gradients function computes the total loss and gradients by performing these operations.

Generate predictions from the input batch of images using the supporting function yolov3Forward.
Collect predictions on the CPU for postprocessing.
Convert the predictions from the YOLO v3 grid cell coordinates to bounding box coordinates to allow easy comparison with the ground truth data by using the supporting functions generateTiledAnchors and applyAnchorBoxOffsets.
Generate targets for loss computation by using the converted predictions and ground truth data. These targets are generated for bounding box positions (x, y, width, height), object confidence, and class probabilities. See the supporting function generateTargets.
Calculates the mean squared error of the predicted bounding box coordinates with target boxes. See the supporting function bboxOffsetLoss.
Determines the binary cross-entropy of the predicted object confidence score with target object confidence score. See the supporting function objectnessLoss.
Determines the binary cross-entropy of the predicted class of object with the target. See the supporting function classConfidenceLoss.
Computes the total loss as the sum of all losses.
Computes the gradients of learnables with respect to the total loss.

function [gradients, totalLoss, state] = modelGradients(net, XTrain, YTrain, anchors, mask, penaltyThreshold, networkOutputs)
inputImageSize = size(XTrain,1:2);

% Extract the predictions from the network.
[YPredCell, state] = yolov3Forward(net,XTrain,networkOutputs,mask);

% Gather the activations in the CPU for post processing and extract dlarray data. 
gatheredPredictions = cellfun(@ gather, YPredCell(:,1:6),'UniformOutput',false); 
gatheredPredictions = cellfun(@ extractdata, gatheredPredictions, 'UniformOutput', false);

% Convert predictions from grid cell coordinates to box coordinates.
tiledAnchors = generateTiledAnchors(gatheredPredictions(:,2:5),anchors,mask);
gatheredPredictions(:,2:5) = applyAnchorBoxOffsets(tiledAnchors, gatheredPredictions(:,2:5), inputImageSize);

% Generate target for predictions from the ground truth data.
[boxTarget, objectnessTarget, classTarget, objectMaskTarget, boxErrorScale] = generateTargets(gatheredPredictions, YTrain, inputImageSize, anchors, mask, penaltyThreshold);

% Compute the loss.
boxLoss = bboxOffsetLoss(YPredCell(:,[2 3 7 8]),boxTarget,objectMaskTarget,boxErrorScale);
objLoss = objectnessLoss(YPredCell(:,1),objectnessTarget,objectMaskTarget);
clsLoss = classConfidenceLoss(YPredCell(:,6),classTarget,objectMaskTarget);
totalLoss = boxLoss + objLoss + clsLoss;

% Compute gradients of learnables with regard to loss.
gradients = dlgradient(totalLoss, net.Learnables);
end

function [YPredCell, state] = yolov3Forward(net, XTrain, networkOutputs, anchorBoxMask)
% Predict the output of network and extract the confidence score, x, y,
% width, height, and class.
YPredictions = cell(size(networkOutputs));
[YPredictions{:}, state] = forward(net, XTrain, 'Outputs', networkOutputs);
YPredCell = extractPredictions(YPredictions, anchorBoxMask);

% Append predicted width and height to the end as they are required
% for computing the loss.
YPredCell(:,7:8) = YPredCell(:,4:5);

% Apply sigmoid and exponential activation.
YPredCell(:,1:6) = applyActivations(YPredCell(:,1:6));
end

function boxLoss = bboxOffsetLoss(boxPredCell, boxDeltaTarget, boxMaskTarget, boxErrorScaleTarget)
% Mean squared error for bounding box position.
lossX = sum(cellfun(@(a,b,c,d) mse(a.*c.*d,b.*c.*d),boxPredCell(:,1),boxDeltaTarget(:,1),boxMaskTarget(:,1),boxErrorScaleTarget));
lossY = sum(cellfun(@(a,b,c,d) mse(a.*c.*d,b.*c.*d),boxPredCell(:,2),boxDeltaTarget(:,2),boxMaskTarget(:,1),boxErrorScaleTarget));
lossW = sum(cellfun(@(a,b,c,d) mse(a.*c.*d,b.*c.*d),boxPredCell(:,3),boxDeltaTarget(:,3),boxMaskTarget(:,1),boxErrorScaleTarget));
lossH = sum(cellfun(@(a,b,c,d) mse(a.*c.*d,b.*c.*d),boxPredCell(:,4),boxDeltaTarget(:,4),boxMaskTarget(:,1),boxErrorScaleTarget));
boxLoss = lossX+lossY+lossW+lossH;
end

function objLoss = objectnessLoss(objectnessPredCell, objectnessDeltaTarget, boxMaskTarget)
% Binary cross-entropy loss for objectness score.
objLoss = sum(cellfun(@(a,b,c) crossentropy(a.*c,b.*c,'TargetCategories','independent'),objectnessPredCell,objectnessDeltaTarget,boxMaskTarget(:,2)));
end

function clsLoss = classConfidenceLoss(classPredCell, classTarget, boxMaskTarget)
% Binary cross-entropy loss for class confidence score.
clsLoss = sum(cellfun(@(a,b,c) crossentropy(a.*c,b.*c,'TargetCategories','independent'),classPredCell,classTarget,boxMaskTarget(:,3)));
end

Augmentation and Data Processing Functions

function data = augmentData(A)
% Apply random horizontal flipping, and random X/Y scaling. Boxes that get
% scaled outside the bounds are clipped if the overlap is above 0.25. Also,
% jitter image color.

data = cell(size(A));
for ii = 1:size(A,1)
    I = A{ii,1};
    bboxes = A{ii,2};
    labels = A{ii,3};
    sz = size(I);
    if numel(sz)==3 && sz(3) == 3
        I = jitterColorHSV(I,...
            'Contrast',0.0,...
            'Hue',0.1,...
            'Saturation',0.2,...
            'Brightness',0.2);
    end
    
    % Randomly flip image.
    tform = randomAffine2d('XReflection',true,'Scale',[1 1.1]);
    rout = affineOutputView(sz,tform,'BoundsStyle','centerOutput');
    I = imwarp(I,tform,'OutputView',rout);
    
    % Apply same transform to boxes.
    [bboxes,indices] = bboxwarp(bboxes,tform,rout,'OverlapThreshold',0.25);
    labels = labels(indices);
    
    % Return original data only when all boxes are removed by warping.
    if isempty(indices)
        data = A(ii,:);
    else
        data(ii,:) = {I, bboxes, labels};
    end
end
end


function data = preprocessData(data, targetSize)
% Resize the images and scale the pixels to between 0 and 1. Also scale the
% corresponding bounding boxes.

for ii = 1:size(data,1)
    I = data{ii,1};
    imgSize = size(I);
    
    % Convert an input image with single channel to 3 channels.
    if numel(imgSize) == 1 
        I = repmat(I,1,1,3);
    end
    bboxes = data{ii,2};
    I = im2single(imresize(I,targetSize(1:2)));
    scale = targetSize(1:2)./imgSize(1:2);
    bboxes = bboxresize(bboxes,scale);
    data(ii,1:2) = {I, bboxes};
end
end

function [x,y] = createBatchData(data, classNames)
% The createBatchData function creates a batch of images and ground truths
% from input data, which is a [Nx3] cell array returned by the transformed
% datastore for YOLO v3. It creates two 4-D arrays by concatenating all the
% images and ground truth boxes along the batch dimension. The function
% performs these operations on the bounding boxes before concatenating
% along the fourth dimension:
% * Convert the class names to numeric class IDs based on their position in
%   the class names.
% * Combine the ground truth boxes, class IDs and network input size into
%   single cell array.
% * Pad with zeros to make the number of ground truths consistent across
%   a mini-batch.

% Concatenate images along the batch dimension.
x = cat(4,data{:,1});

% Get class IDs from the class names.
groundTruthClasses = data(:,3);
classNames = repmat({categorical(classNames')},size(groundTruthClasses));
[~,classIndices] = cellfun(@(a,b)ismember(a,b),groundTruthClasses,classNames,'UniformOutput',false);

% Append the label indexes and training image size to scaled bounding boxes
% and create a single cell array of responses.
groundTruthBoxes = data(:,2);
combinedResponses = cellfun(@(bbox,classid)[bbox,classid],groundTruthBoxes,classIndices,'UniformOutput',false);
len = max( cellfun(@(x)size(x,1), combinedResponses ) );
paddedBBoxes = cellfun( @(v) padarray(v,[len-size(v,1),0],0,'post'), combinedResponses, 'UniformOutput',false);
y = cat(4,paddedBBoxes{:,1});
end

Network Creation Functions

function lgraph = squeezenetFeatureExtractor(net, imageInputSize)
% The squeezenetFeatureExtractor function removes the layers after 'fire9-concat'
% in SqueezeNet and also removes any data normalization used by the image input layer.

% Convert to layerGraph.
lgraph = layerGraph(net);

lgraph = removeLayers(lgraph, {'drop9' 'conv10' 'relu_conv10' 'pool10' 'prob' 'ClassificationLayer_predictions'});
inputLayer = imageInputLayer(imageInputSize,'Normalization','none','Name','data');
lgraph = replaceLayer(lgraph,'data',inputLayer);
end

function lgraph = addFirstDetectionHead(lgraph,anchorBoxMasks,numPredictorsPerAnchor)
% The addFirstDetectionHead function adds the first detection head.

numAnchorsScale1 = size(anchorBoxMasks, 2);
% Compute the number of filters for last convolution layer.
numFilters = numAnchorsScale1*numPredictorsPerAnchor;
firstDetectionSubNetwork = [
    convolution2dLayer(3,256,'Padding','same','Name','conv1Detection1','WeightsInitializer','he')
    reluLayer('Name','relu1Detection1')
    convolution2dLayer(1,numFilters,'Padding','same','Name','conv2Detection1','WeightsInitializer','he')
    ];
lgraph = addLayers(lgraph,firstDetectionSubNetwork);
end

function lgraph = addSecondDetectionHead(lgraph,anchorBoxMasks,numPredictorsPerAnchor)
% The addSecondDetectionHead function adds the second detection head.

numAnchorsScale2 = size(anchorBoxMasks, 2);
% Compute the number of filters for the last convolution layer.
numFilters = numAnchorsScale2*numPredictorsPerAnchor;
secondDetectionSubNetwork = [
    upsampleLayer(2,'upsample1Detection2')
    depthConcatenationLayer(2,'Name','depthConcat1Detection2');
    convolution2dLayer(3,128,'Padding','same','Name','conv1Detection2','WeightsInitializer','he')
    reluLayer('Name','relu1Detection2')
    convolution2dLayer(1,numFilters,'Padding','same','Name','conv2Detection2','WeightsInitializer','he')
    ];
lgraph = addLayers(lgraph,secondDetectionSubNetwork);
end

Learning Rate Schedule Function

function currentLR = piecewiseLearningRateWithWarmup(iteration, learningRate, warmupPeriod, numIterations)
% The piecewiseLearningRateWithWarmup function computes the current
% learning rate based on the iteration number.

if iteration <= warmupPeriod
    % Increase the learning rate for number of iterations in warmup period.
    currentLR = learningRate * ((iteration/warmupPeriod)^4);
    
elseif iteration >= warmupPeriod && iteration < warmupPeriod+floor(0.6*(numIterations-warmupPeriod))
    % After warm up period, keep the learning rate constant if the remaining number of iteration is less than 60 percent. 
    currentLR = learningRate;
    
elseif iteration >= warmupPeriod+floor(0.6*(numIterations-warmupPeriod)) && iteration < warmupPeriod+floor(0.9*(numIterations-warmupPeriod))
    % If the remaining number of iteration is more than 60 percent but less
    % than 90 percent multiply the learning rate by 0.1.
    currentLR = learningRate*0.1;
    
else
    % If remaining iteration is more than 90 percent multiply the learning
    % rate by 0.01.
    currentLR = learningRate*0.01;
end
end

Predict Functions

function [bboxes,scores,labels] = yolov3Detect(net, XTest, networkOutputs, anchors, anchorBoxMask, confidenceThreshold, overlapThreshold, classes)
% The yolov3Detect function detects the bounding boxes, scores, and labels in an image.

imageSize = size(XTest,[1,2]);

% Find the input image layer and get the network input size.
networkInputIdx = arrayfun( @(x)isa(x,'nnet.cnn.layer.ImageInputLayer'), net.Layers);
networkInputSize = net.Layers(networkInputIdx).InputSize;

% Predict and filter the detections based on confidence threshold.
predictions = yolov3Predict(net,XTest,networkOutputs,anchorBoxMask);
predictions = cellfun(@ gather, predictions,'UniformOutput',false);
predictions = cellfun(@ extractdata, predictions, 'UniformOutput', false);
tiledAnchors = generateTiledAnchors(predictions(:,2:5),anchors,anchorBoxMask);
predictions(:,2:5) = applyAnchorBoxOffsets(tiledAnchors, predictions(:,2:5), networkInputSize);
[bboxes,scores,labels] = generateYOLOv3Detections(predictions, confidenceThreshold, imageSize, classes);

% Apply suppression to the detections to filter out multiple overlapping
% detections.
if ~isempty(scores)
    [bboxes, scores, labels] = selectStrongestBboxMulticlass(bboxes, scores, labels ,...
        'RatioType', 'Union', 'OverlapThreshold', overlapThreshold);
end
end

function YPredCell = yolov3Predict(net,XTrain,networkOutputs,anchorBoxMask)
% Predict the output of network and extract the confidence, x, y,
% width, height, and class.
YPredictions = cell(size(networkOutputs));
[YPredictions{:}] = predict(net, XTrain);
YPredCell = extractPredictions(YPredictions, anchorBoxMask);

% Apply activation to the predicted cell array.
YPredCell = applyActivations(YPredCell);
end

Utility Functions

function YPredCell = applyActivations(YPredCell)
YPredCell(:,1:3) = cellfun(@ sigmoid ,YPredCell(:,1:3),'UniformOutput',false);
YPredCell(:,4:5) = cellfun(@ exp,YPredCell(:,4:5),'UniformOutput',false);
YPredCell(:,6) = cellfun(@ sigmoid ,YPredCell(:,6),'UniformOutput',false);
end

function predictions = extractPredictions(YPredictions, anchorBoxMask)
predictions = cell(size(YPredictions, 1),6);
for ii = 1:size(YPredictions, 1)
    numAnchors = size(anchorBoxMask{ii},2);
    % Confidence scores.
    startIdx = 1;
    endIdx = numAnchors;
    predictions{ii,1} = YPredictions{ii}(:,:,startIdx:endIdx,:);
    
    % X positions.
    startIdx = startIdx + numAnchors;
    endIdx = endIdx+numAnchors;
    predictions{ii,2} = YPredictions{ii}(:,:,startIdx:endIdx,:);
    
    % Y positions.
    startIdx = startIdx + numAnchors;
    endIdx = endIdx+numAnchors;
    predictions{ii,3} = YPredictions{ii}(:,:,startIdx:endIdx,:);
    
    % Width.
    startIdx = startIdx + numAnchors;
    endIdx = endIdx+numAnchors;
    predictions{ii,4} = YPredictions{ii}(:,:,startIdx:endIdx,:);
    
    % Height.
    startIdx = startIdx + numAnchors;
    endIdx = endIdx+numAnchors;
    predictions{ii,5} = YPredictions{ii}(:,:,startIdx:endIdx,:);
    
    % Class probabilities.
    startIdx = startIdx + numAnchors;
    predictions{ii,6} = YPredictions{ii}(:,:,startIdx:end,:);
end
end

function tiledAnchors = generateTiledAnchors(YPredCell,anchorBoxes,anchorBoxMask)
% Generate tiled anchor offset.
tiledAnchors = cell(size(YPredCell));
for i=1:size(YPredCell,1)
    anchors = anchorBoxes(anchorBoxMask{i}, :);
    [h,w,~,n] = size(YPredCell{i,1});
    [tiledAnchors{i,2}, tiledAnchors{i,1}] = ndgrid(0:h-1,0:w-1,1:size(anchors,1),1:n);
    [~,~,tiledAnchors{i,3}] = ndgrid(0:h-1,0:w-1,anchors(:,2),1:n);
    [~,~,tiledAnchors{i,4}] = ndgrid(0:h-1,0:w-1,anchors(:,1),1:n);
end
end

function tiledAnchors = applyAnchorBoxOffsets(tiledAnchors,YPredCell,inputImageSize)
% Convert grid cell coordinates to box coordinates.
for i=1:size(YPredCell,1)
    [h,w,~,~] = size(YPredCell{i,1});  
    tiledAnchors{i,1} = (tiledAnchors{i,1}+YPredCell{i,1})./w;
    tiledAnchors{i,2} = (tiledAnchors{i,2}+YPredCell{i,2})./h;
    tiledAnchors{i,3} = (tiledAnchors{i,3}.*YPredCell{i,3})./inputImageSize(2);
    tiledAnchors{i,4} = (tiledAnchors{i,4}.*YPredCell{i,4})./inputImageSize(1);
end
end

function [lossPlotter, learningRatePlotter] = configureTrainingProgressPlotter(f)
% Create the subplots to display the loss and learning rate.
figure(f);
clf
subplot(2,1,1);
ylabel('Learning Rate');
xlabel('Iteration');
learningRatePlotter = animatedline;
subplot(2,1,2);
ylabel('Total Loss');
xlabel('Iteration');
lossPlotter = animatedline;
end

References

1. Redmon, Joseph, and Ali Farhadi. “YOLOv3: An Incremental Improvement.” Preprint, submitted April 8, 2018. https://arxiv.org/abs/1804.02767.

Documentation