This example shows how to deploy a custom trained series network to detect defects in objects such as hexagon nuts. The custom networks were trained by using transfer learning. Transfer learning is commonly used in deep learning applications. You can take a pretrained network and use it as a starting point to learn a new task. Fine-tuning a network with transfer learning is usually much faster and easier than training a network with randomly initialized weights from scratch. You can quickly transfer learned features to a new task using a smaller number of training signals. This example uses two trained series networks trainedDefNet.mat
and trainedBlemDetNet.mat.
Xilinx ZCU102 SoC development kit
Deep Learning HDL Toolbox™Support Package for Xilinx FPGA and SoC
Deep Learning Toolbox™
Deep Learning HDL Toolbox™
To download and load the custom pretrained series networks trainedDefNet
and trainedBlemDetNet
, enter:
if ~isfile('trainedDefNet.mat') url = 'https://www.mathworks.com/supportfiles/dlhdl/trainedDefNet.mat'; websave('trainedDefNet.mat',url); end net1 = load('trainedDefNet.mat'); snet_defnet = net1.custom_alexnet
snet_defnet = SeriesNetwork with properties: Layers: [25×1 nnet.cnn.layer.Layer] InputNames: {'data'} OutputNames: {'output'}
Analyze snet_defnet
layers.
analyzeNetwork(snet_defnet)
if ~isfile('trainedBlemDetNet.mat') url = 'https://www.mathworks.com/supportfiles/dlhdl/trainedBlemDetNet.mat'; websave('trainedBlemDetNet.mat',url); end net2 = load('trainedBlemDetNet.mat'); snet_blemdetnet = net2.convnet
snet_blemdetnet = SeriesNetwork with properties: Layers: [12×1 nnet.cnn.layer.Layer] InputNames: {'imageinput'} OutputNames: {'classoutput'}
analyzeNetwork(snet_blemdetnet)
Create a target object that has a custom name for your target device and an interface to connect your target device to the host computer. Interface options are JTAG and Ethernet. To use the JTAG connection, install the Xilinx(TM) Vivado(TM) Design Suite 2019.2.
To set the Xilinx Vivado toolpath, enter:
% hdlsetuptoolpath('ToolName', 'Xilinx Vivado', 'ToolPath', 'C:\Xilinx\Vivado\2019.2\bin\vivado.bat'); hT = dlhdl.Target('Xilinx','Interface','Ethernet')
hT = Target with properties: Vendor: 'Xilinx' Interface: Ethernet IPAddress: '10.10.10.15' Username: 'root' Port: 22
Create an object of the dlhdl.Workflow
class. When you create the object, specify the network and the bitstream name. Specify the saved pretrained trainedDefNet
as the network. Make sure that the bitstream name matches the data type and the FPGA board that you are targeting. In this example the target FPGA board is the Xilinx ZCU102 SOC board. The bitstream uses a single data type.
hW = dlhdl.Workflow('Network',snet_defnet,'Bitstream','zcu102_single','Target',hT)
hW = Workflow with properties: Network: [1×1 SeriesNetwork] Bitstream: 'zcu102_single' ProcessorConfig: [] Target: [1×1 dlhdl.Target]
To compile the trainedDefnet series network, run the compile function of the dlhdl.Workflow
object .
hW.compile
offset_name offset_address allocated_space _______________________ ______________ _________________ "InputDataOffset" "0x00000000" "8.0 MB" "OutputResultOffset" "0x00800000" "4.0 MB" "SystemBufferOffset" "0x00c00000" "28.0 MB" "InstructionDataOffset" "0x02800000" "4.0 MB" "ConvWeightDataOffset" "0x02c00000" "12.0 MB" "FCWeightDataOffset" "0x03800000" "84.0 MB" "EndOffset" "0x08c00000" "Total: 140.0 MB"
ans = struct with fields:
Operators: [1×1 struct]
LayerConfigs: [1×1 struct]
NetConfigs: [1×1 struct]
To deploy the network on the Xilinx ZCU102 SoC hardware, run the deploy function of the dlhdl.Workflow
object. This function uses the output of the compile function to program the FPGA board by using the programming file. It also downloads the network weights and biases. The deploy function starts programming the FPGA device, displays progress messages, and the time it takes to deploy the network.
hW.deploy
### FPGA bitstream programming has been skipped as the same bitstream is already loaded on the target FPGA. ### Deep learning network programming has been skipped as the same network is already loaded on the target FPGA.
Load an image from the attached testImages
folder, resize the image to match the network image input layer dimensions, and run the predict function of the dlhdl.Workflow
object to retrieve and display the defect prediction from the FPGA.
wi = uint32(320); he = uint32(240); ch = uint32(3); filename=[pwd,'\ng1.png']; img=imread(filename); img = imresize(img, [he, wi]); img = mat2ocv(img); % Extract ROI for preprocessing [Iori, imgPacked, num, bbox] = myNDNet_Preprocess(img); % row-major > column-major conversion imgPacked2 = zeros([128,128,4],'uint8'); for c = 1:4 for i = 1:128 for j = 1:128 imgPacked2(i,j,c) = imgPacked((i-1)*128 + (j-1) + (c-1)*128*128 + 1); end end end % Classify detected nuts by using CNN scores = zeros(2,4); for i = 1:num [scores(:,i), speed] = hW.predict(single(imgPacked2(:,:,i)),'Profile','on'); end
### Finished writing input activations. ### Running single input activations. Deep Learning Processor Profiler Performance Results LastLayerLatency(cycles) LastLayerLatency(seconds) FramesNum Total Latency Frames/s ------------- ------------- --------- --------- --------- Network 12199544 0.05545 1 12199586 18.0 conv_module 3292478 0.01497 conv1 412777 0.00188 norm1 173433 0.00079 pool1 58705 0.00027 conv2 656607 0.00298 norm2 128094 0.00058 pool2 53221 0.00024 conv3 780491 0.00355 conv4 600179 0.00273 conv5 409095 0.00186 pool5 19991 0.00009 fc_module 8907066 0.04049 fc6 1759795 0.00800 fc7 7030223 0.03196 fc8 117046 0.00053 * The clock frequency of the DL processor is: 220MHz
Iori = reshape(Iori, [1, he*wi*ch]);
bbox = reshape(bbox, [1,16]);
scores = reshape(scores, [1, 8]);
% Insert an annotation for postprocessing
out = myNDNet_Postprocess(Iori, num, bbox, scores, wi, he, ch);
sz = [he wi ch];
out = ocv2mat(out,sz);
imshow(out)
Create an object of the dlhdl.Workflow
class. When you create the object, specify the network and the bitstream name. Specify the saved pretrained trainedblemDetNet
as the network. Make sure that the bitstream name matches the data type and the FPGA board that you are targeting. In this example the target FPGA board is the Xilinx ZCU102 SOC board. The bitstream uses a single data type.
hW = dlhdl.Workflow('Network',snet_blemdetnet,'Bitstream','zcu102_single','Target',hT)
hW = Workflow with properties: Network: [1×1 SeriesNetwork] Bitstream: 'zcu102_single' ProcessorConfig: [] Target: [1×1 dlhdl.Target]
To compile the trainedBlemDetNet
series network, run the compile function of the dlhdl.Workflow
object.
hW.compile
offset_name offset_address allocated_space _______________________ ______________ ________________ "InputDataOffset" "0x00000000" "8.0 MB" "OutputResultOffset" "0x00800000" "4.0 MB" "SystemBufferOffset" "0x00c00000" "28.0 MB" "InstructionDataOffset" "0x02800000" "4.0 MB" "ConvWeightDataOffset" "0x02c00000" "4.0 MB" "FCWeightDataOffset" "0x03000000" "36.0 MB" "EndOffset" "0x05400000" "Total: 84.0 MB"
ans = struct with fields:
Operators: [1×1 struct]
LayerConfigs: [1×1 struct]
NetConfigs: [1×1 struct]
To deploy the network on the Xilinx ZCU102 SoC hardware, run the deploy function of the dlhdl.Workflow
object. This function uses the output of the compile function to program the FPGA board by using the programming file. It also downloads the network weights and biases. The deploy function starts programming the FPGA device, displays progress messages, and the time it takes to deploy the network.
hW.deploy
### FPGA bitstream programming has been skipped as the same bitstream is already loaded on the target FPGA. ### Loading weights to FC Processor. ### 50% finished, current time is 28-Jun-2020 12:33:36. ### FC Weights loaded. Current time is 28-Jun-2020 12:33:37
Load an image from the attached testImages
folder, resize the image to match the network image input layer dimensions, and run the predict function of the dlhdl.Workflow
object to retrieve and display the defect prediction from the FPGA.
wi = uint32(320); he = uint32(240); ch = uint32(3); filename=[pwd,'\ok1.png']; img=imread(filename); img = imresize(img, [he, wi]); img = mat2ocv(img); % Extract ROI for preprocessing [Iori, imgPacked, num, bbox] = myNDNet_Preprocess(img); % row-major > column-major conversion imgPacked2 = zeros([128,128,4],'uint8'); for c = 1:4 for i = 1:128 for j = 1:128 imgPacked2(i,j,c) = imgPacked((i-1)*128 + (j-1) + (c-1)*128*128 + 1); end end end % classify detected nuts by using CNN scores = zeros(2,4); for i = 1:num [scores(:,i), speed] = hW.predict(single(imgPacked2(:,:,i)),'Profile','on'); end
### Finished writing input activations. ### Running single input activations.
Deep Learning Processor Profiler Performance Results LastLayerLatency(cycles) LastLayerLatency(seconds) FramesNum Total Latency Frames/s ------------- ------------- --------- --------- --------- Network 4886257 0.02221 1 4886299 45.0 conv_module 1256664 0.00571 conv_1 467349 0.00212 maxpool_1 191204 0.00087 crossnorm 159553 0.00073 conv_2 397552 0.00181 maxpool_2 41066 0.00019 fc_module 3629593 0.01650 fc_1 3614829 0.01643 fc_2 14763 0.00007 * The clock frequency of the DL processor is: 220MHz
Iori = reshape(Iori, [1, he*wi*ch]);
bbox = reshape(bbox, [1,16]);
scores = reshape(scores, [1, 8]);
% Insert annotation for postprocessing
out = myNDNet_Postprocess(Iori, num, bbox, scores, wi, he, ch);
sz = [he wi ch];
out = ocv2mat(out,sz);
imshow(out)