Deep Learning in Simulink Using MATLAB Function Block

You can generate optimized code for prediction or detection of a variety of trained deep learning networks in your Simulink^® models. The MATLAB Function (Simulink) blocks contain code that uses the coder.loadDeepLearningNetwork function to load a deep learning model and to construct and set up a CNN class. The code also contains calls to the predict or detect functions to predict/detect the responses.. The generated code implements the deep convolutional neural network (CNN) by using the architecture, the layers, and parameters that you specify in the input SeriesNetwork (Deep Learning Toolbox) or DAGNetwork (Deep Learning Toolbox) object.

You can configure the code generator to take advantage of the NVIDIA^® CUDA^® deep neural network library (cuDNN) and TensorRT™ high performance inference libraries for NVIDIA GPUs.

You can configure the code generator to take advantage of the Intel^® Math Kernel Library for Deep Neural Networks (MKL-DNN) when targeting Intel processors or the ARM^® Compute Library for computer vision and machine learning when targeting ARM processors.

Example: Classify Images by Using GoogLeNet

GoogLeNet has been trained on over a million images and can classify images into 1000 object categories (such as keyboard, coffee mug, pencil, and animals). The network has learned rich feature representations for a wide range of images. The network takes an image as input, and then outputs a label for the object in the image together with the probabilities for each of the object categories. This example show you how to perform simulation and generate CUDA code for the pretrained googlenet deep convolutional neural network and classify an image. The pretrained models are available as support packages from the Deep Learning Toolbox™.

Load the pretrained GoogLeNet network. You can choose to load a different pretrained network for image classification. If you do not have the required support packages installed, install the software according to the instructions provided.
```
net = googlenet;
```
The object net contains the DAGNetwork object. Use the analyzeNetwork function to display an interactive visualization of the network architecture, to detect errors and issues in the network, and to display detailed information about the network layers. The layer information includes the sizes of layer activations and learnable parameters, the total number of learnable parameters, and the sizes of state parameters of recurrent layers.
```
analyzeNetwork(net);
```
The image that you want to classify must have the same size as the input size of the network. For GoogLeNet, the size of the imageInputLayer is 224-by-224-by-3. The Classes property of the output classificationLayer contains the names of the classes learned by the network. View 10 random class names out of the total of 1000.
```
classNames = net.Layers(end).Classes;
numClasses = numel(classNames);
disp(classNames(randperm(numClasses,10)))
```
```
    'speedboat'
    'window screen'
    'isopod'
    'wooden spoon'
    'lipstick'
    'drake'
    'hyena'
    'dumbbell'
    'strawberry'
    'custard apple'
```

Create GoogLeNet Model

Create a new Simulink model and insert a MATLAB Function block from the User-Defined Functions library.
Add a Image From File (Computer Vision Toolbox) block from the Computer Vision Toolbox™ library and set the File name parameter to peppers.png. Add a Resize (Computer Vision Toolbox) block from the Computer Vision Toolbox library to the model. Set the Specify parameter of the Resize block to Number of output rows and columns and enter [224 224] as the value for Number of output rows and columns. This bock will resize the input image to that of the input layer of the network.
Double-click on the MATLAB Function block. A default function signature appears in the MATLAB Function Block Editor.
Define a function called googlenet_predict, which implements the prediction entry-point function. The function header declares in as an argument to the googlenet_predict function, with scores and indxTop as the as return value. Save Editor document to file.
```
function [scores,indxTop] = googlenet_predict(in) %#codegen

persistent mynet;

if isempty(mynet)
    mynet = coder.loadDeepLearningNetwork('googlenet');
end

% pass in input   
predict_scores = predict(mynet,in);
[scores,indx] = sort(predict_scores, 'descend');
indxTop = indx(1:5);
```
A persistent object mynet loads the DAGNetwork object. At the first call to the entry-point function, the persistent object is constructed and set up. On subsequent calls to the function, the same object is reused to call predict on inputs, avoiding reconstructing and reloading the network object.
You can also use the activations (Deep Learning Toolbox) method to network activations for a specific layer. For example, the following line of code returns the network activations for the layer specified in layerIdx.
```
out = activations(mynet,in,layerIdx,'OutputAs','Channels');
```
You can also use the classify (Deep Learning Toolbox) method to predict class labels for the image data in in using the trained network, mynet.
```
[out,scores] = classify(mynet,in);
```
For LSTM networks, you can also use the predictAndUpdateState (Deep Learning Toolbox) and resetState (Deep Learning Toolbox) methods. For usage notes and limitations of these method, see Supported Functions.
Right-click on the MATLAB Function block and select Block Parameters (Subsystem).
On the Code Generation tab, select Reusable function for Function packaging.
Connect these blocks as shown in the diagram. Save the model as googlenetModel.

Configure the Model for GPU Acceleration

Model configuration parameters determine the acceleration method used during simulation.

Open the Configuration Parameters dialog box, Solver pane. To compile your model for acceleration and generate CUDA code, you must configure the model to use a fixed-step solver. The following table shows the solver configuration for this example.

Parameter	Setting	Effect on Generated Code
Type	`Fixed-step`	Maintains a constant (fixed) step size, which is required for code generation
Solver	`discrete (no continuous states)`	Applies a fixed-step integration technique for computing the state derivative of the model
Fixed-step size	`auto`	Simulink chooses the step size

Snapshot of the configuration parameters dialog showing solver options for simulation.

Select the Simulation Target pane. Set the Language to C++.
Select GPU acceleration. GPU Coder™ specific options are now visible in the Simulation Target > GPU Acceleration pane. For the purposes of this example, you can use the default values of these parameters.
On the Simulation Target pane. Set the Target Library in the Deep learning group to cuDNN. You can also select TensorRT.
Click OK to save and close the Configuration Parameters dialog box.
You can use set_param to configure the model parameter programmatically from the MATLAB^® command Window. For example,
```
set_param('googlenetModel','GPUAcceleration','on');
```

Building the GPU Accelerated Model

To build the GPU accelerated model and simulate it, you can start the model by selecting Run on the Simulation tab or by running the command:
```
sim('googlenetModel');
```
at the MATLAB prompt.
The software first checks to see if CUDA/C++ code was previously compiled for your model. If code was created previously, the software runs the model. If code was not previously built, the software first generates and compiles the CUDA/C++ code, and then runs the model. The code generation tool places the generated code in a subfolder of the working folder called slprj/_slprj/googlenetModel.
Display the top five predicted labels and their associated probabilities as a histogram. Because the network classifies images into so many object categories, and many categories are similar, it is common to consider the top-five accuracy when evaluating networks. The network classifies the image as a bell pepper with a high probability.
```
im = imread('peppers.png');
classNamesTop = classNames(out.yout{2}.Values.Data(:,:,1))

h = figure;
h.Position(3) = 2*h.Position(3);
ax1 = subplot(1,2,1);
ax2 = subplot(1,2,2);

image(ax1,im);
barh(ax2,out.yout{1}.Values.Data(1,5:-1:1,1))
xlabel(ax2,'Probability')
yticklabels(ax2,classNamesTop(5:-1:1))
ax2.YAxisLocation = 'right';
sgtitle('Top 5 predictions using GoogLeNet')
```

Configure the Model for Code Generation

The model configuration parameters provide many options for the code generation and build process.

Select the Code Generation pane. Set the System target file to grt.tlc. You can also use the Embedded Coder^® target file ert.tlc.
Set the Language to C++.
Select Generate GPU code. GPU Coder specific options are now visible in the Code Generation > GPU Code pane.
Select Generate code only.
Select the Toolchain. For Linux^® platforms select NVIDIA CUDA | gmake (64-bit Linux). For Windows^® systems, select NVIDIA CUDA (w/Microsoft Visual C++ 20XX) | nmake (64-bit windows).
On the Code Generation > Report pane, select Create code generation report and Open report automatically.
On the Code Generation > Interface pane, set the Target Library in the Deep learning group to cuDNN. You can also select TensorRT.
For the purposes of this example, you can use the default values of the GPU-specific parameters in Code Generation > GPU Code pane.
Click OK to save and close the Configuration Parameters dialog box.
You can also use set_param to configure the model parameter programmatically from the MATLAB command Window. For example,
```
set_param('googlenetModel','GenerateGPUCode','CUDA');
```

Generate CUDA Code for the Model

In the Simulink Editor, open the Simulink Coder app.
Generate code.
Messages appear in the Diagnostics Viewer. The code generator produces CUDA source and header files, and an HTML code generation report. The code generator places the files in a build folder, a subfolder named googlenetModel_grt_rtw under your current working folder.

Example of the generated CUDA code.

googlenetModel.cu

/*
 * googlenetModel.cu
 *
 * Prerelease License - for engineering feedback and testing purposes
 * only. Not for sale.
 *
 * Code generation for model "googlenetModel".
 *
 * Model version              : 1.6
 * Simulink Coder version : 9.4 (R2020b) 11-Apr-2020
 * C++ source code generated on : Fri Apr  15 11:32:42 2020
 *
 * Target selection: grt.tlc
 * Note: GRT includes extra infrastructure and instrumentation for prototyping
 * Embedded hardware selection: Intel->x86-64 (Linux 64)
 * Code generation objectives: Unspecified
 * Validation result: Not run
 */

#include "googlenetModel.h"
#include "googlenetModel_private.h"
#include "math_constants.h"

/* Forward declaration for local functions */
static __global__ __launch_bounds__(32, 1) void googlene_eML_blk_kernel_kernel1(
  const int32_T iidx[1000], real_T indxTop[5]);
static __global__ __launch_bounds__(512, 1) void DeepLearningNetwork_predict_ker
  (const uint8_T in[150528], uint8_T b_in[150528]);
static __global__ __launch_bounds__(512, 1) void DeepLearningNetwork_predict_k_e
  (const uint8_T b_in[150528], real32_T miniBatchT[150528]);
static __global__ __launch_bounds__(512, 1) void DeepLearningNetwork_predict_k_h
  (const cell_wrap_4_googlenetModel_T *outputsMiniBatch, real32_T varargout_1
   [1000]);
static __global__ __launch_bounds__(32, 1) void googlene_eML_blk_kernel_kernel1(
  const int32_T iidx[1000], real_T indxTop[5])
{
  uint64_T threadId;
  threadId = mwGetGlobalThreadIndex();
  if (static_cast<int32_T>(threadId) < 5) {
    indxTop[static_cast<int32_T>(threadId)] = static_cast<real_T>(iidx[
      static_cast<int32_T>(threadId)]);
  }
}

/* Function for MATLAB Function: '<Root>/GoogLeNet Predict' */
void googlenetModelModelClass::googl_DeepLearningNetwork_setup
  (googlenet0_googlenetModel0 *obj)
{
  obj->setup();
  obj->batchSize = 1;
}

static __global__ __launch_bounds__(512, 1) void DeepLearningNetwork_predict_ker
  (const uint8_T in[150528], uint8_T b_in[150528])
{
  uint64_T threadId;
  threadId = mwGetGlobalThreadIndex();
  if (static_cast<int32_T>(threadId) < 150528) {
    b_in[static_cast<int32_T>(threadId)] = in[static_cast<int32_T>(threadId)];
  }
}

static __global__ __launch_bounds__(512, 1) void DeepLearningNetwork_predict_k_e
  (const uint8_T b_in[150528], real32_T miniBatchT[150528])
{
  uint64_T threadId;
  int32_T tmp;
  int32_T tmp_0;
  threadId = mwGetGlobalThreadIndex();
  tmp_0 = static_cast<int32_T>(threadId % 224UL);
  threadId = (threadId - static_cast<uint64_T>(tmp_0)) / 224UL;
  tmp = static_cast<int32_T>(threadId % 224UL);
  threadId = (threadId - static_cast<uint64_T>(tmp)) / 224UL;
  if ((static_cast<int32_T>((static_cast<int32_T>(static_cast<int32_T>(threadId)
          < 3)) && (static_cast<int32_T>(tmp < 224)))) && (static_cast<int32_T>
       (tmp_0 < 224))) {
    miniBatchT[(tmp_0 + 224 * tmp) + 50176 * static_cast<int32_T>(threadId)] =
      static_cast<real32_T>(b_in[(224 * tmp_0 + tmp) + 50176 *
      static_cast<int32_T>(threadId)]);
  }
}

static __global__ __launch_bounds__(512, 1) void DeepLearningNetwork_predict_k_h
  (const cell_wrap_4_googlenetModel_T *outputsMiniBatch, real32_T varargout_1
   [1000])
{
  uint64_T threadId;
  threadId = mwGetGlobalThreadIndex();
  if (static_cast<int32_T>(threadId) < 1000) {
    varargout_1[static_cast<int32_T>(threadId)] = outputsMiniBatch->f1[
      static_cast<int32_T>(threadId)];
  }
}

/* Function for MATLAB Function: '<Root>/GoogLeNet Predict' */
void googlenetModelModelClass::goo_DeepLearningNetwork_predict
  (googlenet0_googlenetModel0 *obj, const uint8_T in[150528], real32_T
   varargout_1[1000])
{
  cell_wrap_4_googlenetModel_T *gpu_outputsMiniBatch;
  real32_T (*gpu_miniBatchT)[150528];
  real32_T (*gpu_varargout_1)[1000];
  uint8_T (*gpu_b_in)[150528];
  uint8_T (*gpu_in)[150528];
  cudaMalloc(&gpu_varargout_1, 4000UL);
  cudaMalloc(&gpu_outputsMiniBatch, 4000UL);
  cudaMalloc(&gpu_miniBatchT, 602112UL);
  cudaMalloc(&gpu_b_in, 150528UL);
  cudaMalloc(&gpu_in, 150528UL);
  cudaMemcpy(gpu_in, (void *)&in[0], 150528UL, cudaMemcpyHostToDevice);
  DeepLearningNetwork_predict_ker<<<dim3(294U, 1U, 1U), dim3(512U, 1U, 1U)>>>
    (*gpu_in, *gpu_b_in);
  DeepLearningNetwork_predict_k_e<<<dim3(294U, 1U, 1U), dim3(512U, 1U, 1U)>>>
    (*gpu_b_in, *gpu_miniBatchT);
  cudaMemcpy(obj->getInputDataPointer(), *gpu_miniBatchT, obj->layers[1]
             ->getOutputTensor(0)->getNumElements() * sizeof(real32_T),
             cudaMemcpyDeviceToDevice);
  obj->predict();
  cudaMemcpy(gpu_outputsMiniBatch->f1, obj->getLayerOutput(78, 0), obj->layers
             [78]->getOutputTensor(0)->getNumElements() * sizeof(real32_T),
             cudaMemcpyDeviceToDevice);
  DeepLearningNetwork_predict_k_h<<<dim3(2U, 1U, 1U), dim3(512U, 1U, 1U)>>>
    (gpu_outputsMiniBatch, *gpu_varargout_1);
  cudaMemcpy(&varargout_1[0], gpu_varargout_1, 4000UL, cudaMemcpyDeviceToHost);
  cudaFree(*gpu_in);
  cudaFree(*gpu_b_in);
  cudaFree(*gpu_miniBatchT);
  cudaFree(gpu_outputsMiniBatch);
  cudaFree(*gpu_varargout_1);
}

/* Function for MATLAB Function: '<Root>/GoogLeNet Predict' */
void googlenetModelModelClass::googlenetModel_sort(real32_T x[1000], int32_T
  idx[1000])
{
  int32_T bLen2;
  int32_T b_bLen;
  int32_T blockOffset;
  int32_T exitg1;
  int32_T i1;
  int32_T i2;
  int32_T i3;
  int32_T i4;
  int32_T i_j;
  int32_T ib;
  int32_T idx_tmp;
  int32_T nNaNs;
  int32_T nNonNaN;
  int32_T nPairs;
  int32_T p;
  int32_T q;
  real32_T xwork[1000];
  real32_T b_xwork[256];
  real32_T x4[4];
  real32_T tmp;
  real32_T tmp_0;
  int16_T iwork[1000];
  int16_T b_iwork[256];
  int16_T idx4[4];
  int8_T perm[4];
  int8_T perm_0;
  x4[0] = 0.0F;
  idx4[0] = 0;
  x4[1] = 0.0F;
  idx4[1] = 0;
  x4[2] = 0.0F;
  idx4[2] = 0;
  x4[3] = 0.0F;
  idx4[3] = 0;
  for (ib = 0; ib < 1000; ib++) {
    idx[ib] = 0;
    xwork[ib] = 0.0F;
  }

  nNaNs = 0;
  ib = 0;
  for (nNonNaN = 0; nNonNaN < 1000; nNonNaN++) {
    if (rtIsNaNF(x[nNonNaN])) {
      idx[999 - nNaNs] = nNonNaN + 1;
      xwork[999 - nNaNs] = x[nNonNaN];
      nNaNs++;
    } else {
      ib++;
      idx4[ib - 1] = static_cast<int16_T>(nNonNaN + 1);
      x4[ib - 1] = x[nNonNaN];
      if (ib == 4) {
        ib = nNonNaN - nNaNs;
        if (x4[0] >= x4[1]) {
          i1 = 1;
          i2 = 2;
        } else {
          i1 = 2;
          i2 = 1;
        }

        if (x4[2] >= x4[3]) {
          i3 = 3;
          i4 = 4;
        } else {
          i3 = 4;
          i4 = 3;
        }

        tmp = x4[i1 - 1];
        tmp_0 = x4[i3 - 1];
        if (tmp >= tmp_0) {
          tmp = x4[i2 - 1];
          if (tmp >= tmp_0) {
            perm[0] = static_cast<int8_T>(i1);
            perm[1] = static_cast<int8_T>(i2);
            perm[2] = static_cast<int8_T>(i3);
            perm[3] = static_cast<int8_T>(i4);
          } else if (tmp >= x4[i4 - 1]) {
            perm[0] = static_cast<int8_T>(i1);
            perm[1] = static_cast<int8_T>(i3);
            perm[2] = static_cast<int8_T>(i2);
            perm[3] = static_cast<int8_T>(i4);
          } else {
            perm[0] = static_cast<int8_T>(i1);
            perm[1] = static_cast<int8_T>(i3);
            perm[2] = static_cast<int8_T>(i4);
            perm[3] = static_cast<int8_T>(i2);
          }
        } else {
          tmp_0 = x4[i4 - 1];
          if (tmp >= tmp_0) {
            if (x4[i2 - 1] >= tmp_0) {
              perm[0] = static_cast<int8_T>(i3);
              perm[1] = static_cast<int8_T>(i1);
              perm[2] = static_cast<int8_T>(i2);
              perm[3] = static_cast<int8_T>(i4);
            } else {
              perm[0] = static_cast<int8_T>(i3);
              perm[1] = static_cast<int8_T>(i1);
              perm[2] = static_cast<int8_T>(i4);
              perm[3] = static_cast<int8_T>(i2);
            }
          } else {
            perm[0] = static_cast<int8_T>(i3);
            perm[1] = static_cast<int8_T>(i4);
            perm[2] = static_cast<int8_T>(i1);
            perm[3] = static_cast<int8_T>(i2);
          }
        }

        idx_tmp = perm[0] - 1;
        idx[ib - 3] = idx4[idx_tmp];
        i1 = perm[1] - 1;
        idx[ib - 2] = idx4[i1];
        i2 = perm[2] - 1;
        idx[ib - 1] = idx4[i2];
        i3 = perm[3] - 1;
        idx[ib] = idx4[i3];
        x[ib - 3] = x4[idx_tmp];
        x[ib - 2] = x4[i1];
        x[ib - 1] = x4[i2];
        x[ib] = x4[i3];
        ib = 0;
      }
    }
  }

  if (ib > 0) {
    perm[1] = 0;
    perm[2] = 0;
    perm[3] = 0;
    if (ib == 1) {
      perm[0] = 1;
    } else if (ib == 2) {
      if (x4[0] >= x4[1]) {
        perm[0] = 1;
        perm[1] = 2;
      } else {
        perm[0] = 2;
        perm[1] = 1;
      }
    } else if (x4[0] >= x4[1]) {
      if (x4[1] >= x4[2]) {
        perm[0] = 1;
        perm[1] = 2;
        perm[2] = 3;
      } else if (x4[0] >= x4[2]) {
        perm[0] = 1;
        perm[1] = 3;
        perm[2] = 2;
      } else {
        perm[0] = 3;
        perm[1] = 1;
        perm[2] = 2;
      }
    } else if (x4[0] >= x4[2]) {
      perm[0] = 2;
      perm[1] = 1;
      perm[2] = 3;
    } else if (x4[1] >= x4[2]) {
      perm[0] = 2;
      perm[1] = 3;
      perm[2] = 1;
    } else {
      perm[0] = 3;
      perm[1] = 2;
      perm[2] = 1;
    }

    for (nNonNaN = 0; nNonNaN < ib; nNonNaN++) {
      perm_0 = perm[nNonNaN];
      idx_tmp = ((nNonNaN - nNaNs) - ib) + 1000;
      idx[idx_tmp] = idx4[perm_0 - 1];
      x[idx_tmp] = x4[perm_0 - 1];
    }
  }

  ib = (nNaNs >> 1) + 1000;
  for (nNonNaN = 0; nNonNaN <= ib - 1001; nNonNaN++) {
    i2 = (nNonNaN - nNaNs) + 1000;
    i1 = idx[i2];
    idx[i2] = idx[999 - nNonNaN];
    idx[999 - nNonNaN] = i1;
    x[i2] = xwork[999 - nNonNaN];
    x[999 - nNonNaN] = xwork[i2];
  }

  if ((nNaNs & 1U) != 0U) {
    i2 = ib - nNaNs;
    x[i2] = xwork[i2];
  }

  std::memset(&iwork[0], 0, 1000U * sizeof(int16_T));
  nNonNaN = 999 - nNaNs;
  i2 = 2;
  if (1000 - nNaNs > 1) {
    ib = (1000 - nNaNs) >> 8;
    if (ib > 0) {
      for (i1 = 0; i1 < ib; i1++) {
        i4 = i1 << 8;
        for (i2 = 0; i2 < 6; i2++) {
          b_bLen = 1 << (i2 + 2);
          bLen2 = b_bLen << 1;
          nPairs = 256 >> (i2 + 3);
          for (i3 = 0; i3 < nPairs; i3++) {
            blockOffset = i3 * bLen2 + i4;
            for (p = 0; p < bLen2; p++) {
              idx_tmp = blockOffset + p;
              b_iwork[p] = static_cast<int16_T>(idx[idx_tmp]);
              b_xwork[p] = x[idx_tmp];
            }

            p = 0;
            q = b_bLen;
            blockOffset--;
            do {
              exitg1 = 0;
              blockOffset++;
              if (b_xwork[p] >= b_xwork[q]) {
                idx[blockOffset] = b_iwork[p];
                x[blockOffset] = b_xwork[p];
                if (p + 1 < b_bLen) {
                  p++;
                } else {
                  exitg1 = 1;
                }
              } else {
                idx[blockOffset] = b_iwork[q];
                x[blockOffset] = b_xwork[q];
                if (q + 1 < bLen2) {
                  q++;
                } else {
                  q = blockOffset - p;
                  for (blockOffset = 0; blockOffset < b_bLen - p; blockOffset++)
                  {
                    i_j = p + blockOffset;
                    idx_tmp = (q + i_j) + 1;
                    idx[idx_tmp] = b_iwork[i_j];
                    x[idx_tmp] = b_xwork[i_j];
                  }

                  exitg1 = 1;
                }
              }
            } while (exitg1 == 0);
          }
        }
      }

      i1 = ib << 8;
      i2 = 1000 - (nNaNs + i1);
      if (i2 > 0) {
        std::memset(&iwork[0], 0, 1000U * sizeof(int16_T));
        i3 = i2 >> 2;
        ib = 4;
        while (i3 > 1) {
          if ((i3 & 1U) != 0U) {
            i3--;
            b_bLen = ib * i3;
            i4 = i2 - b_bLen;
            if (i4 > ib) {
              b_bLen += i1;
              bLen2 = i4 - ib;
              if ((ib != 0) && (bLen2 != 0)) {
                nPairs = ib + bLen2;
                for (i4 = 0; i4 < nPairs; i4++) {
                  iwork[i4] = static_cast<int16_T>(idx[b_bLen + i4]);
                  xwork[i4] = x[b_bLen + i4];
                }

                i4 = 0;
                nPairs = ib;
                bLen2 += ib;
                b_bLen--;
                do {
                  exitg1 = 0;
                  b_bLen++;
                  if (xwork[i4] >= xwork[nPairs]) {
                    idx[b_bLen] = iwork[i4];
                    x[b_bLen] = xwork[i4];
                    if (i4 + 1 < ib) {
                      i4++;
                    } else {
                      exitg1 = 1;
                    }
                  } else {
                    idx[b_bLen] = iwork[nPairs];
                    x[b_bLen] = xwork[nPairs];
                    if (nPairs + 1 < bLen2) {
                      nPairs++;
                    } else {
                      bLen2 = b_bLen - i4;
                      for (b_bLen = 0; b_bLen < ib - i4; b_bLen++) {
                        nPairs = i4 + b_bLen;
                        idx_tmp = (bLen2 + nPairs) + 1;
                        idx[idx_tmp] = iwork[nPairs];
                        x[idx_tmp] = xwork[nPairs];
                      }

                      exitg1 = 1;
                    }
                  }
                } while (exitg1 == 0);
              }
            }
          }

          b_bLen = ib << 1;
          i3 >>= 1;
          for (i4 = 0; i4 < i3; i4++) {
            nPairs = i4 * b_bLen + i1;
            if (ib != 0) {
              idx_tmp = ib + ib;
              for (bLen2 = 0; bLen2 < idx_tmp; bLen2++) {
                iwork[bLen2] = static_cast<int16_T>(idx[nPairs + bLen2]);
                xwork[bLen2] = x[nPairs + bLen2];
              }

              bLen2 = 0;
              p = ib;
              nPairs--;
              do {
                exitg1 = 0;
                nPairs++;
                if (xwork[bLen2] >= xwork[p]) {
                  idx[nPairs] = iwork[bLen2];
                  x[nPairs] = xwork[bLen2];
                  if (bLen2 + 1 < ib) {
                    bLen2++;
                  } else {
                    exitg1 = 1;
                  }
                } else {
                  idx[nPairs] = iwork[p];
                  x[nPairs] = xwork[p];
                  if (p + 1 < idx_tmp) {
                    p++;
                  } else {
                    p = nPairs - bLen2;
                    for (nPairs = 0; nPairs < ib - bLen2; nPairs++) {
                      blockOffset = bLen2 + nPairs;
                      idx_tmp = (p + blockOffset) + 1;
                      idx[idx_tmp] = iwork[blockOffset];
                      x[idx_tmp] = xwork[blockOffset];
                    }

                    exitg1 = 1;
                  }
                }
              } while (exitg1 == 0);
            }
          }

          ib = b_bLen;
        }

        if (i2 > ib) {
          i3 = i2 - ib;
          if ((ib != 0) && (i3 != 0)) {
            i4 = ib + i3;
            for (i2 = 0; i2 < i4; i2++) {
              b_bLen = i1 + i2;
              iwork[i2] = static_cast<int16_T>(idx[b_bLen]);
              xwork[i2] = x[b_bLen];
            }

            i2 = 0;
            i4 = ib;
            i3 += ib;
            i1--;
            do {
              exitg1 = 0;
              i1++;
              if (xwork[i2] >= xwork[i4]) {
                idx[i1] = iwork[i2];
                x[i1] = xwork[i2];
                if (i2 + 1 < ib) {
                  i2++;
                } else {
                  exitg1 = 1;
                }
              } else {
                idx[i1] = iwork[i4];
                x[i1] = xwork[i4];
                if (i4 + 1 < i3) {
                  i4++;
                } else {
                  i3 = i1 - i2;
                  for (i1 = 0; i1 < ib - i2; i1++) {
                    i4 = i2 + i1;
                    idx_tmp = (i3 + i4) + 1;
                    idx[idx_tmp] = iwork[i4];
                    x[idx_tmp] = xwork[i4];
                  }

                  exitg1 = 1;
                }
              }
            } while (exitg1 == 0);
          }
        }
      }

      i2 = 8;
    }

    i1 = (1000 - nNaNs) >> i2;
    ib = 1 << i2;
    while (i1 > 1) {
      if ((i1 & 1U) != 0U) {
        i1--;
        i3 = ib * i1;
        i2 = 1000 - (nNaNs + i3);
        if (i2 > ib) {
          i4 = i2 - ib;
          if ((ib != 0) && (i4 != 0)) {
            b_bLen = ib + i4;
            for (i2 = 0; i2 < b_bLen; i2++) {
              iwork[i2] = static_cast<int16_T>(idx[i3 + i2]);
              xwork[i2] = x[i3 + i2];
            }

            i2 = 0;
            b_bLen = ib;
            i4 += ib;
            i3--;
            do {
              exitg1 = 0;
              i3++;
              if (xwork[i2] >= xwork[b_bLen]) {
                idx[i3] = iwork[i2];
                x[i3] = xwork[i2];
                if (i2 + 1 < ib) {
                  i2++;
                } else {
                  exitg1 = 1;
                }
              } else {
                idx[i3] = iwork[b_bLen];
                x[i3] = xwork[b_bLen];
                if (b_bLen + 1 < i4) {
                  b_bLen++;
                } else {
                  i4 = i3 - i2;
                  for (i3 = 0; i3 < ib - i2; i3++) {
                    b_bLen = i2 + i3;
                    idx[(i4 + b_bLen) + 1] = iwork[b_bLen];
                    x[(i4 + b_bLen) + 1] = xwork[b_bLen];
                  }

                  exitg1 = 1;
                }
              }
            } while (exitg1 == 0);
          }
        }
      }

      i3 = ib << 1;
      i1 >>= 1;
      for (i2 = 0; i2 < i1; i2++) {
        b_bLen = i2 * i3;
        if (ib != 0) {
          idx_tmp = ib + ib;
          for (i4 = 0; i4 < idx_tmp; i4++) {
            iwork[i4] = static_cast<int16_T>(idx[b_bLen + i4]);
            xwork[i4] = x[b_bLen + i4];
          }

          i4 = 0;
          bLen2 = ib;
          b_bLen--;
          do {
            exitg1 = 0;
            b_bLen++;
            if (xwork[i4] >= xwork[bLen2]) {
              idx[b_bLen] = iwork[i4];
              x[b_bLen] = xwork[i4];
              if (i4 + 1 < ib) {
                i4++;
              } else {
                exitg1 = 1;
              }
            } else {
              idx[b_bLen] = iwork[bLen2];
              x[b_bLen] = xwork[bLen2];
              if (bLen2 + 1 < idx_tmp) {
                bLen2++;
              } else {
                bLen2 = b_bLen - i4;
                for (b_bLen = 0; b_bLen < ib - i4; b_bLen++) {
                  nPairs = i4 + b_bLen;
                  idx_tmp = (bLen2 + nPairs) + 1;
                  idx[idx_tmp] = iwork[nPairs];
                  x[idx_tmp] = xwork[nPairs];
                }

                exitg1 = 1;
              }
            }
          } while (exitg1 == 0);
        }
      }

      ib = i3;
    }

    if (1000 - nNaNs > ib) {
      i2 = 1000 - (nNaNs + ib);
      if ((ib != 0) && (i2 != 0)) {
        i3 = ib + i2;
        for (i1 = 0; i1 < i3; i1++) {
          iwork[i1] = static_cast<int16_T>(idx[i1]);
          xwork[i1] = x[i1];
        }

        i1 = 0;
        i3 = ib;
        i2 += ib;
        i4 = -1;
        do {
          exitg1 = 0;
          i4++;
          if (xwork[i1] >= xwork[i3]) {
            idx[i4] = iwork[i1];
            x[i4] = xwork[i1];
            if (i1 + 1 < ib) {
              i1++;
            } else {
              exitg1 = 1;
            }
          } else {
            idx[i4] = iwork[i3];
            x[i4] = xwork[i3];
            if (i3 + 1 < i2) {
              i3++;
            } else {
              i3 = i4 - i1;
              for (i2 = 0; i2 < ib - i1; i2++) {
                i4 = i1 + i2;
                idx_tmp = (i3 + i4) + 1;
                idx[idx_tmp] = iwork[i4];
                x[idx_tmp] = xwork[i4];
              }

              exitg1 = 1;
            }
          }
        } while (exitg1 == 0);
      }
    }
  }

  if ((nNaNs > 0) && (1000 - nNaNs > 0)) {
    for (ib = 0; ib < nNaNs; ib++) {
      xwork[ib] = x[(ib - nNaNs) + 1000];
      iwork[ib] = static_cast<int16_T>(idx[(ib - nNaNs) + 1000]);
    }

    for (ib = 0; ib <= nNonNaN; ib++) {
      i1 = 999 - (nNaNs + ib);
      i2 = nNaNs + i1;
      x[i2] = x[i1];
      idx[i2] = idx[i1];
    }

    for (nNonNaN = 0; nNonNaN < nNaNs; nNonNaN++) {
      x[nNonNaN] = xwork[nNonNaN];
      idx[nNonNaN] = iwork[nNonNaN];
    }
  }
}

/* Function for MATLAB Function: '<Root>/GoogLeNet Predict' */
void googlenetModelModelClass::googlenetModel_eML_blk_kernel(const uint8_T in
  [150528], real32_T scores[1000], real_T indxTop[5],
  DW_GoogLeNetPredict_googlenet_T *localDW)
{
  real_T (*gpu_indxTop)[5];
  int32_T iidx[1000];
  int32_T (*gpu_iidx)[1000];
  cudaMalloc(&gpu_indxTop, 40UL);
  cudaMalloc(&gpu_iidx, 4000UL);
  if (!localDW->mynet_not_empty) {
    googl_DeepLearningNetwork_setup(&localDW->mynet);
    localDW->mynet_not_empty = true;
  }

  goo_DeepLearningNetwork_predict(&localDW->mynet, in, scores);
  googlenetModel_sort(scores, iidx);
  cudaMemcpy(gpu_iidx, &iidx[0], 4000UL, cudaMemcpyHostToDevice);
  googlene_eML_blk_kernel_kernel1<<<dim3(1U, 1U, 1U), dim3(32U, 1U, 1U)>>>
    (*gpu_iidx, *gpu_indxTop);
  cudaMemcpy(&indxTop[0], gpu_indxTop, 40UL, cudaMemcpyDeviceToHost);
  cudaFree(*gpu_iidx);
  cudaFree(*gpu_indxTop);
}

/* Output and update for atomic system: '<Root>/GoogLeNet Predict' */
void googlenetModelModelClass::googlenetModel_GoogLeNetPredict(const uint8_T
  rtu_in[150528], B_GoogLeNetPredict_googlenetM_T *localB,
  DW_GoogLeNetPredict_googlenet_T *localDW)
{
  real_T tmp_0[5];
  int32_T i;
  real32_T tmp[1000];
  googlenetModel_eML_blk_kernel(rtu_in, tmp, tmp_0, localDW);
  for (i = 0; i < 5; i++) {
    localB->indxTop[i] = tmp_0[i];
  }

  std::memcpy(&localB->scores[0], &tmp[0], 1000U * sizeof(real32_T));
}

void googlenetModelModelClass::googlenetMode_setupGpuResources(void)
{
}

void googlenetModelModelClass::goog_setupDeepLearningResources(void)
{
}

void googlenetModelModelClass::googlenetMo_cleanupGpuResources(void)
{
}

/* Model step function */
void googlenetModelModelClass::step()
{
  int32_T acc;
  int32_T chan;
  int32_T i;
  int32_T indxTblX;
  int32_T k;
  int32_T m;
  int32_T n;
  uint8_T Resize_LineBuffer[384];

  /* S-Function (svipresize): '<Root>/Resize' incorporates:
   *  Constant: '<S2>/Constant1'
   */
  /* use pre-computed weights and index table to perform interpolation */
  for (chan = 0; chan < 3; chan++) {
    i = chan * 196608;

    /* resize along X-axis direction */
    for (m = 0; m < 384; m++) {
      for (n = 0; n < 224; n++) {
        acc = 0;
        for (k = 0; k < 5; k++) {
          indxTblX = n + k * 224;
          indxTblX = (googlenetModel_P.Constant1_Value
                      [(googlenetModel_ConstP.Resize_Xindex[indxTblX] * 384 + m)
                      + i] * googlenetModel_ConstP.Resize_Xweights[indxTblX]) <<
            3;
          if ((acc < 0) && (indxTblX < MIN_int32_T - acc)) {
            acc = MIN_int32_T;
          } else if ((acc > 0) && (indxTblX > MAX_int32_T - acc)) {
            acc = MAX_int32_T;
          } else {
            acc += indxTblX;
          }
        }

        indxTblX = ((acc & 512U) != 0U) + (acc >> 10);
        if (indxTblX < 0) {
          indxTblX = 0;
        } else {
          if (indxTblX > 255) {
            indxTblX = 255;
          }
        }

        googlenetModel_DW.Resize_IntBuffer[m + n * 384] = static_cast<uint8_T>
          (indxTblX);
      }
    }

    /* resize along Y-axis direction */
    for (n = 0; n < 224; n++) {
      indxTblX = n * 384;
      for (m = 0; m < 384; m++) {
        Resize_LineBuffer[m] = googlenetModel_DW.Resize_IntBuffer[indxTblX + m];
      }

      for (m = 0; m < 224; m++) {
        acc = (Resize_LineBuffer[googlenetModel_ConstP.Resize_Yindex[m]] *
               googlenetModel_ConstP.Resize_Yweights[m]) << 3;
        indxTblX = (Resize_LineBuffer[googlenetModel_ConstP.Resize_Yindex[m +
                    224]] * googlenetModel_ConstP.Resize_Yweights[m + 224]) << 3;
        if ((acc < 0) && (indxTblX < MIN_int32_T - acc)) {
          i = MIN_int32_T;
        } else if ((acc > 0) && (indxTblX > MAX_int32_T - acc)) {
          i = MAX_int32_T;
        } else {
          i = acc + indxTblX;
        }

        indxTblX = (Resize_LineBuffer[googlenetModel_ConstP.Resize_Yindex[m +
                    448]] * googlenetModel_ConstP.Resize_Yweights[m + 448]) << 3;
        if ((i < 0) && (indxTblX < MIN_int32_T - i)) {
          i = MIN_int32_T;
        } else if ((i > 0) && (indxTblX > MAX_int32_T - i)) {
          i = MAX_int32_T;
        } else {
          i += indxTblX;
        }

        indxTblX = (Resize_LineBuffer[googlenetModel_ConstP.Resize_Yindex[m +
                    672]] * googlenetModel_ConstP.Resize_Yweights[m + 672]) << 3;
        if ((i < 0) && (indxTblX < MIN_int32_T - i)) {
          i = MIN_int32_T;
        } else if ((i > 0) && (indxTblX > MAX_int32_T - i)) {
          i = MAX_int32_T;
        } else {
          i += indxTblX;
        }

        indxTblX = ((i & 512U) != 0U) + (i >> 10);
        if (indxTblX < 0) {
          indxTblX = 0;
        } else {
          if (indxTblX > 255) {
            indxTblX = 255;
          }
        }

        googlenetModel_B.Resize[(m + n * 224) + chan * 50176] =
          static_cast<uint8_T>(indxTblX);
      }
    }
  }

  /* End of S-Function (svipresize): '<Root>/Resize' */

  /* MATLAB Function: '<Root>/GoogLeNet Predict' */
  googlenetModel_GoogLeNetPredict(googlenetModel_B.Resize,
    &googlenetModel_B.sf_GoogLeNetPredict,
    &googlenetModel_DW.sf_GoogLeNetPredict);

  /* Outport: '<Root>/PredictScores' */
  for (i = 0; i < 1000; i++) {
    googlenetModel_Y.PredictScores[i] =
      googlenetModel_B.sf_GoogLeNetPredict.scores[i];
  }

  /* End of Outport: '<Root>/PredictScores' */

  /* Outport: '<Root>/Index' */
  for (i = 0; i < 5; i++) {
    googlenetModel_Y.Index[i] = googlenetModel_B.sf_GoogLeNetPredict.indxTop[i];
  }

  /* End of Outport: '<Root>/Index' */
}

/* Model initialize function */
void googlenetModelModelClass::initialize()
{
  /* Registration code */

  /* initialize non-finites */
  rt_InitInfAndNaN(sizeof(real_T));
  googlenetMode_setupGpuResources();
  goog_setupDeepLearningResources();
}

/* Model terminate function */
void googlenetModelModelClass::terminate()
{
  googlenetMo_cleanupGpuResources();
}

/* Constructor */
googlenetModelModelClass::googlenetModelModelClass():
  googlenetModel_B()
  ,googlenetModel_DW()
  ,googlenetModel_Y()
  ,googlenetModel_M()
{
  /* Currently there is no constructor body generated.*/
}

/* Destructor */
googlenetModelModelClass::~googlenetModelModelClass()
{
  /* Currently there is no destructor body generated.*/
}

/* Real-Time Model get method */
RT_MODEL_googlenetModel_T * googlenetModelModelClass::getRTM()
{
  return (&googlenetModel_M);
}

Limitations

Code generation for a deep learning network with custom layer is not supported in Simulink.
The Intel Math Kernel Library for Deep Neural Networks (MKL-DNN) requires C++11 standard. Setting the Target Library in the Deep learning group to MKL-DNN automatically generates C++11 code.
Use of MATLAB Function blocks in Stateflow^® charts is not supported.
When GPU acceleration is enabled, the code generator does not support Import custom code for importing custom authored CUDA source files (*.cu). Instead, use coder.ceval inside the MATLAB Function block.
MATLAB Function block does not support all the data types from the MATLAB language. For supported data types, refer to the block documentation.

Documentation