Generate Code Containing Single Instruction Multiple Data for Simulink Models

To improve code execution speed, use Single Instruction Multiple Data (SIMD), which enables processors to execute a single instruction on multiple data points. This parallel computation is enabled by using computational instructions and data management instructions. The computational instructions enable operations such as arithmetic operations on data that is stored in vector registers. The data management instructions enable movement and organization of data from the registers.

SIMD is available on instruction sets such as Intel SSE, Intel AVX, and Inlined ARM NEON Intrinsics. To generate code that contains SIMD instructions, select the appropriate code replacement library. The supported data types are single, double, int32, and int64.

SIMD availability for Simulink™ blocks and the code replacement library for the target hardware is as shown in the table.

Arithmetic Operations	Simulink Blocks	Intel SSE	Intel AVX	Intel AVX-512	Inlined ARM NEON Intrinsics
Addition	Add	Supports `single`, `double`, `int32`, and `int64`	Supports `single`, `double`, `int32`, and `int64`	Supports `single` and `double`	Supports `single`
Subtraction	Add	Supports `single`, `double`, `int32`, and `int64`	Supports `single`, `double`, `int32`, and `int64`	Supports `single` and `double`	Supports `single`
Multiplication	Product, Gain	Supports `single`, `double`, and `int32`	Supports `single`, `double`, and `int32`	Supports `single` and `double`	Supports `single`
Division	Divide	Supports `single` and `double`	Supports `single` and `double`	Supports `single` and `double`	Not supported
Square Root	Sqrt	Supports `single` and `double`	Supports `single` and `double`	Supports `single` and `double`	Not supported
Rounding	Ceil and Floor	Supports `single` and `double`	Supports `single` and `double`	Not supported	Not supported

SIMD optimization is available for the For Each block and MATLAB Function blocks containing for-loops. SIMD code generation is also supported for some DSP System Toolbox blocks such as FIR Interpolation (DSP System Toolbox), FIR Decimation (DSP System Toolbox), LMS Filter (DSP System Toolbox), and Discrete FIR Filter. To identify other DSP System Toolbox blocks that support SIMD code generation, see the Extended Capability section of each block.

Enable SIMD in the Generated Code

In the Configuration Parameters dialog box, select the required Device vendor and Device type. To enable SIMD, on the Interface pane, choose a Code replacement library parameter by clicking Select and adding the required code replacement libraries to the Selected code replacement libraries - prioritized list pane. This table shows the code replacement libraries for the supported Device vendor and Device type.

Device Vendor	Device Type	Code Replacement Library
`Intel` or `AMD`	`x86-64(Windows 64)`	Intel SSE (Windows)
		Intel AVX (Windows)
		Intel AVX-512 (Windows)
	`x86-64(Linux 64)`	Intel SSE (Linux)
		Intel AVX (Linux)
		Intel AVX-512 (Linux)
`ARM Compatible`	`ARM Cortex-A`	Inlined ARM NEON Intrinsics

Alternatively, you can use the command line to choose the library. To set the code replacement library for the currently open model myExampleModel, set the parameter to 'CodeReplacementLibrary' and choose a library such as 'Intel SSE (Windows)'.

set_param('myExampleModel','CodeReplacementLibrary','Intel SSE (Windows)')

Generate SIMD Code for Divide Blocks

Consider a model that has two Divide blocks with one block having an input data type of single and the other block having an input data type of double.

Generate code without adding a code replacement library to the Selected code replacement libraries - prioritized pane. This generated code executes the loop one iteration at a time.

void mDiv_step(void)
{
  int32_T i;
  for (i = 0; i < 140; i++) {
  mDiv_Y.Out2[i] = mDiv_U.In1[i] / mDiv_U.In2[i];
  mDiv_Y.Out3[i] = mDiv_U.In5[i] / mDiv_U.In6[i];
  }
}

Generate code containing SIMD instructions by adding the appropriate code replacement library to the Selected code replacement libraries - prioritized pane. This generated code is for the Intel SSE(Windows) code replacement library.

void mDiv_step(void)
{
   int32_T idx;
   for (idx = 0; idx <= 136; idx += 4) {
    _mm_storeu_ps(&mDiv_Y.Out2[idx], 
    _mm_div_ps(_mm_loadu_ps(&mDiv_U.In1[idx]),
    _mm_loadu_ps(&mDiv_U.In2[idx])));
   }
      
   for (idx = 0; idx <= 138; idx += 2) {
    _mm_storeu_pd(&mDiv_Y.Out3[idx], 
    _mm_div_pd(_mm_loadu_pd(&mDiv_U.In5[idx]),
    _mm_loadu_pd(&mDiv_U.In6[idx])));
   }
}

SIMD instructions process the loops in increments of four and two. During the loop execution, the loop variables are processed in parallel, through computational instructions functions _mm_div_ps and _mm_div_pd. This process improves the execution speed of the generated code when deployed on the target hardware. The data management instructions _mm_storeu_ps and _mm_loadu_ps store and load data from the SIMD registers. For the Divide block that has the data type double, the loop executes in increments of two. For the Divide block that has the data type of single, the loop executes in increments of four.

For a list of a Intel intrinsic functions for supported Simulink blocks, see https://software.intel.com/sites/landingpage/IntrinsicsGuide/. For a list of Inlined ARM NEON Intrinsics functions, see https://developer.arm.com/architectures/instruction-sets/simd-isas/neon/intrinsics.

Limitations

The generated code is not optimized through SIMD if:

The code in a MATLAB Function block contains scalar data types outside the body of loops. For instance, if a,b, and C are scalars, the generated code does not optimize an operation such as c=a+b.
The code in a MATLAB Function block contains indirectly indexed arrays or matrices. For instance if A,B,C, and D are vectors, the generated code is not vectorized for an operation such as D(A)=C(A)+B(A).
The Simulink model contains a reusable subsystem. The blocks within the reusable subsystem might not be optimized.
The code in a MATLAB Function block contains parallel for-Loops (parfor). The parfor loop is not optimized, but any loops within the body of the parfor loop can be vectorized.
The Partition Dimension parameter of a For Each subsystem is below the Loop unrolling threshold configuration parameter.

Documentation