Generate Code Containing Single Instruction Multiple Data for MATLAB Code

To improve code execution speed, use Single Instruction Multiple Data (SIMD), which enables processors to execute a single instruction on multiple data points. This parallel computation is enabled by using computational instructions and data management instructions. The computational instructions enable operations such as arithmetic operations on data that is stored in vector registers. The data management instructions enable movement and organization of data from the registers.

SIMD is available on instruction sets such as Intel SSE, Intel AVX, and Inlined ARM NEON Intrinsics. To generate code that contains SIMD instructions, select the appropriate code replacement library. The supported data types are single, double, int8, int16, int32, and int64.

SIMD availability for MATLAB® functions and operations and the code replacement library for the target hardware is as shown in the table.

Mathematical FunctionsMATLAB FunctionsIntel SSEIntel AVXIntel AVX-512Inlined ARM NEON Intrinsics
Addition+Supports single, double, int8, int16, int32, and int64Supports single, double, int8, int16, int32, and int64Supports single and doubleSupports single
Subtraction- Supports single, double, int8, int16, int32, and int64Supports single, double, int8, int16, int32, and int64Supports single and doubleSupports single
Multiplication.*Supportssingle, double, int16, and int32Supportssingle, double, int16, and int32Supportssingle and doubleSupports single
Division./Supports single and doubleSupports single and doubleSupports single and doubleNot supported
Square RootsqrtSupports single and doubleSupports single and doubleSupports single and doubleNot supported
Rounding UpceilSupports single and doubleSupports single and doubleNot supportedNot supported
Rounding DownfloorSupports single and doubleSupports single and doubleNot supportedNot supported

SIMD code generation is also supported for some system objects in the DSP System Toolbox such as dsp.FIRFilter (DSP System Toolbox), dsp.FIRDecimator (DSP System Toolbox), dsp.FIRInterpolator (DSP System Toolbox), and dsp.LMSFilter (DSP System Toolbox).

Enable SIMD in Generated Code

To enable SIMD replacement, in the MATLAB Coder app, after selecting the required Device vendor and Device type, on the Custom Code tab, set the Code replacement library parameter. This table shows the code replacement libraries for the supported Device vendor and Device type.

Device VendorDevice TypeCode Replacement Library
Intel or AMDx86-64(Windows 64)

Intel SSE (Windows)

Intel AVX (Windows)

Intel AVX-512 (Windows)

x86-64(Linux 64)

Intel SSE (Linux)

Intel AVX (Linux)

Intel AVX-512 (Linux)

ARM CompatibleARM Cortex-A

Inlined ARM NEON Intrinsics

Alternatively, in a coder.EmbeddedCodeConfig configuration object, set the parameter CodeReplacementLibrary parameter to a library such as 'Intel AVX (Windows)'.

cfg = coder.config('lib');
cfg.CodeReplacementLibrary = 'Intel AVX (Windows)';

Generate SIMD Code for Loops and Arrays

Consider the MATLAB function dynamic that has a sum of products expression and variables with data type single. The loop bound is not set at compile time. The table displays the generated code when no code replacement library is selected and the loop executes one iteration at a time. When you choose Intel SSE(Windows) as code replacement library, the generated code contains SIMD instructions. SIMD instructions can process the loop in increments of four through the computational function, such as _mm_add_ps and mm_mul_ps.The data management instructions _mm_storeu_ps and _mm_loadu_ps store and load data from the SIMD registers. If the data type is double, the loop executes in increments of two. Variables that cannot be vectorized are processed in the loop by using the iterator scalarLB.

MATLAB CodeGenerated C Code Without SIMDSIMD Optimized Code
function C = dynamic(A, B)
   assert(all(size(A) <= [100 100]));
   assert(all(size(B) <= [100 100]));
   assert(isa(A, 'single'));
   assert(isa(B, 'single'));

   C = zeros(size(A), 'like', A);
   for i = 1:numel(A)
       C(i) = (A(i) .* B(i)) + 
                   (A(i) .* B(i));
   end
end
loop_ub = A_size[0] * A_size[1];
for (i = 0; i < loop_ub; i++) {
  C_data_tmp = A_data[i] * B_data[i];
  C_data[i] = C_data_tmp + C_data_tmp;
}
loop_ub = A_size[0] * A_size[1];
scalarLB = loop_ub & -4;
vectorUB = scalarLB - 4;
for (i = 0; i <= vectorUB; i += 4) {
   r = _mm_mul_ps(_mm_loadu_ps(&A_data[i]), 
       _mm_loadu_ps(&B_data[i]));
       _mm_storeu_ps(&C_data[i],
       _mm_add_ps(r, r));
}

for (i = scalarLB; i < loop_ub; i++) {
  C_data_tmp = A_data[i] * B_data[i];
  C_data[i] = C_data_tmp + C_data_tmp;
}

For a list of a Intel intrinsic functions for supported MATLAB functions, see https://software.intel.com/sites/landingpage/IntrinsicsGuide/. For a list of Inlined ARM NEON Intrinsic functions, see https://developer.arm.com/architectures/instruction-sets/simd-isas/neon/intrinsics.

Limitations

The generated code is not optimized through SIMD when the MATLAB code contains:

  • Scalar operations outside a loop. For instance, if a,b, and c are scalars, the generated code does not optimize an operation such as c=a+b.

  • Indirectly indexed arrays or matrices. For instance if A,B,C, and D are vectors, the generated code is not vectorized for an operation such as D(A)=C(A)+B(A).

  • Parallel for-Loops (parfor). The parfor loop is not optimized, but any loops within the body of the parfor loop might be vectorized.

Related Topics