Generate Code Containing Single Instruction Multiple Data for MATLAB Code

To improve code execution speed, use Single Instruction Multiple Data (SIMD), which enables processors to execute a single instruction on multiple data points. This parallel computation is enabled by using computational instructions and data management instructions. The computational instructions enable operations such as arithmetic operations on data that is stored in vector registers. The data management instructions enable movement and organization of data from the registers.

SIMD is available on instruction sets such as Intel SSE, Intel AVX, and Inlined ARM NEON Intrinsics. To generate code that contains SIMD instructions, select the appropriate code replacement library. The supported data types are single, double, int8, int16, int32, and int64.

SIMD availability for MATLAB^® functions and operations and the code replacement library for the target hardware is as shown in the table.

Mathematical Functions	MATLAB Functions	Intel SSE	Intel AVX	Intel AVX-512	Inlined ARM NEON Intrinsics
Addition	`+`	Supports `single`, `double`, `int8`, `int16`, `int32`, and `int64`	Supports `single`, `double`, `int8`, `int16`, `int32`, and `int64`	Supports `single` and `double`	Supports `single`
Subtraction	`-`	Supports `single`, `double`, `int8`, `int16`, `int32`, and `int64`	Supports `single`, `double`, `int8`, `int16`, `int32`, and `int64`	Supports `single` and `double`	Supports `single`
Multiplication	`.*`	Supports`single`, `double`, `int16`, and `int32`	Supports`single`, `double`, `int16`, and `int32`	Supports`single` and `double`	Supports `single`
Division	`./`	Supports `single` and `double`	Supports `single` and `double`	Supports `single` and `double`	Not supported
Square Root	`sqrt`	Supports `single` and `double`	Supports `single` and `double`	Supports `single` and `double`	Not supported
Rounding Up	`ceil`	Supports `single` and `double`	Supports `single` and `double`	Not supported	Not supported
Rounding Down	`floor`	Supports `single` and `double`	Supports `single` and `double`	Not supported	Not supported

SIMD code generation is also supported for some system objects in the DSP System Toolbox such as dsp.FIRFilter (DSP System Toolbox), dsp.FIRDecimator (DSP System Toolbox), dsp.FIRInterpolator (DSP System Toolbox), and dsp.LMSFilter (DSP System Toolbox).

Enable SIMD in Generated Code

To enable SIMD replacement, in the MATLAB Coder app, after selecting the required Device vendor and Device type, on the Custom Code tab, set the Code replacement library parameter. This table shows the code replacement libraries for the supported Device vendor and Device type.

Device Vendor	Device Type	Code Replacement Library
`Intel` or `AMD`	`x86-64(Windows 64)`	Intel SSE (Windows)
		Intel AVX (Windows)
		Intel AVX-512 (Windows)
	`x86-64(Linux 64)`	Intel SSE (Linux)
		Intel AVX (Linux)
		Intel AVX-512 (Linux)
`ARM Compatible`	`ARM Cortex-A`	Inlined ARM NEON Intrinsics

Alternatively, in a coder.EmbeddedCodeConfig configuration object, set the parameter CodeReplacementLibrary parameter to a library such as 'Intel AVX (Windows)'.

cfg = coder.config('lib');
cfg.CodeReplacementLibrary = 'Intel AVX (Windows)';

Generate SIMD Code for Loops and Arrays

Consider the MATLAB function dynamic that has a sum of products expression and variables with data type single. The loop bound is not set at compile time. The table displays the generated code when no code replacement library is selected and the loop executes one iteration at a time. When you choose Intel SSE(Windows) as code replacement library, the generated code contains SIMD instructions. SIMD instructions can process the loop in increments of four through the computational function, such as _mm_add_ps and mm_mul_ps.The data management instructions _mm_storeu_ps and _mm_loadu_ps store and load data from the SIMD registers. If the data type is double, the loop executes in increments of two. Variables that cannot be vectorized are processed in the loop by using the iterator scalarLB.

MATLAB Code	Generated C Code Without SIMD	SIMD Optimized Code
function C = dynamic(A, B) assert(all(size(A) <= [100 100])); assert(all(size(B) <= [100 100])); assert(isa(A, 'single')); assert(isa(B, 'single')); C = zeros(size(A), 'like', A); for i = 1:numel(A) C(i) = (A(i) .* B(i)) + (A(i) .* B(i)); end end	loop_ub = A_size[0] * A_size[1]; for (i = 0; i < loop_ub; i++) { C_data_tmp = A_data[i] * B_data[i]; C_data[i] = C_data_tmp + C_data_tmp; }	loop_ub = A_size[0] * A_size[1]; scalarLB = loop_ub & -4; vectorUB = scalarLB - 4; for (i = 0; i <= vectorUB; i += 4) { r = _mm_mul_ps(_mm_loadu_ps(&A_data[i]), _mm_loadu_ps(&B_data[i])); _mm_storeu_ps(&C_data[i], _mm_add_ps(r, r)); } for (i = scalarLB; i < loop_ub; i++) { C_data_tmp = A_data[i] * B_data[i]; C_data[i] = C_data_tmp + C_data_tmp; }

MATLAB Code

Generated C Code Without SIMD

SIMD Optimized Code

function C = dynamic(A, B)
   assert(all(size(A) <= [100 100]));
   assert(all(size(B) <= [100 100]));
   assert(isa(A, 'single'));
   assert(isa(B, 'single'));

   C = zeros(size(A), 'like', A);
   for i = 1:numel(A)
       C(i) = (A(i) .* B(i)) + 
                   (A(i) .* B(i));
   end
end

loop_ub = A_size[0] * A_size[1];
for (i = 0; i < loop_ub; i++) {
  C_data_tmp = A_data[i] * B_data[i];
  C_data[i] = C_data_tmp + C_data_tmp;
}

loop_ub = A_size[0] * A_size[1];
scalarLB = loop_ub & -4;
vectorUB = scalarLB - 4;
for (i = 0; i <= vectorUB; i += 4) {
   r = _mm_mul_ps(_mm_loadu_ps(&A_data[i]), 
       _mm_loadu_ps(&B_data[i]));
       _mm_storeu_ps(&C_data[i],
       _mm_add_ps(r, r));
}

for (i = scalarLB; i < loop_ub; i++) {
  C_data_tmp = A_data[i] * B_data[i];
  C_data[i] = C_data_tmp + C_data_tmp;
}

For a list of a Intel intrinsic functions for supported MATLAB functions, see https://software.intel.com/sites/landingpage/IntrinsicsGuide/. For a list of Inlined ARM NEON Intrinsic functions, see https://developer.arm.com/architectures/instruction-sets/simd-isas/neon/intrinsics.

Limitations

The generated code is not optimized through SIMD when the MATLAB code contains:

Scalar operations outside a loop. For instance, if a,b, and c are scalars, the generated code does not optimize an operation such as c=a+b.
Indirectly indexed arrays or matrices. For instance if A,B,C, and D are vectors, the generated code is not vectorized for an operation such as D(A)=C(A)+B(A).
Parallel for-Loops (parfor). The parfor loop is not optimized, but any loops within the body of the parfor loop might be vectorized.

Documentation