To improve code execution speed, use Single Instruction Multiple Data (SIMD), which enables processors to execute a single instruction on multiple data points. This parallel computation is enabled by using computational instructions and data management instructions. The computational instructions enable operations such as arithmetic operations on data that is stored in vector registers. The data management instructions enable movement and organization of data from the registers.
SIMD is available on instruction sets such as Intel SSE
, Intel
AVX
, and Inlined ARM NEON Intrinsics
. To generate code that
contains SIMD instructions, select the appropriate code replacement library. The supported
data types are single
, double
, int8
,
int16
, int32
, and int64
.
SIMD availability for MATLAB® functions and operations and the code replacement library for the target hardware is as shown in the table.
Mathematical Functions | MATLAB Functions | Intel SSE | Intel AVX | Intel AVX-512 | Inlined ARM NEON Intrinsics |
---|---|---|---|---|---|
Addition | + | Supports single , double ,
int8 , int16 , int32 , and
int64 | Supports single , double ,
int8 , int16 , int32 , and
int64 | Supports single and double | Supports single
|
Subtraction | -
| Supports single , double ,
int8 , int16 , int32 , and
int64 | Supports single , double ,
int8 , int16 , int32 , and
int64 | Supports single and double | Supports single
|
Multiplication | .* | Supportssingle , double ,
int16 , and int32 | Supportssingle , double ,
int16 , and int32 | Supportssingle and double | Supports single
|
Division | ./ | Supports single and double | Supports single and double | Supports single and double | Not supported |
Square Root | sqrt | Supports single and double | Supports single and double | Supports single and double | Not supported |
Rounding Up | ceil | Supports single and double | Supports single and double | Not supported | Not supported |
Rounding Down | floor | Supports single and double | Supports single and double | Not supported | Not supported |
SIMD code generation is also supported for some system objects in the DSP System Toolbox
such as dsp.FIRFilter
(DSP System Toolbox), dsp.FIRDecimator
(DSP System Toolbox),
dsp.FIRInterpolator
(DSP System Toolbox),
and dsp.LMSFilter
(DSP System Toolbox).
To enable SIMD replacement, in the MATLAB Coder app, after selecting the required Device vendor and Device type, on the Custom Code tab, set the Code replacement library parameter. This table shows the code replacement libraries for the supported Device vendor and Device type.
Device Vendor | Device Type | Code Replacement Library |
---|---|---|
Intel or AMD | x86-64(Windows 64) | Intel SSE (Windows) |
Intel AVX (Windows) | ||
Intel AVX-512 (Windows) | ||
x86-64(Linux 64) | Intel SSE (Linux) | |
Intel AVX (Linux) | ||
Intel AVX-512 (Linux) | ||
ARM Compatible | ARM Cortex-A | Inlined ARM NEON Intrinsics |
Alternatively, in a coder.EmbeddedCodeConfig
configuration object, set the parameter
CodeReplacementLibrary
parameter to a library such as 'Intel
AVX (Windows)'
.
cfg = coder.config('lib'); cfg.CodeReplacementLibrary = 'Intel AVX (Windows)';
Consider the MATLAB function dynamic
that has a sum of products
expression and variables with data type single
. The loop bound is not set
at compile time. The table displays the generated code when no code replacement library is
selected and the loop executes one iteration at a time. When you choose Intel
SSE(Windows)
as code replacement library, the generated code contains SIMD
instructions. SIMD instructions can process the loop in increments of four through the
computational function, such as _mm_add_ps
and
mm_mul_ps
.The data management instructions
_mm_storeu_ps
and _mm_loadu_ps
store and load data
from the SIMD registers. If the data type is double
, the loop executes in
increments of two. Variables that cannot be vectorized are processed in the loop by using
the iterator scalarLB
.
MATLAB Code | Generated C Code Without SIMD | SIMD Optimized Code |
---|---|---|
function C = dynamic(A, B) assert(all(size(A) <= [100 100])); assert(all(size(B) <= [100 100])); assert(isa(A, 'single')); assert(isa(B, 'single')); C = zeros(size(A), 'like', A); for i = 1:numel(A) C(i) = (A(i) .* B(i)) + (A(i) .* B(i)); end end |
loop_ub = A_size[0] * A_size[1]; for (i = 0; i < loop_ub; i++) { C_data_tmp = A_data[i] * B_data[i]; C_data[i] = C_data_tmp + C_data_tmp; } |
loop_ub = A_size[0] * A_size[1]; scalarLB = loop_ub & -4; vectorUB = scalarLB - 4; for (i = 0; i <= vectorUB; i += 4) { r = _mm_mul_ps(_mm_loadu_ps(&A_data[i]), _mm_loadu_ps(&B_data[i])); _mm_storeu_ps(&C_data[i], _mm_add_ps(r, r)); } for (i = scalarLB; i < loop_ub; i++) { C_data_tmp = A_data[i] * B_data[i]; C_data[i] = C_data_tmp + C_data_tmp; } |
For a list of a Intel intrinsic functions for supported MATLAB functions, see https://software.intel.com/sites/landingpage/IntrinsicsGuide/. For a list
of Inlined ARM NEON Intrinsic
functions, see https://developer.arm.com/architectures/instruction-sets/simd-isas/neon/intrinsics.
The generated code is not optimized through SIMD when the MATLAB code contains:
Scalar operations outside a loop. For instance, if a,b
, and
c
are scalars, the generated code does not optimize an operation
such as c=a+b
.
Indirectly indexed arrays or matrices. For instance if A,B,C
,
and D
are vectors, the generated code is not vectorized for an
operation such as D(A)=C(A)+B(A)
.
Parallel for-Loops (parfor
). The parfor
loop
is not optimized, but any loops within the body of the parfor
loop
might be vectorized.