This example shows how to use the dspunfold
function to generate a multithreaded MEX file from a MATLAB® function using unfolding technology. The MATLAB function can contain an algorithm which is stateless (has no states) or
stateful (has states).
NOTE: The following example assumes that the current host computer has at least two physical CPU cores. The presented screenshots, speedup, and latency values were collected using a host computer with eight physical CPU cores.
Required MathWorks® products:
DSP System Toolbox™
MATLAB Coder™
dspunfold
with a MATLAB Function Containing a Stateless AlgorithmConsider the MATLAB function
dspunfoldDCTExample
. This function computes the DCT of an
input signal and returns the value and index of the maximum energy point.
function [peakValue,peakIndex] = dspunfoldDCTExample(x) % Stateless MATLAB function computing the dct of a signal (e.g. audio), and % returns the value and index of the highest energy point % Copyright 2015 The MathWorks, Inc. X = dct(x); [peakValue,peakIndex] = max(abs(X));
To accelerate the algorithm, a common approach is to generate a MEX file using the
codegen
function. This example shows how to do so when
using an input of 4096 doubles. The generated MEX file,
dspunfoldDCTExample_mex
, is singlethreaded.
codegen dspunfoldDCTExample -args {(1:4096)'}
To generate a multithreaded MEX file, use the dspunfold
function. The argument -s 0
indicates that the algorithm in
dspunfoldDCTExample
is stateless.
dspunfold dspunfoldDCTExample -args {(1:4096)'} -s 0
This command generates these files:
Multithreaded MEX file
dspunfoldDCTExample_mt
Single-threaded MEX file dspunfoldDCTExample_st
,
which is identical to the MEX file obtained using the
codegen
function
Self-diagnostic analyzer function
dspunfoldDCTExample_analyzer
Additional three MATLAB files are also generated, containing the help for each of the above files.
To measure the speedup of the multithreaded MEX file relative to the
single-threaded MEX file, see the example function
dspunfoldBenchmarkDCTExample
.
function dspunfoldBenchmarkDCTExample % Function used to measure the speedup of the multi-threaded MEX file % dspunfoldDCTExample_mt obtained using dspunfold vs the single-threaded MEX % file dspunfoldDCTExample_st. % Copyright 2015 The MathWorks, Inc. clear dspunfoldDCTExample_mt; % for benchmark precision purpose numFrames = 1e5; inputFrame = (1:4096)'; % exclude first run from timing measurements dspunfoldDCTExample_st(inputFrame); tic; % measure execution time for the single-threaded MEX for frame = 1:numFrames dspunfoldDCTExample_st(inputFrame); end timeSingleThreaded = toc; % exclude first run from timing measurements dspunfoldDCTExample_mt(inputFrame); tic; % measure execution time for the multi-threaded MEX for frame = 1:numFrames dspunfoldDCTExample_mt(inputFrame); end timeMultiThreaded = toc; fprintf('Speedup = %.1fx\n',timeSingleThreaded/timeMultiThreaded);
dspunfoldBenchmarkDCTExample
measures the execution time
taken by dspunfoldDCTExample_st
and
dspunfoldDCTExample_mt
to process
numFrames
frames. Finally, it prints the speedup, which is
the ratio between the multithreaded MEX file execution time and single-threaded MEX
file execution time. Run the example.
dspunfoldBenchmarkDCTExample;
Speedup = 4.7x
To improve the speedup even more, increase the repetition value. To modify the
repetition value, use the -r
flag. For more information on the
repetition value, see the dspunfold
function reference page. For
an example on how to specify the repetition value, see the section 'Using
dspunfold
with a MATLAB Function Containing a Stateful Algorithm'.
dspunfold
generates a multithreaded MEX file, which buffers
multiple signal frames and then processes these frames simultaneously, using
multiple cores. This process introduces some deterministic output latency. Executing
help dspunfoldDCTExample_mt
displays more information about
the multithreaded MEX file, including the value of the output latency. For this
example, the output of the multithreaded MEX file has a latency of 16 frames
relative to its input, which is not the case for the single-threaded MEX file.
Run dspunfoldShowLatencyDCTExample
example. The generated
plot displays the outputs of the single-threaded and multithreaded MEX files. Notice
that the output of the multithreaded MEX is delayed by 16 frames, relative to that
of the single-threaded MEX.
dspunfoldShowLatencyDCTExample;
dspunfold
with a MATLAB Function Containing a Stateful AlgorithmThe MATLAB function
dspunfoldFIRExample
executes two FIR filters.
type dspunfoldFIRExample
function y = dspunfoldFIRExample(u,c1,c2) % Stateful MATLAB function executing two FIR filters % Copyright 2015 The MathWorks, Inc. persistent FIRSTFIR SECONDFIR if isempty(FIRSTFIR) FIRSTFIR = dsp.FIRFilter('NumeratorSource','Input port'); SECONDFIR = dsp.FIRFilter('NumeratorSource','Input port'); end t = FIRSTFIR(u,c1); y = SECONDFIR(t,c2);
To build the multithreaded MEX file, you must provide the state length corresponding to the two FIR filters. Specify 1s to indicate that the state length does not exceed 1 frame.
firCoeffs1 = fir1(127,0.8); firCoeffs2 = fir1(256,0.2,'High'); dspunfold dspunfoldFIRExample -args {(1:2048)',firCoeffs1,firCoeffs2} -s 1
Executing this code generates:
Multithreaded MEX file
dspunfoldFIRExample_mt
Single-threaded MEX file
dspunfoldFIRExample_st
Self-diagnostic analyzer function
dspunfoldFIRExample_analyzer
The corresponding MATLAB help files for these three files
The output latency of the multithreaded MEX file is 16 frames. To measure the
speedup, execute dspunfoldBenchmarkFIRExample
.
dspunfoldBenchmarkFIRExample;
Speedup = 3.9x
To improve the speedup of the multithreaded MEX file even more, specify the exact
state length in samples. To do so, you must specify which input arguments to
dspunfoldFIRExample
are frames. In this example, the first
input is a frame because the elements of this input are sequenced in time. Therefore
it can be further divided into subframes. The last two inputs are not frames because
the FIR filters coefficients cannot be subdivided without changing the nature of the
algorithm. The value of the dspunfoldFIRExample
MATLAB function state length is the sum of
the state length of the two FIR filters (127 + 256 = 383). Using the
-f
argument, mark the first input argument as true (frame),
and the last two input arguments as false (nonframes)
dspunfold dspunfoldFIRExample -args {(1:2048)',firCoeffs1,firCoeffs2} -s 383 -f [true,false,false]
Again, measure the speedup for the resulting multithreaded MEX using the
dspunfoldBenchmarkFIRExample
function. Notice that the
speedup increased because the exact state length was specified in samples, and
dspunfold
was able to subdivide the frame inputs.
dspunfoldBenchmarkFIRExample;
Speedup = 6.3x
Oftentimes, the speedup can be increased even more by increasing the repetition
(-r
) provided when
invoking dspunfold
. The default repetition value is 1.
When you increase this value, the multithreaded MEX buffers more frames internally
before the processing starts. Increasing the repetition factor increases the
efficiency of the multi-threading, but at the cost of a higher output
latency.
dspunfold dspunfoldFIRExample -args {(1:2048)',firCoeffs1,firCoeffs2} ... -s 383 -f [true,false,false] -r 5
Again, measure the speedup for the resulting multithreaded MEX, using the
dspunfoldBenchmarkFIRExample
function. Speedup increases
again, but the output latency is now 80 frames. The general output latency formula
is 2*Threads*Repetition
frames. In these examples, the number of
Threads
is equal to the number of physical CPU cores.
dspunfoldBenchmarkFIRExample;
Speedup = 7.7x
To request that dspunfold
autodetect the state length,
specify -s auto
. This option generates an efficient multithreaded
MEX file, but with a significant increase in the generation time, due to the extra
analysis that it requires.
dspunfold dspunfoldFIRExample -args {(1:2048)',firCoeffs1,firCoeffs2} ... -s auto -f [true,false,false] -r 5
State length: [autodetect] samples, Repetition: 5, Output latency: 40 frames, Threads: 4 Analyzing: dspunfoldFIRExample.m Creating single-threaded MEX file: dspunfoldFIRExample_st.mexw64 Searching for minimal state length (this might take a while) Checking stateless ... Insufficient Checking 2048 samples ... Sufficient Checking 1024 samples ... Sufficient Checking 512 samples ... Sufficient Checking 256 samples ... Insufficient Checking 384 samples ... Sufficient Checking 320 samples ... Insufficient Checking 352 samples ... Insufficient Checking 368 samples ... Insufficient Checking 376 samples ... Insufficient Checking 380 samples ... Insufficient Checking 382 samples ... Insufficient Checking 383 samples ... Sufficient Minimal state length is 383 samples Creating multi-threaded MEX file: dspunfoldFIRExample_mt.mexw64 Creating analyzer file: dspunfoldFIRExample_analyzer.p
dspunfold
checks different state lengths, using as
inputs the values provided with the -args
option. The function
aims to find the minimum state length for which the outputs of the multithreaded MEX
and single-threaded MEX are the same. Notice that it found 383, as the minimal state
length value, which matches the expected value, manually computed before.
When creating a multithreaded MEX file using dspunfold
,
the single-threaded MEX file is also created along with an analyzer function. For
the stateful example in the previous section, the name of the analyzer is
dspunfoldFIRExample_analyzer
.
The goal of the analyzer is to provide a quick way to measure the speedup of the multithreaded MEX relative to the single-threaded MEX, and also to check if the outputs of the multithreaded MEX and single-threaded MEX match. Outputs usually do not match when an incorrect state length value is specified.
Execute the analyzer for the multithreaded MEX file,
dspunfoldFIRExample_mt
, generated previously using the
-s auto
option.
firCoeffs1_1 = fir1(127,0.8); firCoeffs1_2 = fir1(127,0.7); firCoeffs1_3 = fir1(127,0.6); firCoeffs2_1 = fir1(256,0.2,'High'); firCoeffs2_2 = fir1(256,0.1,'High'); firCoeffs2_3 = fir1(256,0.3,'High'); dspunfoldFIRExample_analyzer((1:2048*3)',[firCoeffs1_1;firCoeffs1_2;firCoeffs1_3],... [firCoeffs2_1;firCoeffs2_2;firCoeffs2_3]);
Analyzing multi-threaded MEX file dspunfoldFIRExample_mt.mexw64 ... Latency = 80 frames Speedup = 7.8x
Each input to the analyzer corresponds to the inputs of the
dspunfoldFIRExample_mt
MEX file. Notice that the length
(first dimension) of each input is greater than the expected length. For example,
dspunfoldFIRExample_mt
expects a frame of 2048 doubles for
its first input, while 2048*3 samples were provided to
dspunfoldFIRExample_analyzer
. The analyzer interprets this
input as 3 frames of 2048 samples. The analyzer alternates between these 3 input
frames circularly while checking if the outputs of the multithreaded and
single-threaded MEX files match.
The table shows the inputs used by the analyzer at each step of the numerical
check. The total number of steps invoked by the analyzer is 240 or
3*latency
, where latency
is 80 in this
case.
Input 1 | Input 2 | Input 3 | |
---|---|---|---|
Step 1 |
|
|
|
Step 2 |
|
|
|
Step 3 |
|
|
|
Step 4 |
|
|
|
... | ... | ... | ... |
NOTE: For the analyzer to correctly check for the numerical match between the multithreaded MEX and single-threaded MEX, provide at least two frames with different values for each input. For inputs that represent parameters, such as filter coefficients, the frames can have the same values for each input. In this example, you could have specified a single set of coefficients for the second and third inputs.
dspunfold
| Generate a Multi-Threaded MEX File from a MATLAB Function using DSP Unfolding | How Is dspunfold Different from parfor? | Why Does the Analyzer Choose the Wrong State Length? | Workflow for Generating a Multithreaded MEX File using dspunfold