For GPU code generation, the primary mechanism for creating CUDA® kernels is by using for
-loops. The way you write loops in
your MATLAB® code has a significant impact on the number of kernels created as well as the
performance of the generated code. When you generate GPU code, check the diagnostic report to
see if your loop segment has Loop not parallelized
notices. Calls to
MATLAB functions in your code may also have for
-loops that contain
these notices. To get maximum performance, you want to ensure that compute intensive loop
segments in your code are mapped to kernels and executed in parallel. The following
recommendations help you in achieving this goal and generating efficient CUDA kernels.
Consider a function that has nested for
-loops.
function y = foo(x) ... for i1 = 1:N1 for i2 = 1:N2 for i3 = 1:N3 for i4 = 1:N4 ... end end end end
Assume that one of the intermediate loop i3
is not parallelizable.
When performs loop analysis to create kernels, GPU Coder™ it considers only the outermost parallel loops i1,i2
and
creates a kernel with the outer loop dimensions N1,N2
. The loops
i3,i4
are within the kernel body and are executed sequentially.
However if the innermost i4
is large (iteration), then better
performance may be achieved by creating kernels for the innermost loop.
There are three ways in which you can parallelize the innermost loop:
Rewrite the code so that the innermost code segment is not within a nested loop.
If the iteration size of the outer loop is small, then attach the loop to a
coder.unroll
function. This function unrolls the
for
-loop by making a copy of the loop body for each loop
iteration. For more information, see coder.unroll
.
function y = foo(x) ... for i1 = coder.unroll(1:N1) ... end
Make the outer loop dimension as dynamic bound. This way parallel loop analysis fails on the outer loop, whereas it succeeds on the inner loops.
function y = foo(x,N1) ... for i1 = 1:N1 ... end
Loops with break are not supported.
while (i < N) ... ... if (cond2) ... ... break; end end
Remove breaks by creating a guard variable and conditional.
cond = true; while (i< N) if(cond) ... ... if(cond2) cond = false; end end end
Kernel extraction use parallel loop dependence analysis. There are cases where loop
dependence analysis cannot detect a parallel for loop. The
coder.gpu.kernel
allows GPU Coder to override dependence analysis and force kernel creation. The caveat is for
user to be sure that the loop is “for-all” loop with no inter-iteration
dependencies.
Use coder.gpu.kernel
pragma explicitly on each of your for-loops.
GPU Coder may not create kernels when logical indexing is used for accessing array elements.
i = (mag ~= 0); vx(i) = vx(i)./mag(i); vy(i) = vy(i)./mag(i);
Rewrite the code by using a loop body and guarding with an appropriate conditional.
for i = 1:numel(mag) if (mag(i) ~= 0) vx(i) = vx(i)./mag(i); vy(i) = vy(i)./mag(i); end end
Use of unsupported functions, coder pragmas, toolbox functions etc. inside a loop prevents them from becoming a kernel.
Try rewriting unsupported functions using pure MATLAB.
If smaller loops in a loop nest are the outer most loops, then a kernel could be created with just a subset of the loops in the nesting. If algorithm allows it, always put the largest loops in the outermost nesting.
Rewrite loop nesting with larger loops as outer loops.