This example shows how to use the nokernel
pragma
in a function and prevent the code generator from generating CUDA kernels for the statements within the loop
In one file, write the entry-point function nestedLoop
that
accepts two vector inputs A,B
of size 32x512
. The
function has two nested for
-loops of different iteration lengths, one
for operating along the column and one for operating along the row. The first nested
loop computes the sum of the two vector inputs while the second nested loop scales the
sum by a factor of three.
Use the codegen
function to generate CUDA MEX
function.
GPU Coder creates two kernels: nestedLoop_kernel1
to perform the
computation G(i,j) = A(1,j) + B(1,j);
of the first nested loop and
nestedLoop_kernel2
kernel to perform the computation
C(i,j) = G(i,j) * 3;
of the second nested loop. The second kernel
is created for the inner loop of the second nested loop. The noKernel
pragma is applicable only to the loop that immediately follows the statement. Snippets
of the generated kernels are shown.
static __global__ __launch_bounds__(512, 1) void nestedLoop_kernel1(const real_T
B[512], const real_T A[512], real_T G[16384])
{
uint32_T threadId;
...
if (i < 32) {
G[i + (j << 5)] = A[j] + B[j];
}
}
static __global__ __launch_bounds__(512, 1) void nestedLoop_kernel2(real_T G
[16384], int32_T i, real_T C[16384])
{
uint32_T threadId;
...;
if (j < 512) {
C[i + (j << 5)] = G[i + (j << 5)] * 3.0;
}
A snippet of the main function shows that the code generator has fused the first
nested loop as indicated by the kernel launch parameters. As mentioned earlier, the
outer loop of the second nested loop is the one that is not mapped to a kernel. Hence
the code generator places a for-loop
statement just before the call
to the second CUDA kernel nestedLoop_kernel2
.
void nestedLoop(const real_T A[512], const real_T B[512], real_T C[16384])
{
int32_T i;
...
// These two loops will be fused
cudaMemcpy(gpu_B, (void *)&B[0], 4096UL, cudaMemcpyHostToDevice);
cudaMemcpy(gpu_A, (void *)&A[0], 4096UL, cudaMemcpyHostToDevice);
nestedLoop_kernel1<<<dim3(32U, 1U, 1U), dim3(512U, 1U, 1U)>>>(*gpu_B, *gpu_A, *
gpu_G);
for (i = 0; i < 32; i++) {
nestedLoop_kernel2<<<dim3(1U, 1U, 1U), dim3(512U, 1U, 1U)>>>(*gpu_G, i,
*gpu_C);
C_dirtyOnGpu = true;
}
...
cudaFree(*gpu_C);
}