This example shows how to use the
kernelfun
pragma in a function and generate CUDA® code.
In one file, write the entry-point function scalars
that accepts two vector inputs x,y
of size
1x4096
and one scalar input scale
.
The function has two for
-loops of different iteration
lengths, one for vector addition and one for finding the cumulative sum.
Place the coder.gpu.kernelfun()
pragma within the
scalars
function.
Use the codegen
function to generate
CUDA MEX function.
GPU Coder creates three kernels: scalars_kernel1
for
initializing sout1=0
, scalars_kernel2
for vector addition, and scalars_kernel3
is the reduction
kernel for the cumulative sum.
scalars_kernel1<<<dim3(1U, 1U, 1U), dim3(32U, 1U, 1U)>>>(gpu_sout1);
cudaMemcpy(gpu_y, y, 32768U, cudaMemcpyHostToDevice);
cudaMemcpy(gpu_x, x, 32768U, cudaMemcpyHostToDevice);
scalars_kernel2<<<dim3(2U, 1U, 1U), dim3(512U, 1U, 1U)>>>(gpu_y, gpu_x, gpu_vout);
scalars_kernel3<<<dim3(8U, 1U, 1U), dim3(512U, 1U, 1U)>>>(scale, gpu_x, gpu_sout1);
cudaMemcpy(vout, gpu_vout, 32768U, cudaMemcpyDeviceToHost);
cudaMemcpy(sout1, gpu_sout1, 8U, cudaMemcpyDeviceToHost);
scalars_kernel2
has two blocks with 512 threads per
block for a total of 1024 threads, one for adding each element. Similarly,
scalars_kernel3
has eight blocks with 512 threads per
block resulting in a total of 4096 threads. GPU Coder also performs an
optimization that minimizes the number of cudamMemcpy
function calls. In this example, a copy of the input x
is
in the GPU, no extra cudamMemcpy
is required between
scalars_kernel2
and
scalars_kernel3
. In addition to memory optimization,
any sequential code between kernels is mapped to the CUDA threads to keep data on the GPU.