This example shows how to map an input to the constant memory
space on the GPU by using the coder.gpu.constantMemory
pragma.
Write an entry-point function myFun
that accepts two
inputs a
of size 256x256
and constant
k
of size 1x3
. The function has a
nested for
-loops that adds the constants to each element
of a
. To create a kernel, place the
coder.gpu.kernel()
pragma outside the nested
for
-loop. The
coder.gpu.constantMemory(k)
places the read-only
input k
into the constant memory of the GPU.
Create a configuration object for MEX code generation.
Define a cell array input
that declares the size and
data type of the inputs a,k
to the function
myFun
.
Generate a MEX function myFun_mex
by using
-config
, -args
, and
-report
options to specify configuration, provide
input arguments, and generate a code generation report.
In the report, on the C code tab, click
myFun.cu
.
The read-only variable k
is declared as
const_k
by using the __constant__
qualifier as shown in the code snippet.
/* Variable Definitions */
__constant__ real_T const_k[3];
cudaMemcpyToSymbol
call copies the value of
k
from the host to the device constant memory
const_k
.
cudaMemcpyToSymbol(const_k, k, 24U, 0U, cudaMemcpyHostToDevice);
cudaMemcpy(gpu_a, a, 524288U, cudaMemcpyHostToDevice);
myFun_kernel1<<<dim3(128U, 1U, 1U), dim3(512U, 1U, 1U)>>>(gpu_a, gpu_b);
cudaMemcpy(b, gpu_b, 524288U, cudaMemcpyDeviceToHost);
The kernel body accesses the constant const_k
and adds
it to each element of a
static __global__ __launch_bounds__(512, 1) void myFun_kernel1(const real_T *a,
real_T *b)
{
int32_T i;
int32_T j;
int32_T threadIdX;
threadIdX = (int32_T)(blockDim.x * blockIdx.x + threadIdx.x);
i = threadIdX / 256;
j = threadIdX - i * 256;
if ((!(j >= 256)) && (!(i >= 256))) {
b[i + (j << 8)] = ((a[i + (j << 8)] + const_k[0]) + const_k[1]) + const_k[2];
}
}