gpucoder.reduce

Optimized GPU implementation for reduction operations

Description

example

S = gpucoder.reduce(A,FUN) aggregates the values present in the input array A to a single value using the given function handle FUN. The output S is a scalar.

S = gpucoder.reduce(A,{@FUN1,@FUN2,...}) accepts an input array and a cell array of function handles. It aggregates the values present in the input array to a single value for every function handle provided in the cell array. The size of output is 1-by-N, where N is the number of function handles.

The code generator uses shuffle intrinsics to perform efficient reduction on the GPU. Multiple function handles are aggregated inside a single kernel on the GPU.

Examples

collapse all

This example generates CUDA® code to find the sum and the maximum of the elements of an array.

In one file, write an entry-point function multireduce that accepts a matrix input A. Use the gpucoder.reduce function to perform two types of reduction operations on the elements of A.

function s = multireduce(A)
  s = gpucoder.reduce(A, {@mysum, @mymax}); 
end

function c = mysum(a, b)
  c = a+b;
end

function c = mymax(a, b)
  c = max(a,b);
end

Use the codegen function to generate CUDA MEX function.

codegen -config coder.gpuConfig('mex') -args {rand(1,1024,'double')} -report multireduce

The following is a snippet of the generated code.

...
cudaMalloc(&gpu_s, 16ULL);
cudaMalloc(&gpu_A, 8192ULL);
cudaMemcpy(gpu_A, (void *)&A[0], 8192ULL, cudaMemcpyHostToDevice);
multireduce_kernel1<<<dim3(1U, 1U, 1U), dim3(32U, 1U, 1U)>>>(*gpu_A, *gpu_s);
coder_reduce0<<<dim3(2U, 1U, 1U), dim3(512U, 1U, 1U)>>>(*gpu_A, *gpu_s);
cudaMemcpy(&s[0], gpu_s, 16ULL, cudaMemcpyDeviceToHost);
...
static __inline__ __device__ real_T shflDown2(real_T in1, uint32_T offset,
  uint32_T mask)
{
  int2 tmp;
  tmp = *(int2 *)&in1;
  tmp.x = __shfl_down_sync(mask, tmp.x, offset);
  tmp.y = __shfl_down_sync(mask, tmp.y, offset);
  return *(real_T *)&tmp;
}
...

Input Arguments

collapse all

The input array to perform the reduction operation on. For code generation, the input array must be of numeric or logical data type.

Handle to a user-defined function. FUN can also be a cell array of function handles. The function handle is a binary function and must satisfy the following requirements:

  • Accept two inputs and returns one output. The type of the inputs and output to the function must match the type of the input array A.

  • The function must be commutative and associative, otherwise the behavior is undefined.

Output Arguments

collapse all

Result of the reduction operation. During reduction, S is initialized to the value of one of elements of the input array A. Then, the reduction operation is performed by applying FUN to every element in A and S.

Limitations

  • gpucoder.reduce does not support input arrays that are of complex data type.

  • The user-defined function must accept two inputs and returns one output. The type of the inputs and output to the function must match the type of the input array A.

  • The user-defined function must be commutative and associative, otherwise the behavior is undefined.

  • For some inputs that are of the integer data type, the generated code may contain intermediate computations that reach saturation. In such cases, the results from the generated code may not match the simulation results from MATLAB®.

Introduced in R2019b