gpucoder.batchedMatrixMultiply

Optimized GPU implementation of batched matrix multiply operation

Syntax

[D1,D2] = gpucoder.batchedMatrixMultiply(A1,B1,A2,B2)

[D1,...,DN] = gpucoder.batchedMatrixMultiply(A1,B1,...,AN,BN)

___ = gpucoder.batchedMatrixMultiply(___,Name,Value)

Description

[D1,D2] = gpucoder.batchedMatrixMultiply(A1,B1,A2,B2) performs matrix-matrix multiplication of a batch of matrices A1,B1 and A2,B2. gpucoder.batchedMatrixMultiply performs matrix-matrix multiplication of the form:

$D = α A B$

where $α$ is a scalar multiplication factor, A, B, and D are matrices with dimensions m-by-k, k-by-n, and m-by-n respectively. A and B can optionally be transposed or hermitian-conjugated. By default, $α$ is set to one and the matrices are not transposed. Use the Name,Value pair arguments to specify a different scalar multiplication factor and to specify transpose operations on the input matrices.

All the batches passed to the gpucoder.batchedMatrixMultiply function must be uniform. That is, all instances must have the same dimensions m,n,k.

[D1,...,DN] = gpucoder.batchedMatrixMultiply(A1,B1,...,AN,BN) performs matrix-matrix multiplication of multiple A, B pairs of the form:

$D_{i} = α A_{i} B_{i} i = 1 \dots N$

example

___ = gpucoder.batchedMatrixMultiply(___,Name,Value) performs batched matrix multiply operation using the options specified by one or more Name,Value pair arguments.

Examples

collapse all

Batched Matrix-Matrix Multiplication

This example performs a simple batched matrix-matrix multiplication and uses the gpucoder.batchedMatrixMultiply function to generate CUDA^® code that calls appropriate cublas<t>gemmBatched APIs.

In one file, write an entry-point function myBatchMatMul that accepts matrix inputs A1, B1, A2 and B2. The input matrices are not transposed, therefore use the 'nn' option.

function [D1,D2] = myBatchMatMul(A1,B1,A2,B2,alpha)

[D1,D2] = gpucoder.batchedMatrixMultiply(A1,B1,A2,B2, ...
    'alpha',alpha,'transpose','nn');

end

Use the coder.newtype function to create a type for a matrix of doubles for use in code generation.

A1 = coder.newtype('double',[15,42],[0 0]);
A2 = coder.newtype('double',[15,42],[0 0]);
B1 = coder.newtype('double',[42,30],[0 0]);
B2 = coder.newtype('double',[42,30],[0 0]);
alpha = 0.3;
inputs = {A1,B1,A2,B2,alpha};

Use the codegen function to generate a CUDA library.

cfg = coder.gpuConfig('lib');
cfg.GpuConfig.EnableCUBLAS = true;
cfg.GpuConfig.EnableCUSOLVER = true;
cfg.GenerateReport = true;
codegen -config cfg-args inputs myBatchMatMul

The generated CUDA code contains kernels: myBatchMatMul_kernelNN for initializing the input and output matrices. It also contains the cublasDgemmBatched API calls to the cuBLAS library. The following is a snippet of the generated code.

//
// File: myBatchMatMul.cu
//
...
void myBatchMatMul(const double A1[630], const double B1[1260], const double A2
                   [630], const double B2[1260], double alpha, double D1[450],
                   double D2[450])
{
  double alpha1;
...

  myBatchMatMul_kernel1<<<dim3(2U, 1U, 1U), dim3(512U, 1U, 1U)>>>(*gpu_A2,
    *gpu_A1, *gpu_input_cell_f2, *gpu_input_cell_f1);
  cudaMemcpy(gpu_B2, (void *)&B2[0], 10080UL, cudaMemcpyHostToDevice);
  cudaMemcpy(gpu_B1, (void *)&B1[0], 10080UL, cudaMemcpyHostToDevice);
  myBatchMatMul_kernel2<<<dim3(3U, 1U, 1U), dim3(512U, 1U, 1U)>>>(*gpu_B2,
    *gpu_B1, *gpu_input_cell_f4, *gpu_input_cell_f3);
  myBatchMatMul_kernel3<<<dim3(1U, 1U, 1U), dim3(480U, 1U, 1U)>>>(gpu_r3, gpu_r2);
  myBatchMatMul_kernel4<<<dim3(1U, 1U, 1U), dim3(32U, 1U, 1U)>>>(gpu_r2,
    *gpu_out_cell);
  myBatchMatMul_kernel5<<<dim3(1U, 1U, 1U), dim3(32U, 1U, 1U)>>>(gpu_r3,
    *gpu_out_cell);
...

  cublasDgemmBatched(getCublasGlobalHandle(), CUBLAS_OP_N, CUBLAS_OP_N, 15, 30,
                     42, (double *)gpu_alpha1, (double **)gpu_Aarray, 15,
                     (double **)gpu_Barray, 42, (double *)gpu_beta1, (double **)
                     gpu_Carray, 15, 2);
  myBatchMatMul_kernel6<<<dim3(1U, 1U, 1U), dim3(480U, 1U, 1U)>>>(*gpu_D2,
    *gpu_out_cell, *gpu_D1);
...
}

Input Arguments

collapse all

`A`, `B` — Operands
vectors | matrices

Operands, specified as vectors or matrices. A and B must be 2-D arrays. The number of columns in A must be equal to the number of rows in B.

Name-Value Pair Arguments

Specify optional comma-separated pairs of Name,Value arguments. Name is the argument name and Value is the corresponding value. Name must appear inside quotes. You can specify several name and value pair arguments in any order as Name1,Value1,...,NameN,ValueN.

Example:

[D1,D2] =
          gpucoder.batchedMatrixMultiply(A1,B1,A2,B2,'alpha',0.3,'transpose','CC');

`'alpha'` — Scalar multiplication factor
1.0 (default) | scalar

Value of the scalar used for multiplication with A. Default value is one.

`'transpose'` — Operation performed on input matrices
'NN' (default) | character vector | string

Character vector or string composed of two characters, indicating the operation performed on the matrices A and B prior to matrix multiplication. Possible values are normal ('N'), transposed ('T'), or complex conjugate transpose ('C').

Output Arguments

collapse all

`D` — Product
scalar | vector | matrix

Product, returned as a scalar, vector, or matrix. Array D has the same number of rows as input A and the same number of columns as input B.

Documentation

gpucoder.batchedMatrixMultiply

Syntax

Description

Examples

Batched Matrix-Matrix Multiplication

Input Arguments

`A`, `B` — Operands
vectors | matrices

Name-Value Pair Arguments

`'alpha'` — Scalar multiplication factor
1.0 (default) | scalar

`'transpose'` — Operation performed on input matrices
'NN' (default) | character vector | string

Output Arguments

`D` — Product
scalar | vector | matrix

See Also

Topics

Introduced in R2020a

GPU Coder Documentation

Support

Documentation

gpucoder.batchedMatrixMultiply

Syntax

Description

Examples

Batched Matrix-Matrix Multiplication

Input Arguments

A, B — Operands vectors | matrices

Name-Value Pair Arguments

'alpha' — Scalar multiplication factor 1.0 (default) | scalar

'transpose' — Operation performed on input matrices 'NN' (default) | character vector | string

Output Arguments

D — Product scalar | vector | matrix

See Also

Topics

Introduced in R2020a

GPU Coder Documentation

Support

`A`, `B` — Operands
vectors | matrices

`'alpha'` — Scalar multiplication factor
1.0 (default) | scalar

`'transpose'` — Operation performed on input matrices
'NN' (default) | character vector | string

`D` — Product
scalar | vector | matrix