gpucoder.stridedMatrixMultiply

Optimized GPU implementation of strided and batched matrix multiply operation

Syntax

D = gpucoder.stridedMatrixMultiply(A,B)

___ = gpucoder.stridedMatrixMultiply(___,Name,Value)

Description

D = gpucoder.stridedMatrixMultiply(A,B) performs strided matrix-matrix multiplication of a batch of matrices. The input matrices A and B for each instance of the batch are located at fixed address offsets from their addresses in the previous instance. The gpucoder.stridedMatrixMultiply function performs matrix-matrix multiplication of the form:

$D = α A B$

where $α$ is a scalar multiplication factor, A, B, and D are matrices with dimensions m-by-k, k-by-n, and m-by-n respectively. You can optionally transpose or hermitian-conjugate A and B. By default, $α$ is set to one and the matrices are not transposed. To specify a different scalar multiplication factor and perform transpose operations on the input matrices, use the Name,Value pair arguments.

All the batches passed to the gpucoder.stridedMatrixMultiply function must be uniform. That is, all instances must have the same dimensions m,n,k.

example

___ = gpucoder.stridedMatrixMultiply(___,Name,Value) performs strided batched matrix multiply operation by using the options specified by one or more Name,Value pair arguments.

Examples

collapse all

Strided Batched Matrix-Matrix Multiplication

Perform a simple batched matrix-matrix multiplication and use the gpucoder.stridedMatrixMultiply function to generate CUDA^® code that calls appropriate cublas<t>gemmStridedBatched APIs.

In one file, write an entry-point function myStridedMatMul that accepts matrix inputs A and B. Because the input matrices are not transposed, use the 'nn' option.

function [D] = myStridedMatMul(A,B,alpha)

[D] = gpucoder.stridedMatrixMultiply(A,B,'alpha',alpha, ...
    'transpose','nn');

end

To create a type for a matrix of doubles for use in code generation, use the coder.newtype function.

A = coder.newtype('double',[5 4 100],[0 0]);
B = coder.newtype('double',[4 5 100],[0 0]);
alpha = 0.3;
inputs = {A,B,alpha};

To generate a CUDA library, use the codegen function.

cfg = coder.gpuConfig('lib');
cfg.GpuConfig.EnableCUBLAS = true;
cfg.GpuConfig.EnableCUSOLVER = true;
cfg.GenerateReport = true;
codegen -config cfg-args inputs myStridedMatMul

The generated CUDA code contains kernels myStridedMatMul_kernelNN for initializing the input and output matrices. The code also contains the cublasDgemmStridedBatched API calls to the cuBLAS library. The following code is a snippet of the generated code.

//
// File: myStridedMatMul.cu
//
...
void myStridedMatMul(const double A_data[], const int A_size[3], const double
                     B_data[], const int B_size[3], double alpha, double D_data[],
                     int D_size[3])
{
  double alpha1;
...
  beta1 = 0.0;
  cudaMemcpy(gpu_alpha1, &alpha1, 8ULL, cudaMemcpyHostToDevice);
  cudaMemcpy(gpu_A_data, (void *)A_data, A_size[0] * A_size[1] * A_size[2] *
             sizeof(double), cudaMemcpyHostToDevice);
  cudaMemcpy(gpu_B_data, (void *)B_data, B_size[0] * B_size[1] * B_size[2] *
             sizeof(double), cudaMemcpyHostToDevice);
  cudaMemcpy(gpu_beta1, &beta1, 8ULL, cudaMemcpyHostToDevice);
  if (D_data_dirtyOnCpu) {
    cudaMemcpy(gpu_D_data, &D_data[0], 25 * D_size[2] * sizeof(double),
               cudaMemcpyHostToDevice);
  }

  if (batchDimsA[2] >= batchDimsB[2]) {
    if (batchDimsA[2] >= 1) {
      ntilecols = batchDimsA[2];
    } else {
      ntilecols = 1;
    }
  } else {
    ntilecols = batchDimsB[2];
  }

  cublasDgemmStridedBatched(getCublasGlobalHandle(), CUBLAS_OP_N, CUBLAS_OP_N, 5,
    5, 4, (double *)gpu_alpha1, (double *)&gpu_A_data[0], 5, strideA, (double *)
    &gpu_B_data[0], 4, strideB, (double *)gpu_beta1, (double *)&gpu_D_data[0], 5,
    25, ntilecols);
  cudaMemcpy(&D_data[0], gpu_D_data, 25 * D_size[2] * sizeof(double),
             cudaMemcpyDeviceToHost);
...
}

Input Arguments

collapse all

`A`, `B` — Operands
vectors | matrices

Operands, specified as vectors or matrices. gpucoder.stridedMatrixMultiply multiplies along the first two dimensions.

Name-Value Pair Arguments

Specify optional comma-separated pairs of Name,Value arguments. Name is the argument name and Value is the corresponding value. Name must appear inside quotes. You can specify several name and value pair arguments in any order as Name1,Value1,...,NameN,ValueN.

Example:

D =
          gpucoder.stridedMatrixMultiply(A,B,'alpha',0.3,'transpose','CC');

`'alpha'` — Scalar multiplication factor
1.0 (default) | scalar

Value of the scalar used for multiplication with A. Default value is one.

`'transpose'` — Operation performed on input matrices
'NN' (default) | character vector | string

Character vector or string composed of two characters, indicating the operation performed on the matrices A and B prior to matrix multiplication. Possible values are normal ('N'), transposed ('T'), or complex conjugate transpose ('C').

Output Arguments

collapse all

`D` — Product
scalar | vector | matrix

Product, returned as a scalar, vector, or matrix. Array D has the same number of rows as input A and the same number of columns as input B.

Documentation

gpucoder.stridedMatrixMultiply

Syntax

Description

Examples

Strided Batched Matrix-Matrix Multiplication

Input Arguments

`A`, `B` — Operands
vectors | matrices

Name-Value Pair Arguments

`'alpha'` — Scalar multiplication factor
1.0 (default) | scalar

`'transpose'` — Operation performed on input matrices
'NN' (default) | character vector | string

Output Arguments

`D` — Product
scalar | vector | matrix

See Also

Apps

Functions

Objects

Topics

GPU Coder Documentation

Support

Documentation

gpucoder.stridedMatrixMultiply

Syntax

Description

Examples

Strided Batched Matrix-Matrix Multiplication

Input Arguments

A, B — Operands vectors | matrices

Name-Value Pair Arguments

'alpha' — Scalar multiplication factor 1.0 (default) | scalar

'transpose' — Operation performed on input matrices 'NN' (default) | character vector | string

Output Arguments

D — Product scalar | vector | matrix

See Also

Apps

Functions

Objects

Topics

GPU Coder Documentation

Support

`A`, `B` — Operands
vectors | matrices

`'alpha'` — Scalar multiplication factor
1.0 (default) | scalar

`'transpose'` — Operation performed on input matrices
'NN' (default) | character vector | string

`D` — Product
scalar | vector | matrix