Simulation Acceleration Using GPUs

GPU-Based System objects

GPU-based System objects look and behave much like the other System objects in the Communications Toolbox™ product. The important difference is that the algorithm is executed on a Graphics Processing Unit (GPU) rather than on a CPU. Using the GPU can accelerate your simulation.

System objects for the Communications Toolbox product are located in the comm package and are constructed as:

H = comm.<object name>

For example, a Viterbi Decoder System object™ is constructed as:

H = comm.ViterbiDecoder

In cases where a corresponding GPU-based implementation of a System object exists, they are located in the comm.gpu package and constructed as:

H = comm.gpu.<object name>

For example, a GPU-based Viterbi Decoder System object is constructed as:

H = comm.gpu.ViterbiDecoder

To see a list of available GPU-based implementations enter help comm at the MATLAB^® command line and click GPU Implementations.

General Guidelines for Using GPUs

Graphics Processing Units (GPUs) excel at processing large quantities of data and performing computations with high compute intensity. Processing large quantities of data is one way to maximize the throughput of your GPU in a simulation. The amount of the data that the GPU processes at any one time depends on the size of the data passed to the input of a GPU System object. Therefore, one way to maximize this data size is by processing multiple frames of data.

You can use a single GPU System object to process multiple data frames simultaneously or in parallel. This differs from the way many of the standard, or non-GPU, System objects are implemented. For GPU System objects, the number of frames the objects process in a single call to the object function is either implied by one of the object properties or explicitly stated using the NumFrames property on the objects.

Transmit and decode using BPSK modulation and turbo coding

This example shows how to transmit turbo-encoded blocks of data over a BPSK-modulated AWGN channel. Then, it shows how to decode using an iterative turbo decoder and display errors.

Define a noise variable, establish a frame length of 256, and use the random stream property so that the results are repeatable.

noiseVar = 4; frmLen = 256;
s = RandStream('mt19937ar', 'Seed', 11);
intrlvrIndices = randperm(s, frmLen);

Create a Turbo Encoder System object. The trellis structure for the constituent convolutional code is poly2trellis(4, [13 15 17], 13). The InterleaverIndices property specifies the mapping the object uses to permute the input bits at the encoder as a column vector of integers.

turboEnc = comm.TurboEncoder('TrellisStructure', poly2trellis(4, ...
      [13 15 17], 13), 'InterleaverIndices', intrlvrIndices);

Create a BPSK Modulator System object.

bpsk = comm.BPSKModulator;

Create an AWGN Channel System object.

channel = comm.AWGNChannel('NoiseMethod', 'Variance', 'Variance', ...
      noiseVar);

Create a GPU-Based Turbo Decoder System object. The trellis structure for the constituent convolutional code is poly2trellis(4, [13 15 17], 13). The InterleaverIndicies property specifies the mapping the object uses to permute the input bits at the encoder as a column vector of integers.

turboDec = comm.gpu.TurboDecoder('TrellisStructure', poly2trellis(4, ...
      [13 15 17], 13), 'InterleaverIndices', intrlvrIndices, ...
      'NumIterations', 4);

Create an Error Rate System object.

errorRate = comm.ErrorRate;

Run the simulation.

for frmIdx = 1:8
 data = randi(s, [0 1], frmLen, 1);
 encodedData = turboEnc(data);
 modSignal = bpsk(encodedData);
 receivedSignal = channel(modSignal);

Convert the received signal to log-likelihood ratios for decoding.

receivedBits  = turboDec(-2/(noiseVar/2))*real(receivedSignal));

Compare original the data to the received data and then calculate the error rate results.

errorStats = errorRate(data,receivedBits);
end
fprintf('Error rate = %f\nNumber of errors = %d\nTotal bits = %d\n', ...
errorStats(1), errorStats(2), errorStats(3))

Process Multiple Data Frames Using a GPU

This example shows how to simultaneously process two data frames using an LDPC Decoder System object. The ParityCheckMatrix property determines the frame size. The number of frames that the object processes is determined by the frame size and the input data vector length.

numframes = 2;
 
ldpcEnc = comm.LDPCEncoder;
ldpcGPUDec = comm.gpu.LDPCDecoder;
ldpcDec = comm.LDPCDecoder;
 
 
msg = randi([0 1], 32400,2);
 
for ii=1:numframes,
    encout(:,ii) = ldpcEnc(msg(:,ii));
end
 
%single ended to bipolar (for LLRs)
encout = 1-2*encout;
 
%Decode on the CPU
for ii=1:numframes;
    cout(:,ii) = ldpcDec(encout(:,ii));
end
 
%Multiframe decode on the GPU
gout = ldpcGPUDec(encout(:));
 
%check equality
isequal(gout,cout(:))

Process Multiple Data Frames Using NumFrames Property

This example shows how to process multiple data frames using the NumFrames property of the GPU-based Viterbi Decoder System object. For a Viterbi Decoder, the frame size of your system cannot be inferred from an object property. Therefore, the NumFrames property defines the number of frames present in the input data.

numframes = 10;
 
convEncoder = comm.ConvolutionalEncoder('TerminationMethod', 'Terminated');
vitDecoder = comm.ViterbiDecoder('TerminationMethod', 'Terminated');
 
%Create a GPU Viterbi Decoder, using NumFrames property.
vitGPUDecoder = comm.gpu.ViterbiDecoder('TerminationMethod', 'Terminated', ...
                               'NumFrames', numframes );
 
msg = randi([0 1], 200, numframes);
 
for ii=1:numframes,
    convEncOut(:,ii) = 1-2*convEncoder(msg(:,ii));
end
 
%Decode on the CPU
for ii=1:numframes;
    cVitOut(:,ii) = vitDecoder(convEncOut(:,ii));
end
 
%Decode on the GPU
gVitOut = vitGPUDecoder(convEncOut(:));
 
isequal(gVitOut,cVitOut(:))

gpuArray and Regular MATLAB Numerical Arrays

A GPU-based System object accepts typical MATLAB arrays or objects created using the gpuArray class. A GPU-based System object supports input signals with double- or single-precision data types. The output signal inherits its data type from the input signal.

If the input signal is a MATLAB array, the System object handles data transfer between the CPU and the GPU. The output signal is a MATLAB array.
If the input signal is a gpuArray, the data remains on the GPU. The output signal is a gpuArray. When the object is given a gpuArray, calculations take place entirely on the GPU, and no data transfer occurs. Passing gpuArray arguments provides increased performance by reducing simulation time. For more information, see Establish Arrays on a GPU (Parallel Computing Toolbox).

Passing MATLAB arrays to a GPU System object requires transferring the initial data from a CPU to the GPU. Then, the GPU System object performs calculations and transfers the output data back to the CPU. This process introduces latency. When data in the form of a gpuArray is passed to a GPU System object, the object does not incur the latency from data transfer. Therefore, a GPU System object runs faster when you supply a gpuArray as the input.

In general, you should try to minimize the amount of data transfer between the CPU and the GPU in your simulation.

Pass gpuArray as an Input

This example shows how to pass a gpuArray to the input of the PSK modulator, reducing latency.

pskGPUModulator = comm.gpu.PSKModulator;
x = randi([0 7], 1000, 1, 'single');
gx = gpuArray(x);
 
o = pskGPUModulator(x);
class(o)
 
release(pskGPUModulator); %allow input types to change
 
go = pskGPUModulator(gx);
class(go)

System Block Support for GPU System Objects

GPU System Objects Supported in System Block
System Block Limitations for GPU System Objects

GPU System Objects Supported in System Block

comm.gpu.AWGNChannel
comm.gpu.BlockDeinterleaver
comm.gpu.BlockInterleaver
comm.gpu.ConvolutionalDeinterleaver
comm.gpu.ConvolutionalEncoder
comm.gpu.ConvolutionalInterleaver
comm.gpu.PSKDemodulator
comm.gpu.PSKModulator
comm.gpu.TurboDecoder
comm.gpu.ViterbiDecoder

System Block Limitations for GPU System Objects

The GPU System objects must be simulated using Interpreted Execution. You must select this option explicitly on the block mask; the default value is Code generation.

Documentation