This example shows you how to generate an execution profiling report for the generated CUDA® code by using the gpucoder.profile
function. Fog rectification is used as an example to demonstrate this concept.
CUDA enabled NVIDIA® GPU with compute capability 3.2 or higher.
NVIDIA CUDA toolkit and driver.
Environment variables for the compilers and libraries. For information on the supported versions of the compilers and libraries, see Third-Party Hardware. For setting up the environment variables, see Setting Up the Prerequisite Products.
The profiling workflow of this example depends on the nvprof
tool from NVIDIA. From CUDA toolkit v10.1, NVIDIA restricts access to performance counters to only admin users. To enable GPU performance counters to be used by all users, see the instructions provided in https://developer.nvidia.com/nvidia-development-tools-solutions-ERR_NVGPUCTRPERM-permission-issue-performance-counters.
To verify that the compilers and libraries necessary for running this example are set up correctly, use the coder.checkGpuInstall
function.
envCfg = coder.gpuEnvConfig('host');
envCfg.BasicCodegen = 1;
envCfg.Quiet = 1;
coder.checkGpuInstall(envCfg);
The fog_rectification.m function takes a foggy image as input and returns a defogged image. To generate CUDA code, create a GPU code configuration object with a dynamic library ('dll'
) build type. Because the gpucoder.profile
function accepts only an Embedded Coder configuration object, a coder.EmbeddedCodeConfig
configuration object is used even if the ecoder
option is not explicitly selected.
inputImage = imread('foggyInput.png'); inputs ={inputImage}; designFileName = 'fog_rectification'; cfg = coder.gpuConfig('dll'); cfg.GpuConfig.MallocMode = 'discrete';
Run gpucoder.profile
with a threshold value of 0.003 to see the SIL execution report. The threshold value of 0.003 is just a representative number. If the generated code has a lot of CUDA API or kernel calls, it is likely that each call constitutes only a small proportion of the total time. It is advisable to set a low threshold value (between 0.001-0.005) to generate a meaningful profiling report. It is not advisable to set number of executions value to a very low number (less than 5) because it does not produce an accurate representation of a typical execution profile.
gpucoder.profile(designFileName, inputs, 'CodegenConfig', cfg, 'Threshold', 0.003, 'NumCalls', 10);
### Starting SIL execution for 'fog_rectification' To terminate execution: <a href="matlab: targets_hyperlink_manager('run',1);">clear fog_rectification_sil</a> Execution profiling data is available for viewing. Open <a href="matlab:Simulink.sdi.view;">Simulation Data Inspector.</a> Execution profiling report available after termination. ### Stopping SIL execution for 'fog_rectification'
fog_rectification
FunctionThe code execution profiling report provides metrics based on data collected from a SIL or PIL execution. Execution times are calculated from data recorded by instrumentation probes added to the SIL or PIL test harness or inside the code generated for each component. For more information, see View Execution Times (Embedded Coder). These numbers are representative. The actual values depend on your hardware setup. This profiling was done using MATLAB R2020a on a machine with an 6 core, 3.5GHz Intel® Xeon® CPU, and an NVIDIA TITAN XP GPU
Section 3 shows the complete trace of GPU calls that have a runtime higher than the threshold value. The 'Threshold'
parameter is defined as the fraction of the maximum execution time for a run (excluding the first run). For example, out of 9 calls to the top level fog_rectification
function, if the third call took the maximum time (, ms), then the maximum execution time is
milliseconds. All GPU calls taking more than
milliseconds is shown in this section. Placing your cursor over the calls shows the run-time values of other relevant non-timing related information for each call. For example, placing your cursor over
fog_rectification_kernel10
shows the block dimensions, grid dimensions, and the static shared memory size in KiB of that call. This trace corresponds to the run that took the maximum time.
Section 4 in the report shows the summary of GPU calls that are shown in section 3. The cudaFree
is called 17 times per run of fog_rectification
and the average time taken by 17 calls of cudaFree
over 9 runs of fog_rectification
is 1.7154 milliseconds. This summary is sorted in descending order of time taken to give the users an idea which GPU call is taking the maximum time.