If the kernel is doing little work, then the overhead of memcpy
and
kernel launches can offset any performance gains. Consider working on a larger sample set
(thus increasing the loop size). To detect this condition, look at the
nvvpreport
.
Do more work in the loop or increase sample set size
If there are too many local/temp variables used in the loop body, then it causes high
register pressure in the per-thread register file. You can detect this condition by
running in GPU safe-build mode. Or, nvvp
reports this fact.
Consider using different block sizes in coder.gpu.kernel
pragma.