CUDA Notes¶
Some notes I took down while creating the kernel in glcm_cupy.kernel
Why conditional modules?¶
A huge benefit to dynamically compiled CUDA is the option to remove unneeded code.
Thus, when selecting features the compilations are different.
Why is the kernel split into 4?¶
The kernel is split into 4 sections
GLCM Creation
Features (ASM, Contrast, Homogeneity, GLCM Mean I, GLCM Mean J, Dissimilarity)
Features (GLCM Variance I, GLCM Variance J)
Features (GLCM Correlation)
(2) is dependent on (1)
(3) is dependent on (2)
(4) is dependent on (3)
It’s not possible for a single kernel to sync all threads. Thus, they are separated.
There are some other solutions tried
__syncthreads()
will not work as we require to sync all blocks.We can’t put all calculations in a block due to the thread limit of 512, 1024, 2048.
We require 256 * 256 threads minimum to support a GLCM of max value 255.
Cooperative Groups imposes a 24 block limit.
Atomic Add¶
Threads cannot write to a single pointer in parallel, information will be overwritten and lost. This is the Race Condition.
In order to avoid this, we use Atomic Functions.
… it is guaranteed to be performed without interference from other threads