# CUDA Notes

Some notes I took down while creating the kernel in `glcm_cupy.kernel`

## Why conditional modules?

A huge benefit to dynamically compiled CUDA is the option to remove unneeded code.

Thus, when [selecting features](select_feature) the compilations are different.

## Why is the kernel split into 4?

The kernel is split into 4 sections

1) GLCM Creation
2) Features (ASM, Contrast, Homogeneity, GLCM Mean I, GLCM Mean J, Dissimilarity)
3) Features (GLCM Variance I, GLCM Variance J)
4) Features (GLCM Correlation)

- (2) is dependent on (1)
- (3) is dependent on (2)
- (4) is dependent on (3)

It's not possible for a single kernel to sync all threads. Thus, they are separated.

There are some other solutions tried

1) `__syncthreads()` will not work as we require to sync all blocks.
    1) We can't put all calculations in a block due to the thread limit of 512, 1024, 2048.
    2) We require 256 * 256 threads minimum to support a GLCM of max value 255.
2) **Cooperative Groups** imposes a 24 block limit.

## Atomic Add

Threads cannot write to a single pointer in parallel, information will be overwritten and lost. This is the **Race
Condition**.

In order to avoid this, we use [**Atomic
Functions**](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#atomic-functions).

> ... it is guaranteed to be performed without interference from other threads