Introduction to GPU Processing (High Level Overview)

GPUs have become increasingly popular for machine learning and HPC workloads due to their superior throughput capabilities. While CPUs utilize fewer, more complex cores optimized for sequential processing and low latency, GPUs employ hundreds or thousands of simpler cores designed for parallel processing. This fundamental architectural difference of many simple cores versus few complex cores - enables GPUs to excel at data-parallel workloads where the same operation must be performed across large datasets simultaneously."

A single GPU consists of multiple Streaming Multiprocessors (SMs), each containing numerous cores. The types of cores vary by GPU architecture: pre-Volta architectures featured only CUDA cores, while later architectures (Volta, Ampere, Hopper, and Blackwell) introduced Tensor cores specifically optimized for matrix multiplication operations common in AI workloads.

CUDA programming utilizes both thread and memory hierarchies to efficiently utilize these computing resources. The thread hierarchy organizes execution into a three-level structure: grids contain thread blocks, which in turn contain Warps and threads, A Warp is group of Threads. To support computation, the memory hierarchy provides data access through multiple levels: registers (fastest, per-thread), shared memory (per-block), L1/L2 caches, and global memory (largest, accessible by all threads).

Kernel Execution

To perform computations on the GPU, developers write functions called kernels in CUDA terminology. When launching a kernel, developers specify the grid dimensions (number of thread blocks) and block dimensions (number of threads per block). During kernel execution, threads typically load data from global memory into shared memory to improve performance. This strategy reduces the frequency of global memory accesses, which have higher latency and lower bandwidth compared to shared memory. Once data resides in shared memory, individual threads can efficiently load values into their private registers for computation.

During the process of moving data from Shared Memory to Registers , Thread needs to access this Shared Memory using Banks in Architectures like Ampere each Warp has access to 32 Banks , a warp comprises of 32 threads hence each thread can access only one bank , each bank has capacity of storing 32 bits = 4 Bytes.

Warps and Registers

As mentioned earlier, each warp consists of 32 threads, and each thread block may contain multiple warps. For example, a thread block with 64 threads results in 2 warps. Each thread block has its own shared memory space that all warps within the block can access.

In architectures like Ampere, shared memory is organized into 32 banks. Bank conflicts occur when multiple threads within the same warp attempt to access the same bank simultaneously, forcing these accesses to be serialized and reducing memory bandwidth. To maintain optimal performance, we must design access patterns that minimize bank conflicts, a small notes these conflicts can occur between the threads of same warp trying to access same bank but this conflict does not occur when threads from different warp access the same bank of shared memory.

Example Scenarios for Matrix Loading:

Scenario-1:- Loading 88 matrix with Precision of Floating Point -16 Bits (Half Precision) from Shared Memory to Warp Registers Scenario-2:- Loading 168 matrix with Precision of Floating Point-16 Bits(Half Precision) from Shared Memory to Warp Registers Scenario-3:- Loading 16*16 matrix with Precision of Floating Point-16 Bits(Half Precision) from Shared Memory to Warp Registers

No Bank Conflicts

Scenario-1

8*8 = 64 Numbers
Half Precision = 2 Bytes
64 * 2 Bytes = 128 Bytes
Warp = 32 Threads
Load 128 Bytes from Shared Memory to Registers using 32 Threads
128 Bytes / 32 Threads = 4 Bytes / 1 Thread
Each Thread loads 2 Numbers - 4 Bytes
Each Bank can only serve 4 Bytes per cycle
This scenario does not lead to conflict, each thread access only one bank

8*8 Matrix

Bank Assignement Scenario-1

Scenario-2

16*8 = 128 Numbers
Half Precision = 2 Bytes
128 * 2 Bytes = 256 Bytes
Warp = 32 Threads
Load 256 Bytes from Shared Memory to Registers using 32 Threads(Registers)
256 Bytes/ 32 Threads = 8 Bytes
Each Thread need to load 4 Numbers = 8 Bytes / 2 Bytes
Each Bank can only serve 4 Bytes per cycle
Each Thread access 2 Banks
Thread-0 => Thread-15 access 0-31 Banks
Thread-16 => Thread-31 access 0-31 Banks
This Leads to 2 Way conflicts , hence the loading should be done in 2 Cycles (Sequentially)

Scenario-3

16*16 = 256 Numbers
Half Precision = 2 Bytes
256 * 2 Bytes= 512 Bytes
Warp = 32 Threads
Load 512 Bytes from Shared Memory to Registers with 32 Threads
512 Bytes/ 32 Threads = 16 Bytes
Each Thread needs to Load 8 Numbers = 16 Bytes / 2 Bytes
Each Thread access 4 Banks (4 Bytes per Bank, 4 Banks for 16 Bytes)
Thread-0 => Thread-8 access 0-32 Banks
Thread-9 => Thread-17 access 0-32 Banks
Thread-18 => Thread-25 access 0-32 Banks
Thread-26 => Thread31 access 0-32 Banks
This leads to 4 Way conflicts , hence the loading should be done in 4 Cycles (Sequentially)

https://medium.com/@fatlip/cuda-shared-memory-23cd1a0d4e39 https://medium.com/distributed-knowledge/cuda-memory-management-use-cases-f9d340f7c704

Introduction to GPU Processing (High Level Overview)#

Kernel Execution#

Warps and Registers#

Example Scenarios for Matrix Loading:#

Scenario-1#

Scenario-2#

Scenario-3#