Intro

Every time you ask ChatGPT a question, get a movie recommendation on Netflix, or watch your phone recognize faces in photos, billions of matrix multiplications are happening behind the scenes. This fundamental mathematical operation has become the computational backbone of modern artificial intelligence.

GEMM (General Matrix Multiply) is the technical term for these matrix multiplication operations, and it’s no exaggeration to say that GEMM is omnipresent in machine learning. When a neural network predicts house prices using linear regression, it’s performing matrix multiplications to combine features and weights. When a recurrent neural network analyzes product review sentiment, GEMM operations process sequential data through hidden states. When a convolutional neural network generates captions for your Instagram photos, countless matrix operations extract and combine visual features.

The AI revolution from classic machine learning (pre-2010) through deep learning (2010s) to today’s generative AI era has been enabled by our ability to perform these matrix operations at unprecedented scale. Modern large language models like the ones powering ChatGPT, Gemini, and Claude require trillions of matrix operations for both training and inference. A single forward pass through GPT-3 for instance, involves matrix multiplications with dimensions in the thousands, repeated across 96 layers. While working on understanding Transformer architectures and implementing them in PyTorch, I initially took these operations for granted—calling torch.matmul() without considering what happens underneath. But when I began peeling the layers of abstraction in frameworks like PyTorch and TensorFlow, I discovered a fascinating world of optimization. The journey from a simple matrix multiplication to GPU-accelerated GEMM operations involves sophisticated decisions about memory hierarchies, thread management, and hardware utilization.

This curiosity led me to explore CUDA architecture and the specialized libraries (cuBLAS, cuDNN, CUTLASS) that PyTorch calls behind the scenes. In this blog series, I’ll take you through that same journey—from naive matrix multiplication to highly optimized GPU implementations. We’ll explore why simple approaches fail at scale, how tiled GEMM leverages GPU memory hierarchies, how Tensor Cores accelerate modern AI workloads, and how CUDA libraries abstract away the complexity of optimal GPU utilization.

This is just the tip of the iceberg, and as they say, the devil is in the details. But understanding these fundamentals will give you a new appreciation for the computational machinery powering today’s AI revolution.

Matrix Multiplication

Matrix Multiplication as we know has time complexity of O(N^3) , O(N^3) is not acceptable this definitely requires optimization, consider the scale of modern AI: a single attention mechanism in a large language model can involve large matrices and numerous matrix multipliations are involved across the layers of the models , If these core operations aren’t efficient even the most performant inference engines like vLLM or Triton can’t provide better results despite of the optimizations such as KV Cache, Prefill, Speculative Decoding etc…

The Computational Challenge

Let’s examine a concrete example. Consider multiplying two matrices: A (256×256) and B (256×256), resulting in matrix C (256×256).

To calculate any element C(i,j), we need to compute the dot product of row i from matrix A with column j from matrix B. For C(0,0), we multiply the first row of A with the first column of B and sum the results:

This requires 256 multiply-add operations. Since our result matrix has 256×256 = 65,536 elements, and each requires 256 operations, we need a total of 167,77,216 FMA (Fused Multiply-Add) operations.

The Naive Sequential Approach

The most straightforward implementation uses three nested loops: This sequential approach calculates elements one by one: C(0,0) → C(0,1) → … → C(255,255). On a modern CPU, this might take several seconds, which is unacceptable for real-time inference.

include visualization from Claude if possible

The Parallelization Opportunity

Here’s the key thing to observe, each element of the result matrix C can be calculated independently. Computing C(255,255) doesn’t depend on any previous calculations of matrix C. This independence is the foundation for parallelization. Theoretically, we could spawn 65,536 threads—one for each result element. Thread-0 calculates C(0,0), Thread-1 calculates C(0,1), and so on. If we could execute all threads simultaneously, our computation time would be determined by the maximum time to perform 256 FMA operations rather than 8+ million.

Analogy

Manufacturing Plant that produces Cars, it requires the Skilled Technicians for assembling the Parts , the Technicians require Parts & Tools to assemble them as single Unit (Car)

Though I mentioned that we can parallelize the whole process by creating multiple threads one for each element, this has some constraints , to understand these constraints we need to understand GPU Architecture, in the next section we will try to understand minimum terminology that lays foundation for understanding the GEMM and Optimizations at different levels.

GPU Architecture

To optimize Matrix Multiplication for scale that is required for ML/AI workloads, we should have better understanding of GPU architecture, Processing & Memory hierarchies which are two corner stones for any GPU workload. Like a manufacturing plant producing cars has critical resources like Technicians, Tools, Raw Materials & and Place to keep all the required Parts. GPUs too have critical resources some of them are highlighted in below , while each NVIDIA GPU Architecture has lot of these configurations we are only touching few of the important configurations that are relevant in our context, we can find full architecture details for Ampere here

Processing Units (The Skilled Technicians):

Threads: Individual workers
Warps: Small coordinated group. of 32 threads
Thread Blocks: Teams of workers assigned to Complete a task
CUDA Cores/Tensor Cores: Individual workstations in a Workshop
Streaming Multiprocessors (SMs): The main production lines where work gets done, think of each SM as Workshop

Memory Hierarchy (storage for parts and materials):

Registers: immediate tools required for Technician (fastest, most limited)
Shared Memory/L1 Cache: Workshop floor storage (fast but limited)
L2 Cache: Regional storage - Shared across all SMs
Global RAM: The main warehouse (large but distant)

Processing & Memory Hierachy

Mapping Between Processing & Memory Hierarchy

Resource Constraints:

Just as our manufacturing plant has limited skilled technicians and Shop Floor space, GPUs have finite resources. For NVIDIA’s A100 (Ampere architecture), we have to be aware of these constrains before writing CUDA functions or kernels to achieve our objectives

108 Streaming Multiprocessors
2,048 maximum threads per SM
64 maximum Warps per SM
1024 Threads per Block
64 FP32 cores + 32 FP64 cores + 4 Tensor cores per SM
Total Threads = 2048 Threads per SM * 108 SM = 221,184 Threads per GPU
Global (HBM) memory: 40GB
L2 Cache: 40MB
Shared-memory+L1(Shared across Thread Block): 192KB
Maximum Shared-Memory Configurable: 164KB
Registers: 64K 32-bit Registers per SM, 255 Registers per Thread

CUDA Programming Terminology:

Kernel:

A kernel is a function that executes on the GPU to achieve our computational objective. In our manufacturing analogy, it’s the instruction manual that guides each technician on the steps to follow for their specific task.

For matrix multiplication, our kernel would contain the instructions for computing C[i][j] = A[i][k] * B[k][j] for assigned matrix elements.

Though the GPU performs the actual computation, the CPU orchestrates the entire process. The way I like to think about it is like maritime logistics where cargo is transferred to smaller vessels (lighters) before reaching the main ship: Before Kernel is launched on GPU, CPU copies the data to GPU HBM Memory(Global Memory)

Declare Host Variables (Host = CPU)
Declare Device Variables (Device = GPU)
Allocate Device Memory (Memory to Store the Data Required in Global Memory HBM)
Copy Host → Device (Transfer input data to GPU)
Launch Kernel (Execute Main Function)
Copy Device → Host (Retrieve results from GPU to CPU)
Free Memory(Clean up alloacted memory)

Thread Hierarchy

To perform computations on the GPU, we write kernels, whether developing custom kernels or using existing ones, each kernel requires processing units (CUDA/Tensor cores) and memory resources for execution. These resources are orgaized according to configurations specified during kernel launch.

The resource allocation involves a two-tier responsibility model:

Developer-controlled: Grid dimensions, Block dimensions, shared memory usage, Register Usate
Hardware/scheduler-controlled: Warp Scheduling, Streaming Multiprocessor assignment, Instruction scheduling

Developer’s doesn’t specify individual threads directly. Instead number of required threads expressed through Grid & Thread Blocks configuration which uses a hierarchical approach: Grid → Thread Blocks → Warps → Threads, both Grids & Blocks configurations can be expressed using 1D,2D & 3D to address problem statements belonging to diffrent domains.

Warp

A critical aspect of CUDA’s execution model is the Warp - This is not controlled by developers. While we specify the grid and block dimensions, we don’t specify the warp size, which is controlled by schedulers and is constant at 32 threads per warp executed simultaneously.

An important note about warps is that even if we specify just 1 thread per block, the warp scheduler still allocates a full warp of 32 threads, with 31 threads remaining unused. This significantly impacts SM utilization and should be considered when designing kernel launch configurations. The table below shows the impact of thread block size on warp utilization.

Thread Block Size	Warps Used	Warp Utilization
16 Threads	1 Warp	50% (16/32)
32 Threads	2 Warps	100% (32/32)
48 Threads	2 Warps	75% (48/64)
64 Threads	2 Warps	100% (64/64)
96 Threads	3 Warps	100% (96/96)

SIMT (Single Instruction Multiple Thread)

SIMT is the execution model used by CUDA kernels. Consider multiplying matrix A of size (256,256) by a constant C. The instruction A[i][j] = C * A[i][j] remains the same for all iterations except for the index values. This nested loop can be converted to execute in parallel, where each thread calculates one element on the GPU.

This is achieved by issuing the same instruction to a group of 32 threads (a warp), as discussed earlier. Though threads are the fundamental building blocks for GPU processing, instructions are issued at the warp level. Hence, when an instruction is issued, it’s executed by all active threads in the warp.

Note: In modern architectures, there are modifications where collections of threads work together on computations (like Thread Block Clusters), but this basic SIMT model covers the fundamental concept.


for i in range(256):
    for j in range(256):
        A[i][j] = C * A[i][j]

Grid Configuration

Grid Dimension (gridDim): The grid dimension specifies the number of blocks in the grid. It determines how many blocks are launched to execute the kernel.

Thread Block Configuration

Block Dimension (blockDim): The block dimension specifies the number of threads within a single block.

Kernel Launch

kernel<<<gridDim, blockDim>>>(parameters)

Thread blocks per grid: 16 × 16 = 256 thread blocks
Threads per block: 16 × 16 = 256 threads
Warps per Block : 256 / 32 = 8 Warps per Block
Total Warps : 8 * 256 = 2048 Warps
Total threads: 256 × 256 = 65,536 threads

This is one possible configuration for processing a 256×256 matrix where each thread handles one matrix element(This is not Optimal configuration)

Logical vs Physical Representation

The above representation is logical representation , internally GPU operates on liner indexes, hence it would be converted to linear index, either the developer can do it or internally it is converted to linear index, below image gives idea on how to

Linear thread IDs within each warp (32 consecutive threads)
Linear memory addresses in global memory
Sequential instruction execution within warps

thread Indexes are linear hence while accessing the elements of matrix we can use either of the representations either 2D or linear indexing below is one example of how 2D indexes can be converted to linear indexing.

2D_1D_Index_Mapping

3D_1D_Index_Mapping

Memory Coalescing : Efficient Data Access Patterns

The way matrix elements are stored in memory significantly impacts kernel performance. Understanding how threads access memory is crucial for achieving optimal GPU performance in matrix operations to avoid Kernels starving for data due to slow memory access.

GPU Memory Operations Overview

GPU memory operations can be categorized into three phases:

Load - Transfer data from global memory to registers directly or via shared memory
Compute - Execute the actual mathematical operations on data in registers
Store operations - Write results back to global memory

The important point to consider is that memory bandwidth not computational power which often becomes the bottleneck in GPU kernels. This is why optimizing memory access patterns has significant role while writing custom CUDA Kernels.

Memory coalescing occurs when threads within the same warp access consecutive memory addresses. When this happens, the GPU can combine multiple memory requests into fewer, wider memory transactions, dramatically improving bandwidth utilization.

Note:- Coaelscing is determined by the access pattern of the 32 threads within a single warp, not by the overall grid structure.

Memory Storage Layouts

By default, C++ stores matrices in row-major layout while Fortran uses column-major layout. This is analogous to how databases can be organized as row-oriented (traditional RDBMS) or column-oriented (analytical databases) based on storage patterns.

Row Column Layout

Whether we are using CPU or GPU for computation, data first needs to be brought to execution units ALU in CPU or CUDA Cores in GPU which is accomplished using Load instruction , Upon load instruction data is moved from Global Memory to Registers either directly or through Shared Memory , Registers are High speed memory . to keep it simple for our discussion each Load instruction can only access certain number of bytes in one cycle, when all of our threads in a Warp are accessing adjacent locations one load instruction can fetch all the inputs required for a threads in Warp to be processed , If the threads of the same Warp are accessing the elements that are distant that leads to multiple load instructions from Global Memory which is time taking process as transfer of data from Global Memory is constrained by the bandwidth , because CUDA cores complete the instructions and waits for the data while the load instruction still fetching the data from Global Memory, hence it is paramount important to make sure the data should be arranged such that threads in a warp access consectutive elements for thier respective operations.

Analogy:

In our Manufacturing analogy, if all the technicians are working on same part of Car, all the technicians require parts from same bin or near by Bins , one supply truck make a single trip to to load all parts from consecutive bins, otherwise if each technician is working on different part of car which requires part from different bins this makes it difficult to load all parts from different bins , this might require multiple iterations to bring parts required for all technicians , which impacts the overall execution of the manufacturing process

Let us go through some examples of Coalesce and Non-Coalesce patterns to understand this better

Coalesced Memory Access Pattern Example

Consider matrix multiplication for matrices A=(1024×1024), B=(1024×1024), C=(1024×1024) using a configuration that results in optimal coalesced memory access.

Configuration Parameters

Thread Block Organization

Thread Organization within Each Block

Linear Index Calculation Formula

Coalesced Memory Access Analysis

Perfect Coalescing: Each Warp Accesses Consecutive Memory

With blockDim(32,1), each warp contains 32 threads that access 32 consecutive memory locations within the same matrix row.

Coaelsced Visulizations

coaelsce_grid_block

coaelsce_thread_block

coaelsce_block_0_0

coaelsce_block_1_0

coaelsce_block_31_0

Thread Block (0,0) - Complete Warp Analysis:

Thread 0:

Parameters: blockIdx.x=0, blockIdx.y=0, threadIdx.x=0, threadIdx.y=0

Calculation:
row = 0 * 1 + 0 = 0
col = 0 * 32 + 0 = 0
idx = 0 * 1024 + 0 = 0

Thread 1:

Parameters: blockIdx.x=0, blockIdx.y=0, threadIdx.x=1, threadIdx.y=0

Calculation:
row = 0 * 1 + 0 = 0
col = 0 * 32 + 1 = 1
idx = 0 * 1024 + 1 = 1

Thread 2:

Parameters: blockIdx.x=0, blockIdx.y=0, threadIdx.x=2, threadIdx.y=0

Calculation:
row = 0 * 1 + 0 = 0
col = 0 * 32 + 2 = 2
idx = 0 * 1024 + 2 = 2

…continuing pattern…

Thread 31:

Parameters: blockIdx.x=0, blockIdx.y=0, threadIdx.x=31, threadIdx.y=0

Calculation:
row = 0 * 1 + 0 = 0
col = 0 * 32 + 31 = 31
idx = 0 * 1024 + 31 = 31

✅ Perfect Coalescing: All 32 threads access consecutive indices 0, 1, 2, …, 31

Thread Block (1,0) - Next Consecutive Block:

Thread 0:

Parameters: blockIdx.x=1, blockIdx.y=0, threadIdx.x=0, threadIdx.y=0

Calculation:
row = 0 * 1 + 0 = 0
col = 1 * 32 + 0 = 32
idx = 0 * 1024 + 32 = 32

Thread 1:

Parameters: blockIdx.x=1, blockIdx.y=0, threadIdx.x=1, threadIdx.y=0

Calculation:
row = 0 * 1 + 0 = 0
col = 1 * 32 + 1 = 33
idx = 0 * 1024 + 33 = 33

…continuing pattern…

Thread 31:

Parameters: blockIdx.x=1, blockIdx.y=0, threadIdx.x=31, threadIdx.y=0

Calculation:
row = 0 * 1 + 0 = 0
col = 1 * 32 + 31 = 63
idx = 0 * 1024 + 63 = 63

✅ Perfect Coalescing: All 32 threads access consecutive indices 32, 33, 34, …, 63

Thread Block (31,0) - Final Block in Row 0:

Thread 0:

Parameters: blockIdx.x=31, blockIdx.y=0, threadIdx.x=0, threadIdx.y=0

Calculation:
row = 0 * 1 + 0 = 0
col = 31 * 32 + 0 = 992
idx = 0 * 1024 + 992 = 992

Thread 1:

Parameters: blockIdx.x=31, blockIdx.y=0, threadIdx.x=1, threadIdx.y=0

Calculation:
row = 0 * 1 + 0 = 0
col = 31 * 32 + 1 = 993
idx = 0 * 1024 + 993 = 993

…continuing pattern…

Thread 31:

Parameters: blockIdx.x=31, blockIdx.y=0, threadIdx.x=31, threadIdx.y=0

Calculation:
row = 0 * 1 + 0 = 0
col = 31 * 32 + 31 = 1023
idx = 0 * 1024 + 1023 = 1023

✅ Perfect Coalescing: All 32 threads access consecutive indices 992, 993, 994, …, 1023

Summary: Coalesced Access Pattern

Thread Block	Memory Indices Accessed	Pattern
Block (0,0)	0, 1, 2, …, 31	✅ Consecutive
Block (1,0)	32, 33, 34, …, 63	✅ Consecutive
Block (2,0)	64, 65, 66, …, 95	✅ Consecutive
…	…	✅ Consecutive
Block (31,0)	992, 993, 994, …, 1023	✅ Consecutive

Key Advantages

No Memory Gaps: Every thread in a warp accesses consecutive memory addresses
Single Memory Transaction: Each warp requires only 1 memory transaction
Optimal Bandwidth Utilization: 100% of memory bandwidth is effectively used
Maximum Performance: Optimal memory access pattern for GPU architecture

Warp-Level Analysis

Since each thread block contains exactly 32 threads = 1 warp:

Thread Block	Warp	Threads	Memory Access Pattern
Block (0,0)	Warp 0	0-31	Consecutive: 0-31 ✅
Block (1,0)	Warp 0	0-31	Consecutive: 32-63 ✅
Block (2,0)	Warp 0	0-31	Consecutive: 64-95 ✅
…	…	…	…
Block (31,0)	Warp 0	0-31	Consecutive: 992-1023 ✅

Why This Configuration Works

blockDim.y = 1: Ensures all threads in a warp stay within the same matrix row
blockDim.x = 32: Matches warp size, providing optimal thread-to-warp mapping
Consecutive Access: ThreadIdx.x directly maps to consecutive column indices
No Row Spanning: Warps never span multiple matrix rows, eliminating gaps

Key Insight: The fundamental success of blockDim(32,1) is that it aligns perfectly with GPU Warp architecture, ensuring each warp accesses a single contiguous block of memory within the same matrix row.

Non-Coalesced Memory Access Pattern Example

Consider matrix multiplication for matrices A=(1024×1024), B=(1024×1024), C=(1024×1024) using a configuration that results in non-coalesced memory access.

Configuration Parameters

gridDim = (64,64)   => 64 Thread Blocks in X dimension, 64 Thread Blocks in Y dimension
blockDim = (16,16)  => 16 Threads in X dimension, 16 Threads in Y dimension

gridDim.x = 64      blockDim.x = 16
gridDim.y = 64      blockDim.y = 16

threadIdx.x range: 0-15    blockIdx.x range: 0-63
threadIdx.y range: 0-15    blockIdx.y range: 0-63

Total Configuration:

Total Thread Blocks: 64 × 64 = 4,096 blocks
Total Threads per Block: 16 × 16 = 256 threads
Total Warps per Block: 256 ÷ 32 = 8 warps
Total Threads: 1,048,576 threads (perfect for 1024×1024 matrix)

Thread Block Organization

Thread Block Layout (64×64 grid):

Thread Block (0,0)    Thread Block (1,0)    Thread Block (2,0)    ...    Thread Block (63,0)
Thread Block (0,1)    Thread Block (1,1)    Thread Block (2,1)    ...    Thread Block (63,1)
Thread Block (0,2)    Thread Block (1,2)    Thread Block (2,2)    ...    Thread Block (63,2)
.
.
.
Thread Block (0,63)   Thread Block (1,63)   Thread Block (2,63)   ...    Thread Block (63,63)

Thread Organization within Each Block

Thread Layout (16×16 per block):

threadIdx(0,0)    threadIdx(1,0)    threadIdx(2,0)    ...    threadIdx(15,0)
threadIdx(0,1)    threadIdx(1,1)    threadIdx(2,1)    ...    threadIdx(15,1)
threadIdx(0,2)    threadIdx(1,2)    threadIdx(2,2)    ...    threadIdx(15,2)
.
.
.
threadIdx(0,15)   threadIdx(1,15)   threadIdx(2,15)   ...    threadIdx(15,15)

Linear Index Calculation Formula

row = blockIdx.y * blockDim.y + threadIdx.y
col = blockIdx.x * blockDim.x + threadIdx.x
idx = row * matrix_width + col    (where matrix_width = 1024)

Non-Coalesced Memory Access Analysis

Non-Coelesced Visulizations

non_coaelsce_grid_block

non_coaelsce_thread_block

non_coaelsce_warp_0

non_coaelsce_warp_1

Problem: Warps Span Multiple Matrix Rows

With blockDim(16,16), each warp contains 32 consecutive threads that span 2 matrix rows, creating large memory gaps.

Warp 0 (Threads 0-31) in Thread Block (0,0):

First 16 threads (threadIdx.y = 0):

Parameters: blockIdx.x=0, blockIdx.y=0, threadIdx.x=0-15, threadIdx.y=0

Calculation:
row = 0 * 16 + 0 = 0
col = 0 * 16 + (0-15) = 0-15
idx = 0 * 1024 + (0-15) = 0-15

Memory indices accessed: 0, 1, 2, 3, ..., 14, 15

Next 16 threads (threadIdx.y = 1):

Parameters: blockIdx.x=0, blockIdx.y=0, threadIdx.x=0-15, threadIdx.y=1

Calculation:
row = 0 * 16 + 1 = 1
col = 0 * 16 + (0-15) = 0-15
idx = 1 * 1024 + (0-15) = 1024-1039

Memory indices accessed: 1024, 1025, 1026, ..., 1038, 1039

❌ Gap Analysis: 1009 elements between indices 15 and 1024!

Warp 1 (Threads 32-63) in Thread Block (0,0):

Threads 32-47 (threadIdx.y = 2):

Parameters: blockIdx.x=0, blockIdx.y=0, threadIdx.x=0-15, threadIdx.y=2

Calculation:
row = 0 * 16 + 2 = 2
col = 0 * 16 + (0-15) = 0-15
idx = 2 * 1024 + (0-15) = 2048-2063

Memory indices accessed: 2048, 2049, 2050, ..., 2062, 2063

Threads 48-63 (threadIdx.y = 3):

Parameters: blockIdx.x=0, blockIdx.y=0, threadIdx.x=0-15, threadIdx.y=3

Calculation:
row = 0 * 16 + 3 = 3
col = 0 * 16 + (0-15) = 0-15
idx = 3 * 1024 + (0-15) = 3072-3087

Memory indices accessed: 3072, 3073, 3074, ..., 3086, 3087

❌ Same Gap Problem: 1009 elements between indices 2063 and 3072!

Final Warp Example - Warp 7 (Threads 224-255):

Threads 224-239 (threadIdx.y = 14):

Parameters: blockIdx.x=0, blockIdx.y=0, threadIdx.x=0-15, threadIdx.y=14

Calculation:
row = 0 * 16 + 14 = 14
col = 0 * 16 + (0-15) = 0-15
idx = 14 * 1024 + (0-15) = 14336-14351

Threads 240-255 (threadIdx.y = 15):

Parameters: blockIdx.x=0, blockIdx.y=0, threadIdx.x=0-15, threadIdx.y=15

Calculation:
row = 0 * 16 + 15 = 15
col = 0 * 16 + (0-15) = 0-15
idx = 15 * 1024 + (0-15) = 15360-15375

Summary: Non-Coalesced Access Pattern

Warp	First 16 Threads Access	Next 16 Threads Access	Gap Size	Coalesced?
Warp 0	0-15 (row 0)	1024-1039 (row 1)	1009 elements	❌ No
Warp 1	2048-2063 (row 2)	3072-3087 (row 3)	1009 elements	❌ No
Warp 2	4096-4111 (row 4)	5120-5135 (row 5)	1009 elements	❌ No
…	…	…	1009 elements	❌ No
Warp 7	14336-14351 (row 14)	15360-15375 (row 15)	1009 elements	❌ No

Critical Problems

Large Memory Gaps: Every warp has a 1009-element gap between consecutive thread accesses
Multiple Memory Transactions: Each warp requires 2-4 separate memory transactions instead of 1
Poor Bandwidth Utilization: Only 25-50% of memory bandwidth is effectively used
Performance Degradation: 2-4x slower compared to coalesced access patterns

Key Insight: The fundamental issue with blockDim(16,16) is that it forces warps to span multiple matrix rows, breaking the consecutive memory access pattern required for optimal coalescing.

Compute vs Memory Bound Kernels

GPU kernels can be classified as either memory bound or compute bound based on their primary performance bottleneck. Understanding this classification is crucial for choosing the right optimization approach.

Like every computing system, GPUs have two fundamental dimensions: execution capability and memory bandwidth. Every system has scaling limits, and to improve performance, we can optimize along either dimension until reaching the maximum limits.

Key Classification Metric: Arithmetic Intensity

The fundamental metric for classification is Arithmetic Intensity:

background-color: #ffffff; 
color: #2d2d2d; 
padding: 16px; 
margin: 0; 
font-size: 15px;
line-height: 1.5; 
overflow-x: auto;

“> Arithmetic Intensity = FLOPs (Floating Point Operations) / Bytes Accessed

Peak Arithemtic Intensity for A100 = 9.75 FLOP/byte

Above 9.75 FLOP/byte: Compute-bound(can potentially use full 19.5 TFLOPS)

Below 9.75 FLOP/byte: Memory-bound (limited by 2.0 TB/s bandwidth)

Kernel Classification Definitions

Compute Bound Kernels

Definition: A kernel is compute-bound when its performance is limited by the speed of its mathematical operations. The GPU’s compute units (CUDA Cores, Tensor Cores) are the primary bottleneck, and their utilization is the limiting factor.

Optimization Strategies:

To optimize compute-bound kernels, the focus is on maximizing the efficiency of mathematical operations. This involves:

Ensuring High Occupancy: Launching enough threads to fully saturate the GPU’s Streaming Multiprocessors (SMs) and keep the compute units busy.
Leveraging Specialized Hardware: Using specialized cores like Tensor Cores for matrix multiplication.
Multi-GPU Scaling: For workloads that exceed a single GPU’s capacity, distributing the work across multiple GPUs can
Trade-off: Multi-GPU comes with data transfer overhead dependent on network bandwidth

Memory Bound Kernels

Definition:A kernel is memory-bound when its performance is limited by the speed at which data can be transferred to and from the GPU’s memory. The primary bottleneck is memory bandwidth.

Key Indicators:

Low Compute Utilizaion (cores are idle, waiting for data).
Poor cache hit rates
High memory access latency
Memory access inefficiencies (poor coalescing, cache thrashing)

Optimization Strategies:

To optimize memory-bound kernels, the focus is on reducing the amount of data transferred from high-latency memory and improving access patterns. This is achieved by:

Maximize data reuse: Keep frequently accessed data in faster memory levels
Optimize access patterns: Ensuring memory accesses are coalesced to reduce the number of memory transactions
Use memory hierarchy efficiently: Thread Registers → L1 Cache/Shared Memory → L2 Cache → Global Memory (HBM2)
Use lower precision: Mixed precision or reduced precision where high accuracy isn’t critical

Upcoming: The performance of a naive GEMM kernel is limited by memory bandwidth. The upcoming sections will show how using shared memory with tiling increases data reuse and reduces expensive round trips to global memory. This allows data to be reused on-chip and makes the kernel’s performance dependent on the speed of its mathematical operations, effectively making it compute-bound.

Intro#

Matrix Multiplication#

The Computational Challenge#

The Naive Sequential Approach#

The Parallelization Opportunity#

Analogy#

GPU Architecture#

Processing Units (The Skilled Technicians):#

Memory Hierarchy (storage for parts and materials):#

Resource Constraints:#

CUDA Programming Terminology:#

Kernel:#

Thread Hierarchy#

Warp#

SIMT (Single Instruction Multiple Thread)#

Grid Configuration#

Thread Block Configuration#

Kernel Launch#

Logical vs Physical Representation#

Memory Coalescing : Efficient Data Access Patterns#

GPU Memory Operations Overview#

Memory Storage Layouts#

Analogy:#

Coalesced Memory Access Pattern Example#

Configuration Parameters#

Thread Block Organization#

Thread Organization within Each Block#

Linear Index Calculation Formula#

Coalesced Memory Access Analysis#

Perfect Coalescing: Each Warp Accesses Consecutive Memory#

Coaelsced Visulizations#

Thread Block (0,0) - Complete Warp Analysis:#

Thread Block (1,0) - Next Consecutive Block:#

Thread Block (31,0) - Final Block in Row 0:#

Summary: Coalesced Access Pattern#

Key Advantages#

Warp-Level Analysis#

Why This Configuration Works#

Non-Coalesced Memory Access Pattern Example#

Configuration Parameters#

Thread Block Organization#

Thread Organization within Each Block#

Linear Index Calculation Formula#

Non-Coalesced Memory Access Analysis#

Non-Coelesced Visulizations#

Problem: Warps Span Multiple Matrix Rows#

Warp 0 (Threads 0-31) in Thread Block (0,0):#

Warp 1 (Threads 32-63) in Thread Block (0,0):#

Final Warp Example - Warp 7 (Threads 224-255):#

Summary: Non-Coalesced Access Pattern#

Critical Problems#

Compute vs Memory Bound Kernels#

Key Classification Metric: Arithmetic Intensity#

Kernel Classification Definitions#

Compute Bound Kernels#

Memory Bound Kernels#

Intro

Matrix Multiplication

The Computational Challenge

The Naive Sequential Approach

The Parallelization Opportunity

Analogy

GPU Architecture

Processing Units (The Skilled Technicians):

Memory Hierarchy (storage for parts and materials):

Resource Constraints:

CUDA Programming Terminology:

Kernel:

Thread Hierarchy

Warp

SIMT (Single Instruction Multiple Thread)

Grid Configuration

Thread Block Configuration

Kernel Launch

Logical vs Physical Representation

Memory Coalescing : Efficient Data Access Patterns

GPU Memory Operations Overview

Memory Storage Layouts

Analogy:

Coalesced Memory Access Pattern Example

Configuration Parameters

Thread Block Organization

Thread Organization within Each Block

Linear Index Calculation Formula

Coalesced Memory Access Analysis

Perfect Coalescing: Each Warp Accesses Consecutive Memory

Coaelsced Visulizations

Thread Block (0,0) - Complete Warp Analysis:

Thread Block (1,0) - Next Consecutive Block:

Thread Block (31,0) - Final Block in Row 0:

Summary: Coalesced Access Pattern

Key Advantages

Warp-Level Analysis

Why This Configuration Works

Non-Coalesced Memory Access Pattern Example

Configuration Parameters

Thread Block Organization

Thread Organization within Each Block

Linear Index Calculation Formula

Non-Coalesced Memory Access Analysis

Non-Coelesced Visulizations

Problem: Warps Span Multiple Matrix Rows

Warp 0 (Threads 0-31) in Thread Block (0,0):

Warp 1 (Threads 32-63) in Thread Block (0,0):

Final Warp Example - Warp 7 (Threads 224-255):

Summary: Non-Coalesced Access Pattern

Critical Problems

Compute vs Memory Bound Kernels

Key Classification Metric: Arithmetic Intensity

Kernel Classification Definitions

Compute Bound Kernels

Memory Bound Kernels