GPUs execute many parallel threads. These threads are organized into a three-level hierarchy:

  • Grids: The entire collection of blocks that execute a kernel.
  • Blocks: Subdivisions of the grid that contain groups of threads. Threads within a block can cooperate via shared memory.
  • Threads: The smallest unit of execution.

When launching a kernel (the function that performs computation), you specify this thread hierarchy using the following syntax:

kernel_name<<<gridDim, blockDim>>>(...);

Here, gridDim and blockDim can be configured with 1, 2, or 3 dimensions. Their possible combinations include:

  • gridDim = 1D & blockDim = 1D
  • gridDim = 1D & blockDim = 2D
  • gridDim = 1D & blockDim = 3D
  • gridDim = 2D & blockDim = 1D
  • gridDim = 2D & blockDim = 2D
  • gridDim = 2D & blockDim = 3D
  • gridDim = 3D & blockDim = 1D
  • gridDim = 3D & blockDim = 2D
  • griddIM = 3D & blockDim = 3D

Thread Indexing – 1D Grid and 1D Block

Example: Grid with 2 Blocks, Each with 5 Threads Grid configuration: (2 blocks, each with 5 threads), can be denoted as (2,5).

Local Thread Indexing: Within each block, threads are uniquely indexed from 0 to 4.

Local Thread Index

| BlockID   | ThreadID (Local)|
|---------  |-----------------|
|  Block-0  |      Thread-0   |
|  Block-0  |      Thread-1   |
|  Block-0  |      Thread-2   |
|  Block-0  |      Thread-3   |
|  Block-0  |      Thread-4   |
|  Block-1  |      Thread-0   |
|  Block-1  |      Thread-1   |
|  Block-1  |      Thread-2   |
|  Block-1  |      Thread-3   |
|  Block-1  |      Thread-4   |   

Global ThreadIndex

Although thread indices are unique within a block, they are not unique across blocks. To compute a global thread ID, we use this one-dimensional indexing formula:

we take help of predefined variables , in this context we only have one dimension hence these variables threadIdx.x, blockIdx.x, blockDim.x

Global threadId = blockIdx.x * blockDim.x + threadIdx.x

1. blockDim.x= 5
2. blockIdx.x = [0,1]
3. threadIdx.x = [0,1,2,3,4]

|   BlockID  |  Calculation |   Global ThreadID   |
|------------|--------------|---------------------|
|   Block-0  | 0 * 5 + 0    |           0         |
|   Block-0  | 0 * 5 + 1    |           1         |
|   Block-0  | 0 * 5 + 2    |           2         |
|   Block-0  | 0 * 5 + 3    |           3         |
|   Block-0  | 0 * 5 + 4    |           4         |
|   Block-0  | 1 * 5 + 0    |           5         |
|   Block-0  | 1 * 5 + 1    |           6         |   
|   Block-0  | 1 * 5 + 2    |           7         |
|   Block-0  | 1 * 5 + 3    |           8         |
|   Block-0  | 1 * 5 + 4    |           9         |

Thread Indexing – 2D Grid and 1D Block

When the grid of blocks is two-dimensional, additional parameters (blockIdx.x and blockIdx.y) define the block’s position in the grid. In this case, each block still uses 1D thread indexing.

Example: Grid with a 2D Arrangement (2 columns × 3 rows) of Blocks

Grid Layout :


Imagine your grid is arranged as 2 blocks in the x-dimension and 3 blocks in the y-dimension. The grid can be represented as:

[ 
  [Block (0,0), Block (0,1)],
  [Block (1,0), Block (1,1)],
  [Block (2,0), Block (2,1)] 
]

Total Blocks =2(columns)×3(rows)=6.

Global Block ID Calculation

A unique global block ID is computed as: blockId = blockIdx.y * gridDim.x + blockIdx.x

For this example (with gridDim.x = 2):

|   blockIdx.y  |  blockIdx.x      |      Calculation      |  Global BlockID  |
|---------------|------------------|-----------------------|------------------|
|   0           |     0            |         0 * 2 + 0     |       0          |
|   0           |     1            |         0 * 2 + 1     |       1          |
|   1           |     0            |         1 * 2 + 0     |       2          |
|   1           |     1            |         1 * 2 + 1     |       3          |
|   2           |     0            |         2 * 2 + 0     |       4          |
|   2           |     1            |         2 * 2 + 1     |       5          |

Local Thread Indexing:

Each block contains 5 threads with local thread indices ranging from 0 to 4.

Matrix Representation of 2D Blocks in the Grid

|       | x = 0        | x = 1        |
|-------|--------------|--------------|
| y=0   | Block (0,0)  | Block (0,1)  |
| y=1   | Block (1,0)  | Block (1,1)  |
| y=2   | Block (2,0)  | Block (2,1)  |


|       | x = 0        | x = 1        |
|-------|--------------|--------------|
| y=0   |       0      |    1         |
| y=1   |       2      |    3         |
| y=2   |       4      |    5         |

Global Thread Index Calculation

Once the unique global block ID is computed, the global thread ID is calculated by:

threadId = blockId * blockDim.x + threadIdx.x;

  • blockDim.x = 5

  • threadIdx.x = [0, 1, 2, 3, 4]

      |   GlobalBlockID   |  Calculation     |   GlobalThreadID      |
      |-------------------|------------------|-----------------------|
      |       0           |   0 * 5 + 0      |           0           |
      |       0           |   0 * 5 + 1      |           1           |
      |       0           |   0 * 5 + 2      |           2           |
      |       0           |   0 * 5 + 3      |           3           |
      |       0           |   0 * 5 + 4      |           4           |
      |       1           |   1 * 5 + 0      |           5           |
      |       1           |   1 * 5 + 1      |           6           |
      |       1           |   1 * 5 + 2      |           7           |
      |       1           |   1 * 5 + 3      |           8           |
      |       1           |   1 * 5 + 4      |           9           |
      |       2           |   2 * 5 + 0      |           10          |
      |       2           |   2 * 5 + 1      |           11          |
      |       2           |   2 * 5 + 2      |           12          |
      |       2           |   2 * 5 + 3      |           13          |
      |       2           |   2 * 5 + 4      |           14          |
      |       3           |   3 * 5 + 0      |           15          |
      |       3           |   3 * 5 + 1      |           16          |
      |       3           |   3 * 5 + 2      |           17          |
      |       3           |   3 * 5 + 3      |           18          |
      |       3           |   3 * 5 + 4      |           19          |
      |       4           |   4 * 5 + 0      |           20          |
      |       4           |   4 * 5 + 1      |           21          |
      |       4           |   4 * 5 + 2      |           22          |
      |       4           |   4 * 5 + 3      |           23          |
      |       4           |   4 * 5 + 4      |           24          |
      |       5           |   5 * 5 + 0      |           25          |
      |       5           |   5 * 5 + 1      |           26          |
      |       5           |   5 * 5 + 2      |           27          |
      |       5           |   5 * 5 + 3      |           28          |
      |       5           |   5 * 5 + 4      |           29          |
    

Thread Indexing – 3D Grid and 1D Block

When the grid of blocks is three-dimensional, we have predefined parameters (blockIdx.x and blockIdx.y, blockIdx.z) define the block’s position in the grid. In this case, each block still uses 1D thread indexing.

Example: Grid with a 3D Arrangement (2 × 3 * 4) of Blocks

(x,y,z) = (2,3,4)

Grid Layout:


Total Blocks =2(x)×3(y)*4(z) = 24.

grid = [

[  # z = 0
    [ Block(0,0,0), Block(0,0,1) ],  # y = 0
    [ Block(0,1,0), Block(0,1,1) ],  # y = 1
    [ Block(0,2,0), Block(0,2,1) ]   # y = 2
],
[  # z = 1
    [ Block(1,0,0), Block(1,0,1) ],  # y = 0
    [ Block(1,1,0), Block(1,1,1) ],  # y = 1
    [ Block(1,2,0), Block(1,2,1) ]   # y = 2
],
[  # z = 2
    [ Block(2,0,0), Block(2,0,1) ],  # y = 0
    [ Block(2,1,0), Block(2,1,1) ],  # y = 1
    [ Block(2,2,0), Block(2,2,1) ]   # y = 2
],
[  # z = 3
    [ Block(3,0,0), Block(3,0,1) ],  # y = 0 
    [ Block(3,1,0), Block(3,1,1) ],  # y = 1
    [ Block(3,2,0), Block(3,2,1) ]   # y = 2
]
]

Global Block ID Calculation

A unique global block ID is computed as: blockId = blockIdx.z * gridDim.y * gridDim.x + blockIdx.y * gridDim.x + blockIdx.x

For this example

gridDim.x = 2
gridDim.y = 3
gridDim.z = 4 


|    blockIdx.z  |   blockIdx.y  |  blockIdx.x      |      Calculation           |  Global BlockID   |
|----------------|---------------|------------------|--------------------------- | ------------------|
|       0        |     0         |     0            |     0 * 2 * 3 + 0 * 2 + 0  |         0         |
|       0        |     0         |     1            |     0 * 2 * 3 + 0 * 2 + 1  |         1         |
|       0        |     1         |     0            |     0 * 2 * 3 + 1 * 2 + 0  |         2         |
|       0        |     1         |     1            |     0 * 2 * 3 + 1 * 2 + 1  |         3         |
|       0        |     2         |     0            |     0 * 2 * 3 + 2 * 2 + 0  |         4         |
|       0        |     2         |     1            |     0 * 2 * 3 + 2 * 2 + 1  |         5         |
|       1        |     0         |     0            |     1 * 2 * 3 + 0 * 2 + 0  |         6         |
|       1        |     0         |     1            |     1 * 2 * 3 + 0 * 2 + 1  |         7         |
|       1        |     1         |     0            |     1 * 2 * 3 + 1 * 2 + 0  |         8         |
|       1        |     1         |     1            |     1 * 2 * 3 + 1 * 2 + 1  |         9         |
|       1        |     2         |     0            |     1 * 2 * 3 + 2 * 2 + 0  |         10        |
|       1        |     2         |     1            |     1 * 2 * 3 + 2 * 2 + 1  |         11        |
|       2        |     0         |     0            |     2 * 2 * 3 + 0 * 2 + 0  |         12        |
|       2        |     0         |     1            |     2 * 2 * 3 + 0 * 2 + 1  |         13        |
|       2        |     1         |     0            |     2 * 2 * 3 + 1 * 2 + 0  |         14        |
|       2        |     1         |     1            |     2 * 2 * 3 + 1 * 2 + 1  |         15        |
|       2        |     2         |     0            |     2 * 2 * 3 + 2 * 2 + 0  |         16        |
|       2        |     2         |     1            |     2 * 2 * 3 + 2 * 2 + 1  |         17        |
|       3        |     0         |     0            |     3 * 2 * 3 + 0 * 2 + 0  |         18        |
|       3        |     0         |     1            |     3 * 2 * 3 + 0 * 2 + 1  |         19        |
|       3        |     1         |     0            |     3 * 2 * 3 + 1 * 2 + 0  |         20        |
|       3        |     1         |     1            |     3 * 2 * 3 + 1 * 2 + 1  |         21        |
|       3        |     2         |     0            |     3 * 2 * 3 + 2 * 2 + 0  |         22        |
|       3        |     2         |     1            |     3 * 2 * 3 + 2 * 2 + 1  |         23        |

Local Thread Indexing:

Each block contains 5 threads with local thread indices from 0 to 4.

Global Thread Index Calculation

Once the unique global block ID is computed, the global thread ID is calculated by:

threadId = blockId * blockDim.x + threadIdx.x;

  • BlockDim.x = 5

  • threadIdx.x = [0, 1, 2, 3, 4]

      |   GlobalBlockID   |  Calculation     |   GlobalThreadID      |
      |-------------------|------------------|-----------------------|
      |       0           |   0 * 5 + 0      |           0           |
      |       0           |   0 * 5 + 1      |           1           |
      |       0           |   0 * 5 + 2      |           2           |
      |       0           |   0 * 5 + 3      |           3           |
      |       0           |   0 * 5 + 4      |           4           |
      |       1           |   1 * 5 + 0      |           5           |
      |       1           |   1 * 5 + 1      |           6           |
      |       1           |   1 * 5 + 2      |           7           |
      |       1           |   1 * 5 + 3      |           8           |
      |       1           |   1 * 5 + 4      |           9           |
      |       2           |   2 * 5 + 0      |           10          |
      |       2           |   2 * 5 + 1      |           11          |
      |       2           |   2 * 5 + 2      |           12          |
      |       2           |   2 * 5 + 3      |           13          |
      |       2           |   2 * 5 + 4      |           14          |
      |       3           |   3 * 5 + 0      |           15          |
      |       3           |   3 * 5 + 1      |           16          |
      |       3           |   3 * 5 + 2      |           17          |
      |       3           |   3 * 5 + 3      |           18          |
      |       3           |   3 * 5 + 4      |           19          |
      |       4           |   4 * 5 + 0      |           20          |
      |       4           |   4 * 5 + 1      |           21          |
      |       4           |   4 * 5 + 2      |           22          |
      |       4           |   4 * 5 + 3      |           23          |
      |       4           |   4 * 5 + 4      |           24          |
      |       5           |   5 * 5 + 0      |           25          |
      |       5           |   5 * 5 + 1      |           26          |
      |       5           |   5 * 5 + 2      |           27          |
      |       5           |   5 * 5 + 3      |           28          |
      |       5           |   5 * 5 + 4      |           29          |
      |       6           |   6 * 5 + 0      |           30          |
      |       6           |   6 * 5 + 1      |           31          |
      |       6           |   6 * 5 + 2      |           32          |
      |       6           |   6 * 5 + 3      |           33          |
      |       6           |   6 * 5 + 4      |           34          |
      |                   |                  |                       |
      |                   |                  |                       |
      |                   |                  |                       |
      |       23          |   23 * 5 + 0     |           115         |
      |       23          |   23 * 5 + 1     |           116         |
      |       23          |   23 * 5 + 2     |           117         |
      |       23          |   23 * 5 + 3     |           118         |
      |       23          |   23 * 5 + 4     |           119         |
    

Thread Indexing – 1D Grid and 2D Block

  • GridDim is 1D : gridDim.x = N (say gridDim.x = 1)
  • BlockDim is 2D : blockDim.x = 2, blockDim.y = 3

Since GridDim is 1D, we have only one block. If gridDim.x = 1, the only block is blockIdx.x = 0.

blockDim = (2, 3)
---------------------
| (0,0) | (0,1) |
| (1,0) | (1,1) |
| (2,0) | (2,1) |
---------------------

To calculate global thread indices in 2D, we use:

int global_x = threadIdx.x + blockIdx.x * blockDim.x; int global_y = threadIdx.y + blockIdx.y * blockDim.y;

Each thread in the single block Block(0) has a unique (threadIdx.x, threadIdx.y):

Global Thread Index (flattened row-wise):

---------------------
|   0   |   1   |
|   2   |   3   |
|   4   |   5   |
---------------------

we can derive Global Index using Formulae threadIdx.y * blockDim.x + threadIdx.x

Block(0) – Threads with blockDim(2,3)

blockDim.x = 2

| threadIdx.y | threadIdx.x | GlobalThreadID   |
|-------------|-------------|------------------|
|     0       |     0       |    0 * 2 + 0     |
|     0       |     1       |    0 * 2 + 1     |
|     1       |     0       |    1 * 2 + 0     |
|     1       |     1       |    1 * 2 + 1     |
|     2       |     0       |    2 * 2  + 0    |
|     2       |     1       |    2 * 2 + 1     |

Thread Indexing - 2D Grid and 2D Block

dim3 gridDim(3,2) = (x,y) dim3 blockDim(4,2) = (x,y)

gridDim.x = 3 gridDim.y = 2 blockDim.x = 4 blockDim.y = 2

Total Number of Blocks = 32 = 6 Blocks Threads per Blocks = 42 = 8 Threads Total Threads = 6 * 8 = 48 Threads

Grid Layout


gridDim = (3, 2)

|        | x = 0 | x = 1 |  x=2      |
|------- |-------|-------|-----------|
|  y = 0 | (0,0) | (0,1) |  (0,2)    |
|  y = 1 | (1,0) | (1,1) |  (1,2)    |
 ------------------------------------

|       |  x = 0  |  x = 1  |  x = 2  |
|-------|-------- |---------|---------|
| y=0   |    0    |    1    |   2     |
| y=1   |    3    |    4    |   5     |
|-------------------------------------|

A unique global block ID is computed as: blockId = blockIdx.y * gridDim.x + blockIdx.x

For this example (with gridDim.x = 2):

|   blockIdx.y  |  blockIdx.x      |      Calculation      |  Global BlockId  |
|---------------|------------------|-----------------------|------------------|
|   0           |     0            |         0 * 3 + 0     |       0          |
|   0           |     1            |         0 * 3 + 1     |       1          |
|   0           |     2            |         0 * 3 + 2     |       2          |
|   1           |     0            |         1 * 3 + 0     |       3          |
|   1           |     1            |         1 * 3 + 1     |       4          |
|   1           |     2            |         1 * 3 + 2     |       5          |

Local Thread Indexing:


blockDim = (4, 2)

|        | x = 0 | x = 1 |  x=2   | x=3  |
|------- |-------|-------|--------|------|
|  y = 0 | (0,0) | (0,1) |  (0,2) | (0,3)|
|  y = 1 | (1,0) | (1,1) |  (1,2) | (1,3)|
 -----------------------------------------

|       |  x = 0  |  x = 1  |  x = 2  | x = 3 |
|-------|-------- |---------|---------|-------|
| y=0   |    0    |    1    |   2     |   3   |
| y=1   |    4    |    5    |   6     |   7   |
|---------------------------------------------|

Local Thread Index - Linear Index can be calculated with formulae : local threadId = threadIdx.y * blockDim.x + threadIdx.x

|   threadIdx.y |  threadIdx.x     |      Calculation      |  Local ThreadId  |
|---------------|------------------|-----------------------|------------------|
|   0           |     0            |         0 * 4 + 0     |       0          |
|   0           |     1            |         0 * 4 + 1     |       1          |
|   0           |     2            |         0 * 4 + 2     |       2          |
|   0           |     3            |         0 * 4 + 3     |       3          |
|   1           |     0            |         1 * 4 + 0     |       4          |
|   1           |     1            |         1 * 4 + 1     |       5          |
|   1           |     2            |         1 * 4 + 2     |       6          |
|   1           |     3            |         1 * 4 + 3     |       7          |

Global Thread Index Calculation

Finally using Global BlockId & Local ThreadId of each Block we can Calculate Global Thread Id using below formulae:

(blockId * blockDim.x * blockDim.y) + local threadId

blockId = blockIdx.y * gridDim.x + blockIdx.x
local ThreadId = threadIdx.y * blockDim.x + threadIdx.x

    |   GlobalBlockID   |   LocalThreadId  |   GlobalThreadID      |
    |-------------------|------------------|-----------------------|
    |       0           |   0 * 8 + 0      |           0           |
    |       0           |   0 * 8 + 1      |           1           |
    |       0           |   0 * 8 + 2      |           2           |
    |       0           |   0 * 8 + 3      |           3           |
    |       0           |   0 * 8 + 4      |           4           |
    |       0           |   0 * 8 + 5      |           5           |
    |       0           |   0 * 8 + 6      |           6           |
    |       0           |   0 * 8 + 7      |           7           |

    |       1           |   1 * 8 + 0      |           8           |
    |       1           |   1 * 8 + 1      |           9           |
    |       1           |   1 * 8 + 2      |           10          |
    |       1           |   1 * 8 + 3      |           11          |
    |       1           |   1 * 8 + 4      |           12          |
    |       1           |   1 * 8 + 5      |           13          |
    |       1           |   1 * 8 + 6      |           14          |
    |       1           |   1 * 8 + 7      |           15          |

    |       2           |   2 * 8 + 0      |           16          |
    |       2           |   2 * 8 + 1      |           17          |
    |       2           |   2 * 8 + 2      |           18          |
    |       2           |   2 * 8 + 3      |           19          |
    |       2           |   2 * 8 + 4      |           20          |
    |       2           |   2 * 8 + 5      |           21          |
    |       2           |   2 * 8 + 6      |           22          |
    |       2           |   2 * 8 + 7      |           23          |

  


    |       5           |   5 * 8 + 0      |           40          |
    |       5           |   5 * 8 + 1      |           41          |
    |       5           |   5 * 8 + 2      |           42          |
    |       5           |   5 * 8 + 3      |           43          |
    |       5           |   5 * 8 + 4      |           44          |
    |       5           |   5 * 8 + 5      |           45          |
    |       5           |   5 * 8 + 6      |           46          |
    |       5           |   5 * 8 + 7      |           47          |

Thread Indexing - 3D Grid and 2D Block

dim3 gridDim(4,3,2) = (x,y,z) dim3 blockDim(4,2) = (x,y)

gridDim.x = 4 gridDim.y = 3 gridDim.z = 2

blockDim.x = 4 blockDim.y = 2

Number of Blocks = 24 Nubmer of Thread per Block = 8 Total Number of Threads = 24 * 8 = 192

Grid Layout


Grid Comprises of 2 matrices of (3*4) , which can be represented as below.

Total Number of Blocks = 4 * 3 * 2 = 24 Blocks

grid = [

[  # z = 0
    [ Block(0,0,0), Block(0,0,1), Block(0,0,2), Block(0,0,3) ],  # y = 0
    [ Block(0,1,0), Block(0,1,1), Block(0,1,2), Block(0,1,3) ],  # y = 1
    [ Block(0,2,0), Block(0,2,1), Block(0,2,2), Block(0,2,3) ]   # y = 2
],
[  # z = 1
    [ Block(1,0,0), Block(1,0,1), Block(1,0,2), Block(1,0,3)  ],  # y = 0
    [ Block(1,1,0), Block(1,1,1), Block(1,1,2), Block(1,1,3)  ],  # y = 1
    [ Block(1,2,0), Block(1,2,1), Block(1,2,2), Block(1,2,3)  ]  # y = 2
]

]

A unique global block ID is computed as: blockId = ( blockIdx.z * gridDim.y * gridDim.x ) + ( blockIdx.y * gridDim.x ) + blockIdx.x

    | blockIdx.z   |   blockIdx.y  |  blockIdx.x      |      Calculation                  |  Global BlockId  |
    |------------- |---------------|------------------|-----------------------------------|------------------|
    |      0       |   0           |     0            |    ( 0 * 3 * 4 ) + ( 0 * 4 ) + 0  |       0          |
    |      0       |   0           |     1            |    ( 0 * 3 * 4 ) + ( 0 * 4 ) + 1  |       1          |
    |      0       |   0           |     2            |    ( 0 * 3 * 4 ) + ( 0 * 4 ) + 2  |       2          |
    |      0       |   0           |     3            |    ( 0 * 3 * 4 ) + ( 0 * 4 ) + 3  |       3          |
    |      0       |   1           |     0            |    ( 0 * 3 * 4 ) + ( 1 * 4 ) + 0  |       4          |
    |      0       |   1           |     1            |    ( 0 * 3 * 4 ) + ( 1 * 4 ) + 1  |       5          |
    |      0       |   1           |     2            |    ( 0 * 3 * 4 ) + ( 1 * 4 ) + 2  |       6          |
    |      0       |   1           |     3            |    ( 0 * 3 * 4 ) + ( 1 * 4 ) + 3  |       7          |
    |      0       |   2           |     0            |    ( 0 * 3 * 4 ) + ( 2 * 4 ) + 0  |       8          |
    |      0       |   2           |     1            |    ( 0 * 3 * 4 ) + ( 2 * 4 ) + 1  |       9          |
    |      0       |   2           |     2            |    ( 0 * 3 * 4 ) + ( 2 * 4 ) + 2  |       10         |
    |      0       |   2           |     3            |    ( 0 * 3 * 4 ) + ( 2 * 4 ) + 3  |       11         |

    |      1       |   0           |     0            |    ( 1 * 3 * 4 ) + ( 0 * 4 ) + 0  |       12          |
    |      1       |   0           |     1            |    ( 1 * 3 * 4 ) + ( 0 * 4 ) + 1  |       13          |
    |      1       |   0           |     2            |    ( 1 * 3 * 4 ) + ( 0 * 4 ) + 2  |       14          |
    |      1       |   0           |     3            |    ( 1 * 3 * 4 ) + ( 0 * 4 ) + 3  |       15          |
    |      1       |   1           |     0            |    ( 1 * 3 * 4 ) + ( 1 * 4 ) + 0  |       16          |
    |      1       |   1           |     1            |    ( 1 * 3 * 4 ) + ( 1 * 4 ) + 1  |       17          |
    |      1       |   1           |     2            |    ( 1 * 3 * 4 ) + ( 1 * 4 ) + 2  |       18          |
    |      1       |   1           |     3            |    ( 1 * 3 * 4 ) + ( 1 * 4 ) + 3  |       19          |
    |      1       |   2           |     0            |    ( 1 * 3 * 4 ) + ( 2 * 4 ) + 0  |       20          |
    |      1       |   2           |     1            |    ( 1 * 3 * 4 ) + ( 2 * 4 ) + 1  |       21          |
    |      1       |   2           |     2            |    ( 1 * 3 * 4 ) + ( 2 * 4 ) + 2  |       22          |
    |      1       |   2           |     3            |    ( 1 * 3 * 4 ) + ( 2 * 4 ) + 3  |       23          |

Local Thread Indexing:


blockDim = (4, 2)

|        | x = 0 | x = 1 |  x=2   | x=3  |
|------- |-------|-------|--------|------|
|  y = 0 | (0,0) | (0,1) |  (0,2) | (0,3)|
|  y = 1 | (1,0) | (1,1) |  (1,2) | (1,3)|
 -----------------------------------------

|       |  x = 0  |  x = 1  |  x = 2  | x = 3 |
|-------|-------- |---------|---------|-------|
| y=0   |    0    |    1    |   2     |   3   |
| y=1   |    4    |    5    |   6     |   7   |
|---------------------------------------------|

Local Thread Index - Linear Index can be calculated with formulae : local threadId = threadIdx.y * blockDim.x + threadIdx.x

|   threadIdx.y |  threadIdx.x     |      Calculation      |  Local ThreadId  |
|---------------|------------------|-----------------------|------------------|
|   0           |     0            |         0 * 4 + 0     |       0          |
|   0           |     1            |         0 * 4 + 1     |       1          |
|   0           |     2            |         0 * 4 + 2     |       2          |
|   0           |     3            |         0 * 4 + 3     |       3          |
|   1           |     0            |         1 * 4 + 0     |       4          |
|   1           |     1            |         1 * 4 + 1     |       5          |
|   1           |     2            |         1 * 4 + 2     |       6          |
|   1           |     3            |         1 * 4 + 3     |       7          |

Global Thread Index Calculation:

Finally using Global BlockId & Local ThreadId of each Block we can Calculate Global Thread Id using below formulae:

(blockId * blockDim.x * blockDim.y) + local threadId

blockId = blockIdx.y * gridDim.x + blockIdx.x
local ThreadId = threadIdx.y * blockDim.x + threadIdx.x

    |   GlobalBlockID   |   LocalThreadId  |   GlobalThreadID      |
    |-------------------|------------------|-----------------------|
    |       0           |   0 * 8 + 0      |           0           |
    |       0           |   0 * 8 + 1      |           1           |
    |       0           |   0 * 8 + 2      |           2           |
    |       0           |   0 * 8 + 3      |           3           |
    |       0           |   0 * 8 + 4      |           4           |
    |       0           |   0 * 8 + 5      |           5           |
    |       0           |   0 * 8 + 6      |           6           |
    |       0           |   0 * 8 + 7      |           7           |
    

    |       1           |   1 * 8 + 0      |           8           |
    |       1           |   1 * 8 + 1      |           9           |
    |       1           |   1 * 8 + 2      |           10          |
    |       1           |   1 * 8 + 3      |           11          |
    |       1           |   1 * 8 + 4      |           12          |
    |       1           |   1 * 8 + 5      |           13          |
    |       1           |   1 * 8 + 6      |           14          |
    |       1           |   1 * 8 + 7      |           15          |

    |       2           |   2 * 8 + 0      |           16          |
    |       2           |   2 * 8 + 1      |           17          |
    |       2           |   2 * 8 + 2      |           18          |
    |       2           |   2 * 8 + 3      |           19          |
    |       2           |   2 * 8 + 4      |           20          |
    |       2           |   2 * 8 + 5      |           21          |
    |       2           |   2 * 8 + 6      |           22          |
    |       2           |   2 * 8 + 7      |           23          |

  


    |       23          |   23 * 8 + 0      |           184         |
    |       23          |   23 * 8 + 1      |           185         |
    |       23          |   23 * 8 + 2      |           186         |
    |       23          |   23 * 8 + 3      |           187         |
    |       23          |   23 * 8 + 4      |           188         |
    |       23          |   23 * 8 + 5      |           189         |
    |       23          |   23* 8 + 6       |           190         |
    |       23          |   23 * 8 + 7      |           191         |

Thread Indexing - 1D Grid and 3D Block

dim3 gridDim(1) dim3 blockDim(6,3,2)

(x,y,z) = (6,3,2)

gridDim.x = 1

blockDim.x = 6 blockDim.y = 3 blockDim.z = 2

Number of Threads per Block :- 6 * 3 * 2 = 36

Grid Layout


Here , as we only have one Block we don’t need to calculate GlobalID and directly we can start with calculating Local ThreadID inside Block, which also becomes GlobalThreadID due to single Block

Local Thread Indexing:


[

    [  # z = 0
        [ Thread(0,0,0), Thread(0,0,1), Thread(0,0,2), Thread(0,0,3), Thread(0,0,4), Thread(0,0,5)  ],  # y = 0
        [ Thread(0,1,0), Thread(0,1,1), Thread(0,1,2), Thread(0,1,3), Thread(0,1,4), Thread(0,1,5)  ],  # y = 1
        [ Thread(0,2,0), Thread(0,2,1), Thread(0,2,2), Thread(0,2,3), Thread(0,2,4), Thread(0,2,5)  ]   # y = 2
    ],
    [  # z = 1
        [ Thread(1,0,0), Thread(1,0,1), Thread(1,0,2), Thread(1,0,3), Thread(1,0,4), Thread(1,0,5)   ],  # y = 0
        [ Thread(1,1,0), Thread(1,1,1), Thread(1,1,2), Thread(1,1,3), Thread(1,1,4), Thread(1,1,5)   ],  # y = 1
        [ Thread(1,2,0), Thread(1,2,1), Thread(1,2,2), Thread(1,2,3), Thread(1,2,4), Thread(1,2,5)   ]  # y = 2
    ]

]

Local ThreadID is computed as: Local ThreadId = ( threadIdx.z * blockDim.y * blockDim.x ) + ( threadIdx.y * blockDim.x ) + threadIdx.x

    | threadIdx.z  |   threadIdx.y |  threadIdx.x     |      Calculation                  |  Local ThreadId  |
    |------------- |---------------|------------------|-----------------------------------|------------------|
    |      0       |   0           |     0            |    ( 0 * 6 * 3 ) + ( 0 * 6 ) + 0  |       0          |
    |      0       |   0           |     1            |    ( 0 * 6 * 3 ) + ( 0 * 6 ) + 1  |       1          |
    |      0       |   0           |     2            |    ( 0 * 6 * 3 ) + ( 0 * 6 ) + 2  |       2          |
    |      0       |   0           |     3            |    ( 0 * 6 * 3 ) + ( 0 * 6 ) + 3  |       3          |
    |      0       |   0           |     4            |    ( 0 * 6 * 3 ) + ( 0 * 6 ) + 4  |       4          |
    |      0       |   0           |     5            |    ( 0 * 6 * 3 ) + ( 0 * 6 ) + 5  |       5          |
    
    |      0       |   1           |     0            |    ( 0 * 6 * 3 ) + ( 1 * 6 ) + 0  |       6          |
    |      0       |   1           |     1            |    ( 0 * 6 * 3 ) + ( 1 * 6 ) + 1  |       7          |
    |      0       |   1           |     2            |    ( 0 * 6 * 3 ) + ( 1 * 6 ) + 2  |       8          |
    |      0       |   1           |     3            |    ( 0 * 6 * 3 ) + ( 1 * 6 ) + 3  |       9          |
    |      0       |   1           |     4            |    ( 0 * 6 * 3 ) + ( 1 * 6 ) + 4  |       10         |
    |      0       |   1           |     5            |    ( 0 * 6 * 3 ) + ( 1 * 6 ) + 5  |       11         |

    |      0       |   2           |     0            |    ( 0 * 6 * 3 ) + ( 2 * 6 ) + 0  |       12         |
    |      0       |   2           |     1            |    ( 0 * 6 * 3 ) + ( 2 * 6 ) + 1  |       13         |
    |      0       |   2           |     2            |    ( 0 * 6 * 3 ) + ( 2 * 6 ) + 2  |       14         |
    |      0       |   2           |     3            |    ( 0 * 6 * 3 ) + ( 2 * 6 ) + 3  |       15         |
    |      0       |   2           |     4            |    ( 0 * 6 * 3 ) + ( 2 * 6 ) + 4  |       16         |
    |      0       |   2           |     5            |    ( 0 * 6 * 3 ) + ( 2 * 6 ) + 5  |       17         |


    |      1       |   0           |     0            |    ( 1 * 6 * 3 ) + ( 0 * 6 ) + 0  |       18         |
    |      1       |   0           |     1            |    ( 1 * 6 * 3 ) + ( 0 * 6 ) + 1  |       19         |
    |      1       |   0           |     2            |    ( 1 * 6 * 3 ) + ( 0 * 6 ) + 2  |       20         |
    |      1       |   0           |     3            |    ( 1 * 6 * 3 ) + ( 0 * 6 ) + 3  |       21         |
    |      1       |   0           |     4            |    ( 1 * 6 * 3 ) + ( 0 * 6 ) + 4  |       22         |
    |      1       |   0           |     5            |    ( 1 * 6 * 3 ) + ( 0 * 6 ) + 5  |       23         |

    |      1       |   1           |     0            |    ( 1 * 6 * 3 ) + ( 1 * 6 ) + 0  |       24         |
    |      1       |   1           |     1            |    ( 1 * 6 * 3 ) + ( 1 * 6 ) + 1  |       25         |
    |      1       |   1           |     2            |    ( 1 * 6 * 3 ) + ( 1 * 6 ) + 2  |       26         |
    |      1       |   1           |     3            |    ( 1 * 6 * 3 ) + ( 1 * 6 ) + 3  |       27         |
    |      1       |   1           |     4            |    ( 1 * 6 * 3 ) + ( 1 * 6 ) + 4  |       28         |
    |      1       |   1           |     5            |    ( 1 * 6 * 3 ) + ( 1 * 6 ) + 5  |       29         |

    |      1       |   2           |     0            |    ( 1 * 6 * 3 ) + ( 2 * 6 ) + 0  |       30         |
    |      1       |   2           |     1            |    ( 1 * 6 * 3 ) + ( 2 * 6 ) + 1  |       31         |
    |      1       |   2           |     2            |    ( 1 * 6 * 3 ) + ( 2 * 6 ) + 2  |       32         |
    |      1       |   2           |     3            |    ( 1 * 6 * 3 ) + ( 2 * 6 ) + 3  |       33         |
    |      1       |   2           |     4            |    ( 1 * 6 * 3 ) + ( 2 * 6 ) + 4  |       34         |
    |      1       |   2           |     5            |    ( 1 * 6 * 3 ) + ( 2 * 6 ) + 5  |       35         |

Thread Indexing - 2D Grid and 3D Block

dim3 gridDim(3,2) = (x,y) dim3 blockDim(6,3,2) = (x,y,z)

gridDim.x = 3 gridDim.y = 2

blockDim.x = 6 blockDim.y = 3 blockDim.z = 2

Grid Layout


gridDim = (3, 2)

|        | x = 0 | x = 1 |  x=2      |
|------- |-------|-------|-----------|
|  y = 0 | (0,0) | (0,1) |  (0,2)    |
|  y = 1 | (1,0) | (1,1) |  (1,2)    |
 ------------------------------------

|       |  x = 0  |  x = 1  |  x = 2  |
|-------|-------- |---------|---------|
| y=0   |    0    |    1    |   2     |
| y=1   |    3    |    4    |   5     |
|-------------------------------------|

A unique global block ID is computed as: blockId = blockIdx.y * gridDim.x + blockIdx.x

For this example (with gridDim.x = 2):

|   blockIdx.y  |  blockIdx.x      |      Calculation      |  Global BlockId  |
|---------------|------------------|-----------------------|------------------|
|   0           |     0            |         0 * 3 + 0     |       0          |
|   0           |     1            |         0 * 3 + 1     |       1          |
|   0           |     2            |         0 * 3 + 2     |       2          |
|   1           |     0            |         1 * 3 + 0     |       3          |
|   1           |     1            |         1 * 3 + 1     |       4          |
|   1           |     2            |         1 * 3 + 2     |       5          |

Local Thread Indexing:


[

    [  # z = 0
        [ Thread(0,0,0), Thread(0,0,1), Thread(0,0,2), Thread(0,0,3), Thread(0,0,4), Thread(0,0,5)  ],  # y = 0
        [ Thread(0,1,0), Thread(0,1,1), Thread(0,1,2), Thread(0,1,3), Thread(0,1,4), Thread(0,1,5)  ],  # y = 1
        [ Thread(0,2,0), Thread(0,2,1), Thread(0,2,2), Thread(0,2,3), Thread(0,2,4), Thread(0,2,5)  ]   # y = 2
    ],
    [  # z = 1
        [ Thread(1,0,0), Thread(1,0,1), Thread(1,0,2), Thread(1,0,3), Thread(1,0,4), Thread(1,0,5)   ],  # y = 0
        [ Thread(1,1,0), Thread(1,1,1), Thread(1,1,2), Thread(1,1,3), Thread(1,1,4), Thread(1,1,5)   ],  # y = 1
        [ Thread(1,2,0), Thread(1,2,1), Thread(1,2,2), Thread(1,2,3), Thread(1,2,4), Thread(1,2,5)   ]  # y = 2
    ]

]

Local ThreadID is computed as: Local ThreadId = ( threadIdx.z * blockDim.y * blockDim.x ) + ( threadIdx.y * blockDim.x ) + threadIdx.x

    | threadIdx.z  |   threadIdx.y |  threadIdx.x     |      Calculation                  |  Local ThreadId  |
    |------------- |---------------|------------------|-----------------------------------|------------------|
    |      0       |   0           |     0            |    ( 0 * 6 * 3 ) + ( 0 * 6 ) + 0  |       0          |
    |      0       |   0           |     1            |    ( 0 * 6 * 3 ) + ( 0 * 6 ) + 1  |       1          |
    |      0       |   0           |     2            |    ( 0 * 6 * 3 ) + ( 0 * 6 ) + 2  |       2          |
    |      0       |   0           |     3            |    ( 0 * 6 * 3 ) + ( 0 * 6 ) + 3  |       3          |
    |      0       |   0           |     4            |    ( 0 * 6 * 3 ) + ( 0 * 6 ) + 4  |       4          |
    |      0       |   0           |     5            |    ( 0 * 6 * 3 ) + ( 0 * 6 ) + 5  |       5          |
    
    |      0       |   1           |     0            |    ( 0 * 6 * 3 ) + ( 1 * 6 ) + 0  |       6          |
    |      0       |   1           |     1            |    ( 0 * 6 * 3 ) + ( 1 * 6 ) + 1  |       7          |
    |      0       |   1           |     2            |    ( 0 * 6 * 3 ) + ( 1 * 6 ) + 2  |       8          |
    |      0       |   1           |     3            |    ( 0 * 6 * 3 ) + ( 1 * 6 ) + 3  |       9          |
    |      0       |   1           |     4            |    ( 0 * 6 * 3 ) + ( 1 * 6 ) + 4  |       10         |
    |      0       |   1           |     5            |    ( 0 * 6 * 3 ) + ( 1 * 6 ) + 5  |       11         |

    |      0       |   2           |     0            |    ( 0 * 6 * 3 ) + ( 2 * 6 ) + 0  |       12         |
    |      0       |   2           |     1            |    ( 0 * 6 * 3 ) + ( 2 * 6 ) + 1  |       13         |
    |      0       |   2           |     2            |    ( 0 * 6 * 3 ) + ( 2 * 6 ) + 2  |       14         |
    |      0       |   2           |     3            |    ( 0 * 6 * 3 ) + ( 2 * 6 ) + 3  |       15         |
    |      0       |   2           |     4            |    ( 0 * 6 * 3 ) + ( 2 * 6 ) + 4  |       16         |
    |      0       |   2           |     5            |    ( 0 * 6 * 3 ) + ( 2 * 6 ) + 5  |       17         |


    |      1       |   0           |     0            |    ( 1 * 6 * 3 ) + ( 0 * 6 ) + 0  |       18         |
    |      1       |   0           |     1            |    ( 1 * 6 * 3 ) + ( 0 * 6 ) + 1  |       19         |
    |      1       |   0           |     2            |    ( 1 * 6 * 3 ) + ( 0 * 6 ) + 2  |       20         |
    |      1       |   0           |     3            |    ( 1 * 6 * 3 ) + ( 0 * 6 ) + 3  |       21         |
    |      1       |   0           |     4            |    ( 1 * 6 * 3 ) + ( 0 * 6 ) + 4  |       22         |
    |      1       |   0           |     5            |    ( 1 * 6 * 3 ) + ( 0 * 6 ) + 5  |       23         |

    |      1       |   1           |     0            |    ( 1 * 6 * 3 ) + ( 1 * 6 ) + 0  |       24         |
    |      1       |   1           |     1            |    ( 1 * 6 * 3 ) + ( 1 * 6 ) + 1  |       25         |
    |      1       |   1           |     2            |    ( 1 * 6 * 3 ) + ( 1 * 6 ) + 2  |       26         |
    |      1       |   1           |     3            |    ( 1 * 6 * 3 ) + ( 1 * 6 ) + 3  |       27         |
    |      1       |   1           |     4            |    ( 1 * 6 * 3 ) + ( 1 * 6 ) + 4  |       28         |
    |      1       |   1           |     5            |    ( 1 * 6 * 3 ) + ( 1 * 6 ) + 5  |       29         |

    |      1       |   2           |     0            |    ( 1 * 6 * 3 ) + ( 2 * 6 ) + 0  |       30         |
    |      1       |   2           |     1            |    ( 1 * 6 * 3 ) + ( 2 * 6 ) + 1  |       31         |
    |      1       |   2           |     2            |    ( 1 * 6 * 3 ) + ( 2 * 6 ) + 2  |       32         |
    |      1       |   2           |     3            |    ( 1 * 6 * 3 ) + ( 2 * 6 ) + 3  |       33         |
    |      1       |   2           |     4            |    ( 1 * 6 * 3 ) + ( 2 * 6 ) + 4  |       34         |
    |      1       |   2           |     5            |    ( 1 * 6 * 3 ) + ( 2 * 6 ) + 5  |       35         |

Global ThreadID Calculation:


We use the Global BlockId & Local ThreadId to compute Global ThreadId

(blockId * blockDim.x * blockDim.y * blockDim.z) + local ThreadId

    |   GlobalBlockID   |   LocalThreadId  |   GlobalThreadID      |
    |-------------------|------------------|-----------------------|
    |       0           |   0 * 36 + 0      |           0           |
    |       0           |   0 * 36 + 1      |           1           |
    |       0           |   0 * 36 + 2      |           2           |
    |       0           |   0 * 36 + 3      |           3           |
    |       0           |   0 * 36 + 4      |           4           |
    |       0           |   0 * 36 + 5      |           5           |
    |       0           |   0 * 36 + 6      |           6           |
    |       0           |   0 * 36 + 7      |           7           |
    |       0           |   0 * 36 + 8      |           8           |
    |       0           |   0 * 36 + 9      |           9           |
    |       0           |   0 * 36 + 10     |           10          |
    |       0           |   0 * 36 + 11     |           11          |
    |       0           |   0 * 36 + 12     |           12          |
    |       0           |   0 * 36 + 13     |           13          |
    |       0           |   0 * 36 + 14     |           14          |
    |       0           |   0 * 36 + 15     |           15          |
    |       0           |   0 * 36 + 16     |           16          |
    |       0           |   0 * 36 + 17     |           17          |
    |       0           |   0 * 36 + 18     |           18          |
    |       0           |   0 * 36 + 19     |           19          |

    |       0           |   0 * 36 + 34     |           34          |
    |       0           |   0 * 36 + 35     |           35          |    

    |       1           |   1 * 36 + 0      |           36          |
    |       1           |   1 * 36 + 1      |           37          |
    |       1           |   1 * 36 + 2      |           38          |
    |       1           |   1 * 36 + 3      |           39          |
    |       1           |   1 * 36 + 4      |           40          |
    |       1           |   1 * 36 + 5      |           41          |
    |       1           |   1 * 36 + 6      |           42          |
    |       1           |   1 * 36 + 7      |           43          |
    |       1           |   1 * 36 + 8      |           44          |
    |       1           |   1 * 36 + 9      |           45          |
    |       1           |   1 * 36 + 10     |           46          |
    |       1           |   1 * 36 + 11     |           47          |
    |       1           |   1 * 36 + 12     |           48          |
    |       1           |   1 * 36 + 13     |           49          |
    |       1           |   1 * 36 + 14     |           50          |
    |       1           |   1 * 36 + 15     |           51          |
    |       1           |   1 * 36 + 16     |           52          |
    |       1           |   1 * 36 + 17     |           53          |
    |       1           |   1 * 36 + 18     |           54          |
    |       1           |   1 * 36 + 19     |           55          |

    |       0           |   1 * 36 + 34     |           70          |
    |       0           |   1 * 36 + 35     |           71          | 

     ---------------
   
    |       5           |   5 * 36 + 0      |           180         |
    |       5           |   5 * 36 + 1      |           181         |
    |       5           |   5 * 36 + 2      |           182         |
    |       5           |   5 * 36 + 3      |           183         |
    |       5           |   5 * 36 + 4      |           184         |
    |       5           |   5 * 36 + 5      |           185         |
    |       5           |   5 * 36 + 6      |           186         |
    |       5           |   5 * 36 + 7      |           187         |
    |       5           |   5 * 36 + 8      |           188         |
    |       5           |   5 * 36 + 9      |           189         |
    |       5           |   5 * 36 + 10     |           190         |
    |       5           |   5 * 36 + 11     |           191         |
    |       5           |   5 * 36 + 12     |           192         |
    |       5           |   5 * 36 + 13     |           193         |
    |       5           |   5 * 36 + 14     |           194         |
    |       5           |   5 * 36 + 15     |           195         |
    |       5           |   5 * 36 + 16     |           196         |
    |       5           |   5 * 36 + 17     |           197         |
    |       5           |   5 * 36 + 18     |           198         |
    |       5           |   5 * 36 + 19     |           199         |

    |       5           |   5 * 36 + 34     |           214         |
    |       5           |   5 * 36 + 35     |           215         | 

Thread Indexing - 3D Grid and 3D Block

dim3 gridDim(4,3,2) = (x,y,z) dim3 blockDim(6,3,2) = (x,y,z)

gridDim.x = 4 gridDim.y = 3 gridDim.z = 2

blockDim.x = 6 blockDim.y = 3 blockDim.z = 2

Number of Blocks = 4 * 3 * 2 = 24 Nubmer of Thread per Block = 6 * 3 * 2 = 36 Total Number of Threads = 24 * 36 = 864

Grid Layout


Grid Comprises of 2 matrices of (3*4) , which can be represented as below.

Total Number of Blocks = 4 * 3 * 2 = 24 Blocks

grid = [

[  # z = 0
    [ Block(0,0,0), Block(0,0,1), Block(0,0,2), Block(0,0,3) ],  # y = 0
    [ Block(0,1,0), Block(0,1,1), Block(0,1,2), Block(0,1,3) ],  # y = 1
    [ Block(0,2,0), Block(0,2,1), Block(0,2,2), Block(0,2,3) ]   # y = 2
],
[  # z = 1
    [ Block(1,0,0), Block(1,0,1), Block(1,0,2), Block(1,0,3)  ],  # y = 0
    [ Block(1,1,0), Block(1,1,1), Block(1,1,2), Block(1,1,3)  ],  # y = 1
    [ Block(1,2,0), Block(1,2,1), Block(1,2,2), Block(1,2,3)  ]  # y = 2
]

]

A unique global block ID is computed as: blockId = ( blockIdx.z * gridDim.y * gridDim.x ) + ( blockIdx.y * gridDim.x ) + blockIdx.x

    | blockIdx.z   |   blockIdx.y  |  blockIdx.x      |      Calculation                  |  Global BlockId  |
    |------------- |---------------|------------------|-----------------------------------|------------------|
    |      0       |   0           |     0            |    ( 0 * 3 * 4 ) + ( 0 * 4 ) + 0  |       0          |
    |      0       |   0           |     1            |    ( 0 * 3 * 4 ) + ( 0 * 4 ) + 1  |       1          |
    |      0       |   0           |     2            |    ( 0 * 3 * 4 ) + ( 0 * 4 ) + 2  |       2          |
    |      0       |   0           |     3            |    ( 0 * 3 * 4 ) + ( 0 * 4 ) + 3  |       3          |
    |      0       |   1           |     0            |    ( 0 * 3 * 4 ) + ( 1 * 4 ) + 0  |       4          |
    |      0       |   1           |     1            |    ( 0 * 3 * 4 ) + ( 1 * 4 ) + 1  |       5          |
    |      0       |   1           |     2            |    ( 0 * 3 * 4 ) + ( 1 * 4 ) + 2  |       6          |
    |      0       |   1           |     3            |    ( 0 * 3 * 4 ) + ( 1 * 4 ) + 3  |       7          |
    |      0       |   2           |     0            |    ( 0 * 3 * 4 ) + ( 2 * 4 ) + 0  |       8          |
    |      0       |   2           |     1            |    ( 0 * 3 * 4 ) + ( 2 * 4 ) + 1  |       9          |
    |      0       |   2           |     2            |    ( 0 * 3 * 4 ) + ( 2 * 4 ) + 2  |       10         |
    |      0       |   2           |     3            |    ( 0 * 3 * 4 ) + ( 2 * 4 ) + 3  |       11         |

    |      1       |   0           |     0            |    ( 1 * 3 * 4 ) + ( 0 * 4 ) + 0  |       12          |
    |      1       |   0           |     1            |    ( 1 * 3 * 4 ) + ( 0 * 4 ) + 1  |       13          |
    |      1       |   0           |     2            |    ( 1 * 3 * 4 ) + ( 0 * 4 ) + 2  |       14          |
    |      1       |   0           |     3            |    ( 1 * 3 * 4 ) + ( 0 * 4 ) + 3  |       15          |
    |      1       |   1           |     0            |    ( 1 * 3 * 4 ) + ( 1 * 4 ) + 0  |       16          |
    |      1       |   1           |     1            |    ( 1 * 3 * 4 ) + ( 1 * 4 ) + 1  |       17          |
    |      1       |   1           |     2            |    ( 1 * 3 * 4 ) + ( 1 * 4 ) + 2  |       18          |
    |      1       |   1           |     3            |    ( 1 * 3 * 4 ) + ( 1 * 4 ) + 3  |       19          |
    |      1       |   2           |     0            |    ( 1 * 3 * 4 ) + ( 2 * 4 ) + 0  |       20          |
    |      1       |   2           |     1            |    ( 1 * 3 * 4 ) + ( 2 * 4 ) + 1  |       21          |
    |      1       |   2           |     2            |    ( 1 * 3 * 4 ) + ( 2 * 4 ) + 2  |       22          |
    |      1       |   2           |     3            |    ( 1 * 3 * 4 ) + ( 2 * 4 ) + 3  |       23          |

Local Thread Indexing:


[

    [  # z = 0
        [ Thread(0,0,0), Thread(0,0,1), Thread(0,0,2), Thread(0,0,3), Thread(0,0,4), Thread(0,0,5)  ],  # y = 0
        [ Thread(0,1,0), Thread(0,1,1), Thread(0,1,2), Thread(0,1,3), Thread(0,1,4), Thread(0,1,5)  ],  # y = 1
        [ Thread(0,2,0), Thread(0,2,1), Thread(0,2,2), Thread(0,2,3), Thread(0,2,4), Thread(0,2,5)  ]   # y = 2
    ],
    [  # z = 1
        [ Thread(1,0,0), Thread(1,0,1), Thread(1,0,2), Thread(1,0,3), Thread(1,0,4), Thread(1,0,5)   ],  # y = 0
        [ Thread(1,1,0), Thread(1,1,1), Thread(1,1,2), Thread(1,1,3), Thread(1,1,4), Thread(1,1,5)   ],  # y = 1
        [ Thread(1,2,0), Thread(1,2,1), Thread(1,2,2), Thread(1,2,3), Thread(1,2,4), Thread(1,2,5)   ]  # y = 2
    ]

]

Local ThreadID is computed as: Local ThreadId = ( threadIdx.z * blockDim.y * blockDim.x ) + ( threadIdx.y * blockDim.x ) + threadIdx.x

    | threadIdx.z  |   threadIdx.y |  threadIdx.x     |      Calculation                  |  Local ThreadId  |
    |------------- |---------------|------------------|-----------------------------------|------------------|
    |      0       |   0           |     0            |    ( 0 * 6 * 3 ) + ( 0 * 6 ) + 0  |       0          |
    |      0       |   0           |     1            |    ( 0 * 6 * 3 ) + ( 0 * 6 ) + 1  |       1          |
    |      0       |   0           |     2            |    ( 0 * 6 * 3 ) + ( 0 * 6 ) + 2  |       2          |
    |      0       |   0           |     3            |    ( 0 * 6 * 3 ) + ( 0 * 6 ) + 3  |       3          |
    |      0       |   0           |     4            |    ( 0 * 6 * 3 ) + ( 0 * 6 ) + 4  |       4          |
    |      0       |   0           |     5            |    ( 0 * 6 * 3 ) + ( 0 * 6 ) + 5  |       5          |

    |      0       |   1           |     0            |    ( 0 * 6 * 3 ) + ( 1 * 6 ) + 0  |       6          |
    |      0       |   1           |     1            |    ( 0 * 6 * 3 ) + ( 1 * 6 ) + 1  |       7          |
    |      0       |   1           |     2            |    ( 0 * 6 * 3 ) + ( 1 * 6 ) + 2  |       8          |
    |      0       |   1           |     3            |    ( 0 * 6 * 3 ) + ( 1 * 6 ) + 3  |       9          |
    |      0       |   1           |     4            |    ( 0 * 6 * 3 ) + ( 1 * 6 ) + 4  |       10         |
    |      0       |   1           |     5            |    ( 0 * 6 * 3 ) + ( 1 * 6 ) + 5  |       11         |

    |      0       |   2           |     0            |    ( 0 * 6 * 3 ) + ( 2 * 6 ) + 0  |       12         |
    |      0       |   2           |     1            |    ( 0 * 6 * 3 ) + ( 2 * 6 ) + 1  |       13         |
    |      0       |   2           |     2            |    ( 0 * 6 * 3 ) + ( 2 * 6 ) + 2  |       14         |
    |      0       |   2           |     3            |    ( 0 * 6 * 3 ) + ( 2 * 6 ) + 3  |       15         |
    |      0       |   2           |     4            |    ( 0 * 6 * 3 ) + ( 2 * 6 ) + 4  |       16         |
    |      0       |   2           |     5            |    ( 0 * 6 * 3 ) + ( 2 * 6 ) + 5  |       17         |


    |      1       |   0           |     0            |    ( 1 * 6 * 3 ) + ( 0 * 6 ) + 0  |       18         |
    |      1       |   0           |     1            |    ( 1 * 6 * 3 ) + ( 0 * 6 ) + 1  |       19         |
    |      1       |   0           |     2            |    ( 1 * 6 * 3 ) + ( 0 * 6 ) + 2  |       20         |
    |      1       |   0           |     3            |    ( 1 * 6 * 3 ) + ( 0 * 6 ) + 3  |       21         |
    |      1       |   0           |     4            |    ( 1 * 6 * 3 ) + ( 0 * 6 ) + 4  |       22         |
    |      1       |   0           |     5            |    ( 1 * 6 * 3 ) + ( 0 * 6 ) + 5  |       23         |

    |      1       |   1           |     0            |    ( 1 * 6 * 3 ) + ( 1 * 6 ) + 0  |       24         |
    |      1       |   1           |     1            |    ( 1 * 6 * 3 ) + ( 1 * 6 ) + 1  |       25         |
    |      1       |   1           |     2            |    ( 1 * 6 * 3 ) + ( 1 * 6 ) + 2  |       26         |
    |      1       |   1           |     3            |    ( 1 * 6 * 3 ) + ( 1 * 6 ) + 3  |       27         |
    |      1       |   1           |     4            |    ( 1 * 6 * 3 ) + ( 1 * 6 ) + 4  |       28         |
    |      1       |   1           |     5            |    ( 1 * 6 * 3 ) + ( 1 * 6 ) + 5  |       29         |

    |      1       |   2           |     0            |    ( 1 * 6 * 3 ) + ( 2 * 6 ) + 0  |       30         |
    |      1       |   2           |     1            |    ( 1 * 6 * 3 ) + ( 2 * 6 ) + 1  |       31         |
    |      1       |   2           |     2            |    ( 1 * 6 * 3 ) + ( 2 * 6 ) + 2  |       32         |
    |      1       |   2           |     3            |    ( 1 * 6 * 3 ) + ( 2 * 6 ) + 3  |       33         |
    |      1       |   2           |     4            |    ( 1 * 6 * 3 ) + ( 2 * 6 ) + 4  |       34         |
    |      1       |   2           |     5            |    ( 1 * 6 * 3 ) + ( 2 * 6 ) + 5  |       35         |

Global ThreadID Calculation:


We use the Global BlockId & Local ThreadId to compute Global ThreadId

(blockId * blockDim.x * blockDim.y * blockDim.z) + local ThreadId

    |   GlobalBlockID   |   LocalThreadId  |   GlobalThreadID      |
    |-------------------|------------------|-----------------------|
    |       0           |   0 * 36 + 0      |           0           |
    |       0           |   0 * 36 + 1      |           1           |
    |       0           |   0 * 36 + 2      |           2           |
    |       0           |   0 * 36 + 3      |           3           |
    |       0           |   0 * 36 + 4      |           4           |
    |       0           |   0 * 36 + 5      |           5           |
    |       0           |   0 * 36 + 6      |           6           |
    |       0           |   0 * 36 + 7      |           7           |
    |       0           |   0 * 36 + 8      |           8           |
    |       0           |   0 * 36 + 9      |           9           |
    |       0           |   0 * 36 + 10     |           10          |
    |       0           |   0 * 36 + 11     |           11          |
    |       0           |   0 * 36 + 12     |           12          |
    |       0           |   0 * 36 + 13     |           13          |
    |       0           |   0 * 36 + 14     |           14          |
    |       0           |   0 * 36 + 15     |           15          |
    |       0           |   0 * 36 + 16     |           16          |
    |       0           |   0 * 36 + 17     |           17          |
    |       0           |   0 * 36 + 18     |           18          |
    |       0           |   0 * 36 + 19     |           19          |

    |       0           |   0 * 36 + 34     |           34          |
    |       0           |   0 * 36 + 35     |           35          |    

    |       1           |   1 * 36 + 0      |           36          |
    |       1           |   1 * 36 + 1      |           37          |
    |       1           |   1 * 36 + 2      |           38          |
    |       1           |   1 * 36 + 3      |           39          |
    |       1           |   1 * 36 + 4      |           40          |
    |       1           |   1 * 36 + 5      |           41          |
    |       1           |   1 * 36 + 6      |           42          |
    |       1           |   1 * 36 + 7      |           43          |
    |       1           |   1 * 36 + 8      |           44          |
    |       1           |   1 * 36 + 9      |           45          |
    |       1           |   1 * 36 + 10     |           46          |
    |       1           |   1 * 36 + 11     |           47          |
    |       1           |   1 * 36 + 12     |           48          |
    |       1           |   1 * 36 + 13     |           49          |
    |       1           |   1 * 36 + 14     |           50          |
    |       1           |   1 * 36 + 15     |           51          |
    |       1           |   1 * 36 + 16     |           52          |
    |       1           |   1 * 36 + 17     |           53          |
    |       1           |   1 * 36 + 18     |           54          |
    |       1           |   1 * 36 + 19     |           55          |

    |       0           |   1 * 36 + 34     |           70          |
    |       0           |   1 * 36 + 35     |           71          | 

     ---------------
   
    |       23          |   23 * 36 + 0      |           828         |
    |       23          |   23 * 36 + 1      |           829         |
    |       23          |   23 * 36 + 2      |           830         |
    |       23          |   23 * 36 + 3      |           831         |
    |       23          |   23 * 36 + 4      |           832         |
    |       23          |   23 * 36 + 5      |           833         |
    |       23          |   23 * 36 + 6      |           834         |
    |       23          |   23 * 36 + 7      |           835         |
    |       23          |   23 * 36 + 8      |           836         |
    |       23          |   23 * 36 + 9      |           837         |
    |       23          |   23 * 36 + 10     |           838         |
    |       23          |   23 * 36 + 11     |           839         |
    |       23          |   23 * 36 + 12     |           840         |
    |       23          |   23 * 36 + 13     |           841         |
    |       23          |   23 * 36 + 14     |           842         |
    |       23          |   23 * 36 + 15     |           843         |
    |       23          |   23 * 36 + 16     |           844         |
    |       23          |   23 * 36 + 17     |           845         |
    |       23          |   23 * 36 + 18     |           846         |
    |       23          |   23 * 36 + 19     |           847         |

    |       23          |   23 * 36 + 34     |           862         |
    |       23          |   23 * 36 + 35     |           863         |