The GPU Hierarchical Structure

A GPU consists of several key components:

• Cores: The fundamental processing units capable of executing individual instructions.

• Streaming Multiprocessors (SMs): Groups of cores that work together to execute tasks in parallel.

• Memory Hierarchy: Different levels of memory, such as global memory, shared memory, and registers, optimize data access during computation.

Here is a simplified overview of the GPU processing pipeline:

Data Input: Data is passed to the GPU, typically from the system’s main memory (RAM) to the

GPU’s global memory.

Kernel Execution: The GPU processes data through kernels, which are small programs that run

in parallel across thousands of cores.

Thread Management: Threads are assigned to the cores, where each thread processes a small

portion of the overall data.

Memory Access: Data is read from and written to the memory hierarchy, including shared memory

and registers for optimal performance.

Data Output: After processing, the data is returned to the system memory or used in further GPU computations.

The GPU processing pipeline differs from the CPU pipeline mainly in how it handles parallelism.

While a CPU focuses on executing a few instructions quickly with a deep pipeline and high clock

speeds, the GPU prioritizes executing many instructions simultaneously with many cores working in

parallel.

Streaming Multiprocessors (SMs)

At the heart of modern GPUs are Streaming Multiprocessors (SMs). Each SM contains a group of

cores that can execute threads in parallel, allowing the GPU to handle massive parallelism efficiently.

An SM consists of:

Cores: Individual processing units that execute threads.
Warp Scheduler: Determines how threads are grouped into warps, which are sets of 32 threads

that execute the same instruction simultaneously.

Registers: Fast, low-latency memory used by individual threads for storing variables.
Shared Memory: A small, user-managed cache that allows threads within a block to share data

and communicate efficiently.

Each SM is designed to execute multiple threads simultaneously, maximizing parallelism and ensuring

that the GPU can handle tasks like matrix multiplications, convolution operations, and other

compute-heavy tasks in parallel.

Understanding Grid and Blocks in CUDA

In CUDA (Compute Unified Device Architecture), computation is organized into a hierarchy of grids

and blocks. This hierarchical structure allows the GPU to break down complex problems into smaller,

manageable pieces, enabling large-scale parallelism.

Defining the Grid

The grid is the highest-level abstraction in CUDA’s execution model. It represents the overall problem

space that is being solved on the GPU. A grid is composed of multiple blocks, which are further

subdivided into threads.

For example, if you are performing an image processing task on a 1024x1024 pixel image, the entire

image could be represented as a grid, where each pixel is processed by a different thread.

Blocks: The Subdivision of Grids

Each grid in CUDA is subdivided into blocks. A block is a collection of threads that can execute independently

on the GPU. Each block is assigned to an SM, which executes the threads of that block in

parallel.

A key feature of blocks is that they can communicate with each other using shared memory, which

is faster than accessing global memory. This allows blocks to collaborate on a subset of the overall

computation.

In [10]:

import numpy as np
from numba import cuda

@cuda.jit                                                                           # Compiles this function to run on GPU
def matrix_addition(a, b, result):
    # Get the index of the current thread in the grid
    i, j = cuda.grid(2)                                                             # Each thread gets unique (i,j) coordinates

    # Perform addition if the index is within the bounds
    if i < result.shape[0] and j < result.shape[1]:
        result[i, j] = a[i, j] + b[i, j]

# Initialize data
N = 1024
a = np.random.rand(N, N)        # initialize value a with shape a = (1024, 1024)
b = np.random.rand(N, N)        # initialize value b with shape b = (1024, 1024)
result = np.zeros((N, N))       # initialize result with 0 values with shape result = (1024, 1024)

# Define the grid and block dimensions / Calculate how many blocks needed to cover entire matrix
threads_per_block = (16, 16)    # A block is 16x16 threads # 256 threads per block
blocks_per_grid_x = (a.shape[0] + threads_per_block[0] - 1) // threads_per_block[0]
blocks_per_grid_y = (a.shape[1] + threads_per_block[1] - 1) // threads_per_block[1]
blocks_per_grid = (blocks_per_grid_x, blocks_per_grid_y)                            # Grid of thread blocks

print(f"Grid: {blocks_per_grid} blocks × {threads_per_block} threads")
print(f"Total threads: {blocks_per_grid_x * blocks_per_grid_y * 16 * 16}")

# Launch the kernel
matrix_addition[blocks_per_grid, threads_per_block](a, b, result)

print(result)

Out [10]:

result = (1024, 1024)
Grid: (64, 64) blocks × (16, 16) threads
Total threads: 1048576
[[1.36980796 1.14064583 1.85368593 ... 1.49270892 0.56310749 0.7460153 ]
 [1.71873084 0.59909694 0.75783558 ... 1.73687974 1.50361012 1.73527891]
 [0.57410286 0.96230128 1.13313017 ... 0.31919318 1.1129677  1.48177356]
 ...
 [0.36844851 0.58703067 1.67605846 ... 0.9899919  0.47099114 0.98850427]
 [1.14627379 1.26879155 1.74540994 ... 0.90718099 1.14108572 1.09423898]
 [0.48536405 1.39498137 1.12177176 ... 1.17229426 0.53833065 1.07447603]]

/home/san/miniconda/envs/mac-linux/lib/python3.13/site-packages/numba/cuda/cudadrv/devicearray.py:887: NumbaPerformanceWarning: Host array used in CUDA kernel will incur copy overhead to/from device.
  warn(NumbaPerformanceWarning(msg))

The formula (n + block_size - 1) // block_size is a trick for ceiling division using integer arithmetic.

Regular division (floor):

1024 / 16 = 64.0 → 64 blocks

But 64 blocks × 16 threads = 1024 threads → Perfect fit!

So for 1024, both floor and ceiling give same result

1. The @cuda.jit Decorator

What it does:

Compiles the Python function to GPU machine code (PTX assembly)
Creates a GPU kernel that can be launched with thousands of threads
Handles memory transfers automatically between CPU and GPU

Without @cuda.jit: This is a normal Python function that runs sequentially on CPU

With @cuda.jit: This becomes a GPU kernel that runs in parallel on thousands of GPU cores

2. cuda.grid(2) - Thread Coordinate System

What's happening:

GPU launches 1,048,576 threads (for 1024×1024 matrix)
Each thread gets unique (i, j) coordinates based on its position in the grid
Thread (0,0) processes element [0,0]
Thread (0,1) processes element [0,1]
Thread (15,15) processes element [15,15]
Thread (16,0) processes element [16,0] (next block)

3. Bounds Checking - Critical Safety

Why this is necessary: We launch more threads than needed to handle uneven divisions.
Example: For a 1030×1030 matrix with 16×16 blocks:

We need 65×65 blocks = 1,082,256 threads
But only 1030×1030 = 1,060,900 elements exist
Extra 21,356 threads hit this condition and do nothing

Without bounds check: Threads would try to access invalid memory → CRASH!

4. The Parallel Magic: result[i, j] = a[i, j] + b[i, j]

ALL 1 MILLION additions happen SIMULTANEOUSLY:
Thread(0,0): result[0][0] = a[0][0] + b[0][0]
Thread(0,1): result[0][1] = a[0][1] + b[0][1]
Thread(0,2): result[0][2] = a[0][2] + b[0][2]
...
Thread(1023,1023): result[1023][1023] = a[1023][1023] + b[1023][1023]
↑ ALL HAPPEN AT THE SAME TIME!

5. Kernel Launch: [blocks_per_grid, threads_per_block]

This syntax is SPECIAL CUDA syntax (not normal Python):

Normal Python function call: function_name(arguments)
CUDA Kernel launch: kernel_name[GRID_CONFIG](arguments)

What happens during launch:

Copy input arrays (a, b) from CPU → GPU
Allocate result array on GPU
Launch 64×64×16×16 = 1,048,576 threads
All threads execute the SAME code on DIFFERENT data
Copy result array from GPU → CPU

Threads and Warps

What is a Thread?

A thread is the smallest unit of computation in CUDA, the parallel computing platform from NVIDIA.

Each thread runs a specific portion of code, typically working on one element of data. While a single

thread can execute a small task, CUDA is designed to handle thousands of threads simultaneously,

allowing it to solve complex problems by breaking them down into smaller tasks.

For example, let’s say we want to add two arrays together, element by element. Each thread will

handle the addition of one element from each array.

Warps: Groups of Threads

In CUDA, threads are grouped in units called warps. A warp consists of 32 threads, which are executed

simultaneously by the GPU’s scheduler. Warps are crucial because the GPU processes threads in

groups, and efficient warp-level execution leads to optimal performance

When a warp is scheduled, all 32 threads within it are executed in lockstep, meaning they perform

the same instruction at the same time. However, each thread can operate on different data. If all

threads in a warp execute the same code path without branching, the GPU can achieve maximum

efficiency.

For instance, in the example above, if all threads are simply adding elements of two arrays, the GPU

can schedule the warp to execute the additions simultaneously for all threads.

If threads within a warp start taking different branches (i.e., different execution paths), it can lead

to what is called thread divergence, which we will discuss next.

Managing Thread Divergence

Thread divergence occurs when threads within a warp follow different execution paths. For example,

if some threads in a warp take one branch of an if statement, while others take a different branch, the

warp must execute both branches sequentially, reducing the GPU’s efficiency.

If some threads within the warp evaluate flags[index] == 1 as true while others evaluate it as

false, the warp will need to execute both paths, which leads to thread divergence.

Strategies to minimize divergence: - Avoid conditional branching whenever possible, especially

within a warp. - Try to structure your code so that threads in a warp follow the same execution path. -

When branching is unavoidable, try to group threads with similar execution paths into the same warp.

By minimizing thread divergence, you can ensure more efficient execution of warps and better

overall performance on the GPU.

Memory Hierarchy in GPUs

Global memory

Global memory is the main memory space on a GPU and has a large capacity. All threads can access global memory, but it has relatively high latency, meaning it takes longer for data to be retrieved from it compared to other types of memory. To make your code more efficient, try to minimize the number of accesses to global memory, as each access incurs a high latency cost.

Shared Memory

Shared memory is a fast, low-latency memory space that is shared among all threads within the same block. Unlike global memory, shared memory is much faster to access, and can be used for temporary storage during computations. Using shared memory allows threads within the same block to collaborate and exchange data efficiently.

Registers and Local Memory

Each thread has access to a small number of registers, which are the fastest type of memory on

the GPU. Registers store variables used by a single thread, providing very fast access. However, the

number of registers is limited, and when a thread needs more memory than is available in its registers,

it uses local memory. Local memory is slower than registers, but still faster than global memory. It is used to store

temporary variables that are too large to fit in registers.