- Deep Learning and Machine Learning with CUDA
- Understanding Data Flow in Deep Learning: CPU, GPU, RAM, VRAM, Cache, and Disk Storage
- The GPU Hierarchical Structure
- CPU and GPU Comparision
- Linear Regression Algorithm
- Matrix Addition
- Matrix Multiplication: Naive, Optimized, and CUDA Approaches
- Neural Network: Multi-layer Network
- Vector Addition
- CUDA Kernel for parallel reduction
- Cumulative Sum
- Advanced CUDA Features and Optimization Techniques
The GPU Hierarchical Structure
A GPU consists of several key components:
• Cores: The fundamental processing units capable of executing individual instructions.
• Streaming Multiprocessors (SMs): Groups of cores that work together to execute tasks in parallel.
• Memory Hierarchy: Different levels of memory, such as global memory, shared memory, and registers, optimize data access during computation.
Here is a simplified overview of the GPU processing pipeline:
- Data Input: Data is passed to the GPU, typically from the system’s main memory (RAM) to the
- Kernel Execution: The GPU processes data through kernels, which are small programs that run
- Thread Management: Threads are assigned to the cores, where each thread processes a small
- Memory Access: Data is read from and written to the memory hierarchy, including shared memory
- Data Output: After processing, the data is returned to the system memory or used in further GPU computations.
GPU’s global memory.
in parallel across thousands of cores.
portion of the overall data.
and registers for optimal performance.
The GPU processing pipeline differs from the CPU pipeline mainly in how it handles parallelism.
While a CPU focuses on executing a few instructions quickly with a deep pipeline and high clock
speeds, the GPU prioritizes executing many instructions simultaneously with many cores working in
parallel.
Streaming Multiprocessors (SMs)
At the heart of modern GPUs are Streaming Multiprocessors (SMs). Each SM contains a group of
cores that can execute threads in parallel, allowing the GPU to handle massive parallelism efficiently.
An SM consists of:
- Cores: Individual processing units that execute threads.
- Warp Scheduler: Determines how threads are grouped into warps, which are sets of 32 threads
- Registers: Fast, low-latency memory used by individual threads for storing variables.
- Shared Memory: A small, user-managed cache that allows threads within a block to share data
that execute the same instruction simultaneously.
and communicate efficiently.
Each SM is designed to execute multiple threads simultaneously, maximizing parallelism and ensuring
that the GPU can handle tasks like matrix multiplications, convolution operations, and other
compute-heavy tasks in parallel.
Understanding Grid and Blocks in CUDA
In CUDA (Compute Unified Device Architecture), computation is organized into a hierarchy of grids
and blocks. This hierarchical structure allows the GPU to break down complex problems into smaller,
manageable pieces, enabling large-scale parallelism.
Defining the Grid
The grid is the highest-level abstraction in CUDA’s execution model. It represents the overall problem
space that is being solved on the GPU. A grid is composed of multiple blocks, which are further
subdivided into threads.
For example, if you are performing an image processing task on a 1024x1024 pixel image, the entire
image could be represented as a grid, where each pixel is processed by a different thread.
Blocks: The Subdivision of Grids
Each grid in CUDA is subdivided into blocks. A block is a collection of threads that can execute independently
on the GPU. Each block is assigned to an SM, which executes the threads of that block in
parallel.
A key feature of blocks is that they can communicate with each other using shared memory, which
is faster than accessing global memory. This allows blocks to collaborate on a subset of the overall
computation.
import numpy as np
from numba import cuda
@cuda.jit # Compiles this function to run on GPU
def matrix_addition(a, b, result):
# Get the index of the current thread in the grid
i, j = cuda.grid(2) # Each thread gets unique (i,j) coordinates
# Perform addition if the index is within the bounds
if i < result.shape[0] and j < result.shape[1]:
result[i, j] = a[i, j] + b[i, j]
# Initialize data
N = 1024
a = np.random.rand(N, N) # initialize value a with shape a = (1024, 1024)
b = np.random.rand(N, N) # initialize value b with shape b = (1024, 1024)
result = np.zeros((N, N)) # initialize result with 0 values with shape result = (1024, 1024)
# Define the grid and block dimensions / Calculate how many blocks needed to cover entire matrix
threads_per_block = (16, 16) # A block is 16x16 threads # 256 threads per block
blocks_per_grid_x = (a.shape[0] + threads_per_block[0] - 1) // threads_per_block[0]
blocks_per_grid_y = (a.shape[1] + threads_per_block[1] - 1) // threads_per_block[1]
blocks_per_grid = (blocks_per_grid_x, blocks_per_grid_y) # Grid of thread blocks
print(f"Grid: {blocks_per_grid} blocks × {threads_per_block} threads")
print(f"Total threads: {blocks_per_grid_x * blocks_per_grid_y * 16 * 16}")
# Launch the kernel
matrix_addition[blocks_per_grid, threads_per_block](a, b, result)
print(result)
result = (1024, 1024) Grid: (64, 64) blocks × (16, 16) threads Total threads: 1048576 [[1.36980796 1.14064583 1.85368593 ... 1.49270892 0.56310749 0.7460153 ] [1.71873084 0.59909694 0.75783558 ... 1.73687974 1.50361012 1.73527891] [0.57410286 0.96230128 1.13313017 ... 0.31919318 1.1129677 1.48177356] ... [0.36844851 0.58703067 1.67605846 ... 0.9899919 0.47099114 0.98850427] [1.14627379 1.26879155 1.74540994 ... 0.90718099 1.14108572 1.09423898] [0.48536405 1.39498137 1.12177176 ... 1.17229426 0.53833065 1.07447603]]
/home/san/miniconda/envs/mac-linux/lib/python3.13/site-packages/numba/cuda/cudadrv/devicearray.py:887: NumbaPerformanceWarning: Host array used in CUDA kernel will incur copy overhead to/from device. warn(NumbaPerformanceWarning(msg))
- The formula (n + block_size - 1) // block_size is a trick for ceiling division using integer arithmetic.
- Regular division (floor):
- 1024 / 16 = 64.0 → 64 blocks
- But 64 blocks × 16 threads = 1024 threads → Perfect fit!
- So for 1024, both floor and ceiling give same result
1. The @cuda.jit Decorator
- What it does:
- Compiles the Python function to GPU machine code (PTX assembly)
- Creates a GPU kernel that can be launched with thousands of threads
- Handles memory transfers automatically between CPU and GPU
@cuda.jit
: This is a normal Python function that runs sequentially on CPU@cuda.jit
: This becomes a GPU kernel that runs in parallel on thousands of GPU cores2. cuda.grid(2) - Thread Coordinate System
- What's happening:
- GPU launches 1,048,576 threads (for 1024×1024 matrix)
- Each thread gets unique (i, j) coordinates based on its position in the grid
- Thread (0,0) processes element [0,0]
- Thread (0,1) processes element [0,1]
- Thread (15,15) processes element [15,15]
- Thread (16,0) processes element [16,0] (next block)
3. Bounds Checking - Critical Safety
- Why this is necessary: We launch more threads than needed to handle uneven divisions.
- Example: For a 1030×1030 matrix with 16×16 blocks:
- We need 65×65 blocks = 1,082,256 threads
- But only 1030×1030 = 1,060,900 elements exist
- Extra 21,356 threads hit this condition and do nothing
4. The Parallel Magic: result[i, j] = a[i, j] + b[i, j]
- ALL 1 MILLION additions happen SIMULTANEOUSLY:
- Thread(0,0): result[0][0] = a[0][0] + b[0][0]
- Thread(0,1): result[0][1] = a[0][1] + b[0][1]
- Thread(0,2): result[0][2] = a[0][2] + b[0][2]
- ...
- Thread(1023,1023): result[1023][1023] = a[1023][1023] + b[1023][1023]
- ↑ ALL HAPPEN AT THE SAME TIME!
- This syntax is SPECIAL CUDA syntax (not normal Python):
- Normal Python function call:
function_name(arguments)
- CUDA Kernel launch:
kernel_name[GRID_CONFIG](arguments)
- What happens during launch:
- Copy input arrays (a, b) from CPU → GPU
- Allocate result array on GPU
- Launch 64×64×16×16 = 1,048,576 threads
- All threads execute the SAME code on DIFFERENT data
- Copy result array from GPU → CPU
5. Kernel Launch: [blocks_per_grid, threads_per_block]
Threads and Warps
What is a Thread?
A thread is the smallest unit of computation in CUDA, the parallel computing platform from NVIDIA.
Each thread runs a specific portion of code, typically working on one element of data. While a single
thread can execute a small task, CUDA is designed to handle thousands of threads simultaneously,
allowing it to solve complex problems by breaking them down into smaller tasks.
For example, let’s say we want to add two arrays together, element by element. Each thread will
handle the addition of one element from each array.
Warps: Groups of Threads
In CUDA, threads are grouped in units called warps. A warp consists of 32 threads, which are executed
simultaneously by the GPU’s scheduler. Warps are crucial because the GPU processes threads in
groups, and efficient warp-level execution leads to optimal performance
When a warp is scheduled, all 32 threads within it are executed in lockstep, meaning they perform
the same instruction at the same time. However, each thread can operate on different data. If all
threads in a warp execute the same code path without branching, the GPU can achieve maximum
efficiency.
For instance, in the example above, if all threads are simply adding elements of two arrays, the GPU
can schedule the warp to execute the additions simultaneously for all threads.
If threads within a warp start taking different branches (i.e., different execution paths), it can lead
to what is called thread divergence, which we will discuss next.
Managing Thread Divergence
Thread divergence occurs when threads within a warp follow different execution paths. For example,
if some threads in a warp take one branch of an if
statement, while others take a different branch, the
warp must execute both branches sequentially, reducing the GPU’s efficiency.
If some threads within the warp evaluate flags[index] == 1
as true while others evaluate it as
false, the warp will need to execute both paths, which leads to thread divergence.
Strategies to minimize divergence: - Avoid conditional branching whenever possible, especiallywithin a warp. - Try to structure your code so that threads in a warp follow the same execution path. -
When branching is unavoidable, try to group threads with similar execution paths into the same warp.
By minimizing thread divergence, you can ensure more efficient execution of warps and better
overall performance on the GPU.
Memory Hierarchy in GPUs
Global memory
Global memory is the main memory space on a GPU and has a large capacity. All threads can access global memory, but it has relatively high latency, meaning it takes longer for data to be retrieved from it compared to other types of memory. To make your code more efficient, try to minimize the number of accesses to global memory, as each access incurs a high latency cost.
Shared Memory
Shared memory is a fast, low-latency memory space that is shared among all threads within the same block. Unlike global memory, shared memory is much faster to access, and can be used for temporary storage during computations. Using shared memory allows threads within the same block to collaborate and exchange data efficiently.
Registers and Local Memory
Each thread has access to a small number of registers, which are the fastest type of memory on
the GPU. Registers store variables used by a single thread, providing very fast access. However, the
number of registers is limited, and when a thread needs more memory than is available in its registers,
it uses local memory. Local memory is slower than registers, but still faster than global memory. It is used to store
temporary variables that are too large to fit in registers.