- Deep Learning and Machine Learning with CUDA
- Understanding Data Flow in Deep Learning: CPU, GPU, RAM, VRAM, Cache, and Disk Storage
- The GPU Hierarchical Structure
- CPU and GPU Comparision
- Linear Regression Algorithm
- Matrix Addition
- Matrix Multiplication: Naive, Optimized, and CUDA Approaches
- Neural Network: Multi-layer Network
- Vector Addition
- CUDA Kernel for parallel reduction
- Cumulative Sum
- Advanced CUDA Features and Optimization Techniques
Matrix Addition
Matrix addition is an element-wise operation, which makes it highly parallelizable. Each element in
the resulting matrix can be computed independently. Using Python, we can parallelize this operation
using the concurrent.futures module, multiprocessing, or even GPU acceleration using CUDA.
Example Code for Sequential Matrix Addition : Before diving into parallelism, let’s look at how matrix addition is implemented sequentially:import numpy as np
from pandas.tests.plotting.test_backend import restore_backend
# Define two matrix
A = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
B = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
C = A + B
print(C)
[[ 2 4 6] [ 8 10 12] [14 16 18]]
Parallel Matrix Addition Using Threads
For parallelizing the matrix addition, we can divide the matrix
into rows or blocks and assign each portion to a separate thread for computation. Here’s how you can
do it using the ThreadPoolExecutor
from the concurrent.futures
module:
import numpy as np
from concurrent.futures import ThreadPoolExecutor
# Function to add two matrices row by row
def add_rows(row_a, row_b):
return row_a + row_b
# Matrix addition with threading
def parallel_matrix_addition(A,B):
with ThreadPoolExecutor() as executor:
result = list(executor.map(add_rows, A, B))
return np.array(result)
A = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
B = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
# Perform parallel matrix addition
C = parallel_matrix_addition(A, B)
print(C)
[[ 2 4 6] [ 8 10 12] [14 16 18]]
Matrix Addition using CUDA
We can further optimize matrix addition by leveraging CUDA for GPU
acceleration. CUDA enables the use of GPUs to perform parallel matrix operations on a massive scale,
making it much faster for large matrices. Below is an example of using CUDA for matrix addition:
import numpy as np
from numba import cuda
# Define the matrix size
N =3
# cuda Kernel for Matrix addition
@cuda.jit
def matrix_addition(A, B, C):
i, j = cuda.grid(2)
if i < C.shape[0] and j < C.shape[1]:
C[i, j] = A[i, j] + B[i, j]
# Initialize matrices
A = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
B = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
C = np.zeros((N, N), dtype=np.float32)
# Define Grid and Block Sizes
threads_per_block = (16,16)
blocks_per_grid_x = int(np.ceil(A.shape[0] / threads_per_block[0])) #1
blocks_per_grid_y = int(np.ceil(A.shape[1] / threads_per_block[1])) #1
blocks_per_grid = (blocks_per_grid_x, blocks_per_grid_y) #(1,1)
# Call the cuda kernel
matrix_addition[ blocks_per_grid, threads_per_block ](A, B, C)
print(C)
[[ 2. 4. 6.] [ 8. 10. 12.] [14. 16. 18.]]
/home/san/miniconda/envs/mac-linux/lib/python3.13/site-packages/numba/cuda/dispatcher.py:536: NumbaPerformanceWarning: Grid size 1 will likely result in GPU under-utilization due to low occupancy. warn(NumbaPerformanceWarning(msg)) /home/san/miniconda/envs/mac-linux/lib/python3.13/site-packages/numba/cuda/cudadrv/devicearray.py:887: NumbaPerformanceWarning: Host array used in CUDA kernel will incur copy overhead to/from device. warn(NumbaPerformanceWarning(msg))
Thread Grid Visualization
We have:
- 1 block containing 16×16 threads (256 total)
- 3×3 matrix (9 elements)
Thread Block (16×16 threads) - Only top-left 3×3 are used:
Thread Coordinates (i,j) within the block:
(0,0) (0,1) (0,2) (0,3) ... (0,15)
(1,0) (1,1) (1,2) (1,3) ... (1,15)
(2,0) (2,1) (2,2) (2,3) ... (2,15)
(3,0) (3,1) (3,2) (3,3) ... (3,15)
... ... ... ... ... ...
(15,0)(15,1)(15,2)(15,3)... (15,15)
- Matrix Mapping : Each thread gets global coordinates via cuda.grid(2). Since we have only 1 block, the mapping is direct:
- i = thread_x (0 to 15)
- j = thread_y (0 to 15)
Which Threads Actually Work?
- Working threads (9 threads):
Matrix C indices ←→ Thread coordinates
C[0,0] ← Thread (0,0) → 1+1=2
C[0,1] ← Thread (0,1) → 2+2=4
C[0,2] ← Thread (0,2) → 3+3=6
C[1,0] ← Thread (1,0) → 4+4=8
C[1,1] ← Thread (1,1) → 5+5=10
C[1,2] ← Thread (1,2) → 6+6=12
C[2,0] ← Thread (2,0) → 7+7=14
C[2,1] ← Thread (2,1) → 8+8=16
C[2,2] ← Thread (2,2) → 9+9=18
- Non-working threads (247 threads):
- Threads with i ≥ 3 OR j ≥ 3 skip the computation
- Examples: (0,3), (3,0), (15,15) - all hit the boundary check and exit
Visual Map of Active Threads
Active Threads/Matrix Elements:
🟩 🟩 🟩 ▫ ▫ ▫ ▫ ▫ ▫ ▫ ▫ ▫ ▫ ▫ ▫
🟩 🟩 🟩 ▫ ▫ ▫ ▫ ▫ ▫ ▫ ▫ ▫ ▫ ▫ ▫
🟩 🟩 🟩 ▫ ▫ ▫ ▫ ▫ ▫ ▫ ▫ ▫ ▫ ▫ ▫
▫ ▫ ▫ ▫ ▫ ▫ ▫ ▫ ▫ ▫ ▫ ▫ ▫ ▫ ▫
... 237 more inactive threads ...
where 🟩 = Active thread (computes one matrix element), ▫ = Inactive thread (skips computation due to boundary check)
- Execution Pattern
- GPU launches 256 threads in parallel
- 9 threads find they have valid matrix indices (0-2, 0-2) and perform addition
- 247 threads immediately exit because their coordinates are outside the matrix bounds
- All threads complete simultaneously (GPU processes them in warps of 32)
If we had a 16×16 matrix instead: All 256 threads would be utilized , Perfect mapping: one thread per matrix element , No wasted computation.
Performance Warnings
These are performance warnings, not errors! Your code will still run correctly, but Numba is giving you helpful suggestions to optimize GPU performance.
Warning 1: Low Occupancy
Grid size 1 will likely result in GPU under-utilization due to low occupancy.
- What it means: You're only using 1 block, but GPUs work best with multiple blocks to hide memory latency.
- Why it happens: For a 3×3 matrix, you get:
ceil(3/16)
= 1 block in each dimension- Total: 1×1 = 1 block
Warning 2: Host Array Copy Overhead
Host array used in CUDA kernel will incur copy overhead to/from device.
- What it means: Numba is automatically copying your CPU arrays to GPU memory, which takes time.
- Why it happens: You're using regular NumPy arrays instead of explicitly allocated GPU arrays.
- Solution: Pre-allocate arrays on the GPU:
from numba import cuda
import numpy as np
# Allocate memory directly on GPU
A_device = cuda.to_device(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]], dtype=np.float32))
B_device = cuda.to_device(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]], dtype=np.float32))
C_device = cuda.device_array((N, N), dtype=np.float32) # Allocate empty array on GPU
# Run kernel
matrix_addition[blocks_per_grid, threads_per_block](A_device, B_device, C_device)
# Copy result back to CPU
C = C_device.copy_to_host()
print(C)
[[ 2. 4. 6.] [ 8. 10. 12.] [14. 16. 18.]]
Original Configuration: (16,16) threads per block
GPU EXECUTION - 1 BLOCK with 16×16 THREADS (256 threads total)
BLOCK 0 (16×16 threads) - Only 9 threads work, 247 are idle:
Thread Grid in Block 0:
╔═══════════════════════════════════════════════════════════════════════════════╗
║ (0,0) (0,1) (0,2) (0,3) (0,4) ... (0,15) ║
║ (1,0) (1,1) (1,2) (1,3) (1,4) ... (1,15) ║
║ (2,0) (2,1) (2,2) (2,3) (2,4) ... (2,15) ║
║ (3,0) (3,1) (3,2) (3,3) (3,4) ... (3,15) ║
║ ... ... ... ... ... ... ... ║
║ (15,0)(15,1)(15,2)(15,3)(15,4)...(15,15) ║
╚═══════════════════════════════════════════════════════════════════════════════╝
Matrix Mapping (3×3):
╔═════════╗ Only these threads do work:
║ 🟢 🟢 🟢 ║ 🟢 = Thread computes matrix element
║ 🟢 🟢 🟢 ║ ⚪ = Thread skips (out of bounds)
║ 🟢 🟢 🟢 ║
╚═════════╝ Rest of block: 247 ⚪⚪⚪ threads do nothing
WARP EXECUTION (32 threads/warp):
Warp 0: Threads (0,0)-(0,15), (1,0)-(1,15) → Only 6 threads work, 26 idle
Warp 1: Threads (2,0)-(2,15), (3,0)-(3,3) → Only 3 threads work, 29 idle
Warps 2-7: All threads idle
Optimized Configuration: (4,4) threads per block
GPU EXECUTION - 1 BLOCK with 4×4 THREADS (16 threads total)
BLOCK 0 (4×4 threads) - 9 threads work, 7 are idle:
Thread Grid in Block 0:
╔═════════════════════╗
║ (0,0) (0,1) (0,2) (0,3) ║
║ (1,0) (1,1) (1,2) (1,3) ║
║ (2,0) (2,1) (2,2) (2,3) ║
║ (3,0) (3,1) (3,2) (3,3) ║
╚═════════════════════╝
Matrix Mapping (3×3):
╔═════════╗ Thread activity:
║ 🟢 🟢 🟢 ⚪ ║ 🟢 = Thread computes matrix element (i<3, j<3)
║ 🟢 🟢 🟢 ⚪ ║ ⚪ = Thread skips (j≥3 or i≥3)
║ 🟢 🟢 🟢 ⚪ ║
║ ⚪ ⚪ ⚪ ⚪ ║
╚═════════╝ Only 7 idle threads instead of 247!
WARP EXECUTION (32 threads/warp):
Warp 0: Threads (0,0)-(3,3) → 9 threads work, 7 idle (much better utilization!)
Execution Flow Comparison
ORIGINAL (16,16) threads/block:
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ BLOCK 0 │ │ Matrix 3×3 │ │ GPU Warps │
│ 256 threads │ │ 9 elements │ │ 8 warps needed │
│ │ → │ │ → │ │
│ Only 9 work │ │ Massive waste │ │ Poor occupancy │
│ 247 idle │ │ of resources │ │ (12.5% useful) │
└─────────────────┘ └─────────────────┘ └─────────────────┘
OPTIMIZED (4,4) threads/block:
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ BLOCK 0 │ │ Matrix 3×3 │ │ GPU Warps │
│ 16 threads │ │ 9 elements │ │ 1 warp needed │
│ │ → │ │ → │ │
│ 9 work, 7 idle │ │ Good fit │ │ Better occupancy│
│ │ │ │ │ (56% useful) │
└─────────────────┘ └─────────────────┘ └─────────────────┘
Key Differences
Aspect | Original (16,16) | Optimized (4,4) |
---|---|---|
Threads per block | 256 | 16 |
Working threads | 9 | 9 |
Idle threads | 247 | 7 |
Thread utilization | 3.5% | 56% |
Warps used | 8 warps | 1 warp |
Warp utilization | 12.5% | 56% |
Memory usage | Higher | Lower |
GPU occupancy | Low | Medium |
Why Multiple Blocks Matter
- Better occupancy: GPU can schedule blocks across different Streaming Multiprocessors (SMs)
- Latency hiding: When one block waits for memory, other blocks can execute
- Resource utilization: More blocks = better use of GPU's parallel architecture
- More blocks = better GPU utilization, but more "wasted" threads at boundaries
- Fewer blocks = less thread waste, but poor GPU occupancy
The Trade-off
what Streaming Multiprocessors (SMs) are and how block scheduling works.
A GPU is composed of multiple Streaming Multiprocessors (SMs):
GPU CHIP OVERVIEW:
╔═══════════════════════════════════════════════════╗
║ GPU Chip ║
║ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ║
║ │ SM 0 │ │ SM 1 │ │ SM 2 │ ║
║ │ │ │ │ │ │ ║
║ │ - Cores │ │ - Cores │ │ - Cores │ ║
║ │ - Registers │ │ - Registers │ │ - Registers │ ║
║ │ - Cache │ │ - Cache │ │ - Cache │ ║
║ └─────────────┘ └─────────────┘ └─────────────┘ ║
║ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ║
║ │ SM 3 │ │ SM 4 │ │ SM 5 │ ║
║ │ │ │ │ │ │ ║
║ └─────────────┘ └─────────────┘ └─────────────┘ ║
║ ... etc ... ║
╚═══════════════════════════════════════════════════╝
Each SM has:
- Multiple CUDA cores (e.g., 64-128 cores per SM)
- Limited registers and shared memory
- Scheduler that manages warps (groups of 32 threads)
Single Block vs Multiple Blocks Execution
Case 1: 1 Block (Your Original Code)
GPU with 6 SMs:
SM0: [BLOCK 0 - 256 threads] ← Only this SM is working!
SM1: [IDLE]
SM2: [IDLE]
SM3: [IDLE]
SM4: [IDLE]
SM5: [IDLE]
Result: 1 SM at 100% load, 5 SMs completely idle → 16.7% GPU utilization
Case 2: 4 Blocks (Optimized Code)
GPU with 6 SMs:
SM0: [BLOCK (0,0) - 4 threads] ← Multiple SMs working!
SM1: [BLOCK (0,1) - 4 threads]
SM2: [BLOCK (1,0) - 4 threads]
SM3: [BLOCK (1,1) - 4 threads]
SM4: [IDLE] ← Some SMs still idle, but much better
SM5: [IDLE]
Result: 4 SMs working, 2 idle → 66.7% GPU utilization
What Happens Inside an SM
Each SM can handle multiple blocks simultaneously through warp scheduling:
- Inside SM0 with 1 block (256 threads):
- Warps: [Warp0][Warp1][Warp2]...[Warp7] ← 8 warps total
- Scheduler: Tries to keep all warps busy, but they all depend on same resources
- Warps: [Block0-Warp0][Block1-Warp0][Block0-Warp1]... ← Mixed warps from different blocks
- Scheduler: More flexibility to hide memory latency