Matrix Addition

Matrix addition is an element-wise operation, which makes it highly parallelizable. Each element in

the resulting matrix can be computed independently. Using Python, we can parallelize this operation

using the concurrent.futures module, multiprocessing, or even GPU acceleration using CUDA.

Example Code for Sequential Matrix Addition : Before diving into parallelism, let’s look at how matrix addition is implemented sequentially:

In [8]:

import numpy as np
from pandas.tests.plotting.test_backend import restore_backend

# Define two matrix
A = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
B = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])

C = A + B
print(C)

Out [8]:

[[ 2  4  6]
 [ 8 10 12]
 [14 16 18]]

Parallel Matrix Addition Using Threads

For parallelizing the matrix addition, we can divide the matrix

into rows or blocks and assign each portion to a separate thread for computation. Here’s how you can

do it using the ThreadPoolExecutor from the concurrent.futures module:

In [9]:

import numpy as np
from concurrent.futures import ThreadPoolExecutor

# Function to add two matrices row by row
def add_rows(row_a, row_b):
    return row_a + row_b

# Matrix addition with threading
def parallel_matrix_addition(A,B):
    with ThreadPoolExecutor() as executor:
        result = list(executor.map(add_rows, A, B))
    return np.array(result)

A = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
B = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])

# Perform parallel matrix addition
C = parallel_matrix_addition(A, B)
print(C)

Out [9]:

[[ 2  4  6]
 [ 8 10 12]
 [14 16 18]]

Matrix Addition using CUDA

We can further optimize matrix addition by leveraging CUDA for GPU

acceleration. CUDA enables the use of GPUs to perform parallel matrix operations on a massive scale,

making it much faster for large matrices. Below is an example of using CUDA for matrix addition:

In [10]:

import numpy as np
from numba import cuda

# Define the matrix size
N =3

# cuda Kernel for Matrix addition
@cuda.jit
def matrix_addition(A, B, C):
    i, j = cuda.grid(2)
    if i < C.shape[0] and j < C.shape[1]:
        C[i, j] = A[i, j] + B[i, j]

# Initialize matrices
A = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
B = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
C = np.zeros((N, N), dtype=np.float32)

# Define Grid and Block Sizes
threads_per_block = (16,16)
blocks_per_grid_x = int(np.ceil(A.shape[0] / threads_per_block[0])) #1
blocks_per_grid_y = int(np.ceil(A.shape[1] / threads_per_block[1])) #1
blocks_per_grid = (blocks_per_grid_x, blocks_per_grid_y)    #(1,1)

# Call the cuda kernel
matrix_addition[ blocks_per_grid, threads_per_block ](A, B, C)

print(C)

Out [10]:

[[ 2.  4.  6.]
 [ 8. 10. 12.]
 [14. 16. 18.]]

/home/san/miniconda/envs/mac-linux/lib/python3.13/site-packages/numba/cuda/dispatcher.py:536: NumbaPerformanceWarning: Grid size 1 will likely result in GPU under-utilization due to low occupancy.
  warn(NumbaPerformanceWarning(msg))
/home/san/miniconda/envs/mac-linux/lib/python3.13/site-packages/numba/cuda/cudadrv/devicearray.py:887: NumbaPerformanceWarning: Host array used in CUDA kernel will incur copy overhead to/from device.
  warn(NumbaPerformanceWarning(msg))

Thread Grid Visualization

We have:

1 block containing 16×16 threads (256 total)
3×3 matrix (9 elements)

Thread Block (16×16 threads) - Only top-left 3×3 are used:
Thread Coordinates (i,j) within the block:
(0,0) (0,1) (0,2) (0,3) ... (0,15)
(1,0) (1,1) (1,2) (1,3) ... (1,15)
(2,0) (2,1) (2,2) (2,3) ... (2,15)
(3,0) (3,1) (3,2) (3,3) ... (3,15)
 ...   ...   ...   ...  ...   ...
(15,0)(15,1)(15,2)(15,3)... (15,15)

Matrix Mapping : Each thread gets global coordinates via cuda.grid(2). Since we have only 1 block, the mapping is direct:

i = thread_x (0 to 15)
j = thread_y (0 to 15)

Which Threads Actually Work?

Working threads (9 threads):

Matrix C indices ←→ Thread coordinates
C[0,0] ← Thread (0,0) → 1+1=2
C[0,1] ← Thread (0,1) → 2+2=4
C[0,2] ← Thread (0,2) → 3+3=6
C[1,0] ← Thread (1,0) → 4+4=8
C[1,1] ← Thread (1,1) → 5+5=10
C[1,2] ← Thread (1,2) → 6+6=12
C[2,0] ← Thread (2,0) → 7+7=14
C[2,1] ← Thread (2,1) → 8+8=16
C[2,2] ← Thread (2,2) → 9+9=18

Non-working threads (247 threads):

Threads with i ≥ 3 OR j ≥ 3 skip the computation
Examples: (0,3), (3,0), (15,15) - all hit the boundary check and exit

Visual Map of Active Threads

Active Threads/Matrix Elements:
🟩 🟩 🟩 ▫ ▫ ▫ ▫ ▫ ▫ ▫ ▫ ▫ ▫ ▫ ▫
🟩 🟩 🟩 ▫ ▫ ▫ ▫ ▫ ▫ ▫ ▫ ▫ ▫ ▫ ▫
🟩 🟩 🟩 ▫ ▫ ▫ ▫ ▫ ▫ ▫ ▫ ▫ ▫ ▫ ▫
 ▫  ▫  ▫ ▫ ▫ ▫ ▫ ▫ ▫ ▫ ▫ ▫ ▫ ▫ ▫
 ... 237 more inactive threads ...

where 🟩 = Active thread (computes one matrix element), ▫ = Inactive thread (skips computation due to boundary check)

Execution Pattern
GPU launches 256 threads in parallel

9 threads find they have valid matrix indices (0-2, 0-2) and perform addition
247 threads immediately exit because their coordinates are outside the matrix bounds
All threads complete simultaneously (GPU processes them in warps of 32)

If we had a 16×16 matrix instead: All 256 threads would be utilized , Perfect mapping: one thread per matrix element , No wasted computation.

Performance Warnings

These are performance warnings, not errors! Your code will still run correctly, but Numba is giving you helpful suggestions to optimize GPU performance.

Warning 1: Low Occupancy

Grid size 1 will likely result in GPU under-utilization due to low occupancy.

What it means: You're only using 1 block, but GPUs work best with multiple blocks to hide memory latency.
Why it happens: For a 3×3 matrix, you get:

ceil(3/16) = 1 block in each dimension
Total: 1×1 = 1 block

Solution: Use a smaller thread block size that better matches your small matrix:

Warning 2: Host Array Copy Overhead

Host array used in CUDA kernel will incur copy overhead to/from device.

What it means: Numba is automatically copying your CPU arrays to GPU memory, which takes time.
Why it happens: You're using regular NumPy arrays instead of explicitly allocated GPU arrays.
Solution: Pre-allocate arrays on the GPU:

In [11]:

from numba import cuda
import numpy as np

# Allocate memory directly on GPU
A_device = cuda.to_device(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]], dtype=np.float32))
B_device = cuda.to_device(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]], dtype=np.float32))
C_device = cuda.device_array((N, N), dtype=np.float32)  # Allocate empty array on GPU

# Run kernel
matrix_addition[blocks_per_grid, threads_per_block](A_device, B_device, C_device)

# Copy result back to CPU
C = C_device.copy_to_host()
print(C)

Out [11]:

[[ 2.  4.  6.]
 [ 8. 10. 12.]
 [14. 16. 18.]]

Original Configuration: (16,16) threads per block

GPU EXECUTION - 1 BLOCK with 16×16 THREADS (256 threads total)

BLOCK 0 (16×16 threads) - Only 9 threads work, 247 are idle:

Thread Grid in Block 0:
╔═══════════════════════════════════════════════════════════════════════════════╗
║ (0,0) (0,1) (0,2) (0,3) (0,4) ... (0,15)                                     ║
║ (1,0) (1,1) (1,2) (1,3) (1,4) ... (1,15)                                     ║
║ (2,0) (2,1) (2,2) (2,3) (2,4) ... (2,15)                                     ║
║ (3,0) (3,1) (3,2) (3,3) (3,4) ... (3,15)                                     ║
║ ...   ...   ...   ...   ...   ... ...                                         ║
║ (15,0)(15,1)(15,2)(15,3)(15,4)...(15,15)                                     ║
╚═══════════════════════════════════════════════════════════════════════════════╝

Matrix Mapping (3×3):
╔═════════╗     Only these threads do work:
║ 🟢 🟢 🟢 ║     🟢 = Thread computes matrix element
║ 🟢 🟢 🟢 ║     ⚪ = Thread skips (out of bounds)
║ 🟢 🟢 🟢 ║
╚═════════╝     Rest of block: 247 ⚪⚪⚪ threads do nothing

WARP EXECUTION (32 threads/warp):
Warp 0: Threads (0,0)-(0,15), (1,0)-(1,15) → Only 6 threads work, 26 idle
Warp 1: Threads (2,0)-(2,15), (3,0)-(3,3)  → Only 3 threads work, 29 idle
Warps 2-7: All threads idle

Optimized Configuration: (4,4) threads per block

GPU EXECUTION - 1 BLOCK with 4×4 THREADS (16 threads total)

BLOCK 0 (4×4 threads) - 9 threads work, 7 are idle:

Thread Grid in Block 0:
╔═════════════════════╗
║ (0,0) (0,1) (0,2) (0,3) ║
║ (1,0) (1,1) (1,2) (1,3) ║
║ (2,0) (2,1) (2,2) (2,3) ║
║ (3,0) (3,1) (3,2) (3,3) ║
╚═════════════════════╝

Matrix Mapping (3×3):
╔═════════╗     Thread activity:
║ 🟢 🟢 🟢 ⚪ ║     🟢 = Thread computes matrix element (i<3, j<3)
║ 🟢 🟢 🟢 ⚪ ║     ⚪ = Thread skips (j≥3 or i≥3)
║ 🟢 🟢 🟢 ⚪ ║
║ ⚪ ⚪ ⚪ ⚪ ║
╚═════════╝     Only 7 idle threads instead of 247!

WARP EXECUTION (32 threads/warp):
Warp 0: Threads (0,0)-(3,3) → 9 threads work, 7 idle (much better utilization!)

Execution Flow Comparison

ORIGINAL (16,16) threads/block:
┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│   BLOCK 0       │    │   Matrix 3×3    │    │   GPU Warps     │
│ 256 threads     │    │   9 elements    │    │ 8 warps needed  │
│                 │ →  │                 │ →  │                 │
│ Only 9 work     │    │ Massive waste   │    │ Poor occupancy  │
│ 247 idle        │    │ of resources    │    │ (12.5% useful)  │
└─────────────────┘    └─────────────────┘    └─────────────────┘

OPTIMIZED (4,4) threads/block:
┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│   BLOCK 0       │    │   Matrix 3×3    │    │   GPU Warps     │
│ 16 threads      │    │   9 elements    │    │ 1 warp needed   │
│                 │ →  │                 │ →  │                 │
│ 9 work, 7 idle  │    │ Good fit        │    │ Better occupancy│
│                 │    │                 │    │ (56% useful)    │
└─────────────────┘    └─────────────────┘    └─────────────────┘

Key Differences

Aspect	Original (16,16)	Optimized (4,4)
Threads per block	256	16
Working threads	9	9
Idle threads	247	7
Thread utilization	3.5%	56%
Warps used	8 warps	1 warp
Warp utilization	12.5%	56%
Memory usage	Higher	Lower
GPU occupancy	Low	Medium

Why Multiple Blocks Matter

Better occupancy: GPU can schedule blocks across different Streaming Multiprocessors (SMs)
Latency hiding: When one block waits for memory, other blocks can execute
Resource utilization: More blocks = better use of GPU's parallel architecture

The Trade-off

More blocks = better GPU utilization, but more "wasted" threads at boundaries
Fewer blocks = less thread waste, but poor GPU occupancy

what Streaming Multiprocessors (SMs) are and how block scheduling works.

A GPU is composed of multiple Streaming Multiprocessors (SMs):

GPU CHIP OVERVIEW:

╔═══════════════════════════════════════════════════╗
║                 GPU Chip                          ║
║  ┌─────────────┐ ┌─────────────┐ ┌─────────────┐  ║
║  │    SM 0     │ │    SM 1     │ │    SM 2     │  ║
║  │             │ │             │ │             │  ║
║  │ - Cores     │ │ - Cores     │ │ - Cores     │  ║
║  │ - Registers │ │ - Registers │ │ - Registers │  ║
║  │ - Cache     │ │ - Cache     │ │ - Cache     │  ║
║  └─────────────┘ └─────────────┘ └─────────────┘  ║
║  ┌─────────────┐ ┌─────────────┐ ┌─────────────┐  ║
║  │    SM 3     │ │    SM 4     │ │    SM 5     │  ║
║  │             │ │             │ │             │  ║
║  └─────────────┘ └─────────────┘ └─────────────┘  ║
║                 ... etc ...                       ║
╚═══════════════════════════════════════════════════╝

Each SM has:

Multiple CUDA cores (e.g., 64-128 cores per SM)
Limited registers and shared memory
Scheduler that manages warps (groups of 32 threads)

Single Block vs Multiple Blocks Execution

Case 1: 1 Block (Your Original Code)

GPU with 6 SMs:
SM0: [BLOCK 0 - 256 threads] ← Only this SM is working!
SM1: [IDLE]
SM2: [IDLE]
SM3: [IDLE]
SM4: [IDLE]
SM5: [IDLE]

Result: 1 SM at 100% load, 5 SMs completely idle → 16.7% GPU utilization

Case 2: 4 Blocks (Optimized Code)

GPU with 6 SMs:
SM0: [BLOCK (0,0) - 4 threads] ← Multiple SMs working!
SM1: [BLOCK (0,1) - 4 threads]
SM2: [BLOCK (1,0) - 4 threads]
SM3: [BLOCK (1,1) - 4 threads]
SM4: [IDLE] ← Some SMs still idle, but much better
SM5: [IDLE]

Result: 4 SMs working, 2 idle → 66.7% GPU utilization

What Happens Inside an SM

Each SM can handle multiple blocks simultaneously through warp scheduling:

Inside SM0 with 1 block (256 threads):

Warps: [Warp0][Warp1][Warp2]...[Warp7] ← 8 warps total
Scheduler: Tries to keep all warps busy, but they all depend on same resources

Inside SM0 with multiple blocks:

Warps: [Block0-Warp0][Block1-Warp0][Block0-Warp1]... ← Mixed warps from different blocks
Scheduler: More flexibility to hide memory latency

In [ ]: