Matrix Addition


Matrix Addition

Matrix addition is an element-wise operation, which makes it highly parallelizable. Each element in

the resulting matrix can be computed independently. Using Python, we can parallelize this operation

using the concurrent.futures module, multiprocessing, or even GPU acceleration using CUDA.

Example Code for Sequential Matrix Addition : Before diving into parallelism, let’s look at how matrix addition is implemented sequentially:
In [8]:
import numpy as np
from pandas.tests.plotting.test_backend import restore_backend

# Define two matrix
A = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
B = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])

C = A + B
print(C)
Out [8]:
[[ 2  4  6]
 [ 8 10 12]
 [14 16 18]]

Parallel Matrix Addition Using Threads

For parallelizing the matrix addition, we can divide the matrix

into rows or blocks and assign each portion to a separate thread for computation. Here’s how you can

do it using the ThreadPoolExecutor from the concurrent.futures module:

In [9]:
import numpy as np
from concurrent.futures import ThreadPoolExecutor

# Function to add two matrices row by row
def add_rows(row_a, row_b):
    return row_a + row_b

# Matrix addition with threading
def parallel_matrix_addition(A,B):
    with ThreadPoolExecutor() as executor:
        result = list(executor.map(add_rows, A, B))
    return np.array(result)

A = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
B = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])

# Perform parallel matrix addition
C = parallel_matrix_addition(A, B)
print(C)
Out [9]:
[[ 2  4  6]
 [ 8 10 12]
 [14 16 18]]

Matrix Addition using CUDA

We can further optimize matrix addition by leveraging CUDA for GPU

acceleration. CUDA enables the use of GPUs to perform parallel matrix operations on a massive scale,

making it much faster for large matrices. Below is an example of using CUDA for matrix addition:

In [10]:
import numpy as np
from numba import cuda

# Define the matrix size
N =3

# cuda Kernel for Matrix addition
@cuda.jit
def matrix_addition(A, B, C):
    i, j = cuda.grid(2)
    if i < C.shape[0] and j < C.shape[1]:
        C[i, j] = A[i, j] + B[i, j]

# Initialize matrices
A = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
B = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
C = np.zeros((N, N), dtype=np.float32)

# Define Grid and Block Sizes
threads_per_block = (16,16)
blocks_per_grid_x = int(np.ceil(A.shape[0] / threads_per_block[0])) #1
blocks_per_grid_y = int(np.ceil(A.shape[1] / threads_per_block[1])) #1
blocks_per_grid = (blocks_per_grid_x, blocks_per_grid_y)    #(1,1)

# Call the cuda kernel
matrix_addition[ blocks_per_grid, threads_per_block ](A, B, C)

print(C)
Out [10]:
[[ 2.  4.  6.]
 [ 8. 10. 12.]
 [14. 16. 18.]]
/home/san/miniconda/envs/mac-linux/lib/python3.13/site-packages/numba/cuda/dispatcher.py:536: NumbaPerformanceWarning: Grid size 1 will likely result in GPU under-utilization due to low occupancy.
  warn(NumbaPerformanceWarning(msg))
/home/san/miniconda/envs/mac-linux/lib/python3.13/site-packages/numba/cuda/cudadrv/devicearray.py:887: NumbaPerformanceWarning: Host array used in CUDA kernel will incur copy overhead to/from device.
  warn(NumbaPerformanceWarning(msg))

Thread Grid Visualization

We have:

  • 1 block containing 16×16 threads (256 total)
  • 3×3 matrix (9 elements)
  • Thread Block (16×16 threads) - Only top-left 3×3 are used:
    Thread Coordinates (i,j) within the block:
    (0,0) (0,1) (0,2) (0,3) ... (0,15)
    (1,0) (1,1) (1,2) (1,3) ... (1,15)
    (2,0) (2,1) (2,2) (2,3) ... (2,15)
    (3,0) (3,1) (3,2) (3,3) ... (3,15)
     ...   ...   ...   ...  ...   ...
    (15,0)(15,1)(15,2)(15,3)... (15,15)

    • Matrix Mapping : Each thread gets global coordinates via cuda.grid(2). Since we have only 1 block, the mapping is direct:
      • i = thread_x (0 to 15)
      • j = thread_y (0 to 15)

    Which Threads Actually Work?

    • Working threads (9 threads):
    • Matrix C indices ←→ Thread coordinates
      C[0,0] ← Thread (0,0) → 1+1=2
      C[0,1] ← Thread (0,1) → 2+2=4
      C[0,2] ← Thread (0,2) → 3+3=6
      C[1,0] ← Thread (1,0) → 4+4=8
      C[1,1] ← Thread (1,1) → 5+5=10
      C[1,2] ← Thread (1,2) → 6+6=12
      C[2,0] ← Thread (2,0) → 7+7=14
      C[2,1] ← Thread (2,1) → 8+8=16
      C[2,2] ← Thread (2,2) → 9+9=18

      • Non-working threads (247 threads):
        • Threads with i ≥ 3 OR j ≥ 3 skip the computation
        • Examples: (0,3), (3,0), (15,15) - all hit the boundary check and exit

      Visual Map of Active Threads

      Active Threads/Matrix Elements:
      🟩 🟩 🟩 ▫ ▫ ▫ ▫ ▫ ▫ ▫ ▫ ▫ ▫ ▫ ▫
      🟩 🟩 🟩 ▫ ▫ ▫ ▫ ▫ ▫ ▫ ▫ ▫ ▫ ▫ ▫
      🟩 🟩 🟩 ▫ ▫ ▫ ▫ ▫ ▫ ▫ ▫ ▫ ▫ ▫ ▫
       ▫  ▫  ▫ ▫ ▫ ▫ ▫ ▫ ▫ ▫ ▫ ▫ ▫ ▫ ▫
       ... 237 more inactive threads ...

      where 🟩 = Active thread (computes one matrix element), ▫ = Inactive thread (skips computation due to boundary check)

      • Execution Pattern
      • GPU launches 256 threads in parallel
        • 9 threads find they have valid matrix indices (0-2, 0-2) and perform addition
        • 247 threads immediately exit because their coordinates are outside the matrix bounds
        • All threads complete simultaneously (GPU processes them in warps of 32)

      If we had a 16×16 matrix instead: All 256 threads would be utilized , Perfect mapping: one thread per matrix element , No wasted computation.

Performance Warnings

These are performance warnings, not errors! Your code will still run correctly, but Numba is giving you helpful suggestions to optimize GPU performance.

Warning 1: Low Occupancy

Grid size 1 will likely result in GPU under-utilization due to low occupancy.

  • What it means: You're only using 1 block, but GPUs work best with multiple blocks to hide memory latency.
  • Why it happens: For a 3×3 matrix, you get:
    • ceil(3/16) = 1 block in each dimension
    • Total: 1×1 = 1 block

  • Solution: Use a smaller thread block size that better matches your small matrix:
  • Warning 2: Host Array Copy Overhead

    Host array used in CUDA kernel will incur copy overhead to/from device.

    • What it means: Numba is automatically copying your CPU arrays to GPU memory, which takes time.
    • Why it happens: You're using regular NumPy arrays instead of explicitly allocated GPU arrays.
    • Solution: Pre-allocate arrays on the GPU:
    In [11]:
    from numba import cuda
    import numpy as np
    
    # Allocate memory directly on GPU
    A_device = cuda.to_device(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]], dtype=np.float32))
    B_device = cuda.to_device(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]], dtype=np.float32))
    C_device = cuda.device_array((N, N), dtype=np.float32)  # Allocate empty array on GPU
    
    # Run kernel
    matrix_addition[blocks_per_grid, threads_per_block](A_device, B_device, C_device)
    
    # Copy result back to CPU
    C = C_device.copy_to_host()
    print(C)
    
    Out [11]:
    [[ 2.  4.  6.]
     [ 8. 10. 12.]
     [14. 16. 18.]]
    

    Original Configuration: (16,16) threads per block

    GPU EXECUTION - 1 BLOCK with 16×16 THREADS (256 threads total)
    
    BLOCK 0 (16×16 threads) - Only 9 threads work, 247 are idle:
    
    Thread Grid in Block 0:
    ╔═══════════════════════════════════════════════════════════════════════════════╗
    ║ (0,0) (0,1) (0,2) (0,3) (0,4) ... (0,15)                                     ║
    ║ (1,0) (1,1) (1,2) (1,3) (1,4) ... (1,15)                                     ║
    ║ (2,0) (2,1) (2,2) (2,3) (2,4) ... (2,15)                                     ║
    ║ (3,0) (3,1) (3,2) (3,3) (3,4) ... (3,15)                                     ║
    ║ ...   ...   ...   ...   ...   ... ...                                         ║
    ║ (15,0)(15,1)(15,2)(15,3)(15,4)...(15,15)                                     ║
    ╚═══════════════════════════════════════════════════════════════════════════════╝
    
    Matrix Mapping (3×3):
    ╔═════════╗     Only these threads do work:
    ║ 🟢 🟢 🟢 ║     🟢 = Thread computes matrix element
    ║ 🟢 🟢 🟢 ║     ⚪ = Thread skips (out of bounds)
    ║ 🟢 🟢 🟢 ║
    ╚═════════╝     Rest of block: 247 ⚪⚪⚪ threads do nothing
    
    WARP EXECUTION (32 threads/warp):
    Warp 0: Threads (0,0)-(0,15), (1,0)-(1,15) → Only 6 threads work, 26 idle
    Warp 1: Threads (2,0)-(2,15), (3,0)-(3,3)  → Only 3 threads work, 29 idle
    Warps 2-7: All threads idle

    Optimized Configuration: (4,4) threads per block

    GPU EXECUTION - 1 BLOCK with 4×4 THREADS (16 threads total)
    
    BLOCK 0 (4×4 threads) - 9 threads work, 7 are idle:
    
    Thread Grid in Block 0:
    ╔═════════════════════╗
    ║ (0,0) (0,1) (0,2) (0,3) ║
    ║ (1,0) (1,1) (1,2) (1,3) ║
    ║ (2,0) (2,1) (2,2) (2,3) ║
    ║ (3,0) (3,1) (3,2) (3,3) ║
    ╚═════════════════════╝
    
    Matrix Mapping (3×3):
    ╔═════════╗     Thread activity:
    ║ 🟢 🟢 🟢 ⚪ ║     🟢 = Thread computes matrix element (i<3, j<3)
    ║ 🟢 🟢 🟢 ⚪ ║     ⚪ = Thread skips (j≥3 or i≥3)
    ║ 🟢 🟢 🟢 ⚪ ║
    ║ ⚪ ⚪ ⚪ ⚪ ║
    ╚═════════╝     Only 7 idle threads instead of 247!
    
    WARP EXECUTION (32 threads/warp):
    Warp 0: Threads (0,0)-(3,3) → 9 threads work, 7 idle (much better utilization!)

    Execution Flow Comparison

    ORIGINAL (16,16) threads/block:
    ┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
    │   BLOCK 0       │    │   Matrix 3×3    │    │   GPU Warps     │
    │ 256 threads     │    │   9 elements    │    │ 8 warps needed  │
    │                 │ →  │                 │ →  │                 │
    │ Only 9 work     │    │ Massive waste   │    │ Poor occupancy  │
    │ 247 idle        │    │ of resources    │    │ (12.5% useful)  │
    └─────────────────┘    └─────────────────┘    └─────────────────┘
    
    OPTIMIZED (4,4) threads/block:
    ┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
    │   BLOCK 0       │    │   Matrix 3×3    │    │   GPU Warps     │
    │ 16 threads      │    │   9 elements    │    │ 1 warp needed   │
    │                 │ →  │                 │ →  │                 │
    │ 9 work, 7 idle  │    │ Good fit        │    │ Better occupancy│
    │                 │    │                 │    │ (56% useful)    │
    └─────────────────┘    └─────────────────┘    └─────────────────┘

    Key Differences

    AspectOriginal (16,16)Optimized (4,4)
    Threads per block25616
    Working threads99
    Idle threads2477
    Thread utilization3.5%56%
    Warps used8 warps1 warp
    Warp utilization12.5%56%
    Memory usageHigherLower
    GPU occupancyLowMedium

    Why Multiple Blocks Matter

    • Better occupancy: GPU can schedule blocks across different Streaming Multiprocessors (SMs)
    • Latency hiding: When one block waits for memory, other blocks can execute
    • Resource utilization: More blocks = better use of GPU's parallel architecture
    • The Trade-off

      • More blocks = better GPU utilization, but more "wasted" threads at boundaries
      • Fewer blocks = less thread waste, but poor GPU occupancy

    what Streaming Multiprocessors (SMs) are and how block scheduling works.

    A GPU is composed of multiple Streaming Multiprocessors (SMs):

    GPU CHIP OVERVIEW:

    ╔═══════════════════════════════════════════════════╗
    ║                 GPU Chip                          ║
    ║  ┌─────────────┐ ┌─────────────┐ ┌─────────────┐  ║
    ║  │    SM 0     │ │    SM 1     │ │    SM 2     │  ║
    ║  │             │ │             │ │             │  ║
    ║  │ - Cores     │ │ - Cores     │ │ - Cores     │  ║
    ║  │ - Registers │ │ - Registers │ │ - Registers │  ║
    ║  │ - Cache     │ │ - Cache     │ │ - Cache     │  ║
    ║  └─────────────┘ └─────────────┘ └─────────────┘  ║
    ║  ┌─────────────┐ ┌─────────────┐ ┌─────────────┐  ║
    ║  │    SM 3     │ │    SM 4     │ │    SM 5     │  ║
    ║  │             │ │             │ │             │  ║
    ║  └─────────────┘ └─────────────┘ └─────────────┘  ║
    ║                 ... etc ...                       ║
    ╚═══════════════════════════════════════════════════╝

    Each SM has:

    • Multiple CUDA cores (e.g., 64-128 cores per SM)
    • Limited registers and shared memory
    • Scheduler that manages warps (groups of 32 threads)
    • Single Block vs Multiple Blocks Execution

      Case 1: 1 Block (Your Original Code)

      GPU with 6 SMs:
      SM0: [BLOCK 0 - 256 threads] ← Only this SM is working!
      SM1: [IDLE]
      SM2: [IDLE]
      SM3: [IDLE]
      SM4: [IDLE]
      SM5: [IDLE]
      
      Result: 1 SM at 100% load, 5 SMs completely idle → 16.7% GPU utilization

      Case 2: 4 Blocks (Optimized Code)

      GPU with 6 SMs:
      SM0: [BLOCK (0,0) - 4 threads] ← Multiple SMs working!
      SM1: [BLOCK (0,1) - 4 threads]
      SM2: [BLOCK (1,0) - 4 threads]
      SM3: [BLOCK (1,1) - 4 threads]
      SM4: [IDLE] ← Some SMs still idle, but much better
      SM5: [IDLE]
      
      Result: 4 SMs working, 2 idle → 66.7% GPU utilization

      What Happens Inside an SM

      Each SM can handle multiple blocks simultaneously through warp scheduling:

      • Inside SM0 with 1 block (256 threads):
        • Warps: [Warp0][Warp1][Warp2]...[Warp7] ← 8 warps total
        • Scheduler: Tries to keep all warps busy, but they all depend on same resources

    • Inside SM0 with multiple blocks:
      • Warps: [Block0-Warp0][Block1-Warp0][Block0-Warp1]... ← Mixed warps from different blocks
      • Scheduler: More flexibility to hide memory latency

    In [ ]:
    
    
    All systems normal

    © 2025 2023 Sanjeeb KC. All rights reserved.