Matrix Multiplication: Naive, Optimized, and CUDA Approaches


Matrix Multiplication: Naive, Optimized, and CUDA Approaches

Matrix multiplication is a more complex operation than matrix addition. Given two matrices A of dimensions m ×n and B of dimensions n ×p, their product C is computed as:

Naive Implementation of Matrix Multiplication

In [1]:
import numpy as np

A = np.array([[1, 2, 3], [4, 5, 6]])
B = np.array([[7, 8, 9], [10, 11, 12], [13, 14, 15]])

# Naive Matrix Multiplication

C = np.zeros((A.shape[0], B.shape[1]))

for i in range(A.shape[0]):
    for j in range(B.shape[1]):
        for k in range(A.shape[1]):
            C[i, j] = A[i, k] * B[k, j]

print(C)

Out [1]:
[[39. 42. 45.]
 [78. 84. 90.]]

Naive Matrix Multiplication

The naive approach to matrix multiplication involves three nested loops: one for rows of matrix A,

one for columns of matrix B, and one for summing the element-wise products. Here’s how you can

implement this in Python:

In [3]:
import numpy as np

def naive_matrix_multiplication(A, B):
    m,n = A.shape
    n,p = B.shape
    C = np.zeros((m,p))
    for i in range(m):
        for j in range(p):
            for k in range(n):
                C[i, j] = A[i, k] * B[k, j]
    return C

# Define two matrices
A = np.array([[1, 2], [4, 5]])
B = np.array([[7, 8 ], [10, 11]])

C = naive_matrix_multiplication(A, B)
print(C)
Out [3]:
[[20. 22.]
 [50. 55.]]

Optimized Matrix Multiplication

The naive matrix multiplication is inefficient because it repeatedly reads the same data from memory, causing memory latency issues. A more optimized approach involves using shared memory and leveraging matrix libraries like NumPy, which are highly optimized and make use of BLAS (Basic Linear Algebra Subprograms)

Using NumPy for Optimized Multiplication

NumPy’s dot function is a highly optimized implementation of matrix multiplication.

In [4]:
import numpy as np

A = np.array([[1, 2], [4, 5]])
B = np.array([[7, 8 ], [10, 11]])

C = np.dot(A, B)
print(C)
Out [4]:
[[27 30]
 [78 87]]

NumPy uses highly optimized libraries under the hood (like OpenBLAS or Intel MKL) to perform

matrix multiplication efficiently, taking advantage of low-level optimizations such as data pre-fetching

and cache reuse.

Parallelizing Matrix Multiplication

To parallelize matrix multiplication manually, we can break the operation down by assigning different rows of matrix A and different columns of matrix B to different threads. However, using optimized libraries like NumPy is often the best approach, as they already include multi-threading and SIMD (Single Instruction, Multiple Data) optimizations.

Here’s an example of manually parallelizing matrix multiplication:

In [1]:
from concurrent.futures import ThreadPoolExecutor
import numpy as np

# Function to compute one row of matrix C
def multiply_row(A_row, B):
    return np.dot(A_row, B)

# Parallel Matrix Multiplication
def parallel_matrix_multiplication(A, B):
    m = A.shape[0]
    C = np.zeros((m,B.shape[1]))

    with ThreadPoolExecutor() as executor:
        results = list(executor.map(multiply_row, A, [B]*m))

    return np.array(results)

A = np.array([[1, 2], [4, 5]])
B = np.array([[7, 8 ], [10, 11]])

C = parallel_matrix_multiplication(A, B)
print(C)
Out [1]:
[[27 30]
 [78 87]]

1. Work Division by Rows

  • Each thread processes one complete row of the result matrix
  • The thread takes a single row from A and multiplies it with the entire matrix B
  • This is an embarrassingly parallel problem - each row can be computed independently
  • 2. Thread Pool Execution

    • ThreadPoolExecutor creates a pool of worker threads (typically one per CPU core)
    • executor.map() distributes the rows across available threads
    • [B]*m creates m copies of matrix B (one for each thread)
    • Execution Visualization

      A = [[1, 2],    B = [[7,  8],
           [4, 5]]         [10, 11]]
      
      THREAD 1: Process row 0
      A[0] = [1, 2] × B = [1*7 + 2*10, 1*8 + 2*11] = [27, 30]
      
      THREAD 2: Process row 1
      A[1] = [4, 5] × B = [4*7 + 5*10, 4*8 + 5*11] = [78, 87]
      
      Final result: [[27, 30],
                     [78, 87]]

Memory Coalescing and Alignment

Memory coalescing is a technique that ensures efficient memory access by aligning threads’ memory

requests into fewer transactions. When threads in a warp (a group of 32 threads in CUDA) access consecutive memory addresses, the GPU can combine these accesses into fewer, larger memory transactions. This maximizes memory bandwidth utilization and reduces the overall memory latency.

How it works: In CUDA, global memory is accessed by all threads, but accessing it efficiently is

key to performance. Without proper coalescing, each thread might make its own memory request,

leading to multiple, inefficient transactions. With memory coalescing, these requests are combined

into a single transaction when:

  • Threads in a warp access consecutive addresses.
  • The starting address is properly aligned.
  • Ensuring proper alignment: To achieve memory coalescing, we need to ensure proper alignment
  • of memory. The memory addresses accessed by each thread should be aligned to the size of the data

    type. For example, for an array of ‘float‘ (4 bytes), the address should be aligned on a 4-byte boundary.

    This alignment allows the GPU to fetch data in a single coalesced transaction.

    An example of memory coalescing with a properly aligned float array:

    __global__ void coalescedMemoryAccess(float *input, float *output){
        int tid = threadId.x + blockId.x * blockDim.x;
        output[tid] = input[tid]
    }

    In this case, assuming the ‘input‘ and ‘output‘ arrays are properly aligned, memory access will be

    coalesced.

Shared Memory Optimization

Shared memory is a small, fast, on-chip memory that can be used to significantly speed up access

times in CUDA programs. Shared memory is accessible by all threads within a block, making it useful

for data that needs to be accessed multiple times by different threads.

Key advantages of shared memory:

  • Much faster than global memory.
  • Reduces redundant global memory accesses.
  • Allows efficient data sharing between threads within a block.
    • To use shared memory effectively, we need to:
      • Load frequently accessed data from global memory into shared memory.
      • Minimize bank conflicts (situations where multiple threads attempt to access the same memory bank simultaneously).

    Bank conflicts: CUDA shared memory is divided into banks, and if multiple threads try to access data in the same bank at the same time, a bank conflict occurs, leading to serialization and performance loss. To avoid this, ensure that threads access different memory banks, ideally by organizing data such that consecutive threads access consecutive addresses.

Reducing Warp Divergence for Performance

Warp divergence occurs when threads in a warp follow different execution paths due to conditional

branching (e.g., ‘if-else‘ statements). When divergence occurs, the GPU must execute both paths seri-

ally, which reduces overall performance.

How to minimize warp divergence:

  1. Avoid branching whenever possible, or minimize its occur-
  2. rence by structuring code carefully.

    1. Use predication (conditional assignments) instead of branching,
    2. which allows all threads to execute the same instruction with different outcomes based on conditions.

      1. Ensure branch conditions are uniform across threads in a warp, meaning all threads either take the
      2. same branch or none of them do.

Warp Divergence Explained

The Problem

__global__ void warpDivergenceExample(int *input, int *output) {
    int tid = threadIdx.x;
    if (input[tid] > 0) {
        output[tid] = input[tid] * 2;  // Path A
    } else {
        output[tid] = input[tid] / 2;  // Path B
    }
}

What happens when threads in a warp have different conditions:

Warp with 8 threads (simplified example):
Thread:   0   1   2   3   4   5   6   7
Input:   [5, -2,  8, -1,  3, -4,  7, -3]
Condition: T   F   T   F   T   F   T   F  (T = input[tid] > 0)

Execution:
  1. Threads 0,2,4,6 take TRUE branch → execute multiplication
  2. Threads 1,3,5,7 take FALSE branch → execute division
  3. GPU must execute BOTH paths SERIALLY for the same warp!
  4. Performance Impact

    Without divergence: 32 threads execute 1 instruction → 1 cycle
    With divergence: 32 threads execute 2 paths → 2 cycles (100% slower!)

    Visualizing Warp Divergence

    Divergent Warp Execution

    Warp Timeline (BAD - With Divergence):
    Cycle 1: [T][T][T][T][F][F][F][F] ← Threads 0-3: Execute TRUE path
             Threads 4-7: IDLE (masked out)
    
    Cycle 2: [I][I][I][I][T][T][T][T] ← Threads 4-7: Execute FALSE path
             Threads 0-3: IDLE (masked out)
    
    Total: 2 cycles for 8 threads

    Uniform Warp Execution (Ideal)

    Warp Timeline (GOOD - No Divergence):
    Cycle 1: [T][T][T][T][T][T][T][T] ← All threads execute same path
    Total: 1 cycle for 8 threads (2x faster!)

    The Solution: Predication

    Using Ternary Operator for Predication

    __global__ void warpDivergenceAvoided(int *input, int *output) {
        int tid = threadIdx.x;
        int value = input[tid];
        output[tid] = (value > 0) ? value * 2 : value / 2;
    }

    How predication works:

    • All threads execute both the multiplication and division instructions
    • Conditional selection happens at the register level, not branch level
    • No branching = no serial execution of different paths
In [ ]:

All systems normal

© 2025 2023 Sanjeeb KC. All rights reserved.