- Deep Learning and Machine Learning with CUDA
- Understanding Data Flow in Deep Learning: CPU, GPU, RAM, VRAM, Cache, and Disk Storage
- The GPU Hierarchical Structure
- CPU and GPU Comparision
- Linear Regression Algorithm
- Matrix Addition
- Matrix Multiplication: Naive, Optimized, and CUDA Approaches
- Neural Network: Multi-layer Network
- Vector Addition
- CUDA Kernel for parallel reduction
- Cumulative Sum
- Advanced CUDA Features and Optimization Techniques
Matrix Multiplication: Naive, Optimized, and CUDA Approaches
Matrix multiplication is a more complex operation than matrix addition. Given two matrices A of dimensions m ×n and B of dimensions n ×p, their product C is computed as:
Naive Implementation of Matrix Multiplication
import numpy as np
A = np.array([[1, 2, 3], [4, 5, 6]])
B = np.array([[7, 8, 9], [10, 11, 12], [13, 14, 15]])
# Naive Matrix Multiplication
C = np.zeros((A.shape[0], B.shape[1]))
for i in range(A.shape[0]):
for j in range(B.shape[1]):
for k in range(A.shape[1]):
C[i, j] = A[i, k] * B[k, j]
print(C)
[[39. 42. 45.] [78. 84. 90.]]
Naive Matrix Multiplication
The naive approach to matrix multiplication involves three nested loops: one for rows of matrix A,
one for columns of matrix B, and one for summing the element-wise products. Here’s how you can
implement this in Python:
import numpy as np
def naive_matrix_multiplication(A, B):
m,n = A.shape
n,p = B.shape
C = np.zeros((m,p))
for i in range(m):
for j in range(p):
for k in range(n):
C[i, j] = A[i, k] * B[k, j]
return C
# Define two matrices
A = np.array([[1, 2], [4, 5]])
B = np.array([[7, 8 ], [10, 11]])
C = naive_matrix_multiplication(A, B)
print(C)
[[20. 22.] [50. 55.]]
Optimized Matrix Multiplication
The naive matrix multiplication is inefficient because it repeatedly reads the same data from memory, causing memory latency issues. A more optimized approach involves using shared memory and leveraging matrix libraries like NumPy, which are highly optimized and make use of BLAS (Basic Linear Algebra Subprograms)
Using NumPy for Optimized Multiplication
NumPy’s dot function is a highly optimized implementation of matrix multiplication.
import numpy as np
A = np.array([[1, 2], [4, 5]])
B = np.array([[7, 8 ], [10, 11]])
C = np.dot(A, B)
print(C)
[[27 30] [78 87]]
NumPy uses highly optimized libraries under the hood (like OpenBLAS or Intel MKL) to perform
matrix multiplication efficiently, taking advantage of low-level optimizations such as data pre-fetching
and cache reuse.
Parallelizing Matrix Multiplication
To parallelize matrix multiplication manually, we can break the operation down by assigning different rows of matrix A and different columns of matrix B to different threads. However, using optimized libraries like NumPy is often the best approach, as they already include multi-threading and SIMD (Single Instruction, Multiple Data) optimizations.
Here’s an example of manually parallelizing matrix multiplication:
from concurrent.futures import ThreadPoolExecutor
import numpy as np
# Function to compute one row of matrix C
def multiply_row(A_row, B):
return np.dot(A_row, B)
# Parallel Matrix Multiplication
def parallel_matrix_multiplication(A, B):
m = A.shape[0]
C = np.zeros((m,B.shape[1]))
with ThreadPoolExecutor() as executor:
results = list(executor.map(multiply_row, A, [B]*m))
return np.array(results)
A = np.array([[1, 2], [4, 5]])
B = np.array([[7, 8 ], [10, 11]])
C = parallel_matrix_multiplication(A, B)
print(C)
[[27 30] [78 87]]
1. Work Division by Rows
- Each thread processes one complete row of the result matrix
- The thread takes a single row from A and multiplies it with the entire matrix B
- This is an embarrassingly parallel problem - each row can be computed independently
ThreadPoolExecutor
creates a pool of worker threads (typically one per CPU core)executor.map()
distributes the rows across available threads[B]*m
createsm
copies of matrix B (one for each thread)
2. Thread Pool Execution
Execution Visualization
A = [[1, 2], B = [[7, 8],
[4, 5]] [10, 11]]
THREAD 1: Process row 0
A[0] = [1, 2] × B = [1*7 + 2*10, 1*8 + 2*11] = [27, 30]
THREAD 2: Process row 1
A[1] = [4, 5] × B = [4*7 + 5*10, 4*8 + 5*11] = [78, 87]
Final result: [[27, 30],
[78, 87]]
Memory Coalescing and Alignment
Memory coalescing is a technique that ensures efficient memory access by aligning threads’ memory
requests into fewer transactions. When threads in a warp (a group of 32 threads in CUDA) access consecutive memory addresses, the GPU can combine these accesses into fewer, larger memory transactions. This maximizes memory bandwidth utilization and reduces the overall memory latency.
How it works: In CUDA, global memory is accessed by all threads, but accessing it efficiently is
key to performance. Without proper coalescing, each thread might make its own memory request,
leading to multiple, inefficient transactions. With memory coalescing, these requests are combined
into a single transaction when:
- Threads in a warp access consecutive addresses.
- The starting address is properly aligned.
- Ensuring proper alignment: To achieve memory coalescing, we need to ensure proper alignment
of memory. The memory addresses accessed by each thread should be aligned to the size of the data
type. For example, for an array of ‘float‘ (4 bytes), the address should be aligned on a 4-byte boundary.
This alignment allows the GPU to fetch data in a single coalesced transaction.
An example of memory coalescing with a properly aligned float array:
__global__ void coalescedMemoryAccess(float *input, float *output){
int tid = threadId.x + blockId.x * blockDim.x;
output[tid] = input[tid]
}
In this case, assuming the ‘input‘ and ‘output‘ arrays are properly aligned, memory access will be
coalesced.
Shared Memory Optimization
Shared memory is a small, fast, on-chip memory that can be used to significantly speed up access
times in CUDA programs. Shared memory is accessible by all threads within a block, making it useful
for data that needs to be accessed multiple times by different threads.
Key advantages of shared memory:
- Much faster than global memory.
- Reduces redundant global memory accesses.
- Allows efficient data sharing between threads within a block.
- To use shared memory effectively, we need to:
- Load frequently accessed data from global memory into shared memory.
- Minimize bank conflicts (situations where multiple threads attempt to access the same memory bank simultaneously).
Reducing Warp Divergence for Performance
Warp divergence occurs when threads in a warp follow different execution paths due to conditional
branching (e.g., ‘if-else‘ statements). When divergence occurs, the GPU must execute both paths seri-
ally, which reduces overall performance.
How to minimize warp divergence:
- Avoid branching whenever possible, or minimize its occur-
- Use predication (conditional assignments) instead of branching,
- Ensure branch conditions are uniform across threads in a warp, meaning all threads either take the
rence by structuring code carefully.
which allows all threads to execute the same instruction with different outcomes based on conditions.
same branch or none of them do.
Warp Divergence Explained
The Problem
__global__ void warpDivergenceExample(int *input, int *output) {
int tid = threadIdx.x;
if (input[tid] > 0) {
output[tid] = input[tid] * 2; // Path A
} else {
output[tid] = input[tid] / 2; // Path B
}
}
What happens when threads in a warp have different conditions:
Warp with 8 threads (simplified example):
Thread: 0 1 2 3 4 5 6 7
Input: [5, -2, 8, -1, 3, -4, 7, -3]
Condition: T F T F T F T F (T = input[tid] > 0)
Execution:
- Threads 0,2,4,6 take TRUE branch → execute multiplication
- Threads 1,3,5,7 take FALSE branch → execute division
- GPU must execute BOTH paths SERIALLY for the same warp! Performance Impact
- All threads execute both the multiplication and division instructions
- Conditional selection happens at the register level, not branch level
- No branching = no serial execution of different paths
Without divergence: 32 threads execute 1 instruction → 1 cycle
With divergence: 32 threads execute 2 paths → 2 cycles (100% slower!)
Visualizing Warp Divergence
Divergent Warp Execution
Warp Timeline (BAD - With Divergence):
Cycle 1: [T][T][T][T][F][F][F][F] ← Threads 0-3: Execute TRUE path
Threads 4-7: IDLE (masked out)
Cycle 2: [I][I][I][I][T][T][T][T] ← Threads 4-7: Execute FALSE path
Threads 0-3: IDLE (masked out)
Total: 2 cycles for 8 threads
Uniform Warp Execution (Ideal)
Warp Timeline (GOOD - No Divergence):
Cycle 1: [T][T][T][T][T][T][T][T] ← All threads execute same path
Total: 1 cycle for 8 threads (2x faster!)
The Solution: Predication
Using Ternary Operator for Predication
__global__ void warpDivergenceAvoided(int *input, int *output) {
int tid = threadIdx.x;
int value = input[tid];
output[tid] = (value > 0) ? value * 2 : value / 2;
}
How predication works: