- Deep Learning and Machine Learning with CUDA
- Understanding Data Flow in Deep Learning: CPU, GPU, RAM, VRAM, Cache, and Disk Storage
- The GPU Hierarchical Structure
- CPU and GPU Comparision
- Linear Regression Algorithm
- Matrix Addition
- Matrix Multiplication: Naive, Optimized, and CUDA Approaches
- Neural Network: Multi-layer Network
- Vector Addition
- CUDA Kernel for parallel reduction
- Cumulative Sum
- Advanced CUDA Features and Optimization Techniques
Cumulative Sum (Prefix Sum)
The prefix sum (or cumulative sum) is a widely used parallel algorithm, especially for applications
like sorting and graph traversal. The goal of the prefix sum is to produce an output array where each
element is the sum of all previous elements in the input array.
- Inclusive Scan: In an inclusive scan, the element at position i in the output array is the sum of all
- Exclusive Scan: In an exclusive scan, the element at position i in the output array is the sum of all
elements up to and including position i in the input array.
elements before position i in the input array.
Initial Array:
[8, 3, 5, 7, 2, 9, 1, 6, 4, 10, 12, 15, 11, 14, 13, 16]
Cumulative Sum Calculation:
Position 0 : 8
Position 1 : 8 + 3 = 11
Position 2 : 11 + 5 = 16
Position 3 : 16 + 7 = 23
Position 4 : 23 + 2 = 25
Position 5 : 25 + 9 = 34
Position 6 : 34 + 1 = 35
Position 7 : 35 + 6 = 41
Position 8 : 41 + 4 = 45
Position 9 : 45 + 10 = 55
Position 10 : 55 + 12 = 67
Position 11 : 67 + 15 = 82
Position 12 : 82 + 11 = 93
Position 13 : 93 + 14 = 107
Position 14 : 107 + 13 = 120
Position 15 : 120 + 16 = 136
Final Cumulative Sum Array:
[8, 11, 16, 23, 25, 34, 35, 41, 45, 55, 67, 82, 93, 107, 120, 136]
Now, we can represent the process in terms of GPU threads, where each column corresponds to
a thread. The table below shows how the values are updated at each step of the cumulative sum
process.
Parallel Calculation Using GPU Threads To illustrate how this process can be parallelized using GPU
threads, we can use a standard parallel prefix sum algorithm. One common method is the Hillis-Steele scan, which updates the array in logarithmic steps. Here’s how it works for our array of 16 elements
Algorithm Overview
- Step d = 0 (distance = 1):
- ``
For all positions i ≥1, A[i] = A[i] + A[i−1]
``
For all positions i ≥2, A[i] = A[i] + A[i−2]* Step d = 1 (distance = 2): *
For all positions i ≥4, A[i] = A[i] + A[i−4]* Step d = 2 (distance = 4): *
For all positions i ≥8, A[i] = A[i] + A[i−8]* Step d = 3 (distance = 8): *
Detailed Steps with GPU Threads
We will represent each element as being processed by a separate thread Ti, where i ranges from 1 to 16.
Thead | T1 | T2 | T3 | T4 | T5 | T6 | T7 | T8 | T9 | T10 | T11 | T12 | T13 | T14 | T15 | T16 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Value | 8 | 3 | 5 | 7 | 2 | 9 | 1 | 6 | 4 | 10 | 12 | 15 | 11 | 14 | 13 | 16 |
Step 1 (d = 0, distance = 1) Operation:
If i ≥2, Valuei = Valuei + Valuei−1
T1 : 8 (unchanged)
T2 : 3 + 8 = 11
T3 : 5 + 3 = 8
T4 : 7 + 5 = 12
T5 : 2 + 7 = 9
T6 : 9 + 2 = 11
T7 : 1 + 9 = 10
T8 : 6 + 1 = 7
T9 : 4 + 6 = 10
T10 : 10 + 4 = 14
T11 : 12 + 10 = 22
T12 : 15 + 12 = 27
T13 : 11 + 15 = 26
T14 : 14 + 11 = 25
T15 : 13 + 14 = 27
T16 : 16 + 13 = 29
Resulting Values:
``| Thread | 8 | 11 | 8 | 12 | 9 | 11 | 10 | 7 | 10 | 14 | 22 | 27 | 26 | 25 | 27 | 29 |
``
Step 2 (d = 1, distance = 2)
Operation: If i ≥3, Valuei = Valuei + Valuei−2
T1 : 8 (unchanged)
T2 : 11 (unchanged)
T3 : 8 + 8 = 16
T4 : 12 + 11 = 23
T5 : 9 + 8 = 17
T6 : 11 + 12 = 23
T7 : 10 + 9 = 19
T8 : 7 + 11 = 18
T9 : 10 + 10 = 20
T10 : 14 + 7 = 21
T11 : 22 + 10 = 32
T12 : 27 + 14 = 41
T13 : 26 + 22 = 48
T14 : 25 + 27 = 52
T15 : 27 + 26 = 53
T16 : 29 + 25 = 54
Resulting Values:
``| Thread | 8 | 11 | 16 | 23 | 17 | 23 | 19 | 18 | 20 | 21 | 32 | 41 | 48 | 52 | 53 | 54 |
``
Step 3 (d = 2, distance = 4)
Operation: If i ≥5, Valuei = Valuei + Valuei−4
Updated Values:
T1 : 8 (unchanged)
T2 : 11 (unchanged)
T3 : 16 (unchanged)
T4 : 23 (unchanged)
T5 : 17 + 8 = 25
T6 : 23 + 11 = 34
T7 : 19 + 16 = 35
T8 : 18 + 23 = 41
T9 : 20 + 17 = 37
T10 : 21 + 23 = 44
T11 : 32 + 19 = 51
T12 : 41 + 18 = 59
T13 : 48 + 20 = 68
T14 : 52 + 21 = 73
T15 : 53 + 32 = 85
T16 : 54 + 41 = 95
Resulting Values:
``| Thread | 8 | 11 | 16 | 23 | 25 | 34 | 35 | 41 | 37 | 44 | 51 | 59 | 68 | 73 | 85 | 95 |
``
Step 4 (d = 3, distance = 8)
Operation: If i ≥9, Valuei = Valuei + Valuei−8
Updated Values:
T1 : 8 (unchanged)
T2 : 11 (unchanged)
T3 : 16 (unchanged)
T4 : 23 (unchanged)
T5 : 25 (unchanged)
T6 : 34 (unchanged)
T7 : 35 (unchanged)
T8 : 41 (unchanged)
T9 : 37 + 8 = 45
T10 : 44 + 11 = 55
T11 : 51 + 16 = 67
T12 : 59 + 23 = 82
T13 : 68 + 25 = 93
T14 : 73 + 34 = 107
T15 : 85 + 35 = 120
T16 : 95 + 41 = 136
Resulting Values (Final Cumulative Sum):
``| Thread | 8 | 11 | 16 | 23 | 25 | 34 | 35 | 41 | 45 | 55 | 67 | 82 | 93 | 107 | 120 | 136 |
``