Cumulative Sum (Prefix Sum)

The prefix sum (or cumulative sum) is a widely used parallel algorithm, especially for applications

like sorting and graph traversal. The goal of the prefix sum is to produce an output array where each

element is the sum of all previous elements in the input array.

Inclusive Scan: In an inclusive scan, the element at position i in the output array is the sum of all

elements up to and including position i in the input array.

Exclusive Scan: In an exclusive scan, the element at position i in the output array is the sum of all

elements before position i in the input array.

Initial Array:

[8, 3, 5, 7, 2, 9, 1, 6, 4, 10, 12, 15, 11, 14, 13, 16]

Cumulative Sum Calculation:

Position 0 : 8
Position 1 : 8 + 3 = 11
Position 2 : 11 + 5 = 16
Position 3 : 16 + 7 = 23
Position 4 : 23 + 2 = 25
Position 5 : 25 + 9 = 34
Position 6 : 34 + 1 = 35
Position 7 : 35 + 6 = 41
Position 8 : 41 + 4 = 45
Position 9 : 45 + 10 = 55
Position 10 : 55 + 12 = 67
Position 11 : 67 + 15 = 82
Position 12 : 82 + 11 = 93
Position 13 : 93 + 14 = 107
Position 14 : 107 + 13 = 120
Position 15 : 120 + 16 = 136

Final Cumulative Sum Array:

[8, 11, 16, 23, 25, 34, 35, 41, 45, 55, 67, 82, 93, 107, 120, 136]

Now, we can represent the process in terms of GPU threads, where each column corresponds to

a thread. The table below shows how the values are updated at each step of the cumulative sum

process.

Parallel Calculation Using GPU Threads To illustrate how this process can be parallelized using GPU

threads, we can use a standard parallel prefix sum algorithm. One common method is the Hillis-Steele scan, which updates the array in logarithmic steps. Here’s how it works for our array of 16 elements

Algorithm Overview

Step d = 0 (distance = 1):

``For all positions i ≥1, A[i] = A[i] + A[i−1] * Step d = 1 (distance = 2): * For all positions i ≥2, A[i] = A[i] + A[i−2] * Step d = 2 (distance = 4): * For all positions i ≥4, A[i] = A[i] + A[i−4] * Step d = 3 (distance = 8): * For all positions i ≥8, A[i] = A[i] + A[i−8]``

Detailed Steps with GPU Threads

We will represent each element as being processed by a separate thread Ti, where i ranges from 1 to 16.

Thead	T1	T2	T3	T4	T5	T6	T7	T8	T9	T10	T11	T12	T13	T14	T15	T16
Value	8	3	5	7	2	9	1	6	4	10	12	15	11	14	13	16

Step 1 (d = 0, distance = 1) Operation:

If i ≥2, Valuei = Valuei + Valuei−1

T1 : 8 (unchanged)
T2 : 3 + 8 = 11
T3 : 5 + 3 = 8
T4 : 7 + 5 = 12
T5 : 2 + 7 = 9
T6 : 9 + 2 = 11
T7 : 1 + 9 = 10
T8 : 6 + 1 = 7
T9 : 4 + 6 = 10
T10 : 10 + 4 = 14
T11 : 12 + 10 = 22
T12 : 15 + 12 = 27
T13 : 11 + 15 = 26
T14 : 14 + 11 = 25
T15 : 13 + 14 = 27
T16 : 16 + 13 = 29

Resulting Values:

``| Thread | 8 | 11 | 8 | 12 | 9 | 11 | 10 | 7 | 10 | 14 | 22 | 27 | 26 | 25 | 27 | 29 |``

Step 2 (d = 1, distance = 2)

Operation: If i ≥3, Valuei = Valuei + Valuei−2

T1 : 8 (unchanged)
T2 : 11 (unchanged)
T3 : 8 + 8 = 16
T4 : 12 + 11 = 23
T5 : 9 + 8 = 17
T6 : 11 + 12 = 23
T7 : 10 + 9 = 19
T8 : 7 + 11 = 18
T9 : 10 + 10 = 20
T10 : 14 + 7 = 21
T11 : 22 + 10 = 32
T12 : 27 + 14 = 41
T13 : 26 + 22 = 48
T14 : 25 + 27 = 52
T15 : 27 + 26 = 53
T16 : 29 + 25 = 54

Resulting Values:

``| Thread | 8 | 11 | 16 | 23 | 17 | 23 | 19 | 18 | 20 | 21 | 32 | 41 | 48 | 52 | 53 | 54 |``

Step 3 (d = 2, distance = 4)

Operation: If i ≥5, Valuei = Valuei + Valuei−4

Updated Values:

T1 : 8 (unchanged)
T2 : 11 (unchanged)
T3 : 16 (unchanged)
T4 : 23 (unchanged)
T5 : 17 + 8 = 25
T6 : 23 + 11 = 34
T7 : 19 + 16 = 35
T8 : 18 + 23 = 41
T9 : 20 + 17 = 37
T10 : 21 + 23 = 44
T11 : 32 + 19 = 51
T12 : 41 + 18 = 59
T13 : 48 + 20 = 68
T14 : 52 + 21 = 73
T15 : 53 + 32 = 85
T16 : 54 + 41 = 95

Resulting Values:

``| Thread | 8 | 11 | 16 | 23 | 25 | 34 | 35 | 41 | 37 | 44 | 51 | 59 | 68 | 73 | 85 | 95 |``

Step 4 (d = 3, distance = 8)

Operation: If i ≥9, Valuei = Valuei + Valuei−8

Updated Values:

T1 : 8 (unchanged)
T2 : 11 (unchanged)
T3 : 16 (unchanged)
T4 : 23 (unchanged)
T5 : 25 (unchanged)
T6 : 34 (unchanged)
T7 : 35 (unchanged)
T8 : 41 (unchanged)
T9 : 37 + 8 = 45
T10 : 44 + 11 = 55
T11 : 51 + 16 = 67
T12 : 59 + 23 = 82
T13 : 68 + 25 = 93
T14 : 73 + 34 = 107
T15 : 85 + 35 = 120
T16 : 95 + 41 = 136

Resulting Values (Final Cumulative Sum):

``| Thread | 8 | 11 | 16 | 23 | 25 | 34 | 35 | 41 | 45 | 55 | 67 | 82 | 93 | 107 | 120 | 136 |``

In [ ]: