Cumulative Sum


Cumulative Sum (Prefix Sum)

The prefix sum (or cumulative sum) is a widely used parallel algorithm, especially for applications

like sorting and graph traversal. The goal of the prefix sum is to produce an output array where each

element is the sum of all previous elements in the input array.

  • Inclusive Scan: In an inclusive scan, the element at position i in the output array is the sum of all
  • elements up to and including position i in the input array.

    • Exclusive Scan: In an exclusive scan, the element at position i in the output array is the sum of all
    • elements before position i in the input array.

Initial Array:

[8, 3, 5, 7, 2, 9, 1, 6, 4, 10, 12, 15, 11, 14, 13, 16]

Cumulative Sum Calculation:

Position 0 : 8
Position 1 : 8 + 3 = 11
Position 2 : 11 + 5 = 16
Position 3 : 16 + 7 = 23
Position 4 : 23 + 2 = 25
Position 5 : 25 + 9 = 34
Position 6 : 34 + 1 = 35
Position 7 : 35 + 6 = 41
Position 8 : 41 + 4 = 45
Position 9 : 45 + 10 = 55
Position 10 : 55 + 12 = 67
Position 11 : 67 + 15 = 82
Position 12 : 82 + 11 = 93
Position 13 : 93 + 14 = 107
Position 14 : 107 + 13 = 120
Position 15 : 120 + 16 = 136

Final Cumulative Sum Array:

[8, 11, 16, 23, 25, 34, 35, 41, 45, 55, 67, 82, 93, 107, 120, 136]

Now, we can represent the process in terms of GPU threads, where each column corresponds to

a thread. The table below shows how the values are updated at each step of the cumulative sum

process.

Parallel Calculation Using GPU Threads To illustrate how this process can be parallelized using GPU

threads, we can use a standard parallel prefix sum algorithm. One common method is the Hillis-Steele scan, which updates the array in logarithmic steps. Here’s how it works for our array of 16 elements

Algorithm Overview

  • Step d = 0 (distance = 1):
    • `` For all positions i ≥1, A[i] = A[i] + A[i−1]
      * Step d = 1 (distance = 2):
          *
      For all positions i ≥2, A[i] = A[i] + A[i−2]
      * Step d = 2 (distance = 4):
          *
      For all positions i ≥4, A[i] = A[i] + A[i−4]
      * Step d = 3 (distance = 8):
          *
      For all positions i ≥8, A[i] = A[i] + A[i−8]
      ``

Detailed Steps with GPU Threads

We will represent each element as being processed by a separate thread Ti, where i ranges from 1 to 16.

TheadT1T2T3T4T5T6T7T8T9T10T11T12T13T14T15T16
Value83572916410121511141316

Step 1 (d = 0, distance = 1) Operation:

If i ≥2, Valuei = Valuei + Valuei−1

T1 : 8 (unchanged)
T2 : 3 + 8 = 11
T3 : 5 + 3 = 8
T4 : 7 + 5 = 12
T5 : 2 + 7 = 9
T6 : 9 + 2 = 11
T7 : 1 + 9 = 10
T8 : 6 + 1 = 7
T9 : 4 + 6 = 10
T10 : 10 + 4 = 14
T11 : 12 + 10 = 22
T12 : 15 + 12 = 27
T13 : 11 + 15 = 26
T14 : 14 + 11 = 25
T15 : 13 + 14 = 27
T16 : 16 + 13 = 29

Resulting Values:

``| Thread | 8 | 11 | 8 | 12 | 9 | 11 | 10 | 7 | 10 | 14 | 22 | 27 | 26 | 25 | 27 | 29 |``

Step 2 (d = 1, distance = 2)

Operation: If i ≥3, Valuei = Valuei + Valuei−2

T1 : 8 (unchanged)
T2 : 11 (unchanged)
T3 : 8 + 8 = 16
T4 : 12 + 11 = 23
T5 : 9 + 8 = 17
T6 : 11 + 12 = 23
T7 : 10 + 9 = 19
T8 : 7 + 11 = 18
T9 : 10 + 10 = 20
T10 : 14 + 7 = 21
T11 : 22 + 10 = 32
T12 : 27 + 14 = 41
T13 : 26 + 22 = 48
T14 : 25 + 27 = 52
T15 : 27 + 26 = 53
T16 : 29 + 25 = 54

Resulting Values:

``| Thread | 8 | 11 | 16 | 23 | 17 | 23 | 19 | 18 | 20 | 21 | 32 | 41 | 48 | 52 | 53 | 54 |``

Step 3 (d = 2, distance = 4)

Operation: If i ≥5, Valuei = Valuei + Valuei−4

Updated Values:

T1 : 8 (unchanged)
T2 : 11 (unchanged)
T3 : 16 (unchanged)
T4 : 23 (unchanged)
T5 : 17 + 8 = 25
T6 : 23 + 11 = 34
T7 : 19 + 16 = 35
T8 : 18 + 23 = 41
T9 : 20 + 17 = 37
T10 : 21 + 23 = 44
T11 : 32 + 19 = 51
T12 : 41 + 18 = 59
T13 : 48 + 20 = 68
T14 : 52 + 21 = 73
T15 : 53 + 32 = 85
T16 : 54 + 41 = 95

Resulting Values:

``| Thread | 8 | 11 | 16 | 23 | 25 | 34 | 35 | 41 | 37 | 44 | 51 | 59 | 68 | 73 | 85 | 95 |``

Step 4 (d = 3, distance = 8)

Operation: If i ≥9, Valuei = Valuei + Valuei−8

Updated Values:

T1 : 8 (unchanged)
T2 : 11 (unchanged)
T3 : 16 (unchanged)
T4 : 23 (unchanged)
T5 : 25 (unchanged)
T6 : 34 (unchanged)
T7 : 35 (unchanged)
T8 : 41 (unchanged)
T9 : 37 + 8 = 45
T10 : 44 + 11 = 55
T11 : 51 + 16 = 67
T12 : 59 + 23 = 82
T13 : 68 + 25 = 93
T14 : 73 + 34 = 107
T15 : 85 + 35 = 120
T16 : 95 + 41 = 136

Resulting Values (Final Cumulative Sum):

``| Thread | 8 | 11 | 16 | 23 | 25 | 34 | 35 | 41 | 45 | 55 | 67 | 82 | 93 | 107 | 120 | 136 |``

In [ ]:

All systems normal

© 2025 2023 Sanjeeb KC. All rights reserved.