CUDA Kernel for parallel reduction


CUDA Kernel for parallel reduction

Summing a large array is another common problem that benefits from parallelization on a GPU. The

basic idea behind parallel reduction is to split the array into smaller chunks and sum them in parallel.

For example, given an array of N elements, you can have N/2 threads each sum two elements,

reducing the problem size by half. This process repeats until there’s only one element left—the sum of

the entire array.

Summing Arrays: Parallel Reduction Example Consider an array of 16 random numbers. The array

is reduced by summing pairs of elements in parallel until a single sum remains. The process is as

follows:

`` Array = [8, 3, 5, 7, 2, 9, 1, 6, 4, 10, 12, 15, 11, 14, 13, 16] ``

In this example, the array is reduced from 16 elements down to 8, then to 4, then to 2, and finally to

1, which is the sum of the entire array. At each step, pairs of elements are summed in parallel.

The final sum of the array is 136.

In actual GPU threads, each column corresponds to a thread, and the table shows how the values

change at each step of the reduction. Empty cells indicate that the thread has completed its task in

that step and is no longer active.

Code Breakdown with Detailed Explanation

Kernel Definition

`` __global__ void reduceSum(float *input, float *output, int N) {

* __global__: This function runs on the GPU but is called from CPU
* input: Pointer to the large array we want to sum (in GPU memory)
* output: Pointer where each block will store its partial sum
* N: Total number of elements in the input array

### 1. Shared Memory Declaration
extern __shared__ float sdata[];
* What is Shared Memory?
    * Ultra-fast memory shared by all threads in the same block
    * ~100x faster than global GPU memory
    * Limited size (typically 16-48KB per block)
    * Used for thread communication within a block
* Why use shared memory here?
    * Each thread needs to access other threads' data during reduction
    * Global memory accesses would be too slow
### 2. Thread Identification
int tid = threadIdx.x; // Thread ID within block (0 to 255) int i = blockIdx.x * blockDim.x + tid; // Global thread ID
* Example: If we have 1024 elements and 256 threads per block:
    * Block 0: i = 0 to 255
    * Block 1: i = 256 to 511
    * Block 2: i = 512 to 767
    * Block 3: i = 768 to 1023
### 3. Loading Data into Shared Memory
sdata[tid] = (i < N) ? input[i] : 0.0f; __syncthreads();
` * What happens: * Each thread loads one element from global memory to shared memory * Thread 0 loads input[0] into sdata[0] * Thread 1 loads input[1] into sdata[1] * Thread 255 loads input[255] into sdata[255] * __syncthreads() - CRITICAL! * All threads in the block wait here until everyone finishes loading * Ensures all data is available before starting reduction * After this step, shared memory looks like:
Thread:   0  1  2  3  4  5  6  7  8  9  10 11 12 13 14 15
sdata:   [8, 3, 5, 7, 2, 9, 1, 6, 4, 10,12,15,11,14,13,16]
### 4. Parallel Reduction Loop (The Core Algorithm - Adjacent-pair reduction)
for (int stride = 1; stride < blockDim.x; stride *= 2) {
   int index = 2 * stride * tid;
   if (index < blockDim.x) {
       if (index + stride < blockDim.x) {
           sdata[index] += sdata[index + stride];
       }
   }
   __syncthreads();
}
## Code Breakdown - Main Function ### Step 1: Define the Input Data
const int N = 16;
float h_input[] = {8, 3, 5, 7, 2, 9, 1, 6, 4, 10, 12, 15, 11, 14, 13, 16};
float h_output[1];  // To store the final sum
*
h_input: Our array in CPU memory ("h" for host) * h_output: Will store the final result ### Step 2: Allocate GPU Memory
float *d_input, *d_output;
cudaMalloc(&d_input, N * sizeof(float));   // 16 floats on GPU
cudaMalloc(&d_output, sizeof(float));      // 1 float on GPU
*
d_input: GPU copy of our array ("d" for device) * d_output: GPU memory for the result ### Step 3: Copy Data to GPU ` cudaMemcpy(d_input, h_input, N * sizeof(float), cudaMemcpyHostToDevice); ` * Copies the entire array from CPU → GPU * cudaMemcpyHostToDevice: Direction of copy ### Step 4: Configure and Launch Kernel
int threads = 16;
int sharedMem = threads * sizeof(float);  // 16 × 4 bytes = 64 bytes

reduceSum<<<1, threads, sharedMem>>>(d_input, d_output, N);
* Kernel Launch Syntax:
<<>> * 1: 1 block (we only need one block for 16 elements) * threads: 16 threads in that block * sharedMem: 64 bytes of shared memory allocated ### Step 5: Copy Result from GPU to CPU ` cudaMemcpy(h_output, d_output, sizeof(float), cudaMemcpyDeviceToHost); ` * cudaMemcpy(): CUDA function to copy memory between CPU and GPU * h_output: Destination (CPU memory where we want the result) * d_output: Source (GPU memory where the result was computed) * sizeof(float): Copy exactly 4 bytes (size of one float) * cudaMemcpyDeviceToHost`: Direction - from GPU → CPU

GPU VRAM (Device Memory)     CPU RAM (Host Memory)
---------------------        ---------------------
d_output[0] = 136.0    →    h_output[0] = 136.0

  • The GPU computed the sum and stored it in d_output[0] (GPU memory)
  • But we can't directly use GPU memory from CPU code
  • We need to copy it back to CPU memory to print or use it
All systems normal

© 2025 2023 Sanjeeb KC. All rights reserved.