CUDA Kernel for parallel reduction

Code Breakdown with Detailed Explanation

Kernel Definition

`` __global__ void reduceSum(float *input, float *output, int N) {

* __global__: This function runs on the GPU but is called from CPU
* input: Pointer to the large array we want to sum (in GPU memory)
* output: Pointer where each block will store its partial sum
* N: Total number of elements in the input array

### 1. Shared Memory Declaration

 extern __shared__ float sdata[]; * What is Shared Memory?
    * Ultra-fast memory shared by all threads in the same block
    * ~100x faster than global GPU memory
    * Limited size (typically 16-48KB per block)
    * Used for thread communication within a block
* Why use shared memory here?
    * Each thread needs to access other threads' data during reduction
    * Global memory accesses would be too slow
### 2. Thread Identification
int tid = threadIdx.x;                    // Thread ID within block (0 to 255)
int i = blockIdx.x * blockDim.x + tid;    // Global thread ID
* Example: If we have 1024 elements and 256 threads per block:
    * Block 0: i = 0 to 255
    * Block 1: i = 256 to 511
    * Block 2: i = 512 to 767
    * Block 3: i = 768 to 1023
### 3. Loading Data into Shared Memory
sdata[tid] = (i < N) ? input[i] : 0.0f;
__syncthreads();


* What happens:
    * Each thread loads one element from global memory to shared memory
        * Thread 0 loads input[0] into sdata[0]
        * Thread 1 loads input[1] into sdata[1]
        * Thread 255 loads input[255] into sdata[255]
*

__syncthreads() - CRITICAL!


    * All threads in the block wait here until everyone finishes loading
    * Ensures all data is available before starting reduction
* After this step, shared memory looks like:
Thread:   0  1  2  3  4  5  6  7  8  9  10 11 12 13 14 15
sdata:   [8, 3, 5, 7, 2, 9, 1, 6, 4, 10,12,15,11,14,13,16]
### 4. Parallel Reduction Loop (The Core Algorithm - Adjacent-pair reduction)
for (int stride = 1; stride < blockDim.x; stride *= 2) {
   int index = 2 * stride * tid;
   if (index < blockDim.x) {
       if (index + stride < blockDim.x) {
           sdata[index] += sdata[index + stride];
       }
   }
   __syncthreads();
}

## Code Breakdown - Main Function
### Step 1: Define the Input Data
const int N = 16;
float h_input[] = {8, 3, 5, 7, 2, 9, 1, 6, 4, 10, 12, 15, 11, 14, 13, 16};
float h_output[1];  // To store the final sum
*

h_input

: Our array in CPU memory ("h" for host)
*

h_output

: Will store the final result

### Step 2: Allocate GPU Memory
float *d_input, *d_output;
cudaMalloc(&d_input, N * sizeof(float));   // 16 floats on GPU
cudaMalloc(&d_output, sizeof(float));      // 1 float on GPU
*

d_input:

 GPU copy of our array ("d" for device)
*

d_output:

 GPU memory for the result

### Step 3: Copy Data to GPU

` cudaMemcpy(d_input, h_input, N * sizeof(float), cudaMemcpyHostToDevice); `


* Copies the entire array from CPU → GPU
*

cudaMemcpyHostToDevice:

 Direction of copy

### Step 4: Configure and Launch Kernel
int threads = 16;
int sharedMem = threads * sizeof(float);  // 16 × 4 bytes = 64 bytes

reduceSum<<<1, threads, sharedMem>>>(d_input, d_output, N);
* Kernel Launch Syntax:

<<>>


    * 1: 1 block (we only need one block for 16 elements)
    * threads: 16 threads in that block
    * sharedMem: 64 bytes of shared memory allocated

### Step 5: Copy Result from GPU to CPU

` cudaMemcpy(h_output, d_output, sizeof(float), cudaMemcpyDeviceToHost); `

cudaMemcpy()

: CUDA function to copy memory between CPU and GPU
*

h_output

: Destination (CPU memory where we want the result)
*

d_output

: Source (GPU memory where the result was computed)
*

sizeof(float)

: Copy exactly 4 bytes (size of one float)
*

cudaMemcpyDeviceToHost`: Direction - from GPU → CPU

GPU VRAM (Device Memory)     CPU RAM (Host Memory)
---------------------        ---------------------
d_output[0] = 136.0    →    h_output[0] = 136.0

The GPU computed the sum and stored it in d_output[0] (GPU memory)
But we can't directly use GPU memory from CPU code
We need to copy it back to CPU memory to print or use it