- Deep Learning and Machine Learning with CUDA
- Understanding Data Flow in Deep Learning: CPU, GPU, RAM, VRAM, Cache, and Disk Storage
- The GPU Hierarchical Structure
- CPU and GPU Comparision
- Linear Regression Algorithm
- Matrix Addition
- Matrix Multiplication: Naive, Optimized, and CUDA Approaches
- Neural Network: Multi-layer Network
- Vector Addition
- CUDA Kernel for parallel reduction
- Cumulative Sum
- Advanced CUDA Features and Optimization Techniques
CUDA Kernel for parallel reduction
Summing a large array is another common problem that benefits from parallelization on a GPU. The
basic idea behind parallel reduction is to split the array into smaller chunks and sum them in parallel.
For example, given an array of N elements, you can have N/2 threads each sum two elements,
reducing the problem size by half. This process repeats until there’s only one element left—the sum of
the entire array.
Summing Arrays: Parallel Reduction Example Consider an array of 16 random numbers. The array
is reduced by summing pairs of elements in parallel until a single sum remains. The process is as
follows:
`` Array = [8, 3, 5, 7, 2, 9, 1, 6, 4, 10, 12, 15, 11, 14, 13, 16]
``
In this example, the array is reduced from 16 elements down to 8, then to 4, then to 2, and finally to
1, which is the sum of the entire array. At each step, pairs of elements are summed in parallel.
The final sum of the array is 136.
In actual GPU threads, each column corresponds to a thread, and the table shows how the values
change at each step of the reduction. Empty cells indicate that the thread has completed its task in
that step and is no longer active.
Code Breakdown with Detailed Explanation
Kernel Definition
`` __global__ void reduceSum(float *input, float *output, int N) {
`
extern __shared__ float sdata[]; * __global__: This function runs on the GPU but is called from CPU
* input: Pointer to the large array we want to sum (in GPU memory)
* output: Pointer where each block will store its partial sum
* N: Total number of elements in the input array
### 1. Shared Memory Declaration
int tid = threadIdx.x; // Thread ID within block (0 to 255)
int i = blockIdx.x * blockDim.x + tid; // Global thread ID
* What is Shared Memory?
* Ultra-fast memory shared by all threads in the same block
* ~100x faster than global GPU memory
* Limited size (typically 16-48KB per block)
* Used for thread communication within a block
* Why use shared memory here?
* Each thread needs to access other threads' data during reduction
* Global memory accesses would be too slow
### 2. Thread Identification
sdata[tid] = (i < N) ? input[i] : 0.0f;
__syncthreads();
* Example: If we have 1024 elements and 256 threads per block:
* Block 0: i = 0 to 255
* Block 1: i = 256 to 511
* Block 2: i = 512 to 767
* Block 3: i = 768 to 1023
### 3. Loading Data into Shared Memory
* What happens:
* Each thread loads one element from global memory to shared memory
* Thread 0 loads input[0] into sdata[0]
* Thread 1 loads input[1] into sdata[1]
* Thread 255 loads input[255] into sdata[255]
*
__syncthreads() - CRITICAL!
* All threads in the block wait here until everyone finishes loading
* Ensures all data is available before starting reduction
* After this step, shared memory looks like:
h_input
### 4. Parallel Reduction Loop (The Core Algorithm - Adjacent-pair reduction)
Thread: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
sdata: [8, 3, 5, 7, 2, 9, 1, 6, 4, 10,12,15,11,14,13,16]
## Code Breakdown - Main Function
### Step 1: Define the Input Data
for (int stride = 1; stride < blockDim.x; stride *= 2) {
int index = 2 * stride * tid;
if (index < blockDim.x) {
if (index + stride < blockDim.x) {
sdata[index] += sdata[index + stride];
}
}
__syncthreads();
}
* const int N = 16;
float h_input[] = {8, 3, 5, 7, 2, 9, 1, 6, 4, 10, 12, 15, 11, 14, 13, 16};
float h_output[1]; // To store the final sum
: Our array in CPU memory ("h" for host)
*
h_output: Will store the final result
### Step 2: Allocate GPU Memory
d_input:
* float *d_input, *d_output;
cudaMalloc(&d_input, N * sizeof(float)); // 16 floats on GPU
cudaMalloc(&d_output, sizeof(float)); // 1 float on GPU
GPU copy of our array ("d" for device)
*
d_output: GPU memory for the result
### Step 3: Copy Data to GPU
` cudaMemcpy(d_input, h_input, N * sizeof(float), cudaMemcpyHostToDevice);
`
* Copies the entire array from CPU → GPU
*
cudaMemcpyHostToDevice: Direction of copy
### Step 4: Configure and Launch Kernel
<<
* Kernel Launch Syntax: int threads = 16;
int sharedMem = threads * sizeof(float); // 16 × 4 bytes = 64 bytes
reduceSum<<<1, threads, sharedMem>>>(d_input, d_output, N);
* 1: 1 block (we only need one block for 16 elements)
* threads: 16 threads in that block
* sharedMem: 64 bytes of shared memory allocated
### Step 5: Copy Result from GPU to CPU
` cudaMemcpy(h_output, d_output, sizeof(float), cudaMemcpyDeviceToHost);
`
*
cudaMemcpy(): CUDA function to copy memory between CPU and GPU
*
h_output: Destination (CPU memory where we want the result)
*
d_output: Source (GPU memory where the result was computed)
*
sizeof(float): Copy exactly 4 bytes (size of one float)
*
cudaMemcpyDeviceToHost`: Direction - from GPU → CPU
GPU VRAM (Device Memory) CPU RAM (Host Memory)
--------------------- ---------------------
d_output[0] = 136.0 → h_output[0] = 136.0
- The GPU computed the sum and stored it in d_output[0] (GPU memory)
- But we can't directly use GPU memory from CPU code
- We need to copy it back to CPU memory to print or use it