Vector Addition: The Fundamentals

One of the simplest operations you can perform on a GPU is vector addition. This involves adding two

arrays element-wise. The advantage of using a GPU is that you can perform multiple operations in

parallel, which can drastically improve performance.

In CUDA, we use threads to perform the addition of the vectors. Each thread handles the computation

for one element in the vector. The idea is to map each thread to one data element, allowing the

GPU to compute the sum of the entire array in parallel.

Let’s assume we have two vectorsAandB, each withN elements. The task is to add them together

to produce a third vector C, where C[i] = A[i] + B[i].

In this example, we initialize two vectors on the host, copy them to the GPU, execute the vector

addition in parallel, and then copy the result back to the host.

>## Cuda Vector Addition Source Code

Part 1: Understanding the CUDA Kernel

Kernel Definition:

``__global__ void vectorAdd(float *A, float *B, float *C, int N) {`*__global__:This is a GPU function that can be called from CPU *void vectorAdd:Function name - runs on GPU but called from CPU *Parameters:Pointers to arrays A, B, C and size N ### Thread Index Calculation:`int i = threadIdx.x + blockIdx.x * blockDim.x;

* This is the most important line in CUDA! Let's understand each part:

 Grid (multiple blocks)
┌───────────┬───────────┬───┐
│ Block 0   │ Block 1   │...│  ← blockIdx.x (0, 1, 2...)
│           │           │   │
│ Threads:  │ Threads:  │   │
│ 0,1,2...  │ 0,1,2...  │   │  ← threadIdx.x (0,1,2...)
└───────────┴───────────┴───┘

threadIdx.x

: Thread index within its block (0 to 255 if 256 threads/block)
*

blockIdx.x

: Which block this thread is in (0, 1, 2...)
*

blockDim.x

: How many threads per block (256 in our example)
*

i`: Global index across all threads

Example: If you have 1024 elements and 256 threads/block:

Block 0: threads 0-255 → i = 0-255
Block 1: threads 0-255 → i = 256-511
Block 2: threads 0-255 → i = 512-767
Block 3: threads 0-255 → i = 768-1023

Boundary Check:

if (i < N) {
    C[i] = A[i] + B[i];
}

Why needed? We might launch more threads than array elements
Example: For N=1024 with 256 threads/block, we need 4 blocks (1024 threads) - perfect!
But if N=1000, we still use 4 blocks (1024 threads), but last 24 threads skip work

Part 2: Host (CPU) Code - Step by Step

Step 1: Define Problem Size

int N = 1024;                    // 1024 elements
size_t size = N * sizeof(float); // Total memory needed: 1024 × 4 bytes = 4096 bytes

size_t is a special data type that represents sizes and counts in C/C++. It's an unsigned integer type that's guaranteed to be large enough to represent the size of any object in memory.
sizeof(float) returns how many bytes a float occupies
N * sizeof(float)

N is the number of elements you want
sizeof(float) is bytes per element
Multiplication gives total bytes needed for the array 6 × 4 = 24 bytes

Step 2: Allocate CPU Memory

CPU RAM (Host)         GPU VRAM (Device)
┌─────────────┐        ┌─────────────┐
│ h_A: 1,2,3..│        │ d_A: ? ? ? ?│ ← Allocated but empty
│ h_B:10,20,30│        │ d_B: ? ? ? ?│
│ h_C: ? ? ? ?│        │ d_C: ? ? ? ?│
└─────────────┘        └─────────────┘

float *h_A = (float *)malloc(size);  // Allocate array A on CPU
float *h_B = (float *)malloc(size);  // Allocate array B on CPU
float *h_C = (float *)malloc(size);  // Allocate result array C on CPU

The h_ prefix typically indicates that these arrays are stored in host memory (CPU RAM), as opposed to d_ which would indicate device memory (GPU memory). This is a common convention in CUDA programming.

h_A: Array with 6 elements initialized to {1, 2, 3, 4, 5, 6} h_A: [1.0, 2.0, 3.0, 4.0, 5.0, 6.0]
h_B: Array with 6 elements initialized to {10, 20, 30, 40, 50, 60} h_B: [10.0, 20.0, 30.0, 40.0, 50.0, 60.0]
h_C: Array with 6 elements (uninitialized, will contain garbage values) h_C: [?, ?, ?, ?, ?, ?] // Uninitialized values

malloc(): Standard C function to allocate CPU memory

h_ prefix: Convention for "host" (CPU) pointers

This pattern is often seen in GPU programming where:

h_A and h_B contain input data on the CPU
h_C is allocated to receive results after some computation (likely on GPU)
The arrays would later be copied to device memory for processing

Step 3: Initialize Data on CPU

for (int i = 0; i < N; i++) {
    h_A[i] = i;      // A = [0, 1, 2, 3, ..., 1023]
    h_B[i] = i * 2;  // B = [0, 2, 4, 6, ..., 2046]
}

Step 4: Allocate GPU Memory

float *d_A, *d_B, *d_C;           // Declare GPU pointers
cudaMalloc(&d_A, size);           // Allocate array A on GPU
cudaMalloc(&d_B, size);           // Allocate array B on GPU
cudaMalloc(&d_C, size);           // Allocate result array C on GPU

cudaMalloc(): CUDA function to allocate GPU memory
d_ prefix: Convention for "device" (GPU) pointers
GPU memory is separate from CPU memory!

Step 5: Copy Data from CPU → GPU

cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);  // CPU → GPU
cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);  // CPU → GPU

cudaMemcpy(): Copies data between CPU and GPU
cudaMemcpyHostToDevice: Direction - from CPU to GPU
This is slow (PCIe bottleneck) - minimize these transfers!

Step 6: Configure and Launch Kernel

int threadsPerBlock = 256;
int blocksPerGrid = (N + threadsPerBlock - 1) / threadsPerBlock;
vectorAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, N);

Kernel launch syntax: <<<blocks, threads>>>
Calculation:

``blocksPerGrid = (1024 + 256 - 1) / 256 = (1279) / 256 = 4.99 → 4 (integer division) * 4 blocks × 256 threads/block = 1024 threads total * Each thread processes one array element ### Step 7: Copy Results from GPU → CPU cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost); // GPU → CPU`*cudaMemcpyDeviceToHost:Direction - from GPU to CPU * Nowh_C` contains the results computed by GPU

Step 8: Cleanup Memory

cudaFree(d_A);  // Free GPU memory
cudaFree(d_B);
cudaFree(d_C);
// Also should free CPU memory: free(h_A); free(h_B); free(h_C);

Part 3: Visual Execution Timeline

CPU Execution (Sequential):

Time:    Operation
↓
t1:      Allocate CPU memory (h_A, h_B, h_C)
t2:      Initialize arrays on CPU
t3:      Allocate GPU memory (d_A, d_B, d_C)
t4:      Copy h_A → d_A, h_B → d_B (CPU→GPU)
t5:      LAUNCH KERNEL (GPU takes over)
t6:      Wait for GPU to finish
t7:      Copy d_C → h_C (GPU→CPU)
t8:      Free memory

GPU Execution (Parallel):

Time:    All 1024 threads execute SIMULTANEOUSLY!
↓
t5:      Thread 0:   C[0] = A[0] + B[0]
         Thread 1:   C[1] = A[1] + B[1]
         Thread 2:   C[2] = A[2] + B[2]
         ...
         Thread 1023: C[1023] = A[1023] + B[1023]
         ALL HAPPEN AT THE SAME TIME!

Output

Launching kernel with 2 blocks, 4 threads/block
Thread 0 (block 0, thread 0) processing element 0
Thread 1 (block 0, thread 1) processing element 1
Thread 2 (block 0, thread 2) processing element 2
Thread 3 (block 0, thread 3) processing element 3
Thread 4 (block 1, thread 0) processing element 4
Thread 5 (block 1, thread 1) processing element 5
Thread 6 (block 1, thread 2) processing element 6
Thread 7 (block 1, thread 3) processing element 7

Results:
C[0] = 1.0 + 10.0 = 11.0
C[1] = 2.0 + 20.0 = 22.0
C[2] = 3.0 + 30.0 = 33.0
C[3] = 4.0 + 40.0 = 44.0
C[4] = 5.0 + 50.0 = 55.0
C[5] = 6.0 + 60.0 = 66.0

In [ ]: