Vector Addition


Vector Addition: The Fundamentals

One of the simplest operations you can perform on a GPU is vector addition. This involves adding two

arrays element-wise. The advantage of using a GPU is that you can perform multiple operations in

parallel, which can drastically improve performance.

In CUDA, we use threads to perform the addition of the vectors. Each thread handles the computation

for one element in the vector. The idea is to map each thread to one data element, allowing the

GPU to compute the sum of the entire array in parallel.

Let’s assume we have two vectorsAandB, each withN elements. The task is to add them together

to produce a third vector C, where C[i] = A[i] + B[i].

In this example, we initialize two vectors on the host, copy them to the GPU, execute the vector

addition in parallel, and then copy the result back to the host.

>## Cuda Vector Addition Source Code

Part 1: Understanding the CUDA Kernel

Kernel Definition:

``__global__ void vectorAdd(float *A, float *B, float *C, int N) {` * __global__: This is a GPU function that can be called from CPU * void vectorAdd: Function name - runs on GPU but called from CPU * Parameters: Pointers to arrays A, B, C and size N ### Thread Index Calculation: `int i = threadIdx.x + blockIdx.x * blockDim.x;

* This is the most important line in CUDA! Let's understand each part:
Grid (multiple blocks) ┌───────────┬───────────┬───┐ │ Block 0 │ Block 1 │...│ ← blockIdx.x (0, 1, 2...) │ │ │ │ │ Threads: │ Threads: │ │ │ 0,1,2... │ 0,1,2... │ │ ← threadIdx.x (0,1,2...) └───────────┴───────────┴───┘ ` * threadIdx.x: Thread index within its block (0 to 255 if 256 threads/block) * blockIdx.x: Which block this thread is in (0, 1, 2...) * blockDim.x: How many threads per block (256 in our example) * i`: Global index across all threads

  • Example: If you have 1024 elements and 256 threads/block:
    • Block 0: threads 0-255 → i = 0-255
    • Block 1: threads 0-255 → i = 256-511
    • Block 2: threads 0-255 → i = 512-767
    • Block 3: threads 0-255 → i = 768-1023

Boundary Check:

if (i < N) {
    C[i] = A[i] + B[i];
}

  • Why needed? We might launch more threads than array elements
  • Example: For N=1024 with 256 threads/block, we need 4 blocks (1024 threads) - perfect!
  • But if N=1000, we still use 4 blocks (1024 threads), but last 24 threads skip work

Part 2: Host (CPU) Code - Step by Step

Step 1: Define Problem Size

int N = 1024;                    // 1024 elements
size_t size = N * sizeof(float); // Total memory needed: 1024 × 4 bytes = 4096 bytes

  • size_t is a special data type that represents sizes and counts in C/C++. It's an unsigned integer type that's guaranteed to be large enough to represent the size of any object in memory.
  • sizeof(float) returns how many bytes a float occupies
  • N * sizeof(float)
    • N is the number of elements you want
    • sizeof(float) is bytes per element
    • Multiplication gives total bytes needed for the array 6 × 4 = 24 bytes

Step 2: Allocate CPU Memory

CPU RAM (Host)         GPU VRAM (Device)
┌─────────────┐        ┌─────────────┐
│ h_A: 1,2,3..│        │ d_A: ? ? ? ?│ ← Allocated but empty
│ h_B:10,20,30│        │ d_B: ? ? ? ?│
│ h_C: ? ? ? ?│        │ d_C: ? ? ? ?│
└─────────────┘        └─────────────┘

float *h_A = (float *)malloc(size);  // Allocate array A on CPU
float *h_B = (float *)malloc(size);  // Allocate array B on CPU
float *h_C = (float *)malloc(size);  // Allocate result array C on CPU

  • The h_ prefix typically indicates that these arrays are stored in host memory (CPU RAM), as opposed to d_ which would indicate device memory (GPU memory). This is a common convention in CUDA programming.
    • h_A: Array with 6 elements initialized to {1, 2, 3, 4, 5, 6} h_A: [1.0, 2.0, 3.0, 4.0, 5.0, 6.0]
    • h_B: Array with 6 elements initialized to {10, 20, 30, 40, 50, 60} h_B: [10.0, 20.0, 30.0, 40.0, 50.0, 60.0]
    • h_C: Array with 6 elements (uninitialized, will contain garbage values) h_C: [?, ?, ?, ?, ?, ?] // Uninitialized values

  • malloc(): Standard C function to allocate CPU memory
  • h_ prefix: Convention for "host" (CPU) pointers
  • This pattern is often seen in GPU programming where:
    • h_A and h_B contain input data on the CPU
    • h_C is allocated to receive results after some computation (likely on GPU)
    • The arrays would later be copied to device memory for processing

    Step 3: Initialize Data on CPU

    for (int i = 0; i < N; i++) {
        h_A[i] = i;      // A = [0, 1, 2, 3, ..., 1023]
        h_B[i] = i * 2;  // B = [0, 2, 4, 6, ..., 2046]
    }

    Step 4: Allocate GPU Memory

    float *d_A, *d_B, *d_C;           // Declare GPU pointers
    cudaMalloc(&d_A, size);           // Allocate array A on GPU
    cudaMalloc(&d_B, size);           // Allocate array B on GPU
    cudaMalloc(&d_C, size);           // Allocate result array C on GPU

    • cudaMalloc(): CUDA function to allocate GPU memory
    • d_ prefix: Convention for "device" (GPU) pointers
    • GPU memory is separate from CPU memory!
    • Step 5: Copy Data from CPU → GPU

      cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);  // CPU → GPU
      cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);  // CPU → GPU

      • cudaMemcpy(): Copies data between CPU and GPU
      • cudaMemcpyHostToDevice: Direction - from CPU to GPU
      • This is slow (PCIe bottleneck) - minimize these transfers!
      • Step 6: Configure and Launch Kernel

        int threadsPerBlock = 256;
        int blocksPerGrid = (N + threadsPerBlock - 1) / threadsPerBlock;
        vectorAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, N);

        • Kernel launch syntax: <<<blocks, threads>>>
        • Calculation:
          • `` blocksPerGrid = (1024 + 256 - 1) / 256 = (1279) / 256 = 4.99 → 4 (integer division)
            * 4 blocks × 256 threads/block = 1024 threads total
                * Each thread processes one array element
            ### Step 7: Copy Results from GPU → CPU
            cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost); // GPU → CPU
            ` * cudaMemcpyDeviceToHost: Direction - from GPU to CPU * Now h_C` contains the results computed by GPU

        Step 8: Cleanup Memory

        cudaFree(d_A);  // Free GPU memory
        cudaFree(d_B);
        cudaFree(d_C);
        // Also should free CPU memory: free(h_A); free(h_B); free(h_C);

    Part 3: Visual Execution Timeline

    CPU Execution (Sequential):

    Time:    Operation
    ↓
    t1:      Allocate CPU memory (h_A, h_B, h_C)
    t2:      Initialize arrays on CPU
    t3:      Allocate GPU memory (d_A, d_B, d_C)
    t4:      Copy h_A → d_A, h_B → d_B (CPU→GPU)
    t5:      LAUNCH KERNEL (GPU takes over)
    t6:      Wait for GPU to finish
    t7:      Copy d_C → h_C (GPU→CPU)
    t8:      Free memory

    GPU Execution (Parallel):

    Time:    All 1024 threads execute SIMULTANEOUSLY!
    ↓
    t5:      Thread 0:   C[0] = A[0] + B[0]
             Thread 1:   C[1] = A[1] + B[1]
             Thread 2:   C[2] = A[2] + B[2]
             ...
             Thread 1023: C[1023] = A[1023] + B[1023]
             ALL HAPPEN AT THE SAME TIME!

    Output

    Launching kernel with 2 blocks, 4 threads/block
    Thread 0 (block 0, thread 0) processing element 0
    Thread 1 (block 0, thread 1) processing element 1
    Thread 2 (block 0, thread 2) processing element 2
    Thread 3 (block 0, thread 3) processing element 3
    Thread 4 (block 1, thread 0) processing element 4
    Thread 5 (block 1, thread 1) processing element 5
    Thread 6 (block 1, thread 2) processing element 6
    Thread 7 (block 1, thread 3) processing element 7
    
    Results:
    C[0] = 1.0 + 10.0 = 11.0
    C[1] = 2.0 + 20.0 = 22.0
    C[2] = 3.0 + 30.0 = 33.0
    C[3] = 4.0 + 40.0 = 44.0
    C[4] = 5.0 + 50.0 = 55.0
    C[5] = 6.0 + 60.0 = 66.0

    In [ ]:
    
    
    All systems normal

    © 2025 2023 Sanjeeb KC. All rights reserved.