- Deep Learning and Machine Learning with CUDA
- Understanding Data Flow in Deep Learning: CPU, GPU, RAM, VRAM, Cache, and Disk Storage
- The GPU Hierarchical Structure
- CPU and GPU Comparision
- Linear Regression Algorithm
- Matrix Addition
- Matrix Multiplication: Naive, Optimized, and CUDA Approaches
- Neural Network: Multi-layer Network
- Vector Addition
- CUDA Kernel for parallel reduction
- Cumulative Sum
- Advanced CUDA Features and Optimization Techniques
Vector Addition: The Fundamentals
One of the simplest operations you can perform on a GPU is vector addition. This involves adding two
arrays element-wise. The advantage of using a GPU is that you can perform multiple operations in
parallel, which can drastically improve performance.
In CUDA, we use threads to perform the addition of the vectors. Each thread handles the computation
for one element in the vector. The idea is to map each thread to one data element, allowing the
GPU to compute the sum of the entire array in parallel.
Let’s assume we have two vectorsAandB, each withN elements. The task is to add them together
to produce a third vector C, where C[i] = A[i] + B[i].
In this example, we initialize two vectors on the host, copy them to the GPU, execute the vector
addition in parallel, and then copy the result back to the host.
Part 1: Understanding the CUDA Kernel
Kernel Definition:
``__global__ void vectorAdd(float *A, float *B, float *C, int N) {
`
*
__global__: This is a GPU function that can be called from CPU
*
void vectorAdd: Function name - runs on GPU but called from CPU
*
Parameters: Pointers to arrays A, B, C and size N
### Thread Index Calculation:
`int i = threadIdx.x + blockIdx.x * blockDim.x;
`
Grid (multiple blocks)
┌───────────┬───────────┬───┐
│ Block 0 │ Block 1 │...│ ← blockIdx.x (0, 1, 2...)
│ │ │ │
│ Threads: │ Threads: │ │
│ 0,1,2... │ 0,1,2... │ │ ← threadIdx.x (0,1,2...)
└───────────┴───────────┴───┘
* This is the most important line in CUDA! Let's understand each part:
*
threadIdx.x: Thread index within its block (0 to 255 if 256 threads/block)
*
blockIdx.x: Which block this thread is in (0, 1, 2...)
*
blockDim.x: How many threads per block (256 in our example)
*
i`: Global index across all threads
- Example: If you have 1024 elements and 256 threads/block:
- Block 0: threads 0-255 → i = 0-255
- Block 1: threads 0-255 → i = 256-511
- Block 2: threads 0-255 → i = 512-767
- Block 3: threads 0-255 → i = 768-1023
Boundary Check:
if (i < N) {
C[i] = A[i] + B[i];
}
- Why needed? We might launch more threads than array elements
- Example: For N=1024 with 256 threads/block, we need 4 blocks (1024 threads) - perfect!
- But if N=1000, we still use 4 blocks (1024 threads), but last 24 threads skip work
Part 2: Host (CPU) Code - Step by Step
Step 1: Define Problem Size
int N = 1024; // 1024 elements
size_t size = N * sizeof(float); // Total memory needed: 1024 × 4 bytes = 4096 bytes
size_t
is a special data type that represents sizes and counts in C/C++. It's an unsigned integer type that's guaranteed to be large enough to represent the size of any object in memory.sizeof(float)
returns how many bytes a float occupiesN * sizeof(float)
N
is the number of elements you wantsizeof(float)
is bytes per element- Multiplication gives total bytes needed for the array
6 × 4 = 24 bytes
Step 2: Allocate CPU Memory
CPU RAM (Host) GPU VRAM (Device)
┌─────────────┐ ┌─────────────┐
│ h_A: 1,2,3..│ │ d_A: ? ? ? ?│ ← Allocated but empty
│ h_B:10,20,30│ │ d_B: ? ? ? ?│
│ h_C: ? ? ? ?│ │ d_C: ? ? ? ?│
└─────────────┘ └─────────────┘
float *h_A = (float *)malloc(size); // Allocate array A on CPU
float *h_B = (float *)malloc(size); // Allocate array B on CPU
float *h_C = (float *)malloc(size); // Allocate result array C on CPU
- The h_ prefix typically indicates that these arrays are stored in host memory (CPU RAM), as opposed to d_ which would indicate device memory (GPU memory). This is a common convention in CUDA programming.
- h_A: Array with 6 elements initialized to {1, 2, 3, 4, 5, 6}
h_A: [1.0, 2.0, 3.0, 4.0, 5.0, 6.0]
- h_B: Array with 6 elements initialized to {10, 20, 30, 40, 50, 60}
h_B: [10.0, 20.0, 30.0, 40.0, 50.0, 60.0]
- h_C: Array with 6 elements (uninitialized, will contain garbage values)
h_C: [?, ?, ?, ?, ?, ?] // Uninitialized values
malloc():
Standard C function to allocate CPU memoryh_ prefix:
Convention for "host" (CPU) pointers- h_A and h_B contain input data on the CPU
- h_C is allocated to receive results after some computation (likely on GPU)
- The arrays would later be copied to device memory for processing
Step 3: Initialize Data on CPU
for (int i = 0; i < N; i++) {
h_A[i] = i; // A = [0, 1, 2, 3, ..., 1023]
h_B[i] = i * 2; // B = [0, 2, 4, 6, ..., 2046]
}
Step 4: Allocate GPU Memory
float *d_A, *d_B, *d_C; // Declare GPU pointers
cudaMalloc(&d_A, size); // Allocate array A on GPU
cudaMalloc(&d_B, size); // Allocate array B on GPU
cudaMalloc(&d_C, size); // Allocate result array C on GPU
cudaMalloc():
CUDA function to allocate GPU memoryd_ prefix:
Convention for "device" (GPU) pointers- GPU memory is separate from CPU memory!
Step 5: Copy Data from CPU → GPU
cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice); // CPU → GPU
cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice); // CPU → GPU
cudaMemcpy():
Copies data between CPU and GPUcudaMemcpyHostToDevice:
Direction - from CPU to GPU- This is slow (PCIe bottleneck) - minimize these transfers!
Step 6: Configure and Launch Kernel
int threadsPerBlock = 256;
int blocksPerGrid = (N + threadsPerBlock - 1) / threadsPerBlock;
vectorAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, N);
- Kernel launch syntax:
<<<blocks, threads>>>
- Calculation:
- ``
blocksPerGrid = (1024 + 256 - 1) / 256 = (1279) / 256 = 4.99 → 4 (integer division)
`
cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost); // GPU → CPU* 4 blocks × 256 threads/block = 1024 threads total * Each thread processes one array element ### Step 7: Copy Results from GPU → CPU
*
cudaMemcpyDeviceToHost:Direction - from GPU to CPU * Now
h_C` contains the results computed by GPU
Step 8: Cleanup Memory
cudaFree(d_A); // Free GPU memory
cudaFree(d_B);
cudaFree(d_C);
// Also should free CPU memory: free(h_A); free(h_B); free(h_C);
Part 3: Visual Execution Timeline
CPU Execution (Sequential):
Time: Operation
↓
t1: Allocate CPU memory (h_A, h_B, h_C)
t2: Initialize arrays on CPU
t3: Allocate GPU memory (d_A, d_B, d_C)
t4: Copy h_A → d_A, h_B → d_B (CPU→GPU)
t5: LAUNCH KERNEL (GPU takes over)
t6: Wait for GPU to finish
t7: Copy d_C → h_C (GPU→CPU)
t8: Free memory
GPU Execution (Parallel):
Time: All 1024 threads execute SIMULTANEOUSLY!
↓
t5: Thread 0: C[0] = A[0] + B[0]
Thread 1: C[1] = A[1] + B[1]
Thread 2: C[2] = A[2] + B[2]
...
Thread 1023: C[1023] = A[1023] + B[1023]
ALL HAPPEN AT THE SAME TIME!
Output
Launching kernel with 2 blocks, 4 threads/block
Thread 0 (block 0, thread 0) processing element 0
Thread 1 (block 0, thread 1) processing element 1
Thread 2 (block 0, thread 2) processing element 2
Thread 3 (block 0, thread 3) processing element 3
Thread 4 (block 1, thread 0) processing element 4
Thread 5 (block 1, thread 1) processing element 5
Thread 6 (block 1, thread 2) processing element 6
Thread 7 (block 1, thread 3) processing element 7
Results:
C[0] = 1.0 + 10.0 = 11.0
C[1] = 2.0 + 20.0 = 22.0
C[2] = 3.0 + 30.0 = 33.0
C[3] = 4.0 + 40.0 = 44.0
C[4] = 5.0 + 50.0 = 55.0
C[5] = 6.0 + 60.0 = 66.0