- Deep Learning and Machine Learning with CUDA
- Understanding Data Flow in Deep Learning: CPU, GPU, RAM, VRAM, Cache, and Disk Storage
- The GPU Hierarchical Structure
- CPU and GPU Comparision
- Linear Regression Algorithm
- Matrix Addition
- Matrix Multiplication: Naive, Optimized, and CUDA Approaches
- Neural Network: Multi-layer Network
- Vector Addition
- CUDA Kernel for parallel reduction
- Cumulative Sum
- Advanced CUDA Features and Optimization Techniques
Advanced CUDA Features and Optimization Techniques
Streams and Concurrency
CUDA streams provide a mechanism to overlap computation and data transfer, allowing us to optimize the GPU’s utilization by performing multiple tasks in parallel. The basic idea is that instead of serializing operations on the GPU (like launching one kernel and waiting for its completion), we can split tasks across multiple streams and execute them concurrently.
Overlapping Computation and Data Transfer
By default, CUDA operates in a synchronous manner: memory transfers (from host to device or device to host) and kernel executions are serialized, meaning one must finish before the other begins. However, we can overlap memory transfers with kernel execution using streams, which allows for more efficient use of the GPU.
To demonstrate overlapping, let’s look at the following example.
import numpy as np
import pycuda.driver as cuda
import pycuda.autoinit
from pycuda.compiler import SourceModule
# Define the Kernel
mod = SourceModule("""
__global__ void kernel(float *a, float *b){
int idx = threadIdx.x + blockDim.x * blockDim.x;
b[idx] = a[idx] * 2;
}
""")
kernel = mod.get_function("kernel")
# Initialize host data
N = 1024
h_a = np.random.randn(N).astype(np.float32)
h_b = np.empty_like(h_a)
# Allocate Device memory
d_a = cuda.mem_alloc(h_a.nbytes)
d_b = cuda.mem_alloc(h_b.nbytes)
# Create Streams
stream1 = cuda.Stream()
stream2 = cuda.Stream()
# Transfer data asynchronously in stream1
cuda.memcpy_htod_async(d_a, h_a, stream1)
# Launch Kernel in Stream2
kernel(d_a, d_b, block=(256, 1, 1), grid=(N // 256, 1), stream=stream2)
# Transfer result back asynchronously in stream1
cuda.mem_cpy_htod_async(d_b, h_b, stream1)
# Synchronize the streams
stream1.synchronize()
stream2.synchronize()
# Output Result
print(h_b[:10])
import torch
import numpy as np
# Check if CUDA is available
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
N = 1024
h_a = np.random.randn(N).astype(np.float32)
# Create streams
stream1 = torch.cuda.Stream()
stream2 = torch.cuda.Stream()
# Use streams
with torch.cuda.stream(stream1):
d_a = torch.from_numpy(h_a).to(device)
with torch.cuda.stream(stream2):
d_b = d_a * 2 # Automatic GPU operation
# Synchronize and copy back
torch.cuda.synchronize()
h_b = d_b.cpu().numpy()
print(h_b[:10])
[ 0.24536662 2.7647512 -0.29380977 -2.7479107 1.5783794 3.2773077 0.75366443 3.4507585 -2.7262783 1.6730118 ]
and device to host) occurs in stream1, and kernel execution happens in stream2. - By running these op-
erations concurrently in different streams, we achieve better utilization of both the memory bandwidth
and computational power of the GPU
Managing Multiple Streams
Managing multiple streams becomes essential when optimizing more complex applications. CUDA
streams are independent, and operations issued to different streams can be executed concurrently.
However, there may be cases where we want to ensure the correct order of execution between streams.
To manage dependencies between streams, we can use cudaStreamWaitEventto synchronize streams
based on certain events. This ensures that kernels in one stream only start after a specific event in
another stream has occurred.
Here’s a simple example:
# Creating an event
event = cuda.Event()
cuda.memcpy_htod_async(d_a, h_a, stream1)
event.record(stream1)
# Make stream2 wait for stream1 to finish
stream2.wait_event(event)
# Now we can safely execute the kernel in stream2
kernel(d_a, d_b, block=(256, 1, 1), grid=(N // 256, 1), stream=stream2)
In this code, we used an event to ensure that stream2 only starts its kernel after the data transfer
in stream1 is complete.
Dynamic Parallelism
Dynamic parallelism in CUDA allows a kernel to launch other kernels directly from the device. This is
useful for algorithms where the problem size is not known in advance or is irregular, such as adaptive
mesh refinement, graph traversal, or recursive algorithms.
Using Profilers to Identify Bottlenecks
Profilers provide insights into execution time, memory usage, and other performance metrics. By
analyzing the output of these tools, we can pinpoint areas where our application is underperforming.
A common scenario is identifying memory transfer bottlenecks. For example, using nvprof:
`` nvprof python cuda_program.py
``
The profiler will show detailed information about the kernel’s execution time, memory transfer time,
and any potential bottlenecks in the code.
! nvprof python ./cpp-cuda/vector-addition