Advanced CUDA Features and Optimization Techniques


Advanced CUDA Features and Optimization Techniques

Streams and Concurrency

CUDA streams provide a mechanism to overlap computation and data transfer, allowing us to optimize the GPU’s utilization by performing multiple tasks in parallel. The basic idea is that instead of serializing operations on the GPU (like launching one kernel and waiting for its completion), we can split tasks across multiple streams and execute them concurrently.

Overlapping Computation and Data Transfer

By default, CUDA operates in a synchronous manner: memory transfers (from host to device or device to host) and kernel executions are serialized, meaning one must finish before the other begins. However, we can overlap memory transfers with kernel execution using streams, which allows for more efficient use of the GPU.

To demonstrate overlapping, let’s look at the following example.

In [ ]:
import numpy as np
import pycuda.driver as cuda
import pycuda.autoinit
from pycuda.compiler import SourceModule

# Define the Kernel
mod = SourceModule("""
__global__ void kernel(float *a, float *b){
    int idx = threadIdx.x + blockDim.x * blockDim.x;
    b[idx] = a[idx] * 2;
}
""")
kernel = mod.get_function("kernel")

# Initialize host data
N = 1024
h_a = np.random.randn(N).astype(np.float32)
h_b = np.empty_like(h_a)

# Allocate Device memory
d_a = cuda.mem_alloc(h_a.nbytes)
d_b = cuda.mem_alloc(h_b.nbytes)

# Create Streams
stream1 = cuda.Stream()
stream2 = cuda.Stream()

# Transfer data asynchronously in stream1
cuda.memcpy_htod_async(d_a, h_a, stream1)

# Launch Kernel in Stream2
kernel(d_a, d_b, block=(256, 1, 1), grid=(N // 256, 1), stream=stream2)

# Transfer result back asynchronously in stream1
cuda.mem_cpy_htod_async(d_b, h_b, stream1)

# Synchronize the streams
stream1.synchronize()
stream2.synchronize()

# Output Result
print(h_b[:10])
In [16]:
import torch
import numpy as np

# Check if CUDA is available
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

N = 1024
h_a = np.random.randn(N).astype(np.float32)

# Create streams
stream1 = torch.cuda.Stream()
stream2 = torch.cuda.Stream()

# Use streams
with torch.cuda.stream(stream1):
    d_a = torch.from_numpy(h_a).to(device)

with torch.cuda.stream(stream2):
    d_b = d_a * 2  # Automatic GPU operation

# Synchronize and copy back
torch.cuda.synchronize()
h_b = d_b.cpu().numpy()

print(h_b[:10])
Out [16]:
[ 0.24536662  2.7647512  -0.29380977 -2.7479107   1.5783794   3.2773077
  0.75366443  3.4507585  -2.7262783   1.6730118 ]
In this example: - We created two streams: stream1 and stream2. - Data transfer (host to device

and device to host) occurs in stream1, and kernel execution happens in stream2. - By running these op-

erations concurrently in different streams, we achieve better utilization of both the memory bandwidth

and computational power of the GPU

Managing Multiple Streams

Managing multiple streams becomes essential when optimizing more complex applications. CUDA

streams are independent, and operations issued to different streams can be executed concurrently.

However, there may be cases where we want to ensure the correct order of execution between streams.

To manage dependencies between streams, we can use cudaStreamWaitEventto synchronize streams

based on certain events. This ensures that kernels in one stream only start after a specific event in

another stream has occurred.

Here’s a simple example:

In [ ]:
# Creating an event
event = cuda.Event()

cuda.memcpy_htod_async(d_a, h_a, stream1)
event.record(stream1)

# Make stream2 wait for stream1 to finish
stream2.wait_event(event)

# Now we can safely execute the kernel in stream2
kernel(d_a, d_b, block=(256, 1, 1), grid=(N // 256, 1), stream=stream2)

In this code, we used an event to ensure that stream2 only starts its kernel after the data transfer

in stream1 is complete.

Dynamic Parallelism

Dynamic parallelism in CUDA allows a kernel to launch other kernels directly from the device. This is

useful for algorithms where the problem size is not known in advance or is irregular, such as adaptive

mesh refinement, graph traversal, or recursive algorithms.

Using Profilers to Identify Bottlenecks

Profilers provide insights into execution time, memory usage, and other performance metrics. By

analyzing the output of these tools, we can pinpoint areas where our application is underperforming.

A common scenario is identifying memory transfer bottlenecks. For example, using nvprof:

`` nvprof python cuda_program.py ``

The profiler will show detailed information about the kernel’s execution time, memory transfer time,

and any potential bottlenecks in the code.

In [ ]:
! nvprof python  ./cpp-cuda/vector-addition
In [ ]:

All systems normal

© 2025 2023 Sanjeeb KC. All rights reserved.