- Deep Learning and Machine Learning with CUDA
- Understanding Data Flow in Deep Learning: CPU, GPU, RAM, VRAM, Cache, and Disk Storage
- The GPU Hierarchical Structure
- CPU and GPU Comparision
- Linear Regression Algorithm
- Matrix Addition
- Matrix Multiplication: Naive, Optimized, and CUDA Approaches
- Neural Network: Multi-layer Network
- Vector Addition
- CUDA Kernel for parallel reduction
- Cumulative Sum
- Advanced CUDA Features and Optimization Techniques
Example: Sequential operations in Python
Consider the following Python code that demonstrates how a CPU would handle a series of sequential
operations, such as iterating through a list and performing a calculation on each item. Since
CPUs are optimized for single-threaded operations, this is a typical example of the type of task where
they excel.
numbers = [1,2,3,4,5,6,7,8,9]
squared_numbers = []
for number in numbers:
squared_numbers.append(number ** 2)
print(squared_numbers)
[1, 4, 9, 16, 25, 36, 49, 64, 81]
In this case, the CPU performs each iteration of the loop one after the other in a linear sequence,
quickly handling each task.
Example 2: Parallel operations in Python using TensorFlow
In this example, we will demonstrate how to use TensorFlow to perform parallel matrix operations on a GPU. TensorFlow automatically detects available GPUs and offloads operations to them.
! pip install tensorflow
import tensorflow as tf
# Create a large matrix
matrix = tf.random.uniform((1000,1000))
print(matrix)
# Perform a matrix multiplication (parallelized on the GPU)
result = tf.matmul(matrix, matrix)
tf.print(result)
2025-09-26 00:18:22.287294: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
tf.Tensor( [[0.13004518 0.84748614 0.2270875 ... 0.7458832 0.99783456 0.5731169 ] [0.08732259 0.70059323 0.23242676 ... 0.7435374 0.53951645 0.8577293 ] [0.10347593 0.14106536 0.81996095 ... 0.8034626 0.00634098 0.1147083 ] ... [0.30464292 0.79907846 0.77350307 ... 0.95706356 0.30141973 0.77926624] [0.3281932 0.5348165 0.3558544 ... 0.88482213 0.19720232 0.6675515 ] [0.8592788 0.0202378 0.61015797 ... 0.7103808 0.74298215 0.2031815 ]], shape=(1000, 1000), dtype=float32) [[237.677307 242.357559 241.551941 ... 255.651245 249.1474 241.301529] [249.864243 245.160522 251.838593 ... 268.313385 253.124878 247.757324] [248.079971 247.996582 248.186905 ... 263.282562 257.262878 247.419128] ... [246.074677 249.738266 244.21106 ... 261.871704 258.541809 239.883972] [245.866302 251.497 249.791916 ... 267.468079 257.267151 246.39621] [233.316971 238.451 239.813828 ... 253.885117 243.204727 239.814]]
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR W0000 00:00:1758817104.139776 46871 gpu_device.cc:2342] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform. Skipping registering GPU devices...
In this case, TensorFlow automatically uses the GPU to accelerate the matrix multiplication.
Example 3: Parallel operations in Python using PyTorch and GPU
This example demonstrates the use of a GPU to perform parallel operations using PyTorch, a popular deep-learning framework that provides GPU acceleration. In deep-learning, we perform matrix operations using PyTorch’s CUDA support to leverage the GPU.
import torch
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
device
device(type='cuda')
GPU in Matrix Operations
For instance, in a machine learning context, GPUs are often used to train models that can recognize images or understand natural language. Below is an example using the PyTorch library to demonstrate how GPUs can accelerate training:
import torch
# Check if GPU is available
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
# Sample Tensor
data = torch.randn(1000,1000).to(device)
# Perform a tensor operation
result = data * data
print(result)
tensor([[1.3159e+00, 2.3972e+00, 7.3035e-01, ..., 2.3215e+00, 1.5452e+00, 3.2903e+00], [5.8138e-01, 1.8858e-01, 1.0537e-02, ..., 2.0386e-02, 1.2114e+00, 5.2845e-01], [1.1759e+00, 9.0369e-01, 2.4117e-03, ..., 2.0131e-01, 5.2694e+00, 2.9426e-01], ..., [2.2667e-01, 3.4197e+00, 2.1350e-02, ..., 1.5322e+00, 1.4658e+00, 6.8606e-02], [1.2529e-01, 1.9850e+00, 2.1910e+00, ..., 2.0396e+00, 3.6126e-03, 3.4626e-02], [2.7877e-02, 9.7797e-01, 2.4776e+00, ..., 3.7252e-04, 1.6705e+00, 1.4745e+00]], device='cuda:0')
In this example, if a GPU is available, the tensor operations will be performed on it, speeding up the
computation.
Architecture Comparision
The CPU and GPU architectures differ fundamentally in their design and purpose. While CPUs have
fewer cores, each core is highly sophisticated and capable of handling complex instructions. This
makes CPUs ideal for managing general-purpose tasks, with the control unit acting as a “leader”, coordinating
the system. On the other hand, GPUs are equipped with a large number of simple, lightweight
cores that excel at parallel processing. The GPU architecture is designed for handling large amounts of
simple, repetitive tasks, functioning more like “workers” in a large team, efficiently executing multiple
tasks simultaneously.
In the CPU diagram, the control unit coordinates the smaller number of Arithmetic Logic Units
(ALUs) to perform general-purpose computation. The IO and cache systems support data transfer
and storage, enabling the CPU to handle a wide range of complex tasks.
In the GPU diagram, the architecture emphasizes a much larger number of simple cores. Each core
is optimized for performing specific, simple tasks in parallel, which is ideal for graphics rendering and
other highly parallel computations. This design trades off individual core power for sheer numbers,
focusing on throughput over latency.