Example: Sequential operations in Python

Consider the following Python code that demonstrates how a CPU would handle a series of sequential

operations, such as iterating through a list and performing a calculation on each item. Since

CPUs are optimized for single-threaded operations, this is a typical example of the type of task where

they excel.

In [2]:

numbers = [1,2,3,4,5,6,7,8,9]
squared_numbers = []

for number in numbers:
    squared_numbers.append(number ** 2)

print(squared_numbers)

Out [2]:

[1, 4, 9, 16, 25, 36, 49, 64, 81]

In this case, the CPU performs each iteration of the loop one after the other in a linear sequence,

quickly handling each task.

Example 2: Parallel operations in Python using TensorFlow

In this example, we will demonstrate how to use TensorFlow to perform parallel matrix operations on a GPU. TensorFlow automatically detects available GPUs and offloads operations to them.

In [ ]:

! pip install tensorflow

In [5]:

import tensorflow as tf
# Create a large matrix
matrix = tf.random.uniform((1000,1000))
print(matrix)
# Perform a matrix multiplication (parallelized on the GPU)
result = tf.matmul(matrix, matrix)
tf.print(result)

Out [5]:

2025-09-26 00:18:22.287294: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.

tf.Tensor(
[[0.13004518 0.84748614 0.2270875  ... 0.7458832  0.99783456 0.5731169 ]
 [0.08732259 0.70059323 0.23242676 ... 0.7435374  0.53951645 0.8577293 ]
 [0.10347593 0.14106536 0.81996095 ... 0.8034626  0.00634098 0.1147083 ]
 ...
 [0.30464292 0.79907846 0.77350307 ... 0.95706356 0.30141973 0.77926624]
 [0.3281932  0.5348165  0.3558544  ... 0.88482213 0.19720232 0.6675515 ]
 [0.8592788  0.0202378  0.61015797 ... 0.7103808  0.74298215 0.2031815 ]], shape=(1000, 1000), dtype=float32)
[[237.677307 242.357559 241.551941 ... 255.651245 249.1474 241.301529]
 [249.864243 245.160522 251.838593 ... 268.313385 253.124878 247.757324]
 [248.079971 247.996582 248.186905 ... 263.282562 257.262878 247.419128]
 ...
 [246.074677 249.738266 244.21106 ... 261.871704 258.541809 239.883972]
 [245.866302 251.497 249.791916 ... 267.468079 257.267151 246.39621]
 [233.316971 238.451 239.813828 ... 253.885117 243.204727 239.814]]

WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
W0000 00:00:1758817104.139776   46871 gpu_device.cc:2342] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...

In this case, TensorFlow automatically uses the GPU to accelerate the matrix multiplication.

Example 3: Parallel operations in Python using PyTorch and GPU

This example demonstrates the use of a GPU to perform parallel operations using PyTorch, a popular deep-learning framework that provides GPU acceleration. In deep-learning, we perform matrix operations using PyTorch’s CUDA support to leverage the GPU.

In [3]:

import torch
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
device

Out [3]:

device(type='cuda')

GPU in Matrix Operations

For instance, in a machine learning context, GPUs are often used to train models that can recognize images or understand natural language. Below is an example using the PyTorch library to demonstrate how GPUs can accelerate training:

In [4]:

import torch

# Check if GPU is available
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# Sample Tensor
data = torch.randn(1000,1000).to(device)

# Perform a tensor operation
result = data * data
print(result)

Out [4]:

tensor([[1.3159e+00, 2.3972e+00, 7.3035e-01,  ..., 2.3215e+00, 1.5452e+00,
         3.2903e+00],
        [5.8138e-01, 1.8858e-01, 1.0537e-02,  ..., 2.0386e-02, 1.2114e+00,
         5.2845e-01],
        [1.1759e+00, 9.0369e-01, 2.4117e-03,  ..., 2.0131e-01, 5.2694e+00,
         2.9426e-01],
        ...,
        [2.2667e-01, 3.4197e+00, 2.1350e-02,  ..., 1.5322e+00, 1.4658e+00,
         6.8606e-02],
        [1.2529e-01, 1.9850e+00, 2.1910e+00,  ..., 2.0396e+00, 3.6126e-03,
         3.4626e-02],
        [2.7877e-02, 9.7797e-01, 2.4776e+00,  ..., 3.7252e-04, 1.6705e+00,
         1.4745e+00]], device='cuda:0')

In this example, if a GPU is available, the tensor operations will be performed on it, speeding up the

computation.

Architecture Comparision

The CPU and GPU architectures differ fundamentally in their design and purpose. While CPUs have

fewer cores, each core is highly sophisticated and capable of handling complex instructions. This

makes CPUs ideal for managing general-purpose tasks, with the control unit acting as a “leader”, coordinating

the system. On the other hand, GPUs are equipped with a large number of simple, lightweight

cores that excel at parallel processing. The GPU architecture is designed for handling large amounts of

simple, repetitive tasks, functioning more like “workers” in a large team, efficiently executing multiple

tasks simultaneously.

In the CPU diagram, the control unit coordinates the smaller number of Arithmetic Logic Units

(ALUs) to perform general-purpose computation. The IO and cache systems support data transfer

and storage, enabling the CPU to handle a wide range of complex tasks.

In the GPU diagram, the architecture emphasizes a much larger number of simple cores. Each core

is optimized for performing specific, simple tasks in parallel, which is ideal for graphics rendering and

other highly parallel computations. This design trades off individual core power for sheer numbers,

focusing on throughput over latency.

CPU and GPU Comparision

Example: Sequential operations in Python

Example 2: Parallel operations in Python using TensorFlow

Example 3: Parallel operations in Python using PyTorch and GPU

GPU in Matrix Operations

Architecture Comparision