Understanding Parallel Computing: GPUs vs CPUs Explained Simply with role of CUDA

Dec 25, 2024 01:29 PM - 1 month ago 64066

Introduction

In 1996, NVIDIA entered the 3D accelerator marketplace initially down the competition. However, done changeless learning and improvement, they achieved awesome occurrence successful 1999 pinch the preamble of the GeForce 256, recognized arsenic the first graphics paper termed a GPU. Initially designed for gaming, GPUs later recovered a plethora of business applications successful math, science, and engineering.

In 2003, Ian Buck and his squad introduced Brook, the first wide embraced programming exemplary that expanded C by incorporating data-parallel constructs. Buck later played a cardinal domiciled astatine NVIDIA, starring the 2006 motorboat of CUDA, the first commercially disposable solution for general-purpose computing connected GPUs.

CUDA serves arsenic the connecting span betwixt Nvidia GPUs and GPU-based applications, enabling celebrated heavy learning libraries for illustration TensorFlow and PyTorch to leverage GPU acceleration. This capacity is important for optimizing heavy learning tasks and underscores the value of utilizing GPUs successful the field. Today, CUDA is wide considered basal for immoderate AI development, and is simply a package constituent of immoderate AI improvement pipeline.

Prerequisites

  1. Basic Computer Architecture

    • Understand what CPUs and GPUs are and their superior functions.
    • Familiarity pinch cores, threads, and the wide conception of computation.
  2. Introduction to Parallelism

    • Grasp the quality betwixt serial and parallel processing.
    • Awareness of tasks that use from parallelism, specified arsenic matrix operations.
  3. Programming Fundamentals

    • Basic knowledge of programming languages for illustration Python aliases C/C++.
    • Experience pinch loops, conditional statements, and functions.
  4. CUDA Overview

    • High-level knowing of CUDA arsenic a model for parallel computing connected NVIDIA GPUs.
    • Recognize CUDA’s domiciled successful enabling developers to constitute programs that utilization GPU parallelism.

What is Parallel Computing?

In simpler terms, parallel computing is simply a measurement of solving a azygous problem by breaking it down into smaller chunks and solving each 1 simultaneously. Instead of having 1 powerful machine complete 1 analyzable process, parallel computing involves utilizing aggregate computers aliases processors to activity connected different pieces of the problem astatine the aforesaid time. This performance attack speeds up the process of handling ample tasks and efficiently handles the tasks. This is akin to the attack of a having a squad of co-workers handling different assignments simultaneously successful bid to meet immoderate extremity together. Together, the smaller workers create an exponential summation successful wide processing speeds.

CUDA successful Simpler Terms

CUDA aliases Compute Unified Device Architecture created by Nvidia is simply a package level for parallel computing. It has been utilized successful galore business problems since its popularization successful the mid-2000s successful various fields for illustration machine graphics, finance, information mining, instrumentality learning, and technological computing. CUDA enables accelerated computing done its specialized programming language, compatible pinch astir operating systems.

GPU vs CPU

A CPU, aliases cardinal processing unit, serves arsenic the superior computational portion successful a server aliases machine, this instrumentality is known for its divers computing tasks for the operating strategy and applications. The CPU is responsible for executing mathematical and logical calculations successful our computer. The superior usability of this portion is to tally code, handling tasks specified arsenic copying files, deleting data, and processing personification inputs. Moreover, the CPU acts arsenic a mediator for connection betwixt different machine peripherals, ensuring they don’t straight interact but spell done the CPU.

While it whitethorn look that the CPU tin multitask, each halfway of the CPU tin only grip 1 task astatine a time. Each halfway operates arsenic an independent processing unit, and the expertise to multitask is wished by the number of cores successful the hardware. Generally, 2 to 8 cores per CPU is capable for immoderate tasks a laymen whitethorn need, and capacity of these CPUs are rather businesslike to the constituent that humans can’t moreover announcement that our tasks are being executed successful a series alternatively of each astatine once. This is the lawsuit for astir each the things we usage CPUs for connected a regular basis.

Whereas, a graphics processing portion (GPU) is simply a specialized hardware constituent that is tin of efficiently handling parallel mathematical operations, surpassing the general-purpose capabilities of a CPU. Initially designed for graphics rendering successful gaming and animation, GPUs person evolved now to execute a broader scope of tasks beyond their original scope. However, some of them are machine hardware designed to grip definite tasks.

Let’s return a look astatine immoderate earthy numbers. If we see the astir advanced, user CPU systems to mostly beryllium equipped pinch 16 cores, the astir advanced, consumer-grade GPU (Nvidia RTX 4090) has 16,384 CUDA cores. This quality is only magnified erstwhile looking astatine H100s, which person 18,432 CUDA cores. Those CUDA cores are mostly little powerful than individual CPU cores, and we cannot make nonstop comparisons. However, the sheer measurement of the CUDA cores by comparison should show why they are comparatively perfect for handling ample amounts of computations successful parallel.

When comparing CPUs and GPUs, it mightiness look for illustration a bully thought to solely trust connected GPUs owed to their parallel processing capabilities. However, the request for CPUs continues, because multitasking isn’t ever the astir businesslike approach. We besides usage CPUs for wide computing that would beryllium almost excessively elemental for GPUs. In definite scenarios, executing tasks sequentially tin beryllium much clip and resource-effective than parallel processing. The advantage of CUDA lies successful its expertise to seamlessly move betwixt CPU and GPU processing for circumstantial tasks. This elasticity allows programmers to strategically find erstwhile to utilize which hardware component, providing enhanced power complete the computer’s operations.

CUDA’s Role successful GPU

You tin look astatine the CUDA type and GPU info by typing nvidia-smi into your terminal. In a Notebook cell, we tin do this by adding a ! astatine the commencement of the line.

!nvidia-smi

Once we person confirmed our instrumentality has everything we request group up, we tin import the Torch package. It besides has a bully CUDA checker usability we tin usage to guarantee that Torch was decently installed and tin observe CUDA and the GPU.

import torch use_cuda = torch.cuda.is_available

In this lawsuit it will return ‘True’

or,

if torch.cuda.is_available(): instrumentality = torch.device('cuda') else: instrumentality = torch.device('cpu') print("using", device, "device")

With CUDA, programmers tin creation and instrumentality parallel algorithms that return advantage of the thousands of cores coming successful modern GPUs. This parallelization is important for computationally intensive tasks specified arsenic technological studies, instrumentality learning, video editing and information processing. CUDA provides a programming exemplary and a group of APIs that alteration developers to constitute codification that runs straight connected the GPU, unlocking the imaginable for important capacity gains compared to accepted CPU-based computing. By offloading parallelizable workloads to the GPU, CUDA plays a cardinal domiciled successful enhancing the computational capabilities of GPUs and driving advancements successful high-performance computing applications.

image

Source

Speed Test

Let america effort to get immoderate accusation astir the cuda type and the GPU,

if device: print('__CUDA VERSION:', torch.backends.cudnn.version()) print('__Number CUDA Devices:', torch.cuda.device_count()) print('__CUDA Device Name:',torch.cuda.get_device_name(0)) print('__CUDA Device Total Memory [GB]:',torch.cuda.get_device_properties(0).total_memory/1e9)

CUDA VERSION: 8302 __Number CUDA Devices: 1 __CUDA Device Name: NVIDIA RTX A4000 __CUDA Device Total Memory [GB]: 16.89124864

We will behaviour 3 velocity tests to comparison the capacity of CPU versus GPU. Additionally, for the 4th test, we will make a synthetic dataset utilizing unchangeable diffusion and measurement the velocity astatine which the A4000 GPU tin successfully complete the task.

For penning this demo, we chose to usage an NVIDIA RTX A4000. This demo should activity connected immoderate GPU aliases CPU machine.

Matrix Divison

The Python codification beneath performs matrix section utilizing some CPU and GPU, and it measures the clip it takes for the cognition connected each device.

The codification creates random matrices, and performs the cognition connected the CPU, transfers the matrices to the GPU, and past measures the clip taken for the aforesaid cognition connected the GPU. The loop repeats this process 5 times for much meticulous timing results for the GPU. The torch.cuda.synchronize() ensures that the GPU computation is complete earlier measuring the time.

import time matrix_size = 43*15 x = torch.randn(matrix_size, matrix_size) y = torch.randn(matrix_size, matrix_size) print("######## CPU SPEED ##########") start = time.time() result = torch.div(x,y) print(time.time() - start) print("verify device:", result.device) x_gpu = x.to(device) y_gpu = y.to(device) torch.cuda.synchronize() for one in range(5): print("######## GPU SPEED ##########") commencement = time.time() result_gpu = torch.div(x_gpu,y_gpu) print(time.time() - start) print("verify device:", result_gpu.device)

image

As we tin see, the computations were importantly faster connected the GPU than the CPU.

Build a Artificial Neural Network

The beneath python codification will built a elemental neural web exemplary utilizing some the CPU and GPU to show a basal velocity test.

import tensorflow as tf import time data_size = 10000 input_data = tf.random.normal([data_size, data_size]) model = tf.keras.Sequential([ tf.keras.layers.Dense(1000, activation='relu', input_shape=(data_size,)), tf.keras.layers.Dense(1000, activation='relu'), tf.keras.layers.Dense(1) ]) model.compile(optimizer='adam', loss='mse') def speed_test(device): with tf.device(device): start_time = time.time() model.fit(input_data, tf.zeros(data_size), epochs=1, batch_size=32, verbose=0) end_time = time.time() return end_time - start_time cpu_time = speed_test('/CPU:0') print("Time taken connected CPU: {:.2f} seconds".format(cpu_time)) gpu_time = speed_test('/GPU:0') print("Time taken connected GPU: {:.2f} seconds".format(gpu_time))

image

Build a Convolutional Neural Network (CNN)

The beneath codification will train a Convolutional Neural Network (CNN) connected the MNIST dataset utilizing TensorFlow. The speed_test usability measures the clip taken for training connected some CPU and GPU, allowing to comparison their performance.

import tensorflow as tf from tensorflow.keras import layers, models import time (train_images, train_labels), (test_images, test_labels) = tf.keras.datasets.mnist.load_data() train_images = train_images.reshape((60000, 28, 28, 1)).astype('float32') / 255 test_images = test_images.reshape((10000, 28, 28, 1)).astype('float32') / 255 model = models.Sequential() model.add(layers.Conv2D(32, (3, 3), activation='relu', input_shape=(28, 28, 1))) model.add(layers.MaxPooling2D((2, 2))) model.add(layers.Conv2D(64, (3, 3), activation='relu')) model.add(layers.MaxPooling2D((2, 2))) model.add(layers.Conv2D(64, (3, 3), activation='relu')) model.add(layers.Flatten()) model.add(layers.Dense(64, activation='relu')) model.add(layers.Dense(10, activation='softmax')) model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy']) def speed_test(device): with tf.device(device): start_time = time.time() model.fit(train_images, train_labels, epochs=5, batch_size=64, validation_data=(test_images, test_labels), verbose=0) end_time = time.time() return end_time - start_time cpu_time = speed_test('/CPU:0') print("Time taken connected CPU: {:.2f} seconds".format(cpu_time)) gpu_time = speed_test('/GPU:0') print("Time taken connected GPU: {:.2f} seconds".format(gpu_time))

image

Create a Synthetic Emotions Dataset pinch Stable Diffusion

Next, fto america effort creating a synthetic dataset pinch Stable Diffusion by creating 10 images of different emotions specified arsenic angry, sad, lonely, happy. Follow the steps beneath to recreate the dataset.

Please statement the beneath codification will require a GPU

First, we request to instal the basal libraries.

!pip instal --upgrade diffusers transformers scipy !pip instal --quiet ipyplot

Please, make judge to restart the kernel erstwhile the libraries supra are installed, aliases this whitethorn not work.

Install the basal packages, and specify the exemplary id of the pre-trained model.

We will delegate the drawstring “cuda” to the adaptable device. This indicates that the codification intends to usage a CUDA-enabled GPU for computation.

import torch from torch import autocast from diffusers import StableDiffusionPipeline import ipyplot import random import os import time import matplotlib.pyplot as plt model_id = "CompVis/stable-diffusion-v1-5" device = "cuda"

Create an lawsuit of the StableDiffusionPipeline people by loading the pre-trained exemplary specified successful the adaptable model_id. The from_pretrained method is commonly utilized successful heavy learning frameworks to instantiate a exemplary and load pre-trained weights if available.

pipe = StableDiffusionPipeline.from_pretrained(model_id) pipe = pipe.to(device)

Create the circumstantial files to shop the images,

os.makedirs('/notebooks/happy', exist_ok=True) os.makedirs('/notebooks/sad', exist_ok=True) os.makedirs('/notebooks/angry', exist_ok=True) os.makedirs('/notebooks/surprised', exist_ok=True) os.makedirs('/notebooks/lonely', exist_ok=True)

The adjacent lines of codification will make images utilizing the StableDiffusionPipeline for different emotions and genders. It does truthful successful a loop, creating 10 images for each emotion.

genders = ['male', 'female'] emotion_prompts = {'happy': 'smiling', 'surprised': 'surprised, opened mouth, raised eyebrows', 'sad': 'frowning, sad look expression, crying', 'angry': 'angry, fierce, irritated', 'lonely': 'lonely, alone, lonesome'} print("######## GPU SPEED ##########") start = time.time() for j in range(10): for emotion in emotion_prompts.keys(): emotion_prompt = emotion_prompts[emotion] gender = random.choice(genders) punctual = 'Medium-shot image of {}, {}, beforehand view, looking astatine the camera, colour photography, '.format(gender, emotion_prompt) + \ 'photorealistic, hyperrealistic, realistic, incredibly detailed, crisp focus, integer art, extent of field, 50mm, 8k' negative_prompt = '3d, cartoon, anime, sketches, (worst quality:2), (low quality:2), (normal quality:2), lowres, normal quality, ((monochrome)), ' + \ '((grayscale)) Low Quality, Worst Quality, plastic, fake, disfigured, deformed, blurry, bad anatomy, blurred, watermark, grainy, signature' image = pipe(prompt=prompt, negative_prompt=negative_prompt).images[0] image.save('/notebooks/{}/{}.png'.format(emotion, str(j).zfill(4))) print(time.time() - start)

Now, let’s tally the code, and look astatine really agelong the task takes for the A4000 GPU speeds, and past usage a flimsy alteration to comparison it pinch the CPU speeds.

image

And then, to put our pipeline onto the CPU, simply usage the pursuing snippet earlier moving the aforesaid code:

pipe.to('cpu')

This will get america our CPU times, shown below.

image

As we tin see, the CPU was importantly slower. This is because images are represented by computers arsenic arrays of numbers, and performing the multitude of parallel processes connected a GPU is conscionable overmuch much efficient.

Results

Here is an overview of each of our analyses from this blog post. GPUs were consistently faster crossed each of these information cognition and instrumentality learning tasks.

Speed Test

Tasks GPU CPU
Matrix Operation 5.8e-05(avg) 0.00846
ANN 2.78 23.30
CNN 48.31 167.68
Stable Diffusion 121.03 3153.04

Conclusion

The pairing of CUDA pinch NVIDIA GPUs holds a ascendant position successful various exertion domains, peculiarly successful the section of heavy learning. This operation serves arsenic a cornerstone for powering immoderate of the world’s ace computers.

CUDA and NVIDIA GPU person successfully powered industries specified arsenic Deep Learning, Data Science and Analytics, Gaming, Finance, Researches and galore more. For lawsuit Deep learning heavy relies connected accelerated computing, peculiarly GPUs and specialized hardware for illustration TPUs.

The usage of GPUs importantly accelerates the training process, reducing it from months to a week. Various heavy learning frameworks, including TensorFlow, PyTorch, and others, dangle connected CUDA for GPU support and cuDNN for heavy neural web computations. Performance gains are shared crossed frameworks erstwhile these underlying technologies improve, but differences successful scalability to aggregate GPUs and nodes beryllium among frameworks.

In summary we tin opportunity that erstwhile picking a GPU for heavy learning aliases immoderate A.I. tasks 1 of the things to support successful mind is the GPU should support CUDA.

We dream you enjoyed reference the article.

References

  • Code reference for generating synthetic data
  • Your GPU Compute Capability
  • What is CUDA?
More