PyTorch 101 Memory Management and Using Multiple GPUs

Oct 10, 2024 04:33 PM - 4 months ago 149869

Introduction

In this portion we will cover,

  1. How to usage aggregate GPUs for your network, either utilizing information parallelism aliases exemplary parallelism.
  2. How to automate action of GPU while creating a caller objects.
  3. How to diagnose and analyse representation issues should they arise.

So, let’s get started.

Before we begin, fto maine punctual you the different parts of our PyTorch series.

  1. Understanding Graphs, Automatic Differentiation and Autograd
  2. Building Your First Neural Network

You tin get each the codification successful this post, (and different posts arsenic well) successful the Github repo here.

Prerequisites

Before diving into PyTorch 101: Memory Management and Using Multiple GPUs, guarantee you person the following:

  • Basic knowing of Python and PyTorch.
  • PyTorch installed connected your system.
  • Access to a CUDA-enabled GPU aliases aggregate GPUs for testing (optional but recommended).
  • Familiarity pinch GPU representation management concepts (optional but beneficial).
  • pip for installing immoderate further packages.

Moving tensors astir CPU / GPUs

Every Tensor successful PyTorch has a to() personnel function. It’s occupation is to put the tensor connected which it’s called to a definite instrumentality whether it beryllium the CPU aliases a definite GPU. Input to the to usability is simply a torch.device entity which tin initialised pinch either of the pursuing inputs.

  1. cpu for CPU
  2. cuda:0 for putting it connected GPU number 0. Similarly, if you want to put the tensors on

Generally, whenever you initialise a Tensor, it’s put connected the CPU. You tin move it to the GPU then. You tin cheque whether a GPU is disposable aliases not by invoking the torch.cuda.is_available function.

if torch.cuda.is_available(): dev = "cuda:0" else: dev = "cpu" device = torch.device(dev) a = torch.zeros(4,3) a = a.to(device)

You tin besides move a tensor to a definite GPU by giving it’s scale arsenic the statement to to function.

Importantly, the supra portion of codification is instrumentality agnostic, that is, you don’t person to separately alteration it for it to activity connected some GPU and the CPU.

cuda() function

Another measurement to put tensors connected GPUs is to telephone cuda(n) usability connected them wherever n is the scale of the GPU. If you conscionable telephone cuda, past the tensor is placed connected GPU 0.

The torch.nn.Module people besides has to  adnd cuda functions which puts the full web connected a peculiar device. Unlike, Tensors calling to connected the nn.Module entity is enough, and there’s nary request to delegate the returned worth from the to function.

clf = myNetwork() clf.to(torch.device("cuda:0")

Automatic action of GPU

While it’s bully to beryllium capable to explicitly determine connected which GPU does a tensor go, generally, we create a batch of tensors during our operations. We want them to beryllium automatically created connected a definite device, truthful arsenic to trim transverse instrumentality transfers which tin slow our codification down. In this regard, PyTorch provides america pinch immoderate functionality to execute this.

First, is the torch.get_device function. It’s only supported for GPU tensors. It returns america the scale of the GPU connected which the tensor resides. We tin usage this usability to find the instrumentality of the tensor, truthful that we tin move a created tensor automatically to this device.

a = t1.get_device() b = torch.tensor(a.shape).to(dev)

We tin besides telephone cuda(n) while creating caller Tensors. By default each tensors created by cuda telephone are put connected GPU 0, but this tin beryllium changed by the pursuing statement.

torch.cuda.set_device(0)

If a tensor is created arsenic a consequence of an cognition betwixt 2 operands which are connected aforesaid device, truthful will beryllium the resultant tensor. If operands are connected different devices, it will lead to an error.

new_* functions

One tin besides make usage of the bunch of new_ functions that made their measurement to PyTorch successful type 1.0. When a usability for illustration new_ones is called connected a Tensor it returns a caller tensor cof aforesaid information type, and connected the aforesaid instrumentality arsenic the tensor connected which the new_ones usability was invoked.

ones = torch.ones((2,)).cuda(0) newOnes = ones.new_ones((3,4)) randTensor = torch.randn(2,4)

A elaborate database of new_ functions tin beryllium recovered successful PyTorch docs the nexus of which I person provided below.

Using Multiple GPUs

There are 2 ways really we could make usage of aggregate GPUs.

  1. Data Parallelism, wherever we disagreement batches into smaller batches, and process these smaller batches successful parallel connected aggregate GPU.
  2. Model Parallelism, wherever we break the neural web into smaller sub networks and past execute these sub networks connected different GPUs.

Data Parallelism

Data Parallelism successful PyTorch is achieved done the nn.DataParallel class. You initialize a nn.DataParallel entity pinch a nn.Module entity representing your network, and a database of GPU IDs, crossed which the batches person to beryllium parallelised.

parallel_net = nn.DataParallel(myNet, gpu_ids = [0,1,2])

Now, you tin simply execute the nn.DataParallel entity conscionable for illustration a nn.Module .

predictions = parallel_net(inputs) loss = loss_function(predictions, labels) loss.mean().backward() optimizer.step()

However, location are a fewer things I want to shed ray over. Despite the truth our information has to beryllium parallelised complete aggregate GPUs, we person to initially shop it connected a azygous GPU.

We besides request to make judge the DataParallel entity is connected that peculiar GPU arsenic well. The syntax remains akin to what we did earlier pinch nn.Module.

input = input.to(0) parallel_net = parellel_net.to(0)

In effect, the pursuing sketch describes really nn.DataParallel works.

image

Working of nn.DataParallel. Source

DataParallel takes the input, splits it into smaller batches, replicates the neural web crossed each the devices, executes the walk and past collects the output backmost connected the original GPU.

One rumor pinch DataParallel tin beryllium that it tin put asymmetrical load connected 1 GPU (the main node). There are mostly 2 ways to circumvent these problem.

  1. First, is to compute the nonaccomplishment during the guardant pass. This makes judge astatine slightest the nonaccomplishment calculation shape is parallelised.
  2. Another measurement is to instrumentality a parallel nonaccomplishment usability layer. This is beyond the scope of this article. However, for those willing I person fixed a nexus to a mean article detailing implementation of specified a furniture astatine the extremity of this article.

Model Parallelism

Model parallelism intends that you break your web into smaller subnetworks that you past put connected different GPUs. The main information for doing specified a point is that your web mightiness beryllium excessively ample to fresh wrong a azygous GPU.

Note that exemplary parallelism is often slower than information parallelism arsenic splitting a azygous web into aggregate GPUs introduces limitations betwixt GPUs which prevents them from moving successful a genuinely parallel way. The advantage 1 derives retired of exemplary parallelism is not astir speed, but expertise to tally networks whose size is excessively ample to fresh connected a azygous GPU.

As we spot successful fig b, Subnet 2 waits for Subnet 1 during guardant pass, while Subnet 1 waits for Subnet 2 during backward pass.

image

Model Parallelism pinch Dependencies Implementing Model parallelism is PyTorch is beautiful easy arsenic agelong arsenic you retrieve 2 things.

  1. The input and the web should ever beryllium connected the aforesaid device.
  2. to and cuda functions person autograd support, truthful your gradients tin beryllium copied from 1 GPU to different during backward pass.

We will usage the pursuing portion of codification to understand this better.

class model_parallel(nn.Module): def __init__(self): super().__init__() self.sub_network1 = ... self.sub_network2 = ... self.sub_network1.cuda(0) self.sub_network2.cuda(1) def forward(x): x = x.cuda(0) x = self.sub_network1(x) x = x.cuda(1) x = self.sub_network2(x) return x

In the init usability we person put the sub-networks connected GPUs 0 and 1 respectively.

Notice successful the guardant function, we transportation the intermediate output from sub_network1 to GPU 1 earlier feeding it to sub_network2. Since cuda has autograd support, the nonaccomplishment backpropagated from sub_network2 will beryllium copied to buffers of sub_network1 for further backpropagation.

Troubleshooting Out of Memory Errors

In this conception we will screen really to diagnose representation issues and imaginable solutions if your web is utilizing much representation than it is needed.

While going retired of representation whitethorn necessitate reducing batch size, 1 tin do definite cheque to guarantee that usage of representation is optimal.

Tracking Memory Usage pinch GPUtil

One measurement to way GPU usage is by monitoring representation usage successful a console pinch nvidia-smi command. The problem pinch this attack is that highest GPU usage, and retired of representation happens truthful accelerated that you can’t rather pinpoint which portion of your codification is causing the representation overflow.

For this we will usage an hold called GPUtil, which you tin instal pinch pip by moving the pursuing command.

pip instal GPUtil

The usage is beautiful elemental too.

import GPUtil GPUtil.showUtilization()

Just put the 2nd statement wherever you want to spot the GPU Utilisation. By placing this connection astatine different places successful the codification you tin fig retired what portion is precisely causing the the web to spell OOM.

Let america now talk astir imaginable methods for remedying OOM errors.

Dealing pinch Memory Losses utilizing del keyword

PyTorch has a beautiful fierce garbage collector. As soon arsenic a adaptable goes retired of scope, the garbage postulation will free it.

It is to beryllium kept successful mind that Python doesn’t enforce scoping rules arsenic powerfully arsenic different languages specified arsenic C/C++. A adaptable is only freed erstwhile location exists nary pointers to it. (This has to do pinch the truth that variables needn’t beryllium declared successful Python)

As a result, representation occupied by tensos holding your input, output tensors tin still not beryllium freed moreover erstwhile you are retired of training loop. Consider the pursuing chunk of code.

for x in range(10): one = x print(i)

Running the supra snippet of codification will people values of one  even erstwhile we are extracurricular are the loop wherever we initialised i. Similarly, tensors holding  loss and output tin unrecorded beyond the training loop. In bid to genuinely free up the abstraction held by these tensors, we usage del keyword.

del out, loss

In fact, arsenic a wide norm of thumb, if you are done pinch a tensor, you should del arsenic it won’t beryllium garbage collected unless location is nary reference to it left.

Using Python Data Types Instead Of 1-D Tensors

Often, we aggregate values successful our training loop to compute immoderate metrics. Biggest illustration of this is that we update the moving nonaccomplishment  each iteration. However, if not done cautiously successful PyTorch, specified a point tin lead to excess usage of representation than what is required.

Consider the pursuing snippet of code.

total_loss = 0 for x in range(10): iter_loss = torch.randn(3,4).mean() iter_loss.requires_grad = True total_loss += iter_loss

We expect that successful the consequent iterations, the reference to iter_loss is reassigned to caller iter_loss, and the entity representing iter_loss from earlier practice will beryllium freed. But this doesn’t happen. Why?

Since iter_loss is differentiable, the statement total_loss += iter_loss creates a computation chart pinch 1 AddBackward usability node. During consequent iterations, AddBackward nodes are added to this chart and nary entity holding values of iter_loss is freed. Normally, the representation allocated to a computation chart is freed erstwhile backward is called upon it, but here, there’s nary scope of calling backward.

image

The computation chart created erstwhile you support adding the nonaccomplishment tensor to the adaptable loss

The solution to this is to adhd a python information type, and not a tensor to total_loss which prevents creation of immoderate computation graph.

We simply switch the statement total_loss += iter_loss pinch total_loss += iter_loss.item(). point returns the python information type from a tensor containing azygous values.

Emptying Cuda Cache

While PyTorch aggressively frees up memory, a pytorch process whitethorn not springiness backmost the representation backmost to the OS moreover aft you del your tensors. This representation is cached truthful that it tin beryllium quickly allocated to caller tensors being allocated without requesting the OS caller other memory.

This tin beryllium a problem erstwhile you are utilizing much than 2 processes successful your workflow.

The first process tin clasp onto the GPU representation moreover if it’s activity is done causing OOM erstwhile the 2nd process is launched. To remedy this, you tin constitute the bid astatine the extremity of your code.

torch.cuda.empy_cache()

This will make judge that the abstraction held by the process is released.

import torch from GPUtil import showUtilization as gpu_usage print("Initial GPU Usage") gpu_usage() tensorList = [] for x in range(10): tensorList.append(torch.randn(10000000,10).cuda()) print("GPU Usage aft allcoating a bunch of Tensors") gpu_usage() del tensorList print("GPU Usage aft deleting the Tensors") gpu_usage() print("GPU Usage aft emptying the cache") torch.cuda.empty_cache() gpu_usage()

The pursuing output is produced erstwhile this codification is executed connected a Tesla K80

Initial GPU Usage | ID | GPU | MEM | ------------------ | 0 | 0% | 5% | GPU Usage aft allcoating a bunch of Tensors | ID | GPU | MEM | ------------------ | 0 | 3% | 30% | GPU Usage aft deleting the Tensors | ID | GPU | MEM | ------------------ | 0 | 3% | 30% | GPU Usage aft emptying the cache | ID | GPU | MEM | ------------------ | 0 | 3% | 5% |

Using torch.no_grad() for Inference

PyTorch, by default, will create a computational chart during the guardant pass. During creation of this graph, it will allocate buffers to shop gradients and intermediate values which are utilized for computing the gradient during the backward pass.

During the backward pass, each of these buffers, pinch the objection of those allocated for leafage variables are freed.

However, during inference, location is nary backward walk and these buffers are ne'er freed, starring up to piling up of memory. Therefore, whenever you want to execute a portion of codification that doesn’t request to beryllium backpropagated, put it wrong a torch.no_grad() discourse manager.

with torch.no_grad()

Using CuDNN Backend

You tin make usage of the cudnn benchmark alternatively of the vanilla benchmark. CuDNN tin provided a batch of optimisation which tin bring down your abstraction usage, particularly erstwhile the input to your neural web is of fixed size. Add the pursuing lines connected apical of your codification to alteration CuDNN benchmark.

torch.backends.cudnn.benchmark = True torch.backends.cudnn.enabled = True

Using 16-bit Floats

The caller RTX and Volta cards by nVidia support some 16-bit training and inference.

model = model.half() input = input.half()

However, the 16-bit training options person to beryllium taken pinch a pinch of salt.

While usage of 16-bit tensors tin trim your GPU usage by almost half, location are a fewer issues pinch them.

  1. In PyTorch, batch-norm layers person convergence issues pinch half precision floats. If that’s the lawsuit pinch you, make judge that batch norm layers are float32.
model.half() for furniture in model.modules(): if isinstance(layer, nn.BatchNorm2d): layer.float()

Also, you request to make judge erstwhile the output is passed done different layers successful the guardant function, the input to the batch norm furniture is converted from float16 to float32 and past the output needs to beryllium converted backmost to float16

One tin find a bully chat of 16-bit training successful PyTorch here.

2.  You tin person overflow issues pinch 16-bit float. Once, I retrieve I had specified an overflow while trying to shop the Union area of 2 bounding boxes (for computation of IoUs)  in a float16.  So make judge you person a realistic bound connected the worth you are trying to prevention successful a float16.

Nvidia has precocious released a PyTorch hold called Apex, that facilitates numerically safe mixed precision training successful PyTorch. I person provided the nexus to that astatine the extremity of the article.

Conclusion

That concludes are chat connected representation guidance and usage of Multiple GPUs successful PyTorch. Following are the important links that you whitethorn wanna travel up this article with.

Further Reading

  1. PyTorch caller functions
  2. Parallelised Loss Layer: Training Neural Nets connected Larger Batches: Practical Tips for 1-GPU, Multi-GPU & Distributed setups
  3. GPUtil Github page
  4. A chat connected half precision training successful PyTorch
  5. Nvidia Apex Github page
  6. Nvidia Apex tutorial
More