Introduction
Using containers for GPU workloads requires installing the Nvidia instrumentality toolkit and moving Docker pinch further flags. This tutorial explains really to group up the Nvidia instrumentality toolkit, tally Docker for GPU workloads, and instal Miniconda to negociate Python environments. This guideline focuses connected PyTorch usage pinch GPU Droplets connected DigitalOcean.
Prerequisites
To travel this tutorial, you will need:
- A DigitalOcean Cloud account.
- A GPU Droplet. GPU Droplets are presently successful early availability, but you tin request entree here.
Why Use a GPU Droplet?
DigitalOcean’s GPU Droplets are NVIDIA H100s that you tin rotation up on-demand—try them retired by spinning up a GPU Droplet today. Note these are presently successful early availability, and will beryllium released for everyone soon!
Step 1 - Set Up the GPU Droplet
-
Create a GPU Droplet-Log into your DigitalOcean account, create a caller GPU Droplet pinch the OS Image group arsenic “AI/ML Ready v1.0”, and take a GPU plan.
Once the GPU Droplet is created, log into its console.
-
Add a New User (Recommended)-Instead of utilizing the guidelines personification for everything, it’s amended to create a caller personification for information reasons:
- adduser do-shark
- usermod -aG sudo do-shark
- su do-shark
- cd ~/
Using containers for GPU workloads requires installing the Nvidia instrumentality toolkit and moving docker pinch further flags.
Install the Toolkit and Docker
The Nvidia instrumentality toolkit replaced the erstwhile wrapper named nvidia-docker. You tin instal the toolkit and Docker pinch the pursuing command:
- sudo apt-get install docker.io nvidia-container-toolkit
Enable the Nvidia Container Runtime
Run the pursuing bid to alteration the Nvidia instrumentality runtime:
- sudo nvidia-ctk runtime configure --runtime=docker
Restart Docker
After enabling the runtime, restart Docker to use the changes:
- sudo systemctl restart docker
Step 3 — Running a PyTorch Container (Single Node)
When moving PyTorch successful a container, Nvidia recommends utilizing circumstantial Docker flags for capable representation allocation.
--gpus each --ipc=host --ulimit memlock=-1 --ulimit stack=67108864These flags are responsible for:
–gpus all: Enables GPU entree for the container.
–ipc=host: Allows the instrumentality to usage the host’s IPC namespace.
–ulimit memlock=-1: Removes the limit connected locked-in-memory reside space.
–ulimit stack=67108864: Sets the maximum stack size to 64MB.
To corroborate that PyTorch is moving correctly successful a containerized environment, tally the pursuing command:
- sudo docker tally --rm -it --gpus each --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 nvcr.io/nvidia/pytorch:24.08-py3 python3 -c "import torch;print('CUDA available:', torch.cuda.is_available())"
The supra docker invocation will corroborate pytorch is moving correctly successful a containerized environment. The last people from the execution should show “CUDA available: True”.
Output
============= == PyTorch == ============= NVIDIA Release 24.08 (build 107063150) PyTorch Version 2.5.0a0+872d972 Container image Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES. All authorities reserved. Copyright (c) 2014-2024 Facebook Inc. Copyright (c) 2011-2014 Idiap Research Institute (Ronan Collobert) Copyright (c) 2012-2014 Deepmind Technologies (Koray Kavukcuoglu) Copyright (c) 2011-2012 NEC Laboratories America (Koray Kavukcuoglu) Copyright (c) 2011-2013 NYU (Clement Farabet) Copyright (c) 2006-2010 NEC Laboratories America (Ronan Collobert, Leon Bottou, Iain Melvin, Jason Weston) Copyright (c) 2006 Idiap Research Institute (Samy Bengio) Copyright (c) 2001-2004 Idiap Research Institute (Ronan Collobert, Samy Bengio, Johnny Mariethoz) Copyright (c) 2015 Google Inc. Copyright (c) 2015 Yangqing Jia Copyright (c) 2013-2016 The Caffe contributors All authorities reserved. Various files see modifications (c) NVIDIA CORPORATION & AFFILIATES. All authorities reserved. The NVIDIA Deep Learning Container License governs this instrumentality image and its contents. By pulling and utilizing the container, you judge the position and conditions of this license: https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license NOTE: CUDA Forward Compatibility mode ENABLED. Using CUDA 12.6 driver type 560.35.03 pinch kernel driver type 535.183.01. See https://docs.nvidia.com/deploy/cuda-compatibility/ for details. CUDA available: TrueStep 4 — Running a PyTorch Container (Multi-Node)
Use the aforesaid guidelines arguments for multi-node configurations arsenic for the single-node setup, but see further hindrance mounts to observe the GPU cloth web devices and NCCL topology.
--gpus each --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 --network=host --volume /dev/infiniband:/dev/infiniband --volume /sys/class/infiniband/:/sys/class/infiniband/ --device /dev/infiniband/:/dev/infiniband/ -v /etc/nccl.conf:/etc/nccl.conf -v /etc/nccl:/etc/ncclThese flags are responsible for:
– gpus all: Enables entree to each disposable GPUs successful the container.
– ipc=host: Uses the host’s IPC namespace, allowing amended inter-process communication.
– ulimit memlock=-1: Removes the limit connected locked-in-memory reside space.
– ulimit stack=67108864: Sets the maximum stack size to 64MB.
– network=host: Uses the host’s web stack wrong the container.
– volume /dev/infiniband:/dev/infiniband: Mounts the InfiniBand devices into the container.
– volume /sys/class/infiniband/:/sys/class/infiniband/: Mounts InfiniBand strategy information.
– device /dev/infiniband/:/dev/infiniband/: Allows the instrumentality to entree InfiniBand devices.
– -v /etc/nccl.conf:/etc/nccl.conf: Mounts the NCCL (NVIDIA Collective Communications Library) configuration file.
– -v /etc/nccl:/etc/nccl: Mounts the NCCL directory for further configurations.
To corroborate that PyTorch is functioning successful a containerized multi-node environment, execute the pursuing command:
- sudo docker tally --rm -it --gpus each --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 --network=host --volume /dev/infiniband:/dev/infiniband --volume /sys/class/infiniband/:/sys/class/infiniband/ --device /dev/infiniband/:/dev/infiniband/ -v /etc/nccl.conf:/etc/nccl.conf -v /etc/nccl:/etc/nccl nvcr.io/nvidia/pytorch:24.08-py3 python3 -c "import torch;print('CUDA available:', torch.cuda.is_available())"
The supra docker invocation will corroborate that PyTorch is moving correctly successful a containerized multi-node environment. The last people from the execution should show “CUDA available: True”.
Output
============= == PyTorch == ============= NVIDIA Release 24.08 (build 107063150) PyTorch Version 2.5.0a0+872d972 Container image Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES. All authorities reserved. Copyright (c) 2014-2024 Facebook Inc. Copyright (c) 2011-2014 Idiap Research Institute (Ronan Collobert) Copyright (c) 2012-2014 Deepmind Technologies (Koray Kavukcuoglu) Copyright (c) 2011-2012 NEC Laboratories America (Koray Kavukcuoglu) Copyright (c) 2011-2013 NYU (Clement Farabet) Copyright (c) 2006-2010 NEC Laboratories America (Ronan Collobert, Leon Bottou, Iain Melvin, Jason Weston) Copyright (c) 2006 Idiap Research Institute (Samy Bengio) Copyright (c) 2001-2004 Idiap Research Institute (Ronan Collobert, Samy Bengio, Johnny Mariethoz) Copyright (c) 2015 Google Inc. Copyright (c) 2015 Yangqing Jia Copyright (c) 2013-2016 The Caffe contributors All authorities reserved. Various files see modifications (c) NVIDIA CORPORATION & AFFILIATES. All authorities reserved. This instrumentality image and its contents are governed by the NVIDIA Deep Learning Container License. By pulling and utilizing the container, you judge the position and conditions of this license: https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license NOTE: CUDA Forward Compatibility mode ENABLED. Using CUDA 12.6 driver type 560.35.03 pinch kernel driver type 535.183.01. See https://docs.nvidia.com/deploy/cuda-compatibility/ for details. CUDA available: TrueStep 5 — Installing Miniconda
Miniconda is simply a lightweight type of Anaconda, providing an businesslike measurement to negociate Python environments. To instal Miniconda, travel these steps:
Download and Install Miniconda
Use the pursuing commands to download and instal Miniconda.
- mkdir -p ~/miniconda3
- wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda3/miniconda.sh
- bash ~/miniconda3/miniconda.sh -b -u -p ~/miniconda3
- rm -rf ~/miniconda3/miniconda.sh
Initialize Miniconda
- ~/miniconda3/bin/conda init bash
Exit and log backmost successful to use the changes.
- exit
Now log backmost successful arsenic the do-shark user.
- su do-shark
Verify the conda version.
- conda --version
Output
conda 24.7.1Step 6 — Setting Up a PyTorch Environment pinch Miniconda
With Miniconda installed, you tin group up a Python situation for PyTorch:
Create and Activate a New Environment
- conda create -n torch python=3.10
- conda activate torch
Install PyTorch
To instal PyTorch pinch CUDA support, usage the pursuing command. CUDA, which stands for Compute Unified Device Architecture, is simply a parallel computing level and programming exemplary for wide computing connected graphical processing units (GPUs).
conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidiaConclusion
You person successfully group up Nvidia Container Toolkit and Miniconda connected your DigitalOcean GPU Droplet. You are now fresh to usage containerized PyTorch workloads pinch GPU support. For further information, you tin research the charismatic archiving for Nvidia’s Deep Learning Containers and PyTorch.