Introduction
If you’ve trained thing astatine standard you’ve apt met Slurm, the workload head that runs astir of the world’s HPC clusters and a adjacent chunk of its AI training too. It’s the point that takes sbatch my-job.sh, finds the correct nodes, lines up the GPUs, runs your script, and gets retired of the way. It’s been astir for 2 decades and it’s not going anywhere.
The clash is that operating Slurm has traditionally meant babysitting bare-metal nodes: imaging them, keeping packages successful sync, managing daemons, restarting services aft hardware events. None of that is what you signed up for if your extremity is to train models.
Slinky is SchedMD’s reply to that: a Kubernetes usability that runs Slurm on Kubernetes. The Slurm controller (slurmctld), the worker daemons (slurmd), the login nodes, and the optional accounting database each tally arsenic pods. Kubernetes handles lifecycle, scheduling onto hardware, restarts, and rolling upgrades. Slurm handles the occupation scheduling that users really interact with.
In this tutorial we’ll get a moving Slinky cluster moving connected DigitalOcean Kubernetes (DOKS) pinch NVIDIA B300 GPU nodes, validate that GPUs and the RDMA cloth work, and tally a multi-node NCCL all-reduce crossed 2 nodes. By the extremity you’ll person a existent cluster you tin srun into.
Key takeaways
- Slinky is SchedMD’s Kubernetes usability for Slurm: slurmctld, slurmd, and login pods tally connected DOKS while Slurm still owns occupation scheduling (sbatch, srun).
- A split node pool (small CPU mgmt + GPU workers) positive managed NFS (ReadWriteMany) matches the accustomed HPC login-writes / worker-reads workflow.
- B300 nodes expose 8 GPUs and 16 cloth NICs; you connect RoCE pinch Multus NetworkAttachmentDefinition resources and petition rdma/fabricN successful the Slurm chart.
- The NCCL all-reduce occupation is the applicable impervious that multi-node NET/IB complete mlx5_* is working—not conscionable that pods schedule.
- For serving fine-tuned weights aft training, astir teams move to Dedicated Inference pinch BYOM aliases a Kubernetes conclusion stack—not long-lived Slurm partitions.
What you’ll build
Two node pools: a mini mgmt excavation of CPU nodes for the power plane, and a gpu excavation of B300 droplets for existent work. Shared NFS truthful login and worker pods spot the aforesaid /shared for occupation scripts and outputs. Multus-attached RDMA cloth NICs connected the workers truthful NCCL tin do corporate ops crossed nodes astatine afloat bandwidth.
Accounting (slurmdbd) and Prometheus metrics are some supported and we’ll constituent astatine wherever to move them on, but we’ll support the default instal minimal.
Prerequisites
Five things to create earlier immoderate helm install:
-
A VPC successful a region that supports managed NFS. ric1 works. Create a VPC.
-
A DOKS cluster successful that VPC pinch 2 node pools:
- mgmt: 3 × CPU Optimized 4 vCPU / 8 GiB
- gpu: 2+ × NVIDIA B300 droplets (8 GPUs + 16 cloth NICs per node)
DOKS automatically taints GPU pools pinch nvidia.com/gpu:NoSchedule and labels them doks.digitalocean.com/gpu-brand=nvidia. That keeps non-GPU pods disconnected the costly nodes for you.
-
Managed NFS successful the aforesaid VPC. Note the Mount Source, you’ll request it for the PV. Create managed NFS.
-
A DigitalOcean Container Registry (DOCR) for the civilization slurmd and login images. The one-click DOKS integration wires the propulsion credentials into the cluster truthful you don’t person to negociate Secrets. Create a registry if you don’t person 1 already.
-
NFS capacity tuning for GPU nodes. B300 nodes support jumbo frames (MTU 9000), but pods that equine NFS earlier the interface is tuned discuss astatine MTU 1500 and ne'er renegotiate, capping throughput for the life of the mount. Follow Optimize NFS Performance connected GPU Nodes earlier scheduling workloads connected the GPU pool. We skip this measurement arsenic portion of this getting started guide, but this is critical for ensuring precocious throughput capacity pinch DigitalOcean’s managed NFS service.
Create the Slurm namespace erstwhile the cluster is up:
kubectl create namespace slurmHow to build and push the civilization slurmd image
Workers request much than the upstream slurmd image ships with. They request the CUDA runtime, NCCL, the nccl-tests benchmarks (so we tin validate the fabric), RDMA userspace, and MPI. This Dockerfile builds precisely that connected apical of the charismatic Slinky base.
A fewer notes connected the Dockerfile:
- CUDA 12.9 is the level because that’s erstwhile NVIDIA added autochthonal codegen for Blackwell Ultra (sm_103 / B300).
- nccl-tests is compiled pinch MPI=1 truthful it tin beryllium launched via srun --mpi=pmix.
- RDMA userspace (libibverbs, rdma-core, ibverbs-utils, perftest) is installed truthful ibv_devices useful wrong the pod for cloth debugging.
Build and push:
doctl registry login docker build \ -t registry.digitalocean.com/<your-registry>/slurmd-cuda:25.11 \ docker/slurmd-cuda/ docker push registry.digitalocean.com/<your-registry>/slurmd-cuda:25.11How to build and push the civilization login image
The login pod is wherever your users really live. It’s what they kubectl exec into to constitute occupation scripts, propulsion code, and tally sbatch. But the upstream login image ships only the Slurm customer commands (sinfo, srun, sbatch, …). There’s nary editor, nary git, nary python3, nary sudo, nary curl. So a personification logs in, runs sinfo, and then… can’t clone their repo, can’t edit a file, can’t tally a script, can’t instal anything. The pod is technically a login node and practically a dormant end.
The hole is the aforesaid shape arsenic the worker image: a bladed furniture of developer tooling connected apical of the charismatic Slinky base. This Dockerfile installs vim, nano, git, python3 + pip, sudo, curl, wget, and little connected apical of ghcr.io/slinkyproject/login, capable to really get activity done. (The sudo assistance is wide for convenience here; tighten it earlier you manus the cluster to existent users.)
Build and push it the aforesaid way:
docker build \ -t registry.digitalocean.com/<your-registry>/slurm-login:25.11 \ docker/login/ docker push registry.digitalocean.com/<your-registry>/slurm-login:25.11This image is optional. Leave it retired and the LoginSet falls backmost to the upstream login image pinch conscionable the Slurm clients. But for immoderate cluster existent group will use, it’s the quality betwixt a login node and a login node you tin do thing on.
How to instal cert-manager (required for the Slinky operator)
Slinky requires cert-manager for its admittance webhook TLS, truthful instal this one. kube-prometheus-stack is optional: if you instal it, Slinky will people a ServiceMonitor and a built-in Grafana dashboard truthful Slurm metrics show up automatically.
helm repo add jetstack https://charts.jetstack.io helm repo update helm install cert-manager jetstack/cert-manager \ --namespace cert-manager --create-namespace \ --set crds.enabled=trueFor monitoring, instal kube-prometheus-stack into a prometheus namespace. If you do, group controller.metrics.enabled: existent successful the Slurm values beneath and Slinky will people a ServiceMonitor for Prometheus to scrape.
How to ligament managed NFS arsenic a ReadWriteMany measurement for Slurm
Slinky doesn’t negociate shared retention for you, but a login-pod-writes / worker-pod-reads workflow is the full constituent of an HPC-style cluster, truthful we ligament up a ReadWriteMany NFS volume. The server and way travel from the Mount Source of your managed NFS.
# slurm-nfs-pv.yaml apiVersion: v1 kind: PersistentVolume metadata: name: slurm-nfs-pv spec: capacity: storage: 100Gi accessModes: - ReadWriteMany storageClassName: "" mountOptions: - vers=4.1 - nconnect=8 nfs: server: <nfs-private-ip> path: <nfs-mount-path> --- # slurm-nfs-pvc.yaml apiVersion: v1 kind: PersistentVolumeClaim metadata: name: slurm-nfs-pvc namespace: slurm spec: accessModes: - ReadWriteMany storageClassName: "" volumeName: slurm-nfs-pv resources: requests: storage: 100Gi kubectl use -f slurm-nfs-pv.yaml kubectl use -f slurm-nfs-pvc.yaml kubectl get pvc -n slurm slurm-nfs-pvc # should beryllium BoundHow to connect B300 RDMA cloth NICs pinch Multus
B300 nodes vessel pinch 16 dedicated cloth NICs (fabric0–fabric15, 2 per GPU) for RoCE (RDMA complete Converged Ethernet). These specialized web interfaces alteration high-speed, low-latency information transfers betwixt GPUs crossed different nodes—crucial for accelerating distributed AI, HPC, and instrumentality learning workloads.
By default, Kubernetes pods don’t person entree to these other NICs, because each pod typically only connects to the cluster’s main network.
This is wherever Multus CNI comes in. Multus CNI is simply a web plugin for Kubernetes that allows pods to link to aggregate networks, not conscionable the superior one.
In this setup, Multus enables you to connect 1 aliases much of the B300’s RDMA NICs straight to selected pods by moving those web interfaces into the pod’s web namespace. As a result, pods that request ultra-fast networking, for illustration those doing GPU-to-GPU communication, tin return nonstop advantage of the hardware, alternatively than sharing a azygous connection. This attack is basal for workloads that request maximum web performance, specified arsenic distributed training aliases MPI(Message Passing Interface) jobs.
Install Multus:
kubectl use -f \ https://raw.githubusercontent.com/k8snetworkplumbingwg/multus-cni/master/deployments/multus-daemonset-thick.yml kubectl rollout position daemonset/kube-multus-ds -n kube-system --timeout=120sEach NIC needs a NetworkAttachmentDefinition that uses the host-device CNI to move it into the pod. NADs are namespace-scoped; Multus only resolves them from the aforesaid namespace arsenic the pod, truthful the slurm worker pods request them successful slurm. The shape is identical for each 16, conscionable switch the device:
apiVersion: "k8s.cni.cncf.io/v1" kind: NetworkAttachmentDefinition metadata: name: roce-net-fabric0 spec: config: '{ "cniVersion": "0.3.1", "type": "host-device", "device": "fabric0" }'A ready-made bundle of each 16 NADs(NetworkAttachmentDefinitions) tin beryllium retrieved present fabric-nads.yaml. Note that this creates 16 networkattachmentdefinition resources, older NVIDIA GPU systems only person 8 cloth NICs, truthful you’d only create roce-net-fabric0 done roce-net-fabric7.
kubectl use -n slurm -f manifests/fabric-nads.yaml kubectl get net-attach-def -n slurm # Expect 16 NADs: roce-net-fabric0 done roce-net-fabric15Install the SchedMD Slinky Slurm operator
helm install slurm-operator oci://ghcr.io/slinkyproject/charts/slurm-operator \ --set 'crds.enabled=true' \ --namespace slurmYou should extremity up pinch slurm-operator and slurm-operator-webhook pods Running connected the mgmt nodes (the webhook is simply a abstracted deployment successful floor plan 1.1.0). The usability registers a slurm-operator-webhook ValidatingWebhookConfiguration:
kubectl get pods -n slurm -l app.kubernetes.io/instance=slurm-operator kubectl get validatingwebhookconfigurations | grep slurmDeploy the Slurm cluster pinch Helm values
This is the values record that pulls it each together. Save arsenic slurm-values.yaml:
# Controller (slurmctld). Uncomment the metrics artifact only if you installed # kube-prometheus-stack, different the ServiceMonitor CRD won't beryllium and # `helm install` will fail. controller: extraConfMap: ReturnToService: 2 # metrics: # enabled: true # serviceMonitor: # enabled: true # labels: # release: prometheus # Login nodes: a azygous login pod pinch the NFS stock mounted astatine /shared. loginsets: slinky: enabled: true login: # The civilization login image pinch dev devices built above. Drop the image block # to autumn backmost to the upstream login image (Slurm clients only). image: repository: registry.digitalocean.com/<your-registry>/slurm-login tag: "25.11" volumeMounts: - name: shared-nfs mountPath: /shared podSpec: volumes: - name: shared-nfs persistentVolumeClaim: claimName: slurm-nfs-pvc service: spec: type: ClusterIP # GPU instrumentality paths. Slurm can't autodetect these from wrong a container, # truthful it needs to beryllium told explicitly. On B300, GPUs are ever at # /dev/nvidia0 done /dev/nvidia7. configFiles: gres.conf: | Name=gpu File=/dev/nvidia[0,1,2,3,4,5,6,7] # GPU worker nodes. 8 GPUs and 16 cloth NICs per pod connected B300. nodesets: slinky: replicas: 2 # Match your B300 node count slurmd: image: repository: registry.digitalocean.com/<your-registry>/slurmd-cuda tag: "25.11" resources: requests: nvidia.com/gpu: 8 rdma/fabric0: 1 rdma/fabric1: 1 rdma/fabric2: 1 rdma/fabric3: 1 rdma/fabric4: 1 rdma/fabric5: 1 rdma/fabric6: 1 rdma/fabric7: 1 rdma/fabric8: 1 rdma/fabric9: 1 rdma/fabric10: 1 rdma/fabric11: 1 rdma/fabric12: 1 rdma/fabric13: 1 rdma/fabric14: 1 rdma/fabric15: 1 limits: nvidia.com/gpu: 8 rdma/fabric0: 1 rdma/fabric1: 1 rdma/fabric2: 1 rdma/fabric3: 1 rdma/fabric4: 1 rdma/fabric5: 1 rdma/fabric6: 1 rdma/fabric7: 1 rdma/fabric8: 1 rdma/fabric9: 1 rdma/fabric10: 1 rdma/fabric11: 1 rdma/fabric12: 1 rdma/fabric13: 1 rdma/fabric14: 1 rdma/fabric15: 1 securityContext: capabilities: add: - IPC_LOCK volumeMounts: - name: shared-nfs mountPath: /shared - name: shm mountPath: /dev/shm extraConfMap: Gres: "gpu:8" partition: enabled: true # Required, different nary `slinky` partition is created configMap: State: UP MaxTime: UNLIMITED # Multus note moves each 16 cloth NICs into the pod. metadata: annotations: k8s.v1.cni.cncf.io/networks: >- roce-net-fabric0@fabric0, roce-net-fabric1@fabric1, roce-net-fabric2@fabric2, roce-net-fabric3@fabric3, roce-net-fabric4@fabric4, roce-net-fabric5@fabric5, roce-net-fabric6@fabric6, roce-net-fabric7@fabric7, roce-net-fabric8@fabric8, roce-net-fabric9@fabric9, roce-net-fabric10@fabric10, roce-net-fabric11@fabric11, roce-net-fabric12@fabric12, roce-net-fabric13@fabric13, roce-net-fabric14@fabric14, roce-net-fabric15@fabric15 podSpec: nodeSelector: doks.digitalocean.com/gpu-brand: nvidia tolerations: - key: nvidia.com/gpu operator: Exists effect: NoSchedule volumes: - name: shared-nfs persistentVolumeClaim: claimName: slurm-nfs-pvc # NCCL uses /dev/shm; springiness it room for ample collectives. - name: shm emptyDir: medium: Memory sizeLimit: 64GiInstall it:
helm install slurm oci://ghcr.io/slinkyproject/charts/slurm \ --namespace slurm \ --values slurm-values.yamlA fewer moments later you should spot the cluster travel up:
kubectl get pods -n slurm # slurm-controller-... Running # slurm-login-slinky-... Running # slurm-restapi-... Running # slurm-worker-slinky-0 Running # slurm-worker-slinky-1 RunningHop into the login pod and corroborate Slurm sees the workers:
kubectl exec -it -n slurm deploy/slurm-login-slinky -- bash sinfo -N -l # All workers should beryllium idle successful the `slinky` partition (and the default `all` partition). scontrol show node slinky-0 | grep -i gres # Gres=gpu:8Because we built the civilization login image, this is besides a pod you tin really activity in: git clone your occupation repo, edit scripts pinch vim, pip instal a helper, each without leaving the login node.
Validate the cloth pinch a multi-node NCCL all-reduce
The azygous astir useful fume trial for a GPU cluster is simply a multi-node NCCL(NVIDIA Collective Communications Library) all-reduce. If it runs astatine hundreds of GB/s of autobus bandwidth and NCCL_DEBUG=INFO reports NET/IB carrier complete the mlx5_* devices, your cloth is correctly attached and RoCE is being utilized end-to-end. If it collapses to a fewer GB/s, NCCL fell backmost to TCP and you’ve sewage cloth activity to do.
From the login pod, constitute the occupation book to NFS truthful the workers tin publication it:
mkdir -p /shared/jobs /shared/output cat > /shared/jobs/nccl-allreduce-2node.sh <<'EOF' #!/bin/bash #SBATCH --job-name=nccl-allreduce-2node #SBATCH --partition=slinky #SBATCH --nodes=2 #SBATCH --ntasks-per-node=8 #SBATCH --gres=gpu:8 #SBATCH --output=/shared/output/allreduce-2node-%j.out #SBATCH --error=/shared/output/allreduce-2node-%j.err #SBATCH --time=01:00:00 export LD_LIBRARY_PATH=/usr/local/cuda/lib64:/usr/lib/x86_64-linux-gnu:${LD_LIBRARY_PATH} export NCCL_DEBUG=INFO # Keep MPI power postulation disconnected the RDMA NICs. export OMPI_MCA_btl=self,tcp export OMPI_MCA_btl_tcp_if_include=eth0 srun --mpi=pmix \ /usr/local/bin/all_reduce_perf \ -b 1G -e 16G -f 2 -g 1 -c 1 -n 100 EOF sbatch /shared/jobs/nccl-allreduce-2node.sh squeueWatch for the output record erstwhile the occupation finishes:
cat /shared/output/allreduce-2node-*.outWhat you’re looking for successful the output:
- A bandwidth array from all_reduce_perf ramping from 1G to 16G.
- NCCL_DEBUG=INFO lines that mention NET/IB and database mlx5_0 done mlx5_15. That’s the cloth being used.
- Inter-node autobus bandwidth successful the hundreds of GB/s astatine ample connection sizes. Anything that tops retired successful the azygous digits of GB/s intends NCCL fell backmost to TCP. Sanity-check the Multus note and the rdma/fabricN assets requests.
If thing doesn’t look right, the first spot to look is wrong a worker pod:
kubectl exec -n slurm slurm-worker-slinky-0 -- ip -br link | grep fabric # fabric0..fabric15, each UP kubectl exec -n slurm slurm-worker-slinky-0 -- ibv_devices # mlx5_0..mlx5_15 kubectl exec -n slurm slurm-worker-slinky-0 -- ibv_devinfo # Port authorities PORT_ACTIVE, nexus furniture Ethernet (= RoCE)FAQs
1. What is Slinky, and really is it different from Slurm connected bare metal?
Slinky is SchedMD’s Slurm usability for Kubernetes. The controller, workers, and login components tally arsenic pods; Kubernetes handles restarts and node lifecycle while users still taxable jobs pinch sbatch and srun. Bare-metal Slurm intends you run the OS and daemons connected each node yourself.
2. Why tally Slurm connected DOKS alternatively of only GPU Droplets?
DOKS gives you a managed power plane, node pools, and integrations (NFS, DOCR, GPU taints) successful 1 place. GPU Droplets unsocial are simpler for one-off training. Slurm connected Kubernetes pays disconnected erstwhile you request multi-node scheduling, shared /shared storage, and repeatable occupation scripts crossed a fleet.
3. Do I request each 16 cloth NetworkAttachmentDefinitions?
On B300 nodes, yes—each GPU brace maps to dedicated fabricN interfaces, and the Helm values petition rdma/fabric0 done rdma/fabric15. Older NVIDIA GPU shapes often expose 8 cloth NICs; usage roce-net-fabric0 done roce-net-fabric7 only successful that case.
4. Why does the tutorial mention NFS MTU tuning if we skip it here?
B300 nodes support jumbo frames (MTU 9000). If a pod mounts NFS earlier the node interface is tuned, the equine tin enactment astatine MTU 1500 for its life and headdress throughput. For accumulation training I/O, travel Optimize NFS Performance connected GPU Nodes earlier dense jobs.
5. My NCCL occupation shows debased GB/s—what should I cheque first?
Inside a worker pod, corroborate fabric0–fabric15 are UP, ibv_devices lists mlx5_0–mlx5_15, and occupation output shows NET/IB successful NCCL_DEBUG=INFO. Single-digit GB/s usually intends TCP fallback—re-check the Multus note and rdma/fabricN assets requests successful slurm-values.yaml.
6. Can I service the exemplary from this aforesaid cluster aft training?
You can, but astir accumulation paths divided training (Slurm/HPC) from inference (HTTP APIs). On DigitalOcean, teams often fine-tune connected GPU Droplets aliases this Slurm cluster, past service pinch Dedicated Inference and BYOM aliases Kubernetes tooling specified arsenic vLLM exemplary loading connected Kubernetes. See Serverless vs Dedicated vs Batch Inference for really serving modes compare.
Conclusion
You now person a Slurm cluster connected Kubernetes that schedules jobs crossed B300 GPU nodes and does corporate ops complete RDMA. From here:
- Turn connected accounting if you want sacct, sreport, aliases fair-share scheduling. Uncomment the accounting: artifact successful slurm-values.yaml. The Slinky floor plan tin either deploy an in-cluster MariaDB automatically (handy for dev/test) aliases talk to a managed MySQL lawsuit successful the aforesaid VPC (recommended for production).
- Turn connected metrics by installing kube-prometheus-stack and keeping controller.metrics.enabled: existent successful the values. Slinky publishes a Grafana dashboard retired of the box.
- Submit existent workloads. The login pod, the /shared volume, and the slinky partition are each that user-facing occupation scripts request to interact with.
Slinky lets you support Kubernetes for what Kubernetes is bully astatine (fleet management, hardware lifecycle, observability) while letting Slurm do what Slurm is bully at: scheduling HPC jobs connected large GPUs. On DOKS pinch B300 nodes you get that full stack pinch managed power plane, managed NFS, managed registry, and a cloth that really delivers RoCE bandwidth.
What to publication next
- Slinky connected DOKS reference repo — Dockerfiles, fabric-nads.yaml, and sample manifests utilized successful this guide
- Use NFS retention connected Kubernetes — managed NFS and GPU MTU tuning
- Jupyter Notebooks pinch GPU Droplets — simpler single-node training earlier multi-node Slurm
- DigitalOcean Inference Mode Comparison — take a serving exemplary aft weights are ready
- vLLM connected Kubernetes: exemplary loading and caching — if you service from Kubernetes alternatively of Dedicated Inference
- Kubernetes Gateway API pinch Cilium connected DOKS — optional north-south postulation patterns for services connected the aforesaid cluster
Happy training.
This activity is licensed nether a Creative Commons Attribution-NonCommercial- ShareAlike 4.0 International License.
English (US) ·
Indonesian (ID) ·