Run LLMs with Ollama on H100 GPUs for Maximum Efficiency

Sep 23, 2024 06:40 PM - 4 months ago 152272

Introduction

This article is simply a guideline to tally Large Language Models utilizing Ollama connected H100 GPUs offered by DigitalOcean. DigitalOcean GPU Droplets supply a powerful, scalable solution for AI/ML training, inference, and different compute-intensive tasks specified arsenic heavy learning, high-performance computing (HPC), information analytics, and graphics rendering. These GPUs are designed to grip demanding workloads, GPU Droplets alteration businesses to efficiently standard AI/ML operations on-demand, without the request for managing unnecessary costs. Offering simplicity, flexibility, and affordability, DigitalOcean’s GPU Droplets guarantee speedy deployment and easiness of use, making them perfect for developers and information scientists.

Now, pinch support for NVIDIA H100 GPUs, users tin accelerate AI/ML development, test, deploy, and optimize their applications seamlessly—without the request for extended setup aliases attraction typically associated pinch accepted platforms. Ollama is an unfastened root instrumentality which provides entree to a divers room of pre-trained models, offers effortless installation and setup crossed different operating systems, and exposes a section API for seamless integration into applications and workflows. Users tin customize and fine-tune LLMs, optimize capacity pinch hardware acceleration, and use from interactive personification interfaces for intuitive interactions.

Prerequisites

  • Access to H100 GPUs: Ensure you person entree to NVIDIA H100 GPUs, either done on-premise hardware aliases utilizing GPU Droplets by DigitalOcean.

  • Supported Frameworks: Familiarity pinch Python and Linux Commands.

  • CUDA and cuDNN Installed: Ensure NVIDIA CUDA and cuDNN libraries are installed for optimal GPU performance.

  • Sufficient Storage and Memory: Have ample retention and representation disposable to grip ample exemplary datasets and weights.

  • Basic Understanding of LLMs: A foundational knowing of ample connection models and their building to efficaciously negociate and optimize them.

These prerequisites thief guarantee a soft and businesslike acquisition erstwhile moving LLMs pinch Ollama connected H100 GPUs.

What is Ollama?

Ollama offers a measurement to download a ample connection exemplary from its immense connection exemplary room which consists of Llama3.1, Mistral, Code Llama, Gemma and overmuch more. Ollama combines exemplary weights, configuration, and information into 1 package, specified by a Modelfile. Ollama provides a elastic level for creating, importing, and utilizing civilization aliases pre-existing connection models, perfect for creating chatbots, matter summarization, and overmuch more. It emphasizes privacy, integrates seamlessly pinch windows, macOS and Linux, and is free to use. Ollama besides allows users to deploy models locally pinch ease. Further, the level besides supports real-time interactions via a REST API. It’s cleanable for LLM-powered web apps and tools. It’s very akin to really Docker works. With Docker, we tin drawback different images from a cardinal hub and tally them successful containers. Furthermore, Ollama allows america to customize the models by creating a Modelfile. Below is the codification to create Modelfile:

From llama2 # Set the somesthesia PARAMETER somesthesia 1 # Set the strategy Prompt SYSTEM """ You are a adjuvant school adjunct created by DO. Answer questions asked based connected Artificial Intelligence, Deep Learning. """

Next, tally the civilization model,

Ollama create MLexp \-f ./Modelfile Ollama tally MLexp

The Power of NVIDIA H100 GPUs

  • The H100 is Nvidia’s astir powerful GPU, specially designed for artificial intelligence applications. With 80 cardinal transistors—six times much than the A100—it tin process ample information sets overmuch faster than different GPUs connected the market.
  • As we each cognize AI applications are information quiet and are computationally expensive. To negociate this immense magnitude of workload H100 are considered to beryllium the champion choice.
  • The H100 features fourth-generation tensor cores and a transformer motor pinch FP8 precision. The H100 triples the floating-point operations per 2nd (FLOPS) compared to erstwhile models, delivering 60 teraflops of double-precision (FP64) computing, which is important for precise calculations successful HPC tasks. It tin execute single-precision matrix-multiply operations astatine 1 petaflop throughput utilizing TF32 precision without requiring immoderate changes to existing code, making it user-friendly for developers.
  • The H100 introduces DPX instructions that importantly boost capacity for move programming tasks, achieving 7X amended capacity than the A100 and 40X faster than CPUs for circumstantial algorithms for illustration DNA series alignment.
  • H100 GPUs supply the basal computational power, offering 3 terabytes per 2nd (TB/s) of representation bandwidth per GPU. This precocious capacity allows for businesslike handling of ample datasets.
  • The H100 supports scalability done technologies for illustration NVLink and NVSwitch™, which allows aggregate GPUs to activity together effectively.

GPU Droplets

DigitalOcean GPU Droplets connection a simple, flexible, and cost-effective solution for your AI/ML workloads. These scalable machines are perfect for reliably moving training and conclusion tasks connected AI/ML models. Additionally, DigitalOcean GPU Droplets are well-suited for high-performance computing (HPC) tasks, making them a versatile prime for a scope of usage cases including simulation, information analysis, and technological computing. Try the GPU Droplets now by signing up for a DigitalOcean account.

Why Run LLMs pinch Ollama connected H100 GPUs?

To tally Ollama efficiently a GPU from NVIDIA is required to tally things hassle free. As pinch CPU users tin expect a slow response.

  • H100 owed to its precocious architecture offers exceptional computing powerfulness which helps to importantly velocity up the ratio of LLMs.
  • Ollama lets users customize and fine-tune LLMs to meet their circumstantial needs, enabling punctual engineering, few-shot learning, and tailored fine-tuning to align models pinch desired outcomes. Pairing Ollama pinch H100 GPUs enhances exemplary conclusion and training times for developers and researchers.
  • H100 GPUs person the capacity to grip models specified arsenic Falcon 180b which makes them perfect to create and deploy Gen AI devices for illustration chatbots aliases RAG applications.
  • H100 GPUs travel pinch hardware optimizations for illustration tensor cores, which importantly accelerate tasks involving LLMs, particularly erstwhile dealing pinch matrix-heavy operations.

Setting Up Ollama pinch H100 GPUs

Ollama is very good compatible pinch Windows, macOS, aliases Linux. Here we are utilizing Linux codification arsenic our GPU Droplets are based connected Linux OS.

Run the codification beneath successful your terminal to cheque the GPU specification.

nvidia-smi

Next, we will effort to instal Ollama first utilizing the aforesaid terminal.

curl \-fsSL https://ollama.com/install.sh | sh

This will instantly commencement the Ollama installation.

Once the installation is done we tin propulsion immoderate LLM and commencement moving pinch the exemplary specified arsenic Llama 3.1, Phi3, Mistral, Gemma 2 aliases immoderate different model.

To tally and chat pinch models, we will tally the beneath code. Please consciousness free to alteration the exemplary arsenic per your requirements. Running the exemplary pinch Ollama is rather straightforward and present we are utilizing the powerful H100, the process to make a consequence becomes accelerated and efficient.

ollama tally example\_model ollama tally qwen2:7b

In lawsuit of the correction "could not link to ollama app, is it running? Please usage the beneath codification to link to Ollama

sudo systemctl alteration ollama sudo systemctl commencement ollama

Ollama supports a wide database of models, present are immoderate illustration models that tin beryllium downloaded and used.

Model Parameters Size Download
Llama 3.1 8B 4.7GB Ollama tally llama3.1
Llama 3.1 70B 40GB Ollama tally llama3.1:70b
Llama 3.1 405B 231GB Ollama tally llama3.1:405b
Phi 3 Mini 3.8B 2.3GB Ollama tally phi3
Phi 3 Medium 14B 7.9GB Ollama tally phi3:medium
Gemma 2 27B 16GB Ollama tally gemma2:27b
Mistral 7B 4.1GB Ollama tally mistral
Code Llama 7B 3.8GB Ollama tally codellama

With Ollama users tin tally the LLMs conveniently without moreover the request for net relationship arsenic the exemplary and its limitations get stored locally.

>>> Write a python codification for a fibonacci series.

def fibonacci(n): """ This usability prints the first n numbers of the Fibonacci sequence. Parameters: @param n (int): The number of elements successful the Fibonacci series to print. Returns: None """ a, b = 0, 1 for one in range(n): print(a) a, b = b, a + b if __name__ == "__main__": fibonacci(10)

This python codification defines a elemental `fibonacci` usability that takes an integer statement and prints the first n numbers successful the Fibonacci sequence. The Fibonacci series starts with 0 and 1, and each consequent number is the sum of the erstwhile two.

The if __name__ == "__main__": artifact astatine the extremity tests this usability by calling it pinch a parameter worth of 10, which prints retired the first 10 numbers successful the Fibonacci sequence.

Conclusion

Ollama is simply a caller Gen-AI instrumentality for moving pinch ample connection models locally, offering enhanced privacy, customization, and offline accessibility. Ollama has led moving pinch LLM simpler and to research and research pinch open-source LLMs straight connected their machines, Ollama promotes invention and deeper knowing of AI. To entree a powerful GPU for illustration H100, see utilizing DigitalOcean’s GPU Droplets. DigitalOcean’s GPU Droplets are presently successful Early Availability.

For getting started pinch Python, we urge checking retired this beginner’s guideline to group up your strategy and hole for moving introductory tutorials.

References

  • Ollama Official
  • GPU Droplets
More