LLM Inference Optimization 101

Jan 17, 2025 07:31 PM - 3 weeks ago 28297

Fast conclusion makes the world spell brrr

Large Language Models (LLMs) make coherent, earthy connection responses, efficaciously automating a multitude of tasks that were antecedently exclusive to humans. As galore cardinal players successful the field, specified arsenic Jensen Huang and Ilya Sutskever, person precocious alluded to, we’re successful an era of agentic AI. This caller paradigm seeks to revolutionize various aspects of our lives, from personalized medicine and acquisition to intelligent assistants, and beyond.

However, it is important to beryllium alert that while these models are getting progressively powerful, wide take is hindered by the monolithic costs to tally them, frustrating hold times that render definite real-world applications impractical, arsenic good as, of course, their increasing c footprint. To reap the benefits of this technology, while mitigating costs and powerfulness consumption, it is captious that we proceed to optimize each facet of LLM inference.

The extremity of this article is to springiness readers an overview of existent ways successful which researchers and heavy learning practitioners are optimizing LLM inference.

What is LLM inference?

Similar to really 1 uses what they learned to lick a caller problem, conclusion is erstwhile a trained AI exemplary uses patterns detected during training to infer and make predictions connected caller data. This conclusion process is what enables LLMs to execute tasks for illustration matter completion, translation, summarization, and conversation.

Text Generation Inference pinch 1-click Models

DigitalOcean has collaborated pinch HuggingFace to connection 1-click models. This allows for the integration of GPU Droplets pinch state-of-the-art open-source LLMs successful Text Generation Inference (TGI)-optimized instrumentality applications. This intends galore of the conclusion optimizations covered successful this article (ex: tensor parallelism, quantization, flashattention, paged attention) are already taken attraction of and maintained by HuggingFace. For accusation connected really to usage these 1-click models, cheque retired our article Getting Started pinch LLMs.

Prerequisites

While this article includes immoderate introductory heavy learning concepts, galore topics discussed are comparatively advanced. Those wished to amended understand conclusion optimization are encouraged to research the links scattered passim the article and successful the references section.

It is advised that readers person an knowing of neural web fundamentals, the attraction mechanism, the transformer, and data types earlier proceeding.

It would besides thief to beryllium knowledgeable astir the GPU representation hierarchy.

The article,Introduction to GPU Performance Optimization, provides discourse connected really GPUs tin beryllium programmed to accelerate neural web training and inference. It besides explains cardinal position specified arsenic latency and throughput.

The Two Phases of LLM Inference

LLM conclusion tin beryllium divided into 2 phases: prefill and decode. These stages are separated owed to the different computational requirements of each stage. While prefill, a highly-parallelized matrix-matrix cognition that saturates GPU utilization, is compute-bound, decode, a matrix-vector cognition that underutilizes the GPU compute capability, is memory-bound.

The prefill shape tin beryllium likened to reference an full archive astatine erstwhile and processing each the words simultaneously to constitute the first connection whereas the decode shape tin beryllium compared to continuing to constitute this consequence connection by word, wherever the prime of each connection depends connected what was written before.

Let’s research why prefill is compute-bound and decode is memory-bound.

Prefill

In the prefill stage, the LLM processes the full input punctual astatine erstwhile to make the first consequence token. This involves performing a afloat guardant walk done the transformer layers for each token successful the punctual simultaneously. While representation entree is needed during prefill, the computational activity of processing the tokens successful parallel predominate the capacity profile.

Decode

In the decode stage, matter is generated autoregressively wherever the adjacent token is predicted 1 astatine clip fixed each erstwhile tokens. The decoding process is memory-bound owed to its request to many times entree humanities context. For each caller token generated, the exemplary must load the attraction cache (key/value states, AKA KV cache) from each erstwhile tokens, requiring predominant representation accesses that go much intensive arsenic the series grows longer. Despite the existent computation per token during decode being considerably little than prefill, the repeated retrieval of cached attraction states from representation makes the representation bandwidth and redundant representation accesses the limiting facet during the decode phase.

Metrics tin beryllium utilized to measure capacity and place areas of imaginable bottlenecks during these 2 conclusion stages.

Metrics

Metric Definition Why do we care?
Time-to-First-Token (TTFT) Time to process the punctual and make the first token. TTFT tells america really agelong prefill took. The longer the prompt, the longer the TTFT arsenic the attraction system needs the full input series to compute the KV cache. Inference optimization seeks to minimize TTFT.
Inter-token Latency (ITL) AKA Time per Output Token Average clip betwixt consecutive tokens. ITL tells america the complaint astatine which decoding (token generation) occurs. Consistent ITLs are perfect arsenic they are suggestive of businesslike representation management, precocious GPU representation bandwidth, and well-optimized attraction computation.

Optimizing Prefill and Decode

Speculative Decoding

Speculative Decoding uses a smaller, faster exemplary to make aggregate tokens simultaneously, and past verifies them pinch the larger target model.

Chunked Prefills and Decode-Maximal Batching

SARATHI shows really chunked prefills tin alteration the section of ample prefills into manageable chunks, which tin past beryllium batched pinch decode requests (decode-maximal batching) for businesslike processing.

Batching

Batching groups conclusion requests together, pinch larger batch sizes corresponding to higher throughput. However, batch sizes tin only beryllium accrued up to a definite grade owed to constricted GPU on-chip memory.

Batch Size

To execute maximum utilization of the hardware, 1 tin effort to find the captious ratio wherever there’s a equilibrium betwixt 2 cardinal limiting factors:

  • The clip needed to transportation weights betwixt representation and compute units (limited by representation bandwidth)
  • The clip required for existent computational operations (limited by FLOPS)

While these 2 times are equal, the batch size tin beryllium accrued without incurring immoderate capacity penalty. Beyond this point, expanding batch size would create bottlenecks successful either representation transportation aliases computation. To find an optimal batch size, profiling is important.

KV cache management plays a captious domiciled successful determining the maximum batch size and improving inference. Thus, the remainder of the article will attraction connected managing the KV cache.

KV Cache Management

When looking astatine really representation is allocated successful the GPU during serving, the exemplary weights stay fixed and the activations only utilize a fraction of the GPU’s representation resources compared to the KV cache. Therefore, freeing up abstraction for the KV cache is critical. This tin beryllium achieved by reducing the exemplary weight representation footprint done quantization, reducing the KV cache representation footprint pinch modified architectures and attraction variants, arsenic good arsenic pooling representation from aggregate GPUs pinch parallelism.

Quantization

Quantization reduces the number of bits needed to shop the model’s parameters (ex: weights, activations, and gradients). This method reduces conclusion latency by exchanging representation for accuracy.

Attention and its variants

Review of Queries, Keys, and Values: Queries: Represent the discourse aliases question. Keys: Represent the accusation being attended to. Values: Represent the accusation being retrieved.

Attention weights are computed by comparing queries pinch keys, and past utilized to weight values, producing the last output representation.

Query (Prompt) → Attention Weights → Relevant Information (Values)

SWA Sliding Window Attention (SWA) aliases section attention, restricts attraction to a fixed-size model that slides complete the sequence. While SWA is not businesslike to standard to agelong inputs, Character AI saw that velocity and value wasn’t impacted pinch agelong sequences erstwhile interleaving SWA and world attention, pinch adjacent world attraction layers sharing a KV cache (cross-layer attention).

Local Attention vs. Global Attention

Local and world attraction mechanisms disagree successful cardinal aspects. Local attraction uses little computation (O(n * w)) and representation by focusing connected token windows, enabling faster conclusion particularly for agelong sequences, but whitethorn miss long-range dependencies. Global attention, while computationally much costly (O(n^2)) and memory-intensive owed to processing each token pairs, is capable to amended seizure afloat discourse and long-range limitations astatine the costs of slower conclusion speed.

Paged Attention

Inspired by virtual representation allocation, PagedAttention projected a model for optimizing KV cache that takes the variety of the number of tokens crossed requests into consideration.

FlashAttention

There are 3 variations of FlashAttention, pinch FlashAttention-3 being the latest merchandise and optimized for Hopper GPUs. Each loop of this algorithm takes a hardware-aware attack to make the attraction computation arsenic accelerated arsenic possible. Past articles written connected FlashAttention include: Designing Hardware-Aware Algorithms: FlashAttention and FlashAttention-2

Model Architectures: Dense Models vs. Mixture of Experts

Dense LLMs are the modular wherever each parameters are actively engaged during inference.

Mixture of Experts (MoE) LLMs are composed of aggregate specialized sub-networks pinch a routing mechanism. Because only applicable experts are activated for each input, improved parameter ratio and faster conclusion than dense models is often observed.

Parallelism

Larger models often require aggregate GPUs to tally effectively. There are a number of different parallelization strategies that let for multi-GPU inference.

Parallelism Type Partions Description Purpose
Data Data Splits different batches of information crossed devices. Distribution of representation and computation for ample datasets that wouldn’t fresh connected a azygous device
Tensor Weight Tensors Splits tensors crossed aggregate devices either row-wise aliases column-wise Distribution of representation and computation for ample tensors that wouldn’t fresh connected a azygous device
Pipeline Model Layers (vertically) Splits different stages of the afloat exemplary pipeline successful parallel Improves throughput by overlapping computation of different exemplary stages
Context Input Sequences Divides input sequences into segments crossed devices Reduces representation bottleneck for agelong series inputs
Expert MoE models Splits experts, wherever each master is simply a smaller model, crossed devices Allows for larger models pinch improved capacity by distributing computation crossed aggregate experts
Fully Sharded Data Data, model, optimizer, and gradients Shards components crossed devices, processes information successful parallel, and synchronizes aft each training step Enables training of highly ample models that transcend the representation capacity of a azygous instrumentality by distributing some exemplary parameters and activations.

Conclusion

It’s undeniable that conclusion is an breathtaking area of investigation and optimization. The section moves fast, and to support up, conclusion needs to move faster. In summation to much agentic workflows, we’re seeing much move conclusion strategies that let models to “think longer” connected harder problems. For example, OpenAI’s o1 exemplary shows accordant capacity improvements connected challenging mathematical and programming tasks erstwhile much computational resources are devoted during inference.

Well, acknowledgment truthful overmuch for reading! This article is surely not conclusive to each location is successful conclusion optimization. Stay tuned for much breathtaking articles connected this taxable and adjacent ones.

References and Other Excellent Resources

Blog posts:

Mastering LLM Techniques: Inference Optimization | NVIDIA Technical Blog LLM Inference astatine standard pinch TGI Looking backmost astatine speculative decoding (Google Research) LLM Inference Series: 4. KV caching, a deeper look | by Pierre Lienhart | Medium A Visual Guide to Quantization - by Maarten Grootendorst Optimizing AI Inference astatine Character.AI Optimizing AI Inference astatine Character.AI (Part Deux)

Papers:

LLM-Inference-Bench: Inference Benchmarking of Large Language Models connected AI Accelerators

Efficient Memory Management for Large Language Model Serving pinch PagedAttention

SARATHI: Efficient LLM Inference by Piggybacking Decodes pinch Chunked Prefills

The LLama 3 Herd of Models (Section 6)

Talks:

Mastering LLM Inference Optimization From Theory to Cost Effective Deployment: Mark Moyou

NVIDIA CEO Jensen Huang Keynote astatine CES 2025

Building Machine Learning Systems for a Trillion Trillion Floating Point Operations :: Jane Street

Dylan Patel - Inference Math, Simulation, and AI Megaclusters - Stanford CS 229S - Autumn 2024

How does batching activity connected modern GPUs?

GitHub Links:

Sharan Chetlur -Nvidia/Presentation Slides - High Performance LLM Serving connected Nvidia GPUs

GitHub - huggingface/search-and-learn: Recipes to standard inference-time compute of unfastened models

More