Large‑language‑model deployment has evolved from investigation experiments into production‑scale chatbots, AI‑powered hunt engines, and enterprise‑class assistants. While exemplary accuracy continues to amended acknowledgment to engineering advances successful transformer architecture, business challenges revolve astir cost, latency, and scale. GPU nonaccomplishment is uncommon owed to insufficient FLOPs; astir limitations are from representation bandwidth and the expanding representation footprint of attraction states and increasing context. Attention discourse windows are expanding quickly towards thousands aliases moreover millions of tokens. This makes naive conclusion pipelines struggle to support up. A captious method for mitigating redundant computation is key‑value (KV) caching, which stores intermediate attraction states alternatively than recomputing them for each token.
This article breaks down really KV caching useful internally, why conclusion is memory‑bound, and really modern engines utilization caching, paging, and batching to slash conclusion costs. We’ll besides opposition KV caching pinch prefix/prompt caching and talk creation trade‑offs.
Key Takeaways
- KV caching reduces redundant computation during LLM decoding. Instead of recomputing cardinal and worth tensors for each erstwhile tokens astatine each procreation step, the exemplary stores and reuses them.
- LLM conclusion is often constricted by representation bandwidth, not only computation. Prefill is usually compute-heavy, while decoding is memory-heavy because each caller token requires entree to exemplary weights and KV-cache blocks.
- KV cache representation grows linearly pinch series magnitude and batch size. Long-context workloads and ample batches tin make the KV cache representation rival aliases transcend exemplary weight memory.
- Modern serving systems trim KV-cache discarded done paged attention, continuous batching, prefix caching, quantization, eviction, and offloading. These techniques amended GPU utilization and trim infrastructure cost.
- KV caching is powerful but not ever arsenic effective. Short sequences, mini models, highly move prompts, sliding-window attention, representation limits, and slow offloading tiers tin trim its benefits.
What Is KV Caching?
In transformer models, each output token tin be to each antecedently generated tokens done multi‑head attention. At each decoding step, queries (Q), keys (K), and values (V) are computed for each token. Without caching, past computing K and V for each erstwhile tokens must beryllium done N times to make N tokens. As a result, the runtime and representation costs of attraction standard quadratically pinch series length.

KV caching alleviates this by storing the K and V vectors for each token aft they are computed and reusing those vectors astatine each consequent decoding step. Rather than recomputing K and V during each iteration, they are simply loaded from the cache. The exemplary only needs to compute the query vector for the caller token.
Why LLM Inference Gets Expensive astatine Scale
LLM inference becomes costly astatine standard because serving a exemplary involves much than earthy compute. It is besides astir representation bandwidth, cache growth, batching efficiency, and really galore useful tokens each GPU tin nutrient per second.
Memory‑Bound vs Compute‑Bound
Intuitively, you mightiness expect that upgrading to a GPU pinch higher FLOP throughput would ever make conclusion faster, but ample portions of LLM decoding are memory-bound and not compute-bound. While decoding, the exemplary generates 1 token astatine a time. For each token generated, the exemplary needs to publication the exemplary weights once, and the KV cache. However, there’s comparatively small computation done per token.

Therefore, arithmetic strength (amount of computation performed per byte read/written) becomes the ascendant factor. Prefill mostly has precocious arithmetic intensity, truthful GPUs are compute-bound for that stage; decode arithmetic strength is overmuch lower, truthful clip is spent waiting connected memory. Empirically, you’ll often spot this arsenic precocious representation postulation and debased compute utilization. Upgrading GPUs will only thief you truthful overmuch if the bottleneck is really representation bandwidth.
Growth of Cache and Context
KV cache grows pinch discourse windows, and tin easy go the largest GPU memory consumer. For each caller token, cached cardinal and worth tensors are added to each furniture of the model’s attraction mechanism.

The magnitude of cache utilized grows astir linearly pinch the number of “active” tokens, including series magnitude and batch size. For Llama-2-7B successful half-precision, Pierre Lienhart estimates the KV cache requires astir 0.5 MB per token. In this case, 28k full progressive tokens will correspond to astir ~14 GB of KV-cache memory(28,000 tokens×0.5 MB/token≈14,000 MB≈14 GB), which is connected par pinch the model’s FP16 weights. KV cache growth, therefore, becomes arsenic important arsenic earthy exemplary size erstwhile considering whether an LLM workload will fresh successful GPU representation and really efficiently it will run.
Serving Costs and Economics
Inference computation and representation usage find conclusion cost. Prompt tokens chiefly impact prefill compute and KV- cache creation cost. Generated tokens incur repeated decode compute and spot repeated unit connected representation bandwidth, arsenic they proceed to usage the KV-cache.

Caching alleviates this repeated computation, reducing either costs per petition aliases costs per token. However, maturation successful KV- cache requirements will proceed to unit providers to bargain larger GPUs, cluster much GPUs together, aliases build much blase representation guidance systems.
How KV Caching Works
At a precocious level, KV caching useful for illustration this:

- Prefill: the exemplary runs the punctual done erstwhile and computes per-layer, per-head cardinal (K) and worth (V) vectors for each token successful the prompt; it stores those K/V tensors successful a cache truthful it doesn’t person to recompute them later.
- Decode: For each decoding step, the exemplary computes the query (Q) for the caller token and uses Q to be complete the cached keys and values (the cached Ks are utilized to compute attraction weights, and the cached Vs supply the attended outputs). The exemplary besides computes the caller token’s K and V for caching aft generation.
- Cache update: After generating a token, its K and V tensors are added to the cache for consequent decoding steps.
Memory Footprint Calculation
The KV cache representation tin beryllium approximated as:

where B is batch size, S is series length, L is the number of layers, Hkv is the number of key/value heads, D is the caput dimension, and Q is the number of bits per cache element. The facet of 2 accounts for the retention of some keys and worth tensors. From this formula, we tin observe a linear maturation successful batch size and series length.
Real Performance Gains from KV Caching
- 5.2× faster generation: Reported procreation speedup of 5.21× times faster from a Hugging Face organization benchmark tally connected a T4 GPU showed results of 11.7 seconds (with KV caching) versus 1 infinitesimal 1 2nd (without KV caching).
- Up to 5× faster TTFT pinch early reuse: NVIDIA TensorRT-LLM mentions that early KV cache reuse tin amended TTFT by up to 5× for workloads that let shared strategy prompts. NVIDIA separately mentions up to 14× faster TTFT connected H100 systems utilizing CPU offload to reuse KV cache.
- Up to 23× throughput improvement: Anyscale states throughput betterment up to 23× pinch continuous batching and vLLM-specific representation optimizations.
- 2–4× throughput via PagedAttention: According to the vLLM/PagedAttention paper, location is simply a 2–4× throughput betterment complete anterior serving systems astatine akin latency. vLLM has <4% KV- cache representation waste, compared to overmuch higher discarded for accepted contiguous allocation systems.
- Up to 50% KV-cache representation simplification pinch NVFP4: NVIDIA states that NVFP4-accelerated KV cache quantization tin trim KV-cache representation footprint by up to 50% (versus FP8) connected their Blackwell GPUs, pinch reported accuracy losses of <1% connected selected benchmarks.
KV Caching vs Prompt Caching
KV caching is scoped to the decode shape of a azygous request. Without it, the K/V would request to beryllium recomputed for each erstwhile tokens connected each caller token, starring to quadratic cost. In comparison, punctual caching (sometimes referred to arsenic prefix caching) is crossed aggregate requests. It stores the KV cache for a fixed prefix (i.e., strategy punctual + instrumentality definitions + reference documents) truthful that early requests tin bypass the prefill shape and proceed computing from that cached prefix onwards.

Prompt caching only useful for identical prefixes – moreover a azygous characteristic quality results successful a cache miss. Prompt caching tin trim compute and latency by an bid of magnitude erstwhile conditions are met.
Infrastructure Challenges and Trade‑Offs
Despite its benefits, KV caching introduces operational challenges:

- Memory footprint and fragmentation: KV cache sizes tin turn ample capable to transcend exemplary weights, peculiarly connected long-context and high-batch workloads. Using contiguous allocation tin lead to important GPU representation discarded owed to fragmentation and over-reservation. PagedAttention limits representation discarded by storing KV cache successful fixed-sized, non-contiguous blocks that are allocated connected request and released erstwhile sequences complete. vLLM claims representation discarded nether 4%, astatine the costs of analyzable artifact tables, cache management, and scheduling.
- Cache eviction: Decisions astir which cached tokens to clasp erstwhile representation is constrained must beryllium made by the serving engine. Eviction policies utilized by erstwhile activity see sliding-window attention, attention-score-based eviction, and layer-aware/entropy-guided allocation of cache budget. Methods for illustration H2O and Scissorhands show that keeping only caller and important tokens tin trim KV-cache representation usage by astir 2–5× connected benchmarked tasks pinch small aliases nary value loss.
- Memory-bandwidth bottleneck: Memory bandwidth whitethorn besides beryllium a limiting facet successful throughput, moreover erstwhile location is capable representation capacity. Each decoding measurement must publication exemplary weights and corresponding blocks from the KV cache, truthful procreation velocity often depends much connected representation I/O than earthy compute. Prefill/decode architectures alleviate this representation bandwidth unit by colocating compute-intensive prefill activity and memory-bound decoding activity connected abstracted workers, transferring KV cache betwixt them complete accelerated RDMA connections.
- Idle sessions: Chat, RAG, and supplier workflows let users to return pauses betwixt turns. Retaining inactive KV caches successful GPU representation consumes precious capacity. KV cache offloading dynamically moves idle aliases reusable KV blocks to CPU memory, disk, aliases distant retention and reloads them connected demand. LMCache enables multi-tier KV cache reuse crossed serving instances and reports 3×–10× hold savings successful multi-round QA and RAG workloads.
- Granularity and reuse: Early reuse useful champion erstwhile galore requests stock a communal prefix (think strategy prompt, instrumentality definition, aliases archive prefix). TensorRT-LLM’s elastic artifact sizing splits the cache into smaller blocks, allowing for amended reuse crossed partially overlapping prompts. NVIDIA reports up to 7% TTFT betterment connected LLAMA70B erstwhile reducing the artifact size of 64 tokens to 8 tokens. However, smaller blocks summation metadata overhead and complexity.
vLLM, Paged Attention, and Continuous Batching
Modern LLM serving stacks optimize KV caching done 3 approaches: improved representation layout, improved batching, and improved hardware-aware kernels.
vLLM and Paged Attention
vLLM is an open‑source conclusion motor for high‑throughput LLM serving. Paged attraction is simply a caching method introduced by vLLM that treats the KV cache for illustration an operating system’s virtual memory.

Rather than reserving a contiguous representation region for each incoming sequence, vLLM partitions its cache into pages (e.g., blocks of 16 tokens). It maintains a artifact array that maps logical token positions to their beingness location successful GPU memory. This allows the conclusion motor to allocate pages connected demand, free pages erstwhile sequences complete, and eliminates representation fragmentation.
Continuous Batching
Continuous batching is different important optimization utilized for LLM serving. You whitethorn person besides heard it referred to arsenic iteration-level scheduling. In this batching technique, the server updates the batch astatine each decoding measurement alternatively than waiting for an full batch to decorativeness processing.

Static batching, wherever a fixed batch of requests enters the GPU together, struggles pinch inputs that make different numbers of tokens. If a batch generates requests that decorativeness early and others that proceed generating longer sequences, the server must hold for the longest petition to decorativeness earlier starting the adjacent batch. This results successful GPU capacity being wasted.
Continuous batching avoids this waste. When a petition successful a batch finishes, the serving motor tin instantly insert a caller petition into the unfastened slot. This allows the GPU to beryllium utilized much often and reduces idle time.
When KV Caching Is Less Effective
Despite its advantages, KV caching does not ever nutrient the aforesaid capacity gains.
- Short sequences aliases mini models: When series magnitude is short, batch size is small, aliases the exemplary is lightweight, the magnitude of recomputation avoided whitethorn beryllium limited. KV caching tin still help. However, successful that case, the summation is often smaller than successful long-context generation.
- Memory-constrained environments: For models served connected GPUs pinch constricted memory, the KV cache is competing pinch exemplary weights and different progressive requests for memory. High eviction, recomputation, aliases offloading tin trim the capacity benefit. Quantization and offloading tin trim representation usage, but do not region the KV cache’s representation budget.
- Highly move prompts: This impacts prefix caching alternatively than per-request KV caching. If early tokens alteration frequently, prefix reuse crossed requests is unlikely.
- Models pinch section aliases sliding-window attention: Models for illustration Mistral-7B that usage local/sliding-window attraction patterns utilize rolling cache buffers. When utilizing this type of architecture, storing K/V tensors acold extracurricular the attraction model whitethorn supply small nonstop benefit(though the older tokens themselves tin still power later layers indirectly).
- Slow retention tiers: KV-cache offloading only provides use erstwhile fetching cached information from CPU RAM/disk aliases distant retention is faster aliases cheaper than recomputing it. If the retention tier is slow aliases saturated, offloading tin summation latency alternatively than reducing it.
FAQ
1. What is KV caching successful LLM inference?
KV caching is simply a method that stores the cardinal and worth tensors produced by transformer attraction layers. During decoding, the exemplary reuses these cached tensors alternatively of recomputing them for each erstwhile token.
2. Why does KV caching trim conclusion cost?
It reduces repeated computation during token generation. This tin little latency, summation throughput, and let the aforesaid GPU to service much requests. However, the cache itself consumes GPU memory, truthful businesslike cache guidance is still necessary.
3. Why tin KV cache go a representation problem?
Each caller token adds cardinal and worth tensors crossed aggregate layers. As discourse magnitude and batch size increase, the cache grows linearly. In long-context workloads, the KV cache tin go arsenic ample as, aliases larger than, the exemplary weights.
4. What is the quality betwixt KV caching and punctual caching?
KV caching useful wrong a azygous petition during decoding. Prompt caching, besides called prefix caching, reuses cached KV states crossed aggregate requests that stock the nonstop aforesaid prefix, specified arsenic a strategy punctual aliases reference document.
5. Which techniques amended KV-cache efficiency?
Important techniques see paged attention, continuous batching, prefix caching, cache quantization, cache eviction, and cache offloading. Frameworks specified arsenic vLLM, TensorRT-LLM, and LMCache usage these ideas to amended throughput and trim representation waste.
Conclusion
LLM conclusion is quickly becoming a memory-engineering problem, alternatively than solely a compute‑intensive one. Longer discourse lengths and exemplary sizes mean that cost‑effective conclusion will require much than faster GPUs: it will request smarter representation usage. KV caching is simply a captious constituent of this caller paradigm. By caching and reusing attraction states, caching tin slash redundant computation and trim latency. However, this introduces further constraints: caches turn linearly pinch series magnitude and tin transcend exemplary weights! To equilibrium computation and memory, cutting‑edge engines will usage a number of techniques, including paged allocation, to forestall representation fragmentation. They besides usage continuous batching to alternate betwixt compute‑bound and memory‑bound phases, and punctual caching to reuse communal fixed prefixes. Finally, they use quantization to shrink the cache footprint, cache eviction to discard tokens pinch debased importance, and finally, offloading to different tiers of memory.
These innovations alteration america to service long‑context models efficiently, opening up caller exertion scenarios and personification experiences. In the future, hybrid representation architectures, adaptive caching strategies, KV-cache quantization, cache offloading, sparse attention, sliding-window attention, and hardware–software co-design will proceed to push the boundaries of businesslike LLM inference. Infrastructure and infrastructure teams building generative AI services should understand KV caching and trade‑offs to build scalable, cost‑efficient systems.
DigitalOcean’s related resources connected Inference arsenic a Service, What is AI Inference, GPU Inference Solutions, Monitoring GPU Utilization, and Finding the Optimal Batch Size research these topics further and connection applicable guidance for deploying LLMs astatine scale.
References
- LLM Inference Optimization 101
- LLM Inference Series: 4. KV caching, a deeper look
- KV Caching Explained: Optimizing Transformer Inference Efficiency
- KV Cache Optimization: Memory Efficiency for Production LLMs
- Why LLM Inference Is Memory-Bound (Not Compute-Bound)
- Optimizing Inference for Long Context and Large Batch Sizes pinch NVFP4 KV Cache
- How continuous batching enables 23x throughput successful LLM conclusion while reducing p50 latency
This activity is licensed nether a Creative Commons Attribution-NonCommercial- ShareAlike 4.0 International License.
English (US) ·
Indonesian (ID) ·