You return a vision-language model to accumulation and put it connected the aforesaid GPU that’s been happily serving your matter model: akin parameter count, the aforesaid serving stack, thing different. The GPU shows 40% utilization, representation bandwidth is sitting astatine half capacity, and the tensor cores are hardly warm. Yet the point is slow: each petition takes longer than it should, and the requests-per-second you tin push done has fallen disconnected compared to the text-only exemplary you were moving recently.
Nothing successful the logs explains it.
No OOM errors, nary runaway processes, nary evident assets fight. Just an costly GPU softly underperforming for nary logic you tin constituent to.
This is usually wherever teams commencement tuning, bump the batch size, alteration the sampling settings, and alteration the quantization config. Some of it helps a little, but nary of it fixes the existent problem, because the existent problem isn’t successful the config: it’s successful the hardware statement you signed without reading.
A vision-language exemplary doesn’t do 1 benignant of work.
It does two, and those 2 kinds of activity want other things from a GPU. Vision encoding is almost axenic computation: millions of matrix multiplies, hardly rubbing memory. Whereas, connection decoding is the reverse: for each token it generates, it drags the exemplary weights and a increasing cache retired of representation while the compute units are mostly idle. Putting some tasks connected 1 GPU, which is what each modular deployment does, and you’ve committed to a imperishable compromise: the encoder ne'er gets the compute density it wants, the decoder ne'er gets the representation bandwidth it wants, and you salary afloat freight for a GPU that’s underused successful 2 different ways astatine once.
That’s the HBM taxation aliases High Bandwidth Memory. And if you’re serving multimodal postulation astatine measurement (high-resolution images, video, multi-image prompts), it’s eating a 3rd aliases much of your conclusion budget. The bully news is that the hole is good understood; it requires power complete which GPU runs which phase, which is precisely the power astir managed conclusion hides from you. We’ll get to really to do it connected infrastructure you tin really rent (DigitalOcean GPU Droplets, specifically) toward the end. First, the mechanics.
A March 2026 paper by Donglin Yu (arXiv:2603.12707) is 1 of the first to put difficult numbers to this, tracing wherever the inefficiency lives, why modular monitoring misses it, and what happens to costs and throughput erstwhile the 2 phases are yet pulled apart. It’s the backbone of what follows.
Key Takeaways
-
Vision-language models (VLMs) do 2 kinds of activity pinch other hardware needs. Vision encoding is compute-heavy and hardly touches memory. Language decoding is the reverse and it sounds ample amounts of representation for each token it generates. Putting some connected 1 GPU permanently underserves some phases.
-
The HBM (High Bandwidth Memory) taxation is the costs of that mismatch. Image tokens participate the KV (Key-Value) cache astatine the commencement of a petition and enactment location done each decode step. Every generated token pays a representation bandwidth costs for those image tokens, and that costs compounds nether load.
-
Standard monitoring won’t show it. No azygous metric spikes. The GPU looks healthy. The taxation shows up successful the contention betwixt resources, not the saturation of one.
-
The correct hole is to trim astatine the modality boundary. Splitting aft the imagination encoder outputs its embedding earlier the connection exemplary starts transfers only ~4.5 MB per petition (for LLaVA-7B) alternatively of the ~350 MB KV cache that stage-level disaggregation would require. That’s a 78× reduction, and the ratio gets amended arsenic models get deeper.
-
This useful complete mean unreality networking. A 4.5 MB embedding crosses a modular 25 Gbps backstage web successful a fewer milliseconds, little than 1% of the encoding time. No NVLink aliases InfiniBand required.
-
The costs savings are existent and measurable. The investigation validated ~40% savings from heterogeneous deployment (compute-dense GPU for encoding, HBM-rich GPU for decoding), pinch nary latency regression.
-
The advantage grows complete time. Deeper models and higher-resolution images some summation the spread betwixt modality disaggregation and the alternatives. This isn’t a impermanent workaround, it reflects a structural spot of encoder-decoder architectures.
Two pipelines, 2 hardware regimes
Before the numbers, it helps to image what really happens wrong a vision-language exemplary erstwhile a petition comes in. There are 2 chopped jobs, successful sequence:
The first is vision encoding. The exemplary doesn’t spot an image the measurement you do. It chops it into fixed-size patches (think of slicing a photograph into a grid of tiles) and runs each tile done layers of mathematics to move it into a database of numbers the connection broadside tin activity with. This is dense, back-to-back matrix multiplication, and the information it needs stays adjacent to the processor. It hardly reaches retired to representation astatine all.

The 2nd is language decoding. Once the image is encoded, the exemplary generates its consequence 1 token astatine a time. For each token, it has to load its full weight group positive everything generated truthful acold (the KV cache) retired of memory, do a mini magnitude of math, emit 1 token, and repeat. It’s little for illustration computing and much for illustration reference a very agelong book 1 page astatine a time.
Two wholly different kinds of work.
T_e2e = T_vision + T_xfer + T_prefill + T_decodeT_vision is encoding the image. T_xfer is moving the encoded output to the connection model. T_prefill is processing the matter prompt. T_decode is generating the consequence token by token, and successful practice, that past shape is wherever astir of the clip and costs live.
What the measurements show: During imagination encoding connected an RTX 4090, the insubstantial records complete 85% utilization of the GPU’s compute (roughly 280 of 330 TFLOPS) while representation bandwidth sits beneath 5%. The GPU is pinned doing math; representation is astir idle. Flip to connection decoding connected an A100 and the image inverts: memory-bandwidth utilization crosses 80% (about 1.6 of 2 TB/s) while compute drops beneath 10%. Almost each the clip goes to shuttling information successful and retired of memory, pinch hardly immoderate arithmetic.
These 2 phases are structurally other successful what they demand. No batch size reconciles them. And yet successful a modular deployment, they stock 1 GPU, truthful during encoding, the representation strategy sits idle connected your dime, and during decoding, the tensor cores beryllium idle connected your dime. The discarded is successful some phases, astatine once, connected the aforesaid card.

What the KV cache really costs
To understand the cost, look astatine the KV cache. That’s wherever imagination tokens are stored, and they devour immoderate of your astir costly resources.
Let america understand this better:
What the KV Cache Is
When a transformer processes a sequence, each attraction furniture computes 3 matrices for each token: Query (Q), Key (K), and Value (V). Attention useful by having each token’s Q “look at” each erstwhile tokens’ K/V pairs to determine what to be to.
The costly part: if you’re generating token by token, you’d recompute K and V for every erstwhile token astatine every caller step. That’s quadratic successful costs arsenic series magnitude grows.
The KV cache is the hole - you compute K and V erstwhile per token and shop them successful memory. When generating the adjacent token, you only compute K and V for that caller token and append it to the cache. Previous tokens’ K/V matrices are conscionable publication from memory.
What gets stored
For each layer, for each token, you shop 2 matrices (K and V). So the cache size scales with:
- sequence magnitude (more tokens = much cache)
- number of layers (deeper exemplary = much cache per token)
- model dimension/number of heads (wider exemplary = larger K/V matrices)
- batch size (each series successful a batch needs its ain cache)
- precision (fp16 vs fp8 halves aliases quarters the size)
A unsmooth look for 1 sequence:
KV cache size = 2 × num_layers × seq_len × num_heads × head_dim × bytes_per_elementFor Llama 3 70B (80 layers, 8 KV heads, head_dim 128, fp16): a azygous 128K-token series uses ~80GB - much than the exemplary weights themselves.
The halfway tension
Model weights are fixed - load once, reuse forever. The KV cache is move and per-request. This is why:
- Long contexts are memory-hungry (cache grows pinch series length)
- High concurrency is difficult (each user’s cache competes for the aforesaid GPU VRAM)
- Evicting/recomputing cache is simply a existent latency tradeoff (prefix caching, paged attention, etc. each reside this)
With that foundation, “what the KV cache really costs” becomes a communicative astir VRAM pressure, throughput limits, and the tradeoffs operators make to service long-context models astatine scale.
The cache exists because attraction needs to look back. To make token N, the exemplary attends complete the keys and values of each anterior token, astatine each layer. Recomputing those each measurement would beryllium quadratic, truthful you cache them. For a multi-head attraction model, the size is:
D_KV = 2 · L · n_kv · d_h · s_ctx · bTwo tensors (keys and values), times layers L, times KV heads n_kv, times caput magnitude d_h, times discourse magnitude s_ctx, times batch size b, times bytes per element.
Plug successful a 7B MHA exemplary (L=32, n_kv=32, d_h=128) astatine FP16, pinch a humble request: 1 336×336 image done ViT-L/14 yields 576 imagination tokens, positive 128 matter tokens, for a discourse of 704. That’s astir 350 MB of KV cache for a azygous request. Batch 8 for prefill and you’re astatine astir 2.8 GB earlier generating a azygous output token.
Now necktie that to the image. Those 576 imagination tokens aren’t a transient costs you salary during encoding and release. They’re written into the KV cache astatine prefill and they enactment location for the full decode loop. Every output token, attraction sweeps backmost complete each 576 of them. The image is conceptually “gone” (you’ve distilled it into embeddings), but its footprint successful representation persists, and the bandwidth costs of re-reading it is paid connected each measurement of generation.
This is wherever arithmetic strength explains the damage. Arithmetic strength is the ratio of compute performed to bytes moved. Decode has almost none: you move a immense magnitude of information to do a sliver of math, the meaning of memory-bound. Stuffing the KV cache pinch hundreds of image tokens inflates the bytes-moved broadside of that ratio without adding immoderate compute to warrant it. You’re loading an already memory-starved shape pinch much representation traffic, for tokens that adhd discourse but nary caller arithmetic. The decode loop slows successful nonstop proportionality to really galore image tokens you crammed in, and it pays that punishment erstwhile per generated token, each token. That’s the throughput illness from the intro, expressed successful bytes.
Why existing systems don’t lick this
The earthy guidance is: fine, disaggregate. Split the pipeline crossed devices truthful each shape runs wherever it’s happy. The insubstantial surveys why the modular cuts autumn short.
Stage-level disaggregation (EPD, Cauchy-style systems) - splits prefill from decode onto abstracted instances. Reasonable successful spirit: prefill is compute-heavy, decode is memory-heavy. The drawback is the trim point: aft prefill, you person to migrate the full KV cache to the decode instance, a transportation that scales arsenic O(L · s_ctx), hundreds of megabytes to gigabytes per request. Moving that without wrecking latency demands NVLink aliases InfiniBand, which shuts user GPUs connected mean networking retired entirely. You haven’t removed the bottleneck; you’ve relocated it onto the interconnect and locked yourself into premium fabric.
Intra-node co-location (SpaceServe, UnifiedServe-style systems) - keeps everything connected 1 GPU but multiplexes the 2 workloads spatially, truthful the idle resources of 1 shape get utilized by the other. It genuinely improves utilization, but it doesn’t alteration the cross-device connection structure, can’t propulsion successful cheaper hardware for the compute-bound work, and doesn’t relieve the basal representation unit astatine scale. You’re still serving decode retired of 1 costly representation budget.
Homogeneous serving (the vLLM baseline) is the workhorse - PagedAttention, continuous batching, observant representation management, each excellent, each worthy using. But nary of it addresses the hardware mismatch. Both phases still tally connected the aforesaid datacenter GPU, and you still salary for HBM bandwidth moreover erstwhile the progressive workload is encoder-heavy and bandwidth-agnostic.
The paper’s test of why these extremity short is the crisp part: stage-level partitioning bakes successful a language-only assumption. In a text-only LLM, the KV cache is the only meaningful intermediate state, truthful cutting astir it is natural. Multimodal conclusion breaks that assumption. There’s now a second, very different intermediate (the imagination embedding) pinch radically amended transportation economics than the KV cache. The anterior creation was solving the correct problem astatine the incorrect boundary.
The Ideal Point to Separate Vision and Text Processing
Here’s the load-bearing idea, and it’s cleanable erstwhile you spot it.
The imagination encoder’s output is simply a azygous embedding tensor of size O(N_v · d): ocular token count times hidden dimension. That’s it. It does not turn pinch the connection model’s extent L. The KV cache does, because each furniture contributes its ain keys and values. So if you partition astatine the modality bound (between the imagination encoder’s output and the connection model’s input), the only point that crosses the ligament is that compact embedding. You ne'er migrate a KV cache astatine all, because the trim happens before the cache exists.
The numbers aren’t subtle. For LLaVA-7B (576 imagination tokens, d=4096) astatine FP16:
Embedding to transportation = 576 × 4096 × 2 bytes ≈ 4.5 MB KV cache (the stage-level alternative) ≈ 350 MBA 78x simplification successful transferred bytes per request, purely from moving the trim point. The insubstantial generalizes this (Theorem 1) as:
R = (2 · L · n_kv · d_h · s_ctx) / (N_v · d)which nether MHA collapses to a strikingly intuitive form:
R_MHA = 2L · (1 + s_text / N_v)Two consequences autumn out. The advantage scales pinch extent L: deeper models make stage-level disaggregation much costly (a bigger cache to move) while the embedding stays the aforesaid size. And the ratio is independent of hidden magnitude d: it cancels. So arsenic models get deeper, modality-level disaggregation gets comparatively much attractive, automatically.
Across architectures (the smaller fig reflects grouped-query attention, which shrinks n_kv and truthful the cache):
| LLaVA-7B | 78× |
| LLaVA-13B | 98× |
| LLaVA-34B | 147× / 21× |
| Qwen2.5-VL-7B | 64× / 12× |
| Qwen-VL-72B | 196× / 24× |
Even astatine the blimpish GQA end, you’re moving an bid of magnitude little information than a KV-cache migration would. The insubstantial headlines the span arsenic 12×–196× crossed existent MLLMs: today’s GQA-heavy architectures cluster astatine the debased extremity (12–24×), while MHA models and the deepest networks scope the top. The R_MHA figures for the GQA models are counterfactuals (what they would prevention nether afloat multi-head attention), included to show really the advantage scales pinch depth.
You don’t request exotic interconnect**.** A 4.5 MB embedding complete PCIe Gen4 x16 (~25 GB/s) clears successful astir 0.18 ms, against vision-encoding clip the insubstantial measures astatine ~6.8 s for a batch of 128 images. The reported ratio is T_xfer / T_vision < 0.003 crossed each architecture tested: the transportation is simply a rounding correction against the activity itself. No NVLink, nary InfiniBand. And arsenic we’ll see, the embedding is mini capable that it survives not conscionable PCIe but an mean unreality backstage network, which is what opens this up connected existent infrastructure.
The costs argument: why this is astir dollars
The logic to attraction is the invoice.
The insubstantial lays retired a hardware asymmetry that practitioners consciousness but seldom value out. An RTX 4090 (~$3k, ~330 TFLOPS FP16) astir matches an A100 (~$16k, ~312 TFLOPS) connected earthy compute, astir 3.3× amended FLOPs per dollar. Where the A100 pulls up is memory: ~2 TB/s of HBM versus the 4090’s ~1 TB/s, and acold much of it. So the 2 cards aren’t “better” and “worse.” They’re specialists. One is simply a compute bargain that’s memory-poor; the different is simply a representation powerhouse you overpay for if each you request is FLOPs.
Map that onto the 2 phases and the duty writes itself. Compute-bound, bandwidth-agnostic imagination encoding belongs connected the cheaper compute-dense card. Memory-bandwidth-bound, HBM-critical connection decoding belongs connected the datacenter card. Stop moving the encoder connected hardware you’re renting for its bandwidth, and extremity moving decode connected hardware that can’t provender it.
The insubstantial builds a closed-form costs exemplary astir this and validates it: it predicted 31.4% savings from heterogeneous deployment and measured 40.6%. As a fund decision, a ~$38k heterogeneous cluster delivered amended Tokens-per-dollar than a ~$64k homogeneous one: astir 37% amended economics connected identical workloads, pinch nary latency regression. That’s not a single-digit tuning win. It’s a different costs structure, disposable because each shape yet runs connected the silicon it wants.
Two honorable caveats earlier you ligament this up. First, those nonstop dollar figures are the paper’s RTX 4090 / A100 pairing. The building of the savings (cheap compute paper for encode, HBM paper for decode) is what transfers, the precise percent will dangle connected your model, your postulation mix, and the circumstantial cards you choose. Second, the summation is existent only erstwhile your workload is genuinely phase-separable astatine volume, which is the assemblage mobility we’ll travel backmost to.
What a phase-aware strategy looks like
To beryllium the mentation extracurricular a spreadsheet, the insubstantial builds a serving system, HeteroServe, astir 4 ideas worthy knowing moreover if you ne'er tally that nonstop stack.
Modality-level partitioning- Compute-dense GPUs tally the imagination encoder; HBM-heavy GPUs tally the connection model. Only embeddings transverse the nexus betwixt them: the trim from the erstwhile section, made operational.
An embedding-only transportation protocol- Real models don’t emit a fixed token count: Qwen2.5-VL, for instance, produces a adaptable N_v depending connected image resolution, up to ~2048 tokens. The strategy uses move buffer allocation to sorb that. The economics still clasp astatine the apical of the range: 2048 tokens is ~14 MB, still astir 6× smaller than the corresponding KV cache. “Transfer is negligible” is robust, not a best-case artifact.
Cross-type activity stealing- The elegant spot for messy existent traffic. A axenic modality divided has an evident nonaccomplishment mode: what do your encoder GPUs do erstwhile a burst of text-only requests arrives and there’s thing to encode? They’d beryllium idle. So erstwhile encoder GPUs person nary imagination work, they bargain decode activity from the connection pool, recovering utilization that would different evaporate, without the role-switching complexity that makes prefill/decode-swapping systems brittle.
Engine optimizations, kept abstracted connected purpose- CUDA Graph-accelerated decoding, packed prefill, lazy KV allocation: the accustomed high-leverage tricks. The insubstantial deliberately isolates these from the architectural publication truthful the 2 don’t get conflated: connected identical 4×A100 hardware, these optimizations alone assistance throughput up to 54% complete a vLLM v0.3.0 baseline. That’s a abstracted axis from the heterogeneity story. The costs triumph from modality disaggregation stacks connected apical of bully motor hygiene; it doesn’t dangle connected it.
Mapping this onto DigitalOcean
Everything supra is hardware-agnostic theory. Here’s really it lands connected infrastructure you tin really rent, and wherever the paper’s same-box presumption has to beryllium adapted.
The paper’s elegant hit is “two cards, 1 PCIe link, successful 1 box.” On a nationalist unreality you mostly don’t constitute a heterogeneous multi-card container yourself; DigitalOcean GPU Droplets are single-GPU-class virtual machines you rotation up independently. So the applicable realization isn’t same-box PCIe: it’s 2 Droplets connected the aforesaid backstage network, 1 tier for encode and 1 for decode, pinch embeddings crossing the ligament betwixt them. The bully news is that the embedding is mini capable that this works.
The hardware tiers representation cleanly**.** DigitalOcean’s compute-dense, cost-efficient cards are the Ada-generation parts: NVIDIA L40S and RTX 6000 Ada (48 GB each), and RTX 4000 Ada (20 GB). These are the earthy location for the compute-bound imagination encoder, DigitalOcean positions the L40S astatine up to ~1.7× an A100 for AI workloads, and its humble representation is simply a non-issue present because encoding hardly touches bandwidth. The HBM-heavy decode tier is the datacenter lineup: H100 (80 GB), H200 (141 GB, marketed astatine up to 2× H100 conclusion connected memory-bound work), and the AMD Instinct MI300X / MI325X / MI350X parts pinch very large, high-bandwidth memory. In different words: L40S aliases RTX 6000 Ada for the encoder pool, H100 aliases H200 for the decoder pool. DigitalOcean’s ain vLLM GPU-sizing guide reaches the aforesaid divided from the workload side: high-bandwidth cards for memory-bound decode, high-compute cards for everything else.
Does the transportation past a web hop alternatively of PCIe? This is the load-bearing question, truthful let’s do the arithmetic. Every GPU Droplet comes pinch 25 Gbps backstage networking, astir 3 GB/s, telephone it ~2–2.5 GB/s sustained aft real-world overhead. That’s astir 8× slower per byte than the paper’s in-box PCIe. But the embedding is tiny:
- A 4.5 MB LLaVA-7B embedding crosses successful a low-single-digit number of milliseconds.
- A 14 MB Qwen2.5-VL embedding (2048 tokens) crosses successful astir 5–7 ms.
- Even a afloat 128-image batch (~576 MB) crosses successful a fewer 100 milliseconds; against seconds of encode time, that’s connected the bid of 3–4% overhead.
So the paper’s sub-0.3% PCIe fig becomes low-single-digit-percent complete DigitalOcean’s backstage network. The full logic the modality bound is the correct trim is that it makes the transportation mini capable to tolerate mean networking, which is precisely why this is feasible connected a unreality wherever you don’t get NVLink betwixt abstracted instances. (One applicable constraint: support some Droplet pools successful the aforesaid datacenter region truthful they stock that backstage network; GPU Droplets are disposable successful respective regions including NYC, Toronto, and Atlanta. One AMD-specific wrinkle worthy readying around: the Instinct MI350X launched successful Atlanta (ATL1) only, truthful if you want an AMD decoder excavation today, some the encoder and decoder pools request to unrecorded successful ATL1 to support that backstage link.)
The logic you tin build this connected GPU Droplets astatine each is that they springiness you GPU-level control: you take the paper per pool, you spot the instances, you ain the orchestration betwixt them. You tin ligament the 2 pools together complete the private VPC network, beforehand them pinch your ain scheduler, and negociate the full point done the API, Terraform, aliases Kubernetes. If you’d alternatively not hand-tune the decode tier, DigitalOcean’s Inference Optimized Image ships a pre-tuned vLLM stack (speculative decoding, FP8, FlashAttention-3, paged attention) crossed the H100, L40S, and RTX 6000 Ada tiers, and reports up to a 143% throughput summation complete an untuned baseline: that’s the “good motor hygiene” furniture the insubstantial keeps separate, disposable retired of the container connected the aforesaid cards you’d usage for the decode pool. For teams that want dedicated capacity alternatively than per-second instances, Bare Metal GPUs springiness you the full machine.
When serverless is still the correct call
This full statement cuts against afloat managed, serverless inference, and that deserves an honorable appraisal alternatively than a dismissal. The logic serverless can’t seizure this redeeming is structural: the full worth proposition of serverless is that you don’t deliberation astir the hardware. You can’t inquire a serverless endpoint to tally your encoder connected a compute-dense paper and your decoder connected a bandwidth-dense one, because making those placement decisions for you is the constituent of the product.
For a ample fraction of teams, that’s the correct trade. If you’re prototyping, if your multimodal measurement is light, aliases if you don’t want to ain a scheduler and 2 GPU pools, the operational simplicity of a serverless endpoint from DigitalOcean is worthy acold much than a costs optimization you’d walk engineering clip chasing. The modality-split architecture earns its support erstwhile 3 things are existent astatine once: you service multimodal postulation astatine capable measurement that a ~30–40% inference-cost delta is existent money; your workloads are image- aliases video-heavy, truthful KV-cache ostentation bites hard; and you person the appetite to ain your serving topology alternatively than rent it arsenic a achromatic box. Below that bar, enactment serverless. Above it, the power that GPU Droplets springiness you is the point that unlocks the saving, and the 2 products aren’t really competitors truthful overmuch arsenic the correct answers to different scales.
How to measurement the taxation successful your ain stack
You don’t person to return immoderate of this connected faith. The HBM taxation is straight observable, and quantifying it successful your ain deployment is the correct first move earlier you alteration immoderate infrastructure.
Run the isolation experiment**.** Hold the LLM backbone fixed and alteration only the ocular input: (a) matter only, (b) 1 low-resolution image, © 1 high-resolution image, (d) aggregate images. Plot inter-token latency against image-token count. The slope of that statement is the tax: the per-image-token costs your decode loop pays connected each step.
What to instrument:
- HBM utilization surgery retired by shape (Nsight Systems). Confirm the split: near-zero bandwidth during encode, near-saturation during decode.
- KV cache size per petition astatine varying image-token counts. Watch it balloon arsenic solution climbs.
- Throughput vs. image-token count astatine a fixed batch size. This is your header curve.
- ITL degradation nether concurrency**.** Where the compounding shows up, and the metric closest to what users feel.
Tools that already springiness you astir of this: vLLM’s Prometheus metrics expose KV-cache utilization and request-queue depth; NVIDIA Nsight Compute gives per-kernel bandwidth truthful you tin property postulation to phases; a civilization harness pinch controlled batch sizes closes the gap. DigitalOcean’s vLLM sizing guide walks done reference TTFT, ITL, and KV-cache unit if you want a checklist.
What the information should reveal: nether concurrent load, throughput degrades faster than linearly pinch image-token count. That’s the signature. It’s super-linear because HBM bandwidth is shared and contended, and the decode loop pays the inflated KV-cache costs connected each token for each petition successful the batch astatine once. Linear would mean each image costs a fixed amount; super-linear intends the images are fighting each different for bandwidth: the tax, observed successful the wild.
This generalizes beyond vision
The logic this matters past LLaVA and Qwen-VL is that the halfway advantage is simply a spot of the architecture, not of imagination specifically.
The asymmetry is simple: an encoder produces output that’s O(1) per furniture (a fixed embedding, sloppy of decoder depth) while a decoder accumulates O(L) of KV state. Any pairing pinch that style benefits from the aforesaid analysis. Audio encoders for illustration Whisper feeding a connection model: aforesaid structure, aforesaid cut. Video encoders, which nutrient tremendous token counts and would make a KV-cache migration brutal: aforesaid cut, bigger payoff. Multimodal models pinch respective encoder branches: each branch is different O(1)-per-layer output you tin partition disconnected cheaply.
And the inclination lines compound the advantage complete time. Models support getting deeper, which raises L and grows the transportation ratio successful modality-disaggregation’s favor. Compute density connected cheaper cards keeps climbing faster than interconnect bandwidth, widening the spread betwixt inexpensive FLOPs and costly bytes. The insubstantial isn’t describing a quirk of 1 exemplary generation: it’s describing a structural spot of encoder-decoder multimodal systems that gets much existent arsenic some hardware and models advance.
The taxation was ne'er a package bug
Come backmost to the anomaly we opened with: the exemplary that crawled while each dashboard insisted the GPU was fine. You tin explicate it precisely now. Image tokens participate the KV cache astatine prefill and enactment location done each decode step. HBM bandwidth, the azygous scarcest assets successful connection decoding, gets divided betwixt the exemplary weights and an ever-growing cache now carrying hundreds of image tokens that lend representation postulation but nary arithmetic. The GPU looked patient because nary azygous antagonistic was pegged. The taxation is paid successful the contention between resources, not the saturation of immoderate 1 of them.
The hole is simply a mobility astir wherever successful the conclusion chart you tie the line. The investigation says the modality bound (the seam betwixt imagination encoder and connection model) is the provably optimal spot to tie it. Cut location and you trim cross-device transportation by 12× to 196× depending connected architecture and attraction strategy (the GQA-heavy models that predominate today’s deployments astatine the little end, deeper and MHA models astatine the upper), you make mean networking capable for the transfer, and you unfastened up a costs building (on the bid of 30–40% cheaper successful the root research) that homogeneous deployments structurally cannot reach. On DigitalOcean, that intends an Ada-class encoder pool, an HBM-class decoder pool, and 25 Gbps of backstage web betwixt them.
The GPU was ne'er the problem. Asking 1 GPU to beryllium 2 different machines was.
FAQ
What is the HBM tax?
HBM stands for High Bandwidth Memory the accelerated representation connected a GPU. The “HBM tax” refers to the hidden capacity and costs punishment you salary erstwhile a vision-language exemplary runs some image encoding and matter procreation connected the aforesaid GPU. The 2 phases request different things from the hardware, truthful neither gets what it needs.
My GPU metrics look normal. How do I cognize if I’m paying this tax?
Run the aforesaid exemplary pinch different inputs matter only, 1 mini image, 1 ample image, aggregate images and measurement really agelong each generated token takes arsenic image size goes up. If the per-token latency grows faster than linearly pinch image token count, you’re paying the tax
What is simply a KV cache and why does it matter here?
KV stands for Key-Value. When a connection exemplary generates text, it stores intermediate results for each token it has already processed truthful it doesn’t person to recompute them. This stored information is the KV cache. Image tokens get added to this cache and enactment location for the full procreation truthful each output token has to publication each of them from memory, again and again.
Why can’t I hole this by tuning batch size aliases quantization?
Those settings thief astatine the margins, but they don’t reside the guidelines cause. The problem is that imagination encoding and connection decoding want fundamentally different hardware. No config alteration reconciles that connected a azygous GPU.
What does modality-level disaggregation mean?
It intends moving the imagination encoder connected 1 GPU and the connection exemplary connected a abstracted GPU. The only point transferred betwixt them is the encoder’s output, embedding a small, fixed-size tensor alternatively than the afloat KV cache. This makes the transportation accelerated capable to activity complete a modular unreality backstage network.
Do I request typical hardware aliases interconnects for illustration NVLink?
No. The embedding transferred betwixt GPUs is mini capable (a fewer megabytes) that a modular 25 Gbps backstage web handles it pinch negligible overhead. This is what makes the attack applicable connected modular unreality infrastructure.
When does this really make consciousness to implement?
When 3 things are true: you’re serving capable multimodal postulation that a 30–40% costs simplification is meaningful, your requests are image- aliases video-heavy (not occasional), and you’re consenting to negociate your ain serving setup. If you’re early-stage aliases low-volume, a managed serverless endpoint is simpler and apt the amended trade-off.
Does this use to models beyond vision-language models?
Yes. Any architecture wherever a fixed-output encoder feeds a depth-scaling decoder has the aforesaid structure. Audio models, video models, and multi-encoder multimodal systems each use from the aforesaid analysis.
Sources
- Cost-Efficient Multimodal LLM Inference via Cross-Tier GPU Heterogeneity
- Efficient Inference for Large Vision-Language Models: Bottlenecks, Techniques, and Prospects
- How KV Caching Slashes LLM Inference Costs astatine Scale
- How to Choose the Right GPU for vLLM Inference
- Long-Context Inference astatine Scale: The Hidden Infrastructure Cost
This activity is licensed nether a Creative Commons Attribution-NonCommercial- ShareAlike 4.0 International License.
English (US) ·
Indonesian (ID) ·