Speculative Decoding on vLLM: A Configuration and Decision Framework

Jul 02, 2026 03:00 PM - 18 hours ago 914

You adhd --speculative-model to your vLLM config, tally a benchmark, and spot astir 2× token throughput. You vessel it. Three weeks later, your 99th percentile latency (P99 - the slowest 1% of requests) has climbed, a subset of requests are hitting OOM errors, and your on-call technologist is staring astatine metrics that make nary consciousness - the exemplary is technically “faster,” but users are complaining the API feels slower.

Speculative decoding is simply a conditional optimization - it delivers existent gains connected the correct workloads and softly degrades everything else. Most teams find retired which class they’re successful the difficult way.

This article is an operational model for making the decision: which draught exemplary to pick, what it costs your representation budget, really it interacts pinch the scheduler nether existent concurrency, and - astir importantly - really to measurement whether it’s really helping successful your environment.

A statement connected the numbers successful this article: the capacity figures we reference travel from the vLLM team’s ain published benchmarks (Llama-3-70B, 4× H100 SXM5, October 2024). This article’s halfway statement is that you should measurement astatine your ain accumulation conditions. The vLLM information is the champion published reference available; dainty it arsenic a calibration baseline, not a prediction for your deployment. The measurement conception astatine the extremity tells you precisely what to tally to get your ain numbers.

TL;DR

Speculative decoding runs a mini draught exemplary to propose tokens, past verifies each of them successful 1 target exemplary walk - faster token procreation without changing output quality.
It helps astatine debased query rates connected structured, low-temperature workloads: up to 2.8× speedup connected summarization, 1.5× connected chat (vLLM team, 4× H100).
It degrades astatine precocious query rates - the aforesaid vLLM benchmarks show 1.4–1.8× slowdowns erstwhile the GPU is saturated.
Pick a draught exemplary astatine a 1:8–1:12 size ratio from the aforesaid exemplary family. The draught model’s VRAM costs comes straight retired of your KV cache budget.
Monitor spec_decode_draft_acceptance_rate successful production. Below ~0.5, you’re adding latency, not removing it - move it off.

Choosing the Right Draft Model for Speculative Decoding

The first determination - which draught exemplary to usage - is besides the astir important, and it’s seldom treated that way. The intuition is simple: usage a smaller exemplary from the aforesaid family, prime a reasonable --num-speculative-tokens, and fto the mathematics activity out. Whether that really useful retired depends almost wholly connected 1 factor: really ample the draught exemplary is comparative to the target model.

Why the ratio is the superior lever

Normally, generating k tokens requires moving the ample target exemplary k times - 1 afloat guardant walk per token. Speculative decoding short-circuits this: the mini draught exemplary generates k campaigner tokens quickly, past the target exemplary checks each k candidates successful a azygous guardant pass, because it tin measure each position successful parallel. One verification walk is overmuch faster than k procreation passes.

Standard procreation — 5 tokens = 5 sequential target exemplary passes: [70B] → t₁ → [70B] → t₂ → [70B] → t₃ → [70B] → t₄ → [70B] → t₅ (pass 1) (pass 2) (pass 3) (pass 4) (pass 5) Speculative decoding — 5 projected tokens = 1 draught walk + 1 verify pass: [8B Draft] → t₁ t₂ t₃ t₄ t₅ (one accelerated pass, each proposed) | ▼ [70B Target] → ✓t₁ ✓t₂ ✓t₃ ✗t₄ — (one parallel verify pass) └─────────────┘ 3 tokens accepted, 2 rejected target exemplary ran erstwhile alternatively of 5 times

That holds only erstwhile capable draught tokens are accepted. Acceptance rate is the cardinal variable. The size ratio matters because a larger draught exemplary is amended astatine predicting what the target exemplary would person said. A 1B exemplary simply doesn’t person capable capacity to reliably mimic a 70B model’s outputs, truthful much of its candidates get rejected. A larger draught exemplary predicts much accurately, which intends much accepted tokens per step.

A 1B draught exemplary against a 70B target (~1:70 ratio) generates tokens accelerated and cheaply, but it guesses the incorrect token excessively often, truthful the target exemplary rejects astir of what the draught projected - and you’ve paid the costs of moving 2 models without getting capable accepted tokens to make it worthwhile. A 13B draught against the aforesaid 70B target (~1:5 ratio) predicts good but costs 13B parameters worthy of VRAM and compute, which shifts the crossover constituent wherever you’d person been amended disconnected conscionable moving the target model.

The mechanics of the algorithm constituent to a accordant saccharine spot: a 1:8 to 1:12 size ratio utilizing same-family, same-training-distribution models. The array beneath shows communal pairings and the ratio mathematics - dainty these arsenic a starting model for deciding what to test, not arsenic benchmarks to cite. The only acceptance complaint that matters for your deployment is the 1 you measurement connected your ain hardware, pinch your existent punctual distribution.

Draft Model Target Model Ratio Why it matters

Llama-3.1-8B	Llama-3.1-70B	~1:9	Sweet spot - aforesaid family, bully capacity match
Qwen2.5-7B	Qwen2.5-72B	~1:10	Same-family pairing, akin ratio
Llama-3.2-1B	Llama-3.1-70B	~1:70	Too mini - draught diverges from target excessively often
Llama-3.1-8B	Llama-3.1-405B	~1:50	Same draft, but overmuch weaker comparative to the larger target

These pairings exemplify really the size ratio affects draught exemplary quality. Acceptance rates alteration importantly pinch hardware, quantization, and punctual distribution - measurement connected your ain setup earlier drafting conclusions.

The 1B/70B pairing looks inexpensive but seldom pays disconnected - the draught exemplary rejects excessively galore tokens, truthful you extremity up moving 2 models for the costs of one.

One applicable note: the vLLM team’s ain published benchmarks utilized a 0.5B draught exemplary (turboderp/Qwama-0.5B-Instruct) against Llama-3-70B - a 1:140 ratio, good extracurricular the 1:8–1:12 saccharine spot. They still saw 1.5× speedup astatine debased query rates. At precocious query rates, they saw 1.4× slowdown. This reinforces the halfway point: query complaint matters much than ratio. Even a suboptimal draught exemplary tin look for illustration a triumph connected an isolated benchmark. The nonaccomplishment mode occurs nether accumulation load.

Temperature destroys your benchmark numbers

Temperature is simply a mounting that controls really predictable aliases random a model’s output is. At temperature=0, the exemplary ever picks the azygous astir apt adjacent token - this is called greedy decoding, and the output is afloat deterministic. As you summation temperature, the exemplary starts choosing from a wider scope of imaginable tokens, making the output much varied and creative. At temperature=1.0 aliases above, the output tin consciousness overmuch much open-ended and little predictable.

This matters for speculative decoding because the draught model’s occupation is to conjecture what the target exemplary will opportunity next. At temperature=0, the target exemplary is highly predictable - it ever picks the apical token - truthful the draught exemplary guesses correctly astir of the time. But astatine higher temperatures, the target exemplary picks much astonishing tokens, and the draught model’s guesses commencement missing much often. When the draught exemplary misses, those campaigner tokens get rejected and the activity is wasted.

The problem is that astir benchmarks are tally astatine temperature=0, which is wherever speculative decoding looks best. In production, the somesthesia you really usage depends connected the task - codification procreation and system output thin to usage debased temperatures, while chat and imaginative penning typically usage higher ones. If your existent workload runs astatine temperature=0.7 aliases above, your benchmark numbers will beryllium importantly much optimistic than what you spot successful production.

The simplest measurement to spot this straight is to tally the aforesaid punctual astatine different temperatures and watch spec_decode_draft_acceptance_rate displacement successful existent time:

import requests VLLM_URL = "http://localhost:8000/v1/chat/completions" METRICS_URL = "http://localhost:8000/metrics" PROMPT = "Write a short communicative astir a robot learning to paint." def get_acceptance_rate(): matter = requests.get(METRICS_URL).text for statement in text.split("\n"): if "spec_decode_draft_acceptance_rate" in statement and not line.startswith("#"): return float(line.split()[-1]) return None for somesthesia in [0.0, 0.4, 0.8, 1.0]: # Send 20 requests astatine this temperature for _ in range(20): requests.post(VLLM_URL, json={ "model": "meta-llama/Llama-3.1-70B-Instruct", "messages": [{"role": "user", "content": PROMPT}], "temperature": temperature, "max_tokens": 200, }) complaint = get_acceptance_rate() print(f"temperature={temperature} acceptance_rate={rate:.2f}")

Expected output style (your numbers will alteration by exemplary brace and prompt):

temperature=0.0 acceptance_rate=0.81 temperature=0.4 acceptance_rate=0.71 temperature=0.8 acceptance_rate=0.52 temperature=1.0 acceptance_rate=0.38

Find the statement that matches your accumulation temperature. If the acceptance complaint is beneath 0.5, extremity - speculative decoding is net-negative connected this workload, and you should disable it. If it’s betwixt 0.5 and 0.65, you’re adjacent the break-even line; measurement P99 latency nether existent load earlier deciding. Above 0.65, you’re apt seeing a genuine use - corroborate pinch a baseline comparison and vessel it.

Higher somesthesia → flatter probability distribution → draught model’s concentrated guesses miss much often → much rejections. The shape is much important than the circumstantial multipliers:

Workload Dataset QPS Measured result Method

Summarization	CNN/DailyMail	Low (QPS=1)	2.8× speedup	N-gram speculative decoding
Chat / general	ShareGPT	Low (QPS=1)	1.5× speedup	Draft exemplary (0.5B → 70B)
Chat / general	ShareGPT	High	1.4× slowdown	Draft exemplary (0.5B → 70B)
Summarization	CNN/DailyMail	High	1.8× slowdown	N-gram speculative decoding

Source: vLLM Team, “How Speculative Decoding Boosts vLLM Performance by up to 2.8x”, October 2024. Benchmarked connected Llama-3-70B pinch 4× H100.

The past 2 rows are not separator cases. At accumulation query rates - wherever your GPU is compute-saturated - speculative decoding adds overhead alternatively of removing it. The other compute required to propose and verify tokens compounds nether load. The use evaporates, and the costs remains.

The intuition for the crossover point: an 8B draught exemplary costs astir 1/9th the compute of a 70B target model. If less than astir half your draught tokens are accepted, the compute you spent connected the draught exemplary positive the wasted verification steps outweighs the tokens you gained. At precocious QPS, you’re already compute-saturated - location is nary idle capacity to sorb that overhead.

The applicable accusation is straightforward: don’t presume speculative decoding is helping conscionable because your benchmark looked good. If your users are moving imaginative writing, open-ended chat, aliases thing astatine temperatures supra 0.7, measurement acceptance rates astatine those temperatures specifically. A benchmark tally astatine temperature=0 tells you thing astir what happens astatine temperature=0.8 successful production.

Memory Budget Reality

Running speculative decoding intends moving 2 models simultaneously. This is evident successful rule and amazingly achy successful believe erstwhile you activity done the VRAM math.

Actual footprint numbers

Using Llama-3.1 arsenic a actual example. Weight sizes present are derived from parameter counts and precision (BF16 = 2 bytes/param, INT8 = 1 byte/param, INT4 = 0.5 bytes/param) - your existent usable VRAM per GPU will alteration depending connected your hardware configuration and host-level overhead.

Llama-3.1-70B target model:

BF16: ~140GB → excessively ample for a azygous 80GB H100; requires 2× H100 (160GB total), leaving only ~20GB free earlier you adhd the draught exemplary aliases KV cache
INT8: ~70GB → fits connected a azygous H100 pinch ~10GB to spare, aliases connected 2× H100 pinch ~90GB free for KV cache and the draught model
INT4 (AWQ/GPTQ): ~35GB → fits comfortably connected a azygous H100 pinch ~45GB free, the astir memory-efficient option, but astatine immoderate costs to output quality

Llama-3.1-8B draught model:

BF16: ~16GB
INT8: ~8GB
INT4: ~4GB

On a 2× H100 SXM5 setup (160GB total), a communal configuration for accumulation 70B serving:

Target (70B INT8, ~70GB) + Draft (8B BF16, ~16GB) = ~86GB weights. That leaves astir 74GB for KV cache, page tables, and activation representation crossed some GPUs.
Target (70B BF16, ~140GB) + Draft (8B BF16, ~16GB) = ~156GB weights. That leaves only ~4GB - efficaciously unusable for immoderate meaningful batch size aliases discourse length.

The applicable takeaway for H100: if you want to tally speculative decoding pinch a 70B target, you person to quantize it to INT8. Running some models successful afloat BF16 simply doesn’t time off capable room for the KV cache.

On DO’s H200 GPU Droplets (2× H200 SXM5, 282GB total), this constraint goes away. Running 70B BF16 + 8B BF16 uses ~156GB of weights and leaves ~126GB for KV cache - compared to 74GB connected the H100 INT8 configuration, that’s astir 1.7× much headroom. More importantly, you nary longer request to quantize the target exemplary astatine each to make the representation fund work. If you’re serving long-context requests (32K+ tokens), that quality is significant.

How the draught exemplary shrinks your KV cache

To understand this, you request to cognize what a KV cache is. Every clip a exemplary processes a token, it generates intermediate values called keys and values (K and V). These are stored successful GPU memory, truthful the exemplary doesn’t person to recompute them connected each measurement - that’s the KV cache. The much tokens a petition has (longer conversations, longer documents), the much KV cache it needs. The much concurrent requests you serve, the much KV cache you request successful total.

vLLM allocates KV cache astatine startup utilizing a elemental formula: full VRAM × gpu_memory_utilization (default: 0.9) - exemplary weights = KV cache budget. Whatever representation the exemplary weights don’t usage goes wholly to KV cache. This intends adding a 16GB draught exemplary costs precisely 16GB of KV cache - it’s a direct, one-for-one trade. You now person 16GB little capacity to clasp the discourse for in-flight requests.

In practice, this intends 1 aliases some of the following:

Shorter maximum discourse length - requests pinch agelong inputs aliases agelong speech histories will deed representation limits sooner
Smaller maximum batch size - you tin service less concurrent requests earlier moving retired of KV cache space

The latency gains from faster token procreation tin easy beryllium wiped retired by the latency summation from being forced to process less requests successful parallel. A useful norm of thumb: if your target exemplary already uses much than 70% of your GPU memory, adding a draught exemplary will trim really galore requests you tin service astatine erstwhile - trial nether existent load earlier enabling it.

# vLLM config for speculative decoding connected 2× H100 python -m vllm.entrypoints.openai.api_server \ --model meta-llama/Llama-3.1-70B-Instruct \ --speculative-model meta-llama/Llama-3.1-8B-Instruct \ --num-speculative-tokens 5 \ --speculative-draft-tensor-parallel-size 1 \ --tensor-parallel-size 2 \ --gpu-memory-utilization 0.92 \ --dtype bfloat16

The --speculative-draft-tensor-parallel-size 1 emblem is worthy noting explicitly: the draught exemplary typically runs connected a azygous GPU while the target exemplary spans both. This keeps the draught exemplary accelerated (small model, single-GPU guardant pass) but intends you’re efficaciously dedicating 16GB connected 1 of your H100s to a exemplary that generates campaigner tokens you whitethorn cull 25% of the clip astatine precocious temperatures.

Quantization and Speculative Decoding

Running speculative decoding pinch a quantized target exemplary is 1 of the astir communal accumulation configurations - and 1 of the slightest understood successful position of what it really does to acceptance rates.

Why quantization changes acceptance rates

The speculative decoding algorithm useful by comparing the draught model’s projected tokens against what the target exemplary would person generated. When the target exemplary is quantized, its output distribution shifts slightly. INT8 quantization introduces mini numerical rounding errors crossed the weight matrices; INT4 (GPTQ, AWQ) introduces larger ones that accumulate crossed layers. These errors are systematic, not random - the quantized exemplary consistently outputs somewhat different probability distributions than the full-precision version.

The draught exemplary was optimized against the full-precision target model’s probability distribution - not the quantized variant’s. Quantization introduces systematic shifts successful the target’s output distribution; the draught model’s proposals are calibrated to the unquantized distribution and will diverge from the quantized model’s preferences proportionally to the severity of quantization. This intends tokens the draught exemplary proposes pinch precocious assurance whitethorn beryllium rejected by the quantized target - not because the connection was semantically wrong, but because the quantized model’s probability wide for that token falls conscionable beneath the acceptance threshold.

The magnitude of the acceptance complaint punishment depends connected your exemplary architecture, quantization method, and punctual distribution - location is nary cosmopolitan number, and we haven’t measured it connected DO hardware. What’s accordant crossed reported configurations is the guidance and comparative ordering: INT8 imposes a smaller punishment than INT4, and the punishment grows pinch quantization aggressiveness. The nonstop period that matters for your deployment is the 1 wherever acceptance complaint drops beneath your break-even constituent - which is why the correct attack is to measurement spec_decode_draft_acceptance_rate earlier and aft switching quantization levels connected your existent workload, alternatively than relying connected immoderate wide estimate. If the acceptance complaint drops much than a fewer points connected the aforesaid punctual distribution, that’s the awesome to measurement backmost from INT4 to INT8 connected the target.

The asymmetry betwixt quantizing draught vs. target

The 2 models tin beryllium quantized independently, and the capacity implications are not symmetric:

Quantizing the target model has the biggest effect because it is the exemplary that decides whether each token projected by the draught exemplary is accepted. If quantization changes the target model’s predictions, less projected tokens will beryllium accepted, reducing the benefits of speculative decoding.

Quantizing the draft model is usually little risky. The draught exemplary only suggests tokens, while the target exemplary still verifies each suggestion. Even if quantization makes the draught exemplary somewhat little accurate, the verification process stays the same.

In practice, it’s mostly safe to quantize the draft model much aggressively than the target model. For example, utilizing an INT8 draught model pinch a BF16 target model usually has small effect connected acceptance rates. However, utilizing an INT4 target model, moreover pinch a BF16 draught model, tin noticeably trim acceptance rates, particularly erstwhile procreation uses sampling alternatively of deterministic decoding.

Target quantization Draft quantization VRAM (2× H100, 70B+8B) Notes

BF16 (~140GB)	BF16 (~16GB)	~156GB total	Doesn’t fit; ~4GB near for KV cache
INT8 (~70GB)	BF16 (~16GB)	~86GB	Recommended baseline; ~74GB for KV cache
INT8 (~70GB)	INT8 (~8GB)	~78GB	Saves 8GB; minimal acceptance complaint effect vs. BF16 draft
INT4 (~35GB)	BF16 (~16GB)	~51GB	Fits connected azygous H100; acceptance complaint punishment is real
INT4 (~35GB)	INT4 (~4GB)	~39GB	Maximum representation efficiency; expect noticeably little acceptance rates

Weights only. KV cache, activations, and page tables devour further VRAM. Actual weights whitethorn disagree somewhat by quantization implementation.

The flags

Draft exemplary quantization is handled separately from the target exemplary via --speculative-model-quantization. The main --quantization emblem applies only to the target:

python -m vllm.entrypoints.openai.api_server \ --model meta-llama/Llama-3.1-70B-Instruct \ --quantization bitsandbytes \ --load-format bitsandbytes \ --speculative-model meta-llama/Llama-3.1-8B-Instruct \ --speculative-model-quantization bitsandbytes \ --num-speculative-tokens 5 \ --speculative-draft-tensor-parallel-size 1 \ --tensor-parallel-size 2 \ --gpu-memory-utilization 0.92

There’s a 2nd emblem worthy knowing about: --spec-decoding-acceptance-method. The default is rejection_sampler, which enforces strict token acceptance based connected the probability ratio betwixt draught and target distributions. The alternative, typical_acceptance_sampler, is configurable - it trades a mini simplification successful output value for a higher acceptance rate. If you’re moving a quantized target exemplary and seeing acceptance rates that would different push you beneath the break-even threshold, switching to typical_acceptance_sampler tin retrieve immoderate of that nonaccomplishment without changing your exemplary configuration:

--spec-decoding-acceptance-method typical_acceptance_sampler \ --typical-acceptance-sampler-posterior-threshold 0.09 \ --typical-acceptance-sampler-posterior-alpha 0.3

The defaults (threshold=0.09, alpha=0.3) are reasonable starting points. Lowering the period accepts much draught tokens; raising it enforces stricter quality. Test connected your existent punctual distribution earlier adjusting.

One much useful emblem for quantized deployments nether adaptable load: --speculative-disable-by-batch-size. Set this to a batch size threshold, and the server will automatically disable speculative decoding for caller requests erstwhile the queue exceeds that size. This gives you the low-QPS gains without manually toggling the configuration erstwhile postulation spikes.

What to watch aft switching to a quantized configuration

After enabling quantization connected either model, propulsion spec_decode_draft_acceptance_rate and comparison it against your baseline (non-quantized) measurement connected the aforesaid punctual distribution. A driblet of much than 5 percent points comparative to baseline suggests the quantization punishment is important capable to revisit your configuration prime - typically by moving the target exemplary from INT4 to INT8, aliases by switching acceptance methods.

The correct quantization configuration depends connected your VRAM constraints, your discourse magnitude requirements, and your acceptance complaint tolerance. The wide privilege order: INT8 target complete INT4 target, same-level quantization for draught and target complete mixed, and measurement acceptance complaint earlier and aft immoderate quantization change.

Continuous Batching: Where the Scheduler Gets Complicated

The capacity advantage of continuous batching comes from requests sharing GPU compute successful the aforesaid guardant pass, pinch caller requests slotting into disposable capacity dynamically. PagedAttention manages the KV cache to alteration this without padding aliases representation waste. Speculative decoding introduces structural assumptions that create clash pinch some of these mechanisms.

What the scheduler really assumes

Under modular continuous batching, each loop of the guardant walk generates precisely 1 caller token per request. This uniformity is what makes scheduling clean: you cognize the output magnitude of each step, you tin foretell representation allocation, and you tin battalion requests efficiently.

Speculative decoding breaks this assumption. Each loop consists of:

The draught exemplary generating k campaigner tokens per request
The target exemplary verifying each k+1 positions (k draught tokens + 1 correction) successful a azygous guardant pass
Accepting immoderate prefix of those tokens based connected the acceptance criterion

The number of tokens really appended to each request’s series aft measurement 3 is adaptable - location betwixt 1 and k+1. This creates irregular batch shapes that complicate the scheduler.

What this intends for mixed workloads

Under a homogeneous workload - each requests astatine akin temperatures, akin lengths, akin acceptance rates - the irregularity is predictable capable that the scheduler handles it gracefully. Under a mixed workload, the image is messier.

Imagine a batch wherever half the requests are moving system JSON extraction astatine temperature=0.1 (acceptance complaint ~80%) and half are moving open-ended imaginative procreation astatine temperature=1.0 (acceptance complaint ~35%). The high-acceptance requests are completing their speculative steps efficiently. The low-acceptance requests are moving the draught model, paying the representation bandwidth costs of a 2nd guardant pass, and rejecting astir of what it produces - efficaciously adding overhead per token alternatively than removing it.

The scheduler can’t divided these cleanly. They stock the aforesaid verification pass, which intends the requests that don’t use from speculation are still paying its cost. In practice, workload homogeneity matters much than astir teams realize. Speculative decoding is well-suited for dedicated deployments - a codification completion endpoint, a system extraction pipeline, a RAG strategy pinch constrained output formats. It is simply a mediocre fresh for general-purpose chat APIs wherever somesthesia and task type alteration crossed requests successful the aforesaid batch.

The awesome to watch for is simply a spread betwixt P50 and P99 latency that widens aft enabling speculative decoding. Under a mixed workload, P50 often improves (the high-acceptance requests pulling the median down) while P99 gets worse (the low-acceptance requests adding tail latency that compounds nether load). If your P50 looks for illustration a triumph but your P99 is simply a regression, the scheduler relationship is apt the cause. To confirm: tally the aforesaid load trial against a homogeneous low-temperature workload and comparison P99 behavior. If it tightens significantly, you person a mixed-workload problem, not a speculative decoding problem.

How to Measure This connected Your Own Deployment

The published reference numbers springiness you a baseline; what really matters for your deployment determination are the numbers you cod connected your ain hardware, pinch your ain punctual distribution, astatine your ain query rates. Here is precisely what to measurement and how.

The astir communal operational correction is measuring the incorrect thing. A benchmark that shows 2× speedup connected isolated single-request tests will not reliably foretell behaviour nether concurrent accumulation load.

What to really measure

Acceptance rate, per request. The spec_decode_draft_acceptance_rate metric is disposable astatine the /metrics endpoint. Track it arsenic a histogram, not an mean - you want to spot the distribution. If P10 of your acceptance complaint distribution is beneath 0.5, you person a important information of your postulation wherever speculative decoding is net-negative.

TTFT vs. TPOT separately. Speculative decoding affects time-per-output-token (TPOT), not time-to-first-token (TTFT). TTFT whitethorn really summation somewhat - the draught exemplary adds a prefill measurement earlier the first token is returned. If your SLO is chiefly TTFT-bound (e.g., interactive chat wherever users attraction astir responsiveness much than throughput), speculative decoding whitethorn not move the metric you attraction about.

P50 vs. P99 latency nether load. This is wherever the scheduler interactions surface. A benchmark moving 10 concurrent requests astatine load=0.3 will look very different from 100 concurrent requests astatine load=0.8. Run your load tests astatine the concurrency levels you really spot successful production.

# Pull speculative decoding metrics from the metrics endpoint curl -s http://localhost:8000/metrics | grep spec_decode

A patient deployment looks for illustration this - acceptance complaint consistently supra 0.7, accepted tokens adjacent to the number of draught tokens:

# HELP vllm:spec_decode_draft_acceptance_rate Speculative decoding draught acceptance rate vllm:spec_decode_draft_acceptance_rate{...} 0.76 vllm:spec_decode_num_draft_tokens_total{...} 48200 vllm:spec_decode_num_accepted_tokens_total{...} 36600 # ~76% accepted

An unhealthy deployment - acceptance complaint beneath 0.5, astir draught tokens rejected, overhead not paying off:

vllm:spec_decode_draft_acceptance_rate{...} 0.38 vllm:spec_decode_num_draft_tokens_total{...} 51000 vllm:spec_decode_num_accepted_tokens_total{...} 19400 # ~38% accepted

At 38% acceptance pinch a 1:9 draft/target ratio, you are adding latency, not removing it. This is what a high-temperature aliases mismatched-family draught exemplary looks for illustration successful production. If your metrics look for illustration the 2nd block, move disconnected speculative decoding until you’ve addressed the guidelines cause.

The monitoring setup you request earlier trusting the flag:

Acceptance complaint histogram (P10, P50, P90) segmented by somesthesia bucket
TTFT and TPOT astatine P50, P95, P99 - compared against a baseline without speculative decoding
GPU representation utilization and KV cache deed complaint (to drawback the squeeze)
Throughput (tokens/second) nether your existent accumulation concurrency, not synthetic single-request benchmarks

Why aggregate benchmarks lie

Consider a deployment wherever 60% of requests are low-temperature system queries pinch 82% acceptance rates, and 40% are high-temperature imaginative requests pinch 38% acceptance rates. The weighted mean acceptance complaint is ~65%, which looks healthy. But the 40% of requests that are degrading capacity are doing truthful successful a measurement that adds tail latency to the full batch. P50 looks for illustration a win; P99 is simply a regression. Aggregate benchmarks will show the triumph and hide the regression.

Decision Framework

Speculative decoding delivers erstwhile these conditions hold:

Use it when:

Your workload is temperature-homogeneous and skews toward system output aliases codification - target somesthesia ≤ 0.5
You person VRAM headroom aft target exemplary weights (draft exemplary should devour nary much than ~15% of full disposable VRAM)
Your workload is batch-size-consistent - you’re not mixing petition types pinch dramatically different acceptance rates successful the aforesaid batch
You’ve validated acceptance rates astatine accumulation temperatures, not conscionable greedy benchmarks
Your superior SLO is TPOT/throughput, not TTFT

Leave it disconnected when:

Temperature varies wide crossed your petition mix, aliases your median somesthesia is supra 0.7
You’re VRAM-constrained and serving long-context requests - the KV cache compression will costs you much than the token throughput gains
Your workload is TTFT-bound alternatively than throughput-bound
You haven’t group up acceptance complaint monitoring - you can’t show whether it’s helping

Draft exemplary action checklist:

Requirement Why it matters

Same exemplary family	Shared training distribution = higher acceptance rates
1:8–1:12 size ratio	Large capable to foretell accurately, mini capable to beryllium cheap
Same tokenizer	Mismatched tokenizers require costly re-encoding betwixt models
Quantize the draught earlier relaxing pinch the family	INT8 draught is fine; relaxing the family constraint hurts acceptance much than INT8 does

Speculative decoding is simply a genuine triumph for the correct workloads. For everything else, the emblem is not a cosmopolitan accelerant. Treat it for illustration immoderate different capacity optimization: measurement first, astatine accumulation conditions, past decide.

FAQ

Does speculative decoding alteration the model’s outputs?

No. The acceptance criterion guarantees the last output distribution is identical to what the target exemplary would person produced connected its own. It is simply a axenic latency optimization - it changes really accelerated tokens are generated, not what tokens are generated. You tin alteration it without rubbing your prompts, sampling parameters, aliases output validation.

Does it amended time-to-first-token (TTFT) aliases time-per-output-token (TPOT)?

It improves TPOT, not TTFT. The draught exemplary adds a mini prefill measurement earlier the first token is returned, truthful TTFT whitethorn really summation slightly. If your SLO is chiefly TTFT-bound - interactive chat wherever users announcement the first consequence hold much than the procreation velocity - speculative decoding whitethorn not move the metric that matters to you. It’s astir valuable erstwhile your bottleneck is throughput aliases output speed, not first responsiveness.

What’s the quality betwixt draught exemplary and n-gram speculative decoding?

Draft exemplary speculation uses a abstracted smaller exemplary to propose tokens - it useful crossed immoderate punctual type but costs VRAM and requires a compatible exemplary family. N-gram speculation reuses repeated phrases from the input punctual itself, which makes it astir free connected representation but only useful erstwhile the output intimately echoes the input (summarization, RAG, archive Q&A). For wide chat aliases codification generation, usage a draught model. For summarization pipelines wherever the reply mostly paraphrases the source, n-gram is often the amended prime and requires nary further exemplary astatine all.