Speculative Decoding and Drafting
Abstract
Accelerate generation by drafting tokens with a faster surrogate (or multi-head drafts) and verifying with a higher-quality model, achieving speedups with minimal quality loss.
Motivation
- Reduce latency and cost for long generations
- Maintain quality vs. naive small-model-only pipelines
Architectures
- Draft-and-Verify: small draft model proposes tokens; main model verifies/accepts
- Multi-Head Drafting (e.g., Medusa): extra heads propose multiple candidates for acceptance by the base
- Redrafting: iterative verifier prompts to refine drafts (e.g., ReDrafter)
- Serving: batched speculative decoding with paged KV caches
Design Choices
- Draft model size and alignment with target domain
- Acceptance policy and window size
- Fallback behavior on low agreement
- Streaming vs. chunked verification
Pros/Cons
- Pros: Significant speedups, controllable quality
- Cons: Added complexity, extra memory, sensitive to draft quality
Evaluation Metrics
- Speedup (x), TTFT/TPOT reduction
- Quality delta vs. baseline (task-specific metrics, human eval)
- Acceptance rate and rejection overhead
Vendor/Tooling
- Implementations emerging in inference servers (e.g., vLLM experimental)
- Research baselines: Assisted Generation, Medusa, EAGLE, ReDrafter
Design Checklist
- Benchmark speedup vs. quality on target prompts/tasks
- Tune acceptance window and draft temperature
- Monitor failure modes (loops, format drift); add guardrails
- Validate under long-context workloads
References
- Title: Assisted Generation URL: https://openai.com/research/assisted-generation Publisher/Vendor: OpenAI Accessed: 2025-08-14 Version_or_release: 2023 (research blog)
- Title: Medusa: Simple LLM Inference Acceleration Framework URL: https://arxiv.org/abs/2309.01446 Publisher/Vendor: arXiv Accessed: 2025-08-14 Version_or_release: 2023-09 (preprint)
- Title: EAGLE: Speculative Sampling URL: https://arxiv.org/abs/2305.14233 Publisher/Vendor: arXiv Accessed: 2025-08-14 Version_or_release: 2023-05 (preprint)
- Title: ReDrafter: Efficient Redrafting for LLM Inference URL: https://arxiv.org/abs/2404.09791 Publisher/Vendor: arXiv Accessed: 2025-08-14 Version_or_release: 2024-04 (preprint)
- Title: vLLM (PagedAttention and serving) URL: https://vllm.ai Publisher/Vendor: vLLM Project Accessed: 2025-08-14 Version_or_release: provider_reported