Speculative Decoding and Drafting

Abstract

Accelerate generation by drafting tokens with a faster surrogate (or multi-head drafts) and verifying with a higher-quality model, achieving speedups with minimal quality loss.

Motivation

Reduce latency and cost for long generations
Maintain quality vs. naive small-model-only pipelines

Architectures

Draft-and-Verify: small draft model proposes tokens; main model verifies/accepts
Multi-Head Drafting (e.g., Medusa): extra heads propose multiple candidates for acceptance by the base
Redrafting: iterative verifier prompts to refine drafts (e.g., ReDrafter)
Serving: batched speculative decoding with paged KV caches

Design Choices

Draft model size and alignment with target domain
Acceptance policy and window size
Fallback behavior on low agreement
Streaming vs. chunked verification

Pros/Cons

Pros: Significant speedups, controllable quality
Cons: Added complexity, extra memory, sensitive to draft quality

Evaluation Metrics

Speedup (x), TTFT/TPOT reduction
Quality delta vs. baseline (task-specific metrics, human eval)
Acceptance rate and rejection overhead

Vendor/Tooling

Implementations emerging in inference servers (e.g., vLLM experimental)
Research baselines: Assisted Generation, Medusa, EAGLE, ReDrafter

Design Checklist

Benchmark speedup vs. quality on target prompts/tasks
Tune acceptance window and draft temperature
Monitor failure modes (loops, format drift); add guardrails
Validate under long-context workloads

References

Title: Assisted Generation URL: https://openai.com/research/assisted-generation Publisher/Vendor: OpenAI Accessed: 2025-08-14 Version_or_release: 2023 (research blog)
Title: Medusa: Simple LLM Inference Acceleration Framework URL: https://arxiv.org/abs/2309.01446 Publisher/Vendor: arXiv Accessed: 2025-08-14 Version_or_release: 2023-09 (preprint)
Title: EAGLE: Speculative Sampling URL: https://arxiv.org/abs/2305.14233 Publisher/Vendor: arXiv Accessed: 2025-08-14 Version_or_release: 2023-05 (preprint)
Title: ReDrafter: Efficient Redrafting for LLM Inference URL: https://arxiv.org/abs/2404.09791 Publisher/Vendor: arXiv Accessed: 2025-08-14 Version_or_release: 2024-04 (preprint)
Title: vLLM (PagedAttention and serving) URL: https://vllm.ai Publisher/Vendor: vLLM Project Accessed: 2025-08-14 Version_or_release: provider_reported