Public Preview

Speculative Decoding And Drafting documentation

Speculative Decoding and Drafting

Abstract

Accelerate generation by drafting tokens with a faster surrogate (or multi-head drafts) and verifying with a higher-quality model, achieving speedups with minimal quality loss.

Motivation

  • Reduce latency and cost for long generations
  • Maintain quality vs. naive small-model-only pipelines

Architectures

  • Draft-and-Verify: small draft model proposes tokens; main model verifies/accepts
  • Multi-Head Drafting (e.g., Medusa): extra heads propose multiple candidates for acceptance by the base
  • Redrafting: iterative verifier prompts to refine drafts (e.g., ReDrafter)
  • Serving: batched speculative decoding with paged KV caches

Design Choices

  • Draft model size and alignment with target domain
  • Acceptance policy and window size
  • Fallback behavior on low agreement
  • Streaming vs. chunked verification

Pros/Cons

  • Pros: Significant speedups, controllable quality
  • Cons: Added complexity, extra memory, sensitive to draft quality

Evaluation Metrics

  • Speedup (x), TTFT/TPOT reduction
  • Quality delta vs. baseline (task-specific metrics, human eval)
  • Acceptance rate and rejection overhead

Vendor/Tooling

  • Implementations emerging in inference servers (e.g., vLLM experimental)
  • Research baselines: Assisted Generation, Medusa, EAGLE, ReDrafter

Design Checklist

  • Benchmark speedup vs. quality on target prompts/tasks
  • Tune acceptance window and draft temperature
  • Monitor failure modes (loops, format drift); add guardrails
  • Validate under long-context workloads

References

  • Title: Assisted Generation URL: https://openai.com/research/assisted-generation Publisher/Vendor: OpenAI Accessed: 2025-08-14 Version_or_release: 2023 (research blog)
  • Title: Medusa: Simple LLM Inference Acceleration Framework URL: https://arxiv.org/abs/2309.01446 Publisher/Vendor: arXiv Accessed: 2025-08-14 Version_or_release: 2023-09 (preprint)
  • Title: EAGLE: Speculative Sampling URL: https://arxiv.org/abs/2305.14233 Publisher/Vendor: arXiv Accessed: 2025-08-14 Version_or_release: 2023-05 (preprint)
  • Title: ReDrafter: Efficient Redrafting for LLM Inference URL: https://arxiv.org/abs/2404.09791 Publisher/Vendor: arXiv Accessed: 2025-08-14 Version_or_release: 2024-04 (preprint)
  • Title: vLLM (PagedAttention and serving) URL: https://vllm.ai Publisher/Vendor: vLLM Project Accessed: 2025-08-14 Version_or_release: provider_reported