Inference and Decoding Strategies

Definition

Inference produces outputs from trained models. Decoding selects the next tokens given logits, balancing quality, diversity, and efficiency.

Core: greedy, temperature/top-k/top-p sampling, beam search, typical/contrastive decoding
Acceleration: speculative decoding (draft-and-verify), multi-branch heads (e.g., Medusa), verifier pipelines (ReDrafter), and recurrent/drafting variants
Systems: KV cache optimizations and scheduler techniques (e.g., PagedAttention in vLLM)

OpenAI, Google/DeepMind, Anthropic, Meta; open-source stacks like vLLM and Hugging Face Transformers

Title: Assisted Generation (speculative decoding) URL: https://openai.com/research/assisted-generation Publisher/Vendor: OpenAI Accessed: 2025-08-14 Version_or_release: 2023 (research blog)
Title: Medusa: Simple LLM Inference Acceleration Framework URL: https://arxiv.org/abs/2309.01446 Publisher/Vendor: arXiv Accessed: 2025-08-14 Version_or_release: 2023-09 (preprint)
Title: EAGLE: Speculative Sampling URL: https://arxiv.org/abs/2305.14233 Publisher/Vendor: arXiv Accessed: 2025-08-14 Version_or_release: 2023-05 (preprint)
Title: ReDrafter: Efficient Redrafting for LLM Inference URL: https://arxiv.org/abs/2404.09791 Publisher/Vendor: arXiv Accessed: 2025-08-14 Version_or_release: 2024-04 (preprint)
Title: vLLM: PagedAttention and Efficient LLM Serving URL: https://vllm.ai/ Publisher/Vendor: vLLM Project Accessed: 2025-08-14 Version_or_release: provider_reported