Inference and Decoding Strategies
Definition
Inference produces outputs from trained models. Decoding selects the next tokens given logits, balancing quality, diversity, and efficiency.
Why It Matters
- Controls style, determinism, and factuality
- Impacts latency/cost via algorithmic complexity
- Enables throughput gains (speculative/drafting) without retraining
2025 State of the Art
- Core: greedy, temperature/top-k/top-p sampling, beam search, typical/contrastive decoding
- Acceleration: speculative decoding (draft-and-verify), multi-branch heads (e.g., Medusa), verifier pipelines (ReDrafter), and recurrent/drafting variants
- Systems: KV cache optimizations and scheduler techniques (e.g., PagedAttention in vLLM)
Key Players
- OpenAI, Google/DeepMind, Anthropic, Meta; open-source stacks like vLLM and Hugging Face Transformers
Challenges
- Trade-offs: coherence vs. diversity; speed vs. quality
- Maintaining JSON/structure while sampling
- Handling very-long-context decoding stability
Reference Architectures
- Assisted Generation / speculative draft model + verifier main model
- Single-model multi-head drafting (Medusa-style)
- Serving layers with paged KV caches and batching
Opportunities
- Adaptive decoding policies conditioned on prompt/task
- Structured/constrained decoding with JSON Schema
- Unified draft-verify across text/image/audio tokens
Design Checklist & Acceptance Criteria
- Expose decoding knobs (temperature, top-p, top-k, presence/frequency penalties)
- Provide deterministic paths (greedy, low temperature) for evals
- Support structured outputs via constrained decoding/JSON Schema
- Measure speedups and QoE under speculative settings on target hardware
- Verify safety/format compliance in streaming mode
References
- Title: Assisted Generation (speculative decoding) URL: https://openai.com/research/assisted-generation Publisher/Vendor: OpenAI Accessed: 2025-08-14 Version_or_release: 2023 (research blog)
- Title: Medusa: Simple LLM Inference Acceleration Framework URL: https://arxiv.org/abs/2309.01446 Publisher/Vendor: arXiv Accessed: 2025-08-14 Version_or_release: 2023-09 (preprint)
- Title: EAGLE: Speculative Sampling URL: https://arxiv.org/abs/2305.14233 Publisher/Vendor: arXiv Accessed: 2025-08-14 Version_or_release: 2023-05 (preprint)
- Title: ReDrafter: Efficient Redrafting for LLM Inference URL: https://arxiv.org/abs/2404.09791 Publisher/Vendor: arXiv Accessed: 2025-08-14 Version_or_release: 2024-04 (preprint)
- Title: vLLM: PagedAttention and Efficient LLM Serving URL: https://vllm.ai/ Publisher/Vendor: vLLM Project Accessed: 2025-08-14 Version_or_release: provider_reported