Public Preview

Inference And Decoding Strategies documentation

Inference and Decoding Strategies

Definition

Inference produces outputs from trained models. Decoding selects the next tokens given logits, balancing quality, diversity, and efficiency.

Why It Matters

  • Controls style, determinism, and factuality
  • Impacts latency/cost via algorithmic complexity
  • Enables throughput gains (speculative/drafting) without retraining

2025 State of the Art

  • Core: greedy, temperature/top-k/top-p sampling, beam search, typical/contrastive decoding
  • Acceleration: speculative decoding (draft-and-verify), multi-branch heads (e.g., Medusa), verifier pipelines (ReDrafter), and recurrent/drafting variants
  • Systems: KV cache optimizations and scheduler techniques (e.g., PagedAttention in vLLM)

Key Players

  • OpenAI, Google/DeepMind, Anthropic, Meta; open-source stacks like vLLM and Hugging Face Transformers

Challenges

  • Trade-offs: coherence vs. diversity; speed vs. quality
  • Maintaining JSON/structure while sampling
  • Handling very-long-context decoding stability

Reference Architectures

  • Assisted Generation / speculative draft model + verifier main model
  • Single-model multi-head drafting (Medusa-style)
  • Serving layers with paged KV caches and batching

Opportunities

  • Adaptive decoding policies conditioned on prompt/task
  • Structured/constrained decoding with JSON Schema
  • Unified draft-verify across text/image/audio tokens

Design Checklist & Acceptance Criteria

  • Expose decoding knobs (temperature, top-p, top-k, presence/frequency penalties)
  • Provide deterministic paths (greedy, low temperature) for evals
  • Support structured outputs via constrained decoding/JSON Schema
  • Measure speedups and QoE under speculative settings on target hardware
  • Verify safety/format compliance in streaming mode

References

  • Title: Assisted Generation (speculative decoding) URL: https://openai.com/research/assisted-generation Publisher/Vendor: OpenAI Accessed: 2025-08-14 Version_or_release: 2023 (research blog)
  • Title: Medusa: Simple LLM Inference Acceleration Framework URL: https://arxiv.org/abs/2309.01446 Publisher/Vendor: arXiv Accessed: 2025-08-14 Version_or_release: 2023-09 (preprint)
  • Title: EAGLE: Speculative Sampling URL: https://arxiv.org/abs/2305.14233 Publisher/Vendor: arXiv Accessed: 2025-08-14 Version_or_release: 2023-05 (preprint)
  • Title: ReDrafter: Efficient Redrafting for LLM Inference URL: https://arxiv.org/abs/2404.09791 Publisher/Vendor: arXiv Accessed: 2025-08-14 Version_or_release: 2024-04 (preprint)
  • Title: vLLM: PagedAttention and Efficient LLM Serving URL: https://vllm.ai/ Publisher/Vendor: vLLM Project Accessed: 2025-08-14 Version_or_release: provider_reported