Public Preview

Tokenization And Context documentation

Tokenization and Context

Definition

Tokenization converts raw inputs into model-consumable units. For text, tokens are subwords/bytes (e.g., BPE, Unigram); for code, byte-level schemes are common; for multimodal models, non-text inputs are serialized (e.g., patches, embeddings) alongside text tokens.

Why It Matters

  • Affects throughput, cost, and quality via token count and granularity
  • Determines cross-lingual handling and robustness to OOV strings
  • Enables very-long-context workflows (summarization, agents, RAG)

2025 State of the Art

  • Subword and byte-level tokenizers dominate: SentencePiece (Unigram/BPE), tiktoken-like byte-fallback.
  • Providers ship expanded context (≥200k tokens; some SKUs advertise up to 1M–2M). Practical limits vary by tier and runtime.
  • Multimodal tokenization mixes text with image/video/audio descriptors; vendors expose fewer public details.

Key Players

  • OpenAI (tiktoken), Google (SentencePiece, Gemini 1.5 context), Anthropic (Claude 3.x/3.5/3.7 context), Meta (Llama 3.x tokenizer)

Challenges

  • Token count variance across languages and code
  • Latency/memory scale with sequence length; KV cache growth
  • Mismatch between advertised and effective usable context under latency/SLA constraints

Reference Architectures

  • Transformer encoders/decoders with byte/subword tokenizers
  • Attention memory optimizations (PagedAttention) for long context

Opportunities

  • Language-agnostic, compression-aware tokenizers
  • Mixed-modality packing for lower overhead
  • Context management policies (summarization, windowing, eviction)

Design Checklist & Acceptance Criteria

  • Choose tokenizer with byte fallback and stable IDs across versions
  • Validate tokenization cost across target languages and code samples
  • Confirm effective context at target latency on chosen runtime
  • For multimodal, specify image/video/audio pre-tokenization steps
  • Document upgrade policy for tokenizer merges/vocab

References

  • Title: tiktoken URL: https://github.com/openai/tiktoken Publisher/Vendor: OpenAI Accessed: 2025-08-14 Version_or_release: provider_reported
  • Title: SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing URL: https://github.com/google/sentencepiece Publisher/Vendor: Google Accessed: 2025-08-14 Version_or_release: provider_reported
  • Title: Meta Llama 3 URL: https://ai.meta.com/blog/meta-llama-3/ Publisher/Vendor: Meta Accessed: 2025-08-14 Version_or_release: 2024-04 (blog)
  • Title: Gemini models and context URL: https://ai.google.dev/gemini-api/docs/models/gemini Publisher/Vendor: Google Accessed: 2025-08-14 Version_or_release: provider_reported
  • Title: Claude models overview (context and features) URL: https://docs.anthropic.com/en/docs/about-claude/models Publisher/Vendor: Anthropic Accessed: 2025-08-14 Version_or_release: provider_reported