Tokenization and Context
Definition
Tokenization converts raw inputs into model-consumable units. For text, tokens are subwords/bytes (e.g., BPE, Unigram); for code, byte-level schemes are common; for multimodal models, non-text inputs are serialized (e.g., patches, embeddings) alongside text tokens.
Why It Matters
- Affects throughput, cost, and quality via token count and granularity
- Determines cross-lingual handling and robustness to OOV strings
- Enables very-long-context workflows (summarization, agents, RAG)
2025 State of the Art
- Subword and byte-level tokenizers dominate: SentencePiece (Unigram/BPE), tiktoken-like byte-fallback.
- Providers ship expanded context (≥200k tokens; some SKUs advertise up to 1M–2M). Practical limits vary by tier and runtime.
- Multimodal tokenization mixes text with image/video/audio descriptors; vendors expose fewer public details.
Key Players
- OpenAI (tiktoken), Google (SentencePiece, Gemini 1.5 context), Anthropic (Claude 3.x/3.5/3.7 context), Meta (Llama 3.x tokenizer)
Challenges
- Token count variance across languages and code
- Latency/memory scale with sequence length; KV cache growth
- Mismatch between advertised and effective usable context under latency/SLA constraints
Reference Architectures
- Transformer encoders/decoders with byte/subword tokenizers
- Attention memory optimizations (PagedAttention) for long context
Opportunities
- Language-agnostic, compression-aware tokenizers
- Mixed-modality packing for lower overhead
- Context management policies (summarization, windowing, eviction)
Design Checklist & Acceptance Criteria
- Choose tokenizer with byte fallback and stable IDs across versions
- Validate tokenization cost across target languages and code samples
- Confirm effective context at target latency on chosen runtime
- For multimodal, specify image/video/audio pre-tokenization steps
- Document upgrade policy for tokenizer merges/vocab
References
- Title: tiktoken URL: https://github.com/openai/tiktoken Publisher/Vendor: OpenAI Accessed: 2025-08-14 Version_or_release: provider_reported
- Title: SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing URL: https://github.com/google/sentencepiece Publisher/Vendor: Google Accessed: 2025-08-14 Version_or_release: provider_reported
- Title: Meta Llama 3 URL: https://ai.meta.com/blog/meta-llama-3/ Publisher/Vendor: Meta Accessed: 2025-08-14 Version_or_release: 2024-04 (blog)
- Title: Gemini models and context URL: https://ai.google.dev/gemini-api/docs/models/gemini Publisher/Vendor: Google Accessed: 2025-08-14 Version_or_release: provider_reported
- Title: Claude models overview (context and features) URL: https://docs.anthropic.com/en/docs/about-claude/models Publisher/Vendor: Anthropic Accessed: 2025-08-14 Version_or_release: provider_reported