Tokenization and Context

Definition

Tokenization converts raw inputs into model-consumable units. For text, tokens are subwords/bytes (e.g., BPE, Unigram); for code, byte-level schemes are common; for multimodal models, non-text inputs are serialized (e.g., patches, embeddings) alongside text tokens.

Why It Matters

Affects throughput, cost, and quality via token count and granularity
Determines cross-lingual handling and robustness to OOV strings
Enables very-long-context workflows (summarization, agents, RAG)

2025 State of the Art

Subword and byte-level tokenizers dominate: SentencePiece (Unigram/BPE), tiktoken-like byte-fallback.
Providers ship expanded context (≥200k tokens; some SKUs advertise up to 1M–2M). Practical limits vary by tier and runtime.
Multimodal tokenization mixes text with image/video/audio descriptors; vendors expose fewer public details.

Key Players

OpenAI (tiktoken), Google (SentencePiece, Gemini 1.5 context), Anthropic (Claude 3.x/3.5/3.7 context), Meta (Llama 3.x tokenizer)

Challenges

Token count variance across languages and code
Latency/memory scale with sequence length; KV cache growth
Mismatch between advertised and effective usable context under latency/SLA constraints

Reference Architectures

Transformer encoders/decoders with byte/subword tokenizers
Attention memory optimizations (PagedAttention) for long context

Opportunities

Language-agnostic, compression-aware tokenizers
Mixed-modality packing for lower overhead
Context management policies (summarization, windowing, eviction)

Design Checklist & Acceptance Criteria

Choose tokenizer with byte fallback and stable IDs across versions
Validate tokenization cost across target languages and code samples
Confirm effective context at target latency on chosen runtime
For multimodal, specify image/video/audio pre-tokenization steps
Document upgrade policy for tokenizer merges/vocab

References

Title: tiktoken URL: https://github.com/openai/tiktoken Publisher/Vendor: OpenAI Accessed: 2025-08-14 Version_or_release: provider_reported
Title: SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing URL: https://github.com/google/sentencepiece Publisher/Vendor: Google Accessed: 2025-08-14 Version_or_release: provider_reported
Title: Meta Llama 3 URL: https://ai.meta.com/blog/meta-llama-3/ Publisher/Vendor: Meta Accessed: 2025-08-14 Version_or_release: 2024-04 (blog)
Title: Gemini models and context URL: https://ai.google.dev/gemini-api/docs/models/gemini Publisher/Vendor: Google Accessed: 2025-08-14 Version_or_release: provider_reported
Title: Claude models overview (context and features) URL: https://docs.anthropic.com/en/docs/about-claude/models Publisher/Vendor: Anthropic Accessed: 2025-08-14 Version_or_release: provider_reported