Safety Filtering and PostProcessing
Abstract
Apply automated safety checks and transformations before delivering model outputs: moderation, redaction, normalization, and policy enforcement.
Motivation
- Reduce risk of harmful or disallowed content
- Enforce compliance (PII handling, IP, NSFW)
- Improve robustness via normalization and guard transforms
Architectures
- Inline moderation API calls (text/image/video/audio)
- Rule engines and classifiers; allow/deny with reasons
- Redaction (PII), watermark detection where available
- Post-generation correction: regex/JSON validation/policy strips
Design Choices
- Model-native moderation vs. external services
- Thresholds and actions (block, mask, review)
- False-positive/negative trade-offs; user appeal flows
- Region/data handling (no retention, logging governance)
Pros/Cons
- Pros: Reduced risk, auditability, consistent policy
- Cons: Latency cost, possible overblocking, maintenance burden
Evaluation Metrics
- Precision/recall on labeled safety sets; violation rate
- Time-to-decision, reviewer workload
- Policy coverage by category (hate, violence, self-harm, sexual content, etc.)
Vendor/Tooling
- OpenAI: omni-moderation-latest for text/image
- Google: Gemini Safety and policy docs
- Microsoft: Azure AI Content Safety (text/image/video) with categories
- Anthropic: Safety spec and guidance; model guardrails
Design Checklist
- Define categories, thresholds, and escalation paths
- Log decisions with reasons; support appeals
- Localize policies for regions; document data handling
- Test with red-team prompts and regularly refresh sets
References
- Title: Safety Spec and Moderation (OpenAI) URL: https://platform.openai.com/docs/guides/safety-best-practices Publisher/Vendor: OpenAI Accessed: 2025-08-14 Version_or_release: provider_reported
- Title: omni-moderation-latest URL: https://platform.openai.com/docs/models/omni-moderation-latest Publisher/Vendor: OpenAI Accessed: 2025-08-14 Version_or_release: provider_reported
- Title: Safety guidance (Gemini API) URL: https://ai.google.dev/gemini-api/docs/safety Publisher/Vendor: Google Accessed: 2025-08-14 Version_or_release: provider_reported
- Title: Azure AI Content Safety URL: https://learn.microsoft.com/azure/ai-services/content-safety/overview Publisher/Vendor: Microsoft Accessed: 2025-08-14 Version_or_release: provider_reported
- Title: Safety overview (Anthropic) URL: https://docs.anthropic.com/en/docs/build-with-claude/safety Publisher/Vendor: Anthropic Accessed: 2025-08-14 Version_or_release: provider_reported