Guardrails
Production safety filters
Runtime layers that inspect, control, and if needed block LLM outputs. They provide safety the model alone can't reach.
However well-aligned a model is, it doesn't produce 100% safe output in production. Guardrails are inspection layers wrapping the model: LLM generates, guardrail checks, problems get blocked, rewritten, or regenerated.
Common guardrail types: - PII detection: leaks of personal data (SSN, emails, phones, cards) - Toxicity filter: hate, harassment, abuse - Topic boundary: refuse out-of-scope questions - Hallucination check: verify factual claims against sources - Output schema validation: enforce structured output format - Rate limiting: cap requests per user per minute
Tools: Llama Guard (Meta's open filter model), Guardrails AI, NeMo Guardrails (NVIDIA), Lakera Guard. OpenAI/Anthropic also sell their internal moderation as a product.
Like the editor layer at a news outlet. The reporter (LLM) writes the story, the editor (guardrail) checks before publishing: "this is libelous," "that word is sensitive," "cite your source." Instead of trusting the reporter alone, an extra quality layer in the process.
A bank's customer-support bot. A customer asks "how do I update my SSN on file?" The model writes a long explanation — but accidentally uses another customer's SSN as an example (training-data leak).
PII guardrail catches this:
1. Scan output regex for SSN pattern → match.
2. Mask: replace with [SSN].
3. Log + alarm.
4. Send sanitized output to the user.
Not one line of code but a whole architectural layer — a typical production system runs 3-5 guardrails in sequence. Latency goes up; safety is non-negotiable.
- Production AI products — guardrails aren't optional
- Regulated sectors (finance, health, legal) — compliance requires them
- Public-facing products — malicious usage is inevitable
- Multi-tenant systems — one user's output can't affect another
- Internal-only prototype, single user — overhead not worth it
- Believing one guardrail solves everything (think layered)
- Only filtering input (output must be filtered too)
False positives → broken UX
Overly aggressive filters block harmless answers. Block any 'cancer' mention → your medical assistant is useless. Need context-aware filters and tuned thresholds.
Latency adds up
5 guardrails × 100ms = 500ms extra. May conflict with streaming (do you filter every token?). Plan async/parallel filter architecture.
Guardrails get jailbroken too
Llama Guard, OpenAI Moderation API are themselves attackable. Don't rely on one defense; layer them — defense in depth.