IndustryJune 15, 20265 min read

You Can't Guardrail Your Way to Safety

Keel Editorial Team

Research on AI governance, budgets, and auditability

An editorial diagram showing finite AI guardrail rules beside an open prompt space and a verifiable evidence record.

Most AI safety spending today buys a taller wall: guardrails, filters, policies, a longer list of things the model must not do. The pitch is intuitive — block the bad stuff before it happens. The problem is that a NIST senior scientist has now published a peer-reviewed result showing the wall can never be complete. Not "today's wall is short." No finite wall can catch everything — provably.

The proof, in plain terms

Apostol Vassilev's paper — "Robust AI Security and Alignment: A Sisyphean Endeavor?" — builds a formal, information-theoretic argument in the spirit of Gödel's 1931 incompleteness theorems, the result that showed no rich-enough rule system can prove every truth within itself. The move: treat a guardrail as a finite "checker," and show — using the same diagonalization idea Gödel used — that for any such checker there exists an input it fails to catch. Finite rules, infinitely many ways to phrase a prompt, so a prompt that slips through always exists. He states it as theorems and proves it holds regardless of the AI's architecture or the language used to prompt it.

Two qualifications the headlines tend to drop — and both matter for what you do next. First, it's a result about completeness, not futility: the proof shows no checker catches everything, but it pointedly "does not give any recipes to attackers for how to construct adversarial prompts." Guardrails still raise the cost of an attack; they just can't be airtight. Second, the work extends Gödel's idea rather than reducing to it — formal and peer-reviewed, but Gödel-inspired, not "Gödel's theorem proves your filter fails."

If you've ever watched a spam filter lose to "fr€e m0ney," or a tight contract lose to a clever reader, you already feel the shape of it: language is too slippery to fully pin down. The difference is that this time someone wrote it down as a theorem.

Why this should change how you spend

NIST itself drew the operational conclusion: move away from a "one and done" security model toward continuous monitoring, testing, and updating. If no wall can be complete, the questions that matter move downstream:

When something gets past the guardrail — and something will — do you find out?

Can you reconstruct exactly what the agent was allowed to do, and what it actually did?

Can you show that record to a regulator, an auditor, or a customer — and have them believe it?

Safety stops being a wall and becomes a discipline: continuous attention, and an honest record you can stand behind afterward. The usual advice gets as far as "monitor continuously, log your prompts." That's the right direction — but logging into your own system is evidence only if everyone already trusts your system. The moment the question is adversarial — a breach, a dispute, an audit — "trust our logs" is exactly what's in doubt.

The honest answer: prove it, don't promise it

Here's the line we won't cross: nothing makes guardrails complete, and we're not claiming to. Keel doesn't sell a taller wall, and it doesn't "solve" a problem a NIST scientist just showed is unsolvable in the limit.

What Keel does is the downstream half the result points at — built to hold up when trust is the thing in question. For every governed action, Keel produces a record of what the agent was authorized to do and what it actually did, bound together and anchored so it can't be quietly rewritten later, and verifiable by a third party without trusting Keel, the model vendor, or the operator. Not a claim that nothing bad happened — a record of what happened that survives the people who made it being wrong, motivated, or gone. Scope-faithful: it proves what was in scope, and doesn't pretend to prove the absence of everything else.

There's an irony worth sitting with. Selling prevention as the finish line implicitly promises completeness — the one thing the result says no one can deliver. The honest posture was never "our wall is complete." It was "we'll show you, provably, what the system did." A NIST theorem just made that the steadier place to stand.

You can't guardrail your way to complete safety. So decide what your AI can do — and prove what it did.

You Can't Guardrail Your Way to Safety

The proof, in plain terms

Why this should change how you spend

The honest answer: prove it, don't promise it

Sources

Share