AI Guardrails: Put Your Dog on a Leash

Jun 04, 2026

There is a conversation happening across the industry right now about AI safety, and a lot of it centers on guardrails. System prompts. Constitutional training. Refusal behaviors. Markdown files that tell an agent what tools it should and should not call. Skill definitions that scope what an LLM is allowed to do for a given task. The volume of work going into these mechanisms is substantial, and the people doing the work are smart.

I think a large portion of it is misclassified.

When I look at how AI guardrails are designed today, the framing that keeps coming back to me is the old triangle from foundational security: administrative, physical, and technical controls. Physical does not really apply to most of what we are talking about here, so set that one aside. Administrative and technical are the two we care about, and the relationship between them is the part of the picture that I think a lot of the current AI safety conversation is getting wrong.

A Quick Refresher on Controls

Administrative controls are the rules. Policies, procedures, training, regulations, contracts. They tell people what they are supposed to do. They are words on paper.

Technical controls are the enforcement. They are the systems that physically prevent the rule from being broken, regardless of whether the person on the other end of the keyboard wanted to break it. Access control lists. Authentication. Authorization scopes. Network segmentation. Encryption. The actual mechanism that says no when no is the right answer.

The relationship between the two is the part that matters. Administrative controls produce intent. They tell people what good looks like. They make the expectation explicit so that everyone is operating from a shared understanding. They do not, on their own, stop a person from doing the wrong thing. Words on paper have exactly as much control over a person as that person is willing to give them. If the person decides not to follow the policy, the policy does not enforce itself.

Technical controls are what makes the administrative control real. They are the part of the system that produces the outcome when intent fails. Intent fails for a lot of reasons. People make honest mistakes. People take shortcuts. People are tired, distracted, or rushed. People act in bad faith. The technical control does not care about the reason. It just enforces the rule.

A well-run security program has both, and they reinforce each other. The policy tells you what the rule is. The technical control makes sure the rule actually holds.

Words on paper have exactly as much control over a person as that person is willing to give them.

Where AI Guardrails Sit

When I look at the current state of AI safety mechanisms, most of what we are calling “guardrails” lives on the administrative side of that line.

Model weights and the training that produced them are administrative. The model has been conditioned to behave a certain way, but the conditioning is probabilistic. It makes the model less likely to do the bad thing. It does not make the bad thing impossible. The history of jailbreaks and prompt injection attacks is the history of people demonstrating the gap.

System prompts are administrative. They are instructions in plain text that tell the model what role it is playing, what it should and should not do, and what tone it should take. They live in the context window alongside everything else. They can be ignored, contradicted, or overwritten by clever input. They have no enforcement mechanism of their own.

Skill files, agent instruction files, tool descriptions, the markdown scaffolding that has emerged around modern agent frameworks. All administrative. Useful, even necessary, for getting consistent behavior. Not enforcement.

There is a category of product that markets itself as “guardrails,” using external classifier models or rule-based filters that sit outside the main LLM and inspect inputs and outputs. Those are better than nothing. They sit a half-step closer to technical controls because they are a separate enforcement layer. But they are still probabilistic, still inside the same trust boundary as the model they are protecting in most deployments, and still bypassable. Closer to technical, still not there.

The clarifying point I want to make is this. An agent operating with a broad, all-access token and a stack of input and output filters in front of it is never going to be as secure as the same agent operating with a token scoped to the exact permissions its single function requires. The filter approach treats over-permissioning as a problem to be managed at runtime by inspecting requests as they come and go. The scoping approach treats over-permissioning as a problem to be eliminated up front by making the unwanted action structurally impossible. One of those is enforcement. The other is supervision. They are not the same thing, and supervision does not substitute for enforcement no matter how good the supervisor is.

The reason this matters in practice is that filters are probabilistic and scopes are not. A filter is a piece of software making a judgment call about whether a given input or output looks acceptable. It will be right most of the time. It will be wrong some of the time. A scope on a token is not making a judgment call. The action is either inside the scope or it is not, and if it is not, the system refuses without consulting anyone. That difference is the difference between defense in depth and false confidence.

Filters belong in the stack. They are useful. They catch things the scoping cannot catch, particularly things that are technically inside the scope but still undesirable, like a message with sensitive content going out through an otherwise legitimate channel. But they belong on top of correct scoping, not in place of it. The scoping is the technical control. The filter is the administrative-flavored layer that helps the technical control do its job. Get the order right and the stack works. Get the order wrong and the filter becomes the thing standing between the model and an action it should never have been able to take in the first place, which is exactly the failure mode this post is about.

The actual technical control, the one that produces the outcome when the administrative control fails, is somewhere else entirely. It is the API permission scope on the token the agent is using. It is the network ACL that determines what hosts the agent can reach. It is the sandbox the code execution tool is running inside. It is the database account that is read-only because the agent should not need to write. It is the human in the loop on the action that cannot be undone. The technical control is the thing that says no when the model decides it wants to say yes.

One of those is enforcement. The other is supervision. They are not the same thing, and supervision does not substitute for enforcement no matter how good the supervisor is.

The Token Problem

I want to spend a minute on the API token point, because I think it is the cleanest illustration of the problem.

If you build an agent and you give it a token that has write access to your customer database, the system prompt telling it to be careful with customer data is not going to stop it from writing to the customer database. The training that conditioned it to handle sensitive data responsibly is not going to stop it from writing to the customer database. The skill file that says “this agent should only read customer records” is not going to stop it from writing to the customer database. The model has the token. The token works. The action is permitted by the actual enforcement layer, which is the database authorization system, and the database authorization system was told that this token is allowed to write.

The only thing that actually prevents the agent from writing to the customer database is changing the token’s scope so that the database refuses the write at the technical layer. That is a technical control. Everything else is intent.

This becomes more important, not less, as agents become more capable and as the trend in the industry moves toward giving them broader tool access. Every new credential you put in an agent’s hands is a new permission you are trusting the administrative layer to constrain. The administrative layer is the model’s behavior. The model’s behavior is probabilistic. The math gets worse with every new tool.

The only thing that actually prevents the agent from writing to the customer database is changing the token’s scope. Everything else is intent.

The Leash

The analogy I keep coming back to is the dog and the leash.

You can train your dog well. You can train them for years. You can have a dog whose recall is reliable, whose temperament is steady, who has never lunged at a stranger or chased a squirrel into traffic. You should still put a leash on them when you walk down a busy street.

The leash is not a statement about whether you trust the dog. The leash is a statement about consequences. If the dog is wrong about a squirrel, the consequence is a car, and the consequence is not recoverable. The training is administrative. The leash is technical. Both have their place. Neither replaces the other.

I know a leash is technically a physical control rather than a technical control, but the spirit of the analogy holds. The leash exists because behavioral conditioning, however good, is not the same as enforcement. The point of the leash is not to fix the dog. The point of the leash is to make the consequence of a behavioral failure something other than catastrophe.

That is the model I want people to use when they think about AI safety. The training and the system prompts and the skill files are the dog’s training. They matter. They produce a better-behaved agent. They make the technical controls less likely to be needed. They are not the leash. They are not enforcement. They are the part of the system that makes the agent want to do the right thing, which is not the same as the part of the system that makes the agent unable to do the wrong thing.

The leash is not a statement about whether you trust the dog. The leash is a statement about consequences.

What Technical Controls Actually Look Like

I do not want to leave this post in the diagnostic mode without giving an answer to the obvious next question, which is what the technical control layer should actually look like for AI systems.

Scoped credentials are the first move. Every token an agent uses should have the minimum permissions required to do the job it was built for. If the agent only needs to read, the token only allows reading. If the agent only needs to operate on a particular project, the token is scoped to that project. The agent never gets a token with broader access than the task requires, because the moment that broader token exists, the administrative layer is the only thing standing between the agent and the broader actions.

Sandboxing is the second move. Code execution tools should run inside environments that constrain what they can reach. File system isolation. Network egress restrictions. Resource limits. Whatever the equivalent guardrail is for the kind of execution the agent is doing. The point of the sandbox is the same as the point of the leash. The agent can try to do whatever it tries to do. The sandbox determines what is actually possible.

Human-in-the-loop on irreversible actions is the third move. Some actions cannot be undone. Sending money. Deleting data. Sending a message to a customer. Closing an account. For actions in that category, the right answer is often to require a human approval step in the middle of the workflow, because the cost of getting it wrong is higher than the cost of waiting for a person to look at it.

Defense in depth across all of the above. No single layer is perfect. The administrative layer makes the agent want to behave. The classifier and filter layer adds a second probabilistic check. The technical control layer enforces. The human-in-the-loop catches what makes it through. Each layer covers the failure modes of the layers around it.

The part of the system that makes the agent want to do the right thing is not the same as the part of the system that makes the agent unable to do the wrong thing.

The Closing Point

None of this is an argument against the administrative work. The administrative work matters. A model trained to be helpful, honest, and careful is a meaningfully different product from a model that was not. The system prompts, the skill files, the scaffolding around agent behavior, all of it is real engineering and all of it makes the resulting system better.

It is just not enforcement. It is intent. And as we put more agents in more places with more credentials, the distinction between intent and enforcement is going to get more important, not less.

Train the dog. Then put on the leash.

Discussion about this post

Ready for more?