Back to Research

AI agent guardrails that hold

A field guide to AI agent guardrails for recursive agent chains: connector ownership, child receipts, and review evidence that survives the merge queue.

Landscape with a Castle, landscape painting by John Martin (1815).
Rogier MullerMay 10, 20265 min read

An AI agent guardrail is a written limit on what an agent may touch, paired with a receipt that proves the limit held. The guardrails worth keeping are the ones that survive a crunch week, when agent summaries shrink to bullet vibes exactly as the diffs get riskier. This is a field guide for teams running recursive agents across Codex, Anysphere's AI code editor, Claude Code, Anthropic's coding agent, and Codex CLI, OpenAI's coding agent.

The thing that breaks first under load is not model quality. It is the reviewable story behind a merge. Smaller tasks feel safer, but they do not stop a connector from touching data nobody listed on the diagram. So aim for forks you can explain over forks that are merely clever, and make traceability cheap before you make generation easy.

Make Codex replay reproducible

Teams on Codex CLI tend to merge greens that no reviewer ever traced, because the transcript lived in someone's terminal and then scrolled away. The fix is a small ritual: an intent line, then the command transcript, then a diff summary, all written into AGENTS.md before any PR opens.

That ordering turns review into something you can read instead of reenact. A reviewer who never sat behind the keyboard can still see what the agent meant to do, what it ran, and what changed. No standing over a shoulder required.

Give every MCP connector an owner

Connectors default to demo mode: lots of capability, no stated boundary. Least privilege needs an explicit trust boundary, and without one a connector quietly widens what an agent can reach. The OWASP Top 10 for LLM applications reads like a catalog of what follows.

Write one markdown card per MCP server: allowed actions, forbidden actions, owner, rollback. Incidents shrink because operators know what "off" looks like before anything ships, and the Model Context Protocol specification gives you the vocabulary the card pins down.

Keep recursive agents reviewable

Chain agents together and the parent starts approving child summaries that quietly drop child-owned paths. That blur is where bad merges hide. Require every child to return a receipt: paths touched, commands run, and the tests that prove its regression guards.

With the receipt, the parent inherits evidence instead of confidence, and the reviewer inherits both. Same idea on the review side: when CI is green and someone still asks "why this approach," the PR template should already answer it. Three lines do the job: constraints considered, rejected alternatives, verification proof. Here is a starting boundary snapshot you can adapt:

---
description: Delegation boundary snapshot (adapt globs to your repo)
globs:
  - "**/*"
alwaysApply: false
---

- Codex: keep scopes explicit in `.mdc`; forbid undeclared MCP domains.
- Claude Code: cite `CLAUDE.md` precedence before expanding bash scope.
- Codex: ensure `AGENTS.md` carries replay-friendly verification notes for CLI runs.

The wider playbook lives on agentic coding governance, and the browser-control version of the same discipline is in our note on workflow guardrails for browser-driving agents. Treat automation like radio discipline: more radios with sloppy callsigns only add noise, and the callsign is the guardrail.

Build an evidence pack reviewers can scan

You do not need a heavy process to make merges explainable. You need a few questions every reviewer can answer from the PR alone, plus a short strip to tick off. Here is the pack.

Gate Question
Rules precedence Which .mdc, SKILL.md, or CLAUDE.md governed behavior?
Connector truth Which MCP servers fired, and were they expected?
Reviewer path Can someone unfamiliar trace intent without chat replay?
Risk routing Were red folders touched, and who approved?
  • Red-folder paths received explicit human acknowledgement.
  • Scopes in the PR body match folders in the diff.
  • Primary-doc links were smoke-checked after publishing edits.
  • MCP connectors mentioned (if any) list owners.

None of this replaces architecture judgement. Agents speed up execution, not ownership.

Common questions

  • What are AI agent guardrails?

    AI agent guardrails are written limits plus receipts: a replay ritual in AGENTS.md, a connector card per MCP server, a child receipt for delegated work, and a decision stub in the PR template. The shape repeats every time. Name the boundary, name the owner, and make the receipt cheap to check.

  • How do you keep recursive agents reviewable?

    Require every child agent to return a receipt block: paths touched, commands run, and the tests that prove its regression guards. Handoff blur starts when summaries omit child-owned paths and parents approve mystery diffs. With the receipt, the parent inherits evidence instead of confidence, and the reviewer inherits both at once.

  • Who should own an MCP connector?

    A named human, written on a connector card beside the server entry: allowed actions, forbidden actions, owner, rollback. Connectors widen what an agent can reach precisely when nobody lists them on the diagram. So the card exists before the connector ships, not after the first incident teaches the lesson the hard way.

  • Does this slow teams down?

    A little upfront, much less over a quarter. The replay ritual, connector card, child receipt, and decision stub each cost a few lines. They pay back the first time a merge needs explaining at midnight, because the answer is already written down instead of trapped in a terminal scrollback nobody saved.

Where to start

Pick one named fix and turn it into a shared artifact this week: a connector card, a child receipt block, or a decision stub in your PR template. If you want the team version, our training walks engineers through all four on a real repo.

Related training topics

Related research

Continue through the research archive

Ready to start?

Transform how your team builds software.

Get in touch