Back to Research

Run Review Agents With Receipts

A practical Codex workflow for llm code review, with AGENTS.md guardrails, MCP boundaries, and a copyable review receipt.

Sea Breaking on Stony Cliffs at Left, landscape painting by James Ward.
Rogier MullerJune 26, 20268 min read

The strongest code review setup is not “one magic model.” It is a small review system: clear repo rules, bounded tools, independent verification, and a receipt the human reviewer can trust.

AI code review works best when agents act like careful junior reviewers with excellent patience, not silent maintainers with commit access. Multi-agent orchestration is the practice of splitting work across specialized agents, tools, and checks so each part has a narrow job and an observable output.

For Codex users and engineering teams, that usually means one agent proposes, one agent reviews, and one verification loop proves what changed. This is the heart of practical AI coding governance: make the workflow faster without making ownership fuzzy.

Choose the reviewer by workflow, not leaderboard

The best llm for code review is the model-plus-workflow that catches your real defects without creating review theater. A model that is great at broad reasoning may still be poor for your codebase if it cannot see the right files, run the right commands, or follow your team’s review policy.

Start with the failure modes you actually care about. For a payments service, ask the review agent to focus on idempotency, auth boundaries, logging, migrations, and rollback safety. For a frontend app, ask it to check accessibility, state transitions, performance, and test coverage around user-visible changes.

A useful Codex, OpenAI’s coding agent, pattern is to run review as a constrained pass after the implementation pass. The implementation agent can touch code. The review agent should produce findings, evidence, and suggested fixes, but not quietly rewrite the PR unless a human asks.

The trap is comparing AI code review tools only by benchmark anecdotes. Benchmarks are useful signals, but teams ship through repos, tests, incidents, and habits. Pick the review setup that leaves the clearest trail for your maintainers.

If you want a deeper comparison frame, use Choose Code Review Agents by Receipts as a companion checklist.

Put the repo rules where agents will read them

Use AGENTS.md as the durable contract for Codex work in the repo. Put architecture rules, review expectations, forbidden shortcuts, and verification commands there. Keep it short enough that humans will edit it and agents will obey it.

A good root AGENTS.md might say: “Do not change database migrations without adding rollback notes. Run pnpm test -- --runInBand for backend changes. Prefer small PRs. Review auth, tenancy, and logging on every API route change.” That is boring in the best way.

Use nested files when different parts of the repo have different rules. A services/billing/AGENTS.md can be stricter than docs/AGENTS.md. Local scope beats one giant instruction file that tries to describe the whole company.

The trap is turning AGENTS.md into a junk drawer. Do not paste task prompts, old incident notes, or speculative preferences into durable memory. If a rule would not survive the next sprint, put it in the task prompt instead.

Keep MCP boundaries explicit

MCP, the Model Context Protocol, is the integration layer agents use to reach external tools and data. In a code review workflow, that might include GitHub, issue trackers, docs, logs, databases, or internal knowledge bases.

Write down which MCP servers a review agent may use and which ones it may not use. For example: “Review agents may read GitHub PR diffs, linked tickets, and public docs. They may not query production databases, send Slack messages, or mutate Jira state.”

This matters because a code review ai workflow often fails at the boundary, not inside the model. The agent may be helpful, but the tool permission is too broad. A reviewer with write access to every system is not a reviewer anymore; it is an operator.

The trap is granting tools because setup is annoying. If the agent only needs evidence, give it read access. If it needs to run commands, prefer local test commands or CI checks before external mutation.

Split agents by responsibility

A simple personal or team setup can use three roles: implementer, reviewer, and verifier. The implementer changes code. The reviewer reads the diff against policy. The verifier runs the agreed commands and reports exact results.

This is not bureaucracy. It is separation of concerns. The same agent that wrote the patch is naturally biased toward explaining why it is fine. A second pass catches a different class of mistakes, especially when it starts from git diff and AGENTS.md instead of the original implementation plan.

In Codex CLI-style work, the loop can be plain: ask Codex to implement the issue, inspect the diff, ask for a review-only pass, run tests, then ask for a receipt. If the agent cannot run a command, it should say so instead of implying green status.

The trap is parallelizing too early. Five agents reviewing the same PR can produce more noise than signal. Start with one narrow reviewer, one verification loop, and a receipt format your team can scan in under two minutes.

Paste this review receipt into your workflow

Use this as a pull request comment, review template, or final Codex handoff. The important part is not the wording. The important part is that every claim ties back to evidence.

## AI review receipt

PR: <link or branch>
Reviewer agent: <model/tool name>
Human owner: <name>
Scope reviewed: <files, directories, or diff range>
Repo instructions used: AGENTS.md, <nested AGENTS.md if relevant>
MCP/tools used: <GitHub read, local shell, docs search, none>

## Review focus
- [ ] Security/auth boundaries
- [ ] Data model or migration safety
- [ ] Error handling and observability
- [ ] Backward compatibility
- [ ] Tests cover changed behavior
- [ ] Performance-sensitive paths checked
- [ ] Product or accessibility impact checked

## Findings
| Severity | Finding | Evidence | Suggested next step |
| --- | --- | --- | --- |
| High/Med/Low | <specific issue or “none found”> | <file:line, test output, or command> | <fix, follow-up, or accept risk> |

## Verification run
Commands attempted:
- `<command>` → `<pass/fail/not run>`

Important output:
- `<exact failure, skipped test, or green summary>`

## Limits
- Files not reviewed: <list or “none”>
- Tools unavailable: <list or “none”>
- Assumptions made: <short bullets>

## Human decision
- [ ] Approved as-is
- [ ] Needs changes before merge
- [ ] Safe to merge with tracked follow-up: <ticket>

This artifact also works for engineering team training. In an AI coding workshop, have every participant run the same PR through their preferred agent and compare receipts, not vibes. The differences are usually obvious: one reviewer cites files and tests; another gives confident generalities.

Common questions

  • Which model should we use for code review?

    Use the model that produces specific, evidence-backed findings in your repo, not the one with the loudest reputation. For a fair test, give each candidate the same PR, the same AGENTS.md, and the same receipt template, then compare false positives, missed issues, command honesty, and reviewer time saved.

  • Should review agents be allowed to change code?

    Usually no, at least not during the review pass. Keep the reviewer in comment-and-receipt mode so humans can separate “what changed” from “what was found.” If you allow fixes, make the agent create a separate commit or patch and run the verification loop again.

  • How many agents do we actually need?

    Start with two roles: an implementation agent and a review agent. Add a verifier only when tests, builds, migrations, or security checks are often skipped or misreported. More agents can help on large changes, but they also add coordination cost and duplicate comments.

  • How do we stop llm code review from becoming noisy?

    Make the review policy narrow and require evidence for every finding. A good receipt asks for severity, file-level evidence, and the next step, which filters out generic advice. Also tune AGENTS.md after each noisy review; bad instructions create bad reviews surprisingly quickly.

  • Where do Codex and Claude Code fit in a Codex workflow?

    They can share the same governance shape. Codex, Anysphere’s AI code editor, Claude Code, Anthropic’s coding agent, and Codex can all follow repo instructions, scoped tool access, and review receipts. The product surface changes; the team contract should stay recognizable.

Further reading

Start with one reviewed PR

Pick one real pull request, add a small AGENTS.md, run one review-only agent pass, and require the receipt before merge. If the receipt helps the human reviewer make a better decision, keep the loop and tighten it next week.

One methodology lens

One useful way to read this through our methodology is the Plan step: delegate first-pass decomposition and dependency mapping, review the sequencing and assumptions, and keep ownership of scope and priorities. If that split is still fuzzy, the workflow usually is too.

Related training topics

Related research

Ready to start?

Transform how your team builds software.

Get in touch