Back to Research

Codex Code Review for CLI Teams

A Codex CLI review workflow for AGENTS.md rules, MCP boundaries, verification loops, and review receipts.

Zeegezicht te Villerville Coastal View at Villerville, landscape painting by Charles-François Daubigny (1858).
Rogier MullerJune 16, 20266 min read

A codex code review is a team workflow where OpenAI Codex, OpenAI's coding agent product, helps inspect or prepare code changes while humans still own the merge decision. The practical move for CLI teams is to make every Codex change ship with three things: a small diff, a verification record, and a receipt a reviewer can read in a minute. Codex CLI is the command-line interface for OpenAI Codex that runs coding work from your terminal, applies repository instructions, and keeps changes close to your tests and conventions.

This got more relevant after Codex CLI 0.140.0, listed in OpenAI's June 15, 2026 changelog under MCP and integrations. When an agent can reach more of your stack, your review needs to be clearer about what it touched, not more relaxed about it.

Give Codex its rules before it writes a line

Repository instructions are the cheapest review lever you have, and they work best when they live next to the code. A root AGENTS.md sets the general tone. Production repos usually need more than that.

A payments module, a data pipeline, or an auth layer deserves stricter rules than the rest of the app. Nested instruction files keep that local scope from getting flattened into one vague document the agent half-follows.

The most useful rule to add first is a diff budget: one intent, one area of the codebase, one verification path. A Codex session will edit broadly if you ask broadly, so the budget is what keeps a single task from sprawling into a review nobody wants to open.

Name the MCP boundary in plain words

MCP, the Model Context Protocol, is an integration protocol that connects an agent to external systems like tools, resources, prompts, and services. It gives teams a standard language for those connections, and the specification is the reference for how they behave.

The catch is that integration context is invisible to a reviewer unless someone writes it down. If Codex read GitHub issues, pulled from internal docs, or called a service through MCP, that belongs in the receipt.

State the boundary as a fact: what was read, what was changed, and what was deliberately left alone. A reviewer should never have to guess whether an agent reached into a system it had no business touching.

Make verification a command, not a claim

A summary that says "tests were considered" tells a reviewer nothing they can trust. The fix is to name exact commands the loop ran: unit tests, type checks, linters, migrations, smoke checks.

If a command was skipped, say so and say why. Reviewers are happy to accept known risk. They should not have to reverse-engineer it from the diff.

This is also where skills earn their place. OpenAI's skills examples show the pattern: package reusable knowledge, scripts, and templates so a team can invoke a known workflow. A review skill should not promise correctness. It should produce a consistent receipt, rerun the known checks, and leave the call to the human.

Copy this review receipt into AGENTS.md

Here is a starter you can paste into AGENTS.md or fold into a team skill. The goal is to make Codex work reviewable without turning every reviewer into a prompt detective.

# AGENTS.md review instruction: Codex review receipt

When you change code with Codex CLI, finish with a review receipt.

## Scope
- Task intent:
- Files changed:
- Files intentionally not changed:
- Diff risk: low / medium / high

## MCP boundary
- MCP servers or external systems used:
- Data read through integrations:
- Actions taken through integrations:
- Anything deliberately not accessed:

## Verification loop
- Commands run:
  - npm test
  - npm run typecheck
  - npm run lint
- Commands skipped and reason:
- Manual checks performed:

## Reviewer notes
- Main behavior change:
- Backward-compatibility concern:
- Migration or deploy concern:
- Follow-up work that should not block this PR:

## Human handoff
- Recommended reviewer:
- Question for reviewer:
- Merge confidence: low / medium / high

A good receipt reads in about a minute. If it takes longer, the diff is probably too big or the risk is not yet understood, and both are worth fixing before the PR goes out.

Roll it out without a big project

You do not need a program for this. Add the receipt to one repository, require it only on Codex-authored or Codex-edited pull requests, and run it for two weeks. Small and measurable beats a policy nobody reads.

You will know it is working when reviewers can approve or reject from the receipt and the evidence alone, without replaying a long chat. That single test, can a fresh reviewer defend the merge from the artifact, is the whole point.

Common questions

  • How should a team start with codex code review?

    Start with one visible rule, not a loose preference. Add a short repository convention, a review checklist, and one owner who can reject agent output when the evidence is missing. Keep the rollout to a single repo for two weeks so you can see whether reviewers actually use the receipt before you spread it wider.

  • Which Codex artifact should we standardize first?

    Standardize the smallest artifact reviewers already touch: an AGENTS.md instruction, an MCP boundary note, or a verification checklist. The win is not documentation volume. It is one shared place where scope, allowed tools, expected tests, and rollback notes are visible before generated code reaches review.

  • How do we know the convention is working?

    It works when reviewers can approve or reject from the artifact and its evidence alone. Track whether pull requests name the rule they followed, include the checks they promised, and stop forcing reviewers to replay long sessions just to understand what changed. If those three hold, the receipt is doing its job.

  • Does better MCP support make review automatic?

    No. Stronger integration support gives Codex more context and more reach, which can improve the first patch. It does not remove the need for a reviewer. More reach actually raises the bar for naming what the agent read, changed, and verified, so the boundary note matters more, not less.

Take the next step

Add the receipt to one repo this week, then send a teammate the diff and ask if they can defend the merge from the receipt alone. For more on building these habits into your stack, see Codex CLI workflows.

Further reading

Related training topics

Related research

Ready to start?

Transform how your team builds software.

Get in touch