Back to Research

Stop Using CSS Selectors in E2E Tests

CSS selectors make agent-written E2E tests brittle. Use stable, user-facing hooks instead.

Hero image for Stop Using CSS Selectors in E2E Tests
Rogier MullerApril 6, 20266 min read

CSS selectors are a convenient trap in end-to-end tests. They feel precise, but they are tied to presentation. When the layout changes, the test breaks even if the user flow still works. In agentic coding workflows, that coupling gets worse. An AI can write a test quickly, but it will also grab the easiest visible selector it can find. That often leads to brittle tests that fail on harmless UI changes.

The core problem is simple: CSS selectors describe how the page is built, not what the user is trying to do. Agent-written tests need a stable way to find elements across refactors. If the selector depends on class names, nested structure, or styling hooks, you are encoding implementation details into the test suite.

That does not mean CSS selectors are always wrong. They can be fine for one-off debugging, visual checks, or very small internal tools. But for durable E2E coverage, they are usually the wrong default. The more your team uses agents to generate or maintain tests, the more important it is to give them a better target than the DOM structure.

What breaks first

The failure mode is usually not dramatic. A test passes for weeks, then someone changes a component library, renames a class, or wraps a button in another element. The user experience is unchanged, but the test fails. If an agent is maintaining the suite, it may fix the test by selecting a different CSS path that is just as fragile. You end up with a moving target.

This creates a hidden cost:

  • More false failures after routine UI work.
  • More time spent repairing tests instead of improving coverage.
  • Less trust in the suite, so people ignore failures.
  • More temptation to weaken assertions just to keep tests green.

The result is a test suite that looks active but does not protect much.

Better targets for agents

The better pattern is to expose stable, user-facing hooks. The exact mechanism depends on your stack, but the principle is consistent: select elements by meaning, not styling.

Common options include:

  • Accessible roles and names.
  • Stable data attributes meant for testing.
  • Text that is part of the product contract.
  • Explicit test IDs that do not change with layout.

Accessible queries are often the best first choice because they align with how users and assistive tech perceive the interface. If a button is truly a button, a role-based query is usually more durable than a CSS path. If the element has a meaningful label, that label is often a better anchor than a class name.

When the UI has repeated labels or dynamic content, a dedicated test attribute can be the safer option. The key is to make that attribute intentional and stable. Do not reuse styling classes for test selection. Do not ask agents to infer structure from nested divs when you can give them a direct hook.

A practical implementation pattern

A good team pattern is to separate three concerns:

  1. Styling hooks for CSS.
  2. Semantic hooks for accessibility.
  3. Test hooks for automation.

That separation reduces accidental coupling. It also gives agents a clearer rule: use semantic selectors first, fall back to test hooks when needed, and avoid styling selectors unless there is no better option.

In practice, this means updating component conventions. For example, a design system can require that interactive elements expose stable labels and, where needed, a test attribute. Agents then have a predictable surface to work with. If your team uses generated tests, you can also add a review rule: reject any new E2E test that depends on a class selector unless there is a documented reason.

A useful workflow is:

  • Add stable hooks when building the component, not after tests fail.
  • Prefer queries that match user intent.
  • Keep selectors short and explicit.
  • Review selector choice during test review, not just test behavior.

That last point matters. A test can be logically correct and still be operationally fragile.

Tradeoffs and limits

This is not a universal ban on CSS selectors. There are cases where they are the least bad option, especially in legacy apps with limited control over markup. Some teams also use CSS selectors for quick smoke tests where the cost of occasional breakage is acceptable.

There is also a maintenance tradeoff. Adding stable hooks takes discipline. If every component gets a test attribute by default, the markup can become noisy. If the team is careless, test IDs can drift into another form of implementation clutter. So the goal is not “add more selectors.” The goal is “choose selectors that survive normal change.”

Another limitation: stable hooks do not fix bad test design. If the test asserts too much, waits too long, or covers the wrong user path, it will still be flaky or low value. Selector choice is one part of test quality, not the whole thing.

How to use this with agents

If you want agents to maintain E2E tests reliably, give them a simple policy:

  • Use accessible queries first.
  • Use dedicated test hooks when accessibility queries are ambiguous or unstable.
  • Avoid CSS selectors for anything expected to survive refactors.
  • Flag any selector that depends on layout, nesting, or generated class names.

This works best when the policy is encoded in examples and review checks. Agents follow patterns. If your repository contains a few clear, durable examples, new tests are more likely to match them. If the existing suite is full of brittle selectors, the agent will copy that pattern too.

A small methodology note

This is a Test step problem: the test surface should reflect user intent, not implementation detail. That is the kind of constraint we try to keep visible in our methodology.

Bottom line

If an agent is writing or maintaining E2E tests, CSS selectors are usually the wrong default. They are too close to presentation and too far from user behavior. Stable hooks, semantic queries, and explicit review rules give you tests that are easier for agents to maintain and harder for routine UI work to break.

The practical test is simple: if a harmless refactor can break the selector, the selector is doing the wrong job.

Want to learn more about Codex?

We offer enterprise training and workshops to help your team become more productive with AI-assisted development.

Contact Us