# AI Automating Browsers

When AI automates a browser, the page should own the answer.

The runner should mostly:

- start a deterministic local server
- open one browser run for one scenario or one homogeneous batch
- navigate to `?scenario=...&requestId=...`
- wait for one machine-readable report
- read it back
- tear everything down

If the runner starts feeling like a second layout engine, it is already too complicated.

## Default Shape

This is the boring shape we want by default:

1. Start a deterministic local server.
2. Pick a loopback host that actually responds.
3. Open one browser target for one scenario or one homogeneous batch.
4. Pass the scenario inputs and a unique `requestId` in the URL.
5. Let the page run the scenario or batch.
6. Let the page publish a compact report with phase markers.
7. Wait for the matching `requestId` to reach `phase: 'ready'`.
8. Close the browser target and server.

Keep the scenario in the page. Keep transport in the runner.

## What The Page Owns

The page should:

- read `scenario` or another compact input like `widths=300,310,320`
- read `requestId`
- run the scripted flow
- compute the semantic result
- publish one compact report

Example:

```ts
type Report = {
  phase: 'loading' | 'running' | 'ready' | 'error'
  requestId: string
  scenario: string
  selectedText?: string
  visibleRowRange?: [number, number]
  orderedCardIds?: string[]
  lastDraggedId?: string | null
  message?: string
}
```

Good channels:

- `location.hash` for small live reports
- a local POST endpoint for large batched reports
- one in-page debug object if a richer protocol runner wants to read page state directly

Keep one source of truth for the report object. It can have more than one transport, but don't make the transports disagree.

## `requestId`

Always include a unique `requestId`.

Without it, the runner can easily read:

- a stale report from an earlier run
- another tab's state
- a page that reloaded halfway through

Good:

```ts
const requestId = `${Date.now()}-${Math.random().toString(36).slice(2)}`
const url = `${baseUrl}?scenario=dismiss-to-row&requestId=${encodeURIComponent(requestId)}`
```

Then ignore any report whose `requestId` does not match.

## Phase Markers

If the scenario takes real time, publish coarse phases:

- `loading`
- `running`
- `ready`
- `error`

This makes hangs much easier to read:

- never loaded
- loaded but never started
- started but got stuck
- finished with an explicit failure

If the page has a real transport step, say posting a large report back to the runner, add one more specific phase such as `posting`.

## What The Runner Owns

The runner can do transport work such as:

- start the server
- open the browser
- poll `location.hash`
- wait for one POSTed report
- read one report node
- take a screenshot for debugging

It should not:

- reimplement gesture policy
- guess at "done" from sleeps
- infer layout from pixels when the page can already report the semantic answer
- treat browser exit as truth when the page can already say `ready` or `error`

The runner should feel dumb on purpose.

## Temporary Preservation Diffs Are Narrower

For a one-off refactor question like "did any line or box move?", it is fine for the runner to dump rects directly.

Good:

- open the old code, record item, line, or token rects
- open the new code, record again
- diff the JSON on the same machine and browser
- delete the bulky harness afterward

Bad:

- promoting that one-off dump into the permanent browser architecture
- replacing a better page-owned semantic report with a DOM dump just because it was easy once

This is a temporary preservation oracle, not the default answer to browser automation.

## One Page Run, One Browser Run

Keep the page alive while the runner waits.

If you use `--dump-dom`, start one Chrome process for that scenario, wait for the final report, then kill it.

If you use CDP or Playwright, same idea:

- open one browser process
- open one target for that scenario or homogeneous batch
- wait for the final report
- close it

Don't relaunch Chrome inside a polling loop. Don't keep starting fresh tabs just to ask the page whether it is done yet.

If you need many cheap subcases, batch them in the page instead of in the runner.

Good:

- one corpus page load that measures widths `300..900 step=10`
- one gallery page load that reports visible item ids for a few fixed scroll positions

Bad:

- 61 separate browser navigations for 61 widths
- a runner loop that keeps refreshing the same page just to collect one more row

## If JS Owns The Clock, Let The Page Expose Frame Sampling

If the animation is stepped from JS time, don't make the runner guess with sleeps.

Good page-owned API:

- `setViewport(width, height)`
- `scrollToY(y)`
- `clickItem(id)`
- `flushTo(ms)`
- `sample()`

Then the runner can do:

1. set up one scenario
2. flush to `t=16ms`, `t=64ms`, `settled`
3. read one compact geometry report

This is great for resize, anchoring, and transition bugs. The runner stays dumb. The page stays the only layout engine.

Bad:

- `await sleep(16)` and hoping you caught the right frame
- screenshotting a transition and inferring geometry from pixels
- runner-side math that tries to reconstruct where the item "must have been"

## Tiny CDP Poller Is Fine

A tiny Chrome CDP poller is a good default when you want no extra dependency.

That can be as small as:

- launch headless Chrome with remote debugging
- create a target for the scenario URL
- poll `location.hash` through `Runtime.evaluate`
- stop when the page reports the matching `requestId` and `phase: 'ready'`

This is still the same pattern. CDP is just transport.

Playwright is fine when you really need richer browser control. It is not the default just because it is familiar.

## Headless Is A Different Environment Until Proven Otherwise

Don't assume headed and headless are equivalent just because they use the same browser binary.

If you want to switch a checked-in correctness sweep from headed Chrome to headless Chrome:

- pin the headless screen environment explicitly
- generate the full machine-readable report
- mechanically diff it against the trusted snapshot

If the diff is not empty, the answer is "not equivalent yet", not "close enough".

Good:

- `--screen-info={3024x1964 devicePixelRatio=2}` with a matching Playwright context
- a zero-diff compare against the trusted Chrome sweep JSON

Bad:

- "same Chrome version, so it must match"
- silently rerunning headed after headless disagrees
- updating the baseline before proving parity

## Deterministic Servers

Prefer a deterministic page server, e.g.:

```sh
bun --port 3210 --no-hmr pages/*.html
```

Don't run correctness automation against an HMR session if you can avoid it.

HMR preserves too much state:

- stale modules
- old page state
- confusing reload timing

## Probe Loopback Hosts

Don't assume one loopback hostname will work everywhere.

Probe:

- `http://127.0.0.1`
- `http://localhost`
- `http://[::1]`

and use the one that actually responds.

## Keep Asset Timing Out Of It

If the thing under test is routing, selection, scroll position, z-order, or occlusion, don't let CDN timing decide pass or fail.

In report mode, it is fine to swap remote images for a tiny local placeholder if the assets themselves are not under test.

## Correctness vs Benchmarks

Correctness runs and benchmark runs are different.

Correctness runs:

- are mostly about whether the semantic report is right
- are fine with a minimal dumb runner
- are often great candidates for in-page batching

Benchmark runs:

- care about focus, throttling, and `requestAnimationFrame`
- should not trust numbers from a background tab
- should not casually switch browser driver or browser mode without a fresh benchmark baseline

Don't mix those goals by accident.

For correctness runs, it is often enough to drive a manual page-owned clock and read semantic geometry. That is a different job from measuring real frame cadence.

## Fail Loudly On Drift

Don't hide transport drift or environment drift with silent fallbacks.

Bad:

- if Playwright headless disagrees, quietly rerun headed
- if a POST report never arrives, pretend the hash report was good enough
- if Safari times out, mark the run green because Chrome matched

Good:

- fail if the `requestId` never comes back
- fail if the page never leaves `loading`
- fail if the page gets stuck at `posting`
- fail if the pinned headless width or DPR does not match what the page sees
- fail in compare mode if the new runner's JSON differs from the trusted snapshot

## Common Automation Artifacts

Some things look scary but are just cleanup noise:

- `ERR_CONNECTION_REFUSED` after a temporary server exits
- an early `loading` report before the final `ready` report
- a stale report with the wrong `requestId`
- the automation tab disappearing during teardown

Don't treat these as product failures by default.

## Example

Page:

```ts
const params = new URLSearchParams(location.search)
const scenario = params.get('scenario') ?? ''
const requestId = params.get('requestId') ?? ''

function publishReport(report: Report) {
  location.hash = `report=${encodeURIComponent(JSON.stringify(report))}`
}

publishReport({phase: 'loading', requestId, scenario})

// ...run the scenario...

publishReport({
  phase: 'ready',
  requestId,
  scenario,
  selectedText,
})
```

Runner:

```ts
const requestId = crypto.randomUUID()
const url = `${baseUrl}?scenario=caption-selection&requestId=${encodeURIComponent(requestId)}`

await openChrome(url)

const report = await waitForMatchingReadyReport(requestId)
expect(report.selectedText).toBe(expectedPrompt)
```

Large batched report:

```ts
const requestId = crypto.randomUUID()
const reportServer = await startPostedReportServer(requestId)
const url =
  `${baseUrl}?widths=${encodeURIComponent('300,310,320')}` +
  `&requestId=${encodeURIComponent(requestId)}` +
  `&reportEndpoint=${encodeURIComponent(reportServer.endpoint)}`

await openChrome(url)

const report = await reportServer.waitForReport()
expect(report.rows).toHaveLength(3)
```