Skip to content

ADR-0047: Fake-adapter contract testing via cassette record/replay

  • Status: Accepted
  • Accepted on: 2026-05-12 — promoted after slice #5 audit confirmed live production usage.
  • Date: 2026-04-23
  • Supersedes: none
  • Superseded by: none
  • Related: ADR-0045 §4.2 initiated this work; ADR-0022 (integration test harness) — this ADR extends it with contract testing of fakes.
  • Enforced by: make trust-contracts; tests/trust/contracts/test_fake_*_contract.py; src/contract_refresh_loop.py (the weekly refresh loop that keeps cassettes in sync with reality).

Context

HydraFlow's scenario test harness (MockWorld, ADR-0022) uses fake adapters — FakeGitHub, FakeGit, FakeDocker, FakeLLM — to replace real external services during tests. The problem: a fake that silently diverges from the real service is worse than no fake, because green tests hide broken code. Historically this manifested as "tests pass, but the first production call blows up on a field the fake didn't return" or "the fake's stdout shape was never what gh actually prints." As the number of fakes grew, the divergence risk compounded.

We needed a way to: 1. Prove the fakes match the real services' observable behavior for the call shapes the code actually uses. 2. Detect drift automatically — if gh changes its output format, we must hear about it before production does. 3. Be cheap to maintain — the cassette corpus should update itself, not require a human to re-record every time an API changes.

Decision

Adopt a record/replay cassette pattern with three parts:

Part 1 — Cassette schema

Each cassette is a YAML file (or JSONL for streaming APIs) stored under tests/trust/contracts/cassettes/<adapter>/<scenario>.yaml with a strict pydantic model (tests/trust/contracts/_schema.py::Cassette). The schema includes: - adapter name (gated to the 4 known values). - input (command + args + stdin). - output (stdout + stderr + exit code). - normalizers (list of declarative field-normalizer names — e.g. pr_number, timestamps.ISO8601, sha:short) that strip volatile fields before compare. - recorded_at / recorder_sha for forensic tracing.

Part 2 — Per-adapter contract tests

One test_fake_<adapter>_contract.py per fake. Each loads committed cassettes and replays them against the fake via replay_cassette (tests/trust/contracts/_replay.py). Drift assertion: the normalized got output must equal the normalized expected output, byte-for-byte. Claude streams bypass the YAML schema and compare as normalized JSONL via _normalize_claude_stream in src/contract_diff.py.

Part 3 — Automated refresh (ContractRefreshLoop)

Weekly, ContractRefreshLoop (spec §4.2) re-records the corpus against the real services using recorders in src/contract_recording.py, normalizes via the declared normalizers, diffs against committed cassettes (src/contract_diff.py), and opens a refresh PR if drift is detected. After the PR opens, the loop re-runs make trust-contracts locally (the replay gate) — if it still fails with the new bytes, a fake-drift companion issue is filed for a human to triage because it means the fake's code (not just the cassette) needs updating.

Consequences

Positive: - Fakes now have a machine-checkable contract. A commit that changes FakeGitHub.create_pr's output shape breaks test_fake_github_contract.py until the cassette is re-recorded to match (which is the right signal — "you changed the fake, prove the contract still holds"). - Drift in the real API surfaces in a refresh PR, not in a production incident. A CI reviewer sees "contract-refresh: 2026-04-30 (github)" and auto-merges or triages. - New fake adapters follow the pattern mechanically: add a recorder to contract_recording.py, add cassettes, add a test_fake_<name>_contract.py, done.

Negative: - Cassette maintenance is a real cost. Every schema change to the fake requires re-recording. Mitigated by the automated refresh loop (weekly) and by the normalizer registry (volatile fields filtered automatically). - Live recording requires credentials (gh auth, Docker daemon, Claude CLI). The refresh loop gracefully no-ops on missing credentials — a tolerated partial state. - The 3-attempt escalation tracker in ContractRefreshLoop means a persistently-drifting adapter can take up to 3 weekly cycles to escalate to HITL. Acceptable tradeoff — most drift resolves in one refresh cycle.

Neutral: - Cassette YAMLs live in the repo and therefore participate in git diff / history. Large cassettes could bloat the repo. Mitigated by keeping cassettes small (one call per cassette) and preferring JSONL for streaming.

When to add a new cassette

  • A new call shape is added to a fake. The fake code must be backed by a cassette proving the new shape matches the real API.
  • An adapter's cassette directory is empty but the fake is wired into production code paths. Fix by recording a baseline cassette even if the corpus is small.

Auditor scope (right-sizing, 2026-06-11)

FakeCoverageAuditorLoop (spec §4.7) enforces this ADR by flagging un-cassetted fake methods. Its adapter-surface audit is scoped to the cassette-capable adapters only — those with a real recorder in src/contract_recording.py and a YAML cassette corpus: github, git, docker (claude is contract-tested via JSONL streams, not the YAML cassette-dir). This is the _FAKE_TO_CASSETTE_DIR registry.

Two rules keep the audit honest and noise-free:

  1. No recordable surface → not audited. Fakes with no real recorder/cassette corpus (FakeBeads, FakeSentry, FakeHTTP, FakeSubprocessRunner, FakeFS, FakeLLM) are out of scope for the adapter-surface audit — there is nothing to contract a cassette against. Adding one of these to the cassette system means first adding a recorder + cassette dir (per "When to add a new cassette"), then registering it. Auditing them before that infrastructure exists only produces false positives.
  2. Scaffolding is not adapter surface. A fake public method with no counterpart on its real production adapter (or that adapter's Port) is in-memory test scaffolding (add_issue, from_seed, seed_*, set_*) — there is no real API call to record, so it is excluded via _FAKE_REAL_SURFACE_SOURCES.

The genuine remaining github adapter-surface debt (real PRManager methods lacking cassettes, e.g. mutating gh pr create/gh pr merge) is blocked on an isolated recording sandbox (issue #9080); see "When to replace this pattern" and the negative consequence above re: live-recording credentials.

When to replace this pattern

If a service becomes so dynamic (e.g. non-idempotent responses, per-request IDs that can't be normalized) that cassettes can't faithfully capture it, the fake is no longer a valid test double. Options: - Move the service to a live-only test lane (e.g. nightly against a sandbox). - Narrow the fake's surface to deterministic operations and document the rest as untested. - Supersede this ADR with a different testing strategy (property-based? contract-by-protobuf?).

  • Cassette normalizers registry: tests/trust/contracts/_schema.py::NORMALIZERS
  • Recorders (capture real-service output): src/contract_recording.py
  • Diff/drift detection: src/contract_diff.py
  • Refresh loop: src/contract_refresh_loop.py
  • Fleet scenario: tests/scenarios/test_contract_refresh_scenario.py