ADR-0043: Prompt structure standard (XML tags, 8-criterion rubric, mechanical scoring)¶

Status: Proposed
Date: 2026-04-21
Supersedes: none (codifies an implicit convention that was never documented)
Superseded by: none

Context¶

HydraFlow ships ~26 distinct prompts to Claude across the triage / plan / implement / review / HITL loops and adjacent helpers (ADR review, diff sanity, test adequacy, PR unsticker, conflict resolver, expert council, diagnostic runner, spec match, arch compliance). Every one is hand-assembled with f-string interpolation and markdown headings (## Issue, ## Plan, ## Review Feedback), and every one has grown organically as each loop evolved.

The prompt audit landed in PR #8376 scored all 26 against an 8-criterion rubric derived from Anthropic's own published prompt-engineering guidance. The headline finding: 25 of 26 prompts score High severity, with the worst offenders being

Criterion #3 (XML tag structure) — zero content regions use named tags across the entire factory. Fail rate: 100% (26/26).
Criterion #8 (edge cases named) — prompts rarely tell the model what to do on empty / truncated / unclear input. Fail rate: 85% (22/26).
Criterion #1 (leads with the request) — the imperative ("return a JSON object…", "produce an implementation patch…") lands mid-prompt, after criteria / classification / context. Fail rate: 69% (18/26).

Worst offenders by fail count: diagnostic_runner and expert_council_vote each fail 6 of the 8 criteria, followed by reviewer_ci_fix and triage_decomposition at 5 each.

Known methodology limitation — #6 is under-reported. Criterion #6 (long-context placement) scored 0% fails against the audit fixtures, but that reflects fixture size, not production behavior. The fixtures hold hand-written ≤1KB issue bodies and truncated diffs so the PR diff stays reviewable. Under production inputs (real issue bodies, full diffs, multi-comment discussion threads, prior-failure traces), the implement prompt (agent.AgentRunner._build_prompt_with_stats) will routinely exceed the 10K threshold and, lacking any tagged content regions, is expected to fail #6. Sub-project 2's eval gate — which renders against realistic inputs captured from a canary repo — will be the authoritative measurement of #6 in production. The rubric itself is correct; the audit's fixtures are just too small to exercise it.

No architectural decision has ever codified what a prompt in this codebase should look like. The audit's findings are not anyone's fault — they're the expected outcome of 18 months of organic growth without a standard.

Decision¶

Adopt a three-part standard for every Claude-bound prompt in HydraFlow:

1. Standard XML tag vocabulary for content regions¶

Tag	Purpose
`<issue>`	GitHub issue title / body / labels / comments
`<plan>`	Planner-produced implementation plan
`<diff>`	Patch, PR diff, conflict markers
`<history>`	Prior comments, review feedback, attempt logs
`<constraints>`	Invariants the model must respect
`<manifest>`	File list / repo layout
`<prior_review>`	Last reviewer's feedback (distinct from `<history>`)
`<output_format>`	The output contract (what to produce, what to avoid)
`<example>`	Few-shot examples
`<thinking>`	CoT scaffold (output-side, not input)

Markdown headings remain acceptable for human-readable sub-structure inside a tagged region, but tags own the machine-critical boundaries.

2. Eight-criterion rubric for mechanical scoring¶

#	Criterion	Rule
1	Leads with the request	First non-whitespace sentence (pre-tag) contains an imperative from `{produce, return, generate, classify, review, decide, output, propose, write, summarize}`.
2	Specific	Rendered prompt contains 3/3 of: output-artifact noun; named fields or schema; success criteria phrasing.
3	XML tags	≥3 distinct `<content>...</content>` pairs (excluding `<thinking>` / `<scratchpad>`).
4	Examples where applicable	If structured output cues present, `<example>` or `Example:` required.
5	Output contract	≥1 of: `respond with`, `do not`, `no prose`, `return only`, `output format`, `the output must`.
6	Placement of long context	For rendered prompts ≥10K chars, largest tagged block must end before the last imperative.
7	CoT scaffolded	Decision verbs present → require `<thinking>` / `<scratchpad>` / `think step by step`.
8	Edge cases named	≥1 of: `if empty/missing/truncated/unclear/no …`, `when the … is not/cannot/fails`, `otherwise,`, `in case of`, `fallback`, `do not assume`.

Severity: - High — 2+ Fails, or any Fail on #1 or #6. - Medium — 1 Fail or 3+ Partials. - Low — 0 Fails, ≤2 Partials.

3. Enforcement path — staged across sub-projects 2–4¶

The standard lands in four sub-projects tracked in docs/superpowers/specs/2026-04-20-prompt-audit-design.md:

Sub-project 1 (this PR, #8376) — audit tool + committed report + fixture corpus. No prompt rewrites.
Sub-project 2 — eval gate. Wire the audit's fixture set + rendered-snapshot baseline into the staging→main promotion gate (ADR-0042). Every prompt change must pass rubric parity.
Sub-project 3 — shared template. A thin src/prompt_template.py utility codifies section order and the tag vocabulary. Optional for builders to adopt; reduces drift.
Sub-project 4 — normalization PRs. One PR per loop migrates its prompts to the standard. Each PR is reviewed against the eval gate established in sub-project 2. Priority order (ranked by fail count in the audit report): diagnostic_runner (6 fails), expert_council_vote (6), reviewer_ci_fix (5), triage_decomposition (5), then the agent.* variants (4 each). The implement-prompt variants should be re-prioritized once sub-project 2's canary produces production-scale renderings and #6 becomes measurable.

Rationale — why mechanical scoring, not LLM-as-judge¶

Three alternatives were considered:

LLM-as-judge: ask Claude to score each prompt. Most flexible, but non-deterministic (same prompt scores differently across runs), expensive at CI scale (hundreds of judging calls per gate-run), and hard to debug when a legitimate structural change trips the judge.
Manual review: humans score each prompt. Catches semantic issues automation misses, but not reproducible and doesn't scale to an eval gate that runs on every staging→main promotion.
Mechanical (chosen): regex/keyword rules over the rendered prompt string. Reproducible (same input → same score, always), cheap (sub-second per prompt, runs in CI without LLM budget), and easy to debug (failing criterion points at a specific keyword).

Mechanical scoring has known false-positive risk: a prompt that happens to contain the right keywords can pass while still being structurally weak. This is accepted — the rubric's job is to catch the worst violations reproducibly, not to judge semantic quality. Humans review PRs with the rendered excerpts in the report; semantic quality lives there, not in the gate.

Consequences¶

Positive - Prompt quality becomes a tracked, enforced property of the codebase, not an implicit convention. - Sub-projects 2–4 have an unambiguous spec to implement against. - Prompt-related PR reviews get a mechanical floor — reviewers can focus on semantic correctness instead of re-arguing "should we use tags?" - Future prompts added to new loops automatically inherit the standard via the eval gate (sub-project 2).

Negative - Normalization is a 5-loop migration with behavior-sensitive surface area. Each sub-project-4 PR carries real regression risk until it clears the eval gate. Mitigation: per-loop rollout with canary runs. - The rubric's mechanical rules will produce occasional false positives — reviewers must be able to override with justification. The eval gate accepts a documented override; it does not accept silent drift. - Every new Claude-bound builder added to the codebase must register with PROMPT_REGISTRY or CI fails. Low ongoing cost but a new contribution friction.

Neutral - The existing PromptBuilder truncation infrastructure (src/prompt_builder.py) is unaffected — it operates on text content inside sections and is orthogonal to the section-structure standard.

Alternatives considered¶

Full normalization in one PR. Rejected: 26 prompts × behavior-sensitive rewrites = regression surface larger than one review can hold. Split into five per-loop PRs + one adjacent-helpers PR, each passing the gate.
Markdown-heading convention instead of XML tags. Rejected: Anthropic's published guidance explicitly recommends XML tags for content regions, and the audit's fail rate on #3 is the biggest single structural lever.
Per-loop ad-hoc structure. Rejected: that's the current state. The audit quantified the cost — 25/26 High severity.
Build the eval gate without the audit. Rejected: without a baseline, the gate has nothing to enforce against. Sub-project 1's fixture + snapshot corpus is the gate's input.

Links¶

PR #8376 — sub-project 1, audit tool + report + fixture corpus.
docs/prompt-audit-2026-04-20.md — the generated report.
docs/superpowers/specs/2026-04-20-prompt-audit-design.md — full design spec.
docs/superpowers/plans/2026-04-20-prompt-audit.md — implementation plan.
ADR-0042 — two-tier branch / staging→main promotion (gate target).