ADR-0043: Prompt structure standard (XML tags, 8-criterion rubric, mechanical scoring)¶
- Status: Proposed
- Date: 2026-04-21
- Supersedes: none (codifies an implicit convention that was never documented)
- Superseded by: none
Context¶
HydraFlow ships ~26 distinct prompts to Claude across the triage / plan /
implement / review / HITL loops and adjacent helpers (ADR review, diff sanity,
test adequacy, PR unsticker, conflict resolver, expert council, diagnostic
runner, spec match, arch compliance). Every one is hand-assembled with f-string
interpolation and markdown headings (## Issue, ## Plan, ## Review
Feedback), and every one has grown organically as each loop evolved.
The prompt audit landed in PR #8376 scored all 26 against an 8-criterion rubric derived from Anthropic's own published prompt-engineering guidance. The headline finding: 25 of 26 prompts score High severity, with the worst offenders being
- Criterion #3 (XML tag structure) — zero content regions use named tags across the entire factory. Fail rate: 100% (26/26).
- Criterion #8 (edge cases named) — prompts rarely tell the model what to do on empty / truncated / unclear input. Fail rate: 85% (22/26).
- Criterion #1 (leads with the request) — the imperative ("return a JSON object…", "produce an implementation patch…") lands mid-prompt, after criteria / classification / context. Fail rate: 69% (18/26).
Worst offenders by fail count: diagnostic_runner and expert_council_vote
each fail 6 of the 8 criteria, followed by reviewer_ci_fix and
triage_decomposition at 5 each.
Known methodology limitation — #6 is under-reported. Criterion #6
(long-context placement) scored 0% fails against the audit fixtures, but
that reflects fixture size, not production behavior. The fixtures hold
hand-written ≤1KB issue bodies and truncated diffs so the PR diff stays
reviewable. Under production inputs (real issue bodies, full diffs, multi-comment
discussion threads, prior-failure traces), the implement prompt
(agent.AgentRunner._build_prompt_with_stats) will routinely exceed the 10K
threshold and, lacking any tagged content regions, is expected to fail #6.
Sub-project 2's eval gate — which renders against realistic inputs captured
from a canary repo — will be the authoritative measurement of #6 in
production. The rubric itself is correct; the audit's fixtures are just too
small to exercise it.
No architectural decision has ever codified what a prompt in this codebase should look like. The audit's findings are not anyone's fault — they're the expected outcome of 18 months of organic growth without a standard.
Decision¶
Adopt a three-part standard for every Claude-bound prompt in HydraFlow:
1. Standard XML tag vocabulary for content regions¶
| Tag | Purpose |
|---|---|
<issue> |
GitHub issue title / body / labels / comments |
<plan> |
Planner-produced implementation plan |
<diff> |
Patch, PR diff, conflict markers |
<history> |
Prior comments, review feedback, attempt logs |
<constraints> |
Invariants the model must respect |
<manifest> |
File list / repo layout |
<prior_review> |
Last reviewer's feedback (distinct from <history>) |
<output_format> |
The output contract (what to produce, what to avoid) |
<example> |
Few-shot examples |
<thinking> |
CoT scaffold (output-side, not input) |
Markdown headings remain acceptable for human-readable sub-structure inside a tagged region, but tags own the machine-critical boundaries.
2. Eight-criterion rubric for mechanical scoring¶
| # | Criterion | Rule |
|---|---|---|
| 1 | Leads with the request | First non-whitespace sentence (pre-tag) contains an imperative from {produce, return, generate, classify, review, decide, output, propose, write, summarize}. |
| 2 | Specific | Rendered prompt contains 3/3 of: output-artifact noun; named fields or schema; success criteria phrasing. |
| 3 | XML tags | ≥3 distinct <content>...</content> pairs (excluding <thinking> / <scratchpad>). |
| 4 | Examples where applicable | If structured output cues present, <example> or Example: required. |
| 5 | Output contract | ≥1 of: respond with, do not, no prose, return only, output format, the output must. |
| 6 | Placement of long context | For rendered prompts ≥10K chars, largest tagged block must end before the last imperative. |
| 7 | CoT scaffolded | Decision verbs present → require <thinking> / <scratchpad> / think step by step. |
| 8 | Edge cases named | ≥1 of: if empty/missing/truncated/unclear/no …, when the … is not/cannot/fails, otherwise,, in case of, fallback, do not assume. |
Severity: - High — 2+ Fails, or any Fail on #1 or #6. - Medium — 1 Fail or 3+ Partials. - Low — 0 Fails, ≤2 Partials.
3. Enforcement path — staged across sub-projects 2–4¶
The standard lands in four sub-projects tracked in
docs/superpowers/specs/2026-04-20-prompt-audit-design.md:
- Sub-project 1 (this PR, #8376) — audit tool + committed report + fixture corpus. No prompt rewrites.
- Sub-project 2 — eval gate. Wire the audit's fixture set + rendered-snapshot baseline into the staging→main promotion gate (ADR-0042). Every prompt change must pass rubric parity.
- Sub-project 3 — shared template. A thin
src/prompt_template.pyutility codifies section order and the tag vocabulary. Optional for builders to adopt; reduces drift. - Sub-project 4 — normalization PRs. One PR per loop migrates its
prompts to the standard. Each PR is reviewed against the eval gate
established in sub-project 2. Priority order (ranked by fail count in
the audit report):
diagnostic_runner(6 fails),expert_council_vote(6),reviewer_ci_fix(5),triage_decomposition(5), then theagent.*variants (4 each). The implement-prompt variants should be re-prioritized once sub-project 2's canary produces production-scale renderings and #6 becomes measurable.
Rationale — why mechanical scoring, not LLM-as-judge¶
Three alternatives were considered:
- LLM-as-judge: ask Claude to score each prompt. Most flexible, but non-deterministic (same prompt scores differently across runs), expensive at CI scale (hundreds of judging calls per gate-run), and hard to debug when a legitimate structural change trips the judge.
- Manual review: humans score each prompt. Catches semantic issues automation misses, but not reproducible and doesn't scale to an eval gate that runs on every staging→main promotion.
- Mechanical (chosen): regex/keyword rules over the rendered prompt string. Reproducible (same input → same score, always), cheap (sub-second per prompt, runs in CI without LLM budget), and easy to debug (failing criterion points at a specific keyword).
Mechanical scoring has known false-positive risk: a prompt that happens to contain the right keywords can pass while still being structurally weak. This is accepted — the rubric's job is to catch the worst violations reproducibly, not to judge semantic quality. Humans review PRs with the rendered excerpts in the report; semantic quality lives there, not in the gate.
Consequences¶
Positive - Prompt quality becomes a tracked, enforced property of the codebase, not an implicit convention. - Sub-projects 2–4 have an unambiguous spec to implement against. - Prompt-related PR reviews get a mechanical floor — reviewers can focus on semantic correctness instead of re-arguing "should we use tags?" - Future prompts added to new loops automatically inherit the standard via the eval gate (sub-project 2).
Negative
- Normalization is a 5-loop migration with behavior-sensitive surface area.
Each sub-project-4 PR carries real regression risk until it clears the
eval gate. Mitigation: per-loop rollout with canary runs.
- The rubric's mechanical rules will produce occasional false positives —
reviewers must be able to override with justification. The eval gate
accepts a documented override; it does not accept silent drift.
- Every new Claude-bound builder added to the codebase must register with
PROMPT_REGISTRY or CI fails. Low ongoing cost but a new contribution
friction.
Neutral
- The existing PromptBuilder truncation infrastructure (src/prompt_builder.py)
is unaffected — it operates on text content inside sections and is
orthogonal to the section-structure standard.
Alternatives considered¶
- Full normalization in one PR. Rejected: 26 prompts × behavior-sensitive rewrites = regression surface larger than one review can hold. Split into five per-loop PRs + one adjacent-helpers PR, each passing the gate.
- Markdown-heading convention instead of XML tags. Rejected: Anthropic's published guidance explicitly recommends XML tags for content regions, and the audit's fail rate on #3 is the biggest single structural lever.
- Per-loop ad-hoc structure. Rejected: that's the current state. The audit quantified the cost — 25/26 High severity.
- Build the eval gate without the audit. Rejected: without a baseline, the gate has nothing to enforce against. Sub-project 1's fixture + snapshot corpus is the gate's input.
Links¶
- PR #8376 — sub-project 1, audit tool + report + fixture corpus.
docs/prompt-audit-2026-04-20.md— the generated report.docs/superpowers/specs/2026-04-20-prompt-audit-design.md— full design spec.docs/superpowers/plans/2026-04-20-prompt-audit.md— implementation plan.- ADR-0042 — two-tier branch / staging→main promotion (gate target).