Files
hyperguild/docs/superpowers/specs/2026-05-03-pass-rate-logging-design.md
Mathias Bergqvist af52f501fe docs(specs): pass-rate logging design — Plan 5 of hyperguild migration
Skill instrumentation pattern, brain /pass-rate HTTP endpoint, and
optional hyperguild CLI subcommand for shell access. Pilot with tdd
SKILL.md, then roll out to 6 binary-outcome skills (code-review,
debug, feature-spec, session-retrospective, trainer, spec-driven-dev).

Decisions: SKILL.md as source of truth for the logging contract;
on-demand aggregation from JSONL (no materialized counters until
proven necessary); pass|fail|skip vocabulary forward, with
ok|error|skipped accepted by the read-side aggregator for backwards
compat.

Seven success criteria, ten out-of-scope items, six risks.
2026-05-03 22:23:28 +02:00

10 KiB
Raw Blame History

Spec: Pass-rate logging

Plan 5 of 7 — Hyperguild Skill Migration. Loaded after feature-spec skill.

Problem Statement

Plan 6 (Mode 2 routing pod) needs a per-skill signal to decide whether to route a call to the local model or keep it on Claude. The natural signal is recent pass rate: a skill that succeeds 95% of the time on local is safe to route; a skill that succeeds 60% is not. Today there is no such signal — the session_log MCP exists (shipped in Plan 1) but skills don't reliably call it, and no endpoint computes pass rate from the resulting logs.

Two consequences:

  1. Plan 6 cannot be trusted without baseline data. Routing decisions made on guesses will produce regressions that erode confidence in Mode 2 entirely.
  2. The skill library has no observability. When a skill regresses (model swap, prompt drift, environment change), there's no way to notice until a downstream task explicitly fails.

Why now: Plans 14 are merged. Plan 5 instruments the discipline that Plan 6 will consume. Several weeks of usage data between Plan 5 merge and Plan 6 deploy will mean Plan 6 lands on real numbers, not synthetic.

Success Criteria

  • After Plan 5 merges, every invocation of tdd (pilot skill) calls session_log at the end of each phase (red, green, refactor) with final_status ∈ {pass, fail, skip}.
  • At least 6 of the remaining "binary-outcome" skills get the same treatment: code-review, debug, feature-spec, session-retrospective, trainer, spec-driven-dev. (Skills with no clear pass/fail — clean-code, cognitive-load, solid, refactoring, test-design, problem-analysis, user-stories, planning, atdd, gitea-ci — are out of scope.)
  • A new HTTP REST endpoint GET /pass-rate?skill=X&window=7d on the brain pod returns valid JSON {skill, window, pass, fail, skip, total, pass_rate} for any skill name. Skills with no logged invocations return zeros (not 404, not error). Pass rate is pass / (pass + fail); if pass + fail == 0, returns pass_rate: null.
  • The endpoint's aggregator normalizes legacy values: passok, failerror, skipskipped. No data loss when scanning historical logs.
  • An optional CLI subcommand hyperguild brain pass-rate <skill> [--window 7d] [--json] calls the endpoint and prints either human-readable (tdd: 47 / 50 = 94% (window: 7d)) or JSON.
  • task check passes (lint + test + vet + drift + govulncheck) on each task and on the merged branch.
  • One week post-merge, GET /pass-rate?skill=tdd&window=7d returns non-zero counts and a real pass_rate.

Constraints

  • Stdlib + existing deps only. The endpoint adds to the existing ingestion pod's HTTP handler (Go, net/http). No new service, no new pod, no new persistence layer.
  • No auth on /pass-rate. Same model as the rest of the brain HTTP REST API: Tailscale-only network, no token.
  • Schema: the SKILL.md template uses pass | fail | skip for final_status. The aggregator treats pass and ok as equivalent, fail and error as equivalent, skip and skipped as equivalent. New writes from skills MUST use the new vocabulary; the aggregator handles both for read-back.
  • Storage: continues to use the existing JSONL files at <pod>/brain/sessions/*.jsonl. No format change. No materialized aggregates. If on-demand scans become slow (>500ms p99), revisit in a follow-up; not now.
  • Backwards compatibility: the existing session_log MCP tool's signature does not change. Its docstring should be updated to reflect the new vocabulary, but argument types stay the same.
  • Pilot-before-rollout: the first SKILL.md instrumentation (tdd) must dogfood successfully — at least one real tdd invocation post-instrumentation produces a session log entry — before the other six skills get their updates.

Out of Scope

  • Plan 6 routing pod itself (the consumer of /pass-rate).
  • Materialized rolling counters (compute on-demand for now).
  • Auth, rate limiting, or per-user filtering on /pass-rate.
  • Dashboards or visualization (hyperguild brain pass-rate text/JSON is the only UI).
  • Real-time streaming or push notifications (/pass-rate is poll-only).
  • Skills with no clear binary outcome (the 10 skills listed in Success Criteria).
  • Per-model or per-mode breakdown (session_log already records model_used; the endpoint aggregates across all models for now). Plan 6 may want sharper aggregation; we'll add fields when it lands.
  • Migration of the one historical entry in 2026-04-17-validate-hyperguild.jsonl from pass (which is the new vocabulary, by accident) — no migration needed.

Technical Approach

Component A — SKILL.md instrumentation pattern

Each instrumented skill gets a standardized "Logging" subsection under its existing "Brain MCP Integration" section. The subsection names the required session_log fields with explicit copy-paste examples:

**At each phase end:** call `session_log` with:
- `skill`: "<this-skill-name>"
- `phase`: "<the-phase>"
- `final_status`: "pass" | "fail" | "skip"
- `message`: "<one-line summary>"
- `duration_ms`: <wall clock>
- `project_root`: "<absolute path to the project under work>"

The pilot SKILL.md (~/dev/.skills/tdd/SKILL.md) gets instrumented first. The implementation defines the contract; the rollout commits replicate the pattern across the other six SKILL.md files.

Rationale: SKILL.md as the source of truth means the contract is visible to every agent that loads the skill — no hidden middleware. Mode-agnostic: the agent calls session_log whether it's Claude (Mode 1), Claude+routing (Mode 2), or Crush (Mode 3). The pattern is uniform; only the skill name + phase set differ.

Component B — /pass-rate HTTP endpoint

New handler at the existing ingestion pod, peer to /query, /write, /ingest, etc.

GET /pass-rate?skill=<name>&window=<duration>
→ 200 { "skill": "tdd", "window": "7d", "pass": 47, "fail": 3, "skip": 0, "total": 50, "pass_rate": 0.94 }

Algorithm:

  1. Parse skill (required) and window (default 7d, accept Go-style 1h, 12h, 7d, 30d).
  2. Walk brain/sessions/*.jsonl in the pod's volume. For each line: parse JSON, filter by skill == query.skill and timestamp >= now - window.
  3. Tally pass (counts both pass and ok), fail (fail and error), skip (skip and skipped).
  4. Compute pass_rate = pass / (pass + fail); if pass + fail == 0, return pass_rate: null.
  5. Return JSON.

Rationale for on-demand: the JSONL files are append-only and small (one entry per skill phase, kilobytes per session at most). For the first months of Plan 5 usage, scanning all sessions for a single query is fast enough. If it ever isn't, a materialized index is a follow-up — the endpoint shape doesn't change.

Component C — Optional CLI subcommand

hyperguild brain pass-rate <skill> [--window 7d] [--json]. Adds a third nested verb under brain (sibling to query and write). Calls GET /pass-rate?skill=<>&window=<> via the existing brainClient infrastructure. Default human output: tdd: 47 / 50 = 94% (window: 7d). --json passes through the response envelope.

Rationale: shell access to pass-rate without curl + jq. Optional in the strict sense — Plan 6's routing pod will call the endpoint directly, not via the CLI — but cheap to add (one new method on brainClient, one new dispatch case in runBrain).

Schema and normalization

session_log JSONL line shape (unchanged today, codified by this plan):

{
  "session_id": "<id>",
  "timestamp": "2026-05-03T20:30:00Z",
  "skill": "tdd",
  "phase": "red",
  "project_root": "/abs/path",
  "final_status": "pass",
  "duration_ms": 12345,
  "message": "Test written, function undefined, red confirmed."
}

final_status values:

  • New writes (this plan onward): pass | fail | skip
  • Read aggregator accepts both new and legacy: pass/ok → pass, fail/error → fail, skip/skipped → skip
  • Anything else → counted as skip for safety (don't pollute pass/fail with malformed entries)

Tests

  • Endpoint: table-driven tests with a temp brain/sessions/ directory containing JSONL files spanning multiple skills, multiple statuses (both vocabularies), edge cases (empty file, malformed line, timestamp outside window, future timestamp). Tests run via httptest.NewServer against the real handler.
  • CLI: tests for runBrainPassRate against httptest.Server fake of /pass-rate. Human and --json output paths.
  • Pilot dogfood: after instrumenting tdd/SKILL.md, one real TDD task in this plan exercises the logging path. The corresponding session log entry verifies end-to-end.
  • task check per task.

Risks

  • Skills that don't reliably log produce missing data. The aggregator returns zero counts for those, which Plan 6 may misread as "this skill always passes" or "this skill is broken". Mitigation: the endpoint returns pass_rate: null when pass + fail == 0, signalling "no data" distinct from "always passes". Plan 6 must check for null.
  • Agents may forget to call session_log mid-skill. No way to enforce in cloud Mode 1 — Claude may skip the call if instructions are unclear. Mitigation: SKILL.md template makes the call literal and copy-pasteable. After 1 week, if instrumentation rate is < 80% of expected calls, escalate; consider a wrapper at the routing-pod layer in Plan 6 as belt-and-suspenders.
  • Schema drift between legacy ok and new pass. Mitigation: the aggregator's normalization rule. Documented in the endpoint's response and in the session_log tool docstring update.
  • /pass-rate walks all session files for each request. With ~1 file per session and tens of sessions per week, this is microseconds today. At hundreds of files per day, may need a date-bounded directory layout. Mitigation: monitor; if scan time > 100ms p99, revisit. Not in this plan.
  • The pilot may fail on the first dogfood. If tdd instrumentation doesn't produce a log entry (e.g. agent didn't call session_log, JSON shape wrong, file permissions), the rollout to the other six skills is blocked until the pilot succeeds. Mitigation: explicit "pilot validates end-to-end" gate as the last step of Component A.
  • Adding a third verb under brain slightly stretches the inline-router pattern. Three verbs in a switch is still simple; if it grows to five, the CLI may want a per-verb registration map. Mitigation: deferred — three is fine.