docs(specs): pass-rate logging design — Plan 5 of hyperguild migration

Skill instrumentation pattern, brain /pass-rate HTTP endpoint, and optional hyperguild CLI subcommand for shell access. Pilot with tdd SKILL.md, then roll out to 6 binary-outcome skills (code-review, debug, feature-spec, session-retrospective, trainer, spec-driven-dev). Decisions: SKILL.md as source of truth for the logging contract; on-demand aggregation from JSONL (no materialized counters until proven necessary); pass|fail|skip vocabulary forward, with ok|error|skipped accepted by the read-side aggregator for backwards compat. Seven success criteria, ten out-of-scope items, six risks.
2026-05-03 22:23:28 +02:00
parent b3b1fde825
commit af52f501fe
1 changed files with 125 additions and 0 deletions
--- a/docs/superpowers/specs/2026-05-03-pass-rate-logging-design.md
+++ b/docs/superpowers/specs/2026-05-03-pass-rate-logging-design.md
@@ -0,0 +1,125 @@
+# Spec: Pass-rate logging
+
+> Plan 5 of 7 — Hyperguild Skill Migration. Loaded after `feature-spec` skill.
+
+## Problem Statement
+
+Plan 6 (Mode 2 routing pod) needs a per-skill signal to decide whether to route a call to the local model or keep it on Claude. The natural signal is recent pass rate: a skill that succeeds 95% of the time on local is safe to route; a skill that succeeds 60% is not. Today there is no such signal — the `session_log` MCP exists (shipped in Plan 1) but skills don't reliably call it, and no endpoint computes pass rate from the resulting logs.
+
+Two consequences:
+1. **Plan 6 cannot be trusted without baseline data.** Routing decisions made on guesses will produce regressions that erode confidence in Mode 2 entirely.
+2. **The skill library has no observability.** When a skill regresses (model swap, prompt drift, environment change), there's no way to notice until a downstream task explicitly fails.
+
+**Why now:** Plans 1–4 are merged. Plan 5 instruments the discipline that Plan 6 will consume. Several weeks of usage data between Plan 5 merge and Plan 6 deploy will mean Plan 6 lands on real numbers, not synthetic.
+
+## Success Criteria
+
+- [ ] After Plan 5 merges, every invocation of `tdd` (pilot skill) calls `session_log` at the end of each phase (red, green, refactor) with `final_status` ∈ {pass, fail, skip}.
+- [ ] At least 6 of the remaining "binary-outcome" skills get the same treatment: `code-review`, `debug`, `feature-spec`, `session-retrospective`, `trainer`, `spec-driven-dev`. (Skills with no clear pass/fail — `clean-code`, `cognitive-load`, `solid`, `refactoring`, `test-design`, `problem-analysis`, `user-stories`, `planning`, `atdd`, `gitea-ci` — are out of scope.)
+- [ ] A new HTTP REST endpoint `GET /pass-rate?skill=X&window=7d` on the brain pod returns valid JSON `{skill, window, pass, fail, skip, total, pass_rate}` for any skill name. Skills with no logged invocations return zeros (not 404, not error). Pass rate is `pass / (pass + fail)`; if `pass + fail == 0`, returns `pass_rate: null`.
+- [ ] The endpoint's aggregator normalizes legacy values: `pass` ≡ `ok`, `fail` ≡ `error`, `skip` ≡ `skipped`. No data loss when scanning historical logs.
+- [ ] An optional CLI subcommand `hyperguild brain pass-rate <skill> [--window 7d] [--json]` calls the endpoint and prints either human-readable (`tdd: 47 / 50 = 94% (window: 7d)`) or JSON.
+- [ ] `task check` passes (lint + test + vet + drift + govulncheck) on each task and on the merged branch.
+- [ ] One week post-merge, `GET /pass-rate?skill=tdd&window=7d` returns non-zero counts and a real `pass_rate`.
+
+## Constraints
+
+- **Stdlib + existing deps only.** The endpoint adds to the existing ingestion pod's HTTP handler (Go, `net/http`). No new service, no new pod, no new persistence layer.
+- **No auth on `/pass-rate`.** Same model as the rest of the brain HTTP REST API: Tailscale-only network, no token.
+- **Schema:** the SKILL.md template uses `pass | fail | skip` for `final_status`. The aggregator treats `pass` and `ok` as equivalent, `fail` and `error` as equivalent, `skip` and `skipped` as equivalent. New writes from skills MUST use the new vocabulary; the aggregator handles both for read-back.
+- **Storage:** continues to use the existing JSONL files at `<pod>/brain/sessions/*.jsonl`. No format change. No materialized aggregates. If on-demand scans become slow (>500ms p99), revisit in a follow-up; not now.
+- **Backwards compatibility:** the existing `session_log` MCP tool's signature does not change. Its docstring should be updated to reflect the new vocabulary, but argument types stay the same.
+- **Pilot-before-rollout:** the first SKILL.md instrumentation (`tdd`) must dogfood successfully — at least one real `tdd` invocation post-instrumentation produces a session log entry — before the other six skills get their updates.
+
+## Out of Scope
+
+- Plan 6 routing pod itself (the consumer of `/pass-rate`).
+- Materialized rolling counters (compute on-demand for now).
+- Auth, rate limiting, or per-user filtering on `/pass-rate`.
+- Dashboards or visualization (`hyperguild brain pass-rate` text/JSON is the only UI).
+- Real-time streaming or push notifications (`/pass-rate` is poll-only).
+- Skills with no clear binary outcome (the 10 skills listed in Success Criteria).
+- Per-model or per-mode breakdown (`session_log` already records `model_used`; the endpoint aggregates across all models for now). Plan 6 may want sharper aggregation; we'll add fields when it lands.
+- Migration of the one historical entry in `2026-04-17-validate-hyperguild.jsonl` from `pass` (which is the new vocabulary, by accident) — no migration needed.
+
+## Technical Approach
+
+### Component A — SKILL.md instrumentation pattern
+
+Each instrumented skill gets a standardized "Logging" subsection under its existing "Brain MCP Integration" section. The subsection names the required `session_log` fields with explicit copy-paste examples:
+
+```
+**At each phase end:** call `session_log` with:
+- `skill`: "<this-skill-name>"
+- `phase`: "<the-phase>"
+- `final_status`: "pass" | "fail" | "skip"
+- `message`: "<one-line summary>"
+- `duration_ms`: <wall clock>
+- `project_root`: "<absolute path to the project under work>"
+```
+
+The pilot SKILL.md (`~/dev/.skills/tdd/SKILL.md`) gets instrumented first. The implementation defines the contract; the rollout commits replicate the pattern across the other six SKILL.md files.
+
+Rationale: SKILL.md as the source of truth means the contract is visible to every agent that loads the skill — no hidden middleware. Mode-agnostic: the agent calls `session_log` whether it's Claude (Mode 1), Claude+routing (Mode 2), or Crush (Mode 3). The pattern is uniform; only the skill name + phase set differ.
+
+### Component B — `/pass-rate` HTTP endpoint
+
+New handler at the existing ingestion pod, peer to `/query`, `/write`, `/ingest`, etc.
+
+```
+GET /pass-rate?skill=<name>&window=<duration>
+→ 200 { "skill": "tdd", "window": "7d", "pass": 47, "fail": 3, "skip": 0, "total": 50, "pass_rate": 0.94 }
+```
+
+Algorithm:
+1. Parse `skill` (required) and `window` (default `7d`, accept Go-style `1h`, `12h`, `7d`, `30d`).
+2. Walk `brain/sessions/*.jsonl` in the pod's volume. For each line: parse JSON, filter by `skill == query.skill` and `timestamp >= now - window`.
+3. Tally `pass` (counts both `pass` and `ok`), `fail` (`fail` and `error`), `skip` (`skip` and `skipped`).
+4. Compute `pass_rate = pass / (pass + fail)`; if `pass + fail == 0`, return `pass_rate: null`.
+5. Return JSON.
+
+Rationale for on-demand: the JSONL files are append-only and small (one entry per skill phase, kilobytes per session at most). For the first months of Plan 5 usage, scanning all sessions for a single query is fast enough. If it ever isn't, a materialized index is a follow-up — the endpoint shape doesn't change.
+
+### Component C — Optional CLI subcommand
+
+`hyperguild brain pass-rate <skill> [--window 7d] [--json]`. Adds a third nested verb under `brain` (sibling to `query` and `write`). Calls `GET /pass-rate?skill=<>&window=<>` via the existing `brainClient` infrastructure. Default human output: `tdd: 47 / 50 = 94% (window: 7d)`. `--json` passes through the response envelope.
+
+Rationale: shell access to pass-rate without curl + jq. Optional in the strict sense — Plan 6's routing pod will call the endpoint directly, not via the CLI — but cheap to add (one new method on `brainClient`, one new dispatch case in `runBrain`).
+
+### Schema and normalization
+
+`session_log` JSONL line shape (unchanged today, codified by this plan):
+
+```json
+{
+  "session_id": "<id>",
+  "timestamp": "2026-05-03T20:30:00Z",
+  "skill": "tdd",
+  "phase": "red",
+  "project_root": "/abs/path",
+  "final_status": "pass",
+  "duration_ms": 12345,
+  "message": "Test written, function undefined, red confirmed."
+}
+```
+
+`final_status` values:
+- New writes (this plan onward): `pass | fail | skip`
+- Read aggregator accepts both new and legacy: `pass`/`ok` → pass, `fail`/`error` → fail, `skip`/`skipped` → skip
+- Anything else → counted as `skip` for safety (don't pollute pass/fail with malformed entries)
+
+### Tests
+
+- Endpoint: table-driven tests with a temp `brain/sessions/` directory containing JSONL files spanning multiple skills, multiple statuses (both vocabularies), edge cases (empty file, malformed line, timestamp outside window, future timestamp). Tests run via `httptest.NewServer` against the real handler.
+- CLI: tests for `runBrainPassRate` against `httptest.Server` fake of `/pass-rate`. Human and `--json` output paths.
+- Pilot dogfood: after instrumenting `tdd/SKILL.md`, one real TDD task in this plan exercises the logging path. The corresponding session log entry verifies end-to-end.
+- `task check` per task.
+
+## Risks
+
+- **Skills that don't reliably log produce missing data.** The aggregator returns zero counts for those, which Plan 6 may misread as "this skill always passes" or "this skill is broken". Mitigation: the endpoint returns `pass_rate: null` when `pass + fail == 0`, signalling "no data" distinct from "always passes". Plan 6 must check for null.
+- **Agents may forget to call `session_log` mid-skill.** No way to enforce in cloud Mode 1 — Claude may skip the call if instructions are unclear. Mitigation: SKILL.md template makes the call literal and copy-pasteable. After 1 week, if instrumentation rate is < 80% of expected calls, escalate; consider a wrapper at the routing-pod layer in Plan 6 as belt-and-suspenders.
+- **Schema drift between legacy `ok` and new `pass`.** Mitigation: the aggregator's normalization rule. Documented in the endpoint's response and in the `session_log` tool docstring update.
+- **`/pass-rate` walks all session files for each request.** With ~1 file per session and tens of sessions per week, this is microseconds today. At hundreds of files per day, may need a date-bounded directory layout. Mitigation: monitor; if scan time > 100ms p99, revisit. Not in this plan.
+- **The pilot may fail on the first dogfood.** If `tdd` instrumentation doesn't produce a log entry (e.g. agent didn't call `session_log`, JSON shape wrong, file permissions), the rollout to the other six skills is blocked until the pilot succeeds. Mitigation: explicit "pilot validates end-to-end" gate as the last step of Component A.
+- **Adding a third verb under `brain` slightly stretches the inline-router pattern.** Three verbs in a switch is still simple; if it grows to five, the CLI may want a per-verb registration map. Mitigation: deferred — three is fine.