Skill instrumentation pattern, brain /pass-rate HTTP endpoint, and optional hyperguild CLI subcommand for shell access. Pilot with tdd SKILL.md, then roll out to 6 binary-outcome skills (code-review, debug, feature-spec, session-retrospective, trainer, spec-driven-dev). Decisions: SKILL.md as source of truth for the logging contract; on-demand aggregation from JSONL (no materialized counters until proven necessary); pass|fail|skip vocabulary forward, with ok|error|skipped accepted by the read-side aggregator for backwards compat. Seven success criteria, ten out-of-scope items, six risks.
10 KiB
Spec: Pass-rate logging
Plan 5 of 7 — Hyperguild Skill Migration. Loaded after
feature-specskill.
Problem Statement
Plan 6 (Mode 2 routing pod) needs a per-skill signal to decide whether to route a call to the local model or keep it on Claude. The natural signal is recent pass rate: a skill that succeeds 95% of the time on local is safe to route; a skill that succeeds 60% is not. Today there is no such signal — the session_log MCP exists (shipped in Plan 1) but skills don't reliably call it, and no endpoint computes pass rate from the resulting logs.
Two consequences:
- Plan 6 cannot be trusted without baseline data. Routing decisions made on guesses will produce regressions that erode confidence in Mode 2 entirely.
- The skill library has no observability. When a skill regresses (model swap, prompt drift, environment change), there's no way to notice until a downstream task explicitly fails.
Why now: Plans 1–4 are merged. Plan 5 instruments the discipline that Plan 6 will consume. Several weeks of usage data between Plan 5 merge and Plan 6 deploy will mean Plan 6 lands on real numbers, not synthetic.
Success Criteria
- After Plan 5 merges, every invocation of
tdd(pilot skill) callssession_logat the end of each phase (red, green, refactor) withfinal_status∈ {pass, fail, skip}. - At least 6 of the remaining "binary-outcome" skills get the same treatment:
code-review,debug,feature-spec,session-retrospective,trainer,spec-driven-dev. (Skills with no clear pass/fail —clean-code,cognitive-load,solid,refactoring,test-design,problem-analysis,user-stories,planning,atdd,gitea-ci— are out of scope.) - A new HTTP REST endpoint
GET /pass-rate?skill=X&window=7don the brain pod returns valid JSON{skill, window, pass, fail, skip, total, pass_rate}for any skill name. Skills with no logged invocations return zeros (not 404, not error). Pass rate ispass / (pass + fail); ifpass + fail == 0, returnspass_rate: null. - The endpoint's aggregator normalizes legacy values:
pass≡ok,fail≡error,skip≡skipped. No data loss when scanning historical logs. - An optional CLI subcommand
hyperguild brain pass-rate <skill> [--window 7d] [--json]calls the endpoint and prints either human-readable (tdd: 47 / 50 = 94% (window: 7d)) or JSON. task checkpasses (lint + test + vet + drift + govulncheck) on each task and on the merged branch.- One week post-merge,
GET /pass-rate?skill=tdd&window=7dreturns non-zero counts and a realpass_rate.
Constraints
- Stdlib + existing deps only. The endpoint adds to the existing ingestion pod's HTTP handler (Go,
net/http). No new service, no new pod, no new persistence layer. - No auth on
/pass-rate. Same model as the rest of the brain HTTP REST API: Tailscale-only network, no token. - Schema: the SKILL.md template uses
pass | fail | skipforfinal_status. The aggregator treatspassandokas equivalent,failanderroras equivalent,skipandskippedas equivalent. New writes from skills MUST use the new vocabulary; the aggregator handles both for read-back. - Storage: continues to use the existing JSONL files at
<pod>/brain/sessions/*.jsonl. No format change. No materialized aggregates. If on-demand scans become slow (>500ms p99), revisit in a follow-up; not now. - Backwards compatibility: the existing
session_logMCP tool's signature does not change. Its docstring should be updated to reflect the new vocabulary, but argument types stay the same. - Pilot-before-rollout: the first SKILL.md instrumentation (
tdd) must dogfood successfully — at least one realtddinvocation post-instrumentation produces a session log entry — before the other six skills get their updates.
Out of Scope
- Plan 6 routing pod itself (the consumer of
/pass-rate). - Materialized rolling counters (compute on-demand for now).
- Auth, rate limiting, or per-user filtering on
/pass-rate. - Dashboards or visualization (
hyperguild brain pass-ratetext/JSON is the only UI). - Real-time streaming or push notifications (
/pass-rateis poll-only). - Skills with no clear binary outcome (the 10 skills listed in Success Criteria).
- Per-model or per-mode breakdown (
session_logalready recordsmodel_used; the endpoint aggregates across all models for now). Plan 6 may want sharper aggregation; we'll add fields when it lands. - Migration of the one historical entry in
2026-04-17-validate-hyperguild.jsonlfrompass(which is the new vocabulary, by accident) — no migration needed.
Technical Approach
Component A — SKILL.md instrumentation pattern
Each instrumented skill gets a standardized "Logging" subsection under its existing "Brain MCP Integration" section. The subsection names the required session_log fields with explicit copy-paste examples:
**At each phase end:** call `session_log` with:
- `skill`: "<this-skill-name>"
- `phase`: "<the-phase>"
- `final_status`: "pass" | "fail" | "skip"
- `message`: "<one-line summary>"
- `duration_ms`: <wall clock>
- `project_root`: "<absolute path to the project under work>"
The pilot SKILL.md (~/dev/.skills/tdd/SKILL.md) gets instrumented first. The implementation defines the contract; the rollout commits replicate the pattern across the other six SKILL.md files.
Rationale: SKILL.md as the source of truth means the contract is visible to every agent that loads the skill — no hidden middleware. Mode-agnostic: the agent calls session_log whether it's Claude (Mode 1), Claude+routing (Mode 2), or Crush (Mode 3). The pattern is uniform; only the skill name + phase set differ.
Component B — /pass-rate HTTP endpoint
New handler at the existing ingestion pod, peer to /query, /write, /ingest, etc.
GET /pass-rate?skill=<name>&window=<duration>
→ 200 { "skill": "tdd", "window": "7d", "pass": 47, "fail": 3, "skip": 0, "total": 50, "pass_rate": 0.94 }
Algorithm:
- Parse
skill(required) andwindow(default7d, accept Go-style1h,12h,7d,30d). - Walk
brain/sessions/*.jsonlin the pod's volume. For each line: parse JSON, filter byskill == query.skillandtimestamp >= now - window. - Tally
pass(counts bothpassandok),fail(failanderror),skip(skipandskipped). - Compute
pass_rate = pass / (pass + fail); ifpass + fail == 0, returnpass_rate: null. - Return JSON.
Rationale for on-demand: the JSONL files are append-only and small (one entry per skill phase, kilobytes per session at most). For the first months of Plan 5 usage, scanning all sessions for a single query is fast enough. If it ever isn't, a materialized index is a follow-up — the endpoint shape doesn't change.
Component C — Optional CLI subcommand
hyperguild brain pass-rate <skill> [--window 7d] [--json]. Adds a third nested verb under brain (sibling to query and write). Calls GET /pass-rate?skill=<>&window=<> via the existing brainClient infrastructure. Default human output: tdd: 47 / 50 = 94% (window: 7d). --json passes through the response envelope.
Rationale: shell access to pass-rate without curl + jq. Optional in the strict sense — Plan 6's routing pod will call the endpoint directly, not via the CLI — but cheap to add (one new method on brainClient, one new dispatch case in runBrain).
Schema and normalization
session_log JSONL line shape (unchanged today, codified by this plan):
{
"session_id": "<id>",
"timestamp": "2026-05-03T20:30:00Z",
"skill": "tdd",
"phase": "red",
"project_root": "/abs/path",
"final_status": "pass",
"duration_ms": 12345,
"message": "Test written, function undefined, red confirmed."
}
final_status values:
- New writes (this plan onward):
pass | fail | skip - Read aggregator accepts both new and legacy:
pass/ok→ pass,fail/error→ fail,skip/skipped→ skip - Anything else → counted as
skipfor safety (don't pollute pass/fail with malformed entries)
Tests
- Endpoint: table-driven tests with a temp
brain/sessions/directory containing JSONL files spanning multiple skills, multiple statuses (both vocabularies), edge cases (empty file, malformed line, timestamp outside window, future timestamp). Tests run viahttptest.NewServeragainst the real handler. - CLI: tests for
runBrainPassRateagainsthttptest.Serverfake of/pass-rate. Human and--jsonoutput paths. - Pilot dogfood: after instrumenting
tdd/SKILL.md, one real TDD task in this plan exercises the logging path. The corresponding session log entry verifies end-to-end. task checkper task.
Risks
- Skills that don't reliably log produce missing data. The aggregator returns zero counts for those, which Plan 6 may misread as "this skill always passes" or "this skill is broken". Mitigation: the endpoint returns
pass_rate: nullwhenpass + fail == 0, signalling "no data" distinct from "always passes". Plan 6 must check for null. - Agents may forget to call
session_logmid-skill. No way to enforce in cloud Mode 1 — Claude may skip the call if instructions are unclear. Mitigation: SKILL.md template makes the call literal and copy-pasteable. After 1 week, if instrumentation rate is < 80% of expected calls, escalate; consider a wrapper at the routing-pod layer in Plan 6 as belt-and-suspenders. - Schema drift between legacy
okand newpass. Mitigation: the aggregator's normalization rule. Documented in the endpoint's response and in thesession_logtool docstring update. /pass-ratewalks all session files for each request. With ~1 file per session and tens of sessions per week, this is microseconds today. At hundreds of files per day, may need a date-bounded directory layout. Mitigation: monitor; if scan time > 100ms p99, revisit. Not in this plan.- The pilot may fail on the first dogfood. If
tddinstrumentation doesn't produce a log entry (e.g. agent didn't callsession_log, JSON shape wrong, file permissions), the rollout to the other six skills is blocked until the pilot succeeds. Mitigation: explicit "pilot validates end-to-end" gate as the last step of Component A. - Adding a third verb under
brainslightly stretches the inline-router pattern. Three verbs in a switch is still simple; if it grows to five, the CLI may want a per-verb registration map. Mitigation: deferred — three is fine.