Files

Mathias Bergqvist af52f501fe docs(specs): pass-rate logging design — Plan 5 of hyperguild migration

Skill instrumentation pattern, brain /pass-rate HTTP endpoint, and
optional hyperguild CLI subcommand for shell access. Pilot with tdd
SKILL.md, then roll out to 6 binary-outcome skills (code-review,
debug, feature-spec, session-retrospective, trainer, spec-driven-dev).

Decisions: SKILL.md as source of truth for the logging contract;
on-demand aggregation from JSONL (no materialized counters until
proven necessary); pass|fail|skip vocabulary forward, with
ok|error|skipped accepted by the read-side aggregator for backwards
compat.

Seven success criteria, ten out-of-scope items, six risks.

2026-05-03 22:23:28 +02:00

10 KiB

Raw Blame History

Spec: Pass-rate logging

Plan 5 of 7 — Hyperguild Skill Migration. Loaded after feature-spec skill.

Problem Statement

Plan 6 (Mode 2 routing pod) needs a per-skill signal to decide whether to route a call to the local model or keep it on Claude. The natural signal is recent pass rate: a skill that succeeds 95% of the time on local is safe to route; a skill that succeeds 60% is not. Today there is no such signal — the session_log MCP exists (shipped in Plan 1) but skills don't reliably call it, and no endpoint computes pass rate from the resulting logs.

Two consequences:

Plan 6 cannot be trusted without baseline data. Routing decisions made on guesses will produce regressions that erode confidence in Mode 2 entirely.
The skill library has no observability. When a skill regresses (model swap, prompt drift, environment change), there's no way to notice until a downstream task explicitly fails.

Why now: Plans 1–4 are merged. Plan 5 instruments the discipline that Plan 6 will consume. Several weeks of usage data between Plan 5 merge and Plan 6 deploy will mean Plan 6 lands on real numbers, not synthetic.

Success Criteria

After Plan 5 merges, every invocation of tdd (pilot skill) calls session_log at the end of each phase (red, green, refactor) with final_status ∈ {pass, fail, skip}.
At least 6 of the remaining "binary-outcome" skills get the same treatment: code-review, debug, feature-spec, session-retrospective, trainer, spec-driven-dev. (Skills with no clear pass/fail — clean-code, cognitive-load, solid, refactoring, test-design, problem-analysis, user-stories, planning, atdd, gitea-ci — are out of scope.)
A new HTTP REST endpoint GET /pass-rate?skill=X&window=7d on the brain pod returns valid JSON {skill, window, pass, fail, skip, total, pass_rate} for any skill name. Skills with no logged invocations return zeros (not 404, not error). Pass rate is pass / (pass + fail); if pass + fail == 0, returns pass_rate: null.
The endpoint's aggregator normalizes legacy values: pass ≡ ok, fail ≡ error, skip ≡ skipped. No data loss when scanning historical logs.
An optional CLI subcommand hyperguild brain pass-rate <skill> [--window 7d] [--json] calls the endpoint and prints either human-readable (tdd: 47 / 50 = 94% (window: 7d)) or JSON.
task check passes (lint + test + vet + drift + govulncheck) on each task and on the merged branch.
One week post-merge, GET /pass-rate?skill=tdd&window=7d returns non-zero counts and a real pass_rate.

Constraints

Stdlib + existing deps only. The endpoint adds to the existing ingestion pod's HTTP handler (Go, net/http). No new service, no new pod, no new persistence layer.
No auth on /pass-rate. Same model as the rest of the brain HTTP REST API: Tailscale-only network, no token.
Schema: the SKILL.md template uses pass | fail | skip for final_status. The aggregator treats pass and ok as equivalent, fail and error as equivalent, skip and skipped as equivalent. New writes from skills MUST use the new vocabulary; the aggregator handles both for read-back.
Storage: continues to use the existing JSONL files at <pod>/brain/sessions/*.jsonl. No format change. No materialized aggregates. If on-demand scans become slow (>500ms p99), revisit in a follow-up; not now.
Backwards compatibility: the existing session_log MCP tool's signature does not change. Its docstring should be updated to reflect the new vocabulary, but argument types stay the same.
Pilot-before-rollout: the first SKILL.md instrumentation (tdd) must dogfood successfully — at least one real tdd invocation post-instrumentation produces a session log entry — before the other six skills get their updates.

Out of Scope

Plan 6 routing pod itself (the consumer of /pass-rate).
Materialized rolling counters (compute on-demand for now).
Auth, rate limiting, or per-user filtering on /pass-rate.
Dashboards or visualization (hyperguild brain pass-rate text/JSON is the only UI).
Real-time streaming or push notifications (/pass-rate is poll-only).
Skills with no clear binary outcome (the 10 skills listed in Success Criteria).
Per-model or per-mode breakdown (session_log already records model_used; the endpoint aggregates across all models for now). Plan 6 may want sharper aggregation; we'll add fields when it lands.
Migration of the one historical entry in 2026-04-17-validate-hyperguild.jsonl from pass (which is the new vocabulary, by accident) — no migration needed.

Technical Approach

Component A — SKILL.md instrumentation pattern

Each instrumented skill gets a standardized "Logging" subsection under its existing "Brain MCP Integration" section. The subsection names the required session_log fields with explicit copy-paste examples:

**At each phase end:** call `session_log` with:
- `skill`: "<this-skill-name>"
- `phase`: "<the-phase>"
- `final_status`: "pass" | "fail" | "skip"
- `message`: "<one-line summary>"
- `duration_ms`: <wall clock>
- `project_root`: "<absolute path to the project under work>"

The pilot SKILL.md (~/dev/.skills/tdd/SKILL.md) gets instrumented first. The implementation defines the contract; the rollout commits replicate the pattern across the other six SKILL.md files.

Rationale: SKILL.md as the source of truth means the contract is visible to every agent that loads the skill — no hidden middleware. Mode-agnostic: the agent calls session_log whether it's Claude (Mode 1), Claude+routing (Mode 2), or Crush (Mode 3). The pattern is uniform; only the skill name + phase set differ.

Component B — `/pass-rate` HTTP endpoint

New handler at the existing ingestion pod, peer to /query, /write, /ingest, etc.

GET /pass-rate?skill=<name>&window=<duration>
→ 200 { "skill": "tdd", "window": "7d", "pass": 47, "fail": 3, "skip": 0, "total": 50, "pass_rate": 0.94 }

Algorithm:

Parse skill (required) and window (default 7d, accept Go-style 1h, 12h, 7d, 30d).
Walk brain/sessions/*.jsonl in the pod's volume. For each line: parse JSON, filter by skill == query.skill and timestamp >= now - window.
Tally pass (counts both pass and ok), fail (fail and error), skip (skip and skipped).
Compute pass_rate = pass / (pass + fail); if pass + fail == 0, return pass_rate: null.
Return JSON.

Rationale for on-demand: the JSONL files are append-only and small (one entry per skill phase, kilobytes per session at most). For the first months of Plan 5 usage, scanning all sessions for a single query is fast enough. If it ever isn't, a materialized index is a follow-up — the endpoint shape doesn't change.

Component C — Optional CLI subcommand

hyperguild brain pass-rate <skill> [--window 7d] [--json]. Adds a third nested verb under brain (sibling to query and write). Calls GET /pass-rate?skill=<>&window=<> via the existing brainClient infrastructure. Default human output: tdd: 47 / 50 = 94% (window: 7d). --json passes through the response envelope.

Rationale: shell access to pass-rate without curl + jq. Optional in the strict sense — Plan 6's routing pod will call the endpoint directly, not via the CLI — but cheap to add (one new method on brainClient, one new dispatch case in runBrain).

Schema and normalization

session_log JSONL line shape (unchanged today, codified by this plan):

{
  "session_id": "<id>",
  "timestamp": "2026-05-03T20:30:00Z",
  "skill": "tdd",
  "phase": "red",
  "project_root": "/abs/path",
  "final_status": "pass",
  "duration_ms": 12345,
  "message": "Test written, function undefined, red confirmed."
}

final_status values:

New writes (this plan onward): pass | fail | skip
Read aggregator accepts both new and legacy: pass/ok → pass, fail/error → fail, skip/skipped → skip
Anything else → counted as skip for safety (don't pollute pass/fail with malformed entries)

Tests

Endpoint: table-driven tests with a temp brain/sessions/ directory containing JSONL files spanning multiple skills, multiple statuses (both vocabularies), edge cases (empty file, malformed line, timestamp outside window, future timestamp). Tests run via httptest.NewServer against the real handler.
CLI: tests for runBrainPassRate against httptest.Server fake of /pass-rate. Human and --json output paths.
Pilot dogfood: after instrumenting tdd/SKILL.md, one real TDD task in this plan exercises the logging path. The corresponding session log entry verifies end-to-end.
task check per task.

Risks

Skills that don't reliably log produce missing data. The aggregator returns zero counts for those, which Plan 6 may misread as "this skill always passes" or "this skill is broken". Mitigation: the endpoint returns pass_rate: null when pass + fail == 0, signalling "no data" distinct from "always passes". Plan 6 must check for null.
Agents may forget to call session_log mid-skill. No way to enforce in cloud Mode 1 — Claude may skip the call if instructions are unclear. Mitigation: SKILL.md template makes the call literal and copy-pasteable. After 1 week, if instrumentation rate is < 80% of expected calls, escalate; consider a wrapper at the routing-pod layer in Plan 6 as belt-and-suspenders.
Schema drift between legacy ok and new pass. Mitigation: the aggregator's normalization rule. Documented in the endpoint's response and in the session_log tool docstring update.
/pass-rate walks all session files for each request. With ~1 file per session and tens of sessions per week, this is microseconds today. At hundreds of files per day, may need a date-bounded directory layout. Mitigation: monitor; if scan time > 100ms p99, revisit. Not in this plan.
The pilot may fail on the first dogfood. If tdd instrumentation doesn't produce a log entry (e.g. agent didn't call session_log, JSON shape wrong, file permissions), the rollout to the other six skills is blocked until the pilot succeeds. Mitigation: explicit "pilot validates end-to-end" gate as the last step of Component A.
Adding a third verb under brain slightly stretches the inline-router pattern. Three verbs in a switch is still simple; if it grows to five, the CLI may want a per-verb registration map. Mitigation: deferred — three is fine.

10 KiB Raw Blame History Unescape Escape