Files
hyperguild/docs/superpowers/specs/2026-05-04-mode-2-routing-pod-design.md
Mathias Bergqvist 51e01233a4
All checks were successful
CI / Lint / Test / Vet (push) Successful in 10s
CI / Mirror to GitHub (push) Successful in 3s
docs(plan6): spec for Mode 2 routing pod
Drafted via superpowers:feature-spec. Plan 6 of 7 in the skill
migration. Surface frozen at 4 cost-routable skills (code_review,
debug, retrospective, trainer); LiteLLM proxies model choice; pass-
rate drives the route decision with default-to-local plus an env
kill switch for the empty-data window. Plan 7 amendment baked in:
internal/skills/{review,debug,retrospective,trainer} survive Plan 6.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 14:29:26 +02:00

22 KiB
Raw Permalink Blame History

Spec: Mode 2 routing pod

Plan 6 of 7 — Hyperguild Skill Migration. Loaded after feature-spec skill.

Problem Statement

Mode 2 (client-local) is the cost-and-sovereignty mode for paid client work — keep skill calls inside Tailscale, save tokens, but stay reliable. Plans 15 produced everything Mode 2 needs except the consumer: the brain MCP at :30330 is live, four skills are instrumented to log pass | fail | skip, and GET /pass-rate?skill=X&window=Y returns honest numbers (or null when there is no data). What is still missing is the policy layer that reads pass-rate and acts on it.

The supervisor pod (:30320) historically hosted full skill workers (tdd_red/green/refactor, code_review, debug, spec, retrospective, trainer, tier) but with no routing — every call ran local regardless of skill quality, and Claude Code in client-local mode silently lost access to Claude-quality work even when local was wrong. That's the regression Plan 6 fixes.

Why now: the supervisor pod is scheduled for retirement (Plan 7) and the data plumbing for routing decisions exists but has no consumer. Without Plan 6, Plan 7 cannot land.

Success Criteria

  • A new pod routing is deployed via Flux at NodePort :30310, alongside (not replacing) the supervisor and ingestion pods. Image built by gitea CI, deployment manifest under infra/k3s/apps/routing/. kubectl -n routing get deployment shows 1/1 Ready.
  • POST http://koala:30310/mcp responds to tools/list with exactly four tools: code_review, debug, retrospective, trainer. Each tool's name + JSON schema is byte-identical to the supervisor's current advertisement (verified by snapshot test).
  • Bearer-token auth via env var ROUTING_MCP_TOKEN (same opt-in pattern as SUPERVISOR_MCP_TOKEN shipped in f49850d). Empty token = no auth; populated token = Authorization: Bearer <token> required, otherwise HTTP 401 + JSON-RPC -32001.
  • On every tool call, the pod queries ${BRAIN_URL}/pass-rate?skill=<tool>&window=7d and applies a configurable policy:
    • pass_rate == null → route to local (default-to-local)
    • pass_rate ≥ HYPERGUILD_ROUTE_LOCAL_FLOOR (default 0.90) → route to local
    • HYPERGUILD_ROUTE_LOCAL_CEIL ≤ pass_rate < FLOOR (CEIL default 0.70) → 50/50 deterministic sample (hash of canonical request body)
    • pass_rate < CEIL → route to Claude
  • Both routes resolve to a LiteLLM call: local route uses HYPERGUILD_LOCAL_MODEL (default qwen35), Claude route uses HYPERGUILD_CLAUDE_MODEL (default claude-sonnet-4-6). LiteLLM at ${LITELLM_BASE_URL} (default http://piguard:4000) handles provider routing. The routing pod has no direct Anthropic SDK.
  • Every routing decision is logged via session_log to the brain pod with {skill: "_routing", phase: "decide", final_status: "skip", message: "<tool>: <decision>", duration_ms, project_root}. final_status: "skip" keeps these entries out of any skill's pass-rate aggregation.
  • LiteLLM unreachable → fail open to a Claude decision and log final_status: "fail" for _routing. The pod must still serve requests even if LiteLLM is down for hours.
  • cmd/hyperguild/mode.go updated: mode client-local writes the routing entry with "headers": {"X-Hyperguild-Mode": "client-local"} and the _routing_pending placeholder field is removed. The pod accepts but does not branch on the header (forward-compat only).
  • task check (lint + test + vet + drift + govulncheck) passes on each task and on the merged branch. The CI gate that bit Plan 1 must not bite Plan 6 (per feedback_per_task_verification memory).
  • A new task smoke:routing target boots the binary against the live LiteLLM at piguard:4000 and the live /pass-rate at koala:30330, calls each of the four advertised tools once, and verifies a _routing entry appears in the brain via GET /pass-rate?skill=_routing&window=1h. This is the live-contract test (per 2026-05-03-fake-tests-vs-real-contract brain entry); fake-server unit tests verify policy logic, the smoke step verifies the contract.
  • Mode 1 (cloud) and Mode 3 (sovereign) are byte-identically unchanged. Verified by git diff showing no changes to mode.go's modeCloud or modeSovereign functions.

Constraints

  • Stdlib + existing deps only. The routing pod reuses internal/exec/litellm.go (NewLiteLLM, Complete), internal/registry, and internal/skills/{review,debug,retrospective,trainer}/. No new third-party dependency. Auth code may be duplicated from internal/mcp/server.go or extracted to a shared helper — implementer's call.
  • No new persistence. Pass-rate data lives in the brain pod's session JSONL files (Plan 5). Routing-decision logs land in the same place via session_log. Routing pod has no DB, no cache, no on-disk state beyond an optional in-memory pass-rate cache (TTL = 60 seconds — protects the brain from per-call hammering during an active session).
  • MCP wire format identical to supervisor's. Tools have the same names and JSON schemas as today. A consumer switches modes by changing only the URL in .mcp.json — no schema-level differences. Snapshot tests pin this.
  • Pod must start and serve degraded. If LiteLLM is down at startup, the pod still binds to :3210, advertises tools, and serves requests with the fail-open-to-Claude behavior described in success criteria.
  • internal/skills/{review,debug,retrospective,trainer}/ survives Plan 6. Plan 7's note about deleting them is amended: those four packages are reused by the routing pod and must NOT be deleted in Plan 7. Plan 7 deletes only internal/skills/{tdd,spec}/, the supervisor binary, the supervisor manifests, and frees NodePort :30320. This spec calls out the change so Plan 7's author doesn't delete needed code (per 2026-05-03-implicit-cleanup-third-category brain entry).
  • No retries beyond fail-open. A LiteLLM call that errors becomes a Claude decision and a final_status: "fail" log. No exponential backoff, no circuit breaker — that's policy for a future plan once the failure shape is observed.
  • Determinism in sampling. When pass-rate is in the sample band (CEIL ≤ pr < FLOOR), the local-vs-Claude choice for a given request is reproducible: hash a canonical JSON of the request body, low bit picks local. Same input → same decision. Avoids per-call variance confusing the operator during a debugging session.

Out of Scope

  • Plan 7 (supervisor retirement). Separate plan, executed after Plan 6 stabilizes. Plan 6 leaves the supervisor pod running; nothing about supervisor changes in this plan.
  • Routing for tdd_red/green/refactor, spec, tier. Per project_per_skill_routing.md, these are SKILL.md or CLI, not routing-pod tools. They never appear in the routing pod's tools/list. If a future plan changes that decision, it adds them then.
  • Routing for brain_ingest. Already routed at the brain pod (Plan 1). No change.
  • Per-mode policy branching. The pod accepts X-Hyperguild-Mode for forward-compat but treats absent or unknown values as client-local. No code path differs on the header value yet.
  • OAuth, IP allowlisting, rate limiting, audit logging. Bearer-token only; same risk model as the supervisor MCP after f49850d.
  • Decision-log read endpoints. Routing decisions land in the brain via session_log. Reads happen via the existing GET /pass-rate endpoint and JSONL inspection. No new read API.
  • Materialized routing-decision aggregates. Out of scope for the same reason Plan 5 deferred materialized counters: on-demand scans are fast enough at current data volumes.
  • Tunable per-skill thresholds. FLOOR and CEIL are global. If the operator decides debug needs a different floor than code_review, that's a follow-up plan with real data behind the choice.
  • Sampling beyond a 50/50 hash split. No epsilon-decay schedules, no Thompson sampling, no per-skill exploration policies. Add when data justifies.
  • Migration of any existing supervisor-skill .mcp.json registrations. Consumers update their .mcp.json (via hyperguild mode client-local) when they want Mode 2 behavior. No silent redirect.
  • Routing-pod-side prompt customization. The four skill packages already own their prompts; the routing pod just calls into them via the existing Skill interface. Prompt edits remain a SKILL.md or internal/skills/<x>/ concern.

Technical Approach

A. Binary layout: cmd/routing/

A new Go binary at cmd/routing/main.go. Stdlib + internal/*. Wires:

  1. Config from env (typed struct in internal/config/routing.go — peer to Config for the supervisor; deliberately a separate type because the surfaces are different and merging would force every routing-pod field onto the supervisor and vice versa).
  2. internal/exec/litellm.NewLiteLLM(...) — same client the supervisor uses.
  3. internal/skills/{review,debug,retrospective,trainer}.New(...) constructors, each receiving a CompleteFunc that wraps the routing decision (see C below).
  4. internal/registry.New() populated with the four skills.
  5. internal/mcp.NewServer(reg, cfg.MCPAuthToken) — reuse the existing handler with bearer auth from f49850d. The handler is generic; nothing in it is supervisor-specific.

Rationale: the supervisor's runtime is already 80% of what the routing pod needs. Reusing it saves the routing pod from re-implementing skill dispatch, MCP protocol handling, and bearer auth. The only new code is the routing decision itself (C below) and the deployment manifests (G).

B. Configuration via env

Typed struct, parsed at startup. New env vars introduced by Plan 6:

Env var Default Purpose
ROUTING_PORT 3210 Pod's HTTP port (NodePort :30310 maps to this)
ROUTING_MCP_TOKEN Bearer token, opt-in (empty = no auth)
LITELLM_BASE_URL http://piguard:4000 LiteLLM proxy (reused)
LITELLM_API_KEY Reused, sourced from routing-secrets Secret
BRAIN_URL http://ingestion.supervisor:3300 In-cluster brain pod for /pass-rate and session_log
HYPERGUILD_LOCAL_MODEL qwen35 Model name passed to LiteLLM for the local decision
HYPERGUILD_CLAUDE_MODEL claude-sonnet-4-6 Model name for the Claude decision
HYPERGUILD_ROUTE_LOCAL_FLOOR 0.90 At/above this pass-rate, always local
HYPERGUILD_ROUTE_LOCAL_CEIL 0.70 Below this, always Claude. Between CEIL and FLOOR is the sample band.
HYPERGUILD_PASS_RATE_TTL_SECONDS 60 Per-skill in-memory cache TTL

Rationale: every value an operator might want to tune is an env var, not a hardcoded constant. Defaults are the recommendations from the kickoff and the per-skill-routing memory; sensible cluster values flow in via the Flux-managed Secret. No config file to manage.

C. Decision policy (internal/routing/policy.go)

Pure function, no I/O:

type Decision int
const (
    DecideLocal Decision = iota
    DecideClaude
)

type Policy struct{ Floor, Ceil float64 }

// Decide returns the routing decision. passRate may be nil when the brain has no data.
// requestHash is a deterministic 64-bit hash of the canonical request body — used only
// when passRate is in the sample band; same hash → same decision.
func (p Policy) Decide(passRate *float64, requestHash uint64) Decision { ... }

Rules (in order):

  1. passRate == nilDecideLocal (default-to-local)
  2. *passRate >= p.FloorDecideLocal
  3. *passRate < p.CeilDecideClaude
  4. Otherwise (sample band) → requestHash & 1 picks local on 0, claude on 1

Rationale: no I/O in the policy means the function is trivially testable (table-driven, no fixtures, no servers). Network calls happen in a wrapping layer that calls Decide — same separation as internal/skills/*/skill.go keeps prompt strings separate from Complete calls. Default-to-local rule is justified in project_per_skill_routing.md: the four advertised skills are exactly the skills marked "MCP→local" in that target architecture.

D. Pass-rate fetcher (internal/routing/passrate.go)

type Fetcher struct {
    BaseURL    string
    HTTPClient *http.Client // 1s timeout
    Cache      *ttlCache    // map[string]*float64 with 60s TTL, struct-internal
}

func (f *Fetcher) Get(ctx context.Context, skill string) (*float64, error)

Calls GET ${BaseURL}/pass-rate?skill=<skill>&window=7d. On success, caches the parsed pass_rate (which may be null) for HYPERGUILD_PASS_RATE_TTL_SECONDS. On error, returns (nil, err); the dispatch wrapper treats this as *passRate == nil and routes to local (the default-to-local fallback also covers brain-pod-down).

Rationale: GET is correct REST per 2026-05-03-rest-semantics-vs-precedent (this is a pure read with query params; it shouldn't follow the legacy POST-everywhere precedent). Cache TTL of 60s prevents per-call hammering during a tight Claude Code loop while staying fresh enough that a flapping pass-rate visibly affects routing within a minute. No persistence — restart loses cache, that's fine.

E. Dispatch wrapper

The four skills are constructed with their existing CompleteFunc signature ((ctx, model, system, user) (string, int64, error)). The routing pod wraps it:

func (r *Router) Complete(ctx context.Context, skill, model, system, user string) (string, int64, error) {
    pr, _ := r.fetcher.Get(ctx, skill)
    decision := r.policy.Decide(pr, hashCanonical(system, user))
    chosenModel := r.cfg.ClaudeModel
    if decision == DecideLocal {
        chosenModel = r.cfg.LocalModel
    }
    out, ms, err := r.litellm.Complete(ctx, chosenModel, system, user)
    r.logDecision(skill, decision, err, ms)
    if err != nil {
        // fail open: try Claude once if we routed local; if Claude also fails, return error.
        if decision == DecideLocal {
            chosenModel = r.cfg.ClaudeModel
            out, ms, err = r.litellm.Complete(ctx, chosenModel, system, user)
            r.logDecision(skill, DecideClaude, err, ms) // second log entry, marked fail if still erroring
        }
        return out, ms, err
    }
    return out, ms, nil
}

The skill packages don't know about routing — they receive a CompleteFunc and call it. The wrapper substitutes routing logic at construction time.

Rationale: keeps the skill packages oblivious to mode. Same internal/skills/review/ works under the supervisor (no routing) and under the routing pod (routed) without any conditional logic in the skill itself. Plan 7's deletion of the supervisor leaves the skills' shape intact for the routing pod.

F. Decision logging (internal/routing/log.go)

After every decision, POST a session log entry to ${BRAIN_URL}/write (the brain pod's existing endpoint, which appends to brain/sessions/<session>.jsonl). Entry shape:

{
  "skill": "_routing",
  "phase": "decide",
  "final_status": "skip",
  "message": "<original_skill>: <decision> (pass_rate=<value or 'null'>, model=<chosen>)",
  "duration_ms": <litellm_round_trip>,
  "project_root": "<path from request, or 'unknown'>",
  "timestamp": "<RFC3339>",
  "session_id": "<from request, or generated>"
}

final_status: "skip" keeps these entries out of any real skill's pass-rate aggregation (Plan 5's aggregator counts only pass/fail). Operators can still query GET /pass-rate?skill=_routing&window=7d for routing-failure visibility (when LiteLLM down → final_status: "fail" in the second log entry).

Rationale: closes the observability loop without adding a new endpoint or schema. _routing namespaces routing entries away from skill names. skip is the only honest classification — routing isn't itself a pass/fail event in the skill sense.

G. Deployment

New manifest directory infra/k3s/apps/routing/ mirroring infra/k3s/apps/supervisor/'s shape:

  • namespace.yaml — namespace routing (peer to supervisor)
  • deployment.yaml — single replica, nodeSelector koala, image from gitea registry, envFrom: secretRef: routing-secrets
  • service.yaml — ClusterIP on port 3210
  • nodeport.yaml — NodePort 30310 → service 3210
  • secrets.enc.yaml — SOPS-encrypted, contains LITELLM_API_KEY and (optionally) ROUTING_MCP_TOKEN
  • kustomization.yaml — bundles the above

The supervisor pod's CI image build pattern (gitea Actions → gitea.d-ma.be/mathias/supervisor:<sha>) is replicated for gitea.d-ma.be/mathias/routing:<sha>. Flux's existing image-automation will bump the manifest's image tag on each push.

Rationale: copying the supervisor pod's manifest shape (rather than designing from scratch) is the YAGNI move. Flux + image automation already proven on supervisor; same pattern, same operator mental model. Mode 2 setup is now a Flux change, not a one-off kubectl ritual.

H. Live smoke test

task smoke:routing (in the project Taskfile) does:

  1. Boot the binary locally with LITELLM_BASE_URL=http://piguard:4000 and BRAIN_URL=http://koala:30330. Bind to a random localhost port (so it doesn't conflict with anything else).
  2. Send tools/list and assert four tool names.
  3. For each tool, send a minimal valid tools/call. Don't assert on response content — assert response shape (no error, has content).
  4. After all four calls, query GET http://koala:30330/pass-rate?skill=_routing&window=1h and assert total >= 4.
  5. Tear down.

Skipped automatically when LiteLLM is unreachable or when run outside Tailscale (tier 3) — emits a SKIP line and exits 0. task check does NOT include task smoke:routing (CI runner doesn't have Tailscale); operator runs it manually before bumping production.

Rationale: unit tests with httptest.Server fakes verify the policy and the dispatch wrapper logic. The smoke test is the only thing that will catch a contract drift between the routing pod's Complete calls and the actual LiteLLM API, or a schema drift between /pass-rate and what the fetcher expects (per 2026-05-03-fake-tests-vs-real-contract).

I. Mode-template update (cmd/hyperguild/mode.go)

modeClientLocal is amended:

  • The routing entry's url stays at http://koala:30310/mcp.
  • A new key headers is added with {"X-Hyperguild-Mode": "client-local"}.
  • The placeholder _routing_pending field is removed, since the routing pod now exists.

Tests in cmd/hyperguild/mode_test.go are updated to assert the new structure. README in cmd/hyperguild/README.md updated to drop the "not deployed yet" note.

Rationale: Plan 4 deliberately scaffolded the placeholder for Plan 6 to fill in. This is the fill-in. Removing _routing_pending is the implicit cleanup the kickoff anticipates — making it explicit in the spec avoids a Plan-completeness gap (per 2026-05-03-implicit-cleanup-third-category).

Risks

  • Empty pass-rate window in the first weeks. Plans 35 merged on 2026-05-03; usage data has not accumulated. With default-to-local active for all four routed skills, the first weeks of Mode 2 = "everything goes local." If local quality is rough on code_review or debug, the operator's first impression of Mode 2 is bad, and confidence in Plan 6 erodes before data lands. Mitigation: the FLOOR / CEIL are env-tunable. If local quality is unworkable in the first week, set HYPERGUILD_ROUTE_LOCAL_FLOOR=2.0 (impossible threshold) and the pod becomes default-to-Claude with no code change. This is a deliberate kill switch for the early window.

  • LiteLLM-as-single-dependency. The routing pod has exactly one upstream LLM provider: piguard:4000. If LiteLLM is misconfigured (wrong model name routed to wrong provider, expired Anthropic key in LiteLLM's config), every routing-pod call returns garbage. Mitigation: the smoke test catches gross misconfig before deploy; once deployed, LiteLLM's own /health endpoint is the canary (the pod doesn't probe it — operator monitors LiteLLM separately). If a deeper failure mode emerges, add a routing-pod liveness probe in a follow-up.

  • Skill-schema drift. The routing pod's tools/list is asserted byte-identical to the supervisor's via snapshot test. If someone evolves the supervisor's schemas between Plan 6 merge and Plan 7 (a long window), the snapshot drifts. Mitigation: the spec documents that Plan 6 freezes the schemas; supervisor edits to skill schemas are out of scope until Plan 7 deletes the supervisor. This is a soft constraint enforced by the spec, not by code. If the supervisor genuinely needs a schema change before Plan 7, that's a separate plan.

  • Flux drift on kubectl rollout restart. Demonstrated during the bearer-auth rollout earlier today: Flux server-side-applies the deployment every 30s and strips the kubectl.kubernetes.io/restartedAt annotation, which deletes the new ReplicaSet's pod. Mitigation: the Plan 6 implementer prompt and the README note that kubectl delete pod -l app=routing is the correct way to force a restart on Flux-managed deployments — the existing ReplicaSet recreates without an annotation Flux can revert. (This finding is worth a brain entry; capture in retrospective.)

  • Mode header not forwarded by Claude Code. Plan 6 assumes Claude Code propagates headers from .mcp.json. The bearer-auth rollout proved this works for Authorization. The same path should work for X-Hyperguild-Mode. Mitigation: the pod treats absent header as client-local (the only mode that registers the pod). If forwarding silently breaks, behavior is identical — header is forward-compat only.

  • Sample-band hash collision producing skewed routing. Hash inputs are (system, user) strings. If skill prompts produce highly similar bodies (debug bug A vs debug bug B with similar wording), low-bit hash distribution might cluster on one side. Mitigation: at the volumes Plan 6 expects (single operator, ~10s of routed calls/hour at peak), bias is statistically invisible. If volume ever rises, swap hash & 1 for a stronger split. Not the first failure mode worth pre-engineering.

Cross-references

  • Spec for Plan 5 (consumer of /pass-rate): docs/superpowers/specs/2026-05-03-pass-rate-logging-design.md
  • Spec for Plan 4 (which scaffolded the :30310 placeholder): docs/superpowers/specs/2026-05-03-hyperguild-cli-design.md
  • Auto-memory entries project_three_modes, project_skill_migration_plans, project_per_skill_routing, feedback_per_task_verification, feedback_sudo
  • Brain entries 2026-05-03-rest-semantics-vs-precedent, 2026-05-03-aggregator-normalization-backwards-compat, 2026-05-03-fake-tests-vs-real-contract, 2026-05-03-implicit-cleanup-third-category, 2026-05-03-code-reviewer-output-as-candidates, 2026-05-03-done-with-concerns-vs-blocked, 2026-05-03-verification-depth-formula, 2026-05-03-plan-canonical-dispatch-ephemeral