From b6bcc9304872222aca1b3c4ae69f9e98f772c97d Mon Sep 17 00:00:00 2001 From: Mathias Bergqvist Date: Mon, 4 May 2026 14:53:03 +0200 Subject: [PATCH] docs(plan6): implementation plan for Mode 2 routing pod MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit 14 TDD-shaped tasks across two worktrees: hyperguild for code (internal/routing package, cmd/routing binary, Dockerfile, CD workflow, mode template, smoke test, docs) and infra for the k3s manifests (deployment, service, nodeport, SOPS-encrypted secret). Plan 7 amendment baked in: internal/skills/{review, debug,retrospective,trainer} survive Plan 6 — Plan 7 only deletes tdd, spec, and the supervisor binary. Co-Authored-By: Claude Opus 4.7 (1M context) --- .../plans/2026-05-04-mode-2-routing-pod.md | 2449 +++++++++++++++++ 1 file changed, 2449 insertions(+) create mode 100644 docs/superpowers/plans/2026-05-04-mode-2-routing-pod.md diff --git a/docs/superpowers/plans/2026-05-04-mode-2-routing-pod.md b/docs/superpowers/plans/2026-05-04-mode-2-routing-pod.md new file mode 100644 index 0000000..dec93c9 --- /dev/null +++ b/docs/superpowers/plans/2026-05-04-mode-2-routing-pod.md @@ -0,0 +1,2449 @@ +# Mode 2 Routing Pod Implementation Plan + +> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. + +**Goal:** Ship a thin policy pod at `koala:30310` that routes the four cost-routable skill calls (`code_review`, `debug`, `retrospective`, `trainer`) to a LiteLLM-proxied local or Claude model based on per-skill pass rate. Replaces the unconditional supervisor-runs-locally behavior in client-local mode. + +**Architecture:** New Go binary at `cmd/routing/`, reusing `internal/skills/{review,debug,retrospective,trainer}/`, `internal/exec/litellm.go`, `internal/registry`, and `internal/mcp` (bearer-auth handler from `f49850d`). A new `internal/routing` package adds (a) a pure-function decision policy, (b) a TTL-cached pass-rate fetcher, (c) a session-log decision logger, and (d) a router that wraps a `CompleteFunc` so the existing skill packages stay routing-oblivious. Deployed via Flux at NodePort `:30310` alongside the supervisor and ingestion pods. + +**Tech Stack:** Go 1.26 stdlib (`net/http`, `crypto/sha256`, `encoding/json`, `time`, `sync`); existing `testify` for tests; SOPS-encrypted Secret in the infra repo; gitea CI buildctl→skopeo; Flux Kustomize reconciliation. + +--- + +## Plan 6 of 7 — Hyperguild Skill Migration + +Plans 1–5 merged. Plan 6 is the substantive routing-pod plan; Plan 7 (supervisor retirement) follows. + +**Spec:** `docs/superpowers/specs/2026-05-04-mode-2-routing-pod-design.md` (committed `51e0123`). + +### Two worktrees + +- **Hyperguild worktree:** `~/Documents/local-dev/AI/hyperguild/.worktrees/mode-2-routing-pod/` on branch `feat/mode-2-routing-pod`. Contains the Go code, Dockerfile addition, CD workflow update, mode-template update, README, and smoke test. +- **Infra worktree:** `~/Documents/local-dev/AI/infra/.worktrees/mode-2-routing-pod/` on branch `feat/routing-pod-manifests`. Contains the k3s manifests for the new pod plus the SOPS-encrypted Secret. + +Each task's "Files" header names the worktree. Implementer subagents must `cd` into the named worktree before any read/edit/git operation. Plan paths describe the post-merge canonical state (per `2026-05-03-plan-canonical-dispatch-ephemeral` brain entry); dispatch prompts add the worktree translation. + +### Verification convention + +Per task, the implementer runs `task check` (lint + test + vet + drift + govulncheck), not just `go test ./...`. CI's lint gate caught a Plan-1 errcheck regression that local tests missed (per `feedback_per_task_verification` memory). Append `//nolint:errcheck` to any `fmt.Fprint*` to stdout/stderr that ignores its return value. Ignored errors on `defer resp.Body.Close()` use `defer func() { _ = resp.Body.Close() }()`. + +### Status taxonomy for implementer subagents + +- `DONE` — task completed, all checks green, verification commands ran clean. +- `DONE_WITH_CONCERNS` — task completed, but the implementer noticed a plan bug, an environmental anomaly, or related code that looks suspicious. Controller decides: doc-patch, follow-up commit, or accept and roll on (per `2026-05-03-done-with-concerns-vs-blocked` brain entry). +- `BLOCKED` — implementer cannot complete the assigned work. Controller re-dispatches with more context. +- `NEEDS_CONTEXT` — implementer needs information not in the dispatch (rare; usually a doc bug). + +### Code-reviewer expectations + +The reviewer agent surfaces candidate improvements; the controller filters. Per `2026-05-03-code-reviewer-output-as-candidates`, reject reviewer suggestions that add helpers for single-use sites, abstractions for hypothetical futures, or stylistic refactors that diverge from the plan's heredocs. Apply genuine bugs and security findings; defer the rest. + +### Flux operational note + +The auth rollout (commit `afe9a08` in infra) demonstrated that Flux server-side-applies the `routing` Deployment every ~30s and strips any `kubectl rollout restart` annotation, deleting the new ReplicaSet's pod. To force a pod restart on a Flux-managed deployment, use `kubectl -n delete pod -l app=` — the existing ReplicaSet recreates without an annotation Flux can revert. + +### Plan 7 amendment baked in + +`internal/skills/{review,debug,retrospective,trainer}/` are reused by the routing pod and **must not be deleted in Plan 7**. Plan 7 deletes only `internal/skills/{tdd,spec}/`, the supervisor binary, the supervisor manifests, and frees NodePort `:30320`. The implementer of Plan 7 must read this paragraph and the matching note in the spec before deleting anything. + +## File Structure + +### Hyperguild worktree + +| Path | Action | Responsibility | +|---|---|---| +| `internal/config/routing.go` | create | `RoutingConfig` typed struct, `LoadRouting()` env parser | +| `internal/config/routing_test.go` | create | Defaults + env-override tests | +| `internal/routing/policy.go` | create | `Decision` enum, `Policy.Decide(passRate, hash) Decision` | +| `internal/routing/policy_test.go` | create | Table-driven coverage of all four rules | +| `internal/routing/hash.go` | create | `CanonicalHash(system, user) uint64` (SHA-256 prefix) | +| `internal/routing/hash_test.go` | create | Determinism + low-bit distribution sanity | +| `internal/routing/passrate.go` | create | `Fetcher` with TTL cache, calls `GET /pass-rate` | +| `internal/routing/passrate_test.go` | create | `httptest.Server`; cache hit/miss, error path | +| `internal/routing/log.go` | create | `Logger.LogDecision(...)` posts to brain MCP `session_log` | +| `internal/routing/log_test.go` | create | `httptest.Server` capture + body shape assertion | +| `internal/routing/router.go` | create | `Router.Run(...)` wraps fetcher + policy + logger + LiteLLM | +| `internal/routing/router_test.go` | create | Mocked fetcher/logger/litellm; route + fail-open paths | +| `internal/routing/snapshot_test.go` | create | Asserts routing pod's `tools/list` byte-equals captured snapshot | +| `internal/routing/testdata/tools_list.snapshot.json` | create | Snapshot from current supervisor advertisement | +| `cmd/routing/main.go` | create | Wires Config → LiteLLM → Router → Skills → Registry → MCP server | +| `cmd/routing/main_test.go` | create | Integration test with fakes for LiteLLM + brain | +| `cmd/hyperguild/mode.go:74-87` | modify | `modeClientLocal` adds `headers: X-Hyperguild-Mode`, removes `_routing_pending` | +| `cmd/hyperguild/mode_test.go` | modify | Updated assertion for the new shape | +| `cmd/hyperguild/README.md` | modify | Drop "not deployed yet" note; document the header | +| `Dockerfile.routing` | create | Builds `cmd/routing`, bakes `config/`, runs as non-root, no claude CLI | +| `.gitea/workflows/cd.yml` | modify | Build + push routing image; sed `routing/deployment.yaml` in infra | +| `Taskfile.yml` | modify | Add `smoke:routing` task | +| `scripts/smoke-routing.sh` | create | Boots binary, hits each tool, asserts brain has `_routing` entries | +| `README.md` | modify | Mode 2 + new env vars + routing pod URL | +| `.context/PROJECT.md` | modify | Document `koala:30310/mcp` + the four routed skills | + +### Infra worktree + +| Path | Action | Responsibility | +|---|---|---| +| `k3s/apps/routing/namespace.yaml` | create | Namespace `routing` | +| `k3s/apps/routing/deployment.yaml` | create | One-replica Deployment, koala nodeSelector, image from gitea registry | +| `k3s/apps/routing/service.yaml` | create | ClusterIP `routing` on port 3210 | +| `k3s/apps/routing/nodeport.yaml` | create | NodePort 30310 → service 3210 | +| `k3s/apps/routing/secrets.enc.yaml` | create | SOPS-encrypted `LITELLM_API_KEY` + optional `ROUTING_MCP_TOKEN` | +| `k3s/apps/routing/kustomization.yaml` | create | Bundles the above | +| `k3s/apps/kustomization.yaml` | modify | Add `routing` to the apps list | + +--- + +## Task 1: `RoutingConfig` struct + env parser + +**Worktree:** hyperguild + +Typed config struct for the routing pod. New struct (not appended to `Config`) because the routing pod's surface differs from the supervisor's; merging would force every routing field onto the supervisor and vice versa. + +**Files:** +- Create: `internal/config/routing.go` +- Create: `internal/config/routing_test.go` + +- [ ] **Step 1: Write the failing test** + +Create `internal/config/routing_test.go`: + +```go +package config_test + +import ( + "testing" + + "github.com/mathiasbq/supervisor/internal/config" + "github.com/stretchr/testify/assert" + "github.com/stretchr/testify/require" +) + +func TestLoadRoutingDefaults(t *testing.T) { + for _, k := range []string{ + "ROUTING_PORT", "ROUTING_MCP_TOKEN", "LITELLM_BASE_URL", "LITELLM_API_KEY", + "BRAIN_URL", "HYPERGUILD_LOCAL_MODEL", "HYPERGUILD_CLAUDE_MODEL", + "HYPERGUILD_ROUTE_LOCAL_FLOOR", "HYPERGUILD_ROUTE_LOCAL_CEIL", + "HYPERGUILD_PASS_RATE_TTL_SECONDS", + } { + t.Setenv(k, "") + } + + cfg, err := config.LoadRouting() + require.NoError(t, err) + assert.Equal(t, "3210", cfg.Port) + assert.Equal(t, "", cfg.MCPAuthToken) + assert.Equal(t, "http://piguard:4000", cfg.LiteLLMBaseURL) + assert.Equal(t, "http://ingestion.supervisor:3300", cfg.BrainURL) + assert.Equal(t, "qwen35", cfg.LocalModel) + assert.Equal(t, "claude-sonnet-4-6", cfg.ClaudeModel) + assert.InDelta(t, 0.90, cfg.RouteLocalFloor, 1e-9) + assert.InDelta(t, 0.70, cfg.RouteLocalCeil, 1e-9) + assert.Equal(t, 60, cfg.PassRateTTLSeconds) +} + +func TestLoadRoutingFromEnv(t *testing.T) { + t.Setenv("ROUTING_PORT", "3250") + t.Setenv("ROUTING_MCP_TOKEN", "tok-xyz") + t.Setenv("LITELLM_BASE_URL", "http://localhost:4000") + t.Setenv("LITELLM_API_KEY", "lk") + t.Setenv("BRAIN_URL", "http://localhost:3300") + t.Setenv("HYPERGUILD_LOCAL_MODEL", "qwen2-7b") + t.Setenv("HYPERGUILD_CLAUDE_MODEL", "claude-opus-4-7") + t.Setenv("HYPERGUILD_ROUTE_LOCAL_FLOOR", "0.85") + t.Setenv("HYPERGUILD_ROUTE_LOCAL_CEIL", "0.65") + t.Setenv("HYPERGUILD_PASS_RATE_TTL_SECONDS", "30") + + cfg, err := config.LoadRouting() + require.NoError(t, err) + assert.Equal(t, "3250", cfg.Port) + assert.Equal(t, "tok-xyz", cfg.MCPAuthToken) + assert.Equal(t, "http://localhost:4000", cfg.LiteLLMBaseURL) + assert.Equal(t, "lk", cfg.LiteLLMAPIKey) + assert.Equal(t, "http://localhost:3300", cfg.BrainURL) + assert.Equal(t, "qwen2-7b", cfg.LocalModel) + assert.Equal(t, "claude-opus-4-7", cfg.ClaudeModel) + assert.InDelta(t, 0.85, cfg.RouteLocalFloor, 1e-9) + assert.InDelta(t, 0.65, cfg.RouteLocalCeil, 1e-9) + assert.Equal(t, 30, cfg.PassRateTTLSeconds) +} + +func TestLoadRoutingRejectsBadFloat(t *testing.T) { + t.Setenv("HYPERGUILD_ROUTE_LOCAL_FLOOR", "not-a-number") + _, err := config.LoadRouting() + require.Error(t, err) + assert.Contains(t, err.Error(), "HYPERGUILD_ROUTE_LOCAL_FLOOR") +} +``` + +- [ ] **Step 2: Run the test to confirm it fails** + +```bash +cd ~/Documents/local-dev/AI/hyperguild/.worktrees/mode-2-routing-pod +go test ./internal/config/... -run TestLoadRouting -v +``` + +Expected: FAIL — `undefined: config.LoadRouting` and `undefined: config.RoutingConfig`. + +- [ ] **Step 3: Write the implementation** + +Create `internal/config/routing.go`: + +```go +package config + +import ( + "fmt" + "os" + "strconv" +) + +// RoutingConfig holds the runtime configuration for the routing pod. +// Separate from Config because the routing pod's surface differs from the supervisor's. +type RoutingConfig struct { + Port string // ROUTING_PORT, default 3210 + MCPAuthToken string // ROUTING_MCP_TOKEN, optional bearer token + LiteLLMBaseURL string // LITELLM_BASE_URL, default http://piguard:4000 + LiteLLMAPIKey string // LITELLM_API_KEY + BrainURL string // BRAIN_URL, default http://ingestion.supervisor:3300 + LocalModel string // HYPERGUILD_LOCAL_MODEL, default qwen35 + ClaudeModel string // HYPERGUILD_CLAUDE_MODEL, default claude-sonnet-4-6 + RouteLocalFloor float64 // HYPERGUILD_ROUTE_LOCAL_FLOOR, default 0.90 + RouteLocalCeil float64 // HYPERGUILD_ROUTE_LOCAL_CEIL, default 0.70 + PassRateTTLSeconds int // HYPERGUILD_PASS_RATE_TTL_SECONDS, default 60 +} + +func LoadRouting() (RoutingConfig, error) { + cfg := RoutingConfig{ + Port: envOr("ROUTING_PORT", "3210"), + MCPAuthToken: os.Getenv("ROUTING_MCP_TOKEN"), + LiteLLMBaseURL: envOr("LITELLM_BASE_URL", "http://piguard:4000"), + LiteLLMAPIKey: os.Getenv("LITELLM_API_KEY"), + BrainURL: envOr("BRAIN_URL", "http://ingestion.supervisor:3300"), + LocalModel: envOr("HYPERGUILD_LOCAL_MODEL", "qwen35"), + ClaudeModel: envOr("HYPERGUILD_CLAUDE_MODEL", "claude-sonnet-4-6"), + } + + floor, err := parseFloatEnv("HYPERGUILD_ROUTE_LOCAL_FLOOR", 0.90) + if err != nil { + return RoutingConfig{}, err + } + cfg.RouteLocalFloor = floor + + ceil, err := parseFloatEnv("HYPERGUILD_ROUTE_LOCAL_CEIL", 0.70) + if err != nil { + return RoutingConfig{}, err + } + cfg.RouteLocalCeil = ceil + + ttl, err := parseIntEnv("HYPERGUILD_PASS_RATE_TTL_SECONDS", 60) + if err != nil { + return RoutingConfig{}, err + } + cfg.PassRateTTLSeconds = ttl + + return cfg, nil +} + +func parseFloatEnv(key string, def float64) (float64, error) { + v := os.Getenv(key) + if v == "" { + return def, nil + } + f, err := strconv.ParseFloat(v, 64) + if err != nil { + return 0, fmt.Errorf("config: %s: %w", key, err) + } + return f, nil +} + +func parseIntEnv(key string, def int) (int, error) { + v := os.Getenv(key) + if v == "" { + return def, nil + } + n, err := strconv.Atoi(v) + if err != nil { + return 0, fmt.Errorf("config: %s: %w", key, err) + } + return n, nil +} +``` + +- [ ] **Step 4: Run the test to confirm it passes** + +```bash +go test ./internal/config/... -run TestLoadRouting -v +``` + +Expected: PASS — three subtests green. + +- [ ] **Step 5: Run `task check`** + +```bash +task check 2>&1 | tail -20 +``` + +Expected: lint clean, test green, vet clean, no drift, govulncheck clean. + +- [ ] **Step 6: Commit** + +```bash +git add internal/config/routing.go internal/config/routing_test.go +git commit -m "feat(routing): RoutingConfig + LoadRouting" +``` + +--- + +## Task 2: Decision policy + +**Worktree:** hyperguild + +Pure-function policy with no I/O. Decision rules in priority order: null → local; ≥floor → local; = Floor → DecideLocal (trust local) +// 3. *passRate < Ceil → DecideClaude (don't trust local) +// 4. otherwise (sample band) → requestHash low bit picks: 0=local, 1=claude +type Policy struct { + Floor float64 + Ceil float64 +} + +// Decide returns the routing decision for a single call. +// requestHash is consulted only when passRate is in the sample band [Ceil, Floor). +func (p Policy) Decide(passRate *float64, requestHash uint64) Decision { + if passRate == nil { + return DecideLocal + } + if *passRate >= p.Floor { + return DecideLocal + } + if *passRate < p.Ceil { + return DecideClaude + } + if requestHash&1 == 0 { + return DecideLocal + } + return DecideClaude +} +``` + +- [ ] **Step 4: Run the test to confirm it passes** + +```bash +go test ./internal/routing/... -run TestPolicyDecide -v +``` + +Expected: PASS — eight subtests green. + +- [ ] **Step 5: Run `task check`** + +```bash +task check 2>&1 | tail -10 +``` + +- [ ] **Step 6: Commit** + +```bash +git add internal/routing/policy.go internal/routing/policy_test.go +git commit -m "feat(routing): decision policy" +``` + +--- + +## Task 3: Canonical request hash + +**Worktree:** hyperguild + +SHA-256-based hash of `(system, user)` for deterministic sample-band routing. Same prompt pair → same decision across calls. + +**Files:** +- Create: `internal/routing/hash.go` +- Create: `internal/routing/hash_test.go` + +- [ ] **Step 1: Write the failing test** + +Create `internal/routing/hash_test.go`: + +```go +package routing_test + +import ( + "testing" + + "github.com/mathiasbq/supervisor/internal/routing" + "github.com/stretchr/testify/assert" +) + +func TestCanonicalHashDeterministic(t *testing.T) { + a := routing.CanonicalHash("system one", "user one") + b := routing.CanonicalHash("system one", "user one") + assert.Equal(t, a, b, "same inputs must produce same hash") +} + +func TestCanonicalHashDistinguishesInputs(t *testing.T) { + cases := [][2]string{ + {"sys", "user"}, + {"sys", "user2"}, + {"sys2", "user"}, + {"", "system\x00user"}, // separator collision attempt + {"system\x00user", ""}, + } + seen := make(map[uint64]bool) + for _, c := range cases { + h := routing.CanonicalHash(c[0], c[1]) + assert.False(t, seen[h], "collision on %v", c) + seen[h] = true + } +} + +func TestCanonicalHashLowBitDistribution(t *testing.T) { + // Sanity check: across 1000 distinct inputs, low-bit split is roughly even. + zeros, ones := 0, 0 + for i := 0; i < 1000; i++ { + h := routing.CanonicalHash("sys", string(rune('a'+(i%26)))+string(rune(i))) + if h&1 == 0 { + zeros++ + } else { + ones++ + } + } + // Allow ±15% deviation from 500/500. Tighter would be flaky on real data. + assert.InDelta(t, 500, zeros, 150) + assert.InDelta(t, 500, ones, 150) +} +``` + +- [ ] **Step 2: Run the test** + +```bash +go test ./internal/routing/... -run TestCanonicalHash -v +``` + +Expected: FAIL — `undefined: routing.CanonicalHash`. + +- [ ] **Step 3: Write the implementation** + +Create `internal/routing/hash.go`: + +```go +package routing + +import ( + "crypto/sha256" + "encoding/binary" +) + +// CanonicalHash returns a deterministic 64-bit hash of (system, user). +// Used to make sample-band routing decisions reproducible: identical input +// strings produce the same hash on every call, independent of process state. +// +// Inputs are joined with a 0x00 byte separator before hashing — distinguishes +// (system="ab", user="cd") from (system="abcd", user=""). +func CanonicalHash(system, user string) uint64 { + h := sha256.New() + h.Write([]byte(system)) + h.Write([]byte{0}) + h.Write([]byte(user)) + sum := h.Sum(nil) + return binary.BigEndian.Uint64(sum[:8]) +} +``` + +- [ ] **Step 4: Run tests + `task check`** + +```bash +go test ./internal/routing/... -run TestCanonicalHash -v +task check 2>&1 | tail -10 +``` + +Expected: PASS, all checks green. + +- [ ] **Step 5: Commit** + +```bash +git add internal/routing/hash.go internal/routing/hash_test.go +git commit -m "feat(routing): canonical request hash" +``` + +--- + +## Task 4: Pass-rate fetcher with TTL cache + +**Worktree:** hyperguild + +HTTP client that calls `GET ${BrainURL}/pass-rate?skill=X&window=7d`, caches the response (`*float64`, possibly nil) for `TTL`. On error, returns `(nil, err)` so the dispatch wrapper falls through to default-to-local. + +**Files:** +- Create: `internal/routing/passrate.go` +- Create: `internal/routing/passrate_test.go` + +- [ ] **Step 1: Write the failing test** + +Create `internal/routing/passrate_test.go`: + +```go +package routing_test + +import ( + "context" + "encoding/json" + "net/http" + "net/http/httptest" + "sync/atomic" + "testing" + "time" + + "github.com/mathiasbq/supervisor/internal/routing" + "github.com/stretchr/testify/assert" + "github.com/stretchr/testify/require" +) + +func TestFetcherGetReturnsPassRate(t *testing.T) { + srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { + assert.Equal(t, http.MethodGet, r.Method) + assert.Equal(t, "/pass-rate", r.URL.Path) + assert.Equal(t, "tdd", r.URL.Query().Get("skill")) + assert.Equal(t, "7d", r.URL.Query().Get("window")) + w.Header().Set("Content-Type", "application/json") + _ = json.NewEncoder(w).Encode(map[string]any{"skill": "tdd", "pass_rate": 0.94}) + })) + defer srv.Close() + + f := routing.NewFetcher(srv.URL, "7d", time.Minute) + pr, err := f.Get(context.Background(), "tdd") + require.NoError(t, err) + require.NotNil(t, pr) + assert.InDelta(t, 0.94, *pr, 1e-9) +} + +func TestFetcherGetReturnsNilWhenNoData(t *testing.T) { + srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { + _ = json.NewEncoder(w).Encode(map[string]any{"skill": "novel", "pass_rate": nil}) + })) + defer srv.Close() + + f := routing.NewFetcher(srv.URL, "7d", time.Minute) + pr, err := f.Get(context.Background(), "novel") + require.NoError(t, err) + assert.Nil(t, pr) +} + +func TestFetcherCachesWithinTTL(t *testing.T) { + var calls int32 + srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, _ *http.Request) { + atomic.AddInt32(&calls, 1) + _ = json.NewEncoder(w).Encode(map[string]any{"pass_rate": 0.5}) + })) + defer srv.Close() + + f := routing.NewFetcher(srv.URL, "7d", time.Minute) + for i := 0; i < 5; i++ { + _, err := f.Get(context.Background(), "tdd") + require.NoError(t, err) + } + assert.Equal(t, int32(1), atomic.LoadInt32(&calls), "should hit upstream once and serve four times from cache") +} + +func TestFetcherSurfacesUpstreamError(t *testing.T) { + srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, _ *http.Request) { + http.Error(w, "boom", http.StatusInternalServerError) + })) + defer srv.Close() + + f := routing.NewFetcher(srv.URL, "7d", time.Minute) + pr, err := f.Get(context.Background(), "tdd") + require.Error(t, err) + assert.Nil(t, pr) +} +``` + +- [ ] **Step 2: Run the test** + +```bash +go test ./internal/routing/... -run TestFetcher -v +``` + +Expected: FAIL — `undefined: routing.NewFetcher`. + +- [ ] **Step 3: Write the implementation** + +Create `internal/routing/passrate.go`: + +```go +package routing + +import ( + "context" + "encoding/json" + "fmt" + "net/http" + "net/url" + "sync" + "time" +) + +// Fetcher reads /pass-rate from the brain pod with a per-skill TTL cache. +type Fetcher struct { + BaseURL string + Window string + TTL time.Duration + HTTP *http.Client + + mu sync.Mutex + cache map[string]cachedRate +} + +type cachedRate struct { + value *float64 + at time.Time +} + +type passRateResponse struct { + PassRate *float64 `json:"pass_rate"` +} + +// NewFetcher returns a Fetcher that calls baseURL + /pass-rate with the +// given window string. If ttl is zero, defaults to 60 seconds. The HTTP +// client uses a 1-second total timeout. +func NewFetcher(baseURL, window string, ttl time.Duration) *Fetcher { + if ttl == 0 { + ttl = 60 * time.Second + } + return &Fetcher{ + BaseURL: baseURL, + Window: window, + TTL: ttl, + HTTP: &http.Client{Timeout: time.Second}, + cache: make(map[string]cachedRate), + } +} + +// Get returns the pass rate for the named skill, or nil if no data exists, +// or an error if the brain is unreachable. Caches successful results. +func (f *Fetcher) Get(ctx context.Context, skill string) (*float64, error) { + f.mu.Lock() + if c, ok := f.cache[skill]; ok && time.Since(c.at) < f.TTL { + v := c.value + f.mu.Unlock() + return v, nil + } + f.mu.Unlock() + + u := fmt.Sprintf("%s/pass-rate?skill=%s&window=%s", + f.BaseURL, url.QueryEscape(skill), url.QueryEscape(f.Window)) + req, err := http.NewRequestWithContext(ctx, http.MethodGet, u, nil) + if err != nil { + return nil, fmt.Errorf("passrate: build request: %w", err) + } + resp, err := f.HTTP.Do(req) + if err != nil { + return nil, fmt.Errorf("passrate: request: %w", err) + } + defer func() { _ = resp.Body.Close() }() + if resp.StatusCode != http.StatusOK { + return nil, fmt.Errorf("passrate: server returned status %d", resp.StatusCode) + } + + var body passRateResponse + if err := json.NewDecoder(resp.Body).Decode(&body); err != nil { + return nil, fmt.Errorf("passrate: decode: %w", err) + } + + f.mu.Lock() + f.cache[skill] = cachedRate{value: body.PassRate, at: time.Now()} + f.mu.Unlock() + + return body.PassRate, nil +} +``` + +- [ ] **Step 4: Run tests + `task check`** + +```bash +go test ./internal/routing/... -run TestFetcher -v +task check 2>&1 | tail -10 +``` + +- [ ] **Step 5: Commit** + +```bash +git add internal/routing/passrate.go internal/routing/passrate_test.go +git commit -m "feat(routing): pass-rate fetcher with TTL cache" +``` + +--- + +## Task 5: Decision logger + +**Worktree:** hyperguild + +Posts a `session_log` MCP call to the brain pod's `/mcp` endpoint after every routing decision. Best-effort: returns errors but the caller does not block real work on them. + +**Files:** +- Create: `internal/routing/log.go` +- Create: `internal/routing/log_test.go` + +- [ ] **Step 1: Write the failing test** + +Create `internal/routing/log_test.go`: + +```go +package routing_test + +import ( + "context" + "encoding/json" + "io" + "net/http" + "net/http/httptest" + "testing" + + "github.com/mathiasbq/supervisor/internal/routing" + "github.com/stretchr/testify/assert" + "github.com/stretchr/testify/require" +) + +func TestLoggerLogDecision(t *testing.T) { + var captured map[string]any + srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { + assert.Equal(t, http.MethodPost, r.Method) + assert.Equal(t, "/mcp", r.URL.Path) + body, _ := io.ReadAll(r.Body) + require.NoError(t, json.Unmarshal(body, &captured)) + _ = json.NewEncoder(w).Encode(map[string]any{"jsonrpc": "2.0", "id": 1, "result": map[string]any{"content": []map[string]any{{"type": "text", "text": "ok"}}}}) + })) + defer srv.Close() + + l := routing.NewLogger(srv.URL) + err := l.LogDecision(context.Background(), routing.LogEntry{ + SessionID: "sess-1", + Skill: "code_review", + Decision: "local", + Message: "model=qwen35, pass_rate=0.94", + ProjectRoot: "/home/x/proj", + DurationMs: 1234, + Failed: false, + }) + require.NoError(t, err) + + params := captured["params"].(map[string]any) + assert.Equal(t, "tools/call", captured["method"]) + assert.Equal(t, "session_log", params["name"]) + + args := params["arguments"].(map[string]any) + assert.Equal(t, "_routing", args["skill"]) + assert.Equal(t, "decide", args["phase"]) + assert.Equal(t, "skip", args["final_status"]) + assert.Contains(t, args["message"].(string), "code_review: local") + assert.Equal(t, "sess-1", args["session_id"]) + assert.Equal(t, "/home/x/proj", args["project_root"]) + assert.Equal(t, float64(1234), args["duration_ms"]) +} + +func TestLoggerLogFailure(t *testing.T) { + var captured map[string]any + srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { + body, _ := io.ReadAll(r.Body) + _ = json.Unmarshal(body, &captured) + _ = json.NewEncoder(w).Encode(map[string]any{"jsonrpc": "2.0", "id": 1, "result": map[string]any{}}) + })) + defer srv.Close() + + l := routing.NewLogger(srv.URL) + err := l.LogDecision(context.Background(), routing.LogEntry{ + SessionID: "s", Skill: "debug", Decision: "local", Message: "litellm down", Failed: true, + }) + require.NoError(t, err) + + args := captured["params"].(map[string]any)["arguments"].(map[string]any) + assert.Equal(t, "fail", args["final_status"]) +} + +func TestLoggerSurfacesUpstreamError(t *testing.T) { + srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, _ *http.Request) { + http.Error(w, "down", http.StatusBadGateway) + })) + defer srv.Close() + + l := routing.NewLogger(srv.URL) + err := l.LogDecision(context.Background(), routing.LogEntry{Skill: "x", SessionID: "y", Decision: "local"}) + require.Error(t, err) +} +``` + +- [ ] **Step 2: Run the test** + +```bash +go test ./internal/routing/... -run TestLogger -v +``` + +Expected: FAIL — `undefined: routing.NewLogger`. + +- [ ] **Step 3: Write the implementation** + +Create `internal/routing/log.go`: + +```go +package routing + +import ( + "bytes" + "context" + "encoding/json" + "fmt" + "net/http" + "time" +) + +// LogEntry describes a single routing decision to log via the brain MCP. +type LogEntry struct { + SessionID string + Skill string // the original skill the call routed (e.g., "code_review") + Decision string // "local" or "claude" or "claude_fallback" + Message string // free-form, e.g. "model=qwen35, pass_rate=0.94" + ProjectRoot string + DurationMs int64 + Failed bool // true → final_status: "fail"; false → "skip" +} + +// Logger posts session_log entries to a brain MCP at BrainURL + /mcp. +type Logger struct { + BrainURL string + HTTP *http.Client +} + +// NewLogger creates a Logger with a 2-second HTTP timeout. +func NewLogger(brainURL string) *Logger { + return &Logger{ + BrainURL: brainURL, + HTTP: &http.Client{Timeout: 2 * time.Second}, + } +} + +// LogDecision posts a session_log MCP call. Errors are returned but the caller +// MUST NOT block real work on them — logging is best-effort. +func (l *Logger) LogDecision(ctx context.Context, e LogEntry) error { + status := "skip" + if e.Failed { + status = "fail" + } + payload := map[string]any{ + "jsonrpc": "2.0", + "id": 1, + "method": "tools/call", + "params": map[string]any{ + "name": "session_log", + "arguments": map[string]any{ + "session_id": e.SessionID, + "skill": "_routing", + "phase": "decide", + "final_status": status, + "message": fmt.Sprintf("%s: %s — %s", e.Skill, e.Decision, e.Message), + "duration_ms": e.DurationMs, + "project_root": e.ProjectRoot, + }, + }, + } + body, err := json.Marshal(payload) + if err != nil { + return fmt.Errorf("log: marshal: %w", err) + } + req, err := http.NewRequestWithContext(ctx, http.MethodPost, l.BrainURL+"/mcp", bytes.NewReader(body)) + if err != nil { + return fmt.Errorf("log: build request: %w", err) + } + req.Header.Set("Content-Type", "application/json") + resp, err := l.HTTP.Do(req) + if err != nil { + return fmt.Errorf("log: request: %w", err) + } + defer func() { _ = resp.Body.Close() }() + if resp.StatusCode != http.StatusOK { + return fmt.Errorf("log: server returned status %d", resp.StatusCode) + } + return nil +} +``` + +- [ ] **Step 4: Run tests + `task check`** + +```bash +go test ./internal/routing/... -run TestLogger -v +task check 2>&1 | tail -10 +``` + +- [ ] **Step 5: Commit** + +```bash +git add internal/routing/log.go internal/routing/log_test.go +git commit -m "feat(routing): decision logger via brain MCP session_log" +``` + +--- + +## Task 6: Router (dispatch wrapper) + +**Worktree:** hyperguild + +Composes Fetcher + Policy + Logger + a `CompleteFunc`. The wrapper is what the four skill packages receive as their `CompleteFunc`. On a local-route error, it falls open by retrying once on the Claude model. + +**Files:** +- Create: `internal/routing/router.go` +- Create: `internal/routing/router_test.go` + +- [ ] **Step 1: Write the failing test** + +Create `internal/routing/router_test.go`: + +```go +package routing_test + +import ( + "context" + "encoding/json" + "errors" + "net/http" + "net/http/httptest" + "sync" + "testing" + "time" + + "github.com/mathiasbq/supervisor/internal/routing" + "github.com/stretchr/testify/assert" + "github.com/stretchr/testify/require" +) + +type fakeLLM struct { + mu sync.Mutex + calls []struct{ Model, System, User string } + resp string + err error + errOn string // if non-empty, only the named model errors +} + +func (f *fakeLLM) Complete(_ context.Context, model, system, user string) (string, int64, error) { + f.mu.Lock() + defer f.mu.Unlock() + f.calls = append(f.calls, struct{ Model, System, User string }{model, system, user}) + if f.errOn == model { + return "", 0, f.err + } + if f.err != nil && f.errOn == "" { + return "", 0, f.err + } + return f.resp, 100, nil +} + +func newRouter(t *testing.T, llm *fakeLLM, passRate float64) (*routing.Router, *httptest.Server, *httptest.Server) { + t.Helper() + brain := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { + switch r.URL.Path { + case "/pass-rate": + _ = json.NewEncoder(w).Encode(map[string]any{"pass_rate": passRate}) + case "/mcp": + _ = json.NewEncoder(w).Encode(map[string]any{"jsonrpc": "2.0", "id": 1, "result": map[string]any{}}) + } + })) + t.Cleanup(brain.Close) + + r := &routing.Router{ + Fetcher: routing.NewFetcher(brain.URL, "7d", time.Minute), + Logger: routing.NewLogger(brain.URL), + Policy: routing.Policy{Floor: 0.9, Ceil: 0.7}, + LocalModel: "qwen35", + ClaudeModel: "claude-sonnet-4-6", + Complete: llm.Complete, + } + return r, brain, brain +} + +func TestRouterRoutesLocalAtHighPassRate(t *testing.T) { + llm := &fakeLLM{resp: "ok"} + r, _, _ := newRouter(t, llm, 0.95) + + out, _, err := r.Run(context.Background(), routing.RunInput{ + Skill: "code_review", System: "sys", User: "user", SessionID: "s1", ProjectRoot: "/p", + }) + require.NoError(t, err) + assert.Equal(t, "ok", out) + + llm.mu.Lock() + defer llm.mu.Unlock() + require.Len(t, llm.calls, 1) + assert.Equal(t, "qwen35", llm.calls[0].Model) +} + +func TestRouterRoutesClaudeAtLowPassRate(t *testing.T) { + llm := &fakeLLM{resp: "ok"} + r, _, _ := newRouter(t, llm, 0.3) + + _, _, err := r.Run(context.Background(), routing.RunInput{ + Skill: "code_review", System: "sys", User: "user", SessionID: "s2", + }) + require.NoError(t, err) + + llm.mu.Lock() + defer llm.mu.Unlock() + require.Len(t, llm.calls, 1) + assert.Equal(t, "claude-sonnet-4-6", llm.calls[0].Model) +} + +func TestRouterFailsOpenLocalErrorToClaude(t *testing.T) { + llm := &fakeLLM{resp: "ok-after-fallback", err: errors.New("local boom"), errOn: "qwen35"} + r, _, _ := newRouter(t, llm, 0.95) // would route local + + out, _, err := r.Run(context.Background(), routing.RunInput{ + Skill: "code_review", System: "sys", User: "user", SessionID: "s3", + }) + require.NoError(t, err) + assert.Equal(t, "ok-after-fallback", out) + + llm.mu.Lock() + defer llm.mu.Unlock() + require.Len(t, llm.calls, 2) + assert.Equal(t, "qwen35", llm.calls[0].Model) + assert.Equal(t, "claude-sonnet-4-6", llm.calls[1].Model) +} + +func TestRouterDefaultsToLocalWhenBrainUnreachable(t *testing.T) { + // Brain returns 500 → fetcher errors → router treats pass rate as nil → local. + brain := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, _ *http.Request) { + http.Error(w, "down", http.StatusInternalServerError) + })) + defer brain.Close() + + llm := &fakeLLM{resp: "ok"} + r := &routing.Router{ + Fetcher: routing.NewFetcher(brain.URL, "7d", time.Minute), + Logger: routing.NewLogger(brain.URL), + Policy: routing.Policy{Floor: 0.9, Ceil: 0.7}, + LocalModel: "qwen35", + ClaudeModel: "claude-sonnet-4-6", + Complete: llm.Complete, + } + + _, _, err := r.Run(context.Background(), routing.RunInput{ + Skill: "code_review", System: "sys", User: "user", SessionID: "s4", + }) + require.NoError(t, err) + + llm.mu.Lock() + defer llm.mu.Unlock() + require.Len(t, llm.calls, 1) + assert.Equal(t, "qwen35", llm.calls[0].Model) +} +``` + +- [ ] **Step 2: Run the test** + +```bash +go test ./internal/routing/... -run TestRouter -v +``` + +Expected: FAIL — `undefined: routing.Router`, `undefined: routing.RunInput`. + +- [ ] **Step 3: Write the implementation** + +Create `internal/routing/router.go`: + +```go +package routing + +import ( + "context" + "fmt" + "log/slog" +) + +// CompleteFunc matches the signature used by every skill package's Config. +type CompleteFunc func(ctx context.Context, model, system, user string) (string, int64, error) + +// RunInput captures the per-call inputs the dispatch wrapper needs. +type RunInput struct { + Skill string + System string + User string + SessionID string + ProjectRoot string +} + +// Router composes a pass-rate fetcher, a decision policy, a session logger, +// and a LiteLLM client. Skill packages receive Router.Run as their CompleteFunc. +type Router struct { + Fetcher *Fetcher + Logger *Logger + Policy Policy + LocalModel string + ClaudeModel string + Complete CompleteFunc +} + +// Run executes one skill call: decides local vs claude, calls LiteLLM, logs the +// decision. On local-side error, falls open by retrying once on the Claude model. +func (r *Router) Run(ctx context.Context, in RunInput) (string, int64, error) { + pr, ferr := r.Fetcher.Get(ctx, in.Skill) + if ferr != nil { + slog.Warn("router: pass-rate unreachable, defaulting to local", "skill", in.Skill, "err", ferr) + pr = nil + } + hash := CanonicalHash(in.System, in.User) + decision := r.Policy.Decide(pr, hash) + + model := r.ClaudeModel + if decision == DecideLocal { + model = r.LocalModel + } + + out, ms, err := r.Complete(ctx, model, in.System, in.User) + _ = r.Logger.LogDecision(ctx, LogEntry{ + SessionID: in.SessionID, + Skill: in.Skill, + Decision: decision.String(), + Message: fmt.Sprintf("model=%s, pass_rate=%s", model, formatPassRate(pr)), + ProjectRoot: in.ProjectRoot, + DurationMs: ms, + Failed: err != nil, + }) + + if err != nil && decision == DecideLocal { + slog.Warn("router: local failed, falling open to claude", "skill", in.Skill, "err", err) + out, ms, err = r.Complete(ctx, r.ClaudeModel, in.System, in.User) + _ = r.Logger.LogDecision(ctx, LogEntry{ + SessionID: in.SessionID, + Skill: in.Skill, + Decision: "claude_fallback", + Message: fmt.Sprintf("model=%s, after-local-error", r.ClaudeModel), + ProjectRoot: in.ProjectRoot, + DurationMs: ms, + Failed: err != nil, + }) + } + return out, ms, err +} + +func formatPassRate(pr *float64) string { + if pr == nil { + return "null" + } + return fmt.Sprintf("%.2f", *pr) +} +``` + +- [ ] **Step 4: Run tests + `task check`** + +```bash +go test ./internal/routing/... -run TestRouter -v +task check 2>&1 | tail -10 +``` + +- [ ] **Step 5: Commit** + +```bash +git add internal/routing/router.go internal/routing/router_test.go +git commit -m "feat(routing): router dispatch wrapper" +``` + +--- + +## Task 7: Snapshot test for tool-schema parity + +**Worktree:** hyperguild + +Capture the supervisor's current advertisement of the four routed skills (`code_review`, `debug`, `retrospective`, `trainer`) into a JSON snapshot file. Add a test that spins up a registry with the same four skill packages and asserts `tools/list` output byte-equals the snapshot. Pins the schema contract so a downstream change in any skill package fails the routing pod's test loudly. + +**Files:** +- Create: `internal/routing/testdata/tools_list.snapshot.json` +- Create: `internal/routing/snapshot_test.go` + +- [ ] **Step 1: Capture the supervisor's current advertisement** + +```bash +cd ~/Documents/local-dev/AI/hyperguild/.worktrees/mode-2-routing-pod +mkdir -p internal/routing/testdata +go run ./cmd/supervisor & +SUPERVISOR_PID=$! +sleep 2 +curl -sS -X POST http://localhost:3200/mcp \ + -H 'Content-Type: application/json' \ + -d '{"jsonrpc":"2.0","id":1,"method":"tools/list"}' \ + | jq '.result.tools | map(select(.name == "code_review" or .name == "debug" or .name == "retrospective" or .name == "trainer")) | sort_by(.name)' \ + > internal/routing/testdata/tools_list.snapshot.json +kill $SUPERVISOR_PID +wait $SUPERVISOR_PID 2>/dev/null +``` + +If the supervisor binary requires extra env vars to start, set them inline: + +```bash +SUPERVISOR_CONFIG_DIR=./config/supervisor go run ./cmd/supervisor & +``` + +Inspect the file: + +```bash +cat internal/routing/testdata/tools_list.snapshot.json | jq 'length' +``` + +Expected: `4`. + +- [ ] **Step 2: Write the failing test** + +Create `internal/routing/snapshot_test.go`: + +```go +package routing_test + +import ( + "context" + "encoding/json" + "os" + "sort" + "testing" + + iexec "github.com/mathiasbq/supervisor/internal/exec" + "github.com/mathiasbq/supervisor/internal/registry" + "github.com/mathiasbq/supervisor/internal/skills/debug" + "github.com/mathiasbq/supervisor/internal/skills/retrospective" + "github.com/mathiasbq/supervisor/internal/skills/review" + "github.com/mathiasbq/supervisor/internal/skills/trainer" + "github.com/stretchr/testify/assert" + "github.com/stretchr/testify/require" +) + +// TestToolsListMatchesSupervisorSnapshot pins the four routed skills' tool +// definitions to the supervisor's current advertisement. If a skill package +// changes its schema, this test fails loudly so the snapshot can be updated +// in lockstep with the consumer. +func TestToolsListMatchesSupervisorSnapshot(t *testing.T) { + complete := func(_ context.Context, _, _, _ string) (string, int64, error) { + return "", 0, nil + } + _ = iexec.NewLiteLLM // keep import for future use + + reg := registry.New() + reg.Register(review.New(review.Config{ + SkillPrompt: "stub", + DefaultModel: "stub", + CompleteFunc: complete, + })) + reg.Register(debug.New(debug.Config{ + SkillPrompt: "stub", + DefaultModel: "stub", + CompleteFunc: complete, + })) + reg.Register(retrospective.New(retrospective.Config{ + SkillPrompt: "stub", + DefaultModel: "stub", + CompleteFunc: complete, + })) + reg.Register(trainer.New(trainer.Config{ + ReaderPrompt: "stub", + WriterPrompt: "stub", + DefaultModel: "stub", + CompleteFunc: complete, + })) + + tools := reg.Tools() + // Filter to the four routed skills only (registry may expose additional tools). + wanted := map[string]bool{"code_review": true, "debug": true, "retrospective": true, "trainer": true} + var routed []registry.ToolDef + for _, td := range tools { + if wanted[td.Name] { + routed = append(routed, td) + } + } + sort.Slice(routed, func(i, j int) bool { return routed[i].Name < routed[j].Name }) + + got, err := json.MarshalIndent(routed, "", " ") + require.NoError(t, err) + + want, err := os.ReadFile("testdata/tools_list.snapshot.json") + require.NoError(t, err) + + // Normalize both via re-encode so whitespace differences don't dominate. + var gotV, wantV any + require.NoError(t, json.Unmarshal(got, &gotV)) + require.NoError(t, json.Unmarshal(want, &wantV)) + + gotN, _ := json.MarshalIndent(gotV, "", " ") + wantN, _ := json.MarshalIndent(wantV, "", " ") + + assert.Equal(t, string(wantN), string(gotN), + "tool advertisement drifted from supervisor snapshot — update testdata/tools_list.snapshot.json deliberately if the schema change is intentional") +} +``` + +If the actual skill tool name is `review` rather than `code_review` (or vice versa), discover by inspecting `internal/skills/review/skill.go`'s `Tools()` and adjust both the snapshot capture filter and the test's `wanted` map. Use the discovered name throughout the rest of the plan. + +- [ ] **Step 3: Run the test** + +```bash +go test ./internal/routing/... -run TestToolsListMatchesSupervisorSnapshot -v +``` + +Expected: PASS — the snapshot was captured from the same registry the test exercises. If FAIL, the captured names differ from the wanted map; reconcile names per the note above. + +- [ ] **Step 4: `task check`** + +```bash +task check 2>&1 | tail -10 +``` + +- [ ] **Step 5: Commit** + +```bash +git add internal/routing/snapshot_test.go internal/routing/testdata/tools_list.snapshot.json +git commit -m "test(routing): pin tool-schema parity with supervisor" +``` + +--- + +## Task 8: `cmd/routing/main.go` wiring + +**Worktree:** hyperguild + +Compose the binary: load config, build LiteLLM client, build Fetcher/Logger/Router, register the four skills, mount on the existing `internal/mcp` server with bearer auth. + +**Files:** +- Create: `cmd/routing/main.go` +- Create: `cmd/routing/main_test.go` + +- [ ] **Step 1: Write the integration test first** + +Create `cmd/routing/main_test.go`: + +```go +package main_test + +import ( + "context" + "encoding/json" + "net/http" + "net/http/httptest" + "os/exec" + "strings" + "testing" + "time" + + "github.com/stretchr/testify/assert" + "github.com/stretchr/testify/require" +) + +// TestRoutingPodEndToEnd boots the binary against fake LiteLLM + brain servers, +// calls tools/list and one tools/call, and verifies the brain saw a session_log POST. +func TestRoutingPodEndToEnd(t *testing.T) { + if testing.Short() { + t.Skip("end-to-end binary boot") + } + + var brainHits int + llm := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, _ *http.Request) { + _ = json.NewEncoder(w).Encode(map[string]any{ + "choices": []map[string]any{{"message": map[string]any{"role": "assistant", "content": "stub"}}}, + }) + })) + defer llm.Close() + + brain := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { + switch r.URL.Path { + case "/pass-rate": + _ = json.NewEncoder(w).Encode(map[string]any{"pass_rate": 0.95}) + case "/mcp": + brainHits++ + _ = json.NewEncoder(w).Encode(map[string]any{"jsonrpc": "2.0", "id": 1, "result": map[string]any{}}) + } + })) + defer brain.Close() + + bin := buildRouting(t) + cmd := exec.Command(bin) + cmd.Env = append(cmd.Env, + "ROUTING_PORT=33310", + "LITELLM_BASE_URL="+llm.URL, + "LITELLM_API_KEY=stub", + "BRAIN_URL="+brain.URL, + "SUPERVISOR_CONFIG_DIR=./config/supervisor", + "PATH="+osPath(), + ) + require.NoError(t, cmd.Start()) + t.Cleanup(func() { _ = cmd.Process.Kill() }) + + require.NoError(t, waitForPort(t, "127.0.0.1:33310", 5*time.Second)) + + resp := mcpCall(t, "http://127.0.0.1:33310/mcp", `{"jsonrpc":"2.0","id":1,"method":"tools/list"}`) + assert.Contains(t, resp, "code_review") + + resp = mcpCall(t, "http://127.0.0.1:33310/mcp", `{"jsonrpc":"2.0","id":2,"method":"tools/call","params":{"name":"code_review","arguments":{"project_root":"/tmp","files":["README.md"]}}}`) + _ = resp // shape varies by skill; we only need a 200 + + // Wait briefly for the async session_log to land. + deadline := time.Now().Add(2 * time.Second) + for time.Now().Before(deadline) && brainHits < 2 { + time.Sleep(50 * time.Millisecond) + } + assert.GreaterOrEqual(t, brainHits, 2, "expected at least one /pass-rate hit and one /mcp session_log hit") +} +``` + +Add helpers in the same file: + +```go +func buildRouting(t *testing.T) string { + t.Helper() + bin := t.TempDir() + "/routing" + out, err := exec.Command("go", "build", "-o", bin, "./cmd/routing").CombinedOutput() + require.NoError(t, err, "build failed: %s", out) + return bin +} + +func waitForPort(_ *testing.T, addr string, dur time.Duration) error { + deadline := time.Now().Add(dur) + for time.Now().Before(deadline) { + c, err := http.Get("http://" + addr + "/healthz") + if err == nil { + c.Body.Close() + return nil + } + // fallback: try /mcp tools/list — it'll 400 but TCP open is enough + conn, err := http.NewRequest(http.MethodPost, "http://"+addr+"/mcp", strings.NewReader(`{}`)) + if err == nil { + r, err := http.DefaultClient.Do(conn) + if err == nil { + r.Body.Close() + return nil + } + } + time.Sleep(50 * time.Millisecond) + } + return context.DeadlineExceeded +} + +func mcpCall(t *testing.T, url, body string) string { + t.Helper() + r, err := http.Post(url, "application/json", strings.NewReader(body)) + require.NoError(t, err) + defer r.Body.Close() + var b strings.Builder + _, _ = b.ReadFrom(r.Body) + return b.String() +} + +func osPath() string { + for _, e := range append([]string{}, exec.Command("env").Env...) { + if strings.HasPrefix(e, "PATH=") { + return strings.TrimPrefix(e, "PATH=") + } + } + return "/usr/bin:/bin" +} +``` + +- [ ] **Step 2: Run the test** + +```bash +go test ./cmd/routing/... -v +``` + +Expected: FAIL — `cmd/routing/main.go` doesn't exist. + +- [ ] **Step 3: Write the binary** + +Create `cmd/routing/main.go`: + +```go +// cmd/routing/main.go +package main + +import ( + "context" + "log/slog" + "net/http" + "os" + "time" + + "github.com/mathiasbq/supervisor/internal/config" + iexec "github.com/mathiasbq/supervisor/internal/exec" + "github.com/mathiasbq/supervisor/internal/mcp" + "github.com/mathiasbq/supervisor/internal/registry" + "github.com/mathiasbq/supervisor/internal/routing" + "github.com/mathiasbq/supervisor/internal/skills/debug" + "github.com/mathiasbq/supervisor/internal/skills/retrospective" + "github.com/mathiasbq/supervisor/internal/skills/review" + "github.com/mathiasbq/supervisor/internal/skills/trainer" +) + +func main() { + logger := slog.New(slog.NewTextHandler(os.Stderr, nil)) + slog.SetDefault(logger) + + cfg, err := config.LoadRouting() + if err != nil { + logger.Error("config load failed", "err", err) + os.Exit(1) + } + + // Load prompts from config dir (same files the supervisor uses). + configDir := envOr("SUPERVISOR_CONFIG_DIR", "/app/config/supervisor") + mustRead := func(path string) string { + b, err := os.ReadFile(configDir + "/" + path) + if err != nil { + logger.Error("read prompt failed", "path", path, "err", err) + os.Exit(1) + } + return string(b) + } + + llm := iexec.NewLiteLLM(cfg.LiteLLMBaseURL, cfg.LiteLLMAPIKey, 0) + + router := &routing.Router{ + Fetcher: routing.NewFetcher(cfg.BrainURL, "7d", time.Duration(cfg.PassRateTTLSeconds)*time.Second), + Logger: routing.NewLogger(cfg.BrainURL), + Policy: routing.Policy{Floor: cfg.RouteLocalFloor, Ceil: cfg.RouteLocalCeil}, + LocalModel: cfg.LocalModel, + ClaudeModel: cfg.ClaudeModel, + Complete: llm.Complete, + } + + // Skill packages call CompleteFunc(ctx, model, system, user) — no session_id + // or project_root in the signature. Rather than modifying every skill's API + // (and inflating Plan 6's blast radius), the routing pod logs every decision + // under a fixed session_id "_routing". Operators query + // `GET /pass-rate?skill=_routing&window=...` to inspect routing health; per- + // session correlation is sacrificed for a much simpler implementation. + const routingSessionID = "_routing" + wrap := func(skillName string) routing.CompleteFunc { + return func(ctx context.Context, _, system, user string) (string, int64, error) { + // The model param is ignored: the router picks the model based on policy. + return router.Run(ctx, routing.RunInput{ + Skill: skillName, + System: system, + User: user, + SessionID: routingSessionID, + ProjectRoot: "", + }) + } + } + + reg := registry.New() + reg.Register(review.New(review.Config{ + SkillPrompt: mustRead("review.md"), + DefaultModel: cfg.LocalModel, + CompleteFunc: wrap("code_review"), + })) + reg.Register(debug.New(debug.Config{ + SkillPrompt: mustRead("debug.md"), + DefaultModel: cfg.LocalModel, + CompleteFunc: wrap("debug"), + })) + reg.Register(retrospective.New(retrospective.Config{ + SkillPrompt: mustRead("retrospective.md"), + DefaultModel: cfg.LocalModel, + CompleteFunc: wrap("retrospective"), + })) + reg.Register(trainer.New(trainer.Config{ + ReaderPrompt: mustRead("trainer-reader.md"), + WriterPrompt: mustRead("trainer-writer.md"), + DefaultModel: cfg.LocalModel, + CompleteFunc: wrap("trainer"), + })) + + srv := mcp.NewServer(reg, cfg.MCPAuthToken) + mux := http.NewServeMux() + mux.Handle("/mcp", srv) + mux.HandleFunc("/healthz", func(w http.ResponseWriter, _ *http.Request) { + w.WriteHeader(http.StatusOK) + }) + + addr := ":" + cfg.Port + logger.Info("routing pod starting", "addr", addr, + "local", cfg.LocalModel, "claude", cfg.ClaudeModel, + "floor", cfg.RouteLocalFloor, "ceil", cfg.RouteLocalCeil) + if err := http.ListenAndServe(addr, mux); err != nil { + logger.Error("server stopped", "err", err) + os.Exit(1) + } +} + +func envOr(key, def string) string { + if v := os.Getenv(key); v != "" { + return v + } + return def +} +``` + +If the existing skill packages' `Config` field names differ from what's used here (e.g. `SkillPrompt` vs `Prompt`), adjust by reading each package's `skill.go`. + +- [ ] **Step 4: Run integration test + `task check`** + +```bash +go test ./cmd/routing/... -v +task check 2>&1 | tail -15 +``` + +Expected: PASS for both. + +- [ ] **Step 5: Commit** + +```bash +git add cmd/routing/main.go cmd/routing/main_test.go +git commit -m "feat(routing): cmd/routing binary" +``` + +--- + +## Task 9: Update `mode client-local` template + +**Worktree:** hyperguild + +Replace the `_routing_pending` placeholder with a real `headers` block carrying `X-Hyperguild-Mode: client-local`. URL stays at `koala:30310/mcp`. + +**Files:** +- Modify: `cmd/hyperguild/mode.go` +- Modify: `cmd/hyperguild/mode_test.go` +- Modify: `cmd/hyperguild/README.md` + +- [ ] **Step 1: Update the failing test** + +In `cmd/hyperguild/mode_test.go`, find the existing `TestModeClientLocal` (or equivalent). Add an assertion for the new shape: + +```go +func TestModeClientLocalHasRoutingHeader(t *testing.T) { + tmp := t.TempDir() + "/mcp.json" + out := &bytes.Buffer{} + stderr := &bytes.Buffer{} + require.NoError(t, runMode(context.Background(), []string{"client-local", "--out", tmp}, nil, out, stderr)) + + body, err := os.ReadFile(tmp) + require.NoError(t, err) + var doc map[string]any + require.NoError(t, json.Unmarshal(body, &doc)) + + servers := doc["mcpServers"].(map[string]any) + routing := servers["routing"].(map[string]any) + assert.Equal(t, "http://koala:30310/mcp", routing["url"]) + assert.NotContains(t, routing, "_routing_pending", "placeholder should be removed once Plan 6 ships") + + headers, ok := routing["headers"].(map[string]any) + require.True(t, ok, "routing entry should have headers block") + assert.Equal(t, "client-local", headers["X-Hyperguild-Mode"]) +} +``` + +- [ ] **Step 2: Run the test** + +```bash +go test ./cmd/hyperguild/... -run TestModeClientLocal -v +``` + +Expected: FAIL — `_routing_pending` is still there OR `headers` is missing. + +- [ ] **Step 3: Update `mode.go`** + +Replace the `routing` entry inside `modeClientLocal`: + +```go +"routing": map[string]any{ + "url": "http://koala:30310/mcp", + "description": "Mode 2 routing pod — routes skill calls to LiteLLM/local", + "headers": map[string]any{ + "X-Hyperguild-Mode": "client-local", + }, +}, +``` + +- [ ] **Step 4: Update `cmd/hyperguild/README.md`** + +Find the section that mentions "Plan 6 — routing pod not deployed yet" and rewrite that paragraph: + +```markdown +The `routing` entry points at `koala:30310/mcp` (the routing pod, deployed +in Plan 6). The `X-Hyperguild-Mode: client-local` header is forward-compat +for future modes; the pod treats absent or unknown values as `client-local`. +``` + +- [ ] **Step 5: Run tests + `task check`** + +```bash +go test ./cmd/hyperguild/... -run TestModeClientLocal -v +task check 2>&1 | tail -10 +``` + +- [ ] **Step 6: Commit** + +```bash +git add cmd/hyperguild/mode.go cmd/hyperguild/mode_test.go cmd/hyperguild/README.md +git commit -m "feat(hyperguild): mode client-local writes routing headers" +``` + +--- + +## Task 10: `Dockerfile.routing` + CD workflow extension + +**Worktree:** hyperguild + +Add a Dockerfile for the routing binary and extend the CD workflow to build + push the image and update the infra repo's routing deployment manifest. + +**Files:** +- Create: `Dockerfile.routing` +- Modify: `.gitea/workflows/cd.yml` + +- [ ] **Step 1: Write `Dockerfile.routing`** + +```dockerfile +# syntax=docker/dockerfile:1 + +# ── Build stage ─────────────────────────────────────────────────────────────── +FROM golang:1.26-bookworm AS builder + +ARG VERSION=dev +WORKDIR /src + +COPY go.mod go.sum ./ +RUN go mod download + +COPY . . +RUN CGO_ENABLED=0 GOOS=linux GOARCH=amd64 \ + go build -trimpath -ldflags="-s -w -X main.version=${VERSION}" \ + -o /out/routing ./cmd/routing + +# ── Runtime stage ───────────────────────────────────────────────────────────── +FROM gcr.io/distroless/base-debian12 + +COPY --from=builder /out/routing /usr/local/bin/routing +COPY config/ /app/config/ + +ENV SUPERVISOR_CONFIG_DIR=/app/config/supervisor +ENV ROUTING_PORT=3210 + +EXPOSE 3210 + +USER 65532:65532 + +ENTRYPOINT ["/usr/local/bin/routing"] +``` + +- [ ] **Step 2: Extend `.gitea/workflows/cd.yml`** + +Add an `env:` entry: + +```yaml +env: + SERVICE: supervisor + IMAGE: gitea.d-ma.be/mathias/supervisor + INGESTION_IMAGE: gitea.d-ma.be/mathias/ingestion + ROUTING_IMAGE: gitea.d-ma.be/mathias/routing + INFRA_REPO: git@gitea.d-ma.be:mathias/infra.git + BUILDKIT_HOST: unix:///run/buildkit/buildkitd.sock +``` + +Add a new step after the ingestion build step: + +```yaml +- name: Build and push routing image + run: | + set -e + trap 'rm -f /tmp/routing-image.tar' EXIT + IMAGE_TAG="${{ github.sha }}" + echo "Building ${ROUTING_IMAGE}:${IMAGE_TAG}" + + buildctl --addr "${BUILDKIT_HOST}" build \ + --frontend dockerfile.v0 \ + --local context=. \ + --local dockerfile=. \ + --opt filename=Dockerfile.routing \ + --opt build-arg:VERSION="${IMAGE_TAG}" \ + --output type=oci,dest=/tmp/routing-image.tar + + skopeo copy \ + oci-archive:/tmp/routing-image.tar \ + docker://${ROUTING_IMAGE}:${IMAGE_TAG} \ + --dest-creds "${{ secrets.REGISTRY_CREDS }}" + + echo "Built and pushed ${ROUTING_IMAGE}:${IMAGE_TAG}" +``` + +In the "Update infra repo" step, add a third sed and update the commit: + +```yaml +sed -i "s|gitea.d-ma.be/mathias/routing:.*|gitea.d-ma.be/mathias/routing:${IMAGE_TAG}|" \ + "k3s/apps/routing/deployment.yaml" + +git config user.email "cd-bot@d-ma.be" +git config user.name "CD Bot" +git add "k3s/apps/${SERVICE}/deployment.yaml" \ + "k3s/apps/${SERVICE}/ingestion-deployment.yaml" \ + "k3s/apps/routing/deployment.yaml" +git commit -m "chore(deploy): supervisor+ingestion+routing → ${IMAGE_TAG}" +``` + +- [ ] **Step 3: Validate the YAML locally** + +```bash +yq eval '.jobs.deploy.steps | length' .gitea/workflows/cd.yml +``` + +Expected: a number greater than the original (one new step added). + +- [ ] **Step 4: Commit** + +The workflow change is hot — once pushed, CD will try to build the routing image. Until the infra repo has `k3s/apps/routing/deployment.yaml`, the sed line is a no-op (sed succeeds because the file isn't matched anywhere; but the `git add` will fail). Two options: + +**Option A (preferred):** Land the infra-repo manifests (Tasks 11–12) in the infra worktree FIRST, push them so they exist on `infra` main, then push this commit. Order: Tasks 11 → 12 → 10. + +**Option B:** Land the workflow change with a guard, then drop the guard once manifests exist. + +Implementer should pick Option A. After the manifests are in place: + +```bash +git add Dockerfile.routing .gitea/workflows/cd.yml +git commit -m "build(routing): Dockerfile + CD workflow" +``` + +DO NOT push this commit until Tasks 11 and 12 have been pushed to the infra repo's `main`. + +--- + +## Task 11: Routing pod manifests (infra worktree) + +**Worktree:** infra + +Create the k3s manifests for the routing pod. Mirror the supervisor's structure for operator familiarity. + +**Files:** +- Create: `k3s/apps/routing/namespace.yaml` +- Create: `k3s/apps/routing/deployment.yaml` +- Create: `k3s/apps/routing/service.yaml` +- Create: `k3s/apps/routing/nodeport.yaml` +- Create: `k3s/apps/routing/kustomization.yaml` +- Modify: `k3s/apps/kustomization.yaml` + +- [ ] **Step 1: `namespace.yaml`** + +```yaml +apiVersion: v1 +kind: Namespace +metadata: + name: routing +``` + +- [ ] **Step 2: `deployment.yaml`** + +The image tag will be bumped by CD; seed it with a placeholder that gets overwritten on first deploy. + +```yaml +apiVersion: apps/v1 +kind: Deployment +metadata: + name: routing + namespace: routing +spec: + replicas: 1 + selector: + matchLabels: + app: routing + template: + metadata: + labels: + app: routing + spec: + nodeSelector: + kubernetes.io/hostname: koala + imagePullSecrets: + - name: gitea-registry + containers: + - name: routing + image: gitea.d-ma.be/mathias/routing:initial + ports: + - containerPort: 3210 + envFrom: + - secretRef: + name: routing-secrets + env: + - name: ROUTING_PORT + value: "3210" + - name: LITELLM_BASE_URL + value: "http://piguard:4000" + - name: BRAIN_URL + value: "http://ingestion.supervisor:3300" + - name: HYPERGUILD_LOCAL_MODEL + value: "qwen35" + - name: HYPERGUILD_CLAUDE_MODEL + value: "claude-sonnet-4-6" + - name: HYPERGUILD_ROUTE_LOCAL_FLOOR + value: "0.90" + - name: HYPERGUILD_ROUTE_LOCAL_CEIL + value: "0.70" + - name: HYPERGUILD_PASS_RATE_TTL_SECONDS + value: "60" + readinessProbe: + httpGet: + path: /healthz + port: 3210 + initialDelaySeconds: 2 + periodSeconds: 10 +``` + +The `gitea-registry` imagePullSecret needs to exist in the `routing` namespace. If only present in `supervisor`, copy it (Step 6 below). + +- [ ] **Step 3: `service.yaml`** + +```yaml +apiVersion: v1 +kind: Service +metadata: + name: routing + namespace: routing +spec: + selector: + app: routing + ports: + - port: 3210 + targetPort: 3210 + protocol: TCP +``` + +- [ ] **Step 4: `nodeport.yaml`** + +```yaml +apiVersion: v1 +kind: Service +metadata: + name: routing-nodeport + namespace: routing +spec: + type: NodePort + selector: + app: routing + ports: + - port: 3210 + targetPort: 3210 + nodePort: 30310 + protocol: TCP +``` + +- [ ] **Step 5: `kustomization.yaml`** (inside `k3s/apps/routing/`) + +```yaml +apiVersion: kustomize.config.k8s.io/v1beta1 +kind: Kustomization +resources: + - namespace.yaml + - deployment.yaml + - service.yaml + - nodeport.yaml + - secrets.enc.yaml +``` + +`secrets.enc.yaml` is added in Task 12; reference it now so the directory is complete. + +- [ ] **Step 6: Add `routing` to the apps `kustomization.yaml`** + +Modify `k3s/apps/kustomization.yaml`: + +```yaml +apiVersion: kustomize.config.k8s.io/v1beta1 +kind: Kustomization +resources: + - imagepullsecret + - registry + - gitea + - infra-mcp + - supervisor + - cobalt-dingo + - routing +``` + +If `imagepullsecret/` only seeds the secret in specific namespaces, ensure `routing` is added to that list — inspect `k3s/apps/imagepullsecret/` and follow the existing pattern. + +- [ ] **Step 7: Validate manifest syntax with `kustomize build`** + +```bash +cd ~/Documents/local-dev/AI/infra/.worktrees/mode-2-routing-pod +kustomize build k3s/apps/routing 2>&1 | head -20 +``` + +Expected: valid YAML output, no errors. If `secrets.enc.yaml` is referenced but missing, suppress for now by temporarily commenting that line; uncomment in Task 12. + +- [ ] **Step 8: Commit (do NOT push yet)** + +```bash +git add k3s/apps/routing/ k3s/apps/kustomization.yaml +git commit -m "feat(routing): k3s manifests for the new pod" +``` + +Push happens after Task 12 (with the encrypted Secret) so the kustomization is consistent on first Flux apply. + +--- + +## Task 12: Routing-secrets Secret + Flux verification + +**Worktree:** infra + +Encrypt and add the `routing-secrets` Secret. The Secret carries `LITELLM_API_KEY` (reused from supervisor's secret) and optionally a `ROUTING_MCP_TOKEN` for bearer auth. + +**Files:** +- Create: `k3s/apps/routing/secrets.enc.yaml` + +- [ ] **Step 1: Generate a token (or skip auth for first deploy)** + +```bash +# generate (or omit ROUTING_MCP_TOKEN for unauthenticated first deploy): +openssl rand -hex 32 +``` + +Record the value; it will be set in the operator's shell env when Mode 2 is exercised in any project. + +- [ ] **Step 2: Decode the cluster's age key** + +```bash +export SOPS_AGE_KEY="$(kubectl get secret sops-age -n flux-system -o jsonpath='{.data.age\.agekey}' | base64 -d)" +[ -n "$SOPS_AGE_KEY" ] && echo "age key loaded ($(echo -n "$SOPS_AGE_KEY" | wc -c) bytes)" || (echo "FAIL"; exit 1) +``` + +- [ ] **Step 3: Pull `LITELLM_API_KEY` value from the supervisor's secret** + +Decrypt the supervisor's Secret to read the existing value: + +```bash +LITELLM_API_KEY="$(sops -d k3s/apps/supervisor/secrets.enc.yaml | yq eval '.stringData.DMABE_LLMAPI_KEY' -)" +[ -n "$LITELLM_API_KEY" ] && echo "found litellm key" || (echo "FAIL: empty"; exit 1) +``` + +(`DMABE_LLMAPI_KEY` is the supervisor's name for the LiteLLM key — same value, different env-var name in the consumer.) + +- [ ] **Step 4: Create the routing Secret** + +```bash +cat > /tmp/routing-secrets.yaml <" +EOF +``` + +Edit `/tmp/routing-secrets.yaml` and paste the token (or leave the field as `""` for unauthenticated first deploy). + +- [ ] **Step 5: Encrypt with SOPS** + +```bash +sops --encrypt --age age15xez8pcmgg3daxpuqnye9ewawvzjtallheddcrq88ph573yle3nsr5hdq6 \ + --encrypted-regex '^(stringData|data)$' \ + /tmp/routing-secrets.yaml \ + > k3s/apps/routing/secrets.enc.yaml + +rm /tmp/routing-secrets.yaml +unset SOPS_AGE_KEY LITELLM_API_KEY +``` + +Verify the file: + +```bash +head -10 k3s/apps/routing/secrets.enc.yaml +``` + +Expected: `apiVersion: v1`, `kind: Secret`, `stringData:` with `ENC[...]` values. + +- [ ] **Step 6: `kustomize build` re-check** + +```bash +kustomize build k3s/apps/routing | head -30 +``` + +Expected: namespaces, deployment, services, and a Secret with encrypted data fields. Should succeed. + +- [ ] **Step 7: Commit and push (this is the Flux activation)** + +```bash +git add k3s/apps/routing/secrets.enc.yaml +git commit -m "feat(routing): SOPS-encrypted routing-secrets" +git pull --rebase origin main +git push origin main +``` + +`git pull --rebase` accommodates intervening CD-bot commits on `main` (per the auth-rollout precedent earlier today). + +- [ ] **Step 8: Wait for Flux to reconcile** + +```bash +NEW_SHA=$(git rev-parse HEAD) +until kubectl -n flux-system get kustomization apps -o jsonpath='{.status.lastAppliedRevision}' 2>/dev/null | grep -qE "${NEW_SHA:0:7}"; do + sleep 3 +done +echo "Flux applied $NEW_SHA" +``` + +The pod will be in `ImagePullBackOff` because the `:initial` placeholder image doesn't exist yet — that's expected. The CD workflow (Task 10) will publish the real image and bump the tag. + +- [ ] **Step 9: Verify expected partial state** + +```bash +kubectl -n routing get all +``` + +Expected: namespace, deployment (0/1 ready), service, nodeport-service. Pod is in `ErrImagePull` until Task 10 runs end-to-end. + +--- + +## Task 13: `task smoke:routing` live-contract test + +**Worktree:** hyperguild + +Boots the routing binary against the real `piguard:4000` LiteLLM and the real `koala:30330` brain. Calls each of the four advertised tools once, verifies a `_routing` entry appears in the brain. + +**Files:** +- Create: `scripts/smoke-routing.sh` +- Modify: `Taskfile.yml` + +- [ ] **Step 1: Write `scripts/smoke-routing.sh`** + +```bash +#!/usr/bin/env bash +set -euo pipefail + +# Boot the routing binary and exercise its four tools against live deps. +# Skipped when LITELLM_BASE_URL or BRAIN_URL is unreachable. + +LITELLM_BASE_URL="${LITELLM_BASE_URL:-http://piguard:4000}" +BRAIN_URL="${BRAIN_URL:-http://koala:30330}" + +if ! curl -sS --max-time 2 "${LITELLM_BASE_URL}/v1/models" >/dev/null 2>&1; then + echo "SKIP: LITELLM at ${LITELLM_BASE_URL} unreachable" + exit 0 +fi +if ! curl -sS --max-time 2 "${BRAIN_URL}/query" -X POST -d '{"query":"x","k":1}' -H 'Content-Type: application/json' >/dev/null 2>&1; then + echo "SKIP: BRAIN at ${BRAIN_URL} unreachable" + exit 0 +fi + +PORT=33310 +BIN=$(mktemp) +trap 'rm -f $BIN; pkill -P $$ -f "$BIN" 2>/dev/null || true' EXIT + +go build -o "$BIN" ./cmd/routing + +LITELLM_BASE_URL="$LITELLM_BASE_URL" BRAIN_URL="$BRAIN_URL" \ + ROUTING_PORT="$PORT" SUPERVISOR_CONFIG_DIR="$(pwd)/config/supervisor" \ + "$BIN" & +BIN_PID=$! + +# Wait for the binary to bind. +for _ in $(seq 1 50); do + curl -sS "http://127.0.0.1:${PORT}/healthz" >/dev/null 2>&1 && break + sleep 0.1 +done + +call_tool() { + local tool="$1" + local args="$2" + curl -sS -X POST "http://127.0.0.1:${PORT}/mcp" \ + -H 'Content-Type: application/json' \ + -d "{\"jsonrpc\":\"2.0\",\"id\":1,\"method\":\"tools/call\",\"params\":{\"name\":\"${tool}\",\"arguments\":${args}}}" \ + | jq -e '.result // .error' > /dev/null +} + +echo "calling tools/list..." +curl -sS -X POST "http://127.0.0.1:${PORT}/mcp" \ + -H 'Content-Type: application/json' \ + -d '{"jsonrpc":"2.0","id":1,"method":"tools/list"}' \ + | jq -r '.result.tools | map(.name) | sort | .[]' + +echo "calling each tool..." +call_tool code_review '{"project_root":"/tmp","files":["README.md"],"session_id":"smoke-1"}' +call_tool debug '{"project_root":"/tmp","problem":"smoke test","session_id":"smoke-1"}' +call_tool retrospective '{"project_root":"/tmp","session_id":"smoke-1"}' +call_tool trainer '{"project_root":"/tmp","session_id":"smoke-1"}' + +echo "checking brain has _routing entries..." +sleep 2 +COUNT=$(curl -sS "${BRAIN_URL}/pass-rate?skill=_routing&window=1h" | jq -r '.total // 0') +if [ "${COUNT}" -lt 4 ]; then + echo "FAIL: expected ≥4 _routing entries in last 1h, got ${COUNT}" + exit 1 +fi + +echo "PASS: smoke:routing" +``` + +Make it executable: + +```bash +chmod +x scripts/smoke-routing.sh +``` + +The exact `arguments` shape per tool may need to be adjusted based on each skill's required fields. If a smoke call returns a JSON-RPC error like "missing required argument", read the failing tool's `Tools()` definition in `internal/skills//skill.go` and add the required field with a stub value. + +- [ ] **Step 2: Add the Taskfile target** + +In `Taskfile.yml`, append to the `tasks:` map: + +```yaml + smoke:routing: + desc: Boot the routing pod against live LiteLLM + brain and verify _routing logs land + cmds: + - bash scripts/smoke-routing.sh +``` + +- [ ] **Step 3: Run it** + +```bash +task smoke:routing +``` + +Expected: SKIP if offline; PASS otherwise. + +- [ ] **Step 4: Commit** + +```bash +git add scripts/smoke-routing.sh Taskfile.yml +git commit -m "test(routing): live-contract smoke target" +``` + +--- + +## Task 14: Documentation updates + +**Worktree:** hyperguild + +Update the project-level docs to describe Mode 2 + the new env vars + the routing-pod URL. + +**Files:** +- Modify: `README.md` +- Modify: `.context/PROJECT.md` + +- [ ] **Step 1: Update `README.md`'s "Key env vars" table** + +Append: + +```markdown +| `ROUTING_PORT` | `3210` | Routing pod's listen port | +| `ROUTING_MCP_TOKEN` | — | Optional bearer token for the routing MCP HTTP endpoint | +| `BRAIN_URL` | `http://ingestion.supervisor:3300` | Routing pod → brain (in-cluster) | +| `HYPERGUILD_LOCAL_MODEL` | `qwen35` | Local model for routed-to-local skill calls | +| `HYPERGUILD_CLAUDE_MODEL` | `claude-sonnet-4-6` | Claude model for routed-to-Claude skill calls | +| `HYPERGUILD_ROUTE_LOCAL_FLOOR` | `0.90` | At/above pass rate, route to local | +| `HYPERGUILD_ROUTE_LOCAL_CEIL` | `0.70` | Below pass rate, route to Claude. Between CEIL and FLOOR is the sample band. | +| `HYPERGUILD_PASS_RATE_TTL_SECONDS` | `60` | Per-skill pass-rate cache TTL | +``` + +In the architecture diagram block at the top of the README, add the routing pod: + +``` +Your Claude Code session (in any project) + │ + │ MCP over HTTP (Tailscale) + ├──▶ supervisor :3200 (NodePort 30320 on koala) — skill workers: tdd, debug, spec, … + ├──▶ routing :3210 (NodePort 30310 on koala) — Mode 2 only: code_review, debug, retrospective, trainer + └──▶ brain :3300 (NodePort 30330 on koala) — brain_query, brain_write, brain_ingest, session_log +``` + +- [ ] **Step 2: Update `.context/PROJECT.md`** + +Find the "MCP endpoints" section and add a third bullet: + +```markdown +- **`routing`** at `http://koala:30310/mcp` — Mode 2 routing pod. Advertises + the same four cost-routable skills as the supervisor (`code_review`, + `debug`, `retrospective`, `trainer`) but per-call decides whether to use + a local model or Claude based on the brain's `/pass-rate` response. + Bearer auth via `ROUTING_MCP_TOKEN` (opt-in). Only `mode client-local` + registers this endpoint; Mode 1 and Mode 3 do not. +``` + +- [ ] **Step 3: Run `task context:sync` so derived adapters update** + +```bash +task context:sync +``` + +This regenerates `CLAUDE.md`, `AGENTS.md`, `.cursorrules`, `.aider.conventions.md`, and `.context/system-prompt.txt` from the canonical sources. + +- [ ] **Step 4: `task check`** + +```bash +task check 2>&1 | tail -10 +``` + +Expected: drift check green (regenerated adapters tracked). + +- [ ] **Step 5: Commit** + +```bash +git add README.md .context/PROJECT.md CLAUDE.md AGENTS.md .cursorrules .aider.conventions.md .context/system-prompt.txt +git commit -m "docs(routing): document Mode 2 routing pod + env vars" +``` + +--- + +## Final verification before merge + +After all 14 tasks land, on the hyperguild worktree's branch: + +- [ ] **Run the full check chain** + +```bash +cd ~/Documents/local-dev/AI/hyperguild/.worktrees/mode-2-routing-pod +task check 2>&1 | tail -15 +``` + +Expected: 0 issues across lint, test, vet, drift, govulncheck. + +- [ ] **Run smoke test if Tailscale available** + +```bash +task smoke:routing +``` + +Expected: PASS or SKIP (with a clear reason). + +- [ ] **Verify the snapshot test still passes** + +The skill packages can drift between when the snapshot was captured and merge time. Re-run: + +```bash +go test ./internal/routing/... -run TestToolsListMatchesSupervisorSnapshot -v +``` + +If it fails because of an intentional schema change in the merge window, re-capture the snapshot per Task 7's Step 1 and commit the update with a clear message. + +- [ ] **Push the hyperguild branch and merge** + +```bash +git push -u origin feat/mode-2-routing-pod +``` + +Open a PR (or merge to main if the workflow allows direct push). Once merged, gitea CI builds the routing image and CD pushes the image-tag bump to the infra repo. + +- [ ] **Verify Flux applies the new image and the pod becomes Ready** + +```bash +NEW_SHA=$(git -C ~/Documents/local-dev/AI/hyperguild rev-parse main) +echo "Watching for image tag $NEW_SHA on routing deployment..." +until kubectl -n routing get deployment routing -o jsonpath='{.spec.template.spec.containers[0].image}' 2>/dev/null | grep -qE "${NEW_SHA:0:7}"; do + sleep 5 +done +kubectl -n routing rollout status deployment/routing --timeout=120s +``` + +Expected: deployment becomes `1/1 Ready` with the new image. + +If the pod stays `Pending` or `ImagePullBackOff` past 2 minutes, check: + +```bash +kubectl -n routing describe pod -l app=routing | tail -30 +kubectl -n routing logs -l app=routing --tail=50 +``` + +- [ ] **Final live verification** + +```bash +# tools/list should return 4 tools +curl -sS -X POST http://koala:30310/mcp \ + -H 'Content-Type: application/json' \ + -d '{"jsonrpc":"2.0","id":1,"method":"tools/list"}' \ + | jq '.result.tools | length' +# expected: 4 + +# auth check (only meaningful if ROUTING_MCP_TOKEN is set on the pod) +curl -isS -X POST http://koala:30310/mcp \ + -H 'Content-Type: application/json' \ + -d '{"jsonrpc":"2.0","id":1,"method":"tools/list"}' | head -1 +# expected: 401 if token set, 200 otherwise +``` + +- [ ] **Restart pod the Flux-friendly way if needed** + +For any post-merge restart that doesn't ride a fresh image bump, use `kubectl delete pod` (not `kubectl rollout restart` — Flux strips the annotation): + +```bash +kubectl -n routing delete pod -l app=routing +``` + +The existing ReplicaSet recreates the pod, picking up any Secret data changes on startup.