docs: model orchestration design spec for Phase 3

2026-04-20 07:45:32 +02:00
parent f901d4e67d
commit 76f195de2a
1 changed files with 322 additions and 0 deletions
--- a/docs/superpowers/specs/2026-04-20-model-orchestration-design.md
+++ b/docs/superpowers/specs/2026-04-20-model-orchestration-design.md
@@ -0,0 +1,322 @@
+# Model Orchestration Design
+
+**Date:** 2026-04-20  
+**Status:** Approved for implementation
+
+## Problem statement
+
+The hyperguild supervisor currently spawns a `claude --print` subprocess for every skill call. The model routing config (`models.yaml`) exists but is dead weight — the model name is injected as text into the task prompt and ignored. Every skill call costs Claude tokens regardless of task complexity or data sensitivity.
+
+## Goal
+
+Route skill work to the most appropriate model — weighing cost, latency, and quality — with Claude acting as the real supervisor: verifying outputs and deciding when to escalate. Local models on owned hardware handle the common case; Claude escalates through a chain to frontier models only when local quality is insufficient.
+
+## Success criteria
+
+- [ ] Each skill dispatches generation to its configured local model via LiteLLM by default
+- [ ] Claude verifies every local output and either accepts or escalates
+- [ ] Escalation walks a per-skill chain (local small → local large → Sonnet → Opus) with one attempt per tier
+- [ ] Every attempt (model, tier, duration, warm state, verdict) is logged in the session JSONL
+- [ ] Cloud tiers (Sonnet/Opus) self-certify — no separate verifier call
+- [ ] Zero changes to skill handlers — they call `ExecutorFn` exactly as today
+- [ ] `LiteLTMBaseURL` already in config; no new env vars required beyond `LLAMA_SWAP_URL`
+
+## Constraints
+
+- One attempt per tier before escalating (no retry within a tier)
+- Anthropic T&C: Claude is called normally via Anthropic API; local models are called directly via LiteLLM HTTP — no API redirection
+- `models.yaml` remains the single routing config file
+
+## Out of scope
+
+- Auto-rerouting based on real-time warm state (logged, not acted on — Phase 4)
+- Multi-tenant / public service exposure
+- RAG/CAG model boosting
+- Managed Agent cloud delegation (chain stub only in Phase 3)
+
+---
+
+## Architecture
+
+```
+MCP tool call (Claude Code)
+    ↓
+Skill handler — calls ExecutorFn (unchanged)
+    ↓
+Orchestrator.Run (implements ExecutorFn)
+    ├─ Resolve chain from models.yaml
+    ├─ For each model in chain:
+    │   ├─ [ollama/*] → LiteLLM executor → generate
+    │   │       ↓
+    │   │   Claude verifier (task + output + discipline)
+    │   │       ├─ accept  → return Result (log attempt)
+    │   │       └─ escalate → next tier (log attempt)
+    │   │
+    │   └─ [claude-*] → Claude executor (current) → generate + self-certify
+    │           └─ return Result (log attempt)
+    │
+    └─ All tiers exhausted → return best attempt with escalation note
+```
+
+Claude is always the verifier for local tiers. At cloud tiers, Claude generates and self-certifies — the verifier call is skipped.
+
+---
+
+## Components
+
+### 1. `internal/exec/litellm.go` — LiteLLM executor
+
+Calls `POST /v1/chat/completions` on the configured LiteLLM server. Implements the same `ExecutorFn` signature as the existing claude executor.
+
+```go
+type LiteLLMExecutor struct {
+    BaseURL    string
+    APIKey     string
+    HTTPClient *http.Client
+    Timeout    time.Duration
+}
+
+func NewLiteLLM(baseURL, apiKey string, timeout time.Duration) *LiteLLMExecutor
+
+func (e *LiteLLMExecutor) Run(ctx context.Context, req Request) (Result, error)
+```
+
+Request mapping:
+- `req.SkillPrompt` → system message
+- `req.TaskPrompt` → user message
+- `req.Model` → `model` field in the chat completions request
+
+Response handling: local models are prompted (via the discipline file output contract) to return a JSON object matching the `Result` schema. The executor attempts `json.Unmarshal` into `Result` directly — no envelope unwrapping needed (unlike the `--output-format json` claude envelope). If unmarshalling fails, the executor returns an error that the orchestrator treats as an automatic escalation trigger.
+
+### 2. `internal/exec/verifier.go` — Claude verifier
+
+A focused Claude call that judges local model output. Uses the existing `Executor` (claude subprocess) internally.
+
+```go
+type Verdict struct {
+    Accept   bool   `json:"accept"`
+    Feedback string `json:"feedback"` // reason if not accepting; empty if accept
+}
+
+type Verifier struct {
+    executor *Executor // the existing claude executor
+}
+
+func NewVerifier(executor *Executor) *Verifier
+
+func (v *Verifier) Verify(ctx context.Context, skillPrompt, taskPrompt string, output Result) (Verdict, error)
+```
+
+The verifier prompt gives Claude:
+1. The skill discipline file (so it knows the iron laws and output contract)
+2. The original task prompt (informed verification — Claude sees what was asked)
+3. The generated output
+4. A short instruction: "Does this output satisfy the discipline's iron laws and output contract? Reply with JSON: `{\"accept\": true|false, \"feedback\": \"...\"}`"
+
+The verifier uses a lightweight JSON schema for its own output (a `Verdict` schema), keeping the call fast.
+
+### 3. `internal/exec/orchestrator.go` — chain walker
+
+Implements `ExecutorFn`. Walks the escalation chain, delegating generation and verification per tier.
+
+```go
+type Chain []ChainEntry
+
+type ChainEntry struct {
+    Model    string // e.g. "ollama/phi4", "claude-sonnet-4-5"
+    Tier     string // "local" | "subagent" | "managed"
+    IsCloud  bool   // true for claude-* models; skips verifier
+}
+
+type Orchestrator struct {
+    chain    Chain
+    litellm  *LiteLLMExecutor
+    claude   *Executor
+    verifier *Verifier
+    llamaSwapURL string // for warm-state probe
+}
+
+func NewOrchestrator(chain Chain, litellm *LiteLLMExecutor, claude *Executor, verifier *Verifier, llamaSwapURL string) *Orchestrator
+
+func (o *Orchestrator) Run(ctx context.Context, req Request) (Result, error)
+```
+
+Algorithm:
+```
+for each entry in chain:
+    warm = probe llama-swap (if local tier)
+    start = now()
+    if entry.IsCloud:
+        result, err = claude.Run(ctx, req with entry.Model)
+        log attempt(model, tier, duration, warm, verified=true)
+        if err == nil: return result
+    else:
+        result, err = litellm.Run(ctx, req with entry.Model)
+        duration = now() - start
+        if err != nil:
+            log attempt(model, tier, duration, warm, verified=false)
+            continue  // automatic escalation on parse/network error
+        verdict = verifier.Verify(ctx, req.SkillPrompt, req.TaskPrompt, result)
+        log attempt(model, tier, duration, warm, verified=verdict.Accept)
+        if verdict.Accept: return result
+        // inject verifier feedback into next tier's task prompt
+        req.TaskPrompt = req.TaskPrompt + "\n\nPrior attempt feedback: " + verdict.Feedback
+
+return error("all tiers exhausted")
+```
+
+### 4. `internal/config/models.go` — chain parser
+
+Replaces the current single-model resolution with chain parsing.
+
+Updated `models.yaml` format:
+
+```yaml
+verifier: claude-sonnet-4-6   # fixed verifier for all local tiers
+
+llama_swap_url: http://koala:8080   # for warm-state probing
+
+default_chain:
+  - ollama/qwen3-coder-30b-tuned
+  - claude-sonnet-4-5
+
+skills:
+  tdd:
+    chain:
+      - ollama/qwen3-coder-30b-tuned
+      - claude-sonnet-4-5
+  review:
+    chain:
+      - ollama/devstral-tuned
+      - ollama/gemma4
+      - claude-sonnet-4-5
+  debug:
+    chain:
+      - ollama/deepseek-r1-tuned
+      - claude-sonnet-4-5
+  spec:
+    chain:
+      - ollama/phi4
+      - ollama/gemma4
+      - claude-sonnet-4-5
+      - claude-opus-4-6
+  retrospective:
+    chain:
+      - ollama/qwen3-coder-30b-tuned
+      - claude-sonnet-4-5
+  trainer:
+    chain:
+      - ollama/qwen3-coder-30b-tuned
+      - claude-sonnet-4-5
+```
+
+The parser exposes:
+```go
+func (m *Models) ChainFor(skill string) Chain
+func (m *Models) Verifier() string
+func (m *Models) LlamaSwapURL() string
+```
+
+Caller override (`model` param in MCP tool call) pins the chain to a single entry — one model, no escalation. This preserves the existing override behaviour for power users.
+
+### 5. `internal/session/session.go` — updated `Attempt` struct
+
+```go
+type Attempt struct {
+    Attempt       int    `json:"attempt"`
+    Model         string `json:"model"`
+    Tier          string `json:"tier"`          // local | subagent | managed
+    DurationMs    int64  `json:"duration_ms"`
+    WarmStart     bool   `json:"warm_start"`    // model was already loaded in llama-swap
+    Verified      bool   `json:"verified"`
+    Verdict       string `json:"verdict,omitempty"` // accept | escalate | error
+    Feedback      string `json:"feedback,omitempty"` // verifier feedback on escalation
+    OutputSummary string `json:"output_summary,omitempty"`
+    RunnerOutput  string `json:"runner_output,omitempty"`
+}
+```
+
+### 6. `cmd/supervisor/main.go` — one wiring change
+
+```go
+// Before:
+reg.Register(review.New(review.Config{ExecutorFn: executor.Run, ...}))
+
+// After:
+chain := models.ChainFor("review")
+orch := exec.NewOrchestrator(chain, litellmExec, claudeExec, verifier, models.LlamaSwapURL())
+reg.Register(review.New(review.Config{ExecutorFn: orch.Run, ...}))
+```
+
+One orchestrator per skill, sharing the same `litellmExec`, `claudeExec`, and `verifier` instances.
+
+---
+
+## Data flow example: `review` skill call
+
+1. Claude Code calls `review` tool with `files: ["internal/foo.go"]`
+2. Skill handler builds task prompt, calls `orch.Run`
+3. Orchestrator resolves chain: `[devstral, gemma4, sonnet]`
+4. Probes llama-swap: devstral is warm
+5. LiteLLM calls devstral → returns JSON result
+6. Verifier asks Claude: "does this review satisfy the iron laws?"
+7. Claude: `{"accept": false, "feedback": "missing line references for all findings"}`
+8. Orchestrator logs attempt #1 (devstral, local, 4200ms, warm, escalate)
+9. Injects feedback into task prompt, calls gemma4
+10. Verifier: `{"accept": true}`
+11. Orchestrator logs attempt #2 (gemma4, local, 6100ms, cold, accept)
+12. Returns result to skill handler → MCP response
+
+Session JSONL records both attempts. You can see: devstral was warm but produced weak output; gemma4 was cold but passed.
+
+---
+
+## Observability
+
+Session JSONL is the primary store. Each `Entry.Attempts` slice records the full escalation trail. To analyse across sessions:
+
+```bash
+# Which models are escalating most?
+jq -r '.attempts[] | select(.verdict == "escalate") | .model' brain/sessions/*.jsonl | sort | uniq -c
+
+# Average latency per model
+jq -r '.attempts[] | [.model, .duration_ms] | @tsv' brain/sessions/*.jsonl | awk '{sum[$1]+=$2; n[$1]++} END {for (m in sum) print m, sum[m]/n[m]}'
+
+# Cold start frequency
+jq -r '.attempts[] | select(.warm_start == false) | .model' brain/sessions/*.jsonl | sort | uniq -c
+```
+
+No new metrics infrastructure needed for Phase 3. Phase 4 can build a dashboard on top of this data.
+
+---
+
+## Error handling
+
+| Scenario | Behaviour |
+|----------|-----------|
+| LiteLLM unreachable | Log attempt as error, escalate immediately |
+| Local model returns unparseable JSON | Log attempt as error, escalate |
+| Verifier call fails | Log, treat as escalate (safe default) |
+| All tiers exhausted | Return error to skill handler; skill returns MCP error to caller |
+| Caller passes `model` override | Single-entry chain, no escalation, no verifier call |
+
+---
+
+## Testing approach
+
+- `TestLiteLLMExecutor`: mock HTTP server returning valid/invalid JSON; verify parse logic and error escalation
+- `TestVerifier`: fake claude executor returning accept/escalate verdicts; verify prompt construction
+- `TestOrchestrator`: table-driven — chains of 1/2/3 tiers, various accept/escalate/error combinations; verify attempt log contents and final result
+- `TestModelsChainFor`: YAML parsing for all skill overrides and default_chain fallback
+- Integration smoke test: start real LiteLLM (or mock), call `review` tool via MCP, verify attempt log written
+
+---
+
+## Risks
+
+| Risk | Mitigation |
+|------|------------|
+| Local models ignore output contract → bad JSON | Discipline files already specify JSON output contract; parse failure auto-escalates |
+| Verifier Claude call adds latency to every local attempt | Verifier prompt is small and fast; acceptable tradeoff for quality gate |
+| llama-swap warm probe adds overhead | Probe is a single lightweight HTTP GET; timeout at 200ms, treat failure as `warm_start: false` |
+| Chain exhaustion leaves caller with no result | Return structured error via MCP; caller can retry with explicit `model` override |