From 76f195de2a127ea8c338115052680c6f212aaadd Mon Sep 17 00:00:00 2001 From: Mathias Bergqvist Date: Mon, 20 Apr 2026 07:45:32 +0200 Subject: [PATCH] docs: model orchestration design spec for Phase 3 --- .../2026-04-20-model-orchestration-design.md | 322 ++++++++++++++++++ 1 file changed, 322 insertions(+) create mode 100644 docs/superpowers/specs/2026-04-20-model-orchestration-design.md diff --git a/docs/superpowers/specs/2026-04-20-model-orchestration-design.md b/docs/superpowers/specs/2026-04-20-model-orchestration-design.md new file mode 100644 index 0000000..eb47ae4 --- /dev/null +++ b/docs/superpowers/specs/2026-04-20-model-orchestration-design.md @@ -0,0 +1,322 @@ +# Model Orchestration Design + +**Date:** 2026-04-20 +**Status:** Approved for implementation + +## Problem statement + +The hyperguild supervisor currently spawns a `claude --print` subprocess for every skill call. The model routing config (`models.yaml`) exists but is dead weight — the model name is injected as text into the task prompt and ignored. Every skill call costs Claude tokens regardless of task complexity or data sensitivity. + +## Goal + +Route skill work to the most appropriate model — weighing cost, latency, and quality — with Claude acting as the real supervisor: verifying outputs and deciding when to escalate. Local models on owned hardware handle the common case; Claude escalates through a chain to frontier models only when local quality is insufficient. + +## Success criteria + +- [ ] Each skill dispatches generation to its configured local model via LiteLLM by default +- [ ] Claude verifies every local output and either accepts or escalates +- [ ] Escalation walks a per-skill chain (local small → local large → Sonnet → Opus) with one attempt per tier +- [ ] Every attempt (model, tier, duration, warm state, verdict) is logged in the session JSONL +- [ ] Cloud tiers (Sonnet/Opus) self-certify — no separate verifier call +- [ ] Zero changes to skill handlers — they call `ExecutorFn` exactly as today +- [ ] `LiteLTMBaseURL` already in config; no new env vars required beyond `LLAMA_SWAP_URL` + +## Constraints + +- One attempt per tier before escalating (no retry within a tier) +- Anthropic T&C: Claude is called normally via Anthropic API; local models are called directly via LiteLLM HTTP — no API redirection +- `models.yaml` remains the single routing config file + +## Out of scope + +- Auto-rerouting based on real-time warm state (logged, not acted on — Phase 4) +- Multi-tenant / public service exposure +- RAG/CAG model boosting +- Managed Agent cloud delegation (chain stub only in Phase 3) + +--- + +## Architecture + +``` +MCP tool call (Claude Code) + ↓ +Skill handler — calls ExecutorFn (unchanged) + ↓ +Orchestrator.Run (implements ExecutorFn) + ├─ Resolve chain from models.yaml + ├─ For each model in chain: + │ ├─ [ollama/*] → LiteLLM executor → generate + │ │ ↓ + │ │ Claude verifier (task + output + discipline) + │ │ ├─ accept → return Result (log attempt) + │ │ └─ escalate → next tier (log attempt) + │ │ + │ └─ [claude-*] → Claude executor (current) → generate + self-certify + │ └─ return Result (log attempt) + │ + └─ All tiers exhausted → return best attempt with escalation note +``` + +Claude is always the verifier for local tiers. At cloud tiers, Claude generates and self-certifies — the verifier call is skipped. + +--- + +## Components + +### 1. `internal/exec/litellm.go` — LiteLLM executor + +Calls `POST /v1/chat/completions` on the configured LiteLLM server. Implements the same `ExecutorFn` signature as the existing claude executor. + +```go +type LiteLLMExecutor struct { + BaseURL string + APIKey string + HTTPClient *http.Client + Timeout time.Duration +} + +func NewLiteLLM(baseURL, apiKey string, timeout time.Duration) *LiteLLMExecutor + +func (e *LiteLLMExecutor) Run(ctx context.Context, req Request) (Result, error) +``` + +Request mapping: +- `req.SkillPrompt` → system message +- `req.TaskPrompt` → user message +- `req.Model` → `model` field in the chat completions request + +Response handling: local models are prompted (via the discipline file output contract) to return a JSON object matching the `Result` schema. The executor attempts `json.Unmarshal` into `Result` directly — no envelope unwrapping needed (unlike the `--output-format json` claude envelope). If unmarshalling fails, the executor returns an error that the orchestrator treats as an automatic escalation trigger. + +### 2. `internal/exec/verifier.go` — Claude verifier + +A focused Claude call that judges local model output. Uses the existing `Executor` (claude subprocess) internally. + +```go +type Verdict struct { + Accept bool `json:"accept"` + Feedback string `json:"feedback"` // reason if not accepting; empty if accept +} + +type Verifier struct { + executor *Executor // the existing claude executor +} + +func NewVerifier(executor *Executor) *Verifier + +func (v *Verifier) Verify(ctx context.Context, skillPrompt, taskPrompt string, output Result) (Verdict, error) +``` + +The verifier prompt gives Claude: +1. The skill discipline file (so it knows the iron laws and output contract) +2. The original task prompt (informed verification — Claude sees what was asked) +3. The generated output +4. A short instruction: "Does this output satisfy the discipline's iron laws and output contract? Reply with JSON: `{\"accept\": true|false, \"feedback\": \"...\"}`" + +The verifier uses a lightweight JSON schema for its own output (a `Verdict` schema), keeping the call fast. + +### 3. `internal/exec/orchestrator.go` — chain walker + +Implements `ExecutorFn`. Walks the escalation chain, delegating generation and verification per tier. + +```go +type Chain []ChainEntry + +type ChainEntry struct { + Model string // e.g. "ollama/phi4", "claude-sonnet-4-5" + Tier string // "local" | "subagent" | "managed" + IsCloud bool // true for claude-* models; skips verifier +} + +type Orchestrator struct { + chain Chain + litellm *LiteLLMExecutor + claude *Executor + verifier *Verifier + llamaSwapURL string // for warm-state probe +} + +func NewOrchestrator(chain Chain, litellm *LiteLLMExecutor, claude *Executor, verifier *Verifier, llamaSwapURL string) *Orchestrator + +func (o *Orchestrator) Run(ctx context.Context, req Request) (Result, error) +``` + +Algorithm: +``` +for each entry in chain: + warm = probe llama-swap (if local tier) + start = now() + if entry.IsCloud: + result, err = claude.Run(ctx, req with entry.Model) + log attempt(model, tier, duration, warm, verified=true) + if err == nil: return result + else: + result, err = litellm.Run(ctx, req with entry.Model) + duration = now() - start + if err != nil: + log attempt(model, tier, duration, warm, verified=false) + continue // automatic escalation on parse/network error + verdict = verifier.Verify(ctx, req.SkillPrompt, req.TaskPrompt, result) + log attempt(model, tier, duration, warm, verified=verdict.Accept) + if verdict.Accept: return result + // inject verifier feedback into next tier's task prompt + req.TaskPrompt = req.TaskPrompt + "\n\nPrior attempt feedback: " + verdict.Feedback + +return error("all tiers exhausted") +``` + +### 4. `internal/config/models.go` — chain parser + +Replaces the current single-model resolution with chain parsing. + +Updated `models.yaml` format: + +```yaml +verifier: claude-sonnet-4-6 # fixed verifier for all local tiers + +llama_swap_url: http://koala:8080 # for warm-state probing + +default_chain: + - ollama/qwen3-coder-30b-tuned + - claude-sonnet-4-5 + +skills: + tdd: + chain: + - ollama/qwen3-coder-30b-tuned + - claude-sonnet-4-5 + review: + chain: + - ollama/devstral-tuned + - ollama/gemma4 + - claude-sonnet-4-5 + debug: + chain: + - ollama/deepseek-r1-tuned + - claude-sonnet-4-5 + spec: + chain: + - ollama/phi4 + - ollama/gemma4 + - claude-sonnet-4-5 + - claude-opus-4-6 + retrospective: + chain: + - ollama/qwen3-coder-30b-tuned + - claude-sonnet-4-5 + trainer: + chain: + - ollama/qwen3-coder-30b-tuned + - claude-sonnet-4-5 +``` + +The parser exposes: +```go +func (m *Models) ChainFor(skill string) Chain +func (m *Models) Verifier() string +func (m *Models) LlamaSwapURL() string +``` + +Caller override (`model` param in MCP tool call) pins the chain to a single entry — one model, no escalation. This preserves the existing override behaviour for power users. + +### 5. `internal/session/session.go` — updated `Attempt` struct + +```go +type Attempt struct { + Attempt int `json:"attempt"` + Model string `json:"model"` + Tier string `json:"tier"` // local | subagent | managed + DurationMs int64 `json:"duration_ms"` + WarmStart bool `json:"warm_start"` // model was already loaded in llama-swap + Verified bool `json:"verified"` + Verdict string `json:"verdict,omitempty"` // accept | escalate | error + Feedback string `json:"feedback,omitempty"` // verifier feedback on escalation + OutputSummary string `json:"output_summary,omitempty"` + RunnerOutput string `json:"runner_output,omitempty"` +} +``` + +### 6. `cmd/supervisor/main.go` — one wiring change + +```go +// Before: +reg.Register(review.New(review.Config{ExecutorFn: executor.Run, ...})) + +// After: +chain := models.ChainFor("review") +orch := exec.NewOrchestrator(chain, litellmExec, claudeExec, verifier, models.LlamaSwapURL()) +reg.Register(review.New(review.Config{ExecutorFn: orch.Run, ...})) +``` + +One orchestrator per skill, sharing the same `litellmExec`, `claudeExec`, and `verifier` instances. + +--- + +## Data flow example: `review` skill call + +1. Claude Code calls `review` tool with `files: ["internal/foo.go"]` +2. Skill handler builds task prompt, calls `orch.Run` +3. Orchestrator resolves chain: `[devstral, gemma4, sonnet]` +4. Probes llama-swap: devstral is warm +5. LiteLLM calls devstral → returns JSON result +6. Verifier asks Claude: "does this review satisfy the iron laws?" +7. Claude: `{"accept": false, "feedback": "missing line references for all findings"}` +8. Orchestrator logs attempt #1 (devstral, local, 4200ms, warm, escalate) +9. Injects feedback into task prompt, calls gemma4 +10. Verifier: `{"accept": true}` +11. Orchestrator logs attempt #2 (gemma4, local, 6100ms, cold, accept) +12. Returns result to skill handler → MCP response + +Session JSONL records both attempts. You can see: devstral was warm but produced weak output; gemma4 was cold but passed. + +--- + +## Observability + +Session JSONL is the primary store. Each `Entry.Attempts` slice records the full escalation trail. To analyse across sessions: + +```bash +# Which models are escalating most? +jq -r '.attempts[] | select(.verdict == "escalate") | .model' brain/sessions/*.jsonl | sort | uniq -c + +# Average latency per model +jq -r '.attempts[] | [.model, .duration_ms] | @tsv' brain/sessions/*.jsonl | awk '{sum[$1]+=$2; n[$1]++} END {for (m in sum) print m, sum[m]/n[m]}' + +# Cold start frequency +jq -r '.attempts[] | select(.warm_start == false) | .model' brain/sessions/*.jsonl | sort | uniq -c +``` + +No new metrics infrastructure needed for Phase 3. Phase 4 can build a dashboard on top of this data. + +--- + +## Error handling + +| Scenario | Behaviour | +|----------|-----------| +| LiteLLM unreachable | Log attempt as error, escalate immediately | +| Local model returns unparseable JSON | Log attempt as error, escalate | +| Verifier call fails | Log, treat as escalate (safe default) | +| All tiers exhausted | Return error to skill handler; skill returns MCP error to caller | +| Caller passes `model` override | Single-entry chain, no escalation, no verifier call | + +--- + +## Testing approach + +- `TestLiteLLMExecutor`: mock HTTP server returning valid/invalid JSON; verify parse logic and error escalation +- `TestVerifier`: fake claude executor returning accept/escalate verdicts; verify prompt construction +- `TestOrchestrator`: table-driven — chains of 1/2/3 tiers, various accept/escalate/error combinations; verify attempt log contents and final result +- `TestModelsChainFor`: YAML parsing for all skill overrides and default_chain fallback +- Integration smoke test: start real LiteLLM (or mock), call `review` tool via MCP, verify attempt log written + +--- + +## Risks + +| Risk | Mitigation | +|------|------------| +| Local models ignore output contract → bad JSON | Discipline files already specify JSON output contract; parse failure auto-escalates | +| Verifier Claude call adds latency to every local attempt | Verifier prompt is small and fast; acceptable tradeoff for quality gate | +| llama-swap warm probe adds overhead | Probe is a single lightweight HTTP GET; timeout at 200ms, treat failure as `warm_start: false` | +| Chain exhaustion leaves caller with no result | Return structured error via MCP; caller can retry with explicit `model` override |