323 lines
12 KiB
Markdown
323 lines
12 KiB
Markdown
# Model Orchestration Design
|
|
|
|
**Date:** 2026-04-20
|
|
**Status:** Approved for implementation
|
|
|
|
## Problem statement
|
|
|
|
The hyperguild supervisor currently spawns a `claude --print` subprocess for every skill call. The model routing config (`models.yaml`) exists but is dead weight — the model name is injected as text into the task prompt and ignored. Every skill call costs Claude tokens regardless of task complexity or data sensitivity.
|
|
|
|
## Goal
|
|
|
|
Route skill work to the most appropriate model — weighing cost, latency, and quality — with Claude acting as the real supervisor: verifying outputs and deciding when to escalate. Local models on owned hardware handle the common case; Claude escalates through a chain to frontier models only when local quality is insufficient.
|
|
|
|
## Success criteria
|
|
|
|
- [ ] Each skill dispatches generation to its configured local model via LiteLLM by default
|
|
- [ ] Claude verifies every local output and either accepts or escalates
|
|
- [ ] Escalation walks a per-skill chain (local small → local large → Sonnet → Opus) with one attempt per tier
|
|
- [ ] Every attempt (model, tier, duration, warm state, verdict) is logged in the session JSONL
|
|
- [ ] Cloud tiers (Sonnet/Opus) self-certify — no separate verifier call
|
|
- [ ] Zero changes to skill handlers — they call `ExecutorFn` exactly as today
|
|
- [ ] `LiteLTMBaseURL` already in config; no new env vars required beyond `LLAMA_SWAP_URL`
|
|
|
|
## Constraints
|
|
|
|
- One attempt per tier before escalating (no retry within a tier)
|
|
- Anthropic T&C: Claude is called normally via Anthropic API; local models are called directly via LiteLLM HTTP — no API redirection
|
|
- `models.yaml` remains the single routing config file
|
|
|
|
## Out of scope
|
|
|
|
- Auto-rerouting based on real-time warm state (logged, not acted on — Phase 4)
|
|
- Multi-tenant / public service exposure
|
|
- RAG/CAG model boosting
|
|
- Managed Agent cloud delegation (chain stub only in Phase 3)
|
|
|
|
---
|
|
|
|
## Architecture
|
|
|
|
```
|
|
MCP tool call (Claude Code)
|
|
↓
|
|
Skill handler — calls ExecutorFn (unchanged)
|
|
↓
|
|
Orchestrator.Run (implements ExecutorFn)
|
|
├─ Resolve chain from models.yaml
|
|
├─ For each model in chain:
|
|
│ ├─ [ollama/*] → LiteLLM executor → generate
|
|
│ │ ↓
|
|
│ │ Claude verifier (task + output + discipline)
|
|
│ │ ├─ accept → return Result (log attempt)
|
|
│ │ └─ escalate → next tier (log attempt)
|
|
│ │
|
|
│ └─ [claude-*] → Claude executor (current) → generate + self-certify
|
|
│ └─ return Result (log attempt)
|
|
│
|
|
└─ All tiers exhausted → return best attempt with escalation note
|
|
```
|
|
|
|
Claude is always the verifier for local tiers. At cloud tiers, Claude generates and self-certifies — the verifier call is skipped.
|
|
|
|
---
|
|
|
|
## Components
|
|
|
|
### 1. `internal/exec/litellm.go` — LiteLLM executor
|
|
|
|
Calls `POST /v1/chat/completions` on the configured LiteLLM server. Implements the same `ExecutorFn` signature as the existing claude executor.
|
|
|
|
```go
|
|
type LiteLLMExecutor struct {
|
|
BaseURL string
|
|
APIKey string
|
|
HTTPClient *http.Client
|
|
Timeout time.Duration
|
|
}
|
|
|
|
func NewLiteLLM(baseURL, apiKey string, timeout time.Duration) *LiteLLMExecutor
|
|
|
|
func (e *LiteLLMExecutor) Run(ctx context.Context, req Request) (Result, error)
|
|
```
|
|
|
|
Request mapping:
|
|
- `req.SkillPrompt` → system message
|
|
- `req.TaskPrompt` → user message
|
|
- `req.Model` → `model` field in the chat completions request
|
|
|
|
Response handling: local models are prompted (via the discipline file output contract) to return a JSON object matching the `Result` schema. The executor attempts `json.Unmarshal` into `Result` directly — no envelope unwrapping needed (unlike the `--output-format json` claude envelope). If unmarshalling fails, the executor returns an error that the orchestrator treats as an automatic escalation trigger.
|
|
|
|
### 2. `internal/exec/verifier.go` — Claude verifier
|
|
|
|
A focused Claude call that judges local model output. Uses the existing `Executor` (claude subprocess) internally.
|
|
|
|
```go
|
|
type Verdict struct {
|
|
Accept bool `json:"accept"`
|
|
Feedback string `json:"feedback"` // reason if not accepting; empty if accept
|
|
}
|
|
|
|
type Verifier struct {
|
|
executor *Executor // the existing claude executor
|
|
}
|
|
|
|
func NewVerifier(executor *Executor) *Verifier
|
|
|
|
func (v *Verifier) Verify(ctx context.Context, skillPrompt, taskPrompt string, output Result) (Verdict, error)
|
|
```
|
|
|
|
The verifier prompt gives Claude:
|
|
1. The skill discipline file (so it knows the iron laws and output contract)
|
|
2. The original task prompt (informed verification — Claude sees what was asked)
|
|
3. The generated output
|
|
4. A short instruction: "Does this output satisfy the discipline's iron laws and output contract? Reply with JSON: `{\"accept\": true|false, \"feedback\": \"...\"}`"
|
|
|
|
The verifier uses a lightweight JSON schema for its own output (a `Verdict` schema), keeping the call fast.
|
|
|
|
### 3. `internal/exec/orchestrator.go` — chain walker
|
|
|
|
Implements `ExecutorFn`. Walks the escalation chain, delegating generation and verification per tier.
|
|
|
|
```go
|
|
type Chain []ChainEntry
|
|
|
|
type ChainEntry struct {
|
|
Model string // e.g. "ollama/phi4", "claude-sonnet-4-5"
|
|
Tier string // "local" | "subagent" | "managed"
|
|
IsCloud bool // true for claude-* models; skips verifier
|
|
}
|
|
|
|
type Orchestrator struct {
|
|
chain Chain
|
|
litellm *LiteLLMExecutor
|
|
claude *Executor
|
|
verifier *Verifier
|
|
llamaSwapURL string // for warm-state probe
|
|
}
|
|
|
|
func NewOrchestrator(chain Chain, litellm *LiteLLMExecutor, claude *Executor, verifier *Verifier, llamaSwapURL string) *Orchestrator
|
|
|
|
func (o *Orchestrator) Run(ctx context.Context, req Request) (Result, error)
|
|
```
|
|
|
|
Algorithm:
|
|
```
|
|
for each entry in chain:
|
|
warm = probe llama-swap (if local tier)
|
|
start = now()
|
|
if entry.IsCloud:
|
|
result, err = claude.Run(ctx, req with entry.Model)
|
|
log attempt(model, tier, duration, warm, verified=true)
|
|
if err == nil: return result
|
|
else:
|
|
result, err = litellm.Run(ctx, req with entry.Model)
|
|
duration = now() - start
|
|
if err != nil:
|
|
log attempt(model, tier, duration, warm, verified=false)
|
|
continue // automatic escalation on parse/network error
|
|
verdict = verifier.Verify(ctx, req.SkillPrompt, req.TaskPrompt, result)
|
|
log attempt(model, tier, duration, warm, verified=verdict.Accept)
|
|
if verdict.Accept: return result
|
|
// inject verifier feedback into next tier's task prompt
|
|
req.TaskPrompt = req.TaskPrompt + "\n\nPrior attempt feedback: " + verdict.Feedback
|
|
|
|
return error("all tiers exhausted")
|
|
```
|
|
|
|
### 4. `internal/config/models.go` — chain parser
|
|
|
|
Replaces the current single-model resolution with chain parsing.
|
|
|
|
Updated `models.yaml` format:
|
|
|
|
```yaml
|
|
verifier: claude-sonnet-4-6 # fixed verifier for all local tiers
|
|
|
|
llama_swap_url: http://koala:8080 # for warm-state probing
|
|
|
|
default_chain:
|
|
- ollama/qwen3-coder-30b-tuned
|
|
- claude-sonnet-4-5
|
|
|
|
skills:
|
|
tdd:
|
|
chain:
|
|
- ollama/qwen3-coder-30b-tuned
|
|
- claude-sonnet-4-5
|
|
review:
|
|
chain:
|
|
- ollama/devstral-tuned
|
|
- ollama/gemma4
|
|
- claude-sonnet-4-5
|
|
debug:
|
|
chain:
|
|
- ollama/deepseek-r1-tuned
|
|
- claude-sonnet-4-5
|
|
spec:
|
|
chain:
|
|
- ollama/phi4
|
|
- ollama/gemma4
|
|
- claude-sonnet-4-5
|
|
- claude-opus-4-6
|
|
retrospective:
|
|
chain:
|
|
- ollama/qwen3-coder-30b-tuned
|
|
- claude-sonnet-4-5
|
|
trainer:
|
|
chain:
|
|
- ollama/qwen3-coder-30b-tuned
|
|
- claude-sonnet-4-5
|
|
```
|
|
|
|
The parser exposes:
|
|
```go
|
|
func (m *Models) ChainFor(skill string) Chain
|
|
func (m *Models) Verifier() string
|
|
func (m *Models) LlamaSwapURL() string
|
|
```
|
|
|
|
Caller override (`model` param in MCP tool call) pins the chain to a single entry — one model, no escalation. This preserves the existing override behaviour for power users.
|
|
|
|
### 5. `internal/session/session.go` — updated `Attempt` struct
|
|
|
|
```go
|
|
type Attempt struct {
|
|
Attempt int `json:"attempt"`
|
|
Model string `json:"model"`
|
|
Tier string `json:"tier"` // local | subagent | managed
|
|
DurationMs int64 `json:"duration_ms"`
|
|
WarmStart bool `json:"warm_start"` // model was already loaded in llama-swap
|
|
Verified bool `json:"verified"`
|
|
Verdict string `json:"verdict,omitempty"` // accept | escalate | error
|
|
Feedback string `json:"feedback,omitempty"` // verifier feedback on escalation
|
|
OutputSummary string `json:"output_summary,omitempty"`
|
|
RunnerOutput string `json:"runner_output,omitempty"`
|
|
}
|
|
```
|
|
|
|
### 6. `cmd/supervisor/main.go` — one wiring change
|
|
|
|
```go
|
|
// Before:
|
|
reg.Register(review.New(review.Config{ExecutorFn: executor.Run, ...}))
|
|
|
|
// After:
|
|
chain := models.ChainFor("review")
|
|
orch := exec.NewOrchestrator(chain, litellmExec, claudeExec, verifier, models.LlamaSwapURL())
|
|
reg.Register(review.New(review.Config{ExecutorFn: orch.Run, ...}))
|
|
```
|
|
|
|
One orchestrator per skill, sharing the same `litellmExec`, `claudeExec`, and `verifier` instances.
|
|
|
|
---
|
|
|
|
## Data flow example: `review` skill call
|
|
|
|
1. Claude Code calls `review` tool with `files: ["internal/foo.go"]`
|
|
2. Skill handler builds task prompt, calls `orch.Run`
|
|
3. Orchestrator resolves chain: `[devstral, gemma4, sonnet]`
|
|
4. Probes llama-swap: devstral is warm
|
|
5. LiteLLM calls devstral → returns JSON result
|
|
6. Verifier asks Claude: "does this review satisfy the iron laws?"
|
|
7. Claude: `{"accept": false, "feedback": "missing line references for all findings"}`
|
|
8. Orchestrator logs attempt #1 (devstral, local, 4200ms, warm, escalate)
|
|
9. Injects feedback into task prompt, calls gemma4
|
|
10. Verifier: `{"accept": true}`
|
|
11. Orchestrator logs attempt #2 (gemma4, local, 6100ms, cold, accept)
|
|
12. Returns result to skill handler → MCP response
|
|
|
|
Session JSONL records both attempts. You can see: devstral was warm but produced weak output; gemma4 was cold but passed.
|
|
|
|
---
|
|
|
|
## Observability
|
|
|
|
Session JSONL is the primary store. Each `Entry.Attempts` slice records the full escalation trail. To analyse across sessions:
|
|
|
|
```bash
|
|
# Which models are escalating most?
|
|
jq -r '.attempts[] | select(.verdict == "escalate") | .model' brain/sessions/*.jsonl | sort | uniq -c
|
|
|
|
# Average latency per model
|
|
jq -r '.attempts[] | [.model, .duration_ms] | @tsv' brain/sessions/*.jsonl | awk '{sum[$1]+=$2; n[$1]++} END {for (m in sum) print m, sum[m]/n[m]}'
|
|
|
|
# Cold start frequency
|
|
jq -r '.attempts[] | select(.warm_start == false) | .model' brain/sessions/*.jsonl | sort | uniq -c
|
|
```
|
|
|
|
No new metrics infrastructure needed for Phase 3. Phase 4 can build a dashboard on top of this data.
|
|
|
|
---
|
|
|
|
## Error handling
|
|
|
|
| Scenario | Behaviour |
|
|
|----------|-----------|
|
|
| LiteLLM unreachable | Log attempt as error, escalate immediately |
|
|
| Local model returns unparseable JSON | Log attempt as error, escalate |
|
|
| Verifier call fails | Log, treat as escalate (safe default) |
|
|
| All tiers exhausted | Return error to skill handler; skill returns MCP error to caller |
|
|
| Caller passes `model` override | Single-entry chain, no escalation, no verifier call |
|
|
|
|
---
|
|
|
|
## Testing approach
|
|
|
|
- `TestLiteLLMExecutor`: mock HTTP server returning valid/invalid JSON; verify parse logic and error escalation
|
|
- `TestVerifier`: fake claude executor returning accept/escalate verdicts; verify prompt construction
|
|
- `TestOrchestrator`: table-driven — chains of 1/2/3 tiers, various accept/escalate/error combinations; verify attempt log contents and final result
|
|
- `TestModelsChainFor`: YAML parsing for all skill overrides and default_chain fallback
|
|
- Integration smoke test: start real LiteLLM (or mock), call `review` tool via MCP, verify attempt log written
|
|
|
|
---
|
|
|
|
## Risks
|
|
|
|
| Risk | Mitigation |
|
|
|------|------------|
|
|
| Local models ignore output contract → bad JSON | Discipline files already specify JSON output contract; parse failure auto-escalates |
|
|
| Verifier Claude call adds latency to every local attempt | Verifier prompt is small and fast; acceptable tradeoff for quality gate |
|
|
| llama-swap warm probe adds overhead | Probe is a single lightweight HTTP GET; timeout at 200ms, treat failure as `warm_start: false` |
|
|
| Chain exhaustion leaves caller with no result | Return structured error via MCP; caller can retry with explicit `model` override |
|