# Multi-Model Routing for supervisor Reference document for implementing multi-model access within the supervisor project. Researched April 2026. Constraints: Claude Max subscription (ToS must be respected). --- ## Goal Route tasks to specialized, cheaper, or local models during agent and skill flows — without violating Anthropic's terms or introducing unnecessary infrastructure risk. --- ## Hard Constraints - Claude Max subscription is in use. Anthropic's April 2026 terms **prohibit using the subscription with third-party harnesses that spoof the Anthropic API surface**. - `ANTHROPIC_BASE_URL` → LiteLLM workaround is explicitly out of scope. - Claude must remain the reasoning engine. Other models are tools, not replacements. --- ## Infrastructure Available | Machine | Role | Relevant services | |---------|------|-------------------| | koala | GPU inference | llama-swap, Ollama, Qdrant, LiteLLM proxy | | iguana | Services, builds | k3s, general services | | flamingo | Daily driver | Claude Code runs here | LiteLLM proxy on koala exposes 100+ models (local + cloud) through a unified API. All machines connected via Tailscale. --- ## Approved Patterns ### Pattern 1 — Native Claude model tiering (zero build) Claude Code subagents support per-agent model selection via frontmatter. Use this for cost routing within the Claude model family. ```yaml # ~/.claude/agents/explorer.md --- name: explorer description: File reading, code search, codebase mapping — use for all exploration tasks model: haiku --- ``` - `haiku` for exploration, summarization, classification - `sonnet` (default) for main reasoning and implementation - `opus` for deep analysis, architecture decisions **When to use**: Always. Add `model: haiku` to any subagent that does read-heavy or classification work. Cheapest and fastest path to cost control. --- ### Pattern 2 — MCP tools wrapping local models (primary build target) Expose local models on koala as named MCP tools. Claude remains the orchestrator and reasoning engine — it calls local models as tools the same way it calls any other tool. This is the intended MCP use case and carries zero ToS risk. **Semantic contract**: Claude decides *when* to delegate based on the tool description. Write descriptions that tell Claude what the model is good for. #### MCP server implementation Small Python server, run on koala or flamingo, registered in Claude Code settings. ```python # supervisor/scripts/mcp_local_models.py import mcp import requests server = mcp.Server("local-models") LITELLM_BASE = "http://koala:4000" OLLAMA_BASE = "http://koala:11434" def _litellm_chat(model: str, prompt: str) -> str: r = requests.post(f"{LITELLM_BASE}/v1/chat/completions", json={ "model": model, "messages": [{"role": "user", "content": prompt}], "max_tokens": 2048, }) r.raise_for_status() return r.json()["choices"][0]["message"]["content"] @server.tool() def ask_local_llama(prompt: str) -> str: """Ask the local Llama model on koala. Use for: bulk summarization, first-pass analysis, classification, simple Q&A, anything that does not require deep reasoning or up-to-date knowledge. Faster and cheaper than cloud models for routine subtasks.""" return _litellm_chat("llama3-local", prompt) @server.tool() def ask_coding_model(code: str, question: str) -> str: """Ask a code-specialized local model. Use for: syntax checking, boilerplate generation, code formatting questions, simple refactors where pattern-matching is sufficient.""" return _litellm_chat("codellama-local", f"Code:\n{code}\n\nQuestion: {question}") @server.tool() def list_available_local_models() -> list[str]: """List all models currently available on the local LiteLLM proxy.""" r = requests.get(f"{LITELLM_BASE}/v1/models") r.raise_for_status() return [m["id"] for m in r.json()["data"]] if __name__ == "__main__": mcp.run_stdio_server(server) ``` #### Register in Claude Code Add to `~/.claude/settings.json` (or project-level `.claude/settings.json`): ```json { "mcpServers": { "local-models": { "command": "python3", "args": ["/path/to/supervisor/scripts/mcp_local_models.py"] } } } ``` #### LiteLLM config additions needed on koala ```yaml # litellm config.yaml — add model entries for local models model_list: - model_name: llama3-local litellm_params: model: ollama/llama3.2 api_base: http://localhost:11434 - model_name: codellama-local litellm_params: model: ollama/codellama api_base: http://localhost:11434 ``` --- ### Pattern 3 — External orchestration scripts (for pipeline workflows) For multi-model pipelines that don't need to live inside a Claude Code session. These scripts use their own API key (separate from Max subscription — API billing), so they can call Claude API + LiteLLM freely. Claude Code invokes them via the Bash tool. ``` Claude Code → [Bash tool] → ./scripts/orchestrate.py → {Claude API, LiteLLM, local models} ``` ```python # supervisor/scripts/orchestrate.py import anthropic import requests claude = anthropic.Anthropic() # reads ANTHROPIC_API_KEY — separate from Max subscription def analyze_document(path: str) -> str: with open(path) as f: content = f.read() # Step 1: local Llama extracts structure (fast, cheap) structure = requests.post("http://koala:4000/v1/chat/completions", json={ "model": "llama3-local", "messages": [{"role": "user", "content": f"Extract key sections from:\n{content}"}], }).json()["choices"][0]["message"]["content"] # Step 2: Claude synthesizes and reasons over it synthesis = claude.messages.create( model="claude-sonnet-4-6", max_tokens=2048, messages=[{"role": "user", "content": f"Synthesize these findings:\n{structure}"}] ) return synthesis.content[0].text ``` **When to use**: Batch processing, automated pipelines, workflows triggered by cron or external events. Not for interactive Claude Code sessions. --- ## What to Skip | Approach | Why skip | |----------|----------| | `ANTHROPIC_BASE_URL` → LiteLLM | ToS violation with Max subscription (April 2026 terms) | | Third-party harnesses (OpenClaw etc.) | Explicitly banned for subscription users | | A2A in Claude Code | Not implemented by Anthropic yet — revisit late 2026 | | OpenAI agent handoffs | Loses execution context, not worth the complexity | --- ## Protocol Landscape (for awareness, not immediate action) - **MCP** — production, 97M monthly downloads, your primary tool-access protocol. LiteLLM natively supports it as both MCP gateway and MCP client as of v1.60+. - **A2A v1.0** — Google/Linux Foundation, 150+ orgs in production, but Anthropic has not shipped it in Claude Code. The intent is agent-to-agent peer delegation (vs MCP's agent-to-tool). Worth watching for H2 2026. - **AGNTCY** — Cisco/Linux Foundation, discovery and identity layer beneath MCP+A2A. Potentially relevant for multi-machine routing across koala/iguana/flamingo once mature. --- ## Build Priority | Step | Effort | Value | When | |------|--------|-------|------| | Add `model: haiku` to explorer subagents | 10 min | Immediate cost saving | Now | | Write MCP server for local models | 2–3h | Local model access in sessions | Soon | | Register MCP server in Claude Code settings | 15 min | Activates pattern 2 | With above | | Write orchestration script template | 1–2h | Pipeline workflows | When needed | --- ## References - LiteLLM MCP docs: https://docs.litellm.ai/docs/mcp - Community MCP wrapper for LiteLLM: https://github.com/itsDarianNgo/mcp-server-litellm - Ollama MCP server: https://github.com/rawveg/ollama-mcp - A2A protocol status: https://www.linuxfoundation.org/press/a2a-protocol-surpasses-150-organizations-lands-in-major-cloud-platforms-and-sees-enterprise-production-use-in-first-year - AGNTCY: https://github.com/agntcy