Files
hyperguild/docs/multi-model-routing.md
Mathias Bergqvist c9310b1079
All checks were successful
cd / Build and deploy (push) Successful in 9s
CI / Lint / Test / Vet (push) Successful in 10s
CI / Mirror to GitHub (push) Successful in 4s
fix(ingestion): always append .md extension to written filenames
brain_write with a custom filename omitted the .md extension, causing
search to skip the file (search.go filters on HasSuffix .md).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-22 19:23:07 +02:00

7.8 KiB
Raw Blame History

Multi-Model Routing for supervisor

Reference document for implementing multi-model access within the supervisor project. Researched April 2026. Constraints: Claude Max subscription (ToS must be respected).


Goal

Route tasks to specialized, cheaper, or local models during agent and skill flows — without violating Anthropic's terms or introducing unnecessary infrastructure risk.


Hard Constraints

  • Claude Max subscription is in use. Anthropic's April 2026 terms prohibit using the subscription with third-party harnesses that spoof the Anthropic API surface.
  • ANTHROPIC_BASE_URL → LiteLLM workaround is explicitly out of scope.
  • Claude must remain the reasoning engine. Other models are tools, not replacements.

Infrastructure Available

Machine Role Relevant services
koala GPU inference llama-swap, Ollama, Qdrant, LiteLLM proxy
iguana Services, builds k3s, general services
flamingo Daily driver Claude Code runs here

LiteLLM proxy on koala exposes 100+ models (local + cloud) through a unified API. All machines connected via Tailscale.


Approved Patterns

Pattern 1 — Native Claude model tiering (zero build)

Claude Code subagents support per-agent model selection via frontmatter. Use this for cost routing within the Claude model family.

# ~/.claude/agents/explorer.md
---
name: explorer
description: File reading, code search, codebase mapping — use for all exploration tasks
model: haiku
---
  • haiku for exploration, summarization, classification
  • sonnet (default) for main reasoning and implementation
  • opus for deep analysis, architecture decisions

When to use: Always. Add model: haiku to any subagent that does read-heavy or classification work. Cheapest and fastest path to cost control.


Pattern 2 — MCP tools wrapping local models (primary build target)

Expose local models on koala as named MCP tools. Claude remains the orchestrator and reasoning engine — it calls local models as tools the same way it calls any other tool.

This is the intended MCP use case and carries zero ToS risk.

Semantic contract: Claude decides when to delegate based on the tool description. Write descriptions that tell Claude what the model is good for.

MCP server implementation

Small Python server, run on koala or flamingo, registered in Claude Code settings.

# supervisor/scripts/mcp_local_models.py
import mcp
import requests

server = mcp.Server("local-models")

LITELLM_BASE = "http://koala:4000"
OLLAMA_BASE  = "http://koala:11434"

def _litellm_chat(model: str, prompt: str) -> str:
    r = requests.post(f"{LITELLM_BASE}/v1/chat/completions", json={
        "model": model,
        "messages": [{"role": "user", "content": prompt}],
        "max_tokens": 2048,
    })
    r.raise_for_status()
    return r.json()["choices"][0]["message"]["content"]


@server.tool()
def ask_local_llama(prompt: str) -> str:
    """Ask the local Llama model on koala.
    Use for: bulk summarization, first-pass analysis, classification, simple Q&A,
    anything that does not require deep reasoning or up-to-date knowledge.
    Faster and cheaper than cloud models for routine subtasks."""
    return _litellm_chat("llama3-local", prompt)


@server.tool()
def ask_coding_model(code: str, question: str) -> str:
    """Ask a code-specialized local model.
    Use for: syntax checking, boilerplate generation, code formatting questions,
    simple refactors where pattern-matching is sufficient."""
    return _litellm_chat("codellama-local", f"Code:\n{code}\n\nQuestion: {question}")


@server.tool()
def list_available_local_models() -> list[str]:
    """List all models currently available on the local LiteLLM proxy."""
    r = requests.get(f"{LITELLM_BASE}/v1/models")
    r.raise_for_status()
    return [m["id"] for m in r.json()["data"]]


if __name__ == "__main__":
    mcp.run_stdio_server(server)

Register in Claude Code

Add to ~/.claude/settings.json (or project-level .claude/settings.json):

{
  "mcpServers": {
    "local-models": {
      "command": "python3",
      "args": ["/path/to/supervisor/scripts/mcp_local_models.py"]
    }
  }
}

LiteLLM config additions needed on koala

# litellm config.yaml — add model entries for local models
model_list:
  - model_name: llama3-local
    litellm_params:
      model: ollama/llama3.2
      api_base: http://localhost:11434

  - model_name: codellama-local
    litellm_params:
      model: ollama/codellama
      api_base: http://localhost:11434

Pattern 3 — External orchestration scripts (for pipeline workflows)

For multi-model pipelines that don't need to live inside a Claude Code session. These scripts use their own API key (separate from Max subscription — API billing), so they can call Claude API + LiteLLM freely.

Claude Code invokes them via the Bash tool.

Claude Code → [Bash tool] → ./scripts/orchestrate.py → {Claude API, LiteLLM, local models}
# supervisor/scripts/orchestrate.py
import anthropic
import requests

claude = anthropic.Anthropic()  # reads ANTHROPIC_API_KEY — separate from Max subscription

def analyze_document(path: str) -> str:
    with open(path) as f:
        content = f.read()

    # Step 1: local Llama extracts structure (fast, cheap)
    structure = requests.post("http://koala:4000/v1/chat/completions", json={
        "model": "llama3-local",
        "messages": [{"role": "user", "content": f"Extract key sections from:\n{content}"}],
    }).json()["choices"][0]["message"]["content"]

    # Step 2: Claude synthesizes and reasons over it
    synthesis = claude.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=2048,
        messages=[{"role": "user", "content": f"Synthesize these findings:\n{structure}"}]
    )
    return synthesis.content[0].text

When to use: Batch processing, automated pipelines, workflows triggered by cron or external events. Not for interactive Claude Code sessions.


What to Skip

Approach Why skip
ANTHROPIC_BASE_URL → LiteLLM ToS violation with Max subscription (April 2026 terms)
Third-party harnesses (OpenClaw etc.) Explicitly banned for subscription users
A2A in Claude Code Not implemented by Anthropic yet — revisit late 2026
OpenAI agent handoffs Loses execution context, not worth the complexity

Protocol Landscape (for awareness, not immediate action)

  • MCP — production, 97M monthly downloads, your primary tool-access protocol. LiteLLM natively supports it as both MCP gateway and MCP client as of v1.60+.
  • A2A v1.0 — Google/Linux Foundation, 150+ orgs in production, but Anthropic has not shipped it in Claude Code. The intent is agent-to-agent peer delegation (vs MCP's agent-to-tool). Worth watching for H2 2026.
  • AGNTCY — Cisco/Linux Foundation, discovery and identity layer beneath MCP+A2A. Potentially relevant for multi-machine routing across koala/iguana/flamingo once mature.

Build Priority

Step Effort Value When
Add model: haiku to explorer subagents 10 min Immediate cost saving Now
Write MCP server for local models 23h Local model access in sessions Soon
Register MCP server in Claude Code settings 15 min Activates pattern 2 With above
Write orchestration script template 12h Pipeline workflows When needed

References