mathias/hyperguild

Fork 0

Files

Mathias Bergqvist c9310b1079

cd / Build and deploy (push) Successful in 9s

Details

CI / Lint / Test / Vet (push) Successful in 10s

Details

CI / Mirror to GitHub (push) Successful in 4s

Details

fix(ingestion): always append .md extension to written filenames

brain_write with a custom filename omitted the .md extension, causing
search to skip the file (search.go filters on HasSuffix .md).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-04-22 19:23:07 +02:00

7.8 KiB

Raw Blame History

Multi-Model Routing for supervisor

Reference document for implementing multi-model access within the supervisor project. Researched April 2026. Constraints: Claude Max subscription (ToS must be respected).

Goal

Route tasks to specialized, cheaper, or local models during agent and skill flows — without violating Anthropic's terms or introducing unnecessary infrastructure risk.

Hard Constraints

Claude Max subscription is in use. Anthropic's April 2026 terms prohibit using the subscription with third-party harnesses that spoof the Anthropic API surface.
ANTHROPIC_BASE_URL → LiteLLM workaround is explicitly out of scope.
Claude must remain the reasoning engine. Other models are tools, not replacements.

Infrastructure Available

Machine	Role	Relevant services
koala	GPU inference	llama-swap, Ollama, Qdrant, LiteLLM proxy
iguana	Services, builds	k3s, general services
flamingo	Daily driver	Claude Code runs here

LiteLLM proxy on koala exposes 100+ models (local + cloud) through a unified API. All machines connected via Tailscale.

Approved Patterns

Pattern 1 — Native Claude model tiering (zero build)

Claude Code subagents support per-agent model selection via frontmatter. Use this for cost routing within the Claude model family.

# ~/.claude/agents/explorer.md
---
name: explorer
description: File reading, code search, codebase mapping — use for all exploration tasks
model: haiku
---

haiku for exploration, summarization, classification
sonnet (default) for main reasoning and implementation
opus for deep analysis, architecture decisions

When to use: Always. Add model: haiku to any subagent that does read-heavy or classification work. Cheapest and fastest path to cost control.

Pattern 2 — MCP tools wrapping local models (primary build target)

Expose local models on koala as named MCP tools. Claude remains the orchestrator and reasoning engine — it calls local models as tools the same way it calls any other tool.

This is the intended MCP use case and carries zero ToS risk.

Semantic contract: Claude decides when to delegate based on the tool description. Write descriptions that tell Claude what the model is good for.

MCP server implementation

Small Python server, run on koala or flamingo, registered in Claude Code settings.

# supervisor/scripts/mcp_local_models.py
import mcp
import requests

server = mcp.Server("local-models")

LITELLM_BASE = "http://koala:4000"
OLLAMA_BASE  = "http://koala:11434"

def _litellm_chat(model: str, prompt: str) -> str:
    r = requests.post(f"{LITELLM_BASE}/v1/chat/completions", json={
        "model": model,
        "messages": [{"role": "user", "content": prompt}],
        "max_tokens": 2048,
    })
    r.raise_for_status()
    return r.json()["choices"][0]["message"]["content"]


@server.tool()
def ask_local_llama(prompt: str) -> str:
    """Ask the local Llama model on koala.
    Use for: bulk summarization, first-pass analysis, classification, simple Q&A,
    anything that does not require deep reasoning or up-to-date knowledge.
    Faster and cheaper than cloud models for routine subtasks."""
    return _litellm_chat("llama3-local", prompt)


@server.tool()
def ask_coding_model(code: str, question: str) -> str:
    """Ask a code-specialized local model.
    Use for: syntax checking, boilerplate generation, code formatting questions,
    simple refactors where pattern-matching is sufficient."""
    return _litellm_chat("codellama-local", f"Code:\n{code}\n\nQuestion: {question}")


@server.tool()
def list_available_local_models() -> list[str]:
    """List all models currently available on the local LiteLLM proxy."""
    r = requests.get(f"{LITELLM_BASE}/v1/models")
    r.raise_for_status()
    return [m["id"] for m in r.json()["data"]]


if __name__ == "__main__":
    mcp.run_stdio_server(server)

Register in Claude Code

Add to ~/.claude/settings.json (or project-level .claude/settings.json):

{
  "mcpServers": {
    "local-models": {
      "command": "python3",
      "args": ["/path/to/supervisor/scripts/mcp_local_models.py"]
    }
  }
}

LiteLLM config additions needed on koala

# litellm config.yaml — add model entries for local models
model_list:
  - model_name: llama3-local
    litellm_params:
      model: ollama/llama3.2
      api_base: http://localhost:11434

  - model_name: codellama-local
    litellm_params:
      model: ollama/codellama
      api_base: http://localhost:11434

Pattern 3 — External orchestration scripts (for pipeline workflows)

For multi-model pipelines that don't need to live inside a Claude Code session. These scripts use their own API key (separate from Max subscription — API billing), so they can call Claude API + LiteLLM freely.

Claude Code invokes them via the Bash tool.

Claude Code → [Bash tool] → ./scripts/orchestrate.py → {Claude API, LiteLLM, local models}

# supervisor/scripts/orchestrate.py
import anthropic
import requests

claude = anthropic.Anthropic()  # reads ANTHROPIC_API_KEY — separate from Max subscription

def analyze_document(path: str) -> str:
    with open(path) as f:
        content = f.read()

    # Step 1: local Llama extracts structure (fast, cheap)
    structure = requests.post("http://koala:4000/v1/chat/completions", json={
        "model": "llama3-local",
        "messages": [{"role": "user", "content": f"Extract key sections from:\n{content}"}],
    }).json()["choices"][0]["message"]["content"]

    # Step 2: Claude synthesizes and reasons over it
    synthesis = claude.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=2048,
        messages=[{"role": "user", "content": f"Synthesize these findings:\n{structure}"}]
    )
    return synthesis.content[0].text

When to use: Batch processing, automated pipelines, workflows triggered by cron or external events. Not for interactive Claude Code sessions.

What to Skip

Approach	Why skip
`ANTHROPIC_BASE_URL` → LiteLLM	ToS violation with Max subscription (April 2026 terms)
Third-party harnesses (OpenClaw etc.)	Explicitly banned for subscription users
A2A in Claude Code	Not implemented by Anthropic yet — revisit late 2026
OpenAI agent handoffs	Loses execution context, not worth the complexity

Protocol Landscape (for awareness, not immediate action)

MCP — production, 97M monthly downloads, your primary tool-access protocol. LiteLLM natively supports it as both MCP gateway and MCP client as of v1.60+.
A2A v1.0 — Google/Linux Foundation, 150+ orgs in production, but Anthropic has not shipped it in Claude Code. The intent is agent-to-agent peer delegation (vs MCP's agent-to-tool). Worth watching for H2 2026.
AGNTCY — Cisco/Linux Foundation, discovery and identity layer beneath MCP+A2A. Potentially relevant for multi-machine routing across koala/iguana/flamingo once mature.

Build Priority

Step	Effort	Value	When
Add `model: haiku` to explorer subagents	10 min	Immediate cost saving	Now
Write MCP server for local models	2–3h	Local model access in sessions	Soon
Register MCP server in Claude Code settings	15 min	Activates pattern 2	With above
Write orchestration script template	1–2h	Pipeline workflows	When needed

References

LiteLLM MCP docs: https://docs.litellm.ai/docs/mcp
Community MCP wrapper for LiteLLM: https://github.com/itsDarianNgo/mcp-server-litellm
Ollama MCP server: https://github.com/rawveg/ollama-mcp
A2A protocol status: https://www.linuxfoundation.org/press/a2a-protocol-surpasses-150-organizations-lands-in-major-cloud-platforms-and-sees-enterprise-production-use-in-first-year
AGNTCY: https://github.com/agntcy

7.8 KiB Raw Blame History Unescape Escape

Multi-Model Routing for supervisor

Goal

Hard Constraints

Infrastructure Available

Approved Patterns

Pattern 1 — Native Claude model tiering (zero build)

Pattern 2 — MCP tools wrapping local models (primary build target)

MCP server implementation

Register in Claude Code

LiteLLM config additions needed on koala

Pattern 3 — External orchestration scripts (for pipeline workflows)

What to Skip

Protocol Landscape (for awareness, not immediate action)

Build Priority

References

7.8 KiB

Raw Blame History