Files
hyperguild/docs/superpowers/specs/2026-04-22-brain-ingestion-pipeline-design.md
2026-04-22 22:05:19 +02:00

10 KiB

Brain Ingestion Pipeline — Design Spec

Date: 2026-04-22 Status: approved Author: Mathias + Claude


Overview

Add a structured ingestion pipeline to the hyperguild brain. The pipeline accepts raw content (directly or from files) and uses an LLM to produce structured wiki pages in brain/wiki/ — the declarative layer of the Two-Layer Brain. Three fixed knowledge classes: concepts, entities, sources.

This spec covers:

  • Three new packages in the ingestion Go module (llm, wiki, pipeline, watcher)
  • Two new HTTP endpoints on the ingestion server (/ingest, /ingest-path)
  • A background file watcher for brain/raw/
  • Config additions to both the ingestion server and the supervisor

It does not cover Layer 2 (training data, brain/training-data/) — that is the trainer worker's concern.


Information Model

Three fixed wiki page classes, matching the Two-Layer Brain design spec and the existing ingestion-svc model:

wiki/sources/<slug>.md

One page per ingested source (project, book, article, note). Updated (not replaced) on re-ingestion.

Required frontmatter: title, type (article|pdf|book|video|note|project), domain, source_url, date_ingested, last_updated, aliases.

Body sections: Summary · Key Claims · Concepts Introduced or Reinforced · Entities Mentioned · Open Questions Raised. Books add: Chapters · Argument Arc · Updates (dated, append-only).

wiki/concepts/<slug>.md

One page per idea, framework, methodology, or pattern (e.g. Domain Driven Design, TDD, event sourcing).

Required frontmatter: title, domain, last_updated, aliases.

Body sections: Definition · Why It Matters · Related Concepts · Related Entities · Sources · Evolving Notes.

wiki/entities/<slug>.md

One page per person, tool, organisation, technology, or product.

Required frontmatter: title, type (person|company|tool|model|framework|technology), domain, last_updated, aliases.

Body sections: Description · Relevance · Key Positions/Products/Claims · Related Concepts · Related Entities · Sources.

All cross-references use [[slug|Display Text]]. Slug = lowercase title, spaces→hyphens, non-alphanumeric stripped. Slugs must resolve to an existing file in the wiki.

Supporting files

  • brain/wiki/index.md — auto-rebuilt on every ingest: one-sentence summary per page, grouped by type
  • brain/log.md — append-only audit trail: date, source, pages written, warnings

Architecture

New packages (ingestion module)

ingestion/internal/
  llm/        — OpenAI-compatible HTTP client (chat completions, retry on 429,
                configurable timeout and temperature)
  wiki/        — Page types, slug utilities, merge logic, inventory loader,
                index rebuilder, log appender
  pipeline/   — Orchestrates one ingest run end-to-end (content or extracted file text)
  watcher/    — Polls brain/raw/ and triggers pipeline on new files

The existing api/ and search/ packages are updated; no other existing packages change.

Brain directory layout

brain/
  wiki/
    concepts/        ← LLM-structured concept pages
    entities/        ← LLM-structured entity pages
    sources/         ← LLM-structured source pages
    index.md         ← auto-rebuilt on each ingest
  knowledge/         ← quick raw notes via brain_write (BM25-searchable, unchanged)
  raw/               ← drop zone; watcher picks up files here
    processed/       ← moved here on success (organised by date: processed/YYYY-MM-DD/)
    failed/          ← moved here on failure
  sessions/          ← session logs (retrospective/trainer concern, not touched here)
  training-data/     ← Layer 2 (trainer worker concern, not touched here)
  log.md             ← append-only audit trail
  CLAUDE.md          ← schema document injected into every ingest prompt

If brain/CLAUDE.md is absent, the pipeline falls back to an embedded default schema compiled into the binary.


API

POST /ingest

Ingest content provided directly by the caller.

Request:

{
  "content": "...",
  "source": "shape-up-book",
  "dry_run": false
}

Response:

{
  "pages": ["wiki/sources/shape-up.md", "wiki/concepts/betting-table.md"],
  "warnings": []
}

source is the human-readable name used when writing/updating wiki/sources/<slug>.md. dry_run: true returns the page contents without writing.

POST /ingest-path

Ingest a file or walk a directory recursively. Supports .md, .txt, .pdf.

Request:

{
  "path": "/Users/mathias/brain/raw/shape-up.pdf",
  "source": "shape-up-book",
  "dry_run": false
}

If path is a directory, all supported files within it are ingested in sequence. source is optional for directory ingestion — if omitted, the LLM derives it from each file's name and content.

Response: same shape as /ingest, with pages and warnings aggregated across all files.

Supervisor skill update

brain_ingest in internal/skills/brain/handlers.go gains an optional path field. If path is set, it calls /ingest-path; otherwise /ingest.


Pipeline

pipeline.Run(ctx, cfg, brainDir, content, source, dryRun) — called by both HTTP handlers after any file reading is done.

Steps:

  1. Load inventory — walk brain/wiki/{concepts,entities,sources}/, build slug index grouped by type. Injected into prompt so LLM knows what to update vs create.
  2. Load schema — read brain/CLAUDE.md; fall back to embedded default if absent.
  3. Chunk — split content at INGEST_CHUNK_SIZE chars (default 6000; split on paragraph boundary). If INGEST_CHUNK_SIZE=0, no chunking.
  4. LLM call per chunk — returns JSON array of {"path": "wiki/concepts/foo.md", "content": "..."}. Prompt structure: system instruction → date → schema → inventory → non-negotiable slug/wikilink rules → source content.
  5. Parse + truncation recovery — strip markdown fences if present. If JSON array is truncated mid-object (token limit), salvage all complete objects before the break and log a warning.
  6. Merge — combine pages with the same path across chunks:
    • Bullet sections (Related Concepts, Related Entities, Sources, Key Claims): union unique lines
    • Append sections (Evolving Notes, Updates, Open Questions): append new content
    • All other sections: keep first occurrence
    • Frontmatter: keep first occurrence
  7. Write — create subdirs as needed, write files atomically. In dry-run mode, return page map without writing.
  8. Rebuild index.md — one-sentence summary per page (derived from first body paragraph), grouped by type, with page count header.
  9. Append to log.md — date, source, list of pages written, warning count.

File Watcher

Background goroutine started at server startup (when INGEST_WATCH_INTERVAL > 0).

Poll loop:

  1. Walk brain/raw/ for files with supported extensions (.md, .txt, .pdf), excluding processed/ and failed/ subdirs.
  2. For each file found: derive source from filename (strip extension, kebab-to-title), call pipeline.Run with the file content.
  3. On success: move file to brain/raw/processed/YYYY-MM-DD/<filename>.
  4. On failure: move file to brain/raw/failed/<filename>, append error to brain/log.md.
  5. Sleep INGEST_WATCH_INTERVAL seconds, repeat.

Files are processed one at a time (no concurrency within the watcher) to avoid LLM rate-limit collisions.


LLM Prompt

System:

You are a wiki agent. Read the source material and produce structured wiki pages following the schema provided. Output ONLY a valid JSON array — no markdown fences, no other text. Each element must have: "path" (relative path within wiki, e.g. "wiki/sources/foo.md") and "content" (full markdown including YAML frontmatter). Follow the schema strictly: correct frontmatter fields, wikilinks as [[slug|Display Text]], dates in YYYY-MM-DD format, paraphrase rather than quoting verbatim.

User (built dynamically):

  1. Today's date
  2. Full schema (brain/CLAUDE.md content)
  3. Existing wiki inventory grouped by type (for update-vs-create decisions)
  4. Non-negotiable rules: slug format, wikilink format, one-source-per-book, section type enforcement
  5. Source content (the chunk)

Temperature: 0.2 for reproducibility.


Configuration

Ingestion server (new env vars)

Variable Default Description
INGEST_LLM_URL http://iguana:4000/v1 OpenAI-compatible endpoint
INGEST_LLM_KEY (empty) API key
INGEST_LLM_MODEL koala/qwen35-9b-fast Model name
INGEST_LLM_TIMEOUT 15 LLM call timeout (minutes)
INGEST_CHUNK_SIZE 6000 Max chars per LLM call (0 = no chunking)
INGEST_WATCH_INTERVAL 30 Watcher poll interval in seconds (0 = disabled)

Supervisor (new env vars + wiring)

Variable Default Description
INGEST_SVC_URL (empty) URL of ingestion server for brain_ingest
KB_RETRIEVAL_URL (empty) URL of KB retrieval server for brain_search

config.go gets two new fields. main.go passes them to brain.New(). Both tools are only registered as MCP tools when the respective URL is configured (already implemented in skill.go).


Testing

Package What is tested
wiki/ Slug generation (edge cases: apostrophes, colons, version strings), merge logic (bullets union, append, keep-first), inventory loading from temp dir, truncation recovery (valid partial JSON), index rebuild output
pipeline/ Integration test: temp brain dir + mock LLM HTTP server returning fixture JSON; verify files written to correct paths, index rebuilt, log appended
api/ Handler tests for /ingest and /ingest-path using mock pipeline; 400 on missing fields, 200 with expected response shape
watcher/ File placed in brain/raw/ is moved to processed/ on mock-pipeline success; moved to failed/ on error

All tests are table-driven. No real LLM calls in tests.


Out of Scope

  • Python validation/correction loop (can be added later; the LLM prompt enforces schema rules as non-negotiable instructions)
  • brain/training-data/ — trainer worker concern
  • brain/sessions/ — retrospective/sessionlog concern
  • Upload endpoint (multipart HTTP) — scp/rsync to brain/raw/ + watcher covers this
  • Qdrant vector indexing — brain_search calls a separate KB retrieval service; ingestion does not write to Qdrant