mathias/hyperguild

Fork 0

Files

Mathias Bergqvist 62fc3989f2 docs: add brain ingestion pipeline design spec

2026-04-22 22:05:19 +02:00

10 KiB

Raw Blame History

Brain Ingestion Pipeline — Design Spec

Date: 2026-04-22 Status: approved Author: Mathias + Claude

Overview

Add a structured ingestion pipeline to the hyperguild brain. The pipeline accepts raw content (directly or from files) and uses an LLM to produce structured wiki pages in brain/wiki/ — the declarative layer of the Two-Layer Brain. Three fixed knowledge classes: concepts, entities, sources.

This spec covers:

Three new packages in the ingestion Go module (llm, wiki, pipeline, watcher)
Two new HTTP endpoints on the ingestion server (/ingest, /ingest-path)
A background file watcher for brain/raw/
Config additions to both the ingestion server and the supervisor

It does not cover Layer 2 (training data, brain/training-data/) — that is the trainer worker's concern.

Information Model

Three fixed wiki page classes, matching the Two-Layer Brain design spec and the existing ingestion-svc model:

`wiki/sources/<slug>.md`

One page per ingested source (project, book, article, note). Updated (not replaced) on re-ingestion.

Body sections: Summary · Key Claims · Concepts Introduced or Reinforced · Entities Mentioned · Open Questions Raised. Books add: Chapters · Argument Arc · Updates (dated, append-only).

`wiki/concepts/<slug>.md`

One page per idea, framework, methodology, or pattern (e.g. Domain Driven Design, TDD, event sourcing).

Required frontmatter: title, domain, last_updated, aliases.

Body sections: Definition · Why It Matters · Related Concepts · Related Entities · Sources · Evolving Notes.

`wiki/entities/<slug>.md`

One page per person, tool, organisation, technology, or product.

Body sections: Description · Relevance · Key Positions/Products/Claims · Related Concepts · Related Entities · Sources.

Wikilink format

All cross-references use [[slug|Display Text]]. Slug = lowercase title, spaces→hyphens, non-alphanumeric stripped. Slugs must resolve to an existing file in the wiki.

Supporting files

brain/wiki/index.md — auto-rebuilt on every ingest: one-sentence summary per page, grouped by type
brain/log.md — append-only audit trail: date, source, pages written, warnings

Architecture

New packages (`ingestion` module)

ingestion/internal/
  llm/        — OpenAI-compatible HTTP client (chat completions, retry on 429,
                configurable timeout and temperature)
  wiki/        — Page types, slug utilities, merge logic, inventory loader,
                index rebuilder, log appender
  pipeline/   — Orchestrates one ingest run end-to-end (content or extracted file text)
  watcher/    — Polls brain/raw/ and triggers pipeline on new files

The existing api/ and search/ packages are updated; no other existing packages change.

Brain directory layout

brain/
  wiki/
    concepts/        ← LLM-structured concept pages
    entities/        ← LLM-structured entity pages
    sources/         ← LLM-structured source pages
    index.md         ← auto-rebuilt on each ingest
  knowledge/         ← quick raw notes via brain_write (BM25-searchable, unchanged)
  raw/               ← drop zone; watcher picks up files here
    processed/       ← moved here on success (organised by date: processed/YYYY-MM-DD/)
    failed/          ← moved here on failure
  sessions/          ← session logs (retrospective/trainer concern, not touched here)
  training-data/     ← Layer 2 (trainer worker concern, not touched here)
  log.md             ← append-only audit trail
  CLAUDE.md          ← schema document injected into every ingest prompt

If brain/CLAUDE.md is absent, the pipeline falls back to an embedded default schema compiled into the binary.

API

`POST /ingest`

Ingest content provided directly by the caller.

Request:

{
  "content": "...",
  "source": "shape-up-book",
  "dry_run": false
}

Response:

{
  "pages": ["wiki/sources/shape-up.md", "wiki/concepts/betting-table.md"],
  "warnings": []
}

source is the human-readable name used when writing/updating wiki/sources/<slug>.md. dry_run: true returns the page contents without writing.

`POST /ingest-path`

Ingest a file or walk a directory recursively. Supports .md, .txt, .pdf.

Request:

{
  "path": "/Users/mathias/brain/raw/shape-up.pdf",
  "source": "shape-up-book",
  "dry_run": false
}

If path is a directory, all supported files within it are ingested in sequence. source is optional for directory ingestion — if omitted, the LLM derives it from each file's name and content.

Response: same shape as /ingest, with pages and warnings aggregated across all files.

Supervisor skill update

brain_ingest in internal/skills/brain/handlers.go gains an optional path field. If path is set, it calls /ingest-path; otherwise /ingest.

Pipeline

pipeline.Run(ctx, cfg, brainDir, content, source, dryRun) — called by both HTTP handlers after any file reading is done.

Steps:

Load inventory — walk brain/wiki/{concepts,entities,sources}/, build slug index grouped by type. Injected into prompt so LLM knows what to update vs create.
Load schema — read brain/CLAUDE.md; fall back to embedded default if absent.
Chunk — split content at INGEST_CHUNK_SIZE chars (default 6000; split on paragraph boundary). If INGEST_CHUNK_SIZE=0, no chunking.
LLM call per chunk — returns JSON array of {"path": "wiki/concepts/foo.md", "content": "..."}. Prompt structure: system instruction → date → schema → inventory → non-negotiable slug/wikilink rules → source content.
Parse + truncation recovery — strip markdown fences if present. If JSON array is truncated mid-object (token limit), salvage all complete objects before the break and log a warning.
Merge — combine pages with the same path across chunks:
- Bullet sections (Related Concepts, Related Entities, Sources, Key Claims): union unique lines
- Append sections (Evolving Notes, Updates, Open Questions): append new content
- All other sections: keep first occurrence
- Frontmatter: keep first occurrence
Write — create subdirs as needed, write files atomically. In dry-run mode, return page map without writing.
Rebuild index.md — one-sentence summary per page (derived from first body paragraph), grouped by type, with page count header.
Append to log.md — date, source, list of pages written, warning count.

File Watcher

Background goroutine started at server startup (when INGEST_WATCH_INTERVAL > 0).

Poll loop:

Walk brain/raw/ for files with supported extensions (.md, .txt, .pdf), excluding processed/ and failed/ subdirs.
For each file found: derive source from filename (strip extension, kebab-to-title), call pipeline.Run with the file content.
On success: move file to brain/raw/processed/YYYY-MM-DD/<filename>.
On failure: move file to brain/raw/failed/<filename>, append error to brain/log.md.
Sleep INGEST_WATCH_INTERVAL seconds, repeat.

Files are processed one at a time (no concurrency within the watcher) to avoid LLM rate-limit collisions.

LLM Prompt

System:

You are a wiki agent. Read the source material and produce structured wiki pages following the schema provided. Output ONLY a valid JSON array — no markdown fences, no other text. Each element must have: "path" (relative path within wiki, e.g. "wiki/sources/foo.md") and "content" (full markdown including YAML frontmatter). Follow the schema strictly: correct frontmatter fields, wikilinks as [[slug|Display Text]], dates in YYYY-MM-DD format, paraphrase rather than quoting verbatim.

User (built dynamically):

Today's date
Full schema (brain/CLAUDE.md content)
Existing wiki inventory grouped by type (for update-vs-create decisions)
Non-negotiable rules: slug format, wikilink format, one-source-per-book, section type enforcement
Source content (the chunk)

Temperature: 0.2 for reproducibility.

Configuration

Ingestion server (new env vars)

Variable	Default	Description
`INGEST_LLM_URL`	`http://iguana:4000/v1`	OpenAI-compatible endpoint
`INGEST_LLM_KEY`	(empty)	API key
`INGEST_LLM_MODEL`	`koala/qwen35-9b-fast`	Model name
`INGEST_LLM_TIMEOUT`	`15`	LLM call timeout (minutes)
`INGEST_CHUNK_SIZE`	`6000`	Max chars per LLM call (0 = no chunking)
`INGEST_WATCH_INTERVAL`	`30`	Watcher poll interval in seconds (0 = disabled)

Supervisor (new env vars + wiring)

Variable	Default	Description
`INGEST_SVC_URL`	(empty)	URL of ingestion server for `brain_ingest`
`KB_RETRIEVAL_URL`	(empty)	URL of KB retrieval server for `brain_search`

config.go gets two new fields. main.go passes them to brain.New(). Both tools are only registered as MCP tools when the respective URL is configured (already implemented in skill.go).

Testing

Package	What is tested
`wiki/`	Slug generation (edge cases: apostrophes, colons, version strings), merge logic (bullets union, append, keep-first), inventory loading from temp dir, truncation recovery (valid partial JSON), index rebuild output
`pipeline/`	Integration test: temp brain dir + mock LLM HTTP server returning fixture JSON; verify files written to correct paths, index rebuilt, log appended
`api/`	Handler tests for `/ingest` and `/ingest-path` using mock pipeline; 400 on missing fields, 200 with expected response shape
`watcher/`	File placed in `brain/raw/` is moved to `processed/` on mock-pipeline success; moved to `failed/` on error

All tests are table-driven. No real LLM calls in tests.

Out of Scope

Python validation/correction loop (can be added later; the LLM prompt enforces schema rules as non-negotiable instructions)
brain/training-data/ — trainer worker concern
brain/sessions/ — retrospective/sessionlog concern
Upload endpoint (multipart HTTP) — scp/rsync to brain/raw/ + watcher covers this
Qdrant vector indexing — brain_search calls a separate KB retrieval service; ingestion does not write to Qdrant

10 KiB Raw Blame History