# Level 3: Strip Slug Authority from LLM — Design Spec ## Problem The ingestion pipeline currently asks the LLM to produce full wiki pages including the file path (e.g. `wiki/sources/finbert-huggingface.md`). This causes two classes of bug: 1. **Slug proliferation** — the LLM invents different slugs for the same concept across chunks or runs, producing duplicate pages that diverge in content. 2. **Unstable paths** — the LLM may shorten, expand, or vary titles, making deduplication via `Resolve` unreliable because the slug mismatch is upstream of the normalizer. ## Solution Strip slug authority from the LLM entirely. The LLM returns a minimal structured object. The pipeline computes all slugs deterministically from titles using `wiki.Slug(title)`. --- ## LLM JSON Contract ### Output format (per page) ```json { "title": "FinBERT", "type": "concept", "subtype": "framework", "domain": "ai-llm", "content": "## Definition\n\nA BERT-based model fine-tuned for financial sentiment...\n\n## Related\n\n- [[Sentiment Analysis]]\n- [[Hugging Face]]\n" } ``` **Fields:** | Field | Required | Values | |-------|----------|--------| | `title` | yes | Human-readable title, e.g. "FinBERT" | | `type` | yes | `"source"` \| `"concept"` \| `"entity"` | | `subtype` | for entity/source | entity: `person\|company\|tool\|model\|framework\|technology`; source: `article\|pdf\|book\|video\|note\|project` | | `domain` | no | tag string, e.g. `ai-llm`, `finance` | | `content` | yes | Markdown body sections only — no frontmatter, no path | **Wikilinks in content:** `[[Display Name]]` only. No slug. The pipeline canonicalizes to `[[slug|Display Name]]` in a post-processing step. **The LLM never writes slugs, paths, or frontmatter.** --- ## Pipeline Changes ### New type: `RawPage` ```go type RawPage struct { Title string Type string // "source" | "concept" | "entity" Subtype string Domain string Content string } ``` ### New step order ``` ParseRawPages → BuildPages → Resolve → CanonicalizeLinks → injectSourceRefs → mergeAll → write ``` ### Step descriptions **`ParseRawPages(output string) ([]RawPage, []string)`** Replaces `ParsePages`. Deserializes JSON objects with the new schema. Same truncation-recovery logic as today. Returns `(pages, warnings)`. **`BuildPages(rawPages []RawPage, sourceSlug, date string) []wiki.Page`** Converts `RawPage → wiki.Page`: - Computes slug: `wiki.Slug(page.Title)` - Computes path: `wiki//.md` - Assembles frontmatter: ``` --- title: type: <type> subtype: <subtype> # omitted if empty domain: <domain> # omitted if empty created: <date> source: <sourceSlug> # omitted for the source page itself --- ``` - Concatenates frontmatter + content **`Resolve(pages []wiki.Page, inventory) []wiki.Page`** Unchanged. Normalizes near-duplicate titles to existing inventory slugs. **`CanonicalizeLinks(pages []wiki.Page, inventory) ([]wiki.Page, []string)`** New. Builds a title→slug map from inventory + current batch. Replaces `[[Display Name]]` with `[[slug|Display Name]]` in each page's content. Titles with no known slug are left as-is and returned as warnings. **`injectSourceRefs`** Unchanged. Reads `[[slug|...]]` links (post-canonicalization) to inject back-references. **`mergeAll → write`** Unchanged. ### `pipeline.Run` signature change ```go func Run(ctx context.Context, cfg Config, brainDir, content, source string, dryRun bool) (Result, error) ``` `source` is already passed (it's the display name / filename). A new internal `sourceSlug` is derived from it via `wiki.Slug(source)` before calling `BuildPages`. No API change needed. --- ## Files Changed | File | Change | |------|--------| | `ingestion/internal/pipeline/parse.go` | Replace `ParsePages` with `ParseRawPages` + `RawPage` type | | `ingestion/internal/pipeline/build.go` | New file: `BuildPages` | | `ingestion/internal/pipeline/links.go` | New file: `CanonicalizeLinks` | | `ingestion/internal/pipeline/pipeline.go` | Wire new steps; derive `sourceSlug` from `source` | | `ingestion/internal/pipeline/prompt.go` | New system prompt + `BuildPrompt` for new JSON format | | `brain/schema.md` | Update wikilink format and JSON schema docs | `resolve.go`, `refs.go`, `backfill.go`, `merge.go` — no changes. --- ## Wikilink Format - **LLM output**: `[[Display Name]]` - **Stored on disk**: `[[slug|Display Name]]` - **`CanonicalizeLinks`** converts between the two using the inventory This matches Obsidian's display-alias syntax that the existing codebase already uses. --- ## Testing Strategy - `ParseRawPages`: table-driven, cover valid JSON, truncated output, unknown type, missing title - `BuildPages`: table-driven, cover slug computation, frontmatter assembly, source page (no `source:` field), entity with subtype - `CanonicalizeLinks`: cover known title → replaced, unknown title → left as-is + warning, multiple links in one page - Integration test: full `Run` call with mock LLM returning new JSON format, assert no slug duplication across two chunks of the same source --- ## Out of Scope - Re-ingesting existing pages (user will trigger manually after deploy) - Changing the `BackfillRefs` endpoint (already correct, slug-based) - Changing the `Resolve` fuzzy-match algorithm