Files
hyperguild/docs/superpowers/specs/2026-04-23-level3-slug-authority-design.md
2026-04-23 17:23:22 +02:00

5.2 KiB

Level 3: Strip Slug Authority from LLM — Design Spec

Problem

The ingestion pipeline currently asks the LLM to produce full wiki pages including the file path (e.g. wiki/sources/finbert-huggingface.md). This causes two classes of bug:

  1. Slug proliferation — the LLM invents different slugs for the same concept across chunks or runs, producing duplicate pages that diverge in content.
  2. Unstable paths — the LLM may shorten, expand, or vary titles, making deduplication via Resolve unreliable because the slug mismatch is upstream of the normalizer.

Solution

Strip slug authority from the LLM entirely. The LLM returns a minimal structured object. The pipeline computes all slugs deterministically from titles using wiki.Slug(title).


LLM JSON Contract

Output format (per page)

{
  "title": "FinBERT",
  "type": "concept",
  "subtype": "framework",
  "domain": "ai-llm",
  "content": "## Definition\n\nA BERT-based model fine-tuned for financial sentiment...\n\n## Related\n\n- [[Sentiment Analysis]]\n- [[Hugging Face]]\n"
}

Fields:

Field Required Values
title yes Human-readable title, e.g. "FinBERT"
type yes "source" | "concept" | "entity"
subtype for entity/source entity: person|company|tool|model|framework|technology; source: article|pdf|book|video|note|project
domain no tag string, e.g. ai-llm, finance
content yes Markdown body sections only — no frontmatter, no path

Wikilinks in content: [[Display Name]] only. No slug. The pipeline canonicalizes to [[slug|Display Name]] in a post-processing step.

The LLM never writes slugs, paths, or frontmatter.


Pipeline Changes

New type: RawPage

type RawPage struct {
    Title   string
    Type    string // "source" | "concept" | "entity"
    Subtype string
    Domain  string
    Content string
}

New step order

ParseRawPages → BuildPages → Resolve → CanonicalizeLinks → injectSourceRefs → mergeAll → write

Step descriptions

ParseRawPages(output string) ([]RawPage, []string) Replaces ParsePages. Deserializes JSON objects with the new schema. Same truncation-recovery logic as today. Returns (pages, warnings).

BuildPages(rawPages []RawPage, sourceSlug, date string) []wiki.Page Converts RawPage → wiki.Page:

  • Computes slug: wiki.Slug(page.Title)
  • Computes path: wiki/<type>/<slug>.md
  • Assembles frontmatter:
    ---
    title: <Title>
    type: <type>
    subtype: <subtype>        # omitted if empty
    domain: <domain>          # omitted if empty
    created: <date>
    source: <sourceSlug>      # omitted for the source page itself
    ---
    
  • Concatenates frontmatter + content

Resolve(pages []wiki.Page, inventory) []wiki.Page Unchanged. Normalizes near-duplicate titles to existing inventory slugs.

CanonicalizeLinks(pages []wiki.Page, inventory) ([]wiki.Page, []string) New. Builds a title→slug map from inventory + current batch. Replaces [[Display Name]] with [[slug|Display Name]] in each page's content. Titles with no known slug are left as-is and returned as warnings.

injectSourceRefs Unchanged. Reads [[slug|...]] links (post-canonicalization) to inject back-references.

mergeAll → write Unchanged.

pipeline.Run signature change

func Run(ctx context.Context, cfg Config, brainDir, content, source string, dryRun bool) (Result, error)

source is already passed (it's the display name / filename). A new internal sourceSlug is derived from it via wiki.Slug(source) before calling BuildPages. No API change needed.


Files Changed

File Change
ingestion/internal/pipeline/parse.go Replace ParsePages with ParseRawPages + RawPage type
ingestion/internal/pipeline/build.go New file: BuildPages
ingestion/internal/pipeline/links.go New file: CanonicalizeLinks
ingestion/internal/pipeline/pipeline.go Wire new steps; derive sourceSlug from source
ingestion/internal/pipeline/prompt.go New system prompt + BuildPrompt for new JSON format
brain/schema.md Update wikilink format and JSON schema docs

resolve.go, refs.go, backfill.go, merge.go — no changes.


  • LLM output: [[Display Name]]
  • Stored on disk: [[slug|Display Name]]
  • CanonicalizeLinks converts between the two using the inventory

This matches Obsidian's display-alias syntax that the existing codebase already uses.


Testing Strategy

  • ParseRawPages: table-driven, cover valid JSON, truncated output, unknown type, missing title
  • BuildPages: table-driven, cover slug computation, frontmatter assembly, source page (no source: field), entity with subtype
  • CanonicalizeLinks: cover known title → replaced, unknown title → left as-is + warning, multiple links in one page
  • Integration test: full Run call with mock LLM returning new JSON format, assert no slug duplication across two chunks of the same source

Out of Scope

  • Re-ingesting existing pages (user will trigger manually after deploy)
  • Changing the BackfillRefs endpoint (already correct, slug-based)
  • Changing the Resolve fuzzy-match algorithm