5.2 KiB
Level 3: Strip Slug Authority from LLM — Design Spec
Problem
The ingestion pipeline currently asks the LLM to produce full wiki pages including the file path (e.g. wiki/sources/finbert-huggingface.md). This causes two classes of bug:
- Slug proliferation — the LLM invents different slugs for the same concept across chunks or runs, producing duplicate pages that diverge in content.
- Unstable paths — the LLM may shorten, expand, or vary titles, making deduplication via
Resolveunreliable because the slug mismatch is upstream of the normalizer.
Solution
Strip slug authority from the LLM entirely. The LLM returns a minimal structured object. The pipeline computes all slugs deterministically from titles using wiki.Slug(title).
LLM JSON Contract
Output format (per page)
{
"title": "FinBERT",
"type": "concept",
"subtype": "framework",
"domain": "ai-llm",
"content": "## Definition\n\nA BERT-based model fine-tuned for financial sentiment...\n\n## Related\n\n- [[Sentiment Analysis]]\n- [[Hugging Face]]\n"
}
Fields:
| Field | Required | Values |
|---|---|---|
title |
yes | Human-readable title, e.g. "FinBERT" |
type |
yes | "source" | "concept" | "entity" |
subtype |
for entity/source | entity: person|company|tool|model|framework|technology; source: article|pdf|book|video|note|project |
domain |
no | tag string, e.g. ai-llm, finance |
content |
yes | Markdown body sections only — no frontmatter, no path |
Wikilinks in content: [[Display Name]] only. No slug. The pipeline canonicalizes to [[slug|Display Name]] in a post-processing step.
The LLM never writes slugs, paths, or frontmatter.
Pipeline Changes
New type: RawPage
type RawPage struct {
Title string
Type string // "source" | "concept" | "entity"
Subtype string
Domain string
Content string
}
New step order
ParseRawPages → BuildPages → Resolve → CanonicalizeLinks → injectSourceRefs → mergeAll → write
Step descriptions
ParseRawPages(output string) ([]RawPage, []string)
Replaces ParsePages. Deserializes JSON objects with the new schema. Same truncation-recovery logic as today. Returns (pages, warnings).
BuildPages(rawPages []RawPage, sourceSlug, date string) []wiki.Page
Converts RawPage → wiki.Page:
- Computes slug:
wiki.Slug(page.Title) - Computes path:
wiki/<type>/<slug>.md - Assembles frontmatter:
--- title: <Title> type: <type> subtype: <subtype> # omitted if empty domain: <domain> # omitted if empty created: <date> source: <sourceSlug> # omitted for the source page itself --- - Concatenates frontmatter + content
Resolve(pages []wiki.Page, inventory) []wiki.Page
Unchanged. Normalizes near-duplicate titles to existing inventory slugs.
CanonicalizeLinks(pages []wiki.Page, inventory) ([]wiki.Page, []string)
New. Builds a title→slug map from inventory + current batch. Replaces [[Display Name]] with [[slug|Display Name]] in each page's content. Titles with no known slug are left as-is and returned as warnings.
injectSourceRefs
Unchanged. Reads [[slug|...]] links (post-canonicalization) to inject back-references.
mergeAll → write
Unchanged.
pipeline.Run signature change
func Run(ctx context.Context, cfg Config, brainDir, content, source string, dryRun bool) (Result, error)
source is already passed (it's the display name / filename). A new internal sourceSlug is derived from it via wiki.Slug(source) before calling BuildPages. No API change needed.
Files Changed
| File | Change |
|---|---|
ingestion/internal/pipeline/parse.go |
Replace ParsePages with ParseRawPages + RawPage type |
ingestion/internal/pipeline/build.go |
New file: BuildPages |
ingestion/internal/pipeline/links.go |
New file: CanonicalizeLinks |
ingestion/internal/pipeline/pipeline.go |
Wire new steps; derive sourceSlug from source |
ingestion/internal/pipeline/prompt.go |
New system prompt + BuildPrompt for new JSON format |
brain/schema.md |
Update wikilink format and JSON schema docs |
resolve.go, refs.go, backfill.go, merge.go — no changes.
Wikilink Format
- LLM output:
[[Display Name]] - Stored on disk:
[[slug|Display Name]] CanonicalizeLinksconverts between the two using the inventory
This matches Obsidian's display-alias syntax that the existing codebase already uses.
Testing Strategy
ParseRawPages: table-driven, cover valid JSON, truncated output, unknown type, missing titleBuildPages: table-driven, cover slug computation, frontmatter assembly, source page (nosource:field), entity with subtypeCanonicalizeLinks: cover known title → replaced, unknown title → left as-is + warning, multiple links in one page- Integration test: full
Runcall with mock LLM returning new JSON format, assert no slug duplication across two chunks of the same source
Out of Scope
- Re-ingesting existing pages (user will trigger manually after deploy)
- Changing the
BackfillRefsendpoint (already correct, slug-based) - Changing the
Resolvefuzzy-match algorithm