20 Commits

Author SHA1 Message Date
Mathias Bergqvist
0a70d9e972 feat(pipeline): add POST /ingest-raw for direct batch ingestion without LLM
All checks were successful
CI / Lint / Test / Vet (push) Successful in 9s
CI / Mirror to GitHub (push) Has been skipped
Allows callers to provide pre-structured RawPage data directly, bypassing the
LLM extraction step. The pipeline still handles slug computation, frontmatter,
link canonicalization, source back-references, and dedup — only the extraction
is skipped. Useful when a more capable model or manual curation produces the
structured data.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-24 11:15:59 +02:00
Mathias Bergqvist
3e9a648115 fix(pipeline): repair invalid JSON escape sequences from LLM output before parsing
All checks were successful
CI / Lint / Test / Vet (push) Successful in 11s
CI / Mirror to GitHub (push) Has been skipped
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-23 22:04:27 +02:00
Mathias Bergqvist
923a665365 fix(pipeline): skip RawPages with empty title in BuildPages instead of producing broken paths
All checks were successful
CI / Lint / Test / Vet (push) Successful in 9s
CI / Mirror to GitHub (push) Has been skipped
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-23 19:55:37 +02:00
Mathias Bergqvist
537aebc302 feat(pipeline): update system prompt for new LLM JSON contract (no slugs)
- Change prompt to reflect new output format: title, type, subtype, domain, content
- Remove slug/path generation responsibility from LLM — pipeline now handles it
- Wikilinks change from [[slug|Display Name]] to [[Display Name]] only
- LLM no longer includes frontmatter or paths in output

docs(schema): update LLM output format and wikilink convention for Level 3

- Specify JSON schema: title, type, subtype, domain, content fields
- Remove frontmatter requirements from schema output (handled by pipeline)
- Simplify wikilink format to [[Display Name]] — no slug or pipe
- Pipeline now responsible for slug generation and frontmatter construction

These changes shift slug/frontmatter generation from LLM to pipeline,
reducing cognitive load on the model and improving control over output.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-23 19:45:21 +02:00
Mathias Bergqvist
de35d4dbb0 feat(pipeline): wire ParseRawPages+BuildPages+CanonicalizeLinks into Run
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-23 19:07:33 +02:00
Mathias Bergqvist
26855f69b0 feat(pipeline): add CanonicalizeLinks — convert [[Display Name]] to [[slug|Display Name]] 2026-04-23 18:59:10 +02:00
Mathias Bergqvist
a7b363d589 fix(pipeline): quote YAML scalar fields in buildFrontmatter to prevent injection 2026-04-23 18:56:39 +02:00
Mathias Bergqvist
7b57051af8 feat(pipeline): add BuildPages — compute slugs/paths/frontmatter from RawPage 2026-04-23 18:50:37 +02:00
Mathias Bergqvist
a620f6cb01 fix(pipeline): guard empty-title bridge + skip stale integration tests until task4
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-23 18:46:07 +02:00
Mathias Bergqvist
26b5636b43 feat(pipeline): replace ParsePages with ParseRawPages + RawPage type
Strips slug authority from the LLM. The new RawPage type carries only
{title, type, subtype, domain, content} — no paths or frontmatter.
Pipeline will derive slugs deterministically (Task 4).

pipeline.go gets a temporary bridge stub (TODO task4) to keep the
package compiling between tasks.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-23 18:41:33 +02:00
Mathias Bergqvist
1605624668 feat(pipeline): add POST /backfill-refs endpoint to retroactively inject source back-references 2026-04-23 16:50:00 +02:00
Mathias Bergqvist
3c2bd9268c feat(pipeline): wire source back-reference injection into Run 2026-04-23 16:36:22 +02:00
Mathias Bergqvist
29727ec2a5 feat(pipeline): inject source back-references into concept and entity pages 2026-04-23 16:35:47 +02:00
Mathias Bergqvist
3607920601 fix(lint): resolve all errcheck violations in ingestion module
All checks were successful
cd / Build and deploy (push) Successful in 10s
CI / Lint / Test / Vet (push) Successful in 10s
CI / Mirror to GitHub (push) Successful in 3s
2026-04-23 16:20:59 +02:00
Mathias Bergqvist
53e46781b1 feat(pipeline): resolve proposed pages against inventory before writing 2026-04-23 16:00:31 +02:00
Mathias Bergqvist
e9b5cc401c feat(pipeline): add fuzzy entity resolution to prevent slug proliferation 2026-04-23 15:59:36 +02:00
Mathias Bergqvist
1b0706f270 chore(brain): rename CLAUDE.md to schema.md for clarity
CLAUDE.md has a specific meaning in the Claude Code ecosystem (agent
instructions). The wiki schema for the ingestion pipeline should live
in schema.md to avoid confusion.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-22 23:06:32 +02:00
Mathias Bergqvist
04fefe8e9c fix(ingestion): wrap naked error returns and harden mustJSON helper
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-22 22:51:19 +02:00
Mathias Bergqvist
103f4d90bf feat(ingestion): add pipeline orchestrator with prompt builder
Adds prompt.go (BuildPrompt + systemPrompt) and pipeline.go (Run, Config,
Result, mergeAll) that wire chunking, LLM calls, parse, merge, index rebuild,
and log append into a single ingestion pipeline. Includes integration tests
covering write, dry-run, and duplicate-path merge scenarios.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-22 22:45:19 +02:00
Mathias Bergqvist
9b11719481 feat(ingestion): add content chunking and LLM JSON output parser 2026-04-22 22:37:14 +02:00