Files
hyperguild/docs/superpowers/plans/2026-04-23-level3-slug-authority.md
2026-04-23 17:37:45 +02:00

1324 lines
41 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Level 3: Strip Slug Authority from LLM — Implementation Plan
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
**Goal:** Remove slug/path/frontmatter generation from the LLM; pipeline derives all slugs deterministically from titles via `wiki.Slug()`.
**Architecture:** Add `RawPage` type + `ParseRawPages` (LLM returns minimal JSON), `BuildPages` (computes slugs/paths/frontmatter), and `CanonicalizeLinks` (converts `[[Display Name]]``[[slug|Display Name]]`). Wire into `pipeline.Run`. Update system prompt and schema doc.
**Tech Stack:** Go 1.23, `encoding/json`, `regexp`, `strings`, `testify`
**Working directory for all commands:** `ingestion/` (the Go module root — always `cd ingestion` before running `go` commands)
---
## File Map
| File | Action | Responsibility |
|------|--------|---------------|
| `ingestion/internal/pipeline/parse.go` | Modify | Replace `ParsePages`+`wiki.Page` deserialization with `ParseRawPages`+`RawPage` type |
| `ingestion/internal/pipeline/parse_test.go` | Modify | Replace old `ParsePages` tests with `ParseRawPages` tests |
| `ingestion/internal/pipeline/build.go` | Create | `BuildPages` — computes slug, path, frontmatter from `RawPage` |
| `ingestion/internal/pipeline/build_test.go` | Create | Tests for `BuildPages` |
| `ingestion/internal/pipeline/links.go` | Create | `CanonicalizeLinks` — converts `[[Display Name]]``[[slug|Display Name]]` |
| `ingestion/internal/pipeline/links_test.go` | Create | Tests for `CanonicalizeLinks` |
| `ingestion/internal/pipeline/pipeline.go` | Modify | Wire `ParseRawPages → BuildPages → CanonicalizeLinks` into `Run`; update tests |
| `ingestion/internal/pipeline/pipeline_test.go` | Modify | Update mock LLM responses to new JSON format |
| `ingestion/internal/pipeline/prompt.go` | Modify | New system prompt + updated `BuildPrompt` for new JSON contract |
| `brain/schema.md` | Modify | Update wikilink format and JSON output format |
**Unchanged:** `resolve.go`, `refs.go`, `backfill.go`, `merge.go`, `chunk.go`, `resolve_test.go`, `refs_test.go`, `backfill_test.go`
---
## Task 1: `RawPage` type + `ParseRawPages`
**Files:**
- Modify: `ingestion/internal/pipeline/parse.go`
- Modify: `ingestion/internal/pipeline/parse_test.go`
- [ ] **Step 1: Write failing tests**
Replace the entire content of `ingestion/internal/pipeline/parse_test.go` with:
```go
// ingestion/internal/pipeline/parse_test.go
package pipeline
import (
"testing"
"github.com/stretchr/testify/assert"
"github.com/stretchr/testify/require"
)
func TestParseRawPages_ValidJSON(t *testing.T) {
input := `[{"title":"Shape Up","type":"source","subtype":"book","domain":"product-strategy","content":"## Summary\n\nFoo."},{"title":"Betting","type":"concept","content":"## Definition\n\nA technique."}]`
pages, warnings := ParseRawPages(input)
require.Len(t, pages, 2)
assert.Empty(t, warnings)
assert.Equal(t, "Shape Up", pages[0].Title)
assert.Equal(t, "source", pages[0].Type)
assert.Equal(t, "book", pages[0].Subtype)
assert.Equal(t, "product-strategy", pages[0].Domain)
assert.Equal(t, "Betting", pages[1].Title)
assert.Equal(t, "concept", pages[1].Type)
assert.Empty(t, pages[1].Subtype)
}
func TestParseRawPages_StripsFences(t *testing.T) {
input := "```json\n[{\"title\":\"Foo\",\"type\":\"concept\",\"content\":\"## Definition\\n\\nFoo.\"}]\n```"
pages, warnings := ParseRawPages(input)
require.Len(t, pages, 1)
assert.Empty(t, warnings)
assert.Equal(t, "Foo", pages[0].Title)
}
func TestParseRawPages_TruncationRecovery(t *testing.T) {
input := `[{"title":"Foo","type":"concept","content":"## Definition\n\nFoo."},{"title":"Bar","type":"concept","content":"trunc`
pages, warnings := ParseRawPages(input)
require.Len(t, pages, 1)
assert.Equal(t, "Foo", pages[0].Title)
assert.NotEmpty(t, warnings)
}
func TestParseRawPages_EmptyInput(t *testing.T) {
pages, warnings := ParseRawPages("")
assert.Empty(t, pages)
assert.NotEmpty(t, warnings)
}
func TestParseRawPages_PlainFence(t *testing.T) {
input := "```\n[{\"title\":\"Foo\",\"type\":\"concept\",\"content\":\"ok\"}]\n```"
pages, warnings := ParseRawPages(input)
require.Len(t, pages, 1)
assert.Empty(t, warnings)
}
func TestParseRawPages_MissingTitle(t *testing.T) {
// Missing title — still parsed, Title is empty string
input := `[{"type":"concept","content":"## Definition\n\nFoo."}]`
pages, warnings := ParseRawPages(input)
require.Len(t, pages, 1)
assert.Empty(t, warnings)
assert.Empty(t, pages[0].Title)
}
```
- [ ] **Step 2: Run tests to verify they fail**
```bash
cd ingestion && go test ./internal/pipeline/ -run TestParseRawPages -v
```
Expected: FAIL — `ParseRawPages` undefined, `RawPage` undefined.
- [ ] **Step 3: Replace `parse.go` with new implementation**
Replace the entire content of `ingestion/internal/pipeline/parse.go` with:
```go
// ingestion/internal/pipeline/parse.go
package pipeline
import (
"encoding/json"
"fmt"
"strings"
)
// RawPage is the LLM's output format — minimal structured data with no path or frontmatter.
// The pipeline derives slugs, paths, and frontmatter from these fields.
type RawPage struct {
Title string `json:"title"`
Type string `json:"type"` // "source" | "concept" | "entity"
Subtype string `json:"subtype"` // entity: person|company|tool|model|framework|technology; source: article|pdf|book|video|note|project
Domain string `json:"domain"`
Content string `json:"content"` // Markdown body only — no frontmatter
}
// ParseRawPages parses LLM output as a JSON array of RawPage objects.
// If the array is truncated mid-object (token limit), it salvages all complete objects.
func ParseRawPages(output string) ([]RawPage, []string) {
output = strings.TrimSpace(output)
if output == "" {
return nil, []string{"LLM returned empty output"}
}
output = stripFences(output)
var pages []RawPage
if err := json.Unmarshal([]byte(output), &pages); err == nil {
return pages, nil
}
// Truncation recovery: find last `}` that closes a complete object.
idx := strings.LastIndex(output, "}")
if idx < 0 {
return nil, []string{"LLM output contained no complete JSON objects"}
}
start := strings.Index(output, "[")
if start < 0 {
return nil, []string{"LLM output contained no JSON array opening bracket"}
}
candidate := output[start:idx+1] + "]"
if err := json.Unmarshal([]byte(candidate), &pages); err != nil {
return nil, []string{fmt.Sprintf("truncation recovery failed: %v", err)}
}
return pages, []string{fmt.Sprintf("LLM output was truncated; recovered %d page(s)", len(pages))}
}
func stripFences(s string) string {
for _, prefix := range []string{"```json\n", "```json\r\n", "```\n", "```\r\n"} {
if strings.HasPrefix(s, prefix) {
s = strings.TrimPrefix(s, prefix)
s = strings.TrimSuffix(strings.TrimSpace(s), "```")
return strings.TrimSpace(s)
}
}
return s
}
```
- [ ] **Step 4: Run tests to verify they pass**
```bash
cd ingestion && go test ./internal/pipeline/ -run TestParseRawPages -v
```
Expected: all 6 tests PASS.
- [ ] **Step 5: Verify package still compiles (pipeline.go still references old ParsePages — that's expected for now)**
```bash
cd ingestion && go build ./...
```
Expected: compile error mentioning `ParsePages` undefined — that's fine, we'll fix it in Task 4. If the error is something else, investigate.
Actually the old `ParsePages` no longer exists so `pipeline.go` will fail. That's OK — this task is intentionally breaking the pipeline step to set up for Task 4. If CI blocks on this, group Tasks 14 into a single commit.
- [ ] **Step 6: Commit**
```bash
cd ingestion && git add internal/pipeline/parse.go internal/pipeline/parse_test.go
git commit -m "feat(pipeline): replace ParsePages with ParseRawPages + RawPage type"
```
---
## Task 2: `BuildPages`
**Files:**
- Create: `ingestion/internal/pipeline/build.go`
- Create: `ingestion/internal/pipeline/build_test.go`
- [ ] **Step 1: Write failing tests**
Create `ingestion/internal/pipeline/build_test.go`:
```go
// ingestion/internal/pipeline/build_test.go
package pipeline
import (
"strings"
"testing"
"github.com/stretchr/testify/assert"
"github.com/stretchr/testify/require"
)
func TestBuildPages_SourcePage(t *testing.T) {
raw := []RawPage{
{
Title: "Shape Up",
Type: "source",
Subtype: "book",
Domain: "product-strategy",
Content: "## Summary\n\nA book about shaping product work.\n",
},
}
pages := BuildPages(raw, "shape-up", "2026-04-23")
require.Len(t, pages, 1)
p := pages[0]
assert.Equal(t, "wiki/sources/shape-up.md", p.Path)
assert.Contains(t, p.Content, "title: Shape Up")
assert.Contains(t, p.Content, "type: book")
assert.Contains(t, p.Content, "domain: product-strategy")
assert.Contains(t, p.Content, "date_ingested: 2026-04-23")
assert.Contains(t, p.Content, "last_updated: 2026-04-23")
assert.Contains(t, p.Content, "aliases:\n - Shape Up")
assert.Contains(t, p.Content, "## Summary")
assert.True(t, strings.HasPrefix(p.Content, "---\n"), "content must start with frontmatter")
}
func TestBuildPages_ConceptPage(t *testing.T) {
raw := []RawPage{
{
Title: "Betting",
Type: "concept",
Domain: "product-strategy",
Content: "## Definition\n\nA resource allocation technique.\n",
},
}
pages := BuildPages(raw, "shape-up", "2026-04-23")
require.Len(t, pages, 1)
p := pages[0]
assert.Equal(t, "wiki/concepts/betting.md", p.Path)
assert.Contains(t, p.Content, "title: Betting")
assert.Contains(t, p.Content, "domain: product-strategy")
assert.Contains(t, p.Content, "last_updated: 2026-04-23")
assert.Contains(t, p.Content, "aliases:\n - Betting")
assert.NotContains(t, p.Content, "date_ingested")
assert.Contains(t, p.Content, "## Definition")
}
func TestBuildPages_EntityPage(t *testing.T) {
raw := []RawPage{
{
Title: "Ryan Singer",
Type: "entity",
Subtype: "person",
Domain: "product-strategy",
Content: "## Description\n\nA product designer.\n",
},
}
pages := BuildPages(raw, "shape-up", "2026-04-23")
require.Len(t, pages, 1)
p := pages[0]
assert.Equal(t, "wiki/entities/ryan-singer.md", p.Path)
assert.Contains(t, p.Content, "title: Ryan Singer")
assert.Contains(t, p.Content, "type: person")
assert.Contains(t, p.Content, "domain: product-strategy")
assert.Contains(t, p.Content, "last_updated: 2026-04-23")
assert.Contains(t, p.Content, "aliases:\n - Ryan Singer")
assert.NotContains(t, p.Content, "date_ingested")
}
func TestBuildPages_SourceSlugUsedForSourcePage(t *testing.T) {
// LLM title differs from filename — pipeline uses sourceSlug for the source page path.
raw := []RawPage{
{Title: "FinBERT: A Pretrained Model", Type: "source", Subtype: "article", Content: "## Summary\n\nA model.\n"},
}
pages := BuildPages(raw, "finbert-huggingface", "2026-04-23")
require.Len(t, pages, 1)
assert.Equal(t, "wiki/sources/finbert-huggingface.md", pages[0].Path)
// Concept/entity slugs derived from title, not sourceSlug
}
func TestBuildPages_ConceptSlugDerivedFromTitle(t *testing.T) {
raw := []RawPage{
{Title: "Domain-Driven Design", Type: "concept", Content: "## Definition\n\nFoo.\n"},
}
pages := BuildPages(raw, "some-source", "2026-04-23")
require.Len(t, pages, 1)
assert.Equal(t, "wiki/concepts/domain-driven-design.md", pages[0].Path)
}
func TestBuildPages_SourceDefaultSubtype(t *testing.T) {
// If subtype is omitted for a source, default to "article"
raw := []RawPage{
{Title: "Some Post", Type: "source", Content: "## Summary\n\nA post.\n"},
}
pages := BuildPages(raw, "some-post", "2026-04-23")
require.Len(t, pages, 1)
assert.Contains(t, pages[0].Content, "type: article")
}
func TestBuildPages_OmitsDomainWhenEmpty(t *testing.T) {
raw := []RawPage{
{Title: "Betting", Type: "concept", Content: "## Definition\n\nFoo.\n"},
}
pages := BuildPages(raw, "src", "2026-04-23")
require.Len(t, pages, 1)
assert.NotContains(t, pages[0].Content, "domain:")
}
func TestBuildPages_MultiplePages(t *testing.T) {
raw := []RawPage{
{Title: "Shape Up", Type: "source", Subtype: "book", Content: "## Summary\n\nA book.\n"},
{Title: "Betting", Type: "concept", Content: "## Definition\n\nA technique.\n"},
{Title: "Ryan Singer", Type: "entity", Subtype: "person", Content: "## Description\n\nA designer.\n"},
}
pages := BuildPages(raw, "shape-up", "2026-04-23")
require.Len(t, pages, 3)
assert.Equal(t, "wiki/sources/shape-up.md", pages[0].Path)
assert.Equal(t, "wiki/concepts/betting.md", pages[1].Path)
assert.Equal(t, "wiki/entities/ryan-singer.md", pages[2].Path)
}
```
- [ ] **Step 2: Run tests to verify they fail**
```bash
cd ingestion && go test ./internal/pipeline/ -run TestBuildPages -v
```
Expected: FAIL — `BuildPages` undefined.
- [ ] **Step 3: Create `build.go`**
Create `ingestion/internal/pipeline/build.go`:
```go
// ingestion/internal/pipeline/build.go
package pipeline
import (
"fmt"
"strings"
"github.com/mathiasbq/hyperguild/ingestion/internal/wiki"
)
// BuildPages converts RawPages from the LLM into wiki.Pages with computed slugs,
// paths, and YAML frontmatter. sourceSlug is the slug of the source being ingested
// (derived from the filename, not the LLM title).
func BuildPages(rawPages []RawPage, sourceSlug, date string) []wiki.Page {
out := make([]wiki.Page, 0, len(rawPages))
for _, rp := range rawPages {
out = append(out, buildPage(rp, sourceSlug, date))
}
return out
}
func buildPage(rp RawPage, sourceSlug, date string) wiki.Page {
var slug, dir string
switch rp.Type {
case "source":
slug = sourceSlug
dir = "wiki/sources"
case "concept":
slug = wiki.Slug(rp.Title)
dir = "wiki/concepts"
case "entity":
slug = wiki.Slug(rp.Title)
dir = "wiki/entities"
default:
slug = wiki.Slug(rp.Title)
dir = "wiki/" + rp.Type
}
path := dir + "/" + slug + ".md"
fm := buildFrontmatter(rp, date)
return wiki.Page{
Path: path,
Content: fm + "\n" + rp.Content,
}
}
func buildFrontmatter(rp RawPage, date string) string {
var sb strings.Builder
sb.WriteString("---\n")
fmt.Fprintf(&sb, "title: %s\n", rp.Title)
switch rp.Type {
case "source":
subtype := rp.Subtype
if subtype == "" {
subtype = "article"
}
fmt.Fprintf(&sb, "type: %s\n", subtype)
if rp.Domain != "" {
fmt.Fprintf(&sb, "domain: %s\n", rp.Domain)
}
fmt.Fprintf(&sb, "date_ingested: %s\n", date)
fmt.Fprintf(&sb, "last_updated: %s\n", date)
case "concept":
if rp.Domain != "" {
fmt.Fprintf(&sb, "domain: %s\n", rp.Domain)
}
fmt.Fprintf(&sb, "last_updated: %s\n", date)
case "entity":
if rp.Subtype != "" {
fmt.Fprintf(&sb, "type: %s\n", rp.Subtype)
}
if rp.Domain != "" {
fmt.Fprintf(&sb, "domain: %s\n", rp.Domain)
}
fmt.Fprintf(&sb, "last_updated: %s\n", date)
default:
if rp.Domain != "" {
fmt.Fprintf(&sb, "domain: %s\n", rp.Domain)
}
fmt.Fprintf(&sb, "last_updated: %s\n", date)
}
fmt.Fprintf(&sb, "aliases:\n - %s\n", rp.Title)
sb.WriteString("---\n")
return sb.String()
}
```
- [ ] **Step 4: Run tests to verify they pass**
```bash
cd ingestion && go test ./internal/pipeline/ -run TestBuildPages -v
```
Expected: all 8 tests PASS.
- [ ] **Step 5: Commit**
```bash
git add ingestion/internal/pipeline/build.go ingestion/internal/pipeline/build_test.go
git commit -m "feat(pipeline): add BuildPages — compute slugs/paths/frontmatter from RawPage"
```
---
## Task 3: `CanonicalizeLinks`
**Files:**
- Create: `ingestion/internal/pipeline/links.go`
- Create: `ingestion/internal/pipeline/links_test.go`
- [ ] **Step 1: Write failing tests**
Create `ingestion/internal/pipeline/links_test.go`:
```go
// ingestion/internal/pipeline/links_test.go
package pipeline
import (
"testing"
"github.com/stretchr/testify/assert"
"github.com/stretchr/testify/require"
"github.com/mathiasbq/hyperguild/ingestion/internal/wiki"
)
func TestCanonicalizeLinks_KnownTitle(t *testing.T) {
pages := []wiki.Page{
{
Path: "wiki/sources/shape-up.md",
Content: "---\ntitle: Shape Up\n---\n\n## Summary\n\nSee [[Betting]].\n",
},
}
inventory := map[wiki.PageType][]wiki.Entry{
wiki.PageTypeConcept: {
{Slug: "betting", Title: "Betting"},
},
}
got, warnings := CanonicalizeLinks(pages, inventory)
require.Len(t, got, 1)
assert.Empty(t, warnings)
assert.Contains(t, got[0].Content, "[[betting|Betting]]")
assert.NotContains(t, got[0].Content, "[[Betting]]")
}
func TestCanonicalizeLinks_UnknownTitleLeftAsIs(t *testing.T) {
pages := []wiki.Page{
{
Path: "wiki/sources/shape-up.md",
Content: "---\ntitle: Shape Up\n---\n\n## Summary\n\nSee [[Ghost Concept]].\n",
},
}
inventory := map[wiki.PageType][]wiki.Entry{}
got, warnings := CanonicalizeLinks(pages, inventory)
require.Len(t, got, 1)
assert.NotEmpty(t, warnings)
assert.Contains(t, got[0].Content, "[[Ghost Concept]]")
}
func TestCanonicalizeLinks_AlreadyCanonicalLinkUntouched(t *testing.T) {
// Links already in [[slug|Display]] format must not be double-converted
pages := []wiki.Page{
{
Path: "wiki/sources/shape-up.md",
Content: "---\ntitle: Shape Up\n---\n\n## Summary\n\nSee [[betting|Betting]].\n",
},
}
inventory := map[wiki.PageType][]wiki.Entry{
wiki.PageTypeConcept: {
{Slug: "betting", Title: "Betting"},
},
}
got, warnings := CanonicalizeLinks(pages, inventory)
require.Len(t, got, 1)
assert.Empty(t, warnings)
// Should remain exactly as-is — not double-wrapped
assert.Contains(t, got[0].Content, "[[betting|Betting]]")
assert.NotContains(t, got[0].Content, "[[betting|[[betting|Betting]]]]")
}
func TestCanonicalizeLinks_CaseInsensitiveMatch(t *testing.T) {
pages := []wiki.Page{
{
Path: "wiki/sources/foo.md",
Content: "---\ntitle: Foo\n---\n\n## Summary\n\nSee [[domain driven design]].\n",
},
}
inventory := map[wiki.PageType][]wiki.Entry{
wiki.PageTypeConcept: {
{Slug: "domain-driven-design", Title: "Domain Driven Design"},
},
}
got, warnings := CanonicalizeLinks(pages, inventory)
require.Len(t, got, 1)
assert.Empty(t, warnings)
assert.Contains(t, got[0].Content, "[[domain-driven-design|domain driven design]]")
}
func TestCanonicalizeLinks_CurrentBatchPagesResolved(t *testing.T) {
// A concept created in the same batch should be canonicalizable
pages := []wiki.Page{
{
Path: "wiki/sources/shape-up.md",
Content: "---\ntitle: Shape Up\n---\n\n## Summary\n\nSee [[Betting]].\n",
},
{
Path: "wiki/concepts/betting.md",
Content: "---\ntitle: Betting\n---\n\n## Definition\n\nA technique.\n",
},
}
inventory := map[wiki.PageType][]wiki.Entry{} // empty — Betting is in the batch, not inventory
got, warnings := CanonicalizeLinks(pages, inventory)
require.Len(t, got, 2)
assert.Empty(t, warnings)
assert.Contains(t, got[0].Content, "[[betting|Betting]]")
}
func TestCanonicalizeLinks_MultipleLinksInOnePage(t *testing.T) {
pages := []wiki.Page{
{
Path: "wiki/sources/foo.md",
Content: "---\ntitle: Foo\n---\n\n## Summary\n\nSee [[Betting]] and [[Shape Up]].\n",
},
}
inventory := map[wiki.PageType][]wiki.Entry{
wiki.PageTypeConcept: {
{Slug: "betting", Title: "Betting"},
},
wiki.PageTypeSource: {
{Slug: "shape-up", Title: "Shape Up"},
},
}
got, warnings := CanonicalizeLinks(pages, inventory)
require.Len(t, got, 1)
assert.Empty(t, warnings)
assert.Contains(t, got[0].Content, "[[betting|Betting]]")
assert.Contains(t, got[0].Content, "[[shape-up|Shape Up]]")
}
```
- [ ] **Step 2: Run tests to verify they fail**
```bash
cd ingestion && go test ./internal/pipeline/ -run TestCanonicalizeLinks -v
```
Expected: FAIL — `CanonicalizeLinks` undefined.
- [ ] **Step 3: Create `links.go`**
Create `ingestion/internal/pipeline/links.go`:
```go
// ingestion/internal/pipeline/links.go
package pipeline
import (
"fmt"
"path/filepath"
"regexp"
"strings"
"github.com/mathiasbq/hyperguild/ingestion/internal/wiki"
)
// plainLinkRE matches [[Display Name]] — wikilinks without a slug pipe.
// It does NOT match [[slug|Display]] (those already have a pipe).
var plainLinkRE = regexp.MustCompile(`\[\[([^\]|]+)\]\]`)
// CanonicalizeLinks converts [[Display Name]] wikilinks to [[slug|Display Name]]
// using a title→slug map built from the inventory and current batch.
// Unknown titles are left as-is and returned as warnings.
func CanonicalizeLinks(pages []wiki.Page, inventory map[wiki.PageType][]wiki.Entry) ([]wiki.Page, []string) {
titleToSlug := buildTitleMap(pages, inventory)
var allWarnings []string
out := make([]wiki.Page, len(pages))
for i, p := range pages {
newContent, warnings := canonicalizeContent(p.Content, titleToSlug)
p.Content = newContent
out[i] = p
allWarnings = append(allWarnings, warnings...)
}
return out, allWarnings
}
// buildTitleMap builds a lowercase-title → slug map from inventory and current batch.
// Current batch entries take precedence over inventory (they may be updates).
func buildTitleMap(pages []wiki.Page, inventory map[wiki.PageType][]wiki.Entry) map[string]string {
m := make(map[string]string)
for _, entries := range inventory {
for _, e := range entries {
m[strings.ToLower(e.Title)] = e.Slug
}
}
// Current batch overrides inventory
for _, p := range pages {
title := extractTitle(p.Content)
slug := strings.TrimSuffix(filepath.Base(p.Path), ".md")
if title != "" && slug != "" {
m[strings.ToLower(title)] = slug
}
}
return m
}
func canonicalizeContent(content string, titleToSlug map[string]string) (string, []string) {
var warnings []string
result := plainLinkRE.ReplaceAllStringFunc(content, func(match string) string {
sub := plainLinkRE.FindStringSubmatch(match)
if len(sub) < 2 {
return match
}
displayName := sub[1]
slug, ok := titleToSlug[strings.ToLower(displayName)]
if !ok {
warnings = append(warnings, fmt.Sprintf("unknown wikilink: [[%s]]", displayName))
return match
}
return "[[" + slug + "|" + displayName + "]]"
})
return result, warnings
}
```
- [ ] **Step 4: Run tests to verify they pass**
```bash
cd ingestion && go test ./internal/pipeline/ -run TestCanonicalizeLinks -v
```
Expected: all 6 tests PASS.
- [ ] **Step 5: Commit**
```bash
git add ingestion/internal/pipeline/links.go ingestion/internal/pipeline/links_test.go
git commit -m "feat(pipeline): add CanonicalizeLinks — convert [[Display Name]] to [[slug|Display Name]]"
```
---
## Task 4: Wire `pipeline.go` + update `pipeline_test.go`
**Files:**
- Modify: `ingestion/internal/pipeline/pipeline.go`
- Modify: `ingestion/internal/pipeline/pipeline_test.go`
- [ ] **Step 1: Update `pipeline_test.go` to use new LLM response format**
Replace the entire content of `ingestion/internal/pipeline/pipeline_test.go` with:
```go
// ingestion/internal/pipeline/pipeline_test.go
package pipeline
import (
"context"
"encoding/json"
"net/http"
"net/http/httptest"
"os"
"path/filepath"
"testing"
"time"
"github.com/stretchr/testify/assert"
"github.com/stretchr/testify/require"
"github.com/mathiasbq/hyperguild/ingestion/internal/llm"
)
func TestRun_WritesPages(t *testing.T) {
brainDir := t.TempDir()
for _, sub := range []string{"wiki/concepts", "wiki/entities", "wiki/sources"} {
require.NoError(t, os.MkdirAll(filepath.Join(brainDir, sub), 0o755))
}
llmResponse := mustJSON([]RawPage{
{
Title: "Test Article",
Type: "source",
Subtype: "article",
Domain: "software-engineering",
Content: "## Summary\n\nA test article.\n\n## Key Claims\n\n- It tests things.\n\n## Concepts Introduced or Reinforced\n\n[[Testing]]\n\n## Entities Mentioned\n\n## Open Questions Raised\n",
},
{
Title: "Testing",
Type: "concept",
Domain: "software-engineering",
Content: "## Definition\n\nThe practice of verifying software.\n\n## Why It Matters\n\nCatches bugs.\n\n## Related Concepts\n\n## Related Entities\n\n## Sources\n\n## Evolving Notes\n",
},
})
srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
w.Header().Set("Content-Type", "application/json")
_ = json.NewEncoder(w).Encode(map[string]any{
"choices": []map[string]any{
{"message": map[string]any{"role": "assistant", "content": llmResponse}},
},
})
}))
defer srv.Close()
cfg := Config{
Complete: llm.New(srv.URL, "", "test-model", 30*time.Second).Complete,
ChunkSize: 0,
}
result, err := Run(context.Background(), cfg, brainDir, "An article about testing.", "test-article", false)
require.NoError(t, err)
assert.Len(t, result.Pages, 2)
_, err = os.Stat(filepath.Join(brainDir, "wiki", "sources", "test-article.md"))
require.NoError(t, err)
_, err = os.Stat(filepath.Join(brainDir, "wiki", "concepts", "testing.md"))
require.NoError(t, err)
_, err = os.Stat(filepath.Join(brainDir, "wiki", "index.md"))
require.NoError(t, err)
_, err = os.Stat(filepath.Join(brainDir, "log.md"))
require.NoError(t, err)
}
func TestRun_DryRunDoesNotWrite(t *testing.T) {
brainDir := t.TempDir()
for _, sub := range []string{"wiki/concepts", "wiki/entities", "wiki/sources"} {
require.NoError(t, os.MkdirAll(filepath.Join(brainDir, sub), 0o755))
}
llmResponse := mustJSON([]RawPage{{
Title: "Foo",
Type: "source",
Subtype: "article",
Content: "## Summary\n\nFoo.\n",
}})
srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
_ = json.NewEncoder(w).Encode(map[string]any{
"choices": []map[string]any{{"message": map[string]any{"content": llmResponse}}},
})
}))
defer srv.Close()
cfg := Config{Complete: llm.New(srv.URL, "", "m", 30*time.Second).Complete}
result, err := Run(context.Background(), cfg, brainDir, "foo content", "foo", true)
require.NoError(t, err)
assert.Len(t, result.Pages, 1)
_, err = os.Stat(filepath.Join(brainDir, "wiki", "sources", "foo.md"))
assert.True(t, os.IsNotExist(err))
}
func TestRun_MergesDuplicatePaths(t *testing.T) {
brainDir := t.TempDir()
for _, sub := range []string{"wiki/concepts", "wiki/entities", "wiki/sources"} {
require.NoError(t, os.MkdirAll(filepath.Join(brainDir, sub), 0o755))
}
// LLM returns same title twice (simulates multi-chunk duplicate)
llmResponse := mustJSON([]RawPage{
{Title: "Foo", Type: "concept", Content: "## Definition\n\nFirst.\n\n## Related Concepts\n\n[[Bar]]\n"},
{Title: "Foo", Type: "concept", Content: "## Definition\n\nSecond.\n\n## Related Concepts\n\n[[Baz]]\n"},
})
srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
_ = json.NewEncoder(w).Encode(map[string]any{
"choices": []map[string]any{{"message": map[string]any{"content": llmResponse}}},
})
}))
defer srv.Close()
cfg := Config{Complete: llm.New(srv.URL, "", "m", 30*time.Second).Complete}
result, err := Run(context.Background(), cfg, brainDir, "content", "foo", false)
require.NoError(t, err)
assert.Len(t, result.Pages, 1) // deduplicated
content, err := os.ReadFile(filepath.Join(brainDir, "wiki", "concepts", "foo.md"))
require.NoError(t, err)
// keep-first for Definition, union for Related Concepts
assert.Contains(t, string(content), "First.")
// Bar and Baz unknown → left as plain [[links]]
assert.Contains(t, string(content), "[[Bar]]")
assert.Contains(t, string(content), "[[Baz]]")
}
func mustJSON(v any) string {
b, err := json.Marshal(v)
if err != nil {
panic(err)
}
return string(b)
}
```
- [ ] **Step 2: Run tests to verify they fail (expected — pipeline.go still uses old ParsePages)**
```bash
cd ingestion && go test ./internal/pipeline/ -run TestRun -v
```
Expected: compile error or test failures — that's fine.
- [ ] **Step 3: Update `pipeline.go` to wire new steps**
Replace the entire content of `ingestion/internal/pipeline/pipeline.go` with:
```go
// ingestion/internal/pipeline/pipeline.go
package pipeline
import (
"context"
"fmt"
"os"
"path/filepath"
"strings"
"time"
"github.com/mathiasbq/hyperguild/ingestion/internal/wiki"
)
// CompleteFunc is the function signature for LLM calls.
type CompleteFunc func(ctx context.Context, system, user string) (string, error)
// Config holds pipeline configuration.
type Config struct {
Complete CompleteFunc
ChunkSize int // 0 = no chunking
Schema string // overrides brain/schema.md when set (useful in tests)
}
// Result is the outcome of a pipeline run.
type Result struct {
Pages []string // relative paths written (or would-be written in dry-run)
Warnings []string
}
// Run ingests content and writes structured wiki pages to brainDir/wiki/.
// In dry-run mode, pages are returned but not written to disk.
func Run(ctx context.Context, cfg Config, brainDir, content, source string, dryRun bool) (Result, error) {
inventory, err := wiki.LoadInventory(brainDir)
if err != nil {
return Result{}, fmt.Errorf("load inventory: %w", err)
}
schema := cfg.Schema
if schema == "" {
schema = loadSchema(brainDir)
}
sourceSlug := wiki.Slug(source)
date := time.Now().UTC().Format("2006-01-02")
chunks := Chunk(content, cfg.ChunkSize)
var allRaw []RawPage
var allWarnings []string
for _, chunk := range chunks {
userPrompt := BuildPrompt(schema, source, chunk, inventory)
output, err := cfg.Complete(ctx, systemPrompt, userPrompt)
if err != nil {
return Result{}, fmt.Errorf("LLM call: %w", err)
}
raw, warnings := ParseRawPages(output)
allRaw = append(allRaw, raw...)
allWarnings = append(allWarnings, warnings...)
}
pages := BuildPages(allRaw, sourceSlug, date)
resolved := Resolve(pages, inventory)
canonicalized, linkWarnings := CanonicalizeLinks(resolved, inventory)
allWarnings = append(allWarnings, linkWarnings...)
withRefs := injectSourceRefs(canonicalized, inventory, brainDir)
merged := mergeAll(withRefs)
var written []string
for _, page := range merged {
if !dryRun {
dest := filepath.Join(brainDir, filepath.FromSlash(page.Path))
if err := os.MkdirAll(filepath.Dir(dest), 0o755); err != nil {
return Result{}, fmt.Errorf("mkdir for %s: %w", page.Path, err)
}
if err := os.WriteFile(dest, []byte(page.Content), 0o644); err != nil {
return Result{}, fmt.Errorf("write %s: %w", page.Path, err)
}
}
written = append(written, page.Path)
}
if !dryRun {
if err := wiki.RebuildIndex(brainDir, date); err != nil {
allWarnings = append(allWarnings, fmt.Sprintf("rebuild index: %v", err))
}
if err := wiki.AppendLog(brainDir, source, written, allWarnings, date); err != nil {
allWarnings = append(allWarnings, fmt.Sprintf("append log: %v", err))
}
}
return Result{Pages: written, Warnings: allWarnings}, nil
}
// mergeAll deduplicates pages by path, merging content from later occurrences.
func mergeAll(pages []wiki.Page) []wiki.Page {
order := make([]string, 0, len(pages))
byPath := make(map[string]wiki.Page, len(pages))
for _, p := range pages {
if _, seen := byPath[p.Path]; !seen {
order = append(order, p.Path)
byPath[p.Path] = p
} else {
byPath[p.Path] = wiki.Merge(byPath[p.Path], p)
}
}
result := make([]wiki.Page, 0, len(order))
for _, path := range order {
result = append(result, byPath[path])
}
return result
}
const defaultSchema = `# Brain Wiki Schema
Three page types: wiki/sources/, wiki/concepts/, wiki/entities/.
See brain/schema.md for the full schema.
`
func loadSchema(brainDir string) string {
b, err := os.ReadFile(filepath.Join(brainDir, "schema.md"))
if err != nil {
return defaultSchema
}
return strings.TrimSpace(string(b))
}
```
- [ ] **Step 4: Run all pipeline tests**
```bash
cd ingestion && go test ./internal/pipeline/ -v
```
Expected: all tests PASS. If `TestRun_WritesPages` has a warning about unknown wikilink `[[Testing]]` in warnings (since the Testing concept is in the same batch and should be resolved via `CanonicalizeLinks`), check that `CanonicalizeLinks` builds the batch map correctly before checking `result.Warnings`.
Note: `TestRun_WritesPages` now calls `assert.Len(t, result.Pages, 2)` without `assert.Empty(t, result.Warnings)` — unknown-link warnings from `CanonicalizeLinks` are acceptable since "Testing" is in the batch and should canonicalize cleanly. If the test fails with unexpected warnings, investigate `buildTitleMap` — it should find "Testing" from the current batch.
- [ ] **Step 5: Run full test suite**
```bash
cd ingestion && go test ./... 2>&1
```
Expected: all packages pass.
- [ ] **Step 6: Commit**
```bash
git add ingestion/internal/pipeline/pipeline.go ingestion/internal/pipeline/pipeline_test.go
git commit -m "feat(pipeline): wire ParseRawPages+BuildPages+CanonicalizeLinks into Run"
```
---
## Task 5: Update `prompt.go`
**Files:**
- Modify: `ingestion/internal/pipeline/prompt.go`
- [ ] **Step 1: Replace `prompt.go`**
Replace the entire content of `ingestion/internal/pipeline/prompt.go` with:
```go
// ingestion/internal/pipeline/prompt.go
package pipeline
import (
"fmt"
"strings"
"time"
"github.com/mathiasbq/hyperguild/ingestion/internal/wiki"
)
const systemPrompt = `You are a wiki agent. Read the source material and produce structured wiki pages following the schema provided.
Output ONLY a valid JSON array — no markdown fences, no other text before or after.
Each element must have exactly these fields:
"title" — exact page title (e.g. "FinBERT", "Ryan Singer", "Shape Up")
"type" — exactly one of: "source", "concept", "entity"
"subtype" — for source: article|pdf|book|video|note|project; for entity: person|company|tool|model|framework|technology; omit for concept
"domain" — one of the domains in the schema (omit if none fits)
"content" — Markdown body only — NO frontmatter, NO path, NO slug
Wikilinks in content: [[Display Name]] — just the display name, no slug, no pipe separator.
Only link to pages listed in the inventory or pages you are creating in this response.`
// BuildPrompt constructs the user prompt for a single chunk.
func BuildPrompt(schema, source, content string, inventory map[wiki.PageType][]wiki.Entry) string {
var sb strings.Builder
fmt.Fprintf(&sb, "Today's date is %s.\n\n", time.Now().UTC().Format("2006-01-02"))
sb.WriteString("## Schema\n\n")
sb.WriteString(schema)
sb.WriteString("\n\n")
sb.WriteString("## Existing wiki pages\n\n")
sb.WriteString("Reference these pages by display name only — [[Display Name]] — in your content.\n\n")
for _, pt := range []wiki.PageType{wiki.PageTypeConcept, wiki.PageTypeEntity, wiki.PageTypeSource} {
entries := inventory[pt]
label := strings.ToUpper(string(pt)[:1]) + string(pt)[1:]
if len(entries) == 0 {
fmt.Fprintf(&sb, "%s — (none yet)\n\n", label)
continue
}
fmt.Fprintf(&sb, "%s:\n", label)
for _, e := range entries {
fmt.Fprintf(&sb, " - %s\n", e.Title)
}
sb.WriteString("\n")
}
sb.WriteString("## Non-negotiable rules\n\n")
sb.WriteString("1. Output ONLY a valid JSON array — no prose, no fences.\n")
sb.WriteString("2. Fields: title, type, subtype (if applicable), domain (if applicable), content.\n")
sb.WriteString("3. Wikilinks: [[Display Name]] — no slug, no pipe. The pipeline handles slugs.\n")
sb.WriteString("4. Section links must match their section type (Related Concepts → concepts only, etc.).\n")
sb.WriteString("5. One source page per book — if inventory shows it exists, return it as an UPDATE.\n\n")
fmt.Fprintf(&sb, "## Source: %s\n\n", source)
sb.WriteString(content)
return sb.String()
}
```
- [ ] **Step 2: Verify compile and tests still pass**
```bash
cd ingestion && go test ./... 2>&1
```
Expected: all tests PASS (prompt.go has no direct unit tests — covered via integration tests in pipeline_test.go).
- [ ] **Step 3: Commit**
```bash
git add ingestion/internal/pipeline/prompt.go
git commit -m "feat(pipeline): update system prompt for new LLM JSON contract (no slugs)"
```
---
## Task 6: Update `brain/schema.md`
**Files:**
- Modify: `brain/schema.md`
- [ ] **Step 1: Update the schema doc**
Replace the entire content of `brain/schema.md` with:
```markdown
# Brain Wiki Schema
This document defines the three page types in the brain wiki.
The LLM must follow this schema exactly when generating wiki pages.
## Output Format
Return a JSON array. Each element:
```json
{
"title": "exact page title",
"type": "source | concept | entity",
"subtype": "see below — omit for concept",
"domain": "see domains — omit if none fits",
"content": "Markdown body only — no frontmatter, no path"
}
```
- `subtype` for **source**: `article | pdf | book | video | note | project`
- `subtype` for **entity**: `person | company | tool | model | framework | technology`
- The pipeline computes slugs and frontmatter — never include them in output.
## Wikilink Format
All cross-references use `[[Display Name]]` — just the display name, no slug, no pipe.
Rules:
- Only link to pages in the inventory or pages you are creating in this response
- The pipeline converts `[[Display Name]]` to `[[slug|Display Name]]` automatically
- Section links must match their section type (Related Concepts → concept pages only, etc.)
Examples: `[[Domain Driven Design]]`, `[[Ryan Singer]]`, `[[Shape Up]]`
## Domains
Use one of: `ai-llm`, `software-engineering`, `product-strategy`, `finance-markets`,
`personal`, `consulting`, `climate`, `infrastructure`, `security`
---
## Source Pages — wiki/sources/<slug>.md
One page per ingested source. Books are NEVER split across multiple source pages — update the existing one.
Body sections (in this order):
### Summary
23 sentences. Core argument or finding.
### Key Claims
Bulleted list. Paraphrase — no verbatim quotes or code.
### Concepts Introduced or Reinforced
Wikilinks to concept pages ONLY. One per line.
### Entities Mentioned
Wikilinks to entity pages ONLY. One per line.
### Open Questions Raised
Gaps or follow-up questions from this source.
For books only, also add:
### Chapters
One bullet per chapter with 12 sentence summary.
### Argument Arc
Overall narrative as it becomes clear across chapters.
### Updates
Dated entries appended on re-ingestion. NEVER rewrite — only append.
---
## Concept Pages — wiki/concepts/<slug>.md
One page per idea, framework, methodology, or pattern.
Body sections (in this order):
### Definition
One-paragraph plain-language explanation.
### Why It Matters
Practical significance. Why should anyone care?
### Related Concepts
Wikilinks to concept pages ONLY.
### Related Entities
Wikilinks to entity pages ONLY.
### Sources
Wikilinks to source pages ONLY.
### Evolving Notes
Updated as new sources arrive. Append, do not rewrite.
---
## Entity Pages — wiki/entities/<slug>.md
One page per person, tool, organisation, technology, or product.
Body sections (in this order):
### Description
One-line description.
### Relevance
Why this entity matters to this knowledge base.
### Key Positions, Products, or Claims
With dates where known.
### Related Concepts
Wikilinks to concept pages ONLY.
### Related Entities
Wikilinks to entity pages ONLY.
### Sources
Wikilinks to source pages ONLY.
---
## Non-Negotiable Rules
1. Output ONLY a valid JSON array — no markdown fences, no prose before or after
2. Each element: `{"title": "...", "type": "...", "subtype": "...", "domain": "...", "content": "..."}`
3. Never include slugs, paths, or frontmatter in output — the pipeline handles these
4. Wikilinks: `[[Display Name]]` only — no pipe, no slug
5. Dates always YYYY-MM-DD (used only in content body where contextually relevant)
6. Never reproduce verbatim code — describe the pattern or technique
7. Section links must match their section type
8. One source page per book — if inventory shows it exists, include it as an UPDATE
```
- [ ] **Step 2: Verify Go tests still pass (schema.md is loaded at runtime, not compile time)**
```bash
cd ingestion && go test ./... 2>&1
```
Expected: all tests PASS.
- [ ] **Step 3: Commit**
```bash
git add brain/schema.md
git commit -m "docs(schema): update LLM output format and wikilink convention for Level 3"
```
---
## Self-Review
**Spec coverage check:**
| Spec requirement | Task |
|-----------------|------|
| LLM returns `{title, type, subtype, domain, content}` | Task 1 (RawPage), Task 5 (prompt) |
| Pipeline computes all slugs via `wiki.Slug(title)` | Task 2 (BuildPages) |
| Source page uses `sourceSlug` from filename | Task 2 (buildPage: `case "source": slug = sourceSlug`) |
| Frontmatter assembled by pipeline | Task 2 (buildFrontmatter) |
| `[[Display Name]]``[[slug|Display Name]]` canonicalization | Task 3 (CanonicalizeLinks) |
| CanonicalizeLinks runs after BuildPages, before injectSourceRefs | Task 4 (pipeline.go step order) |
| Unknown titles left as-is + warnings | Task 3 (links_test: UnknownTitleLeftAsIs) |
| Current-batch pages resolvable | Task 3 (buildTitleMap includes batch) |
| `injectSourceRefs` unchanged | — (no task needed) |
| `Resolve` unchanged | — (no task needed) |
| `brain/schema.md` updated | Task 6 |
| Integration test for no slug duplication across chunks | Task 4 (TestRun_MergesDuplicatePaths uses same title twice) |
All spec requirements covered.
**Placeholder scan:** None found.
**Type consistency:**
- `RawPage` defined in Task 1, used in Tasks 2, 4 ✓
- `BuildPages([]RawPage, string, string) []wiki.Page` defined in Task 2, called in Task 4 ✓
- `CanonicalizeLinks([]wiki.Page, map[wiki.PageType][]wiki.Entry) ([]wiki.Page, []string)` defined in Task 3, called in Task 4 ✓
- `ParseRawPages(string) ([]RawPage, []string)` defined in Task 1, called in Task 4 ✓
- `extractTitle` used in Task 3 (`links.go`) — defined in `resolve.go` (same package) ✓