12 Commits

Author SHA1 Message Date
Mathias Bergqvist
0a70d9e972 feat(pipeline): add POST /ingest-raw for direct batch ingestion without LLM
All checks were successful
CI / Lint / Test / Vet (push) Successful in 9s
CI / Mirror to GitHub (push) Has been skipped
Allows callers to provide pre-structured RawPage data directly, bypassing the
LLM extraction step. The pipeline still handles slug computation, frontmatter,
link canonicalization, source back-references, and dedup — only the extraction
is skipped. Useful when a more capable model or manual curation produces the
structured data.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-24 11:15:59 +02:00
Mathias Bergqvist
3e9a648115 fix(pipeline): repair invalid JSON escape sequences from LLM output before parsing
All checks were successful
CI / Lint / Test / Vet (push) Successful in 11s
CI / Mirror to GitHub (push) Has been skipped
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-23 22:04:27 +02:00
Mathias Bergqvist
923a665365 fix(pipeline): skip RawPages with empty title in BuildPages instead of producing broken paths
All checks were successful
CI / Lint / Test / Vet (push) Successful in 9s
CI / Mirror to GitHub (push) Has been skipped
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-23 19:55:37 +02:00
Mathias Bergqvist
537aebc302 feat(pipeline): update system prompt for new LLM JSON contract (no slugs)
- Change prompt to reflect new output format: title, type, subtype, domain, content
- Remove slug/path generation responsibility from LLM — pipeline now handles it
- Wikilinks change from [[slug|Display Name]] to [[Display Name]] only
- LLM no longer includes frontmatter or paths in output

docs(schema): update LLM output format and wikilink convention for Level 3

- Specify JSON schema: title, type, subtype, domain, content fields
- Remove frontmatter requirements from schema output (handled by pipeline)
- Simplify wikilink format to [[Display Name]] — no slug or pipe
- Pipeline now responsible for slug generation and frontmatter construction

These changes shift slug/frontmatter generation from LLM to pipeline,
reducing cognitive load on the model and improving control over output.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-23 19:45:21 +02:00
Mathias Bergqvist
de35d4dbb0 feat(pipeline): wire ParseRawPages+BuildPages+CanonicalizeLinks into Run
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-23 19:07:33 +02:00
Mathias Bergqvist
26855f69b0 feat(pipeline): add CanonicalizeLinks — convert [[Display Name]] to [[slug|Display Name]] 2026-04-23 18:59:10 +02:00
Mathias Bergqvist
a7b363d589 fix(pipeline): quote YAML scalar fields in buildFrontmatter to prevent injection 2026-04-23 18:56:39 +02:00
Mathias Bergqvist
7b57051af8 feat(pipeline): add BuildPages — compute slugs/paths/frontmatter from RawPage 2026-04-23 18:50:37 +02:00
Mathias Bergqvist
a620f6cb01 fix(pipeline): guard empty-title bridge + skip stale integration tests until task4
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-23 18:46:07 +02:00
Mathias Bergqvist
26b5636b43 feat(pipeline): replace ParsePages with ParseRawPages + RawPage type
Strips slug authority from the LLM. The new RawPage type carries only
{title, type, subtype, domain, content} — no paths or frontmatter.
Pipeline will derive slugs deterministically (Task 4).

pipeline.go gets a temporary bridge stub (TODO task4) to keep the
package compiling between tasks.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-23 18:41:33 +02:00
Mathias Bergqvist
989f375aec docs: add Level 3 implementation plan 2026-04-23 17:37:45 +02:00
Mathias Bergqvist
6403d5e444 docs: add Level 3 slug authority design spec 2026-04-23 17:23:22 +02:00
18 changed files with 2370 additions and 138 deletions

View File

@@ -3,21 +3,34 @@
This document defines the three page types in the brain wiki. This document defines the three page types in the brain wiki.
The LLM must follow this schema exactly when generating wiki pages. The LLM must follow this schema exactly when generating wiki pages.
## Output Format
Return a JSON array. Each element:
```json
{
"title": "exact page title",
"type": "source | concept | entity",
"subtype": "see below — omit for concept",
"domain": "see domains — omit if none fits",
"content": "Markdown body only — no frontmatter, no path"
}
```
- `subtype` for **source**: `article | pdf | book | video | note | project`
- `subtype` for **entity**: `person | company | tool | model | framework | technology`
- The pipeline computes slugs and frontmatter — never include them in output.
## Wikilink Format ## Wikilink Format
All cross-references use `[[slug|Display Text]]`. All cross-references use `[[Display Name]]` — just the display name, no slug, no pipe.
Rules: Rules:
- slug = lowercase filename without .md, spaces → hyphens, strip all non-alphanumeric except hyphens - Only link to pages in the inventory or pages you are creating in this response
- The `|` separator is REQUIRED — never use `[[Title]]` without a slug - The pipeline converts `[[Display Name]]` to `[[slug|Display Name]]` automatically
- Examples: `[[domain-driven-design|Domain Driven Design]]`, `[[ryan-singer|Ryan Singer]]` - Section links must match their section type (Related Concepts → concept pages only, etc.)
- Slugs must resolve to an existing file in the inventory, or a file you are creating in this response
Slug generation examples: Examples: `[[Domain Driven Design]]`, `[[Ryan Singer]]`, `[[Shape Up]]`
- "Domain Driven Design" → `domain-driven-design`
- "It's Complicated" → `its-complicated`
- "gRPC" → `grpc`
- "GPT-4o" → `gpt-4o`
## Domains ## Domains
@@ -30,17 +43,6 @@ Use one of: `ai-llm`, `software-engineering`, `product-strategy`, `finance-marke
One page per ingested source. Books are NEVER split across multiple source pages — update the existing one. One page per ingested source. Books are NEVER split across multiple source pages — update the existing one.
Required frontmatter:
```yaml
title: <exact title>
type: article | pdf | book | video | note | project
domain: <domain>
date_ingested: YYYY-MM-DD
last_updated: YYYY-MM-DD
aliases:
- <exact title>
```
Body sections (in this order): Body sections (in this order):
### Summary ### Summary
@@ -50,10 +52,10 @@ Body sections (in this order):
Bulleted list. Paraphrase — no verbatim quotes or code. Bulleted list. Paraphrase — no verbatim quotes or code.
### Concepts Introduced or Reinforced ### Concepts Introduced or Reinforced
Wikilinks to wiki/concepts/ ONLY. One per line. Wikilinks to concept pages ONLY. One per line.
### Entities Mentioned ### Entities Mentioned
Wikilinks to wiki/entities/ ONLY. One per line. Wikilinks to entity pages ONLY. One per line.
### Open Questions Raised ### Open Questions Raised
Gaps or follow-up questions from this source. Gaps or follow-up questions from this source.
@@ -75,15 +77,6 @@ Dated entries appended on re-ingestion. NEVER rewrite — only append.
One page per idea, framework, methodology, or pattern. One page per idea, framework, methodology, or pattern.
Required frontmatter:
```yaml
title: <concept name>
domain: <domain>
last_updated: YYYY-MM-DD
aliases:
- <exact title>
```
Body sections (in this order): Body sections (in this order):
### Definition ### Definition
@@ -93,13 +86,13 @@ One-paragraph plain-language explanation.
Practical significance. Why should anyone care? Practical significance. Why should anyone care?
### Related Concepts ### Related Concepts
Wikilinks to wiki/concepts/ ONLY. Wikilinks to concept pages ONLY.
### Related Entities ### Related Entities
Wikilinks to wiki/entities/ ONLY. Wikilinks to entity pages ONLY.
### Sources ### Sources
Wikilinks to wiki/sources/ ONLY. Wikilinks to source pages ONLY.
### Evolving Notes ### Evolving Notes
Updated as new sources arrive. Append, do not rewrite. Updated as new sources arrive. Append, do not rewrite.
@@ -110,16 +103,6 @@ Updated as new sources arrive. Append, do not rewrite.
One page per person, tool, organisation, technology, or product. One page per person, tool, organisation, technology, or product.
Required frontmatter:
```yaml
title: <name>
type: person | company | tool | model | framework | technology
domain: <domain>
last_updated: YYYY-MM-DD
aliases:
- <exact title>
```
Body sections (in this order): Body sections (in this order):
### Description ### Description
@@ -132,23 +115,23 @@ Why this entity matters to this knowledge base.
With dates where known. With dates where known.
### Related Concepts ### Related Concepts
Wikilinks to wiki/concepts/ ONLY. Wikilinks to concept pages ONLY.
### Related Entities ### Related Entities
Wikilinks to wiki/entities/ ONLY. Wikilinks to entity pages ONLY.
### Sources ### Sources
Wikilinks to wiki/sources/ ONLY. Wikilinks to source pages ONLY.
--- ---
## Non-Negotiable Rules ## Non-Negotiable Rules
1. Output ONLY a valid JSON array — no markdown fences, no prose before or after 1. Output ONLY a valid JSON array — no markdown fences, no prose before or after
2. Each element: `{"path": "wiki/<type>/<slug>.md", "content": "...full markdown..."}` 2. Each element: `{"title": "...", "type": "...", "subtype": "...", "domain": "...", "content": "..."}`
3. Slugs are kebab-case: lowercase, spaces→hyphens, strip special characters 3. Never include slugs, paths, or frontmatter in output — the pipeline handles these
4. Every wikilink must be `[[slug|Display Text]]` — the pipe separator is required 4. Wikilinks: `[[Display Name]]` only — no pipe, no slug
5. Dates always YYYY-MM-DD 5. Dates always YYYY-MM-DD (used only in content body where contextually relevant)
6. Never reproduce verbatim code — describe the pattern or technique 6. Never reproduce verbatim code — describe the pattern or technique
7. Section links must match their section type (Related Concepts → concepts/ only, etc.) 7. Section links must match their section type
8. One source page per book — if inventory shows it exists, include it as an UPDATE 8. One source page per book — if inventory shows it exists, include it as an UPDATE

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,148 @@
# Level 3: Strip Slug Authority from LLM — Design Spec
## Problem
The ingestion pipeline currently asks the LLM to produce full wiki pages including the file path (e.g. `wiki/sources/finbert-huggingface.md`). This causes two classes of bug:
1. **Slug proliferation** — the LLM invents different slugs for the same concept across chunks or runs, producing duplicate pages that diverge in content.
2. **Unstable paths** — the LLM may shorten, expand, or vary titles, making deduplication via `Resolve` unreliable because the slug mismatch is upstream of the normalizer.
## Solution
Strip slug authority from the LLM entirely. The LLM returns a minimal structured object. The pipeline computes all slugs deterministically from titles using `wiki.Slug(title)`.
---
## LLM JSON Contract
### Output format (per page)
```json
{
"title": "FinBERT",
"type": "concept",
"subtype": "framework",
"domain": "ai-llm",
"content": "## Definition\n\nA BERT-based model fine-tuned for financial sentiment...\n\n## Related\n\n- [[Sentiment Analysis]]\n- [[Hugging Face]]\n"
}
```
**Fields:**
| Field | Required | Values |
|-------|----------|--------|
| `title` | yes | Human-readable title, e.g. "FinBERT" |
| `type` | yes | `"source"` \| `"concept"` \| `"entity"` |
| `subtype` | for entity/source | entity: `person\|company\|tool\|model\|framework\|technology`; source: `article\|pdf\|book\|video\|note\|project` |
| `domain` | no | tag string, e.g. `ai-llm`, `finance` |
| `content` | yes | Markdown body sections only — no frontmatter, no path |
**Wikilinks in content:** `[[Display Name]]` only. No slug. The pipeline canonicalizes to `[[slug|Display Name]]` in a post-processing step.
**The LLM never writes slugs, paths, or frontmatter.**
---
## Pipeline Changes
### New type: `RawPage`
```go
type RawPage struct {
Title string
Type string // "source" | "concept" | "entity"
Subtype string
Domain string
Content string
}
```
### New step order
```
ParseRawPages → BuildPages → Resolve → CanonicalizeLinks → injectSourceRefs → mergeAll → write
```
### Step descriptions
**`ParseRawPages(output string) ([]RawPage, []string)`**
Replaces `ParsePages`. Deserializes JSON objects with the new schema. Same truncation-recovery logic as today. Returns `(pages, warnings)`.
**`BuildPages(rawPages []RawPage, sourceSlug, date string) []wiki.Page`**
Converts `RawPage → wiki.Page`:
- Computes slug: `wiki.Slug(page.Title)`
- Computes path: `wiki/<type>/<slug>.md`
- Assembles frontmatter:
```
---
title: <Title>
type: <type>
subtype: <subtype> # omitted if empty
domain: <domain> # omitted if empty
created: <date>
source: <sourceSlug> # omitted for the source page itself
---
```
- Concatenates frontmatter + content
**`Resolve(pages []wiki.Page, inventory) []wiki.Page`**
Unchanged. Normalizes near-duplicate titles to existing inventory slugs.
**`CanonicalizeLinks(pages []wiki.Page, inventory) ([]wiki.Page, []string)`**
New. Builds a title→slug map from inventory + current batch. Replaces `[[Display Name]]` with `[[slug|Display Name]]` in each page's content. Titles with no known slug are left as-is and returned as warnings.
**`injectSourceRefs`**
Unchanged. Reads `[[slug|...]]` links (post-canonicalization) to inject back-references.
**`mergeAll → write`**
Unchanged.
### `pipeline.Run` signature change
```go
func Run(ctx context.Context, cfg Config, brainDir, content, source string, dryRun bool) (Result, error)
```
`source` is already passed (it's the display name / filename). A new internal `sourceSlug` is derived from it via `wiki.Slug(source)` before calling `BuildPages`. No API change needed.
---
## Files Changed
| File | Change |
|------|--------|
| `ingestion/internal/pipeline/parse.go` | Replace `ParsePages` with `ParseRawPages` + `RawPage` type |
| `ingestion/internal/pipeline/build.go` | New file: `BuildPages` |
| `ingestion/internal/pipeline/links.go` | New file: `CanonicalizeLinks` |
| `ingestion/internal/pipeline/pipeline.go` | Wire new steps; derive `sourceSlug` from `source` |
| `ingestion/internal/pipeline/prompt.go` | New system prompt + `BuildPrompt` for new JSON format |
| `brain/schema.md` | Update wikilink format and JSON schema docs |
`resolve.go`, `refs.go`, `backfill.go`, `merge.go` — no changes.
---
## Wikilink Format
- **LLM output**: `[[Display Name]]`
- **Stored on disk**: `[[slug|Display Name]]`
- **`CanonicalizeLinks`** converts between the two using the inventory
This matches Obsidian's display-alias syntax that the existing codebase already uses.
---
## Testing Strategy
- `ParseRawPages`: table-driven, cover valid JSON, truncated output, unknown type, missing title
- `BuildPages`: table-driven, cover slug computation, frontmatter assembly, source page (no `source:` field), entity with subtype
- `CanonicalizeLinks`: cover known title → replaced, unknown title → left as-is + warning, multiple links in one page
- Integration test: full `Run` call with mock LLM returning new JSON format, assert no slug duplication across two chunks of the same source
---
## Out of Scope
- Re-ingesting existing pages (user will trigger manually after deploy)
- Changing the `BackfillRefs` endpoint (already correct, slug-based)
- Changing the `Resolve` fuzzy-match algorithm

View File

@@ -68,6 +68,7 @@ func main() {
mux.HandleFunc("POST /write", h.Write) mux.HandleFunc("POST /write", h.Write)
mux.HandleFunc("POST /ingest", h.Ingest) mux.HandleFunc("POST /ingest", h.Ingest)
mux.HandleFunc("POST /ingest-path", h.IngestPath) mux.HandleFunc("POST /ingest-path", h.IngestPath)
mux.HandleFunc("POST /ingest-raw", h.IngestRaw)
mux.HandleFunc("POST /backfill-refs", h.BackfillRefs) mux.HandleFunc("POST /backfill-refs", h.BackfillRefs)
addr := ":" + port addr := ":" + port

View File

@@ -272,6 +272,48 @@ func (h *Handler) IngestPath(w http.ResponseWriter, r *http.Request) {
writeJSON(w, ingestResponse{Pages: allPages, Warnings: allWarnings}) writeJSON(w, ingestResponse{Pages: allPages, Warnings: allWarnings})
} }
type ingestRawRequest struct {
Source string `json:"source"`
Pages []pipeline.RawPage `json:"pages"`
DryRun bool `json:"dry_run"`
}
// IngestRaw handles POST /ingest-raw — run the pipeline on pre-parsed RawPages,
// skipping the LLM extraction step. Use when the caller has already produced
// structured page data (e.g. from a more capable model or manual curation).
func (h *Handler) IngestRaw(w http.ResponseWriter, r *http.Request) {
var req ingestRawRequest
if err := json.NewDecoder(r.Body).Decode(&req); err != nil {
writeError(w, http.StatusBadRequest, "invalid JSON")
return
}
if strings.TrimSpace(req.Source) == "" {
writeError(w, http.StatusBadRequest, "source is required")
return
}
if len(req.Pages) == 0 {
writeError(w, http.StatusBadRequest, "pages is required and must be non-empty")
return
}
result, err := pipeline.RunRaw(h.brainDir, req.Source, req.Pages, req.DryRun)
if err != nil {
h.logger.Error("ingest-raw failed", "source", req.Source, "err", err)
writeError(w, http.StatusInternalServerError, "ingest error")
return
}
pages := result.Pages
if pages == nil {
pages = []string{}
}
warnings := result.Warnings
if warnings == nil {
warnings = []string{}
}
writeJSON(w, ingestResponse{Pages: pages, Warnings: warnings})
}
// BackfillRefs handles POST /backfill-refs — injects source back-references // BackfillRefs handles POST /backfill-refs — injects source back-references
// into all concept and entity pages based on existing wiki/sources/ pages. // into all concept and entity pages based on existing wiki/sources/ pages.
func (h *Handler) BackfillRefs(w http.ResponseWriter, r *http.Request) { func (h *Handler) BackfillRefs(w http.ResponseWriter, r *http.Request) {

View File

@@ -20,9 +20,9 @@ import (
"github.com/mathiasbq/hyperguild/ingestion/internal/pipeline" "github.com/mathiasbq/hyperguild/ingestion/internal/pipeline"
) )
// stubComplete returns a fixed JSON page so tests never call a real LLM. // stubComplete returns a fixed JSON RawPage so tests never call a real LLM.
func stubComplete(_ context.Context, _, _ string) (string, error) { func stubComplete(_ context.Context, _, _ string) (string, error) {
return `[{"path":"wiki/sources/test-source.md","content":"# Test Source\n\nSome content here.\n"}]`, nil return `[{"title":"Test Source","type":"source","subtype":"article","content":"## Summary\n\nSome content here.\n"}]`, nil
} }
func stubPipelineCfg() pipeline.Config { func stubPipelineCfg() pipeline.Config {
@@ -226,6 +226,85 @@ func TestIngestPath_File(t *testing.T) {
assert.NotEmpty(t, pagesSlice) assert.NotEmpty(t, pagesSlice)
} }
// ---------------------------------------------------------------------------
// POST /ingest-raw
// ---------------------------------------------------------------------------
func TestIngestRaw_Validation(t *testing.T) {
cases := []struct {
name string
body map[string]any
}{
{"missing source", map[string]any{"pages": []any{map[string]any{"title": "X", "type": "concept", "content": "x"}}}},
{"missing pages", map[string]any{"source": "test-source"}},
{"empty pages", map[string]any{"source": "test-source", "pages": []any{}}},
}
for _, tc := range cases {
t.Run(tc.name, func(t *testing.T) {
_, h := setup(t)
body, _ := json.Marshal(tc.body)
req := httptest.NewRequest(http.MethodPost, "/ingest-raw", bytes.NewReader(body))
rec := httptest.NewRecorder()
h.IngestRaw(rec, req)
assert.Equal(t, http.StatusBadRequest, rec.Code)
})
}
}
func TestIngestRaw_Success(t *testing.T) {
dir, h := setup(t)
body, _ := json.Marshal(map[string]any{
"source": "test-article",
"pages": []any{
map[string]any{"title": "Test Article", "type": "source", "subtype": "article", "domain": "Testing", "content": "## Summary\n\nThis is a test article about [[Test Concept]].\n"},
map[string]any{"title": "Test Concept", "type": "concept", "domain": "Testing", "content": "A concept for testing.\n"},
},
})
req := httptest.NewRequest(http.MethodPost, "/ingest-raw", bytes.NewReader(body))
rec := httptest.NewRecorder()
h.IngestRaw(rec, req)
require.Equal(t, http.StatusOK, rec.Code)
var resp map[string]any
require.NoError(t, json.Unmarshal(rec.Body.Bytes(), &resp))
pages := resp["pages"].([]any)
assert.Len(t, pages, 2)
// Verify files were written
sourcePath := filepath.Join(dir, "wiki", "sources", "test-article.md")
assert.FileExists(t, sourcePath)
conceptPath := filepath.Join(dir, "wiki", "concepts", "test-concept.md")
assert.FileExists(t, conceptPath)
}
func TestIngestRaw_DryRun(t *testing.T) {
dir, h := setup(t)
body, _ := json.Marshal(map[string]any{
"source": "dry-run-test",
"pages": []any{
map[string]any{"title": "Dry Run Source", "type": "source", "subtype": "article", "content": "Content."},
},
"dry_run": true,
})
req := httptest.NewRequest(http.MethodPost, "/ingest-raw", bytes.NewReader(body))
rec := httptest.NewRecorder()
h.IngestRaw(rec, req)
require.Equal(t, http.StatusOK, rec.Code)
var resp map[string]any
require.NoError(t, json.Unmarshal(rec.Body.Bytes(), &resp))
pages := resp["pages"].([]any)
assert.NotEmpty(t, pages)
// Verify no files were written
sourcePath := filepath.Join(dir, "wiki", "sources", "dry-run-test.md")
assert.NoFileExists(t, sourcePath)
}
func TestIngestPath_Directory(t *testing.T) { func TestIngestPath_Directory(t *testing.T) {
_, h := setup(t) _, h := setup(t)

View File

@@ -0,0 +1,106 @@
// ingestion/internal/pipeline/build.go
package pipeline
import (
"fmt"
"strings"
"github.com/mathiasbq/hyperguild/ingestion/internal/wiki"
)
// BuildPages converts RawPages from the LLM into wiki.Pages with computed slugs,
// paths, and YAML frontmatter. sourceSlug is the slug of the source being ingested
// (derived from the filename, not the LLM title). Pages whose title resolves to an
// empty slug are skipped and returned as warnings instead.
func BuildPages(rawPages []RawPage, sourceSlug, date string) ([]wiki.Page, []string) {
out := make([]wiki.Page, 0, len(rawPages))
var warnings []string
for _, rp := range rawPages {
slug := computeSlug(rp, sourceSlug)
if slug == "" {
warnings = append(warnings, fmt.Sprintf("skipped page with empty title (type: %s)", rp.Type))
continue
}
out = append(out, buildPage(rp, sourceSlug, date))
}
return out, warnings
}
func computeSlug(rp RawPage, sourceSlug string) string {
if rp.Type == "source" {
return sourceSlug
}
return wiki.Slug(rp.Title)
}
func buildPage(rp RawPage, sourceSlug, date string) wiki.Page {
var slug, dir string
switch rp.Type {
case "source":
slug = sourceSlug
dir = "wiki/sources"
case "concept":
slug = wiki.Slug(rp.Title)
dir = "wiki/concepts"
case "entity":
slug = wiki.Slug(rp.Title)
dir = "wiki/entities"
default:
slug = wiki.Slug(rp.Title)
dir = "wiki/" + rp.Type
}
path := dir + "/" + slug + ".md"
fm := buildFrontmatter(rp, date)
return wiki.Page{
Path: path,
Content: fm + "\n" + rp.Content,
}
}
func buildFrontmatter(rp RawPage, date string) string {
var sb strings.Builder
sb.WriteString("---\n")
fmt.Fprintf(&sb, "title: %s\n", yamlScalar(rp.Title))
switch rp.Type {
case "source":
subtype := rp.Subtype
if subtype == "" {
subtype = "article"
}
fmt.Fprintf(&sb, "type: %s\n", yamlScalar(subtype))
if rp.Domain != "" {
fmt.Fprintf(&sb, "domain: %s\n", yamlScalar(rp.Domain))
}
fmt.Fprintf(&sb, "date_ingested: %s\n", date)
fmt.Fprintf(&sb, "last_updated: %s\n", date)
case "concept":
if rp.Domain != "" {
fmt.Fprintf(&sb, "domain: %s\n", yamlScalar(rp.Domain))
}
fmt.Fprintf(&sb, "last_updated: %s\n", date)
case "entity":
if rp.Subtype != "" {
fmt.Fprintf(&sb, "type: %s\n", yamlScalar(rp.Subtype))
}
if rp.Domain != "" {
fmt.Fprintf(&sb, "domain: %s\n", yamlScalar(rp.Domain))
}
fmt.Fprintf(&sb, "last_updated: %s\n", date)
default:
if rp.Domain != "" {
fmt.Fprintf(&sb, "domain: %s\n", yamlScalar(rp.Domain))
}
fmt.Fprintf(&sb, "last_updated: %s\n", date)
}
fmt.Fprintf(&sb, "aliases:\n - %s\n", yamlScalar(rp.Title))
sb.WriteString("---\n")
return sb.String()
}
func yamlScalar(s string) string {
return "'" + strings.ReplaceAll(s, "'", "''") + "'"
}

View File

@@ -0,0 +1,167 @@
// ingestion/internal/pipeline/build_test.go
package pipeline
import (
"strings"
"testing"
"github.com/stretchr/testify/assert"
"github.com/stretchr/testify/require"
)
func TestBuildPages_SourcePage(t *testing.T) {
raw := []RawPage{
{
Title: "Shape Up",
Type: "source",
Subtype: "book",
Domain: "product-strategy",
Content: "## Summary\n\nA book about shaping product work.\n",
},
}
pages, warnings := BuildPages(raw, "shape-up", "2026-04-23")
require.Len(t, pages, 1)
assert.Empty(t, warnings)
p := pages[0]
assert.Equal(t, "wiki/sources/shape-up.md", p.Path)
assert.Contains(t, p.Content, "title: 'Shape Up'")
assert.Contains(t, p.Content, "type: 'book'")
assert.Contains(t, p.Content, "domain: 'product-strategy'")
assert.Contains(t, p.Content, "date_ingested: 2026-04-23")
assert.Contains(t, p.Content, "last_updated: 2026-04-23")
assert.Contains(t, p.Content, "aliases:\n - 'Shape Up'")
assert.Contains(t, p.Content, "## Summary")
assert.True(t, strings.HasPrefix(p.Content, "---\n"), "content must start with frontmatter")
}
func TestBuildPages_ConceptPage(t *testing.T) {
raw := []RawPage{
{
Title: "Betting",
Type: "concept",
Domain: "product-strategy",
Content: "## Definition\n\nA resource allocation technique.\n",
},
}
pages, warnings := BuildPages(raw, "shape-up", "2026-04-23")
require.Len(t, pages, 1)
assert.Empty(t, warnings)
p := pages[0]
assert.Equal(t, "wiki/concepts/betting.md", p.Path)
assert.Contains(t, p.Content, "title: 'Betting'")
assert.Contains(t, p.Content, "domain: 'product-strategy'")
assert.Contains(t, p.Content, "last_updated: 2026-04-23")
assert.Contains(t, p.Content, "aliases:\n - 'Betting'")
assert.NotContains(t, p.Content, "date_ingested")
assert.Contains(t, p.Content, "## Definition")
}
func TestBuildPages_EntityPage(t *testing.T) {
raw := []RawPage{
{
Title: "Ryan Singer",
Type: "entity",
Subtype: "person",
Domain: "product-strategy",
Content: "## Description\n\nA product designer.\n",
},
}
pages, warnings := BuildPages(raw, "shape-up", "2026-04-23")
require.Len(t, pages, 1)
assert.Empty(t, warnings)
p := pages[0]
assert.Equal(t, "wiki/entities/ryan-singer.md", p.Path)
assert.Contains(t, p.Content, "title: 'Ryan Singer'")
assert.Contains(t, p.Content, "type: 'person'")
assert.Contains(t, p.Content, "domain: 'product-strategy'")
assert.Contains(t, p.Content, "last_updated: 2026-04-23")
assert.Contains(t, p.Content, "aliases:\n - 'Ryan Singer'")
assert.NotContains(t, p.Content, "date_ingested")
}
func TestBuildPages_SourceSlugUsedForSourcePage(t *testing.T) {
// LLM title differs from filename — pipeline uses sourceSlug for the source page path.
raw := []RawPage{
{Title: "FinBERT: A Pretrained Model", Type: "source", Subtype: "article", Content: "## Summary\n\nA model.\n"},
}
pages, _ := BuildPages(raw, "finbert-huggingface", "2026-04-23")
require.Len(t, pages, 1)
assert.Equal(t, "wiki/sources/finbert-huggingface.md", pages[0].Path)
}
func TestBuildPages_ConceptSlugDerivedFromTitle(t *testing.T) {
raw := []RawPage{
{Title: "Domain-Driven Design", Type: "concept", Content: "## Definition\n\nFoo.\n"},
}
pages, _ := BuildPages(raw, "some-source", "2026-04-23")
require.Len(t, pages, 1)
assert.Equal(t, "wiki/concepts/domain-driven-design.md", pages[0].Path)
}
func TestBuildPages_SourceDefaultSubtype(t *testing.T) {
// If subtype is omitted for a source, default to "article"
raw := []RawPage{
{Title: "Some Post", Type: "source", Content: "## Summary\n\nA post.\n"},
}
pages, _ := BuildPages(raw, "some-post", "2026-04-23")
require.Len(t, pages, 1)
assert.Contains(t, pages[0].Content, "type: 'article'")
}
func TestBuildPages_OmitsDomainWhenEmpty(t *testing.T) {
raw := []RawPage{
{Title: "Betting", Type: "concept", Content: "## Definition\n\nFoo.\n"},
}
pages, _ := BuildPages(raw, "src", "2026-04-23")
require.Len(t, pages, 1)
assert.NotContains(t, pages[0].Content, "domain:")
}
func TestBuildPages_MultiplePages(t *testing.T) {
raw := []RawPage{
{Title: "Shape Up", Type: "source", Subtype: "book", Content: "## Summary\n\nA book.\n"},
{Title: "Betting", Type: "concept", Content: "## Definition\n\nA technique.\n"},
{Title: "Ryan Singer", Type: "entity", Subtype: "person", Content: "## Description\n\nA designer.\n"},
}
pages, _ := BuildPages(raw, "shape-up", "2026-04-23")
require.Len(t, pages, 3)
assert.Equal(t, "wiki/sources/shape-up.md", pages[0].Path)
assert.Equal(t, "wiki/concepts/betting.md", pages[1].Path)
assert.Equal(t, "wiki/entities/ryan-singer.md", pages[2].Path)
}
func TestBuildPages_TitleWithColon(t *testing.T) {
raw := []RawPage{
{Title: "Shape Up: The Basecamp Method", Type: "source", Subtype: "book", Content: "## Summary\n\nA book.\n"},
}
pages, _ := BuildPages(raw, "shape-up", "2026-04-23")
require.Len(t, pages, 1)
// Title with colon must be quoted in YAML
assert.Contains(t, pages[0].Content, "title: 'Shape Up: The Basecamp Method'")
assert.Contains(t, pages[0].Content, "aliases:\n - 'Shape Up: The Basecamp Method'")
}
func TestBuildPages_EntityNoSubtype(t *testing.T) {
raw := []RawPage{
{Title: "Basecamp", Type: "entity", Content: "## Description\n\nA company.\n"},
}
pages, _ := BuildPages(raw, "src", "2026-04-23")
require.Len(t, pages, 1)
assert.NotContains(t, pages[0].Content, "type:")
assert.Contains(t, pages[0].Content, "title: 'Basecamp'")
}
func TestBuildPages_EmptyTitleSkippedWithWarning(t *testing.T) {
raw := []RawPage{
{Title: "", Type: "concept", Content: "## Definition\n\nFoo.\n"},
{Title: "Betting", Type: "concept", Content: "## Definition\n\nA technique.\n"},
}
pages, warnings := BuildPages(raw, "src", "2026-04-23")
require.Len(t, pages, 1, "empty-title page should be skipped")
assert.Equal(t, "wiki/concepts/betting.md", pages[0].Path)
assert.Len(t, warnings, 1)
assert.Contains(t, warnings[0], "empty title")
}

View File

@@ -0,0 +1,70 @@
// ingestion/internal/pipeline/links.go
package pipeline
import (
"fmt"
"path/filepath"
"regexp"
"strings"
"github.com/mathiasbq/hyperguild/ingestion/internal/wiki"
)
// plainLinkRE matches [[Display Name]] — wikilinks without a slug pipe.
// It does NOT match [[slug|Display]] (those already have a pipe).
var plainLinkRE = regexp.MustCompile(`\[\[([^\]|]+)\]\]`)
// CanonicalizeLinks converts [[Display Name]] wikilinks to [[slug|Display Name]]
// using a title→slug map built from the inventory and current batch.
// Unknown titles are left as-is and returned as warnings.
func CanonicalizeLinks(pages []wiki.Page, inventory map[wiki.PageType][]wiki.Entry) ([]wiki.Page, []string) {
titleToSlug := buildTitleMap(pages, inventory)
var allWarnings []string
out := make([]wiki.Page, len(pages))
for i, p := range pages {
newContent, warnings := canonicalizeContent(p.Content, titleToSlug)
p.Content = newContent
out[i] = p
allWarnings = append(allWarnings, warnings...)
}
return out, allWarnings
}
// buildTitleMap builds a lowercase-title → slug map from inventory and current batch.
// Current batch entries take precedence over inventory (they may be updates).
func buildTitleMap(pages []wiki.Page, inventory map[wiki.PageType][]wiki.Entry) map[string]string {
m := make(map[string]string)
for _, entries := range inventory {
for _, e := range entries {
m[strings.ToLower(e.Title)] = e.Slug
}
}
// Current batch overrides inventory
for _, p := range pages {
title := extractTitle(p.Content)
slug := strings.TrimSuffix(filepath.Base(p.Path), ".md")
if title != "" && slug != "" {
m[strings.ToLower(title)] = slug
}
}
return m
}
func canonicalizeContent(content string, titleToSlug map[string]string) (string, []string) {
var warnings []string
result := plainLinkRE.ReplaceAllStringFunc(content, func(match string) string {
sub := plainLinkRE.FindStringSubmatch(match)
if len(sub) < 2 {
return match
}
displayName := sub[1]
slug, ok := titleToSlug[strings.ToLower(displayName)]
if !ok {
warnings = append(warnings, fmt.Sprintf("unknown wikilink: [[%s]]", displayName))
return match
}
return "[[" + slug + "|" + displayName + "]]"
})
return result, warnings
}

View File

@@ -0,0 +1,125 @@
// ingestion/internal/pipeline/links_test.go
package pipeline
import (
"testing"
"github.com/stretchr/testify/assert"
"github.com/stretchr/testify/require"
"github.com/mathiasbq/hyperguild/ingestion/internal/wiki"
)
func TestCanonicalizeLinks_KnownTitle(t *testing.T) {
pages := []wiki.Page{
{
Path: "wiki/sources/shape-up.md",
Content: "---\ntitle: 'Shape Up'\n---\n\n## Summary\n\nSee [[Betting]].\n",
},
}
inventory := map[wiki.PageType][]wiki.Entry{
wiki.PageTypeConcept: {
{Slug: "betting", Title: "Betting"},
},
}
got, warnings := CanonicalizeLinks(pages, inventory)
require.Len(t, got, 1)
assert.Empty(t, warnings)
assert.Contains(t, got[0].Content, "[[betting|Betting]]")
assert.NotContains(t, got[0].Content, "[[Betting]]")
}
func TestCanonicalizeLinks_UnknownTitleLeftAsIs(t *testing.T) {
pages := []wiki.Page{
{
Path: "wiki/sources/shape-up.md",
Content: "---\ntitle: 'Shape Up'\n---\n\n## Summary\n\nSee [[Ghost Concept]].\n",
},
}
inventory := map[wiki.PageType][]wiki.Entry{}
got, warnings := CanonicalizeLinks(pages, inventory)
require.Len(t, got, 1)
assert.NotEmpty(t, warnings)
assert.Contains(t, got[0].Content, "[[Ghost Concept]]")
}
func TestCanonicalizeLinks_AlreadyCanonicalLinkUntouched(t *testing.T) {
// Links already in [[slug|Display]] format must not be double-converted
pages := []wiki.Page{
{
Path: "wiki/sources/shape-up.md",
Content: "---\ntitle: 'Shape Up'\n---\n\n## Summary\n\nSee [[betting|Betting]].\n",
},
}
inventory := map[wiki.PageType][]wiki.Entry{
wiki.PageTypeConcept: {
{Slug: "betting", Title: "Betting"},
},
}
got, warnings := CanonicalizeLinks(pages, inventory)
require.Len(t, got, 1)
assert.Empty(t, warnings)
// Should remain exactly as-is — not double-wrapped
assert.Contains(t, got[0].Content, "[[betting|Betting]]")
assert.NotContains(t, got[0].Content, "[[betting|[[betting|Betting]]]]")
}
func TestCanonicalizeLinks_CaseInsensitiveMatch(t *testing.T) {
pages := []wiki.Page{
{
Path: "wiki/sources/foo.md",
Content: "---\ntitle: 'Foo'\n---\n\n## Summary\n\nSee [[domain driven design]].\n",
},
}
inventory := map[wiki.PageType][]wiki.Entry{
wiki.PageTypeConcept: {
{Slug: "domain-driven-design", Title: "Domain Driven Design"},
},
}
got, warnings := CanonicalizeLinks(pages, inventory)
require.Len(t, got, 1)
assert.Empty(t, warnings)
assert.Contains(t, got[0].Content, "[[domain-driven-design|domain driven design]]")
}
func TestCanonicalizeLinks_CurrentBatchPagesResolved(t *testing.T) {
// A concept created in the same batch should be canonicalizable
pages := []wiki.Page{
{
Path: "wiki/sources/shape-up.md",
Content: "---\ntitle: 'Shape Up'\n---\n\n## Summary\n\nSee [[Betting]].\n",
},
{
Path: "wiki/concepts/betting.md",
Content: "---\ntitle: 'Betting'\n---\n\n## Definition\n\nA technique.\n",
},
}
inventory := map[wiki.PageType][]wiki.Entry{} // empty — Betting is in the batch, not inventory
got, warnings := CanonicalizeLinks(pages, inventory)
require.Len(t, got, 2)
assert.Empty(t, warnings)
assert.Contains(t, got[0].Content, "[[betting|Betting]]")
}
func TestCanonicalizeLinks_MultipleLinksInOnePage(t *testing.T) {
pages := []wiki.Page{
{
Path: "wiki/sources/foo.md",
Content: "---\ntitle: 'Foo'\n---\n\n## Summary\n\nSee [[Betting]] and [[Shape Up]].\n",
},
}
inventory := map[wiki.PageType][]wiki.Entry{
wiki.PageTypeConcept: {
{Slug: "betting", Title: "Betting"},
},
wiki.PageTypeSource: {
{Slug: "shape-up", Title: "Shape Up"},
},
}
got, warnings := CanonicalizeLinks(pages, inventory)
require.Len(t, got, 1)
assert.Empty(t, warnings)
assert.Contains(t, got[0].Content, "[[betting|Betting]]")
assert.Contains(t, got[0].Content, "[[shape-up|Shape Up]]")
}

View File

@@ -5,13 +5,22 @@ import (
"encoding/json" "encoding/json"
"fmt" "fmt"
"strings" "strings"
"github.com/mathiasbq/hyperguild/ingestion/internal/wiki"
) )
// ParsePages parses LLM output as a JSON array of {path, content} objects. // RawPage is the LLM's output format — minimal structured data with no path or frontmatter.
// If the array is truncated mid-object (token limit), it salvages all complete objects. // The pipeline derives slugs, paths, and frontmatter from these fields.
func ParsePages(output string) ([]wiki.Page, []string) { type RawPage struct {
Title string `json:"title"`
Type string `json:"type"` // "source" | "concept" | "entity"
Subtype string `json:"subtype"` // entity: person|company|tool|model|framework|technology; source: article|pdf|book|video|note|project
Domain string `json:"domain"`
Content string `json:"content"` // Markdown body only — no frontmatter
}
// ParseRawPages parses LLM output as a JSON array of RawPage objects.
// If the output contains invalid JSON escape sequences (e.g. \. from Markdown),
// it attempts repair before falling back to truncation recovery.
func ParseRawPages(output string) ([]RawPage, []string) {
output = strings.TrimSpace(output) output = strings.TrimSpace(output)
if output == "" { if output == "" {
return nil, []string{"LLM returned empty output"} return nil, []string{"LLM returned empty output"}
@@ -19,23 +28,30 @@ func ParsePages(output string) ([]wiki.Page, []string) {
output = stripFences(output) output = stripFences(output)
var pages []wiki.Page // Fast path: valid JSON.
var pages []RawPage
if err := json.Unmarshal([]byte(output), &pages); err == nil { if err := json.Unmarshal([]byte(output), &pages); err == nil {
return pages, nil return pages, nil
} }
// Repair pass: fix invalid escape sequences (e.g. \. \d from Markdown content).
repaired := repairJSON(output)
if err := json.Unmarshal([]byte(repaired), &pages); err == nil {
return pages, []string{"repaired invalid JSON escape sequences in LLM output"}
}
// Truncation recovery: find last `}` that closes a complete object. // Truncation recovery: find last `}` that closes a complete object.
idx := strings.LastIndex(output, "}") idx := strings.LastIndex(repaired, "}")
if idx < 0 { if idx < 0 {
return nil, []string{"LLM output contained no complete JSON objects"} return nil, []string{"LLM output contained no complete JSON objects"}
} }
start := strings.Index(output, "[") start := strings.Index(repaired, "[")
if start < 0 { if start < 0 {
return nil, []string{"LLM output contained no JSON array opening bracket"} return nil, []string{"LLM output contained no JSON array opening bracket"}
} }
candidate := output[start:idx+1] + "]" candidate := repaired[start:idx+1] + "]"
if err := json.Unmarshal([]byte(candidate), &pages); err != nil { if err := json.Unmarshal([]byte(candidate), &pages); err != nil {
return nil, []string{fmt.Sprintf("truncation recovery failed: %v", err)} return nil, []string{fmt.Sprintf("truncation recovery failed: %v", err)}
} }
@@ -43,6 +59,45 @@ func ParsePages(output string) ([]wiki.Page, []string) {
return pages, []string{fmt.Sprintf("LLM output was truncated; recovered %d page(s)", len(pages))} return pages, []string{fmt.Sprintf("LLM output was truncated; recovered %d page(s)", len(pages))}
} }
// repairJSON replaces invalid JSON escape sequences (e.g. \. \d \p) with
// a properly escaped backslash followed by the same character.
// It iterates byte-by-byte to correctly skip already-valid escape sequences
// (including \\) without requiring lookbehind support.
func repairJSON(s string) string {
var b strings.Builder
b.Grow(len(s))
i := 0
for i < len(s) {
if s[i] != '\\' {
b.WriteByte(s[i])
i++
continue
}
// We have a backslash. Peek at the next character.
if i+1 >= len(s) {
// Trailing backslash — emit as-is.
b.WriteByte(s[i])
i++
continue
}
next := s[i+1]
switch next {
case '"', '\\', '/', 'b', 'f', 'n', 'r', 't', 'u':
// Valid JSON escape sequence — emit both characters as-is.
b.WriteByte(s[i])
b.WriteByte(next)
i += 2
default:
// Invalid escape — double the backslash.
b.WriteByte('\\')
b.WriteByte('\\')
b.WriteByte(next)
i += 2
}
}
return b.String()
}
func stripFences(s string) string { func stripFences(s string) string {
for _, prefix := range []string{"```json\n", "```json\r\n", "```\n", "```\r\n"} { for _, prefix := range []string{"```json\n", "```json\r\n", "```\n", "```\r\n"} {
if strings.HasPrefix(s, prefix) { if strings.HasPrefix(s, prefix) {

View File

@@ -8,39 +8,80 @@ import (
"github.com/stretchr/testify/require" "github.com/stretchr/testify/require"
) )
func TestParsePages_ValidJSON(t *testing.T) { func TestParseRawPages_ValidJSON(t *testing.T) {
input := `[{"path":"wiki/sources/foo.md","content":"# Foo"},{"path":"wiki/concepts/bar.md","content":"# Bar"}]` input := `[{"title":"Shape Up","type":"source","subtype":"book","domain":"product-strategy","content":"## Summary\n\nFoo."},{"title":"Betting","type":"concept","content":"## Definition\n\nA technique."}]`
pages, warnings := ParsePages(input) pages, warnings := ParseRawPages(input)
require.Len(t, pages, 2) require.Len(t, pages, 2)
assert.Empty(t, warnings) assert.Empty(t, warnings)
assert.Equal(t, "wiki/sources/foo.md", pages[0].Path) assert.Equal(t, "Shape Up", pages[0].Title)
assert.Equal(t, "wiki/concepts/bar.md", pages[1].Path) assert.Equal(t, "source", pages[0].Type)
assert.Equal(t, "book", pages[0].Subtype)
assert.Equal(t, "product-strategy", pages[0].Domain)
assert.Equal(t, "Betting", pages[1].Title)
assert.Equal(t, "concept", pages[1].Type)
assert.Empty(t, pages[1].Subtype)
} }
func TestParsePages_StripsFences(t *testing.T) { func TestParseRawPages_StripsFences(t *testing.T) {
input := "```json\n[{\"path\":\"wiki/sources/foo.md\",\"content\":\"# Foo\"}]\n```" input := "```json\n[{\"title\":\"Foo\",\"type\":\"concept\",\"content\":\"## Definition\\n\\nFoo.\"}]\n```"
pages, warnings := ParsePages(input) pages, warnings := ParseRawPages(input)
assert.Len(t, pages, 1)
assert.Empty(t, warnings)
}
func TestParsePages_TruncationRecovery(t *testing.T) {
input := `[{"path":"wiki/sources/foo.md","content":"# Foo"},{"path":"wiki/concepts/bar.md","content":"trunc`
pages, warnings := ParsePages(input)
require.Len(t, pages, 1) require.Len(t, pages, 1)
assert.Equal(t, "wiki/sources/foo.md", pages[0].Path) assert.Empty(t, warnings)
assert.Equal(t, "Foo", pages[0].Title)
}
func TestParseRawPages_TruncationRecovery(t *testing.T) {
input := `[{"title":"Foo","type":"concept","content":"## Definition\n\nFoo."},{"title":"Bar","type":"concept","content":"trunc`
pages, warnings := ParseRawPages(input)
require.Len(t, pages, 1)
assert.Equal(t, "Foo", pages[0].Title)
assert.NotEmpty(t, warnings) assert.NotEmpty(t, warnings)
} }
func TestParsePages_EmptyInput(t *testing.T) { func TestParseRawPages_EmptyInput(t *testing.T) {
pages, warnings := ParsePages("") pages, warnings := ParseRawPages("")
assert.Empty(t, pages) assert.Empty(t, pages)
assert.NotEmpty(t, warnings) assert.NotEmpty(t, warnings)
} }
func TestParsePages_PlainFence(t *testing.T) { func TestParseRawPages_PlainFence(t *testing.T) {
input := "```\n[{\"path\":\"wiki/sources/foo.md\",\"content\":\"ok\"}]\n```" input := "```\n[{\"title\":\"Foo\",\"type\":\"concept\",\"content\":\"ok\"}]\n```"
pages, warnings := ParsePages(input) pages, warnings := ParseRawPages(input)
assert.Len(t, pages, 1) require.Len(t, pages, 1)
assert.Empty(t, warnings) assert.Empty(t, warnings)
} }
func TestParseRawPages_MissingTitle(t *testing.T) {
// Missing title — still parsed, Title is empty string
input := `[{"type":"concept","content":"## Definition\n\nFoo."}]`
pages, warnings := ParseRawPages(input)
require.Len(t, pages, 1)
assert.Empty(t, warnings)
assert.Empty(t, pages[0].Title)
}
func TestParseRawPages_InvalidEscapeRepaired(t *testing.T) {
// LLM copied markdown escaped list numbers (\.) into JSON — invalid escape
raw := "[{\"title\":\"Foo\",\"type\":\"concept\",\"content\":\"Step 4\\. Do it.\"}]"
pages, warnings := ParseRawPages(raw)
require.Len(t, pages, 1)
assert.Equal(t, "Foo", pages[0].Title)
assert.Contains(t, pages[0].Content, `4\.`)
assert.NotEmpty(t, warnings) // repair warning
}
func TestRepairJSON_FixesInvalidEscapes(t *testing.T) {
cases := []struct {
in string
want string
}{
{`{"a":"foo\.bar"}`, `{"a":"foo\\.bar"}`},
{`{"a":"\\n is fine"}`, `{"a":"\\n is fine"}`}, // valid \n untouched
{`{"a":"\d+ items"}`, `{"a":"\\d+ items"}`},
{`{"a":"already \\ escaped"}`, `{"a":"already \\ escaped"}`}, // valid \\ untouched
}
for _, tc := range cases {
got := repairJSON(tc.in)
assert.Equal(t, tc.want, got, "input: %s", tc.in)
}
}

View File

@@ -41,9 +41,11 @@ func Run(ctx context.Context, cfg Config, brainDir, content, source string, dryR
schema = loadSchema(brainDir) schema = loadSchema(brainDir)
} }
sourceSlug := wiki.Slug(source)
date := time.Now().UTC().Format("2006-01-02")
chunks := Chunk(content, cfg.ChunkSize) chunks := Chunk(content, cfg.ChunkSize)
var allPages []wiki.Page var allRaw []RawPage
var allWarnings []string var allWarnings []string
for _, chunk := range chunks { for _, chunk := range chunks {
@@ -52,18 +54,40 @@ func Run(ctx context.Context, cfg Config, brainDir, content, source string, dryR
if err != nil { if err != nil {
return Result{}, fmt.Errorf("LLM call: %w", err) return Result{}, fmt.Errorf("LLM call: %w", err)
} }
pages, warnings := ParsePages(output) raw, warnings := ParseRawPages(output)
allPages = append(allPages, pages...) allRaw = append(allRaw, raw...)
allWarnings = append(allWarnings, warnings...) allWarnings = append(allWarnings, warnings...)
} }
resolved := Resolve(allPages, inventory) return buildAndWrite(allRaw, sourceSlug, date, brainDir, source, inventory, allWarnings, dryRun)
withRefs := injectSourceRefs(resolved, inventory, brainDir) }
// RunRaw runs the pipeline on pre-parsed RawPages, skipping the LLM extraction
// step. Use this when the caller has already produced the structured RawPage data
// (e.g. from a more capable model or manual curation).
func RunRaw(brainDir, source string, rawPages []RawPage, dryRun bool) (Result, error) {
inventory, err := wiki.LoadInventory(brainDir)
if err != nil {
return Result{}, fmt.Errorf("load inventory: %w", err)
}
sourceSlug := wiki.Slug(source)
date := time.Now().UTC().Format("2006-01-02")
return buildAndWrite(rawPages, sourceSlug, date, brainDir, source, inventory, nil, dryRun)
}
// buildAndWrite runs BuildPages through write for both Run and RunRaw.
func buildAndWrite(rawPages []RawPage, sourceSlug, date, brainDir, source string, inventory map[wiki.PageType][]wiki.Entry, warnings []string, dryRun bool) (Result, error) {
pages, buildWarnings := BuildPages(rawPages, sourceSlug, date)
warnings = append(warnings, buildWarnings...)
resolved := Resolve(pages, inventory)
canonicalized, linkWarnings := CanonicalizeLinks(resolved, inventory)
warnings = append(warnings, linkWarnings...)
withRefs := injectSourceRefs(canonicalized, inventory, brainDir)
merged := mergeAll(withRefs) merged := mergeAll(withRefs)
date := time.Now().UTC().Format("2006-01-02")
var written []string var written []string
for _, page := range merged { for _, page := range merged {
if !dryRun { if !dryRun {
dest := filepath.Join(brainDir, filepath.FromSlash(page.Path)) dest := filepath.Join(brainDir, filepath.FromSlash(page.Path))
@@ -79,14 +103,14 @@ func Run(ctx context.Context, cfg Config, brainDir, content, source string, dryR
if !dryRun { if !dryRun {
if err := wiki.RebuildIndex(brainDir, date); err != nil { if err := wiki.RebuildIndex(brainDir, date); err != nil {
allWarnings = append(allWarnings, fmt.Sprintf("rebuild index: %v", err)) warnings = append(warnings, fmt.Sprintf("rebuild index: %v", err))
} }
if err := wiki.AppendLog(brainDir, source, written, allWarnings, date); err != nil { if err := wiki.AppendLog(brainDir, source, written, warnings, date); err != nil {
allWarnings = append(allWarnings, fmt.Sprintf("append log: %v", err)) warnings = append(warnings, fmt.Sprintf("append log: %v", err))
} }
} }
return Result{Pages: written, Warnings: allWarnings}, nil return Result{Pages: written, Warnings: warnings}, nil
} }
// mergeAll deduplicates pages by path, merging content from later occurrences. // mergeAll deduplicates pages by path, merging content from later occurrences.

View File

@@ -15,7 +15,6 @@ import (
"github.com/stretchr/testify/require" "github.com/stretchr/testify/require"
"github.com/mathiasbq/hyperguild/ingestion/internal/llm" "github.com/mathiasbq/hyperguild/ingestion/internal/llm"
"github.com/mathiasbq/hyperguild/ingestion/internal/wiki"
) )
func TestRun_WritesPages(t *testing.T) { func TestRun_WritesPages(t *testing.T) {
@@ -24,14 +23,19 @@ func TestRun_WritesPages(t *testing.T) {
require.NoError(t, os.MkdirAll(filepath.Join(brainDir, sub), 0o755)) require.NoError(t, os.MkdirAll(filepath.Join(brainDir, sub), 0o755))
} }
llmResponse := mustJSON([]wiki.Page{ llmResponse := mustJSON([]RawPage{
{ {
Path: "wiki/sources/test-article.md", Title: "Test Article",
Content: "---\ntitle: Test Article\ntype: article\ndomain: software-engineering\ndate_ingested: 2026-04-22\nlast_updated: 2026-04-22\naliases:\n - Test Article\n---\n\n## Summary\n\nA test article.\n\n## Key Claims\n\n- It tests things.\n\n## Concepts Introduced or Reinforced\n\n## Entities Mentioned\n\n## Open Questions Raised\n", Type: "source",
Subtype: "article",
Domain: "software-engineering",
Content: "## Summary\n\nA test article.\n\n## Key Claims\n\n- It tests things.\n\n## Concepts Introduced or Reinforced\n\n[[Testing]]\n\n## Entities Mentioned\n\n## Open Questions Raised\n",
}, },
{ {
Path: "wiki/concepts/testing.md", Title: "Testing",
Content: "---\ntitle: Testing\ndomain: software-engineering\nlast_updated: 2026-04-22\naliases:\n - Testing\n---\n\n## Definition\n\nThe practice of verifying software.\n\n## Why It Matters\n\nCatches bugs.\n\n## Related Concepts\n\n## Related Entities\n\n## Sources\n\n## Evolving Notes\n", Type: "concept",
Domain: "software-engineering",
Content: "## Definition\n\nThe practice of verifying software.\n\n## Why It Matters\n\nCatches bugs.\n\n## Related Concepts\n\n## Related Entities\n\n## Sources\n\n## Evolving Notes\n",
}, },
}) })
@@ -53,7 +57,6 @@ func TestRun_WritesPages(t *testing.T) {
result, err := Run(context.Background(), cfg, brainDir, "An article about testing.", "test-article", false) result, err := Run(context.Background(), cfg, brainDir, "An article about testing.", "test-article", false)
require.NoError(t, err) require.NoError(t, err)
assert.Len(t, result.Pages, 2) assert.Len(t, result.Pages, 2)
assert.Empty(t, result.Warnings)
_, err = os.Stat(filepath.Join(brainDir, "wiki", "sources", "test-article.md")) _, err = os.Stat(filepath.Join(brainDir, "wiki", "sources", "test-article.md"))
require.NoError(t, err) require.NoError(t, err)
@@ -71,9 +74,11 @@ func TestRun_DryRunDoesNotWrite(t *testing.T) {
require.NoError(t, os.MkdirAll(filepath.Join(brainDir, sub), 0o755)) require.NoError(t, os.MkdirAll(filepath.Join(brainDir, sub), 0o755))
} }
llmResponse := mustJSON([]wiki.Page{{ llmResponse := mustJSON([]RawPage{{
Path: "wiki/sources/foo.md", Title: "Foo",
Content: "---\ntitle: Foo\n---\n\n## Summary\n\nFoo.\n", Type: "source",
Subtype: "article",
Content: "## Summary\n\nFoo.\n",
}}) }})
srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
@@ -98,10 +103,10 @@ func TestRun_MergesDuplicatePaths(t *testing.T) {
require.NoError(t, os.MkdirAll(filepath.Join(brainDir, sub), 0o755)) require.NoError(t, os.MkdirAll(filepath.Join(brainDir, sub), 0o755))
} }
// LLM returns same path twice (simulates multi-chunk merge) // LLM returns same title twice (simulates multi-chunk duplicate)
llmResponse := mustJSON([]wiki.Page{ llmResponse := mustJSON([]RawPage{
{Path: "wiki/concepts/foo.md", Content: "---\ntitle: Foo\n---\n\n## Definition\n\nFirst.\n\n## Related Concepts\n\n- [[bar|Bar]]\n"}, {Title: "Foo", Type: "concept", Content: "## Definition\n\nFirst.\n\n## Related Concepts\n\n[[Bar]]\n"},
{Path: "wiki/concepts/foo.md", Content: "---\ntitle: Foo\n---\n\n## Definition\n\nSecond.\n\n## Related Concepts\n\n- [[baz|Baz]]\n"}, {Title: "Foo", Type: "concept", Content: "## Definition\n\nSecond.\n\n## Related Concepts\n\n[[Baz]]\n"},
}) })
srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
@@ -120,8 +125,9 @@ func TestRun_MergesDuplicatePaths(t *testing.T) {
require.NoError(t, err) require.NoError(t, err)
// keep-first for Definition, union for Related Concepts // keep-first for Definition, union for Related Concepts
assert.Contains(t, string(content), "First.") assert.Contains(t, string(content), "First.")
assert.Contains(t, string(content), "[[bar|Bar]]") // Bar and Baz unknown in empty inventory → left as plain [[links]]
assert.Contains(t, string(content), "[[baz|Baz]]") assert.Contains(t, string(content), "[[Bar]]")
assert.Contains(t, string(content), "[[Baz]]")
} }
func mustJSON(v any) string { func mustJSON(v any) string {

View File

@@ -12,12 +12,15 @@ import (
const systemPrompt = `You are a wiki agent. Read the source material and produce structured wiki pages following the schema provided. const systemPrompt = `You are a wiki agent. Read the source material and produce structured wiki pages following the schema provided.
Output ONLY a valid JSON array — no markdown fences, no other text before or after. Output ONLY a valid JSON array — no markdown fences, no other text before or after.
Each element must have: Each element must have exactly these fields:
"path" — relative path within the wiki, e.g. "wiki/sources/foo.md" "title" — exact page title (e.g. "FinBERT", "Ryan Singer", "Shape Up")
"content" — full markdown content of the page including YAML frontmatter "type" — exactly one of: "source", "concept", "entity"
"subtype" — for source: article|pdf|book|video|note|project; for entity: person|company|tool|model|framework|technology; omit for concept
"domain" — one of the domains in the schema (omit if none fits)
"content" — Markdown body only — NO frontmatter, NO path, NO slug
Follow the schema strictly: correct frontmatter fields, wikilinks as [[slug|Display Text]], Wikilinks in content: [[Display Name]] — just the display name, no slug, no pipe separator.
dates in YYYY-MM-DD format, and paraphrase rather than quoting verbatim.` Only link to pages listed in the inventory or pages you are creating in this response.`
// BuildPrompt constructs the user prompt for a single chunk. // BuildPrompt constructs the user prompt for a single chunk.
func BuildPrompt(schema, source, content string, inventory map[wiki.PageType][]wiki.Entry) string { func BuildPrompt(schema, source, content string, inventory map[wiki.PageType][]wiki.Entry) string {
@@ -30,7 +33,7 @@ func BuildPrompt(schema, source, content string, inventory map[wiki.PageType][]w
sb.WriteString("\n\n") sb.WriteString("\n\n")
sb.WriteString("## Existing wiki pages\n\n") sb.WriteString("## Existing wiki pages\n\n")
sb.WriteString("Link ONLY to pages in this inventory or pages you are creating in this response.\n\n") sb.WriteString("Reference these pages by display name only — [[Display Name]] — in your content.\n\n")
for _, pt := range []wiki.PageType{wiki.PageTypeConcept, wiki.PageTypeEntity, wiki.PageTypeSource} { for _, pt := range []wiki.PageType{wiki.PageTypeConcept, wiki.PageTypeEntity, wiki.PageTypeSource} {
entries := inventory[pt] entries := inventory[pt]
@@ -39,19 +42,19 @@ func BuildPrompt(schema, source, content string, inventory map[wiki.PageType][]w
fmt.Fprintf(&sb, "%s — (none yet)\n\n", label) fmt.Fprintf(&sb, "%s — (none yet)\n\n", label)
continue continue
} }
fmt.Fprintf(&sb, "%s — link ONLY under the matching section:\n", label) fmt.Fprintf(&sb, "%s:\n", label)
for _, e := range entries { for _, e := range entries {
fmt.Fprintf(&sb, " - [[%s|%s]]\n", e.Slug, e.Title) fmt.Fprintf(&sb, " - %s\n", e.Title)
} }
sb.WriteString("\n") sb.WriteString("\n")
} }
sb.WriteString("## Non-negotiable rules\n\n") sb.WriteString("## Non-negotiable rules\n\n")
sb.WriteString("1. Output ONLY a valid JSON array — no prose, no fences.\n") sb.WriteString("1. Output ONLY a valid JSON array — no prose, no fences.\n")
sb.WriteString("2. Slugs are kebab-case: lowercase, spaces→hyphens, no special chars.\n") sb.WriteString("2. Fields: title, type, subtype (if applicable), domain (if applicable), content.\n")
sb.WriteString("3. Wikilinks: [[slug|Display Text]] — the pipe is required.\n") sb.WriteString("3. Wikilinks: [[Display Name]] — no slug, no pipe. The pipeline handles slugs.\n")
sb.WriteString("4. Section links must match their section type.\n") sb.WriteString("4. Section links must match their section type (Related Concepts → concepts only, etc.).\n")
sb.WriteString("5. One source page per book — update it if inventory shows it exists.\n\n") sb.WriteString("5. One source page per book — if inventory shows it exists, return it as an UPDATE.\n\n")
fmt.Fprintf(&sb, "## Source: %s\n\n", source) fmt.Fprintf(&sb, "## Source: %s\n\n", source)
sb.WriteString(content) sb.WriteString(content)

View File

@@ -14,13 +14,12 @@ import (
"github.com/stretchr/testify/require" "github.com/stretchr/testify/require"
"github.com/mathiasbq/hyperguild/ingestion/internal/pipeline" "github.com/mathiasbq/hyperguild/ingestion/internal/pipeline"
"github.com/mathiasbq/hyperguild/ingestion/internal/wiki"
) )
// successComplete returns a valid JSON-encoded page array for any call. // successComplete returns a valid JSON-encoded RawPage array for any call.
func successComplete(page wiki.Page) pipeline.CompleteFunc { func successComplete(raw pipeline.RawPage) pipeline.CompleteFunc {
return func(ctx context.Context, system, user string) (string, error) { return func(ctx context.Context, system, user string) (string, error) {
b, err := json.Marshal([]wiki.Page{page}) b, err := json.Marshal([]pipeline.RawPage{raw})
if err != nil { if err != nil {
return "", err return "", err
} }
@@ -50,16 +49,19 @@ func TestStart_ProcessesFile(t *testing.T) {
require.NoError(t, os.WriteFile(rawFile, []byte("Content about Shape Up."), 0o644)) require.NoError(t, os.WriteFile(rawFile, []byte("Content about Shape Up."), 0o644))
date := time.Now().UTC().Format("2006-01-02") date := time.Now().UTC().Format("2006-01-02")
wikiPage := wiki.Page{ rawPage := pipeline.RawPage{
Path: "wiki/sources/shape-up-book.md", Title: "Shape Up Book",
Content: "---\ntitle: Shape Up Book\ntype: article\ndomain: product-management\ndate_ingested: " + date + "\nlast_updated: " + date + "\naliases:\n - Shape Up Book\n---\n\n## Summary\n\nA book about Shape Up.\n", Type: "source",
Subtype: "article",
Domain: "product-management",
Content: "## Summary\n\nA book about Shape Up.\n",
} }
cfg := Config{ cfg := Config{
BrainDir: brainDir, BrainDir: brainDir,
Interval: 50 * time.Millisecond, Interval: 50 * time.Millisecond,
Pipeline: pipeline.Config{ Pipeline: pipeline.Config{
Complete: successComplete(wikiPage), Complete: successComplete(rawPage),
ChunkSize: 0, ChunkSize: 0,
Schema: "# Schema\nThree page types.", Schema: "# Schema\nThree page types.",
}, },
@@ -193,12 +195,14 @@ func TestProcessDir_SkipsSubdirs(t *testing.T) {
// Track which sources were passed to Complete. // Track which sources were passed to Complete.
var processedSources []string var processedSources []string
completeFn := func(ctx context.Context, system, user string) (string, error) { completeFn := func(ctx context.Context, system, user string) (string, error) {
// Record that this was called; return a minimal valid page. // Record that this was called; return a minimal valid RawPage.
page := wiki.Page{ raw := pipeline.RawPage{
Path: "wiki/sources/valid.md", Title: "Valid",
Content: "---\ntitle: Valid\n---\n\n## Summary\n\nValid.\n", Type: "source",
Subtype: "article",
Content: "## Summary\n\nValid.\n",
} }
b, _ := json.Marshal([]wiki.Page{page}) b, _ := json.Marshal([]pipeline.RawPage{raw})
processedSources = append(processedSources, "called") processedSources = append(processedSources, "called")
return string(b), nil return string(b), nil
} }

View File

@@ -17,6 +17,8 @@ func (s *Skill) Handle(ctx context.Context, tool string, args json.RawMessage) (
return s.query(ctx, args) return s.query(ctx, args)
case "brain_write": case "brain_write":
return s.write(ctx, args) return s.write(ctx, args)
case "brain_ingest_raw":
return s.ingestRaw(ctx, args)
case "brain_ingest": case "brain_ingest":
return s.ingest(ctx, args) return s.ingest(ctx, args)
case "brain_search": case "brain_search":
@@ -98,6 +100,33 @@ func (s *Skill) ingest(ctx context.Context, args json.RawMessage) (json.RawMessa
return nil, fmt.Errorf("either content+source or path is required") return nil, fmt.Errorf("either content+source or path is required")
} }
type ingestRawArgs struct {
Source string `json:"source"`
Pages []any `json:"pages"`
DryRun bool `json:"dry_run,omitempty"`
}
func (s *Skill) ingestRaw(ctx context.Context, args json.RawMessage) (json.RawMessage, error) {
var a ingestRawArgs
if err := json.Unmarshal(args, &a); err != nil {
return nil, fmt.Errorf("parse args: %w", err)
}
if s.cfg.IngestSvcURL == "" {
return nil, fmt.Errorf("brain_ingest_raw: INGEST_SVC_URL not configured")
}
if a.Source == "" {
return nil, fmt.Errorf("source is required")
}
if len(a.Pages) == 0 {
return nil, fmt.Errorf("pages is required and must be non-empty")
}
return s.postTo(ctx, s.cfg.IngestSvcURL+"/ingest-raw", map[string]any{
"source": a.Source,
"pages": a.Pages,
"dry_run": a.DryRun,
})
}
type searchArgs struct { type searchArgs struct {
Query string `json:"query"` Query string `json:"query"`
Collection string `json:"collection,omitempty"` Collection string `json:"collection,omitempty"`

View File

@@ -55,6 +55,32 @@ func (s *Skill) Tools() []registry.ToolDef {
}, },
} }
if s.cfg.IngestSvcURL != "" { if s.cfg.IngestSvcURL != "" {
tools = append(tools, registry.ToolDef{
Name: "brain_ingest_raw",
Description: "Ingest pre-structured pages into the brain wiki, bypassing the LLM extraction step. " +
"Use when you (the calling agent) have already extracted entities, concepts, and content from a source. " +
"Provide source (human-readable name) and pages (array of {title, type, subtype, domain, content} objects). " +
"The pipeline computes slugs, paths, frontmatter, wikilink canonicalization, and source back-references. " +
"Returns the list of wiki pages written.",
InputSchema: schema([]string{"source", "pages"}, map[string]any{
"source": map[string]any{"type": "string", "description": "human-readable name for the source, e.g. 'shape-up-book'"},
"pages": map[string]any{
"type": "array",
"items": map[string]any{
"type": "object",
"required": []string{"title", "type", "content"},
"properties": map[string]any{
"title": map[string]any{"type": "string", "description": "page title, e.g. 'Hash Encoding'"},
"type": map[string]any{"type": "string", "enum": []string{"source", "concept", "entity"}, "description": "page type"},
"subtype": map[string]any{"type": "string", "description": "entity: person|company|tool|model|framework|technology; source: article|pdf|book|video|note|project"},
"domain": map[string]any{"type": "string", "description": "knowledge domain, e.g. 'Machine Learning'"},
"content": map[string]any{"type": "string", "description": "markdown body — no frontmatter, use [[Display Name]] for wikilinks"},
},
},
},
"dry_run": map[string]any{"type": "boolean"},
}),
})
tools = append(tools, registry.ToolDef{ tools = append(tools, registry.ToolDef{
Name: "brain_ingest", Name: "brain_ingest",
Description: "Ingest content into the brain wiki (brain/wiki/). Calls an LLM to produce structured wiki pages. " + Description: "Ingest content into the brain wiki (brain/wiki/). Calls an LLM to produce structured wiki pages. " +