hyperguild

Author	SHA1	Message	Date
Mathias	4f78fecd06	feat(search): M4 tier-weighted BM25 re-rank (infra#72) All checks were successful CI / Lint / Test / Vet (push) Successful in 12s Details CI / Mirror to GitHub (push) Successful in 3s Details The eval set under brain/eval/qa-2026-05.md showed BM25 top-1 at 20% with 5 of the missing slugs being short focused knowledge entries that lost to long aggregate docs on raw term-frequency. Tier weighting addresses that without touching the BM25 algorithm itself. How - Result struct gains a Tier field, populated during the file walk via extractTier (frontmatter wins, path prefix as fallback — mirrors the graph.inferTierFromPath logic so the two callers stay in lockstep). - After the existing sort (and optional hybridMerge), do a final stable re-sort by float64(Score) * tierWeight(Tier). Knowledge ×1.5, note ×1.0, inbox ×0.3, unknown ×1.0. - hydrate() (vector-only hits) also fills Tier so re-ranking covers the hybrid path. Test covers the load-bearing case: a long note-tier doc with raw=10 loses to a short knowledge-tier doc with raw=8 after weighting (8×1.5=12 vs 10×1.0=10). Measurement gate is in infra#72: re-run brain/eval/score.py against the live brain after this image lands; close the issue when top-1 hit rate lifts by ≥10 absolute points.	2026-05-25 18:45:20 +02:00
Mathias	37fdd33b2d	feat(ingestion): chunk markdown before embedding (#38 ) All checks were successful CI / Lint / Test / Vet (push) Successful in 11s Details CI / Mirror to GitHub (push) Has been skipped Details Long markdown files (>~8KB) silently failed to embed because nomic-embed-text on iguana has a 2048-token context. embed sync logged errors=1 every cycle with no useful body until #37 added per-item logging — three files exceed the ceiling: finbert source (8 KB), koala-machine-state (7.1 KB), litellm-absorption (8.8 KB). Curated knowledge entries should never be vector-blind. Approach: chunk-before-embed, no schema change. vectorstore/chunk.go (new) - ChunkMarkdown splits at H1/H2 boundaries; sections over maxBytes are further split at paragraph boundaries, packing greedily under budget. - NumberChunks assigns "<parent>#NNNN" storage paths (1-based, zero-padded to 4 digits — handles files with up to ~10k sections in stable sort order). - ParentPath strips the chunk suffix for retrieval-side dedup. vectorstore/sync.go - After ChunkMarkdown produces N pieces, each is embedded + upserted as a separate brain_embeddings row at "<parent>#NNNN". maxChunkBytes = 4000 (≈1000 nomic tokens, well under the 2048 ceiling with headroom for unicode/code blocks). - "Already embedded?" check now reduces known paths to parent set via ParentPath, so the first chunk hit short-circuits the file. - Delete walk also reduces via ParentPath; when a parent file disappears, every chunk row (and any pre-existing bare-path row, for backward compatibility with rows written before this change) gets dropped. search/search.go - hybridMerge collapses chunk-path vector hits to parent via ParentPath before scope check, RRF accumulation, and hydration. A file with three chunk hits returns one result row, not three. Backward compatibility: pre-existing bare-path rows in brain_embeddings keep working — ParentPath returns them unchanged, knownParents handles them as if they were "wiki/foo.md#NNNN" hits, sync skips re-embed, and search dedup is a no-op for them. No migration required to ship. Tests: - chunk_test.go covers short / heading split / oversized section / content preservation / chunk numbering / parent-path stripping. - sync_test.go adds long-file chunking, single-chunk-row short file, skip-if-any-chunk-known, delete-all-chunks-of-disappeared-file. Existing tests updated for #NNNN paths. - search_test.go adds chunk-paths-dedupe-to-parent. Closes gitea/mathias/infra#38.	2026-05-19 21:57:09 +02:00
Mathias	57462b52ff	feat(brain): hybrid BM25 + pgvector retrieval (opt-in) All checks were successful CI / Lint / Test / Vet (push) Successful in 15s Details CI / Mirror to GitHub (push) Successful in 3s Details Wires nomic-embed-text (iguana ollama) + pgvector on the shared postgres18 into brain_query / brain_answer via Reciprocal Rank Fusion. Pure BM25 stays the default; setting BRAIN_PG_DSN and BRAIN_EMBED_URL together opts in. Setting one without the other is misconfiguration → exit 1. New packages: - internal/embed Client.Embed(ctx, text) → []float32 via POST {URL}/api/embed. Defaults to nomic-embed-text:latest (768 dim). nil-on-empty-URL so callers gate on a single nil check. - internal/vectorstore PGStore wraps a pgxpool against postgres18. Init creates brain_embeddings(path PK, vector(768), updated_at) + HNSW cosine index idempotently. Upsert / Delete / Search / KnownPaths. Sync(brainDir, store, embedder) diffs brain/wiki/ against the store and upserts new files / deletes removed ones; StartSync runs it on a ticker (default 300s). Integration tests gated by BRAIN_PG_TEST_DSN. - scripts/brain-embeddings-init.sql One-time DBA setup: brain DB, brain_app role, vector extension, GRANTs. Idempotent. Search layer: - search.QueryOptions gains Vector + Embedder fields. - QueryContext is the cancellable variant; Query stays for callers. - When both are set, BM25 (top-N) and pgvector (top-4N) candidates merge via Reciprocal Rank Fusion (k=60, Cormack et al. 2009 — no tuning knob, robust to scale differences between rankers). - Vector-only hits are hydrated from disk so callers see uniform Result records (path, title, excerpt, wing, hall, score). - Wing/hall filters still apply to vector candidates via path-prefix. - On embedder/vector errors the search falls back to BM25 — embedding outage degrades quality but doesn't take the brain offline. MCP wiring: - mcp.Server.WithHybridRetrieval(v, e) opt-in setter, same shape as WithReranker. - brainQuery and brainAnswer pass the wired vector/embedder through to search.QueryContext. REST: - POST /backfill-embeddings drives Sync synchronously. Returns {added, deleted, errors[]}. 503 when feature is unconfigured. cmd/server/main.go: - BRAIN_PG_DSN + BRAIN_EMBED_URL together enable hybrid; one alone → exit 1. - vectorAdapter bridges *PGStore (returns []Hit) to search.VectorSearcher (which takes []VectorHit) without either package importing the other. - BRAIN_EMBED_SYNC_INTERVAL (default 300s) controls the background Sync ticker. Backend pivot from Qdrant to pgvector recorded in DECISIONS.md 2026-05-18 (supersedes 2026-04-08): postgres18 already runs in databases/ ns, Qdrant was never deployed, one engine beats two. Dependency: github.com/jackc/pgx/v5 — modern, native pgvector via parametric vector literals. Tests: - embed.Client: empty-URL nil, request shape, dimension, upstream error propagation, empty-text rejection. - vectorstore.PGStore: dimension validation (unit); upsert/search/ KnownPaths (integration, BRAIN_PG_TEST_DSN-gated). - vectorstore.Sync: adds new files, skips known, deletes disappeared, skips _index.md, no-op when nil, collects embedder errors. - search.Query: hybrid promotes vector-only hits via RRF; falls back to BM25 on embedder error. Closes hyperguild#8.	2026-05-18 23:11:25 +02:00
Mathias	75685e7b67	feat(brain): structured wing/hall taxonomy + obsidian-compatible layout All checks were successful CI / Lint / Test / Vet (push) Successful in 11s Details CI / Mirror to GitHub (push) Successful in 4s Details Adds a two-dimensional address (wing, hall) to brain notes. A wing is a topic domain (e.g. jepa-fx, hyperguild); a hall is one of a closed vocabulary of memory types (facts, decisions, failures, hypotheses, sources). Notes route to brain/wiki/<wing>/<hall>/<slug>.md with wing/hall/created_at YAML frontmatter, making the directory a valid Obsidian vault. Changes: - new package ingestion/internal/brain (NotePath, ValidHalls, Sanitise, BuildWingIndex, BuildAllWingIndexes) - api.WriteNote refactored to WriteNoteOptions; wing+hall routes to brain/wiki/, otherwise falls back to brain/knowledge/ (legacy) - search.Query → QueryOptions with optional Wing/Hall filtering; Result carries wing/hall extracted from frontmatter or path segments - MCP tools brain_write and brain_query gain optional wing/hall params (hall enum-validated); new brain_index tool regenerates _index.md MOC - POST /index REST endpoint mirrors brain_index - brain_write auto-rebuilds the wing's _index.md after a wing+hall write - scripts/migrate-brain-halls.sh migrates flat brain/wiki/{concepts,entities}/ into the new layout (dry-run by default, --commit applies) All existing tests pass; new tests cover wing/hall write routing, scope filtering, invalid hall rejection, _index.md generation, and migration script paths. Closes hyperguild#1.	2026-05-18 20:47:08 +02:00
Mathias Bergqvist	3625e1268d	feat(ingestion): simplify brain to knowledge/ — write and search use same dir	2026-04-22 15:36:10 +02:00
Mathias Bergqvist	3c1f6edf3e	feat(ingestion): add full-text wiki search package Implements search.Query which walks brainDir/wiki/*/.md, scores files by term-frequency across query tokens, and returns results sorted by score descending. Uses only stdlib — no external search deps. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-17 20:18:57 +02:00

6 Commits