4f78fecd065b2429e1bb79866ea8609fc7d8dcb1
4 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
815739758e |
feat(vectorstore): re-embed on file mtime > store updated_at (#23)
Removes the TODO in Sync that left files static after their first embed. Edits to brain/wiki/ and brain/knowledge/ now surface in subsequent syncs without manual /backfill-embeddings calls. Approach - Store interface: KnownPaths → KnownPathsWithTime returning path → updated_at. Callers compare against file mtime to detect edits. - PGStore: SELECT path, updated_at FROM brain_embeddings. - Sync groups known chunks by parent path and tracks the EARLIEST updated_at per parent. A file is stale when its mtime is after that oldest chunk's timestamp — any chunk older than the file means at least one chunk hasn't been refreshed since the last edit. - Stale-path rewrite: delete every old chunk for the parent (handles "file shrunk → fewer chunks → orphan rows at higher #NNNN" cleanly), then re-chunk + re-embed + re-upsert. Tests - New: TestSync_ReembedsFileWhenMtimeNewer — file mtime forced into the future vs store updated_at; Sync deletes old chunk + upserts fresh one. - New: TestSync_SkipsFileWhenMtimeOlder — file mtime backdated; Sync is a no-op (no upserts, no deletes). - Updated: stubStore.known is now map[string]time.Time. A zero value resolves to a far-future sentinel so existing "skip if already known" tests keep passing without per-test setup. - pg_test renamed KnownPaths integration → KnownPathsWithTime; asserts updated_at is non-zero and within 5s of insert wall-clock. Backward compat - brain_embeddings rows pre-dating this change carry valid updated_at values (column was always populated via `DEFAULT now()` + ON CONFLICT `updated_at = now()`). No migration needed. Live pod will start re-embedding any file whose source has been edited since its chunks were originally written. Closes gitea/mathias/hyperguild#23. |
||
|
|
37fdd33b2d |
feat(ingestion): chunk markdown before embedding (#38)
Long markdown files (>~8KB) silently failed to embed because nomic-embed-text on iguana has a 2048-token context. embed sync logged errors=1 every cycle with no useful body until #37 added per-item logging — three files exceed the ceiling: finbert source (8 KB), koala-machine-state (7.1 KB), litellm-absorption (8.8 KB). Curated knowledge entries should never be vector-blind. Approach: chunk-before-embed, no schema change. vectorstore/chunk.go (new) - ChunkMarkdown splits at H1/H2 boundaries; sections over maxBytes are further split at paragraph boundaries, packing greedily under budget. - NumberChunks assigns "<parent>#NNNN" storage paths (1-based, zero-padded to 4 digits — handles files with up to ~10k sections in stable sort order). - ParentPath strips the chunk suffix for retrieval-side dedup. vectorstore/sync.go - After ChunkMarkdown produces N pieces, each is embedded + upserted as a separate brain_embeddings row at "<parent>#NNNN". maxChunkBytes = 4000 (≈1000 nomic tokens, well under the 2048 ceiling with headroom for unicode/code blocks). - "Already embedded?" check now reduces known paths to parent set via ParentPath, so the first chunk hit short-circuits the file. - Delete walk also reduces via ParentPath; when a parent file disappears, every chunk row (and any pre-existing bare-path row, for backward compatibility with rows written before this change) gets dropped. search/search.go - hybridMerge collapses chunk-path vector hits to parent via ParentPath before scope check, RRF accumulation, and hydration. A file with three chunk hits returns one result row, not three. Backward compatibility: pre-existing bare-path rows in brain_embeddings keep working — ParentPath returns them unchanged, knownParents handles them as if they were "wiki/foo.md#NNNN" hits, sync skips re-embed, and search dedup is a no-op for them. No migration required to ship. Tests: - chunk_test.go covers short / heading split / oversized section / content preservation / chunk numbering / parent-path stripping. - sync_test.go adds long-file chunking, single-chunk-row short file, skip-if-any-chunk-known, delete-all-chunks-of-disappeared-file. Existing tests updated for #NNNN paths. - search_test.go adds chunk-paths-dedupe-to-parent. Closes gitea/mathias/infra#38. |
||
|
|
078ec029da |
fix(ingestion): embed sync also scans brain/knowledge/ + logs per-item errors
The embed sync goroutine only walked brain/wiki/. brain/knowledge/ (112 curated entries, per CLAUDE.md the most-important brain content) had zero coverage in brain_embeddings — vector retrieval was blind to it. Hybrid BM25 + pgvector retrieval would never surface a curated knowledge entry via the vector arm. Extract the per-root walk into a loop over a small subdir list and add "knowledge" alongside "wiki". scanDirs is package-level so it stays a single source of truth for what gets embedded. Also log each failing item's path + error string from StartSync. Previously only the aggregate count was logged, so a persistent `errors=1` per cycle was opaque. With per-item warnings, the actual ollama "input length exceeds the context length" surface immediately. Refs gitea/mathias/infra#37 (this commit covers the knowledge/ scan bug; the long-file chunking bug is a separate change.) |
||
|
|
57462b52ff |
feat(brain): hybrid BM25 + pgvector retrieval (opt-in)
Wires nomic-embed-text (iguana ollama) + pgvector on the shared
postgres18 into brain_query / brain_answer via Reciprocal Rank Fusion.
Pure BM25 stays the default; setting BRAIN_PG_DSN and BRAIN_EMBED_URL
together opts in. Setting one without the other is misconfiguration →
exit 1.
New packages:
- internal/embed
Client.Embed(ctx, text) → []float32 via POST {URL}/api/embed.
Defaults to nomic-embed-text:latest (768 dim). nil-on-empty-URL so
callers gate on a single nil check.
- internal/vectorstore
PGStore wraps a pgxpool against postgres18. Init creates
brain_embeddings(path PK, vector(768), updated_at) + HNSW cosine
index idempotently. Upsert / Delete / Search / KnownPaths.
Sync(brainDir, store, embedder) diffs brain/wiki/ against the store
and upserts new files / deletes removed ones; StartSync runs it on
a ticker (default 300s). Integration tests gated by BRAIN_PG_TEST_DSN.
- scripts/brain-embeddings-init.sql
One-time DBA setup: brain DB, brain_app role, vector extension,
GRANTs. Idempotent.
Search layer:
- search.QueryOptions gains Vector + Embedder fields.
- QueryContext is the cancellable variant; Query stays for callers.
- When both are set, BM25 (top-N) and pgvector (top-4N) candidates
merge via Reciprocal Rank Fusion (k=60, Cormack et al. 2009 — no
tuning knob, robust to scale differences between rankers).
- Vector-only hits are hydrated from disk so callers see uniform
Result records (path, title, excerpt, wing, hall, score).
- Wing/hall filters still apply to vector candidates via path-prefix.
- On embedder/vector errors the search falls back to BM25 — embedding
outage degrades quality but doesn't take the brain offline.
MCP wiring:
- mcp.Server.WithHybridRetrieval(v, e) opt-in setter, same shape as
WithReranker.
- brainQuery and brainAnswer pass the wired vector/embedder through
to search.QueryContext.
REST:
- POST /backfill-embeddings drives Sync synchronously. Returns
{added, deleted, errors[]}. 503 when feature is unconfigured.
cmd/server/main.go:
- BRAIN_PG_DSN + BRAIN_EMBED_URL together enable hybrid; one alone
→ exit 1.
- vectorAdapter bridges *PGStore (returns []Hit) to
search.VectorSearcher (which takes []VectorHit) without either
package importing the other.
- BRAIN_EMBED_SYNC_INTERVAL (default 300s) controls the background
Sync ticker.
Backend pivot from Qdrant to pgvector recorded in DECISIONS.md
2026-05-18 (supersedes 2026-04-08): postgres18 already runs in
databases/ ns, Qdrant was never deployed, one engine beats two.
Dependency: github.com/jackc/pgx/v5 — modern, native pgvector via
parametric vector literals.
Tests:
- embed.Client: empty-URL nil, request shape, dimension, upstream
error propagation, empty-text rejection.
- vectorstore.PGStore: dimension validation (unit); upsert/search/
KnownPaths (integration, BRAIN_PG_TEST_DSN-gated).
- vectorstore.Sync: adds new files, skips known, deletes
disappeared, skips _index.md, no-op when nil, collects embedder
errors.
- search.Query: hybrid promotes vector-only hits via RRF; falls
back to BM25 on embedder error.
Closes hyperguild#8.
|