docs(eval): record M4 + M4b scorer runs — phase 2 gate cleared (infra#72)
Tier-weighted retrieval against the qa-2026-05.md 20-question set: | run | top-1 | top-3 | |--------------------------------|-------|-------| | baseline (pre-phase-1) | 20% | 65% | | post phase 1 (parser+content) | 20% | 70% | | post M4 (tier weighting) | 30% | 75% | | post M4b (entities → K tier) | 35% | 80% | Net Phase 2 lift: +15pt top-1, +15pt top-3 — comfortably above the ≥10pt close-gate set in infra#72. Three remaining misses are content-keyword issues, not structure issues (the questions don't share enough lexical surface with the target entries to surface via BM25 alone). Vector search would help here but the iguana embedder is off-mesh (see infra#64).
This commit is contained in:
167
brain/eval/post-m4.txt
Normal file
167
brain/eval/post-m4.txt
Normal file
@@ -0,0 +1,167 @@
|
||||
# post-m4-tier-weighting — 20 questions, k=5
|
||||
|
||||
top-1 hit rate: 6/20 = 30%
|
||||
top-3 hit rate: 15/20 = 75%
|
||||
|
||||
## per-question detail
|
||||
|
||||
· rank=3 expected=dex-in-memory-storage-wipes-oauth-tokens-on-every-pod-restart
|
||||
q: how do I stop dex from logging users out on every pod restart?
|
||||
1. homelab-network-perimeter-model
|
||||
2. 2026-05-12-koala-machine-state
|
||||
3. dex-in-memory-storage-wipes-oauth-tokens-on-every-pod-restart <-- expected
|
||||
4. infra-litellm-absorption-2026-05-16
|
||||
5. k8s-configmap-mount-no-reload-needs-pod-restart
|
||||
|
||||
· rank=2 expected=postgres-least-privilege-migration-tenant-grant-bypass-2026-05
|
||||
q: my postgres-exporter broke after revoking PUBLIC CONNECT — why?
|
||||
1. infra-litellm-absorption-2026-05-16
|
||||
2. postgres-least-privilege-migration-tenant-grant-bypass-2026-05 <-- expected
|
||||
3. extension-version-lags-platform-major-upgrade
|
||||
4. ntfy-deny-all-rollout-ordering-keep-alert-pipeline-live-during-auth-flip
|
||||
5. gitea-push-mirror-cannot-create-remote-repo-needs-pre-existing-github-repo
|
||||
|
||||
★ rank=1 expected=homelab-network-perimeter-model
|
||||
q: when is a NodePort acceptable vs needing a public ingress with bearer gate?
|
||||
1. homelab-network-perimeter-model <-- expected
|
||||
2. qwen3-thinking-model-empty-content-trap
|
||||
3. mcpclient-empty-token-silent-401-envfrom-missing-key
|
||||
4. 2026-05-12-koala-machine-state
|
||||
5. koala-llama-swap-native-tool-calls-survey-2026-05
|
||||
|
||||
· rank=3 expected=exit-255-unknown-reason-not-oom
|
||||
q: what does container exit code 255 with reason Unknown mean?
|
||||
1. qwen3-thinking-model-empty-content-trap
|
||||
2. infra-litellm-absorption-2026-05-16
|
||||
3. exit-255-unknown-reason-not-oom <-- expected
|
||||
4. mcpclient-empty-token-silent-401-envfrom-missing-key
|
||||
5. koala-llama-swap-native-tool-calls-survey-2026-05
|
||||
|
||||
· rank=2 expected=gitea-push-mirror-cannot-create-remote-repo-needs-pre-existing-github-repo
|
||||
q: can gitea push-mirror create the github repo automatically?
|
||||
1. infra-litellm-absorption-2026-05-16
|
||||
2. gitea-push-mirror-cannot-create-remote-repo-needs-pre-existing-github-repo <-- expected
|
||||
3. adr-new-project-gitea-first-github-mirror
|
||||
4. adr-github-as-primary-remote
|
||||
5. 2026-05-12-koala-machine-state
|
||||
|
||||
✗ rank=0 expected=flux-healthcheck-stale-on-resource-removal
|
||||
q: a flux kustomization is stuck after I removed a resource — why?
|
||||
1. qwen3-thinking-model-empty-content-trap
|
||||
2. 2026-05-12-koala-machine-state
|
||||
3. homelab-architecture-principles-2026-05
|
||||
4. k8s-configmap-mount-no-reload-needs-pod-restart
|
||||
5. training-on-rtx-5070-pretraining-vs-finetuning
|
||||
|
||||
★ rank=1 expected=go-bytes-buffer-bytes-reset-aliasing-trap
|
||||
q: the bytes buffer aliasing trap with Reset in a loop — what's the bug?
|
||||
1. go-bytes-buffer-bytes-reset-aliasing-trap <-- expected
|
||||
2. homelab-security-chains-not-bugs
|
||||
3. Financial Sentiment Analysis on Stock Market Headlines With FinBERT & HuggingFace
|
||||
4. training-on-rtx-5070-pretraining-vs-finetuning
|
||||
5. flux-healthcheck-stale-on-resource-removal
|
||||
|
||||
★ rank=1 expected=homelab-architecture-principles-2026-05
|
||||
q: what are the homelab architecture principles from may 2026?
|
||||
1. homelab-architecture-principles-2026-05 <-- expected
|
||||
2. homelab-network-perimeter-model
|
||||
3. homelab-core-glossary
|
||||
4. 2026-05-12-koala-machine-state
|
||||
5. pattern-reddit-tmux-multiagent-conductor
|
||||
|
||||
? rank=4 expected=2026-05-04-sops-age-key-from-flux-cluster
|
||||
q: where does the sops age private key live in the cluster?
|
||||
1. 2026-05-12-koala-machine-state
|
||||
2. homelab-network-perimeter-model
|
||||
3. dex-in-memory-storage-wipes-oauth-tokens-on-every-pod-restart
|
||||
4. 2026-05-04-sops-age-key-from-flux-cluster <-- expected
|
||||
5. homelab-security-chains-not-bugs
|
||||
|
||||
★ rank=1 expected=grafana-dashboards-as-code-not-ui-state
|
||||
q: why do my grafana dashboards disappear after a pod restart?
|
||||
1. grafana-dashboards-as-code-not-ui-state <-- expected
|
||||
2. infra-litellm-absorption-2026-05-16
|
||||
3. 2026-05-12-koala-machine-state
|
||||
4. dex-in-memory-storage-wipes-oauth-tokens-on-every-pod-restart
|
||||
5. k8s-configmap-mount-no-reload-needs-pod-restart
|
||||
|
||||
★ rank=1 expected=double-diamond-methodology
|
||||
q: what is the double diamond methodology?
|
||||
1. double-diamond-methodology <-- expected
|
||||
2. unified-methodology-diamond-futures-autoresearch
|
||||
3. futures-thinking-extended-double-diamond
|
||||
4. insight-exploration-as-diamond-1
|
||||
5. workflow-idea-to-running-service
|
||||
|
||||
· rank=3 expected=2026-05-04-mcp-transport-version-claude-ai-strict
|
||||
q: my MCP server works from claude code but fails on claude.ai — what's different?
|
||||
1. qwen3-thinking-model-empty-content-trap
|
||||
2. mcp-resource-url-empty-breaks-claude-ai-discovery-silently
|
||||
3. 2026-05-04-mcp-transport-version-claude-ai-strict <-- expected
|
||||
4. 2026-05-04-claude-ai-custom-mcp-connectors
|
||||
5. finding-github-mcp-claudeai-vs-claudecode
|
||||
|
||||
· rank=2 expected=homelab-security-chains-not-bugs
|
||||
q: how should I rate security findings — isolated bugs or exploit chains?
|
||||
1. homelab-network-perimeter-model
|
||||
2. homelab-security-chains-not-bugs <-- expected
|
||||
3. policy-audit-mode-blocks-nothing
|
||||
4. homelab-document-accepted-risk-to-break-audit-cycle
|
||||
5. audit-shortcut-tls-blocks-zero-equals-edge-only
|
||||
|
||||
· rank=2 expected=2026-05-03-canonical-vs-derived-context-flow
|
||||
q: how should canonical context files relate to derived adapter files?
|
||||
1. qwen3-thinking-model-empty-content-trap
|
||||
2. 2026-05-03-canonical-vs-derived-context-flow <-- expected
|
||||
3. 2026-05-12-koala-machine-state
|
||||
4. 2026-05-04-claude-ai-custom-mcp-connectors
|
||||
5. koala-llama-swap-native-tool-calls-survey-2026-05
|
||||
|
||||
· rank=2 expected=homelab-core-glossary
|
||||
q: what is the homelab core vocabulary glossary?
|
||||
1. homelab-architecture-principles-2026-05
|
||||
2. homelab-core-glossary <-- expected
|
||||
3. 2026-05-12-koala-machine-state
|
||||
4. flux-kustomization-depends-on-bootstrap-ordering
|
||||
5. brain-ingest-ntfy-service
|
||||
|
||||
★ rank=1 expected=koala-llama-swap-native-tool-calls-survey-2026-05
|
||||
q: which models on koala llama-swap actually emit native tool_calls correctly?
|
||||
1. koala-llama-swap-native-tool-calls-survey-2026-05 <-- expected
|
||||
2. 2026-05-12-koala-machine-state
|
||||
3. infra-litellm-absorption-2026-05-16
|
||||
4. training-on-rtx-5070-pretraining-vs-finetuning
|
||||
5. qwen3-thinking-model-empty-content-trap
|
||||
|
||||
✗ rank=0 expected=qwen35-9b-fast
|
||||
q: what is qwen35-9b-fast and what's it used for?
|
||||
1. koala-llama-swap-native-tool-calls-survey-2026-05
|
||||
2. qwen3-thinking-model-empty-content-trap
|
||||
3. infra-litellm-absorption-2026-05-16
|
||||
4. 2026-05-12-koala-machine-state
|
||||
5. index
|
||||
|
||||
✗ rank=0 expected=go-defer-errcheck-body-close
|
||||
q: in go, how do I prevent defer body close from silently dropping errors?
|
||||
1. homelab-network-perimeter-model
|
||||
2. infra-litellm-absorption-2026-05-16
|
||||
3. go-bytes-buffer-bytes-reset-aliasing-trap
|
||||
4. mcpclient-empty-token-silent-401-envfrom-missing-key
|
||||
5. koala-llama-swap-native-tool-calls-survey-2026-05
|
||||
|
||||
✗ rank=0 expected=hyperguild-level3-pipeline-rewrite
|
||||
q: what was the level 3 rewrite of hyperguild's ingestion pipeline?
|
||||
1. 2026-05-12-koala-machine-state
|
||||
2. homelab-core-glossary
|
||||
3. koala-llama-swap-native-tool-calls-survey-2026-05
|
||||
4. infra-litellm-absorption-2026-05-16
|
||||
5. homelab-architecture-principles-2026-05
|
||||
|
||||
· rank=3 expected=adr-new-project-gitea-first-github-mirror
|
||||
q: what's the new-project ADR — is it gitea-first or github-first?
|
||||
1. gitea-push-mirror-cannot-create-remote-repo-needs-pre-existing-github-repo
|
||||
2. mcp-tool-design-get-needs-list-partner
|
||||
3. adr-new-project-gitea-first-github-mirror <-- expected
|
||||
4. 2026-05-04-gitea-mcp-build-session
|
||||
5. adr-local-dev-vs-hyperguild-new-project
|
||||
|
||||
167
brain/eval/post-m4b.txt
Normal file
167
brain/eval/post-m4b.txt
Normal file
@@ -0,0 +1,167 @@
|
||||
# post-m4b-entities-promoted — 20 questions, k=5
|
||||
|
||||
top-1 hit rate: 7/20 = 35%
|
||||
top-3 hit rate: 16/20 = 80%
|
||||
|
||||
## per-question detail
|
||||
|
||||
· rank=3 expected=dex-in-memory-storage-wipes-oauth-tokens-on-every-pod-restart
|
||||
q: how do I stop dex from logging users out on every pod restart?
|
||||
1. homelab-network-perimeter-model
|
||||
2. 2026-05-12-koala-machine-state
|
||||
3. dex-in-memory-storage-wipes-oauth-tokens-on-every-pod-restart <-- expected
|
||||
4. infra-litellm-absorption-2026-05-16
|
||||
5. k8s-configmap-mount-no-reload-needs-pod-restart
|
||||
|
||||
· rank=2 expected=postgres-least-privilege-migration-tenant-grant-bypass-2026-05
|
||||
q: my postgres-exporter broke after revoking PUBLIC CONNECT — why?
|
||||
1. infra-litellm-absorption-2026-05-16
|
||||
2. postgres-least-privilege-migration-tenant-grant-bypass-2026-05 <-- expected
|
||||
3. extension-version-lags-platform-major-upgrade
|
||||
4. ntfy-deny-all-rollout-ordering-keep-alert-pipeline-live-during-auth-flip
|
||||
5. gitea-push-mirror-cannot-create-remote-repo-needs-pre-existing-github-repo
|
||||
|
||||
★ rank=1 expected=homelab-network-perimeter-model
|
||||
q: when is a NodePort acceptable vs needing a public ingress with bearer gate?
|
||||
1. homelab-network-perimeter-model <-- expected
|
||||
2. qwen3-thinking-model-empty-content-trap
|
||||
3. mcpclient-empty-token-silent-401-envfrom-missing-key
|
||||
4. 2026-05-12-koala-machine-state
|
||||
5. koala-llama-swap-native-tool-calls-survey-2026-05
|
||||
|
||||
· rank=3 expected=exit-255-unknown-reason-not-oom
|
||||
q: what does container exit code 255 with reason Unknown mean?
|
||||
1. qwen3-thinking-model-empty-content-trap
|
||||
2. infra-litellm-absorption-2026-05-16
|
||||
3. exit-255-unknown-reason-not-oom <-- expected
|
||||
4. mcpclient-empty-token-silent-401-envfrom-missing-key
|
||||
5. koala-llama-swap-native-tool-calls-survey-2026-05
|
||||
|
||||
· rank=2 expected=gitea-push-mirror-cannot-create-remote-repo-needs-pre-existing-github-repo
|
||||
q: can gitea push-mirror create the github repo automatically?
|
||||
1. infra-litellm-absorption-2026-05-16
|
||||
2. gitea-push-mirror-cannot-create-remote-repo-needs-pre-existing-github-repo <-- expected
|
||||
3. adr-new-project-gitea-first-github-mirror
|
||||
4. adr-github-as-primary-remote
|
||||
5. 2026-05-12-koala-machine-state
|
||||
|
||||
✗ rank=0 expected=flux-healthcheck-stale-on-resource-removal
|
||||
q: a flux kustomization is stuck after I removed a resource — why?
|
||||
1. qwen3-thinking-model-empty-content-trap
|
||||
2. 2026-05-12-koala-machine-state
|
||||
3. homelab-architecture-principles-2026-05
|
||||
4. k8s-configmap-mount-no-reload-needs-pod-restart
|
||||
5. training-on-rtx-5070-pretraining-vs-finetuning
|
||||
|
||||
★ rank=1 expected=go-bytes-buffer-bytes-reset-aliasing-trap
|
||||
q: the bytes buffer aliasing trap with Reset in a loop — what's the bug?
|
||||
1. go-bytes-buffer-bytes-reset-aliasing-trap <-- expected
|
||||
2. homelab-security-chains-not-bugs
|
||||
3. Financial Sentiment Analysis on Stock Market Headlines With FinBERT & HuggingFace
|
||||
4. training-on-rtx-5070-pretraining-vs-finetuning
|
||||
5. flux-healthcheck-stale-on-resource-removal
|
||||
|
||||
★ rank=1 expected=homelab-architecture-principles-2026-05
|
||||
q: what are the homelab architecture principles from may 2026?
|
||||
1. homelab-architecture-principles-2026-05 <-- expected
|
||||
2. homelab-network-perimeter-model
|
||||
3. homelab-core-glossary
|
||||
4. 2026-05-12-koala-machine-state
|
||||
5. pattern-reddit-tmux-multiagent-conductor
|
||||
|
||||
? rank=4 expected=2026-05-04-sops-age-key-from-flux-cluster
|
||||
q: where does the sops age private key live in the cluster?
|
||||
1. 2026-05-12-koala-machine-state
|
||||
2. homelab-network-perimeter-model
|
||||
3. dex-in-memory-storage-wipes-oauth-tokens-on-every-pod-restart
|
||||
4. 2026-05-04-sops-age-key-from-flux-cluster <-- expected
|
||||
5. homelab-security-chains-not-bugs
|
||||
|
||||
★ rank=1 expected=grafana-dashboards-as-code-not-ui-state
|
||||
q: why do my grafana dashboards disappear after a pod restart?
|
||||
1. grafana-dashboards-as-code-not-ui-state <-- expected
|
||||
2. infra-litellm-absorption-2026-05-16
|
||||
3. 2026-05-12-koala-machine-state
|
||||
4. dex-in-memory-storage-wipes-oauth-tokens-on-every-pod-restart
|
||||
5. k8s-configmap-mount-no-reload-needs-pod-restart
|
||||
|
||||
★ rank=1 expected=double-diamond-methodology
|
||||
q: what is the double diamond methodology?
|
||||
1. double-diamond-methodology <-- expected
|
||||
2. unified-methodology-diamond-futures-autoresearch
|
||||
3. futures-thinking-extended-double-diamond
|
||||
4. insight-exploration-as-diamond-1
|
||||
5. workflow-idea-to-running-service
|
||||
|
||||
· rank=3 expected=2026-05-04-mcp-transport-version-claude-ai-strict
|
||||
q: my MCP server works from claude code but fails on claude.ai — what's different?
|
||||
1. qwen3-thinking-model-empty-content-trap
|
||||
2. mcp-resource-url-empty-breaks-claude-ai-discovery-silently
|
||||
3. 2026-05-04-mcp-transport-version-claude-ai-strict <-- expected
|
||||
4. 2026-05-04-claude-ai-custom-mcp-connectors
|
||||
5. finding-github-mcp-claudeai-vs-claudecode
|
||||
|
||||
· rank=2 expected=homelab-security-chains-not-bugs
|
||||
q: how should I rate security findings — isolated bugs or exploit chains?
|
||||
1. homelab-network-perimeter-model
|
||||
2. homelab-security-chains-not-bugs <-- expected
|
||||
3. policy-audit-mode-blocks-nothing
|
||||
4. homelab-document-accepted-risk-to-break-audit-cycle
|
||||
5. audit-shortcut-tls-blocks-zero-equals-edge-only
|
||||
|
||||
· rank=2 expected=2026-05-03-canonical-vs-derived-context-flow
|
||||
q: how should canonical context files relate to derived adapter files?
|
||||
1. qwen3-thinking-model-empty-content-trap
|
||||
2. 2026-05-03-canonical-vs-derived-context-flow <-- expected
|
||||
3. 2026-05-12-koala-machine-state
|
||||
4. 2026-05-04-claude-ai-custom-mcp-connectors
|
||||
5. koala-llama-swap-native-tool-calls-survey-2026-05
|
||||
|
||||
· rank=2 expected=homelab-core-glossary
|
||||
q: what is the homelab core vocabulary glossary?
|
||||
1. homelab-architecture-principles-2026-05
|
||||
2. homelab-core-glossary <-- expected
|
||||
3. 2026-05-12-koala-machine-state
|
||||
4. qwen35-9b-fast
|
||||
5. flux-kustomization-depends-on-bootstrap-ordering
|
||||
|
||||
★ rank=1 expected=koala-llama-swap-native-tool-calls-survey-2026-05
|
||||
q: which models on koala llama-swap actually emit native tool_calls correctly?
|
||||
1. koala-llama-swap-native-tool-calls-survey-2026-05 <-- expected
|
||||
2. 2026-05-12-koala-machine-state
|
||||
3. infra-litellm-absorption-2026-05-16
|
||||
4. training-on-rtx-5070-pretraining-vs-finetuning
|
||||
5. qwen3-thinking-model-empty-content-trap
|
||||
|
||||
★ rank=1 expected=qwen35-9b-fast
|
||||
q: what is qwen35-9b-fast and what's it used for?
|
||||
1. qwen35-9b-fast <-- expected
|
||||
2. koala-llama-swap-native-tool-calls-survey-2026-05
|
||||
3. qwen3-thinking-model-empty-content-trap
|
||||
4. infra-litellm-absorption-2026-05-16
|
||||
5. 2026-05-12-koala-machine-state
|
||||
|
||||
✗ rank=0 expected=go-defer-errcheck-body-close
|
||||
q: in go, how do I prevent defer body close from silently dropping errors?
|
||||
1. homelab-network-perimeter-model
|
||||
2. infra-litellm-absorption-2026-05-16
|
||||
3. go-bytes-buffer-bytes-reset-aliasing-trap
|
||||
4. mcpclient-empty-token-silent-401-envfrom-missing-key
|
||||
5. koala-llama-swap-native-tool-calls-survey-2026-05
|
||||
|
||||
✗ rank=0 expected=hyperguild-level3-pipeline-rewrite
|
||||
q: what was the level 3 rewrite of hyperguild's ingestion pipeline?
|
||||
1. 2026-05-12-koala-machine-state
|
||||
2. homelab-core-glossary
|
||||
3. koala-llama-swap-native-tool-calls-survey-2026-05
|
||||
4. infra-litellm-absorption-2026-05-16
|
||||
5. homelab-architecture-principles-2026-05
|
||||
|
||||
· rank=3 expected=adr-new-project-gitea-first-github-mirror
|
||||
q: what's the new-project ADR — is it gitea-first or github-first?
|
||||
1. gitea-push-mirror-cannot-create-remote-repo-needs-pre-existing-github-repo
|
||||
2. mcp-tool-design-get-needs-list-partner
|
||||
3. adr-new-project-gitea-first-github-mirror <-- expected
|
||||
4. 2026-05-04-gitea-mcp-build-session
|
||||
5. adr-local-dev-vs-hyperguild-new-project
|
||||
|
||||
Reference in New Issue
Block a user