Phase 19 — Tool Registry Restructure
Upgrades the tool registry from keyword-based discovery to semantic vector search with quality tracking. Introduces per-tool quality metrics, LLM-based auto-rating after tool execution, degradation alerting with quarantine escalation, and a feedback CLI. Builds on existing infrastructure: tool_retry_stats (migration 0005/0013), tool_quarantine (migration 0020), internal/vectorstore/, and internal/embedding/.
Status: Not Started
Depends on: Phases 1-14 complete
Migrations: 0026_tool_metrics (Phase 19A)
Branch: dev
Why Now
With Phases 1-14 complete, Cruvero has a functional tool registry with keyword-based discovery and basic retry statistics — but tool selection and quality management have three structural problems:
- Keyword-only discovery —
filterRegistryForPromptininternal/agent/workflow.gouses token scoring and exact name matching to select tools. This misses semantic relationships: a prompt about "send notification" cannot discover tools named "email_dispatch" or "slack_post" unless those exact words appear. - No quality signal —
tool_retry_stats(migration 0005) records binary success/failure counters per tool, but there is no measure of output quality. A tool that returns low-quality results 100% of the time appears identical to one that returns high-quality results. - No degradation alerting —
tool_quarantine(migration 0020) provides binary quarantine from the immune system, but there is no graduated degradation awareness. A tool trending toward failure has no pre-quarantine warning mechanism.
Phase 19 solves all three by introducing semantic vector search for tool discovery, quality scoring with LLM auto-rating, and degradation alerting that integrates with the existing quarantine path.
Architecture
Extended registry package: internal/registry/
All new tool quality and search functionality lives in the existing internal/registry/ package. No new package is created — this extends the existing Store, ToolDefinition, and supporting types.
┌──────────────────────────────────────────────────────────────────┐
│ registry.ToolSearcher │
│ │
│ ┌───────────────────┐ ┌──────────────────┐ ┌──────────────┐ │
│ │ Vector Retrieval │ │ Quality │ │ Result │ │
│ │ (embed + search │ │ Re-Ranking │ │ Assembly │ │
│ │ tool desc) │ │ (quality score │ │ (merge, │ │
│ │ │ │ + recency + │ │ format) │ │
│ │ │ │ success rate) │ │ │ │
│ └─────────┬─────────┘ └────────┬─────────┘ └──────┬───────┘ │
│ │ │ │ │
│ └─────────┬───────────┘ │ │
│ │ │ │
│ 3-Stage Pipeline │ │
│ │ │
│ External deps (reused, not owned): │ │
│ ├─ internal/embedding/Embedder │ │
│ ├─ internal/vectorstore/VectorStore (collection: │ │
│ │ "tool_registry") │ │
│ └─ internal/tenant/ (multi-tenant isolation) │ │
│ │ │
│ Existing tables (extended, not replaced): │ │
│ ├─ tool_retry_stats (migration 0005/0013) │ │
│ └─ tool_quarantine (migration 0020) │ │
└──────────────────────────────────────────────────────────────────┘
Core API
// MetricsStore tracks mutable quality signals for tools.
type MetricsStore interface {
RecordExecution(ctx context.Context, toolName string, outcome ExecutionOutcome) error
RecordFeedback(ctx context.Context, toolName string, feedback ToolFeedback) error
GetMetrics(ctx context.Context, toolName string) (ToolMetrics, error)
ListDegraded(ctx context.Context, threshold float64) ([]ToolMetrics, error)
}
// ToolSearcher finds tools by semantic similarity + quality ranking.
type ToolSearcher interface {
Search(ctx context.Context, query string, k int, filter *ToolSearchFilter) ([]ScoredTool, error)
}
// ToolIndexer manages vector embeddings for tool descriptions.
type ToolIndexer interface {
IndexTool(ctx context.Context, tool ToolDefinition) error
IndexRegistry(ctx context.Context, reg ToolRegistry) error
RemoveTool(ctx context.Context, toolName string) error
}
Key Types
type ExecutionOutcome struct {
ToolName string `json:"tool_name"`
RunID string `json:"run_id"`
StepIdx int `json:"step_idx"`
Success bool `json:"success"`
LatencyMs int64 `json:"latency_ms"`
LLMRating float64 `json:"llm_rating"` // 0.0-1.0, from post-execution assessment
ErrorClass string `json:"error_class,omitempty"`
TenantID string `json:"tenant_id"`
}
type ToolFeedback struct {
ToolName string `json:"tool_name"`
UserID string `json:"user_id"`
Rating float64 `json:"rating"` // 0.0-1.0
Comment string `json:"comment,omitempty"`
TenantID string `json:"tenant_id"`
}
type ToolMetrics struct {
ToolName string `json:"tool_name"`
TenantID string `json:"tenant_id"`
TotalCalls int `json:"total_calls"`
SuccessCount int `json:"success_count"`
FailureCount int `json:"failure_count"`
AvgLatencyMs float64 `json:"avg_latency_ms"`
AvgLLMRating float64 `json:"avg_llm_rating"`
QualityScore float64 `json:"quality_score"` // composite: success_rate * avg_llm_rating
LastCalledAt time.Time `json:"last_called_at"`
DegradedSince *time.Time `json:"degraded_since,omitempty"`
}
type ScoredTool struct {
Tool ToolDefinition `json:"tool"`
Score float64 `json:"score"`
Components ScoreComponents `json:"components"`
}
type ScoreComponents struct {
Similarity float64 `json:"similarity"`
Quality float64 `json:"quality"`
Recency float64 `json:"recency"`
}
type ToolSearchFilter struct {
TenantID string `json:"tenant_id,omitempty"`
ExcludeNames []string `json:"exclude_names,omitempty"`
MinQuality float64 `json:"min_quality,omitempty"`
}
Search Pipeline
Three-stage pipeline, same pattern as Phase 18's prompt search:
Stage 1: Vector Retrieval
- Embed query text using
embedding.Embedder.Embed()(internal/embedding/embedder.go:22) - Search
tool_registrycollection viavectorstore.VectorStore.Search()(internal/vectorstore/store.go:35) - Apply tenant isolation filter
- Retrieve top-K candidates (default K=30)
Stage 2: Quality Re-Ranking
Score each candidate using a weighted formula:
score = W_sim * similarity + W_qual * quality + W_rec * recency
| Weight | Default | Source |
|---|---|---|
W_sim (similarity) | 0.5 | Vector cosine similarity from Stage 1 |
W_qual (quality) | 0.35 | success_rate * avg_llm_rating from tool_metrics |
W_rec (recency) | 0.15 | Recency decay from last successful call |
Tools with active quarantine entries (tool_quarantine where released_at IS NULL AND (expires_at IS NULL OR expires_at > NOW())) are excluded from results.
Stage 3: Result Assembly
- Sort by composite score
- Truncate to requested limit (default 20)
- Return
[]ScoredToolwith score components for transparency
Quality Tracking
LLM Auto-Rating
After each tool execution in ToolExecuteActivity, a non-blocking Temporal activity records an ExecutionOutcome including:
- Binary success/failure (existing)
- Execution latency
- LLM quality rating (0.0-1.0) from a post-execution assessment prompt
The LLM rating uses a lightweight prompt asking the model to rate tool output relevance and correctness on a 0-1 scale. This runs as a child activity with short timeout (5s) and fire-and-forget semantics.
Degradation Detection
A periodic activity (or checked inline during ToolExecuteActivity) computes a rolling quality score. When the score drops below a configurable threshold:
- Warning — Log structured warning + emit NATS event (if Phase 12 active) or memory episode fallback
- Alert — Set
degraded_sincetimestamp intool_metrics - Quarantine escalation — If quality stays below threshold for N consecutive calls, insert into existing
tool_quarantinetable (migration 0020) with reason referencing quality degradation
Backward Compatibility
filterRegistryForPrompt in internal/agent/workflow.go is updated to use vector search when CRUVERO_TOOL_SEARCH_SEMANTIC=true, falling back to the existing keyword scoring when disabled or when the vector store is unavailable. The function signature remains unchanged.
Sub-Phases
| Sub-Phase | Name | Prompts | Depends On |
|---|---|---|---|
| 19A | Foundation: MetricsStore, Types, Migration | 4 | — |
| 19B | Vector Indexing + Semantic Search | 4 | 19A |
| 19C | Quality Tracking + Degradation Alerting | 4 | 19B |
| 19D | CLI, Agent Discovery Integration, Testing | 4 | 19C |
Total: 4 sub-phases, 16 prompts, 9 documentation files
Dependency Graph
19A (Foundation) → 19B (Vector Search) → 19C (Quality Tracking) → 19D (CLI/Integration)
Strictly sequential: each sub-phase builds on the previous.
Environment Variables
| Variable | Default | Description |
|---|---|---|
CRUVERO_TOOL_SEARCH_SEMANTIC | false | Enable semantic vector search for tool discovery |
CRUVERO_TOOL_SEARCH_COLLECTION | tool_registry | Vector store collection name |
CRUVERO_TOOL_SEARCH_K | 30 | Vector retrieval candidates (Stage 1) |
CRUVERO_TOOL_SEARCH_RESULT_LIMIT | 20 | Max tools returned to agent |
CRUVERO_TOOL_SEARCH_W_SIMILARITY | 0.5 | Ranking weight: vector similarity |
CRUVERO_TOOL_SEARCH_W_QUALITY | 0.35 | Ranking weight: quality score |
CRUVERO_TOOL_SEARCH_W_RECENCY | 0.15 | Ranking weight: recency decay |
CRUVERO_TOOL_QUALITY_ENABLED | true | Enable quality tracking and LLM auto-rating |
CRUVERO_TOOL_QUALITY_RATING_TIMEOUT | 5s | Timeout for LLM auto-rating activity |
CRUVERO_TOOL_QUALITY_DEGRADE_THRESHOLD | 0.3 | Quality score below which a tool is considered degraded |
CRUVERO_TOOL_QUALITY_QUARANTINE_AFTER | 5 | Consecutive degraded calls before quarantine escalation |
Files Overview
New Files
| File | Sub-Phase | Description |
|---|---|---|
internal/registry/metrics_types.go | 19A | ExecutionOutcome, ToolFeedback, ToolMetrics, ScoredTool, ScoreComponents |
internal/registry/metrics_store.go | 19A | MetricsStore interface + PostgresMetricsStore |
internal/registry/tool_indexer.go | 19B | ToolIndexer interface + DefaultToolIndexer |
internal/registry/tool_searcher.go | 19B | ToolSearcher interface + DefaultToolSearcher (3-stage pipeline) |
internal/registry/scorer.go | 19B | ToolScorer (ranking formula, weight config) |
internal/registry/quality.go | 19C | QualityTracker, degradation detection, quarantine escalation |
internal/registry/rating.go | 19C | LLM auto-rating prompt + activity |
internal/registry/search_config.go | 19B | Search config wiring from env vars |
cmd/tool-feedback/main.go | 19D | CLI to submit tool quality feedback |
migrations/0026_tool_metrics.up.sql | 19A | Extend tool quality tracking tables |
migrations/0026_tool_metrics.down.sql | 19A | Reverse migration |
internal/registry/metrics_types_test.go | 19D | Type validation tests |
internal/registry/metrics_store_test.go | 19D | PostgresMetricsStore tests (sqlmock) |
internal/registry/tool_indexer_test.go | 19D | Indexer tests (mock embedder + vector store) |
internal/registry/tool_searcher_test.go | 19D | Searcher pipeline tests |
internal/registry/scorer_test.go | 19D | Scorer tests |
internal/registry/quality_test.go | 19D | Quality tracking + degradation tests |
internal/registry/rating_test.go | 19D | LLM rating tests |
Modified Files
| File | Sub-Phase | Change |
|---|---|---|
internal/agent/activities.go | 19C | Wire quality recording in ToolExecuteActivity |
internal/agent/workflow.go | 19D | Update filterRegistryForPrompt for semantic search fallback |
internal/config/config.go | 19A | Add tool search/quality config fields |
cmd/seed-registry/main.go | 19B | Add vector indexing after registry seed |
Migration: 0026_tool_metrics
-- 0026_tool_metrics.up.sql
-- Extend tool_retry_stats with quality tracking columns
ALTER TABLE tool_retry_stats
ADD COLUMN IF NOT EXISTS total_calls INTEGER NOT NULL DEFAULT 0,
ADD COLUMN IF NOT EXISTS avg_latency_ms DOUBLE PRECISION NOT NULL DEFAULT 0,
ADD COLUMN IF NOT EXISTS total_rating DOUBLE PRECISION NOT NULL DEFAULT 0,
ADD COLUMN IF NOT EXISTS rating_count INTEGER NOT NULL DEFAULT 0,
ADD COLUMN IF NOT EXISTS quality_score DOUBLE PRECISION NOT NULL DEFAULT 0,
ADD COLUMN IF NOT EXISTS last_called_at TIMESTAMPTZ,
ADD COLUMN IF NOT EXISTS degraded_since TIMESTAMPTZ;
-- Backfill total_calls from existing success + failure counts
UPDATE tool_retry_stats
SET total_calls = successes + failures
WHERE total_calls = 0 AND (successes > 0 OR failures > 0);
-- Tool feedback table for user-submitted ratings
CREATE TABLE IF NOT EXISTS tool_feedback (
id BIGSERIAL PRIMARY KEY,
tenant_id TEXT NOT NULL DEFAULT '_global',
tool_name TEXT NOT NULL,
user_id TEXT NOT NULL DEFAULT '',
rating DOUBLE PRECISION NOT NULL,
comment TEXT NOT NULL DEFAULT '',
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
CREATE INDEX idx_tool_feedback_tool ON tool_feedback (tenant_id, tool_name);
CREATE INDEX idx_tool_feedback_created ON tool_feedback (created_at);
Success Metrics
| Metric | Target |
|---|---|
| Semantic search relevance | Top-5 results contain target tool >= 90% of test queries |
| Search latency (vector + re-rank) | < 50ms p99 |
| Quality score accuracy | LLM rating within 0.2 of manual assessment |
| Degradation detection | Alert within 3 calls of quality drop |
| Quarantine escalation | Automatic quarantine after N consecutive degraded calls |
| Backward compatibility | filterRegistryForPrompt unchanged when semantic disabled |
| Keyword fallback | Graceful degradation when vector store unavailable |
| Test coverage | >= 80% for internal/registry/ (enforced by scripts/check-coverage.sh) |
Code Quality Requirements (SonarQube)
All Go code produced by Phase 19 prompts must pass SonarQube quality gates:
- Error handling: Every returned error must be handled explicitly
- Cyclomatic complexity: Functions under 50 lines where practical
- No dead code: No unused variables, empty blocks, or duplicated logic
- Resource cleanup: Close all resources with proper
deferpatterns - Early returns: Prefer guard clauses over deep nesting
- No magic values: Use named constants
- Linting gate: Run
go vet ./internal/registry/...,staticcheck ./internal/registry/..., andgolangci-lint run ./internal/registry/...before considering prompts complete - Test coverage: 80%+ for new registry files
Risk Mitigation
| Risk | Mitigation |
|---|---|
| Vector store unavailable | Semantic search is opt-in (CRUVERO_TOOL_SEARCH_SEMANTIC=false default). Falls back to keyword search. |
| LLM auto-rating latency | Fire-and-forget activity with 5s timeout. Tool execution is never blocked. |
| Cold start (no embeddings) | seed-registry CLI indexes tools on seed. Keyword fallback for un-indexed tools. |
| Quality score gaming | Composite score includes success rate, not just LLM rating. Manual feedback weighted separately. |
| Migration on existing data | ALTER TABLE ADD COLUMN with defaults. Backfill UPDATE is idempotent. |
Relationship to Other Phases
| Phase | Relationship |
|---|---|
| Phase 5 (Memory) | 19B may reuse salience scoring patterns for recency decay |
| Phase 6 (Tool Registry) | 19A extends existing registry Store + types |
| Phase 8 (Embeddings + Vector) | 19B reuses Embedder and VectorStore with new collection |
| Phase 10D (Immune System) | 19C integrates with existing tool_quarantine for escalation |
| Phase 12 (Events) | 19C emits degradation events via NATS if available |
| Phase 14 (API) | API endpoints can expose tool metrics via existing route patterns |
| Phase 18 (Prompt Library) | 19B mirrors the 3-stage search pipeline pattern from Phase 18 docs |
Progress Notes
(none yet)