Phase 26 — Prompt Library v2: Advanced Prompt Management
Extends the Phase 18 prompt library with deployment environments, composable snippets, A/B experimentation, structured evaluation, version diffing, CI/CD eval integration, production-to-dataset pipelines, NATS cache invalidation, provider-agnostic prompt blueprints, and prompt analytics. Modeled after capabilities found in Braintrust, PromptLayer, and Humanloop — filtered for features that create meaningful value in a Temporal-native agent orchestration platform.
Status: Planned
Depends on: Phase 18 (Prompt Library), Phase 12 (NATS), Phase 9 (Audit/Tenant), Phase 6C (Speculative Execution)
Migrations: 0034_prompt_environments, 0035_prompt_snippets, 0036_prompt_experiments, 0037_eval_datasets
Branch: dev
Why Now
Phase 18 established a production-ready prompt catalog with content-hashed versioning, embedding-based semantic search, quality metrics, and Go text/template rendering. But four critical gaps remain between "prompts exist in a catalog" and "prompts are safely managed across their lifecycle in production":
- No deployment safety net — A prompt goes from "created" to "used by agents" with nothing in between. There is no concept of dev → staging → production promotion, no quality gates, and no rollback path. A bad prompt can immediately affect all agents.
- Monolithic prompts — Every prompt is self-contained. System preambles, safety guardrails, output format instructions, and few-shot examples are duplicated across prompts. Tenant customization requires duplicating entire prompts rather than overriding fragments.
- No experimentation infrastructure — The existing
SpeculativeConfig(Phase 6C) runs multiple decision paths but has no mechanism for controlled prompt A/B tests with traffic splitting, variant tracking, or statistical comparison. - Primitive quality signals —
success_rate * avg_llm_ratingfrom fire-and-forget feedback is insufficient for pre-deployment confidence. There are no eval datasets, no regression detection, no LLM-as-a-judge scoring, and no CI/CD quality gates.
Phase 26 solves all four while preserving Phase 18's architectural strengths: content-hashed immutability, embedding-based semantic search, multi-tenant isolation, and Temporal-native durability.
Architecture
Extended internal/promptlib/ Package
Phase 26 extends the existing package with new subpackages rather than modifying core types:
internal/promptlib/
├── types.go (Phase 18 — unchanged)
├── store.go (Phase 18 — unchanged)
├── metrics_store.go (Phase 18 — unchanged)
├── hash.go (Phase 18 — unchanged)
├── renderer.go (Phase 18 — extended with snippet resolution)
├── searcher.go (Phase 18 — extended with environment filtering)
├── scorer.go (Phase 18 — unchanged)
├── indexer.go (Phase 18 — unchanged)
├── config.go (Phase 18 — extended with new env vars)
├── feedback.go (Phase 18 — unchanged)
│
├── environments.go (26A — NEW: environment store + promotion logic)
├── snippets.go (26A — NEW: snippet resolution + version pinning)
├── experiment.go (26B — NEW: A/B experiment types + resolution)
├── experiment_store.go (26B — NEW: Postgres experiment store)
│
├── eval/
│ ├── types.go (26C — NEW: dataset, eval run, scorer types)
│ ├── dataset_store.go (26C — NEW: eval dataset CRUD)
│ ├── runner.go (26C — NEW: EvalRunWorkflow orchestrator)
│ ├── scorers.go (26C — NEW: built-in scorer activities)
│ └── report.go (26C — NEW: eval result aggregation)
│
├── diff.go (26D — NEW: version diff engine)
├── blueprint.go (26E — NEW: provider-agnostic prompt representation)
└── analytics.go (26E — NEW: prompt-level time-series queries)
Integration Points
┌─────────────────────────────────────────────────────────────────┐
│ Phase 26 Extensions │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────────────┐ │
│ │ Environments │ │ Snippets │ │ Experiments │ │
│ │ (dev→stg→prd)│ │ (composable │ │ (A/B via Temporal │ │
│ │ quality gates│ │ fragments) │ │ SideEffect) │ │
│ └──────┬───────┘ └──────┬───────┘ └──────────┬───────────┘ │
│ │ │ │ │
│ ┌──────▼─────────────────▼──────────────────────▼───────────┐ │
│ │ Phase 18 Core (unchanged) │ │
│ │ Store · MetricsStore · Searcher · Renderer · Indexer │ │
│ └──────┬────────────────────────────────────────────────────┘ │
│ │ │
│ ┌──────▼───────────────────────────────────────────────────┐ │
│ │ Eval Framework │ │
│ │ Datasets · EvalRunWorkflow · Scorers · Reports │ │
│ │ (quality gates for environment promotion) │ │
│ └──────────────────────────────────────────────────────────┘ │
│ │
│ External integrations: │
│ ├─ NATS (Phase 12) — cache invalidation on promotion │
│ ├─ Audit (Phase 9C) — production log → dataset pipeline │
│ ├─ Temporal — EvalRunWorkflow, experiment variant SideEffect │
│ ├─ LLM Client (Phase 7D) — LLM-as-a-judge scoring │
│ └─ Embedding (Phase 8D) — cosine similarity scoring │
└─────────────────────────────────────────────────────────────────┘
Three-Tier Implementation
Tier 1 — Highest ROI (Sub-phases A + B):
1.1 Deployment Environments with Quality Gates
1.2 Composable Prompt Snippets
1.3 A/B Testing via Temporal SideEffect
1.4 Structured Evaluation Framework
Tier 2 — Valuable (Sub-phases C + D):
2.1 Visual Version Diff
2.2 CI/CD Eval Integration
2.3 Production Log → Dataset Pipeline
Tier 3 — Polish (Sub-phase E):
3.1 NATS Cache Invalidation
3.2 Provider-Agnostic Prompt Blueprints
3.3 Prompt Analytics Dashboard
Competitive Comparison
| Capability | Braintrust | PromptLayer | Cruvero (Phase 18) | Cruvero (After Phase 26) |
|---|---|---|---|---|
| Versioning | Content-addressed | Sequential (v1, v2) | Content-hashed + immutable | Same (unchanged) |
| Semantic search | ✗ | ✗ | ✓ Embedding-based 3-stage | Same (unique advantage) |
| Environments | dev/staging/prod | Release labels | ✗ | ✓ Named environments + quality gates |
| Prompt composition | Functions (code) | Snippets (@@@) | ✗ Monolithic only | ✓ Snippets with version pinning |
| A/B testing | Playground + manual | Dynamic Release Labels | ✗ | ✓ Temporal SideEffect (replay-safe) |
| Eval framework | Eval() + autoevals | Visual pipeline builder | success_rate + avg_rating | ✓ Datasets + scorers + EvalRunWorkflow |
| CI/CD evals | GitHub Action | Auto-trigger on version | ✗ | ✓ CLI + exit code |
| Multi-tenancy | Project-level | Workspace | ✓ First-class tenant isolation | Same (unique advantage) |
| Observability | Brainstore + BTQL | Middleware logging | Basic prompt_metrics | ✓ Time-series analytics |
| Cache invalidation | — | Webhook-driven | ✗ | ✓ NATS pub/sub |
| Runtime SDK | Go (beta), TS, Python | Python, JS | Go-native | Same + environment-aware resolution |
| Diff/compare | Side-by-side UI | Diff views | ✗ | ✓ Structured line-level diff |
Core Types and Interfaces
// --- Environments (26A) ---
type Environment struct {
TenantID string `json:"tenant_id"`
Name string `json:"name"` // "dev", "staging", "production"
PromptID string `json:"prompt_id"`
Version int `json:"version"`
PromptHash string `json:"prompt_hash"`
PromotedBy string `json:"promoted_by"`
PromotedAt time.Time `json:"promoted_at"`
}
type QualityGate struct {
MinUsageCount int `json:"min_usage_count"`
MinSuccessRate float64 `json:"min_success_rate"` // 0.0-1.0
MinAvgRating float64 `json:"min_avg_rating"` // 0.0-1.0
RequireEvalPass bool `json:"require_eval_pass"`
EvalDatasetID string `json:"eval_dataset_id,omitempty"`
EvalThreshold float64 `json:"eval_threshold,omitempty"`
}
type EnvironmentStore interface {
Promote(ctx context.Context, env Environment, gate *QualityGate) error
GetActive(ctx context.Context, tenantID, promptID, envName string) (Environment, error)
ListEnvironments(ctx context.Context, tenantID, promptID string) ([]Environment, error)
GetPromotionHistory(ctx context.Context, tenantID, promptID, envName string, limit int) ([]Environment, error)
}
// --- Snippets (26A) ---
type SnippetRef struct {
PromptID string `json:"prompt_id"`
Version int `json:"version,omitempty"` // 0 = latest
Label string `json:"label,omitempty"` // "prod", "staging" — resolved via EnvironmentStore
}
type SnippetResolver interface {
Resolve(ctx context.Context, tenantID string, refs []SnippetRef) (map[string]string, error)
}
// --- Experiments (26B) ---
type Experiment struct {
ID string `json:"id"`
TenantID string `json:"tenant_id"`
PromptID string `json:"prompt_id"`
Name string `json:"name"`
Status ExperimentStatus `json:"status"` // "active", "paused", "completed"
Variants []ExperimentVariant `json:"variants"`
CreatedAt time.Time `json:"created_at"`
CompletedAt *time.Time `json:"completed_at,omitempty"`
Config ExperimentConfig `json:"config"`
}
type ExperimentVariant struct {
Name string `json:"name"`
PromptHash string `json:"prompt_hash"`
TrafficPct int `json:"traffic_pct"` // 0-100, all variants must sum to 100
SampleCount int `json:"sample_count"`
SuccessCount int `json:"success_count"`
AvgScore float64 `json:"avg_score"`
}
type ExperimentConfig struct {
MinSampleSize int `json:"min_sample_size"` // per variant
MaxDuration string `json:"max_duration"` // e.g., "168h" (7 days)
AutoComplete bool `json:"auto_complete"` // auto-promote winner
}
type ExperimentStatus string
const (
ExperimentActive ExperimentStatus = "active"
ExperimentPaused ExperimentStatus = "paused"
ExperimentCompleted ExperimentStatus = "completed"
)
type ExperimentStore interface {
Create(ctx context.Context, exp Experiment) error
Get(ctx context.Context, tenantID, expID string) (Experiment, error)
GetActiveForPrompt(ctx context.Context, tenantID, promptID string) (*Experiment, error)
RecordOutcome(ctx context.Context, expID, variantName string, success bool, score float64) error
Complete(ctx context.Context, expID string, winnerVariant string) error
List(ctx context.Context, tenantID string, status *ExperimentStatus) ([]Experiment, error)
}
// --- Eval Framework (26C) ---
type EvalDataset struct {
ID string `json:"id"`
TenantID string `json:"tenant_id"`
Name string `json:"name"`
PromptID string `json:"prompt_id,omitempty"` // optional binding
Version int `json:"version"`
Entries []EvalEntry `json:"entries"`
CreatedAt time.Time `json:"created_at"`
Metadata map[string]any `json:"metadata,omitempty"`
}
type EvalEntry struct {
ID string `json:"id"`
Input map[string]any `json:"input"` // template params
ExpectedOutput string `json:"expected_output"` // golden output
Metadata map[string]any `json:"metadata,omitempty"`
}
type EvalRun struct {
ID string `json:"id"`
TenantID string `json:"tenant_id"`
PromptHash string `json:"prompt_hash"`
DatasetID string `json:"dataset_id"`
DatasetVer int `json:"dataset_version"`
Scorers []string `json:"scorers"` // scorer names to run
Status string `json:"status"` // "running", "completed", "failed"
Results []EvalResult `json:"results"`
Summary EvalSummary `json:"summary"`
StartedAt time.Time `json:"started_at"`
CompletedAt *time.Time `json:"completed_at,omitempty"`
}
type EvalResult struct {
EntryID string `json:"entry_id"`
Output string `json:"output"`
Scores map[string]float64 `json:"scores"` // scorer_name → 0.0-1.0
Latency time.Duration `json:"latency"`
Tokens int `json:"tokens"`
Error string `json:"error,omitempty"`
}
type EvalSummary struct {
TotalEntries int `json:"total_entries"`
PassCount int `json:"pass_count"`
FailCount int `json:"fail_count"`
ErrorCount int `json:"error_count"`
AvgScores map[string]float64 `json:"avg_scores"` // per scorer
AvgLatency time.Duration `json:"avg_latency"`
TotalTokens int `json:"total_tokens"`
TotalCost float64 `json:"total_cost"`
PassRate float64 `json:"pass_rate"` // 0.0-1.0
}
// Scorer produces a 0.0-1.0 score for an eval entry.
type Scorer interface {
Name() string
Score(ctx context.Context, input ScorerInput) (float64, error)
}
type ScorerInput struct {
Prompt Prompt `json:"prompt"`
TemplateParams map[string]any `json:"template_params"`
Output string `json:"output"`
ExpectedOutput string `json:"expected_output"`
Metadata map[string]any `json:"metadata,omitempty"`
}
// --- Diff (26D) ---
type PromptDiff struct {
PromptID string `json:"prompt_id"`
FromVer int `json:"from_version"`
ToVer int `json:"to_version"`
Hunks []DiffHunk `json:"hunks"`
Summary string `json:"summary"` // "3 additions, 2 deletions, 1 modification"
}
type DiffHunk struct {
Type string `json:"type"` // "add", "delete", "modify", "equal"
FromLine int `json:"from_line"`
ToLine int `json:"to_line"`
Content string `json:"content"`
}
// --- Provider Blueprint (26E) ---
type PromptBlueprint struct {
Messages []BlueprintMessage `json:"messages"`
Model string `json:"model,omitempty"`
Temperature *float64 `json:"temperature,omitempty"`
MaxTokens *int `json:"max_tokens,omitempty"`
StopSequence []string `json:"stop_sequence,omitempty"`
Tools []json.RawMessage `json:"tools,omitempty"`
}
type BlueprintMessage struct {
Role string `json:"role"` // "system", "user", "assistant"
Content string `json:"content"`
}
type BlueprintAdapter interface {
ToProviderFormat(bp PromptBlueprint, provider string) (json.RawMessage, error)
FromProviderFormat(raw json.RawMessage, provider string) (PromptBlueprint, error)
}
Sub-Phases
| Sub-Phase | Name | Prompts | Depends On |
|---|---|---|---|
| 26A | Environments + Snippets | 5 | — |
| 26B | A/B Experiments | 4 | 26A |
| 26C | Evaluation Framework | 5 | 26A |
| 26D | Diff + CI/CD + Log Pipeline | 4 | 26C |
| 26E | NATS Cache + Blueprints + Analytics | 3 | 26A, Phase 12 |
| 26F | CI/CD + Deployment | 3 | 26A–26E |
Total: 6 sub-phases, 24 prompts, 13 documentation files
Dependency Graph
26A (Environments + Snippets)
├─► 26B (A/B Experiments) ────────────┐
├─► 26C (Evaluation) ──► 26D (Diff) ├──► 26F (CI/CD + Deploy)
└─► 26E (NATS + Blueprints) ─────────┘
26B, 26C, and 26E are parallelizable after 26A. 26D depends on 26C (eval framework needed for CI/CD gates). 26F depends on all implementation sub-phases (26A–26E) and handles Docker images, Helm charts, and ArgoCD configuration.
Environment Variables
| Variable | Default | Description |
|---|---|---|
CRUVERO_PROMPTLIB_ENVS_ENABLED | true | Enable deployment environments |
CRUVERO_PROMPTLIB_DEFAULT_ENVS | dev,staging,production | Comma-separated environment names created per tenant |
CRUVERO_PROMPTLIB_SNIPPETS_ENABLED | true | Enable snippet composition |
CRUVERO_PROMPTLIB_SNIPPET_MAX_DEPTH | 3 | Max nested snippet resolution depth (prevents cycles) |
CRUVERO_PROMPTLIB_EXPERIMENTS_ENABLED | false | Enable A/B experimentation |
CRUVERO_PROMPTLIB_EXPERIMENT_MAX_VARIANTS | 4 | Max variants per experiment |
CRUVERO_PROMPTLIB_EVAL_ENABLED | true | Enable evaluation framework |
CRUVERO_PROMPTLIB_EVAL_TIMEOUT | 300s | Per-entry eval timeout |
CRUVERO_PROMPTLIB_EVAL_MAX_CONCURRENT | 10 | Max concurrent eval entries |
CRUVERO_PROMPTLIB_DIFF_CONTEXT_LINES | 3 | Lines of context in diff output |
CRUVERO_PROMPTLIB_NATS_CACHE_ENABLED | false | Enable NATS cache invalidation |
CRUVERO_PROMPTLIB_NATS_SUBJECT | cruvero.prompts.events | NATS subject for prompt events |
CRUVERO_PROMPTLIB_BLUEPRINT_ENABLED | false | Enable provider-agnostic blueprints |
CRUVERO_PROMPTLIB_ANALYTICS_RETENTION | 90d | Analytics data retention period |
Files Overview
New Files
| File | Sub-Phase | Description |
|---|---|---|
internal/promptlib/environments.go | 26A | EnvironmentStore interface + PostgresEnvironmentStore |
internal/promptlib/quality_gate.go | 26A | QualityGate evaluation logic, MetricsStore + EvalRun checks |
internal/promptlib/snippets.go | 26A | SnippetResolver with cycle detection + version pinning |
internal/promptlib/experiment.go | 26B | Experiment types, ExperimentStore interface |
internal/promptlib/experiment_store.go | 26B | PostgresExperimentStore implementation |
internal/promptlib/experiment_resolver.go | 26B | ResolveVariantActivity — Temporal SideEffect-based variant selection |
internal/promptlib/eval/types.go | 26C | EvalDataset, EvalEntry, EvalRun, EvalResult, EvalSummary, Scorer |
internal/promptlib/eval/dataset_store.go | 26C | PostgresDatasetStore (CRUD, versioning) |
internal/promptlib/eval/runner.go | 26C | EvalRunWorkflow — Temporal workflow orchestrating eval |
internal/promptlib/eval/scorers.go | 26C | Built-in scorers: exact_match, contains, regex, cosine_similarity, llm_judge |
internal/promptlib/eval/report.go | 26C | EvalSummary aggregation + comparison helpers |
internal/promptlib/diff.go | 26D | ComputeDiff — line-level diff between prompt versions |
internal/promptlib/log_pipeline.go | 26D | DatasetFromLogsActivity — audit log → eval dataset |
internal/promptlib/nats_cache.go | 26E | NATS publisher for prompt events + subscriber cache buster |
internal/promptlib/blueprint.go | 26E | PromptBlueprint + adapters for OpenAI, Anthropic, Azure |
internal/promptlib/analytics.go | 26E | Time-series queries over prompt_metrics |
cmd/prompt-eval/main.go | 26D | CLI to run eval against dataset + exit code on failure |
cmd/prompt-diff/main.go | 26D | CLI to diff prompt versions |
internal/tools/prompt_promote.go | 26A | PromptPromoteTool — agent tool for environment promotion |
migrations/0034_prompt_environments.up.sql | 26A | prompt_environments table |
migrations/0034_prompt_environments.down.sql | 26A | Drop table |
migrations/0035_prompt_snippets.up.sql | 26A | prompt_snippet_refs table (tracks snippet dependencies) |
migrations/0035_prompt_snippets.down.sql | 26A | Drop table |
migrations/0036_prompt_experiments.up.sql | 26B | prompt_experiments + experiment_outcomes tables |
migrations/0036_prompt_experiments.down.sql | 26B | Drop tables |
migrations/0037_eval_datasets.up.sql | 26C | eval_datasets + eval_entries + eval_runs + eval_results tables |
migrations/0037_eval_datasets.down.sql | 26C | Drop tables |
docker/Dockerfile.prompt-tools | 26F | Multi-binary Go image bundling prompt-eval, prompt-dataset, prompt-experiment, prompt-diff |
charts/cruvero/templates/prompt-tools/job.yaml | 26F | Helm Job template for batch eval/dataset operations |
Modified Files
| File | Sub-Phase | Change |
|---|---|---|
internal/promptlib/renderer.go | 26A | Add snippet resolution via FuncMap before template execution |
internal/promptlib/searcher.go | 26A | Add environment filter to search pipeline Stage 3 |
internal/promptlib/config.go | 26A | Add all new env vars + component wiring |
internal/tools/manager.go | 26A | Register prompt_promote tool |
internal/agent/activities.go | 26B | Wire experiment variant resolution before prompt selection |
cmd/ui/prompts_handler.go | 26D | Add /api/prompts/\{id\}/diff, /api/prompts/\{id\}/environments endpoints |
cmd/ui/frontend/src/pages/PromptLibraryPage.tsx | 26D | Add diff viewer, environment badges, eval results |
.github/workflows/build-images.yaml | 26F | Add prompt-tools to CI build matrix |
charts/cruvero/values.yaml | 26F | Add promptTools section + 14 new env vars |
charts/cruvero/values-dev.yaml | 26F | Enable prompt-tools, set image repo |
charts/cruvero/values-staging.yaml | 26F | Explicit promptTools.enabled: false |
charts/cruvero/values-prod.yaml | 26F | Explicit promptTools.enabled: false |
deploy/argocd/applicationset.yaml | 26F | Add prompt-tools Image Updater annotations |
Referenced (read, not modified)
| File | Purpose |
|---|---|
internal/promptlib/store.go | Store interface for Get/GetLatest (snippet resolution) |
internal/promptlib/metrics_store.go | MetricsStore for quality gate evaluation |
internal/memory/salience.go | Recency/usage scoring patterns |
internal/audit/logger.go | Audit event queries for log→dataset pipeline |
internal/events/nats.go | NATS client patterns (Phase 12) |
internal/llm/client.go | LLM client for llm_judge scorer |
internal/embedding/embedder.go | Embedder for cosine_similarity scorer |
internal/config/config.go | Config struct patterns |
Migrations
0034_prompt_environments.up.sql
CREATE TABLE IF NOT EXISTS prompt_environments (
tenant_id TEXT NOT NULL,
prompt_id TEXT NOT NULL,
env_name TEXT NOT NULL,
version INTEGER NOT NULL,
prompt_hash TEXT NOT NULL,
promoted_by TEXT NOT NULL DEFAULT '',
promoted_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
PRIMARY KEY (tenant_id, prompt_id, env_name)
);
CREATE INDEX idx_prompt_env_hash ON prompt_environments (prompt_hash);
-- Promotion history (append-only audit trail)
CREATE TABLE IF NOT EXISTS prompt_promotion_history (
id BIGSERIAL PRIMARY KEY,
tenant_id TEXT NOT NULL,
prompt_id TEXT NOT NULL,
env_name TEXT NOT NULL,
version INTEGER NOT NULL,
prompt_hash TEXT NOT NULL,
promoted_by TEXT NOT NULL DEFAULT '',
promoted_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
CREATE INDEX idx_promo_history_lookup ON prompt_promotion_history (tenant_id, prompt_id, env_name, promoted_at DESC);
0035_prompt_snippets.up.sql
-- Tracks which prompts reference which snippets (for dependency analysis)
CREATE TABLE IF NOT EXISTS prompt_snippet_refs (
tenant_id TEXT NOT NULL,
parent_hash TEXT NOT NULL, -- prompt that contains the snippet ref
snippet_id TEXT NOT NULL, -- referenced snippet prompt_id
snippet_version INTEGER, -- NULL = latest, >0 = pinned
snippet_label TEXT, -- NULL = direct version, non-null = environment label
PRIMARY KEY (tenant_id, parent_hash, snippet_id)
);
CREATE INDEX idx_snippet_refs_snippet ON prompt_snippet_refs (tenant_id, snippet_id);
0036_prompt_experiments.up.sql
CREATE TABLE IF NOT EXISTS prompt_experiments (
id TEXT NOT NULL,
tenant_id TEXT NOT NULL,
prompt_id TEXT NOT NULL,
name TEXT NOT NULL,
status TEXT NOT NULL DEFAULT 'active',
variants JSONB NOT NULL,
config JSONB NOT NULL DEFAULT '{}',
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
completed_at TIMESTAMPTZ,
winner TEXT,
PRIMARY KEY (tenant_id, id)
);
CREATE INDEX idx_experiments_prompt ON prompt_experiments (tenant_id, prompt_id, status);
CREATE TABLE IF NOT EXISTS experiment_outcomes (
id BIGSERIAL PRIMARY KEY,
experiment_id TEXT NOT NULL,
tenant_id TEXT NOT NULL,
variant_name TEXT NOT NULL,
run_id TEXT NOT NULL,
success BOOLEAN NOT NULL,
score DOUBLE PRECISION,
recorded_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
CREATE INDEX idx_exp_outcomes_lookup ON experiment_outcomes (tenant_id, experiment_id, variant_name);
0037_eval_datasets.up.sql
CREATE TABLE IF NOT EXISTS eval_datasets (
id TEXT NOT NULL,
tenant_id TEXT NOT NULL,
name TEXT NOT NULL,
prompt_id TEXT,
version INTEGER NOT NULL DEFAULT 1,
metadata JSONB DEFAULT '{}',
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
PRIMARY KEY (tenant_id, id, version)
);
CREATE TABLE IF NOT EXISTS eval_entries (
id TEXT NOT NULL,
dataset_id TEXT NOT NULL,
dataset_version INTEGER NOT NULL,
tenant_id TEXT NOT NULL,
input JSONB NOT NULL,
expected_output TEXT NOT NULL,
metadata JSONB DEFAULT '{}',
PRIMARY KEY (tenant_id, dataset_id, dataset_version, id)
);
CREATE TABLE IF NOT EXISTS eval_runs (
id TEXT NOT NULL,
tenant_id TEXT NOT NULL,
prompt_hash TEXT NOT NULL,
dataset_id TEXT NOT NULL,
dataset_ver INTEGER NOT NULL,
scorers TEXT[] NOT NULL,
status TEXT NOT NULL DEFAULT 'running',
summary JSONB DEFAULT '{}',
started_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
completed_at TIMESTAMPTZ,
PRIMARY KEY (tenant_id, id)
);
CREATE INDEX idx_eval_runs_prompt ON eval_runs (tenant_id, prompt_hash);
CREATE TABLE IF NOT EXISTS eval_results (
eval_run_id TEXT NOT NULL,
entry_id TEXT NOT NULL,
tenant_id TEXT NOT NULL,
output TEXT NOT NULL,
scores JSONB NOT NULL,
latency_ms INTEGER NOT NULL,
tokens INTEGER NOT NULL DEFAULT 0,
error TEXT,
PRIMARY KEY (tenant_id, eval_run_id, entry_id)
);
Success Metrics
| Metric | Target |
|---|---|
| Environment promotion latency | < 500ms (quality gate check + upsert) |
| Quality gate evaluation | Checks MetricsStore + optional eval run in < 2s |
| Snippet resolution depth | Max 3 levels, < 5ms p99 |
| Snippet cycle detection | 100% detection rate (no infinite loops) |
| A/B variant selection | Deterministic via Temporal SideEffect (replay-safe) |
| Experiment outcome recording | Non-blocking, < 5ms fire-and-forget |
| Eval run throughput | 10 concurrent entries by default, < 60s for 100-entry dataset |
| Built-in scorer accuracy | LLM-as-judge within 0.15 of human rating |
| Diff computation | < 10ms for prompts up to 10KB |
| CI/CD eval exit code | Non-zero on regression (score below threshold) |
| Log → dataset pipeline | < 30s for 1000 audit entries |
| NATS cache invalidation | < 100ms from promotion to subscriber notification |
| Test coverage | >= 80% for all new files |
| Backward compatibility | Phase 18 behavior unchanged when Phase 26 features disabled |
Code Quality Requirements (SonarQube)
All Go code produced by Phase 26 prompts must pass SonarQube quality gates:
- Error handling: Every returned error must be handled explicitly
- Cyclomatic complexity: Functions under 50 lines where practical
- No dead code: No unused variables, empty blocks, or duplicated logic
- Resource cleanup: Close all resources with proper
deferpatterns - Early returns: Prefer guard clauses over deeply nested conditionals
- No magic values: Use named constants for strings and numbers
- Meaningful names: Descriptive variable and function names
- Linting gate: Run
go vet,staticcheck, andgolangci-lint runbefore considering the prompt complete
Each sub-phase Exit Criteria section includes:
[ ] go vet ./internal/promptlib/... reports no issues[ ] staticcheck ./internal/promptlib/... reports no issues[ ] No functions exceed 50 lines (extract helpers as needed)[ ] All returned errors are handled (no _ = err patterns)
Risk Mitigation
| Risk | Mitigation |
|---|---|
| Environment promotion breaks agents | Feature-flagged. When disabled, searcher returns all prompts (Phase 18 behavior). Promotion is additive — unpromoted prompts still exist. |
| Snippet cycles (A references B references A) | Depth counter with hard limit (default 3). Cycle detection via visited-set in resolution. Returns error, not infinite loop. |
| A/B experiment affects workflow determinism | Variant selection uses Temporal SideEffect — deterministic on replay. Outcomes recorded fire-and-forget. |
| Eval framework cost (LLM-as-judge) | Eval entries processed with configurable concurrency limit. LLM-as-judge is optional scorer, not required. Cost tracked per eval run. |
| Diff on very large prompts | Diff operates on lines, bounded by prompt content size. Prompts > 10KB trigger summary-only mode. |
| NATS unavailability | Cache invalidation is best-effort. Agents fall back to TTL-based cache expiry. Promotion succeeds regardless of NATS state. |
| Migration conflicts with existing Phase 18 tables | All new tables use distinct names (no ALTER on existing prompts or prompt_metrics). |
Relationship to Other Phases
| Phase | Relationship |
|---|---|
| Phase 18 (Prompt Library) | 26 extends 18's types, store, and searcher. No breaking changes. |
| Phase 12 (NATS) | 26E uses NATS for cache invalidation events. |
| Phase 9C (Audit) | 26D reads audit log for production log → dataset pipeline. |
| Phase 6C (Speculative Execution) | 26B's A/B testing shares Temporal SideEffect patterns. |
| Phase 19 (Tool Registry Restructure) | Orthogonal — tool quality scoring and prompt quality scoring are independent. |
| Phase 22C (UI) | 26D extends PromptLibraryPage with diff viewer and environment badges. |
| Phase 24 (Context Management) | Orthogonal — context assembly consumes prompts; Phase 26 manages prompt lifecycle. |
Progress Notes
(none yet)