Skip to main content
neutral

Phase 26 — Prompt Library v2: Advanced Prompt Management

Extends the Phase 18 prompt library with deployment environments, composable snippets, A/B experimentation, structured evaluation, version diffing, CI/CD eval integration, production-to-dataset pipelines, NATS cache invalidation, provider-agnostic prompt blueprints, and prompt analytics. Modeled after capabilities found in Braintrust, PromptLayer, and Humanloop — filtered for features that create meaningful value in a Temporal-native agent orchestration platform.

Status: Planned Depends on: Phase 18 (Prompt Library), Phase 12 (NATS), Phase 9 (Audit/Tenant), Phase 6C (Speculative Execution) Migrations: 0034_prompt_environments, 0035_prompt_snippets, 0036_prompt_experiments, 0037_eval_datasets Branch: dev


Why Now

Phase 18 established a production-ready prompt catalog with content-hashed versioning, embedding-based semantic search, quality metrics, and Go text/template rendering. But four critical gaps remain between "prompts exist in a catalog" and "prompts are safely managed across their lifecycle in production":

  1. No deployment safety net — A prompt goes from "created" to "used by agents" with nothing in between. There is no concept of dev → staging → production promotion, no quality gates, and no rollback path. A bad prompt can immediately affect all agents.
  2. Monolithic prompts — Every prompt is self-contained. System preambles, safety guardrails, output format instructions, and few-shot examples are duplicated across prompts. Tenant customization requires duplicating entire prompts rather than overriding fragments.
  3. No experimentation infrastructure — The existing SpeculativeConfig (Phase 6C) runs multiple decision paths but has no mechanism for controlled prompt A/B tests with traffic splitting, variant tracking, or statistical comparison.
  4. Primitive quality signalssuccess_rate * avg_llm_rating from fire-and-forget feedback is insufficient for pre-deployment confidence. There are no eval datasets, no regression detection, no LLM-as-a-judge scoring, and no CI/CD quality gates.

Phase 26 solves all four while preserving Phase 18's architectural strengths: content-hashed immutability, embedding-based semantic search, multi-tenant isolation, and Temporal-native durability.


Architecture

Extended internal/promptlib/ Package

Phase 26 extends the existing package with new subpackages rather than modifying core types:

internal/promptlib/
├── types.go (Phase 18 — unchanged)
├── store.go (Phase 18 — unchanged)
├── metrics_store.go (Phase 18 — unchanged)
├── hash.go (Phase 18 — unchanged)
├── renderer.go (Phase 18 — extended with snippet resolution)
├── searcher.go (Phase 18 — extended with environment filtering)
├── scorer.go (Phase 18 — unchanged)
├── indexer.go (Phase 18 — unchanged)
├── config.go (Phase 18 — extended with new env vars)
├── feedback.go (Phase 18 — unchanged)

├── environments.go (26A — NEW: environment store + promotion logic)
├── snippets.go (26A — NEW: snippet resolution + version pinning)
├── experiment.go (26B — NEW: A/B experiment types + resolution)
├── experiment_store.go (26B — NEW: Postgres experiment store)

├── eval/
│ ├── types.go (26C — NEW: dataset, eval run, scorer types)
│ ├── dataset_store.go (26C — NEW: eval dataset CRUD)
│ ├── runner.go (26C — NEW: EvalRunWorkflow orchestrator)
│ ├── scorers.go (26C — NEW: built-in scorer activities)
│ └── report.go (26C — NEW: eval result aggregation)

├── diff.go (26D — NEW: version diff engine)
├── blueprint.go (26E — NEW: provider-agnostic prompt representation)
└── analytics.go (26E — NEW: prompt-level time-series queries)

Integration Points

┌─────────────────────────────────────────────────────────────────┐
│ Phase 26 Extensions │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────────────┐ │
│ │ Environments │ │ Snippets │ │ Experiments │ │
│ │ (dev→stg→prd)│ │ (composable │ │ (A/B via Temporal │ │
│ │ quality gates│ │ fragments) │ │ SideEffect) │ │
│ └──────┬───────┘ └──────┬───────┘ └──────────┬───────────┘ │
│ │ │ │ │
│ ┌──────▼─────────────────▼──────────────────────▼───────────┐ │
│ │ Phase 18 Core (unchanged) │ │
│ │ Store · MetricsStore · Searcher · Renderer · Indexer │ │
│ └──────┬────────────────────────────────────────────────────┘ │
│ │ │
│ ┌──────▼───────────────────────────────────────────────────┐ │
│ │ Eval Framework │ │
│ │ Datasets · EvalRunWorkflow · Scorers · Reports │ │
│ │ (quality gates for environment promotion) │ │
│ └──────────────────────────────────────────────────────────┘ │
│ │
│ External integrations: │
│ ├─ NATS (Phase 12) — cache invalidation on promotion │
│ ├─ Audit (Phase 9C) — production log → dataset pipeline │
│ ├─ Temporal — EvalRunWorkflow, experiment variant SideEffect │
│ ├─ LLM Client (Phase 7D) — LLM-as-a-judge scoring │
│ └─ Embedding (Phase 8D) — cosine similarity scoring │
└─────────────────────────────────────────────────────────────────┘

Three-Tier Implementation

Tier 1 — Highest ROI (Sub-phases A + B):
1.1 Deployment Environments with Quality Gates
1.2 Composable Prompt Snippets
1.3 A/B Testing via Temporal SideEffect
1.4 Structured Evaluation Framework

Tier 2 — Valuable (Sub-phases C + D):
2.1 Visual Version Diff
2.2 CI/CD Eval Integration
2.3 Production Log → Dataset Pipeline

Tier 3 — Polish (Sub-phase E):
3.1 NATS Cache Invalidation
3.2 Provider-Agnostic Prompt Blueprints
3.3 Prompt Analytics Dashboard

Competitive Comparison

CapabilityBraintrustPromptLayerCruvero (Phase 18)Cruvero (After Phase 26)
VersioningContent-addressedSequential (v1, v2)Content-hashed + immutableSame (unchanged)
Semantic search✓ Embedding-based 3-stageSame (unique advantage)
Environmentsdev/staging/prodRelease labels✓ Named environments + quality gates
Prompt compositionFunctions (code)Snippets (@@@)✗ Monolithic only✓ Snippets with version pinning
A/B testingPlayground + manualDynamic Release Labels✓ Temporal SideEffect (replay-safe)
Eval frameworkEval() + autoevalsVisual pipeline buildersuccess_rate + avg_rating✓ Datasets + scorers + EvalRunWorkflow
CI/CD evalsGitHub ActionAuto-trigger on version✓ CLI + exit code
Multi-tenancyProject-levelWorkspace✓ First-class tenant isolationSame (unique advantage)
ObservabilityBrainstore + BTQLMiddleware loggingBasic prompt_metrics✓ Time-series analytics
Cache invalidationWebhook-driven✓ NATS pub/sub
Runtime SDKGo (beta), TS, PythonPython, JSGo-nativeSame + environment-aware resolution
Diff/compareSide-by-side UIDiff views✓ Structured line-level diff

Core Types and Interfaces

// --- Environments (26A) ---

type Environment struct {
TenantID string `json:"tenant_id"`
Name string `json:"name"` // "dev", "staging", "production"
PromptID string `json:"prompt_id"`
Version int `json:"version"`
PromptHash string `json:"prompt_hash"`
PromotedBy string `json:"promoted_by"`
PromotedAt time.Time `json:"promoted_at"`
}

type QualityGate struct {
MinUsageCount int `json:"min_usage_count"`
MinSuccessRate float64 `json:"min_success_rate"` // 0.0-1.0
MinAvgRating float64 `json:"min_avg_rating"` // 0.0-1.0
RequireEvalPass bool `json:"require_eval_pass"`
EvalDatasetID string `json:"eval_dataset_id,omitempty"`
EvalThreshold float64 `json:"eval_threshold,omitempty"`
}

type EnvironmentStore interface {
Promote(ctx context.Context, env Environment, gate *QualityGate) error
GetActive(ctx context.Context, tenantID, promptID, envName string) (Environment, error)
ListEnvironments(ctx context.Context, tenantID, promptID string) ([]Environment, error)
GetPromotionHistory(ctx context.Context, tenantID, promptID, envName string, limit int) ([]Environment, error)
}

// --- Snippets (26A) ---

type SnippetRef struct {
PromptID string `json:"prompt_id"`
Version int `json:"version,omitempty"` // 0 = latest
Label string `json:"label,omitempty"` // "prod", "staging" — resolved via EnvironmentStore
}

type SnippetResolver interface {
Resolve(ctx context.Context, tenantID string, refs []SnippetRef) (map[string]string, error)
}

// --- Experiments (26B) ---

type Experiment struct {
ID string `json:"id"`
TenantID string `json:"tenant_id"`
PromptID string `json:"prompt_id"`
Name string `json:"name"`
Status ExperimentStatus `json:"status"` // "active", "paused", "completed"
Variants []ExperimentVariant `json:"variants"`
CreatedAt time.Time `json:"created_at"`
CompletedAt *time.Time `json:"completed_at,omitempty"`
Config ExperimentConfig `json:"config"`
}

type ExperimentVariant struct {
Name string `json:"name"`
PromptHash string `json:"prompt_hash"`
TrafficPct int `json:"traffic_pct"` // 0-100, all variants must sum to 100
SampleCount int `json:"sample_count"`
SuccessCount int `json:"success_count"`
AvgScore float64 `json:"avg_score"`
}

type ExperimentConfig struct {
MinSampleSize int `json:"min_sample_size"` // per variant
MaxDuration string `json:"max_duration"` // e.g., "168h" (7 days)
AutoComplete bool `json:"auto_complete"` // auto-promote winner
}

type ExperimentStatus string

const (
ExperimentActive ExperimentStatus = "active"
ExperimentPaused ExperimentStatus = "paused"
ExperimentCompleted ExperimentStatus = "completed"
)

type ExperimentStore interface {
Create(ctx context.Context, exp Experiment) error
Get(ctx context.Context, tenantID, expID string) (Experiment, error)
GetActiveForPrompt(ctx context.Context, tenantID, promptID string) (*Experiment, error)
RecordOutcome(ctx context.Context, expID, variantName string, success bool, score float64) error
Complete(ctx context.Context, expID string, winnerVariant string) error
List(ctx context.Context, tenantID string, status *ExperimentStatus) ([]Experiment, error)
}

// --- Eval Framework (26C) ---

type EvalDataset struct {
ID string `json:"id"`
TenantID string `json:"tenant_id"`
Name string `json:"name"`
PromptID string `json:"prompt_id,omitempty"` // optional binding
Version int `json:"version"`
Entries []EvalEntry `json:"entries"`
CreatedAt time.Time `json:"created_at"`
Metadata map[string]any `json:"metadata,omitempty"`
}

type EvalEntry struct {
ID string `json:"id"`
Input map[string]any `json:"input"` // template params
ExpectedOutput string `json:"expected_output"` // golden output
Metadata map[string]any `json:"metadata,omitempty"`
}

type EvalRun struct {
ID string `json:"id"`
TenantID string `json:"tenant_id"`
PromptHash string `json:"prompt_hash"`
DatasetID string `json:"dataset_id"`
DatasetVer int `json:"dataset_version"`
Scorers []string `json:"scorers"` // scorer names to run
Status string `json:"status"` // "running", "completed", "failed"
Results []EvalResult `json:"results"`
Summary EvalSummary `json:"summary"`
StartedAt time.Time `json:"started_at"`
CompletedAt *time.Time `json:"completed_at,omitempty"`
}

type EvalResult struct {
EntryID string `json:"entry_id"`
Output string `json:"output"`
Scores map[string]float64 `json:"scores"` // scorer_name → 0.0-1.0
Latency time.Duration `json:"latency"`
Tokens int `json:"tokens"`
Error string `json:"error,omitempty"`
}

type EvalSummary struct {
TotalEntries int `json:"total_entries"`
PassCount int `json:"pass_count"`
FailCount int `json:"fail_count"`
ErrorCount int `json:"error_count"`
AvgScores map[string]float64 `json:"avg_scores"` // per scorer
AvgLatency time.Duration `json:"avg_latency"`
TotalTokens int `json:"total_tokens"`
TotalCost float64 `json:"total_cost"`
PassRate float64 `json:"pass_rate"` // 0.0-1.0
}

// Scorer produces a 0.0-1.0 score for an eval entry.
type Scorer interface {
Name() string
Score(ctx context.Context, input ScorerInput) (float64, error)
}

type ScorerInput struct {
Prompt Prompt `json:"prompt"`
TemplateParams map[string]any `json:"template_params"`
Output string `json:"output"`
ExpectedOutput string `json:"expected_output"`
Metadata map[string]any `json:"metadata,omitempty"`
}

// --- Diff (26D) ---

type PromptDiff struct {
PromptID string `json:"prompt_id"`
FromVer int `json:"from_version"`
ToVer int `json:"to_version"`
Hunks []DiffHunk `json:"hunks"`
Summary string `json:"summary"` // "3 additions, 2 deletions, 1 modification"
}

type DiffHunk struct {
Type string `json:"type"` // "add", "delete", "modify", "equal"
FromLine int `json:"from_line"`
ToLine int `json:"to_line"`
Content string `json:"content"`
}

// --- Provider Blueprint (26E) ---

type PromptBlueprint struct {
Messages []BlueprintMessage `json:"messages"`
Model string `json:"model,omitempty"`
Temperature *float64 `json:"temperature,omitempty"`
MaxTokens *int `json:"max_tokens,omitempty"`
StopSequence []string `json:"stop_sequence,omitempty"`
Tools []json.RawMessage `json:"tools,omitempty"`
}

type BlueprintMessage struct {
Role string `json:"role"` // "system", "user", "assistant"
Content string `json:"content"`
}

type BlueprintAdapter interface {
ToProviderFormat(bp PromptBlueprint, provider string) (json.RawMessage, error)
FromProviderFormat(raw json.RawMessage, provider string) (PromptBlueprint, error)
}

Sub-Phases

Sub-PhaseNamePromptsDepends On
26AEnvironments + Snippets5
26BA/B Experiments426A
26CEvaluation Framework526A
26DDiff + CI/CD + Log Pipeline426C
26ENATS Cache + Blueprints + Analytics326A, Phase 12
26FCI/CD + Deployment326A–26E

Total: 6 sub-phases, 24 prompts, 13 documentation files

Dependency Graph

26A (Environments + Snippets)
├─► 26B (A/B Experiments) ────────────┐
├─► 26C (Evaluation) ──► 26D (Diff) ├──► 26F (CI/CD + Deploy)
└─► 26E (NATS + Blueprints) ─────────┘

26B, 26C, and 26E are parallelizable after 26A. 26D depends on 26C (eval framework needed for CI/CD gates). 26F depends on all implementation sub-phases (26A–26E) and handles Docker images, Helm charts, and ArgoCD configuration.


Environment Variables

VariableDefaultDescription
CRUVERO_PROMPTLIB_ENVS_ENABLEDtrueEnable deployment environments
CRUVERO_PROMPTLIB_DEFAULT_ENVSdev,staging,productionComma-separated environment names created per tenant
CRUVERO_PROMPTLIB_SNIPPETS_ENABLEDtrueEnable snippet composition
CRUVERO_PROMPTLIB_SNIPPET_MAX_DEPTH3Max nested snippet resolution depth (prevents cycles)
CRUVERO_PROMPTLIB_EXPERIMENTS_ENABLEDfalseEnable A/B experimentation
CRUVERO_PROMPTLIB_EXPERIMENT_MAX_VARIANTS4Max variants per experiment
CRUVERO_PROMPTLIB_EVAL_ENABLEDtrueEnable evaluation framework
CRUVERO_PROMPTLIB_EVAL_TIMEOUT300sPer-entry eval timeout
CRUVERO_PROMPTLIB_EVAL_MAX_CONCURRENT10Max concurrent eval entries
CRUVERO_PROMPTLIB_DIFF_CONTEXT_LINES3Lines of context in diff output
CRUVERO_PROMPTLIB_NATS_CACHE_ENABLEDfalseEnable NATS cache invalidation
CRUVERO_PROMPTLIB_NATS_SUBJECTcruvero.prompts.eventsNATS subject for prompt events
CRUVERO_PROMPTLIB_BLUEPRINT_ENABLEDfalseEnable provider-agnostic blueprints
CRUVERO_PROMPTLIB_ANALYTICS_RETENTION90dAnalytics data retention period

Files Overview

New Files

FileSub-PhaseDescription
internal/promptlib/environments.go26AEnvironmentStore interface + PostgresEnvironmentStore
internal/promptlib/quality_gate.go26AQualityGate evaluation logic, MetricsStore + EvalRun checks
internal/promptlib/snippets.go26ASnippetResolver with cycle detection + version pinning
internal/promptlib/experiment.go26BExperiment types, ExperimentStore interface
internal/promptlib/experiment_store.go26BPostgresExperimentStore implementation
internal/promptlib/experiment_resolver.go26BResolveVariantActivity — Temporal SideEffect-based variant selection
internal/promptlib/eval/types.go26CEvalDataset, EvalEntry, EvalRun, EvalResult, EvalSummary, Scorer
internal/promptlib/eval/dataset_store.go26CPostgresDatasetStore (CRUD, versioning)
internal/promptlib/eval/runner.go26CEvalRunWorkflow — Temporal workflow orchestrating eval
internal/promptlib/eval/scorers.go26CBuilt-in scorers: exact_match, contains, regex, cosine_similarity, llm_judge
internal/promptlib/eval/report.go26CEvalSummary aggregation + comparison helpers
internal/promptlib/diff.go26DComputeDiff — line-level diff between prompt versions
internal/promptlib/log_pipeline.go26DDatasetFromLogsActivity — audit log → eval dataset
internal/promptlib/nats_cache.go26ENATS publisher for prompt events + subscriber cache buster
internal/promptlib/blueprint.go26EPromptBlueprint + adapters for OpenAI, Anthropic, Azure
internal/promptlib/analytics.go26ETime-series queries over prompt_metrics
cmd/prompt-eval/main.go26DCLI to run eval against dataset + exit code on failure
cmd/prompt-diff/main.go26DCLI to diff prompt versions
internal/tools/prompt_promote.go26APromptPromoteTool — agent tool for environment promotion
migrations/0034_prompt_environments.up.sql26Aprompt_environments table
migrations/0034_prompt_environments.down.sql26ADrop table
migrations/0035_prompt_snippets.up.sql26Aprompt_snippet_refs table (tracks snippet dependencies)
migrations/0035_prompt_snippets.down.sql26ADrop table
migrations/0036_prompt_experiments.up.sql26Bprompt_experiments + experiment_outcomes tables
migrations/0036_prompt_experiments.down.sql26BDrop tables
migrations/0037_eval_datasets.up.sql26Ceval_datasets + eval_entries + eval_runs + eval_results tables
migrations/0037_eval_datasets.down.sql26CDrop tables
docker/Dockerfile.prompt-tools26FMulti-binary Go image bundling prompt-eval, prompt-dataset, prompt-experiment, prompt-diff
charts/cruvero/templates/prompt-tools/job.yaml26FHelm Job template for batch eval/dataset operations

Modified Files

FileSub-PhaseChange
internal/promptlib/renderer.go26AAdd snippet resolution via FuncMap before template execution
internal/promptlib/searcher.go26AAdd environment filter to search pipeline Stage 3
internal/promptlib/config.go26AAdd all new env vars + component wiring
internal/tools/manager.go26ARegister prompt_promote tool
internal/agent/activities.go26BWire experiment variant resolution before prompt selection
cmd/ui/prompts_handler.go26DAdd /api/prompts/\{id\}/diff, /api/prompts/\{id\}/environments endpoints
cmd/ui/frontend/src/pages/PromptLibraryPage.tsx26DAdd diff viewer, environment badges, eval results
.github/workflows/build-images.yaml26FAdd prompt-tools to CI build matrix
charts/cruvero/values.yaml26FAdd promptTools section + 14 new env vars
charts/cruvero/values-dev.yaml26FEnable prompt-tools, set image repo
charts/cruvero/values-staging.yaml26FExplicit promptTools.enabled: false
charts/cruvero/values-prod.yaml26FExplicit promptTools.enabled: false
deploy/argocd/applicationset.yaml26FAdd prompt-tools Image Updater annotations

Referenced (read, not modified)

FilePurpose
internal/promptlib/store.goStore interface for Get/GetLatest (snippet resolution)
internal/promptlib/metrics_store.goMetricsStore for quality gate evaluation
internal/memory/salience.goRecency/usage scoring patterns
internal/audit/logger.goAudit event queries for log→dataset pipeline
internal/events/nats.goNATS client patterns (Phase 12)
internal/llm/client.goLLM client for llm_judge scorer
internal/embedding/embedder.goEmbedder for cosine_similarity scorer
internal/config/config.goConfig struct patterns

Migrations

0034_prompt_environments.up.sql

CREATE TABLE IF NOT EXISTS prompt_environments (
tenant_id TEXT NOT NULL,
prompt_id TEXT NOT NULL,
env_name TEXT NOT NULL,
version INTEGER NOT NULL,
prompt_hash TEXT NOT NULL,
promoted_by TEXT NOT NULL DEFAULT '',
promoted_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
PRIMARY KEY (tenant_id, prompt_id, env_name)
);

CREATE INDEX idx_prompt_env_hash ON prompt_environments (prompt_hash);

-- Promotion history (append-only audit trail)
CREATE TABLE IF NOT EXISTS prompt_promotion_history (
id BIGSERIAL PRIMARY KEY,
tenant_id TEXT NOT NULL,
prompt_id TEXT NOT NULL,
env_name TEXT NOT NULL,
version INTEGER NOT NULL,
prompt_hash TEXT NOT NULL,
promoted_by TEXT NOT NULL DEFAULT '',
promoted_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

CREATE INDEX idx_promo_history_lookup ON prompt_promotion_history (tenant_id, prompt_id, env_name, promoted_at DESC);

0035_prompt_snippets.up.sql

-- Tracks which prompts reference which snippets (for dependency analysis)
CREATE TABLE IF NOT EXISTS prompt_snippet_refs (
tenant_id TEXT NOT NULL,
parent_hash TEXT NOT NULL, -- prompt that contains the snippet ref
snippet_id TEXT NOT NULL, -- referenced snippet prompt_id
snippet_version INTEGER, -- NULL = latest, >0 = pinned
snippet_label TEXT, -- NULL = direct version, non-null = environment label
PRIMARY KEY (tenant_id, parent_hash, snippet_id)
);

CREATE INDEX idx_snippet_refs_snippet ON prompt_snippet_refs (tenant_id, snippet_id);

0036_prompt_experiments.up.sql

CREATE TABLE IF NOT EXISTS prompt_experiments (
id TEXT NOT NULL,
tenant_id TEXT NOT NULL,
prompt_id TEXT NOT NULL,
name TEXT NOT NULL,
status TEXT NOT NULL DEFAULT 'active',
variants JSONB NOT NULL,
config JSONB NOT NULL DEFAULT '{}',
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
completed_at TIMESTAMPTZ,
winner TEXT,
PRIMARY KEY (tenant_id, id)
);

CREATE INDEX idx_experiments_prompt ON prompt_experiments (tenant_id, prompt_id, status);

CREATE TABLE IF NOT EXISTS experiment_outcomes (
id BIGSERIAL PRIMARY KEY,
experiment_id TEXT NOT NULL,
tenant_id TEXT NOT NULL,
variant_name TEXT NOT NULL,
run_id TEXT NOT NULL,
success BOOLEAN NOT NULL,
score DOUBLE PRECISION,
recorded_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

CREATE INDEX idx_exp_outcomes_lookup ON experiment_outcomes (tenant_id, experiment_id, variant_name);

0037_eval_datasets.up.sql

CREATE TABLE IF NOT EXISTS eval_datasets (
id TEXT NOT NULL,
tenant_id TEXT NOT NULL,
name TEXT NOT NULL,
prompt_id TEXT,
version INTEGER NOT NULL DEFAULT 1,
metadata JSONB DEFAULT '{}',
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
PRIMARY KEY (tenant_id, id, version)
);

CREATE TABLE IF NOT EXISTS eval_entries (
id TEXT NOT NULL,
dataset_id TEXT NOT NULL,
dataset_version INTEGER NOT NULL,
tenant_id TEXT NOT NULL,
input JSONB NOT NULL,
expected_output TEXT NOT NULL,
metadata JSONB DEFAULT '{}',
PRIMARY KEY (tenant_id, dataset_id, dataset_version, id)
);

CREATE TABLE IF NOT EXISTS eval_runs (
id TEXT NOT NULL,
tenant_id TEXT NOT NULL,
prompt_hash TEXT NOT NULL,
dataset_id TEXT NOT NULL,
dataset_ver INTEGER NOT NULL,
scorers TEXT[] NOT NULL,
status TEXT NOT NULL DEFAULT 'running',
summary JSONB DEFAULT '{}',
started_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
completed_at TIMESTAMPTZ,
PRIMARY KEY (tenant_id, id)
);

CREATE INDEX idx_eval_runs_prompt ON eval_runs (tenant_id, prompt_hash);

CREATE TABLE IF NOT EXISTS eval_results (
eval_run_id TEXT NOT NULL,
entry_id TEXT NOT NULL,
tenant_id TEXT NOT NULL,
output TEXT NOT NULL,
scores JSONB NOT NULL,
latency_ms INTEGER NOT NULL,
tokens INTEGER NOT NULL DEFAULT 0,
error TEXT,
PRIMARY KEY (tenant_id, eval_run_id, entry_id)
);

Success Metrics

MetricTarget
Environment promotion latency< 500ms (quality gate check + upsert)
Quality gate evaluationChecks MetricsStore + optional eval run in < 2s
Snippet resolution depthMax 3 levels, < 5ms p99
Snippet cycle detection100% detection rate (no infinite loops)
A/B variant selectionDeterministic via Temporal SideEffect (replay-safe)
Experiment outcome recordingNon-blocking, < 5ms fire-and-forget
Eval run throughput10 concurrent entries by default, < 60s for 100-entry dataset
Built-in scorer accuracyLLM-as-judge within 0.15 of human rating
Diff computation< 10ms for prompts up to 10KB
CI/CD eval exit codeNon-zero on regression (score below threshold)
Log → dataset pipeline< 30s for 1000 audit entries
NATS cache invalidation< 100ms from promotion to subscriber notification
Test coverage>= 80% for all new files
Backward compatibilityPhase 18 behavior unchanged when Phase 26 features disabled

Code Quality Requirements (SonarQube)

All Go code produced by Phase 26 prompts must pass SonarQube quality gates:

  • Error handling: Every returned error must be handled explicitly
  • Cyclomatic complexity: Functions under 50 lines where practical
  • No dead code: No unused variables, empty blocks, or duplicated logic
  • Resource cleanup: Close all resources with proper defer patterns
  • Early returns: Prefer guard clauses over deeply nested conditionals
  • No magic values: Use named constants for strings and numbers
  • Meaningful names: Descriptive variable and function names
  • Linting gate: Run go vet, staticcheck, and golangci-lint run before considering the prompt complete

Each sub-phase Exit Criteria section includes:

  • [ ] go vet ./internal/promptlib/... reports no issues
  • [ ] staticcheck ./internal/promptlib/... reports no issues
  • [ ] No functions exceed 50 lines (extract helpers as needed)
  • [ ] All returned errors are handled (no _ = err patterns)

Risk Mitigation

RiskMitigation
Environment promotion breaks agentsFeature-flagged. When disabled, searcher returns all prompts (Phase 18 behavior). Promotion is additive — unpromoted prompts still exist.
Snippet cycles (A references B references A)Depth counter with hard limit (default 3). Cycle detection via visited-set in resolution. Returns error, not infinite loop.
A/B experiment affects workflow determinismVariant selection uses Temporal SideEffect — deterministic on replay. Outcomes recorded fire-and-forget.
Eval framework cost (LLM-as-judge)Eval entries processed with configurable concurrency limit. LLM-as-judge is optional scorer, not required. Cost tracked per eval run.
Diff on very large promptsDiff operates on lines, bounded by prompt content size. Prompts > 10KB trigger summary-only mode.
NATS unavailabilityCache invalidation is best-effort. Agents fall back to TTL-based cache expiry. Promotion succeeds regardless of NATS state.
Migration conflicts with existing Phase 18 tablesAll new tables use distinct names (no ALTER on existing prompts or prompt_metrics).

Relationship to Other Phases

PhaseRelationship
Phase 18 (Prompt Library)26 extends 18's types, store, and searcher. No breaking changes.
Phase 12 (NATS)26E uses NATS for cache invalidation events.
Phase 9C (Audit)26D reads audit log for production log → dataset pipeline.
Phase 6C (Speculative Execution)26B's A/B testing shares Temporal SideEffect patterns.
Phase 19 (Tool Registry Restructure)Orthogonal — tool quality scoring and prompt quality scoring are independent.
Phase 22C (UI)26D extends PromptLibraryPage with diff viewer and environment badges.
Phase 24 (Context Management)Orthogonal — context assembly consumes prompts; Phase 26 manages prompt lifecycle.

Progress Notes

(none yet)