neutral

Phase 26 — Prompt Library v2: Advanced Prompt Management

Extends the Phase 18 prompt library with deployment environments, composable snippets, A/B experimentation, structured evaluation, version diffing, CI/CD eval integration, production-to-dataset pipelines, NATS cache invalidation, provider-agnostic prompt blueprints, and prompt analytics. Modeled after capabilities found in Braintrust, PromptLayer, and Humanloop — filtered for features that create meaningful value in a Temporal-native agent orchestration platform.

Status: Planned Depends on: Phase 18 (Prompt Library), Phase 12 (NATS), Phase 9 (Audit/Tenant), Phase 6C (Speculative Execution) Migrations: 0034_prompt_environments, 0035_prompt_snippets, 0036_prompt_experiments, 0037_eval_datasets Branch: dev

Why Now

Phase 18 established a production-ready prompt catalog with content-hashed versioning, embedding-based semantic search, quality metrics, and Go text/template rendering. But four critical gaps remain between "prompts exist in a catalog" and "prompts are safely managed across their lifecycle in production":

No deployment safety net — A prompt goes from "created" to "used by agents" with nothing in between. There is no concept of dev → staging → production promotion, no quality gates, and no rollback path. A bad prompt can immediately affect all agents.
Monolithic prompts — Every prompt is self-contained. System preambles, safety guardrails, output format instructions, and few-shot examples are duplicated across prompts. Tenant customization requires duplicating entire prompts rather than overriding fragments.
No experimentation infrastructure — The existing SpeculativeConfig (Phase 6C) runs multiple decision paths but has no mechanism for controlled prompt A/B tests with traffic splitting, variant tracking, or statistical comparison.
Primitive quality signals — success_rate * avg_llm_rating from fire-and-forget feedback is insufficient for pre-deployment confidence. There are no eval datasets, no regression detection, no LLM-as-a-judge scoring, and no CI/CD quality gates.

Phase 26 solves all four while preserving Phase 18's architectural strengths: content-hashed immutability, embedding-based semantic search, multi-tenant isolation, and Temporal-native durability.

Architecture

Extended `internal/promptlib/` Package

Phase 26 extends the existing package with new subpackages rather than modifying core types:

internal/promptlib/
├── types.go              (Phase 18 — unchanged)
├── store.go              (Phase 18 — unchanged)
├── metrics_store.go      (Phase 18 — unchanged)
├── hash.go               (Phase 18 — unchanged)
├── renderer.go           (Phase 18 — extended with snippet resolution)
├── searcher.go           (Phase 18 — extended with environment filtering)
├── scorer.go             (Phase 18 — unchanged)
├── indexer.go            (Phase 18 — unchanged)
├── config.go             (Phase 18 — extended with new env vars)
├── feedback.go           (Phase 18 — unchanged)
│
├── environments.go       (26A — NEW: environment store + promotion logic)
├── snippets.go           (26A — NEW: snippet resolution + version pinning)
├── experiment.go         (26B — NEW: A/B experiment types + resolution)
├── experiment_store.go   (26B — NEW: Postgres experiment store)
│
├── eval/
│   ├── types.go          (26C — NEW: dataset, eval run, scorer types)
│   ├── dataset_store.go  (26C — NEW: eval dataset CRUD)
│   ├── runner.go         (26C — NEW: EvalRunWorkflow orchestrator)
│   ├── scorers.go        (26C — NEW: built-in scorer activities)
│   └── report.go         (26C — NEW: eval result aggregation)
│
├── diff.go               (26D — NEW: version diff engine)
├── blueprint.go          (26E — NEW: provider-agnostic prompt representation)
└── analytics.go          (26E — NEW: prompt-level time-series queries)

Integration Points

┌─────────────────────────────────────────────────────────────────┐
│                    Phase 26 Extensions                          │
│                                                                 │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────────────┐  │
│  │ Environments │  │   Snippets   │  │    Experiments       │  │
│  │ (dev→stg→prd)│  │ (composable  │  │ (A/B via Temporal    │  │
│  │ quality gates│  │  fragments)  │  │  SideEffect)         │  │
│  └──────┬───────┘  └──────┬───────┘  └──────────┬───────────┘  │
│         │                 │                      │              │
│  ┌──────▼─────────────────▼──────────────────────▼───────────┐  │
│  │              Phase 18 Core (unchanged)                     │  │
│  │  Store · MetricsStore · Searcher · Renderer · Indexer     │  │
│  └──────┬────────────────────────────────────────────────────┘  │
│         │                                                       │
│  ┌──────▼───────────────────────────────────────────────────┐   │
│  │                  Eval Framework                           │   │
│  │  Datasets · EvalRunWorkflow · Scorers · Reports          │   │
│  │  (quality gates for environment promotion)                │   │
│  └──────────────────────────────────────────────────────────┘   │
│                                                                 │
│  External integrations:                                         │
│  ├─ NATS (Phase 12) — cache invalidation on promotion          │
│  ├─ Audit (Phase 9C) — production log → dataset pipeline       │
│  ├─ Temporal — EvalRunWorkflow, experiment variant SideEffect  │
│  ├─ LLM Client (Phase 7D) — LLM-as-a-judge scoring            │
│  └─ Embedding (Phase 8D) — cosine similarity scoring           │
└─────────────────────────────────────────────────────────────────┘

Three-Tier Implementation

Tier 1 — Highest ROI (Sub-phases A + B):
1 Deployment Environments with Quality Gates
2 Composable Prompt Snippets
3 A/B Testing via Temporal SideEffect
4 Structured Evaluation Framework

Tier 2 — Valuable (Sub-phases C + D):
1 Visual Version Diff
2 CI/CD Eval Integration
3 Production Log → Dataset Pipeline

Tier 3 — Polish (Sub-phase E):
1 NATS Cache Invalidation
2 Provider-Agnostic Prompt Blueprints
3 Prompt Analytics Dashboard

Competitive Comparison

Capability	Braintrust	PromptLayer	Cruvero (Phase 18)	Cruvero (After Phase 26)
Versioning	Content-addressed	Sequential (v1, v2)	Content-hashed + immutable	Same (unchanged)
Semantic search	✗	✗	✓ Embedding-based 3-stage	Same (unique advantage)
Environments	dev/staging/prod	Release labels	✗	✓ Named environments + quality gates
Prompt composition	Functions (code)	Snippets (`@@@`)	✗ Monolithic only	✓ Snippets with version pinning
A/B testing	Playground + manual	Dynamic Release Labels	✗	✓ Temporal SideEffect (replay-safe)
Eval framework	Eval() + autoevals	Visual pipeline builder	success_rate + avg_rating	✓ Datasets + scorers + EvalRunWorkflow
CI/CD evals	GitHub Action	Auto-trigger on version	✗	✓ CLI + exit code
Multi-tenancy	Project-level	Workspace	✓ First-class tenant isolation	Same (unique advantage)
Observability	Brainstore + BTQL	Middleware logging	Basic prompt_metrics	✓ Time-series analytics
Cache invalidation	—	Webhook-driven	✗	✓ NATS pub/sub
Runtime SDK	Go (beta), TS, Python	Python, JS	Go-native	Same + environment-aware resolution
Diff/compare	Side-by-side UI	Diff views	✗	✓ Structured line-level diff

Core Types and Interfaces

// --- Environments (26A) ---

type Environment struct {
    TenantID    string    `json:"tenant_id"`
    Name        string    `json:"name"`        // "dev", "staging", "production"
    PromptID    string    `json:"prompt_id"`
    Version     int       `json:"version"`
    PromptHash  string    `json:"prompt_hash"`
    PromotedBy  string    `json:"promoted_by"`
    PromotedAt  time.Time `json:"promoted_at"`
}

type QualityGate struct {
    MinUsageCount   int     `json:"min_usage_count"`
    MinSuccessRate  float64 `json:"min_success_rate"`  // 0.0-1.0
    MinAvgRating    float64 `json:"min_avg_rating"`    // 0.0-1.0
    RequireEvalPass bool    `json:"require_eval_pass"`
    EvalDatasetID   string  `json:"eval_dataset_id,omitempty"`
    EvalThreshold   float64 `json:"eval_threshold,omitempty"`
}

type EnvironmentStore interface {
    Promote(ctx context.Context, env Environment, gate *QualityGate) error
    GetActive(ctx context.Context, tenantID, promptID, envName string) (Environment, error)
    ListEnvironments(ctx context.Context, tenantID, promptID string) ([]Environment, error)
    GetPromotionHistory(ctx context.Context, tenantID, promptID, envName string, limit int) ([]Environment, error)
}

// --- Snippets (26A) ---

type SnippetRef struct {
    PromptID string `json:"prompt_id"`
    Version  int    `json:"version,omitempty"` // 0 = latest
    Label    string `json:"label,omitempty"`   // "prod", "staging" — resolved via EnvironmentStore
}

type SnippetResolver interface {
    Resolve(ctx context.Context, tenantID string, refs []SnippetRef) (map[string]string, error)
}

// --- Experiments (26B) ---

type Experiment struct {
    ID          string              `json:"id"`
    TenantID    string              `json:"tenant_id"`
    PromptID    string              `json:"prompt_id"`
    Name        string              `json:"name"`
    Status      ExperimentStatus    `json:"status"` // "active", "paused", "completed"
    Variants    []ExperimentVariant `json:"variants"`
    CreatedAt   time.Time           `json:"created_at"`
    CompletedAt *time.Time          `json:"completed_at,omitempty"`
    Config      ExperimentConfig    `json:"config"`
}

type ExperimentVariant struct {
    Name         string  `json:"name"`
    PromptHash   string  `json:"prompt_hash"`
    TrafficPct   int     `json:"traffic_pct"`    // 0-100, all variants must sum to 100
    SampleCount  int     `json:"sample_count"`
    SuccessCount int     `json:"success_count"`
    AvgScore     float64 `json:"avg_score"`
}

type ExperimentConfig struct {
    MinSampleSize int     `json:"min_sample_size"` // per variant
    MaxDuration   string  `json:"max_duration"`    // e.g., "168h" (7 days)
    AutoComplete  bool    `json:"auto_complete"`   // auto-promote winner
}

type ExperimentStatus string

const (
    ExperimentActive    ExperimentStatus = "active"
    ExperimentPaused    ExperimentStatus = "paused"
    ExperimentCompleted ExperimentStatus = "completed"
)

type ExperimentStore interface {
    Create(ctx context.Context, exp Experiment) error
    Get(ctx context.Context, tenantID, expID string) (Experiment, error)
    GetActiveForPrompt(ctx context.Context, tenantID, promptID string) (*Experiment, error)
    RecordOutcome(ctx context.Context, expID, variantName string, success bool, score float64) error
    Complete(ctx context.Context, expID string, winnerVariant string) error
    List(ctx context.Context, tenantID string, status *ExperimentStatus) ([]Experiment, error)
}

// --- Eval Framework (26C) ---

type EvalDataset struct {
    ID        string         `json:"id"`
    TenantID  string         `json:"tenant_id"`
    Name      string         `json:"name"`
    PromptID  string         `json:"prompt_id,omitempty"` // optional binding
    Version   int            `json:"version"`
    Entries   []EvalEntry    `json:"entries"`
    CreatedAt time.Time      `json:"created_at"`
    Metadata  map[string]any `json:"metadata,omitempty"`
}

type EvalEntry struct {
    ID             string         `json:"id"`
    Input          map[string]any `json:"input"`           // template params
    ExpectedOutput string         `json:"expected_output"` // golden output
    Metadata       map[string]any `json:"metadata,omitempty"`
}

type EvalRun struct {
    ID          string          `json:"id"`
    TenantID    string          `json:"tenant_id"`
    PromptHash  string          `json:"prompt_hash"`
    DatasetID   string          `json:"dataset_id"`
    DatasetVer  int             `json:"dataset_version"`
    Scorers     []string        `json:"scorers"`       // scorer names to run
    Status      string          `json:"status"`        // "running", "completed", "failed"
    Results     []EvalResult    `json:"results"`
    Summary     EvalSummary     `json:"summary"`
    StartedAt   time.Time       `json:"started_at"`
    CompletedAt *time.Time      `json:"completed_at,omitempty"`
}

type EvalResult struct {
    EntryID  string             `json:"entry_id"`
    Output   string             `json:"output"`
    Scores   map[string]float64 `json:"scores"`  // scorer_name → 0.0-1.0
    Latency  time.Duration      `json:"latency"`
    Tokens   int                `json:"tokens"`
    Error    string             `json:"error,omitempty"`
}

type EvalSummary struct {
    TotalEntries int                `json:"total_entries"`
    PassCount    int                `json:"pass_count"`
    FailCount    int                `json:"fail_count"`
    ErrorCount   int                `json:"error_count"`
    AvgScores    map[string]float64 `json:"avg_scores"`    // per scorer
    AvgLatency   time.Duration      `json:"avg_latency"`
    TotalTokens  int                `json:"total_tokens"`
    TotalCost    float64            `json:"total_cost"`
    PassRate     float64            `json:"pass_rate"`     // 0.0-1.0
}

// Scorer produces a 0.0-1.0 score for an eval entry.
type Scorer interface {
    Name() string
    Score(ctx context.Context, input ScorerInput) (float64, error)
}

type ScorerInput struct {
    Prompt         Prompt         `json:"prompt"`
    TemplateParams map[string]any `json:"template_params"`
    Output         string         `json:"output"`
    ExpectedOutput string         `json:"expected_output"`
    Metadata       map[string]any `json:"metadata,omitempty"`
}

// --- Diff (26D) ---

type PromptDiff struct {
    PromptID string     `json:"prompt_id"`
    FromVer  int        `json:"from_version"`
    ToVer    int        `json:"to_version"`
    Hunks    []DiffHunk `json:"hunks"`
    Summary  string     `json:"summary"` // "3 additions, 2 deletions, 1 modification"
}

type DiffHunk struct {
    Type    string `json:"type"` // "add", "delete", "modify", "equal"
    FromLine int   `json:"from_line"`
    ToLine   int   `json:"to_line"`
    Content string `json:"content"`
}

// --- Provider Blueprint (26E) ---

type PromptBlueprint struct {
    Messages     []BlueprintMessage `json:"messages"`
    Model        string             `json:"model,omitempty"`
    Temperature  *float64           `json:"temperature,omitempty"`
    MaxTokens    *int               `json:"max_tokens,omitempty"`
    StopSequence []string           `json:"stop_sequence,omitempty"`
    Tools        []json.RawMessage  `json:"tools,omitempty"`
}

type BlueprintMessage struct {
    Role    string `json:"role"`    // "system", "user", "assistant"
    Content string `json:"content"`
}

type BlueprintAdapter interface {
    ToProviderFormat(bp PromptBlueprint, provider string) (json.RawMessage, error)
    FromProviderFormat(raw json.RawMessage, provider string) (PromptBlueprint, error)
}

Sub-Phases

Sub-Phase	Name	Prompts	Depends On
26A	Environments + Snippets	5	—
26B	A/B Experiments	4	26A
26C	Evaluation Framework	5	26A
26D	Diff + CI/CD + Log Pipeline	4	26C
26E	NATS Cache + Blueprints + Analytics	3	26A, Phase 12
26F	CI/CD + Deployment	3	26A–26E

Total: 6 sub-phases, 24 prompts, 13 documentation files

Dependency Graph

26A (Environments + Snippets)
 ├─► 26B (A/B Experiments) ────────────┐
 ├─► 26C (Evaluation) ──► 26D (Diff)  ├──► 26F (CI/CD + Deploy)
 └─► 26E (NATS + Blueprints) ─────────┘

26B, 26C, and 26E are parallelizable after 26A. 26D depends on 26C (eval framework needed for CI/CD gates). 26F depends on all implementation sub-phases (26A–26E) and handles Docker images, Helm charts, and ArgoCD configuration.

Environment Variables

Variable	Default	Description
`CRUVERO_PROMPTLIB_ENVS_ENABLED`	`true`	Enable deployment environments
`CRUVERO_PROMPTLIB_DEFAULT_ENVS`	`dev,staging,production`	Comma-separated environment names created per tenant
`CRUVERO_PROMPTLIB_SNIPPETS_ENABLED`	`true`	Enable snippet composition
`CRUVERO_PROMPTLIB_SNIPPET_MAX_DEPTH`	`3`	Max nested snippet resolution depth (prevents cycles)
`CRUVERO_PROMPTLIB_EXPERIMENTS_ENABLED`	`false`	Enable A/B experimentation
`CRUVERO_PROMPTLIB_EXPERIMENT_MAX_VARIANTS`	`4`	Max variants per experiment
`CRUVERO_PROMPTLIB_EVAL_ENABLED`	`true`	Enable evaluation framework
`CRUVERO_PROMPTLIB_EVAL_TIMEOUT`	`300s`	Per-entry eval timeout
`CRUVERO_PROMPTLIB_EVAL_MAX_CONCURRENT`	`10`	Max concurrent eval entries
`CRUVERO_PROMPTLIB_DIFF_CONTEXT_LINES`	`3`	Lines of context in diff output
`CRUVERO_PROMPTLIB_NATS_CACHE_ENABLED`	`false`	Enable NATS cache invalidation
`CRUVERO_PROMPTLIB_NATS_SUBJECT`	`cruvero.prompts.events`	NATS subject for prompt events
`CRUVERO_PROMPTLIB_BLUEPRINT_ENABLED`	`false`	Enable provider-agnostic blueprints
`CRUVERO_PROMPTLIB_ANALYTICS_RETENTION`	`90d`	Analytics data retention period

Files Overview

New Files

File	Sub-Phase	Description
`internal/promptlib/environments.go`	26A	EnvironmentStore interface + PostgresEnvironmentStore
`internal/promptlib/quality_gate.go`	26A	QualityGate evaluation logic, MetricsStore + EvalRun checks
`internal/promptlib/snippets.go`	26A	SnippetResolver with cycle detection + version pinning
`internal/promptlib/experiment.go`	26B	Experiment types, ExperimentStore interface
`internal/promptlib/experiment_store.go`	26B	PostgresExperimentStore implementation
`internal/promptlib/experiment_resolver.go`	26B	ResolveVariantActivity — Temporal SideEffect-based variant selection
`internal/promptlib/eval/types.go`	26C	EvalDataset, EvalEntry, EvalRun, EvalResult, EvalSummary, Scorer
`internal/promptlib/eval/dataset_store.go`	26C	PostgresDatasetStore (CRUD, versioning)
`internal/promptlib/eval/runner.go`	26C	EvalRunWorkflow — Temporal workflow orchestrating eval
`internal/promptlib/eval/scorers.go`	26C	Built-in scorers: exact_match, contains, regex, cosine_similarity, llm_judge
`internal/promptlib/eval/report.go`	26C	EvalSummary aggregation + comparison helpers
`internal/promptlib/diff.go`	26D	ComputeDiff — line-level diff between prompt versions
`internal/promptlib/log_pipeline.go`	26D	DatasetFromLogsActivity — audit log → eval dataset
`internal/promptlib/nats_cache.go`	26E	NATS publisher for prompt events + subscriber cache buster
`internal/promptlib/blueprint.go`	26E	PromptBlueprint + adapters for OpenAI, Anthropic, Azure
`internal/promptlib/analytics.go`	26E	Time-series queries over prompt_metrics
`cmd/prompt-eval/main.go`	26D	CLI to run eval against dataset + exit code on failure
`cmd/prompt-diff/main.go`	26D	CLI to diff prompt versions
`internal/tools/prompt_promote.go`	26A	PromptPromoteTool — agent tool for environment promotion
`migrations/0034_prompt_environments.up.sql`	26A	prompt_environments table
`migrations/0034_prompt_environments.down.sql`	26A	Drop table
`migrations/0035_prompt_snippets.up.sql`	26A	prompt_snippet_refs table (tracks snippet dependencies)
`migrations/0035_prompt_snippets.down.sql`	26A	Drop table
`migrations/0036_prompt_experiments.up.sql`	26B	prompt_experiments + experiment_outcomes tables
`migrations/0036_prompt_experiments.down.sql`	26B	Drop tables
`migrations/0037_eval_datasets.up.sql`	26C	eval_datasets + eval_entries + eval_runs + eval_results tables
`migrations/0037_eval_datasets.down.sql`	26C	Drop tables
`docker/Dockerfile.prompt-tools`	26F	Multi-binary Go image bundling prompt-eval, prompt-dataset, prompt-experiment, prompt-diff
`charts/cruvero/templates/prompt-tools/job.yaml`	26F	Helm Job template for batch eval/dataset operations

Modified Files

File	Sub-Phase	Change
`internal/promptlib/renderer.go`	26A	Add snippet resolution via `FuncMap` before template execution
`internal/promptlib/searcher.go`	26A	Add environment filter to search pipeline Stage 3
`internal/promptlib/config.go`	26A	Add all new env vars + component wiring
`internal/tools/manager.go`	26A	Register `prompt_promote` tool
`internal/agent/activities.go`	26B	Wire experiment variant resolution before prompt selection
`cmd/ui/prompts_handler.go`	26D	Add `/api/prompts/\{id\}/diff`, `/api/prompts/\{id\}/environments` endpoints
`cmd/ui/frontend/src/pages/PromptLibraryPage.tsx`	26D	Add diff viewer, environment badges, eval results
`.github/workflows/build-images.yaml`	26F	Add prompt-tools to CI build matrix
`charts/cruvero/values.yaml`	26F	Add `promptTools` section + 14 new env vars
`charts/cruvero/values-dev.yaml`	26F	Enable prompt-tools, set image repo
`charts/cruvero/values-staging.yaml`	26F	Explicit `promptTools.enabled: false`
`charts/cruvero/values-prod.yaml`	26F	Explicit `promptTools.enabled: false`
`deploy/argocd/applicationset.yaml`	26F	Add prompt-tools Image Updater annotations

Referenced (read, not modified)

File	Purpose
`internal/promptlib/store.go`	Store interface for Get/GetLatest (snippet resolution)
`internal/promptlib/metrics_store.go`	MetricsStore for quality gate evaluation
`internal/memory/salience.go`	Recency/usage scoring patterns
`internal/audit/logger.go`	Audit event queries for log→dataset pipeline
`internal/events/nats.go`	NATS client patterns (Phase 12)
`internal/llm/client.go`	LLM client for llm_judge scorer
`internal/embedding/embedder.go`	Embedder for cosine_similarity scorer
`internal/config/config.go`	Config struct patterns

Migrations

`0034_prompt_environments.up.sql`

CREATE TABLE IF NOT EXISTS prompt_environments (
    tenant_id    TEXT NOT NULL,
    prompt_id    TEXT NOT NULL,
    env_name     TEXT NOT NULL,
    version      INTEGER NOT NULL,
    prompt_hash  TEXT NOT NULL,
    promoted_by  TEXT NOT NULL DEFAULT '',
    promoted_at  TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    PRIMARY KEY (tenant_id, prompt_id, env_name)
);

CREATE INDEX idx_prompt_env_hash ON prompt_environments (prompt_hash);

-- Promotion history (append-only audit trail)
CREATE TABLE IF NOT EXISTS prompt_promotion_history (
    id           BIGSERIAL PRIMARY KEY,
    tenant_id    TEXT NOT NULL,
    prompt_id    TEXT NOT NULL,
    env_name     TEXT NOT NULL,
    version      INTEGER NOT NULL,
    prompt_hash  TEXT NOT NULL,
    promoted_by  TEXT NOT NULL DEFAULT '',
    promoted_at  TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

CREATE INDEX idx_promo_history_lookup ON prompt_promotion_history (tenant_id, prompt_id, env_name, promoted_at DESC);

`0035_prompt_snippets.up.sql`

-- Tracks which prompts reference which snippets (for dependency analysis)
CREATE TABLE IF NOT EXISTS prompt_snippet_refs (
    tenant_id       TEXT NOT NULL,
    parent_hash     TEXT NOT NULL,  -- prompt that contains the snippet ref
    snippet_id      TEXT NOT NULL,  -- referenced snippet prompt_id
    snippet_version INTEGER,        -- NULL = latest, >0 = pinned
    snippet_label   TEXT,           -- NULL = direct version, non-null = environment label
    PRIMARY KEY (tenant_id, parent_hash, snippet_id)
);

CREATE INDEX idx_snippet_refs_snippet ON prompt_snippet_refs (tenant_id, snippet_id);

`0036_prompt_experiments.up.sql`

CREATE TABLE IF NOT EXISTS prompt_experiments (
    id            TEXT NOT NULL,
    tenant_id     TEXT NOT NULL,
    prompt_id     TEXT NOT NULL,
    name          TEXT NOT NULL,
    status        TEXT NOT NULL DEFAULT 'active',
    variants      JSONB NOT NULL,
    config        JSONB NOT NULL DEFAULT '{}',
    created_at    TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    completed_at  TIMESTAMPTZ,
    winner        TEXT,
    PRIMARY KEY (tenant_id, id)
);

CREATE INDEX idx_experiments_prompt ON prompt_experiments (tenant_id, prompt_id, status);

CREATE TABLE IF NOT EXISTS experiment_outcomes (
    id              BIGSERIAL PRIMARY KEY,
    experiment_id   TEXT NOT NULL,
    tenant_id       TEXT NOT NULL,
    variant_name    TEXT NOT NULL,
    run_id          TEXT NOT NULL,
    success         BOOLEAN NOT NULL,
    score           DOUBLE PRECISION,
    recorded_at     TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

CREATE INDEX idx_exp_outcomes_lookup ON experiment_outcomes (tenant_id, experiment_id, variant_name);

`0037_eval_datasets.up.sql`

CREATE TABLE IF NOT EXISTS eval_datasets (
    id          TEXT NOT NULL,
    tenant_id   TEXT NOT NULL,
    name        TEXT NOT NULL,
    prompt_id   TEXT,
    version     INTEGER NOT NULL DEFAULT 1,
    metadata    JSONB DEFAULT '{}',
    created_at  TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    PRIMARY KEY (tenant_id, id, version)
);

CREATE TABLE IF NOT EXISTS eval_entries (
    id              TEXT NOT NULL,
    dataset_id      TEXT NOT NULL,
    dataset_version INTEGER NOT NULL,
    tenant_id       TEXT NOT NULL,
    input           JSONB NOT NULL,
    expected_output TEXT NOT NULL,
    metadata        JSONB DEFAULT '{}',
    PRIMARY KEY (tenant_id, dataset_id, dataset_version, id)
);

CREATE TABLE IF NOT EXISTS eval_runs (
    id            TEXT NOT NULL,
    tenant_id     TEXT NOT NULL,
    prompt_hash   TEXT NOT NULL,
    dataset_id    TEXT NOT NULL,
    dataset_ver   INTEGER NOT NULL,
    scorers       TEXT[] NOT NULL,
    status        TEXT NOT NULL DEFAULT 'running',
    summary       JSONB DEFAULT '{}',
    started_at    TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    completed_at  TIMESTAMPTZ,
    PRIMARY KEY (tenant_id, id)
);

CREATE INDEX idx_eval_runs_prompt ON eval_runs (tenant_id, prompt_hash);

CREATE TABLE IF NOT EXISTS eval_results (
    eval_run_id TEXT NOT NULL,
    entry_id    TEXT NOT NULL,
    tenant_id   TEXT NOT NULL,
    output      TEXT NOT NULL,
    scores      JSONB NOT NULL,
    latency_ms  INTEGER NOT NULL,
    tokens      INTEGER NOT NULL DEFAULT 0,
    error       TEXT,
    PRIMARY KEY (tenant_id, eval_run_id, entry_id)
);

Success Metrics

Metric	Target
Environment promotion latency	< 500ms (quality gate check + upsert)
Quality gate evaluation	Checks MetricsStore + optional eval run in < 2s
Snippet resolution depth	Max 3 levels, < 5ms p99
Snippet cycle detection	100% detection rate (no infinite loops)
A/B variant selection	Deterministic via Temporal SideEffect (replay-safe)
Experiment outcome recording	Non-blocking, < 5ms fire-and-forget
Eval run throughput	10 concurrent entries by default, < 60s for 100-entry dataset
Built-in scorer accuracy	LLM-as-judge within 0.15 of human rating
Diff computation	< 10ms for prompts up to 10KB
CI/CD eval exit code	Non-zero on regression (score below threshold)
Log → dataset pipeline	< 30s for 1000 audit entries
NATS cache invalidation	< 100ms from promotion to subscriber notification
Test coverage	>= 80% for all new files
Backward compatibility	Phase 18 behavior unchanged when Phase 26 features disabled

Code Quality Requirements (SonarQube)

All Go code produced by Phase 26 prompts must pass SonarQube quality gates:

Error handling: Every returned error must be handled explicitly
Cyclomatic complexity: Functions under 50 lines where practical
No dead code: No unused variables, empty blocks, or duplicated logic
Resource cleanup: Close all resources with proper defer patterns
Early returns: Prefer guard clauses over deeply nested conditionals
No magic values: Use named constants for strings and numbers
Meaningful names: Descriptive variable and function names
Linting gate: Run go vet, staticcheck, and golangci-lint run before considering the prompt complete

Each sub-phase Exit Criteria section includes:

[ ] go vet ./internal/promptlib/... reports no issues
[ ] staticcheck ./internal/promptlib/... reports no issues
[ ] No functions exceed 50 lines (extract helpers as needed)
[ ] All returned errors are handled (no _ = err patterns)

Risk Mitigation

Risk	Mitigation
Environment promotion breaks agents	Feature-flagged. When disabled, searcher returns all prompts (Phase 18 behavior). Promotion is additive — unpromoted prompts still exist.
Snippet cycles (A references B references A)	Depth counter with hard limit (default 3). Cycle detection via visited-set in resolution. Returns error, not infinite loop.
A/B experiment affects workflow determinism	Variant selection uses Temporal `SideEffect` — deterministic on replay. Outcomes recorded fire-and-forget.
Eval framework cost (LLM-as-judge)	Eval entries processed with configurable concurrency limit. LLM-as-judge is optional scorer, not required. Cost tracked per eval run.
Diff on very large prompts	Diff operates on lines, bounded by prompt content size. Prompts > 10KB trigger summary-only mode.
NATS unavailability	Cache invalidation is best-effort. Agents fall back to TTL-based cache expiry. Promotion succeeds regardless of NATS state.
Migration conflicts with existing Phase 18 tables	All new tables use distinct names (no ALTER on existing `prompts` or `prompt_metrics`).

Relationship to Other Phases

Phase	Relationship
Phase 18 (Prompt Library)	26 extends 18's types, store, and searcher. No breaking changes.
Phase 12 (NATS)	26E uses NATS for cache invalidation events.
Phase 9C (Audit)	26D reads audit log for production log → dataset pipeline.
Phase 6C (Speculative Execution)	26B's A/B testing shares Temporal SideEffect patterns.
Phase 19 (Tool Registry Restructure)	Orthogonal — tool quality scoring and prompt quality scoring are independent.
Phase 22C (UI)	26D extends PromptLibraryPage with diff viewer and environment badges.
Phase 24 (Context Management)	Orthogonal — context assembly consumes prompts; Phase 26 manages prompt lifecycle.

Progress Notes

(none yet)

Why Now​

Architecture​

Extended internal/promptlib/ Package​

Integration Points​

Three-Tier Implementation​

Competitive Comparison​

Core Types and Interfaces​

Sub-Phases​

Dependency Graph​

Environment Variables​

Files Overview​

New Files​

Modified Files​

Referenced (read, not modified)​

Migrations​

0034_prompt_environments.up.sql​

0035_prompt_snippets.up.sql​

0036_prompt_experiments.up.sql​

0037_eval_datasets.up.sql​

Success Metrics​

Code Quality Requirements (SonarQube)​

Risk Mitigation​

Relationship to Other Phases​

Progress Notes​