Skip to main content

Source: docs/manual/prompt-tools.md

This page is generated by site/scripts/sync-manual-docs.mjs.

Prompt Engineering Tools Guide

Cruvero ships a prompt-engineering CLI suite for dataset management, evaluation, experimentation, and version diffing.

Source: cmd/prompt-tools/*, cmd/prompt-eval/*, cmd/prompt-dataset/*, cmd/prompt-experiment/*, cmd/prompt-diff/*, internal/promptcli/*, internal/promptlib/*

Architecture

prompt-tools is a dispatcher CLI that routes to dedicated subcommands:

  • prompt-tools eval ... -> prompt-eval
  • prompt-tools dataset ... -> prompt-dataset
  • prompt-tools experiment ... -> prompt-experiment
  • prompt-tools diff ... -> prompt-diff

The same binaries can also be executed directly.

Command Matrix

CommandPrimary UseBacking Package
prompt-evalEvaluate prompt output quality against a datasetinternal/promptcli/evalcli
prompt-datasetCreate/list/get datasets and build from audit logsinternal/promptcli/datasetcli
prompt-experimentManage A/B prompt experiments and winnersinternal/promptcli/experimentcli
prompt-diffCompare prompt versionsinternal/promptcli/diffcli

prompt-eval

Runs prompt evaluations and computes summary pass/fail metrics.

Key flags

FlagDescription
--prompt-hashPrompt hash to evaluate (required)
--dataset / --dataset-versionDataset id/version (required id)
--scorersComma-separated scorers (default exact_match)
--thresholdPass threshold (default 0.8)
--fail-on-regressionExit non-zero when regression detected
--baseline-run / --regression-baselineBaseline run strategy (auto or run id)
--tenantTenant id (default default)
--format`text
--ci / --github-summaryCI-friendly output modes
--notify / --notify-subjectPublish completion event to NATS

Example

go run ./cmd/prompt-eval \
--prompt-hash ph_abc123 \
--dataset support-regression \
--dataset-version 3 \
--scorers exact_match,semantic_similarity \
--threshold 0.85 \
--fail-on-regression \
--regression-baseline auto \
--format markdown

prompt-dataset

Manages evaluation datasets in Postgres and can generate datasets from audit logs.

Key flags

FlagDescription
--create <file>Create dataset from JSON file
--listList datasets for tenant
--get <id> / --versionGet dataset by id/version
--add-entries <file> + --datasetAdd entries to an existing dataset
--from-logsBuild dataset from audit logs
--prompt-hashRequired with --from-logs
--failures-onlyKeep only failed cases in --from-logs mode
--since / --max-entriesLog extraction window and cap
--tenant / --nameTenant id and dataset name override

Example

go run ./cmd/prompt-dataset \
--from-logs \
--prompt-hash ph_abc123 \
--since 168h \
--failures-only \
--max-entries 300 \
--name support-regression-v2

prompt-experiment

Creates and tracks prompt experiments with variant winner selection.

Key flags

FlagDescription
--create <file>Create experiment from JSON
--listList tenant experiments
--get <id>Fetch experiment
--complete <id>Mark experiment complete
--winner <name>Winner variant name for completion
--tenantTenant id

Example

go run ./cmd/prompt-experiment --complete exp-173 --winner concise_v2 --tenant default

prompt-diff

Computes a diff between prompt versions with text or JSON output.

Key flags

FlagDescription
--promptPrompt id (required)
--fromSource version (required)
--toTarget version (default: latest)
--jsonJSON diff output
--tenantTenant id

Example

go run ./cmd/prompt-diff --prompt incident_classifier --from 7 --to 9

Configuration

Runtime dependencies

VariablePurpose
CRUVERO_POSTGRES_URLPrompt library, datasets, and experiment storage
CRUVERO_LLM_PROVIDERActive provider for evaluation calls
CRUVERO_OPENROUTER_API_KEY / CRUVERO_OPENROUTER_MODELOpenRouter provider settings
CRUVERO_OPENAI_API_KEY / CRUVERO_OPENAI_MODELOpenAI provider settings
CRUVERO_GOOGLE_API_KEY / CRUVERO_GOOGLE_MODELGoogle provider settings
CRUVERO_ANTHROPIC_API_KEY / CRUVERO_ANTHROPIC_MODELAnthropic provider settings

Prompt library controls

VariablePurpose
CRUVERO_PROMPTLIB_EVAL_ENABLEDEnable evaluation paths
CRUVERO_PROMPTLIB_EVAL_TIMEOUTEvaluation timeout budget
CRUVERO_PROMPTLIB_EVAL_MAX_CONCURRENTEval parallelism cap
CRUVERO_PROMPTLIB_DIFF_CONTEXT_LINESContext lines in computed prompt diffs
CRUVERO_PROMPTLIB_EXPERIMENTS_ENABLEDExperiment feature switch
CRUVERO_PROMPTLIB_EXPERIMENT_MAX_VARIANTSMax variants per experiment
CRUVERO_PROMPTLIB_SNIPPETS_ENABLEDSnippet composition support
CRUVERO_PROMPTLIB_SNIPPET_MAX_DEPTHMax snippet nesting depth

Integration with Prompt Library v2

The CLI suite and Prompt Library v2 share the same storage and scoring primitives:

  • prompt-dataset creates datasets consumed by prompt-eval.
  • prompt-eval writes eval_runs and eval_results used by prompt governance workflows.
  • prompt-experiment persists experiment state and winner metadata used by promotion flows.
  • prompt-diff uses the same diff engine used in UI/version review paths.