Source: docs/manual/prompt-tools.md

This page is generated by site/scripts/sync-manual-docs.mjs.

Prompt Engineering Tools Guide

Cruvero ships a prompt-engineering CLI suite for dataset management, evaluation, experimentation, and version diffing.

Source: cmd/prompt-tools/*, cmd/prompt-eval/*, cmd/prompt-dataset/*, cmd/prompt-experiment/*, cmd/prompt-diff/*, internal/promptcli/*, internal/promptlib/*

Architecture

prompt-tools is a dispatcher CLI that routes to dedicated subcommands:

prompt-tools eval ... -> prompt-eval
prompt-tools dataset ... -> prompt-dataset
prompt-tools experiment ... -> prompt-experiment
prompt-tools diff ... -> prompt-diff

The same binaries can also be executed directly.

Command Matrix

Command	Primary Use	Backing Package
`prompt-eval`	Evaluate prompt output quality against a dataset	`internal/promptcli/evalcli`
`prompt-dataset`	Create/list/get datasets and build from audit logs	`internal/promptcli/datasetcli`
`prompt-experiment`	Manage A/B prompt experiments and winners	`internal/promptcli/experimentcli`
`prompt-diff`	Compare prompt versions	`internal/promptcli/diffcli`

`prompt-eval`

Runs prompt evaluations and computes summary pass/fail metrics.

Key flags

Flag	Description
`--prompt-hash`	Prompt hash to evaluate (required)
`--dataset` / `--dataset-version`	Dataset id/version (required id)
`--scorers`	Comma-separated scorers (default `exact_match`)
`--threshold`	Pass threshold (default `0.8`)
`--fail-on-regression`	Exit non-zero when regression detected
`--baseline-run` / `--regression-baseline`	Baseline run strategy (`auto` or run id)
`--tenant`	Tenant id (default `default`)
`--format`	`text
`--ci` / `--github-summary`	CI-friendly output modes
`--notify` / `--notify-subject`	Publish completion event to NATS

Example

go run ./cmd/prompt-eval \
  --prompt-hash ph_abc123 \
  --dataset support-regression \
  --dataset-version 3 \
  --scorers exact_match,semantic_similarity \
  --threshold 0.85 \
  --fail-on-regression \
  --regression-baseline auto \
  --format markdown

`prompt-dataset`

Manages evaluation datasets in Postgres and can generate datasets from audit logs.

Key flags

Flag	Description
`--create <file>`	Create dataset from JSON file
`--list`	List datasets for tenant
`--get <id>` / `--version`	Get dataset by id/version
`--add-entries <file>` + `--dataset`	Add entries to an existing dataset
`--from-logs`	Build dataset from audit logs
`--prompt-hash`	Required with `--from-logs`
`--failures-only`	Keep only failed cases in `--from-logs` mode
`--since` / `--max-entries`	Log extraction window and cap
`--tenant` / `--name`	Tenant id and dataset name override

Example

go run ./cmd/prompt-dataset \
  --from-logs \
  --prompt-hash ph_abc123 \
  --since 168h \
  --failures-only \
  --max-entries 300 \
  --name support-regression-v2

`prompt-experiment`

Creates and tracks prompt experiments with variant winner selection.

Key flags

Flag	Description
`--create <file>`	Create experiment from JSON
`--list`	List tenant experiments
`--get <id>`	Fetch experiment
`--complete <id>`	Mark experiment complete
`--winner <name>`	Winner variant name for completion
`--tenant`	Tenant id

Example

go run ./cmd/prompt-experiment --complete exp-173 --winner concise_v2 --tenant default

`prompt-diff`

Computes a diff between prompt versions with text or JSON output.

Key flags

Flag	Description
`--prompt`	Prompt id (required)
`--from`	Source version (required)
`--to`	Target version (default: latest)
`--json`	JSON diff output
`--tenant`	Tenant id

Example

go run ./cmd/prompt-diff --prompt incident_classifier --from 7 --to 9

Configuration

Runtime dependencies

Variable	Purpose
`CRUVERO_POSTGRES_URL`	Prompt library, datasets, and experiment storage
`CRUVERO_LLM_PROVIDER`	Active provider for evaluation calls
`CRUVERO_OPENROUTER_API_KEY` / `CRUVERO_OPENROUTER_MODEL`	OpenRouter provider settings
`CRUVERO_OPENAI_API_KEY` / `CRUVERO_OPENAI_MODEL`	OpenAI provider settings
`CRUVERO_GOOGLE_API_KEY` / `CRUVERO_GOOGLE_MODEL`	Google provider settings
`CRUVERO_ANTHROPIC_API_KEY` / `CRUVERO_ANTHROPIC_MODEL`	Anthropic provider settings

Prompt library controls

Variable	Purpose
`CRUVERO_PROMPTLIB_EVAL_ENABLED`	Enable evaluation paths
`CRUVERO_PROMPTLIB_EVAL_TIMEOUT`	Evaluation timeout budget
`CRUVERO_PROMPTLIB_EVAL_MAX_CONCURRENT`	Eval parallelism cap
`CRUVERO_PROMPTLIB_DIFF_CONTEXT_LINES`	Context lines in computed prompt diffs
`CRUVERO_PROMPTLIB_EXPERIMENTS_ENABLED`	Experiment feature switch
`CRUVERO_PROMPTLIB_EXPERIMENT_MAX_VARIANTS`	Max variants per experiment
`CRUVERO_PROMPTLIB_SNIPPETS_ENABLED`	Snippet composition support
`CRUVERO_PROMPTLIB_SNIPPET_MAX_DEPTH`	Max snippet nesting depth

Integration with Prompt Library v2

The CLI suite and Prompt Library v2 share the same storage and scoring primitives:

prompt-dataset creates datasets consumed by prompt-eval.
prompt-eval writes eval_runs and eval_results used by prompt governance workflows.
prompt-experiment persists experiment state and winner metadata used by promotion flows.
prompt-diff uses the same diff engine used in UI/version review paths.

Architecture​

Command Matrix​

prompt-eval​

Key flags​

Example​

prompt-dataset​

Key flags​

Example​

prompt-experiment​

Key flags​

Example​

prompt-diff​

Key flags​

Example​

Configuration​

Runtime dependencies​

Prompt library controls​

Integration with Prompt Library v2​

Related Docs​

Architecture

Command Matrix

`prompt-eval`

Key flags

Example

`prompt-dataset`

Key flags

Example

`prompt-experiment`

Key flags

Example

`prompt-diff`

Key flags

Example

Configuration

Runtime dependencies

Prompt library controls

Integration with Prompt Library v2

Related Docs