Skip to main content

Phase 12

neutral

Goal: Introduce a lightweight event bus layer using NATS that future-proofs Cruvero's extension surface without disrupting the Temporal-first architecture. Temporal remains the spine for all workflow orchestration, durability, and state management. NATS operates strictly at the edges — providing async event fan-out, MCP server lifecycle management, and embedding pipeline intake.

Status: Completed (2026-02-09)


Why Now

With Phases 1–11 complete, Cruvero has a fully operational runtime: durable agent loops, multi-agent supervision, memory, causal tracing, cost tracking, sandboxed tools, guardrails, streaming, evaluation, and enterprise hardening. What's missing is a scalable extension seam — a way for external consumers (dashboards, bots, audit pipelines, webhook relays) to react to agent events without polling Temporal queries or coupling to the SSE endpoint.

Current limitations this phase addresses:

  • SSE endpoint is single-consumer and ephemeral — the GET /api/stream endpoint works for the Web UI but can't serve multiple independent consumers, has no replay, and loses events on disconnect.
  • MCP server discovery is staticCRUVERO_MCP_SERVERS env var requires worker restart to add/remove servers. With 14+ MCP servers and growing, dynamic discovery is overdue.
  • Embedding pipeline is synchronous — Phase 8D's embedding work blocks workflow activities while waiting for embedding generation. Batching and backpressure need a queue.
  • Audit/telemetry writes are synchronous — Phase 9C audit events and Phase 8C telemetry write directly to Postgres in activities, creating write amplification under load.
  • Teams/Telegram bots need event subscriptions — The planned control plane bots need to subscribe to workflow events without polling.

Architecture Principle: Temporal is the Spine, NATS is the Nervous System

This is the critical constraint. NATS must never be in the workflow deterministic path. Activities can publish to NATS, but workflow completion must never depend on NATS availability. If NATS is down, workflows continue unaffected — events are simply not fanned out until recovery.

┌──────────────────────────────────────────────────────┐
│ Temporal Cluster │
│ ┌────────────┐ ┌────────────┐ ┌────────────────┐ │
│ │ Agent WF │ │ Graph WF │ │ Supervisor WF │ │
│ └─────┬──────┘ └─────┬──────┘ └───────┬────────┘ │
│ │ │ │ │
│ ┌─────▼───────────────▼─────────────────▼────────┐ │
│ │ Worker Activities │ │
│ │ (LLM, Tools, Memory, Traces, Cost) │ │
│ └─────────────────────┬──────────────────────────┘ │
└────────────────────────┼──────────────────────────────┘
│ fire-and-forget publish

┌──────────────────────┐
│ NATS / JetStream │
│ │
│ Subjects: │
│ cruvero.events.* │
│ cruvero.mcp.* │
│ cruvero.embed.* │
│ cruvero.audit.* │
└──────┬───────────────┘

┌────────────┼────────────────┐
▼ ▼ ▼
┌─────────┐ ┌──────────┐ ┌──────────────┐
│ Web UI │ │ Telegram │ │ Audit Writer │
│ (SSE) │ │ Bot │ │ (batch) │
└─────────┘ └──────────┘ └──────────────┘

What Gets Refactored (Existing Code)

Since all phases are complete, this phase touches existing code at well-defined seams:

1. Event Publishing in Activities (internal/agent/activities.go)

  • After each activity completes (LLM decision, tool execution, observation, trace save), publish an event to NATS.
  • Non-blocking, fire-and-forget. If NATS is unavailable, log a warning and continue.
  • No changes to workflow determinism or Temporal state.

2. SSE Endpoint Refactor (cmd/ui/main.go)

  • Current: SSE polls Temporal queries on interval.
  • Refactored: SSE subscribes to NATS [cruvero.events](http://cruvero.events).<workflow_id> subject.
  • Fallback: If NATS unavailable, degrade to current polling behavior.
  • Benefit: Real-time events, multiple UI tabs, no polling overhead.

3. MCP Bridge Dynamic Discovery (internal/tools/mcp_bridge.go)

  • Current: Static env config loaded once at startup.
  • Refactored: MCP servers announce capabilities on cruvero.mcp.announce at startup.
  • Bridge subscribes and dynamically updates available tools.
  • Env config remains as fallback for servers that don't support NATS announce.

4. Embedding Pipeline (internal/memory/embedding.go)

  • Current: Synchronous embedding generation in activity.
  • Refactored: Activity publishes embedding request to cruvero.embed.requests JetStream.
  • Separate embedding worker consumes, batches, and writes results.
  • Activity can optionally wait (with timeout) or fire-and-forget for non-blocking semantic extraction.

5. Audit Event Buffering (internal/audit/)

  • Current: Phase 9C writes audit events directly to Postgres per-event.
  • Refactored: Publish to cruvero.audit.events JetStream stream.
  • Dedicated consumer batches writes to Postgres (configurable batch size and flush interval).
  • Guarantees: JetStream acknowledgment ensures no audit event loss.

6. Supervisor Inter-Agent Events (internal/supervisor/)

  • Current: Uses Temporal signals for all inter-agent communication.
  • No change to signals — they remain the durable communication path.
  • Addition: Supervisor publishes observability events (agent delegated, result received, saga step) to NATS for external monitoring.
  • External consumers can watch multi-agent orchestration in real-time.

What's Future-Proofed (New Extension Points)

1. Webhook Relay

Any external system can subscribe to [cruvero.events](http://cruvero.events).> and relay to webhooks. No Cruvero code changes needed — just a NATS consumer.

2. Teams/Telegram Bot Event Subscriptions

Control plane bots subscribe to approval requests, run completions, cost alerts, and error events. No polling, no coupling to Cruvero internals.

3. Multi-Worker Event Coordination

As Cruvero scales to multiple worker instances, NATS provides cross-worker event visibility without shared state.

4. Custom Alerting Pipelines

Operators define NATS consumers that trigger PagerDuty, OpsGenie, or Slack based on event patterns (cost threshold exceeded, agent stuck, tool failure rate spike).

5. External Agent Frameworks

Third-party agents can publish events to NATS subjects and participate in Cruvero's event ecosystem without being Temporal workflows.

6. Replay Event Streams

JetStream persistence means events can be replayed from any point — enabling post-hoc debugging, compliance review, and training data generation.


Sub-Phase Breakdown

This phase is split into 4 sub-phases, each independently deployable and testable. NATS is optional at every stage — CRUVERO_EVENTS_BACKEND=none disables the entire layer.


📡 Phase 12A: Core Event Bus Infrastructure

Goal: Establish the EventBus interface, NATS implementation, and first event publishers.

Duration: 1.5–2 weeks

Deliverables

1. EventBus Interface (internal/events/bus.go)

type EventBus interface {
// Publish fires an event. Non-blocking, best-effort.
Publish(ctx context.Context, subject string, event Event) error

// Subscribe returns a channel of events for the subject pattern.
Subscribe(ctx context.Context, subject string) (<-chan Event, error)

// Request sends a request and waits for a reply (for discovery).
Request(ctx context.Context, subject string, data []byte, timeout time.Duration) ([]byte, error)

// Close tears down connections.
Close() error
}

type Event struct {
ID string `json:"id"`
Type string `json:"type"` // run.started, step.completed, tool.called, etc.
Subject string `json:"subject"` // NATS subject
Source string `json:"source"` // workflow_id or component
Timestamp time.Time `json:"timestamp"`
TenantID string `json:"tenant_id,omitempty"`
RunID string `json:"run_id,omitempty"`
StepIndex int `json:"step_index,omitempty"`
Data json.RawMessage `json:"data"`
}

2. Implementations

  • NATSEventBus — full NATS + JetStream implementation.
  • NoopEventBus — silent no-op for disabled mode.
  • LogEventBus — logs events to structured logger (useful for debugging without NATS).

3. Event Taxonomy (Subject Hierarchy)

cruvero.events.run.started         {run_id, prompt, model}
cruvero.events.run.completed {run_id, steps, cost, duration}
cruvero.events.run.failed {run_id, error, step_index}
cruvero.events.step.decided {run_id, step, tool, decision}
cruvero.events.step.completed {run_id, step, tool, result_summary}
cruvero.events.tool.called {run_id, step, tool, args_hash}
cruvero.events.tool.failed {run_id, step, tool, error}
cruvero.events.tool.repaired {run_id, step, tool, attempt}
cruvero.events.approval.requested {run_id, step, tool, request_id}
cruvero.events.approval.resolved {run_id, step, approved, reason}
cruvero.events.question.asked {run_id, step, question}
cruvero.events.question.answered {run_id, step, answer}
cruvero.events.cost.recorded {run_id, step, cost}
cruvero.events.cost.threshold {run_id, threshold, current}
cruvero.events.memory.saved {run_id, type, key}
cruvero.events.trace.saved {run_id, step}
cruvero.events.control.pause \{run_id\}
cruvero.events.control.resume \{run_id\}
cruvero.events.supervisor.* {supervisor events mirror above}

4. Activity Integration

  • Add EventBus to activity context (injected via worker setup).
  • Instrument LLMDecideActivity, ToolExecuteActivity, ObserveActivity, SaveTraceActivity.
  • Each publishes 1 event after successful completion. Fire-and-forget with error logging.

5. Configuration

CRUVERO_EVENTS_BACKEND=nats|log|none      (default: none)
CRUVERO_NATS_URL=nats://localhost:4222
CRUVERO_NATS_CLUSTER_ID=cruvero
CRUVERO_NATS_CREDS_FILE= (optional NKey/JWT auth)
CRUVERO_NATS_TLS=auto|false
CRUVERO_NATS_CONNECT_TIMEOUT=5s
CRUVERO_NATS_RECONNECT_WAIT=2s
CRUVERO_NATS_MAX_RECONNECTS=60
CRUVERO_EVENTS_SUBJECT_PREFIX=cruvero (default: cruvero)

6. Docker Compose Addition

nats:
image: nats:2.10-alpine
command: ["--jetstream", "--store_dir=/data"]
ports:
- "4222:4222"
- "8222:8222" # monitoring
volumes:
- nats-data:/data
deploy:
resources:
limits:
memory: 128M

7. Tests

  • Unit: EventBus interface contract tests (run against Noop, Log, and NATS via testcontainer).
  • Integration: Publish event from activity mock, verify subscriber receives it.
  • Resilience: NATS unavailable — verify activity completes without error.
  • Subject validation: Verify taxonomy compliance.

8. Migration

  • migrations/0008_events_config.up.sql — optional table for event routing rules per tenant.

Files to Add/Change

  • internal/events/bus.go (interface)
  • internal/events/nats.go (NATS impl)
  • internal/events/noop.go (no-op impl)
  • internal/events/log.go (log impl)
  • internal/events/types.go (Event, subject constants)
  • internal/events/bus_test.go
  • internal/agent/activities.go (add publish calls)
  • internal/config/config.go (NATS config)
  • cmd/worker/main.go (wire EventBus)
  • docker-compose.yml (add NATS service)
  • .env.example (NATS vars)
  • migrations/0008_events_config.up.sql
  • migrations/0008_events_config.down.sql

Exit Criteria

  • EventBus interface with 3 implementations
  • Activities publish events on completion
  • NATS subscriber receives events in real-time
  • CRUVERO_EVENTS_BACKEND=none disables all publishing with zero overhead
  • Worker survives NATS being unavailable
  • Docker Compose includes NATS with JetStream

Dependencies

  • All prior phases complete
  • No breaking changes to any existing workflow or activity

📡 Phase 12B: SSE Refactor & Consumer Patterns

Goal: Refactor the Web UI SSE endpoint to consume from NATS, and establish reusable consumer patterns for external integrations.

Duration: 1–1.5 weeks

Deliverables

1. SSE-over-NATS Bridge (cmd/ui/sse_nats.go)

  • Subscribe to [cruvero.events.run](http://cruvero.events.run).<workflow_id> (wildcard for all step events).
  • Convert NATS events to SSE format and push to connected clients.
  • Support multiple concurrent SSE clients per workflow (fan-out from single NATS subscription).
  • Fallback: If NATS unavailable, degrade to current Temporal query polling.
  • Configurable via CRUVERO_UI_SSE_SOURCE=nats|temporal (default: nats if events backend is NATS, else temporal).

2. Consumer SDK (internal/events/consumer.go)

type ConsumerConfig struct {
Subject string
Group string // queue group for load balancing
BatchSize int // for batch consumers
FlushInterval time.Duration // max time before flush
MaxPending int // backpressure limit
AckPolicy AckPolicy // explicit, none, all
}

type BatchConsumer interface {
HandleBatch(ctx context.Context, events []Event) error
}

type StreamConsumer interface {
HandleEvent(ctx context.Context, event Event) error
}
  • Reusable consumer patterns: single-event handler, batch handler, filtered handler.
  • Queue groups for load-balanced consumption (multiple bot instances, multiple audit writers).
  • Built-in metrics: events consumed, processing latency, errors.

3. Event Replay (cmd/event-replay)

  • CLI tool to replay JetStream events from a point in time.
  • Useful for debugging, compliance review, and backfilling consumers that were offline.
go run ./cmd/event-replay --subject "cruvero.events.run.*" --since 2h --consumer my-debug

4. Web UI Enhancements

  • Real-time event indicators in run detail view (events arrive instantly vs 1-5s poll).
  • Event log panel showing raw events for debugging.
  • Connection status indicator (NATS connected/degraded/polling).

5. Tests

  • SSE receives events within 100ms of activity publishing.
  • Multiple SSE clients receive same event.
  • Graceful degradation when NATS disconnects mid-stream.
  • Event replay CLI outputs correct events for time range.

Files to Add/Change

  • cmd/ui/sse_nats.go (NATS-backed SSE)
  • cmd/ui/main.go (wire NATS SSE source)
  • cmd/ui/ui/index.html (connection status indicator)
  • internal/events/consumer.go (consumer SDK)
  • internal/events/consumer_test.go
  • cmd/event-replay/main.go

Exit Criteria

  • SSE streams events from NATS with < 100ms latency
  • Multiple UI tabs receive events simultaneously
  • Graceful fallback to Temporal polling when NATS is down
  • Event replay CLI works for arbitrary time ranges
  • Consumer SDK supports batch and stream patterns

Dependencies

  • Phase 12A complete

📡 Phase 12C: MCP Dynamic Discovery & Embedding Pipeline

Goal: Use NATS for MCP server lifecycle management and async embedding generation.

Duration: 1.5–2 weeks

Deliverables

1. MCP Server Announce Protocol

cruvero.mcp.announce     — MCP server publishes tool manifest on startup
cruvero.mcp.heartbeat — periodic health signal
cruvero.mcp.deregister — graceful shutdown notification
cruvero.mcp.health.req — request/reply health check from bridge

Announce payload:

{
"server_name": "mcp-github",
"transport": "stdio",
"tools": [
{"name": "github_list_repos", "description": "...", "schema": {...}}
],
"version": "1.2.0",
"capabilities": ["tools", "resources"],
"timestamp": "2026-02-08T10:00:00Z"
}

2. Dynamic MCP Registry Update (internal/tools/mcp_bridge.go)

  • Bridge subscribes to cruvero.mcp.announce.
  • On new announcement: validate, diff against current tools, update in-memory tool map.
  • Periodic health check via [cruvero.mcp.health](http://cruvero.mcp.health).req (request/reply).
  • Stale servers (no heartbeat for 2 intervals) marked unhealthy; tools return error with explanation.
  • Env config (CRUVERO_MCP_SERVERS) remains as static fallback — announce supplements, doesn't replace.

3. MCP Server Sidecar Helper (cmd/mcp-announce)

  • Lightweight wrapper that any MCP server can use to announce to NATS.
  • Can be used as a process wrapper: mcp-announce --server mcp-github -- /path/to/mcp-github.
  • Publishes announce on start, heartbeat every 30s, deregister on SIGTERM.

4. Async Embedding Pipeline

JetStream stream: CRUVERO_EMBED

cruvero.embed.requests   — embedding generation requests
cruvero.embed.results — completed embeddings (for consumers that need them)
cruvero.embed.dlq — dead letter queue for failed embeddings

Request payload:

{
"request_id": "emb-uuid",
"source": "semantic_extract",
"run_id": "run-123",
"text": "User prefers formal tone in reports",
"namespace": "user-prefs",
"metadata": {"fact_id": "fact-456"}
}

5. Embedding Worker (cmd/embed-worker)

  • Consumes from cruvero.embed.requests JetStream.
  • Batches requests (configurable: CRUVERO_EMBED_BATCH_SIZE=32, CRUVERO_EMBED_FLUSH_MS=500).
  • Calls embedding provider (Phase 8D infrastructure).
  • Writes results to vector store (Qdrant/pgvector).
  • Publishes completion to cruvero.embed.results.
  • DLQ for persistent failures after retries.

6. Activity Refactor (internal/memory/embedding.go)

  • Current synchronous embedding call replaced with NATS publish.
  • Two modes:
    • async (default): Publish and continue. Embedding available eventually.
    • sync: Publish and wait for result on cruvero.embed.results with timeout. Falls back to direct call on timeout.
  • Mode configurable: CRUVERO_EMBED_MODE=async|sync|direct (direct = current behavior, no NATS).

7. Configuration

# MCP Discovery
CRUVERO_MCP_DISCOVERY=static|nats|both (default: static)
CRUVERO_MCP_HEARTBEAT_INTERVAL=30s
CRUVERO_MCP_STALE_THRESHOLD=90s

# Embedding Pipeline
CRUVERO_EMBED_MODE=async|sync|direct (default: direct)
CRUVERO_EMBED_BATCH_SIZE=32
CRUVERO_EMBED_FLUSH_MS=500
CRUVERO_EMBED_DLQ_MAX_RETRIES=3
CRUVERO_EMBED_WORKER_CONCURRENCY=4

8. Tests

  • MCP announce → bridge picks up new tools within 5s.
  • MCP deregister → tools marked unavailable.
  • Stale MCP server → health check fails → tools return error.
  • Embedding request → batched → written to vector store.
  • Embedding DLQ → failed requests retried, then dead-lettered.
  • Direct mode still works when NATS unavailable.

Files to Add/Change

  • internal/tools/mcp_bridge.go (add NATS discovery)
  • internal/tools/mcp_announce.go (announce types)
  • internal/memory/embedding.go (async mode)
  • cmd/mcp-announce/main.go
  • cmd/embed-worker/main.go
  • internal/events/streams.go (JetStream stream definitions)
  • internal/config/config.go (new config vars)

Exit Criteria

  • MCP server dynamically registers tools via NATS announce
  • MCP health check detects and marks stale servers
  • Embedding requests batched and processed asynchronously
  • DLQ captures failed embeddings for retry/investigation
  • All features degrade gracefully when NATS unavailable
  • Static MCP config continues to work as fallback

Dependencies

  • Phase 12A complete
  • Phase 8D (embedding infrastructure) complete

📡 Phase 12D: Audit Buffering, Telemetry Pipeline & Ops Tooling

Goal: Move audit and telemetry writes to buffered NATS consumers, add operational tooling for the event bus.

Duration: 1–1.5 weeks

Deliverables

1. Audit Event Buffering

JetStream stream: CRUVERO_AUDIT

  • Phase 9C audit events publish to cruvero.audit.events instead of direct Postgres writes.
  • Dedicated audit consumer batches writes (configurable: CRUVERO_AUDIT_BATCH_SIZE=50, CRUVERO_AUDIT_FLUSH_MS=1000).
  • JetStream AckExplicit ensures no audit event is lost — consumer acks only after successful Postgres write.
  • Retention: configurable stream retention matching audit retention policy.
  • PII detection still runs inline before publishing (PII flags travel with the event).

2. Telemetry Pipeline

cruvero.telemetry.metrics    — cost, latency, token usage per step
cruvero.telemetry.spans — lightweight span summaries for non-OTEL consumers
  • Activities publish lightweight telemetry events alongside workflow events.
  • Consumer writes to Prometheus push gateway, Grafana Loki, or custom sinks.
  • Decouples telemetry collection from the OTEL exporter — operators can add custom sinks without modifying worker code.

3. Event Bus Health Dashboard

  • New UI page: /events.html
  • Shows: NATS connection status, stream stats (messages, consumers, lag), subject activity heatmap.
  • Uses NATS monitoring endpoint (http://nats:8222/jsz) for stream metadata.

4. Event Bus CLI (cmd/event-bus)

go run ./cmd/event-bus status              # connection + stream health
go run ./cmd/event-bus streams # list JetStream streams
go run ./cmd/event-bus consumers # list active consumers
go run ./cmd/event-bus subscribe <subject> # live subscribe for debugging
go run ./cmd/event-bus publish <subject> <payload> # manual event injection
go run ./cmd/event-bus purge <stream> # purge stream (with confirmation)

5. Tenant Isolation

  • Multi-tenant event isolation via subject prefix: cruvero.<tenant_id>.events.*
  • JetStream streams can be per-tenant or shared with subject filtering.
  • Respects Phase 9A tenant boundaries — consumers only see their tenant's events.

6. Debian Deployment Update

  • Update Debian deployment guide with NATS service configuration.
  • NATS memory budget: ~64–128MB (well within 8GB constraint).
  • JetStream file store for persistence (vs memory store).
  • Systemd unit file for NATS.

7. Configuration

# Audit Buffering
CRUVERO_AUDIT_BUFFER=nats|direct (default: direct)
CRUVERO_AUDIT_BATCH_SIZE=50
CRUVERO_AUDIT_FLUSH_MS=1000
CRUVERO_AUDIT_STREAM_RETENTION=365d

# Telemetry
CRUVERO_TELEMETRY_NATS=true|false (default: false)
CRUVERO_TELEMETRY_SUBJECTS=metrics,spans

# Tenant Isolation
CRUVERO_EVENTS_TENANT_ISOLATION=true|false (default: false)

8. Tests

  • Audit events buffered and batch-written (verify batch size and flush timing).
  • No audit event lost during consumer restart (JetStream redelivery).
  • Tenant A cannot subscribe to Tenant B's events.
  • Event bus CLI reports accurate stream statistics.
  • Debian deployment: NATS starts, worker connects, events flow.

Files to Add/Change

  • internal/audit/nats_writer.go (buffered audit consumer)
  • internal/events/telemetry.go (telemetry publisher)
  • cmd/event-bus/main.go (ops CLI)
  • cmd/ui/ui/events.html (event bus dashboard)
  • cmd/ui/main.go (events API endpoints)
  • internal/events/tenant.go (tenant subject routing)
  • docs/[debian-deploy.md](http://debian-deploy.md) (NATS section)

Exit Criteria

  • Audit events buffered through JetStream with zero loss
  • Batch writes reduce Postgres write amplification by 10x+
  • Event bus CLI provides operational visibility
  • Tenant isolation enforced at subject level
  • Debian deployment guide updated with NATS
  • All features disabled cleanly with CRUVERO_EVENTS_BACKEND=none

Dependencies

  • Phase 12A and 12B complete
  • Phase 9A (multi-tenancy) and 9C (audit) complete

Timeline

Sub-PhaseDurationCumulative
12A: Core Event Bus1.5–2 weeks2 weeks
12B: SSE Refactor & Consumers1–1.5 weeks3.5 weeks
12C: MCP Discovery & Embedding Pipeline1.5–2 weeks5.5 weeks
12D: Audit Buffering & Ops Tooling1–1.5 weeks7 weeks

Total estimated duration: 5–7 weeks

12A and 12B can be developed sequentially (B depends on A). 12C and 12D can begin once 12A is complete and run in parallel with 12B.


Resource Impact (8GB Debian Target)

ComponentMemoryDiskNotes
NATS server64–128 MBMinimalJetStream file store
Embed worker32–64 MBNoneSeparate process, optional
Event consumers16–32 MB eachNoneRun in-process or separate
Total addition~128–256 MB< 1 GBFits within 8GB budget

NATS is one of the lightest messaging systems available. Alpine image is ~18MB. JetStream file store avoids memory pressure for persistence.


Risk Mitigation

RiskMitigation
NATS becomes critical pathStrict fire-and-forget in activities. Health checks log warnings, never block.
Event orderingJetStream provides per-subject ordering. Cross-subject ordering not guaranteed (and not needed).
Message lossJetStream with AckExplicit for audit/embed. Best-effort for observability events.
Operational complexityCRUVERO_EVENTS_BACKEND=none disables everything. Zero NATS dependency for core workflows.
8GB memory pressureNATS server capped at 128MB. JetStream uses file store. Embed worker optional.
Schema evolutionEvents are JSON with versioned type field. Consumers handle unknown types gracefully.

Environment Variables Summary

# Core
CRUVERO_EVENTS_BACKEND=nats|log|none
CRUVERO_NATS_URL=nats://localhost:4222
CRUVERO_NATS_CLUSTER_ID=cruvero
CRUVERO_NATS_CREDS_FILE=
CRUVERO_NATS_TLS=auto|false
CRUVERO_NATS_CONNECT_TIMEOUT=5s
CRUVERO_NATS_RECONNECT_WAIT=2s
CRUVERO_NATS_MAX_RECONNECTS=60
CRUVERO_EVENTS_SUBJECT_PREFIX=cruvero

# MCP Discovery
CRUVERO_MCP_DISCOVERY=static|nats|both
CRUVERO_MCP_HEARTBEAT_INTERVAL=30s
CRUVERO_MCP_STALE_THRESHOLD=90s

# Embedding Pipeline
CRUVERO_EMBED_MODE=async|sync|direct
CRUVERO_EMBED_BATCH_SIZE=32
CRUVERO_EMBED_FLUSH_MS=500
CRUVERO_EMBED_DLQ_MAX_RETRIES=3
CRUVERO_EMBED_WORKER_CONCURRENCY=4

# Audit Buffering
CRUVERO_AUDIT_BUFFER=nats|direct
CRUVERO_AUDIT_BATCH_SIZE=50
CRUVERO_AUDIT_FLUSH_MS=1000

# Telemetry
CRUVERO_TELEMETRY_NATS=true|false

# Tenant Events
CRUVERO_EVENTS_TENANT_ISOLATION=true|false

Success Metrics

MetricTarget
Event publish latency (activity overhead)< 1ms p99
SSE delivery latency (event → browser)< 100ms p99
Audit event loss rate0% (JetStream guarantee)
MCP discovery latency (announce → available)< 5s
Embedding pipeline throughput100+ embeddings/sec batched
Worker impact when NATS downZero (no errors, no latency change)
Memory overhead (NATS server)< 128MB
Test coverage>= 80% for internal/events/ (enforced by scripts/check-coverage.sh)

Code Quality Requirements (SonarQube)

All Go code produced by Phase 12 prompts must pass SonarQube quality gates. Each PROMPT.md file includes these constraints in every prompt's Constraints section:

  • Error handling: Every returned error must be handled explicitly — never ignore with _
  • Cyclomatic complexity: Keep functions focused and small; extract complex logic into well-named helpers. Functions under 50 lines where practical
  • No dead code: No unused variables, empty blocks, or duplicated logic
  • Resource cleanup: Close all resources (DB rows, response bodies, NATS connections) with proper defer patterns
  • Early returns: Avoid deeply nested conditionals — prefer guard clauses
  • No magic values: Use named constants for strings and numbers
  • No hardcoded secrets: All credentials via env vars
  • Meaningful names: Descriptive variable and function names
  • Linting gate: Run go vet ./internal/events/..., staticcheck ./internal/events/..., and golangci-lint run ./internal/events/... before considering the prompt complete — these catch most issues SonarQube flags
  • Test coverage: 80%+ coverage target for internal/events/

Each sub-phase Exit Criteria section includes:

  • [ ] go vet ./internal/events/... reports no issues
  • [ ] staticcheck ./internal/events/... reports no issues
  • [ ] No functions exceed 50 lines (extract helpers as needed)
  • [ ] All returned errors are handled (no _ = err patterns)

Open Questions

  1. NATS vs NATS with JetStream only? — JetStream adds persistence but complexity. Recommendation: JetStream for audit and embedding (must not lose), core NATS for observability events (best-effort is fine).
  2. Embedded NATS? — NATS can run embedded in the Go worker process. Simpler deployment but less isolation. Recommendation: External process for production, embedded option for single-node dev.
  3. Event schema registry? — Should events have a formal schema registry? Recommendation: Start with documented JSON conventions, add formal schemas if consumer ecosystem grows.
  4. NATS auth model? — NKey auth, JWT auth, or none? Recommendation: None for dev, NKey for production single-tenant, JWT for multi-tenant.

Relationship to Other Phases

PhaseRelationship
Phase 8C (Observability)12D adds NATS-based telemetry pipeline as alternative sink
Phase 8D (Embeddings)12C decouples embedding generation into async pipeline
Phase 9A (Multi-tenancy)12D adds tenant-scoped event isolation
Phase 9C (Audit)12D buffers audit writes through JetStream
Phase 10 (Neuro)Event bus enables real-time metacognitive monitoring
Phase 11C (Streaming)NATS can carry LLM token streams to UI
Phase 11G (Reward Learning)Reward signals can be published as events
Teams/Telegram BotsBots subscribe to events instead of polling