Phase 12

neutral

Goal: Introduce a lightweight event bus layer using NATS that future-proofs Cruvero's extension surface without disrupting the Temporal-first architecture. Temporal remains the spine for all workflow orchestration, durability, and state management. NATS operates strictly at the edges — providing async event fan-out, MCP server lifecycle management, and embedding pipeline intake.

Status: Completed (2026-02-09)

Why Now

With Phases 1–11 complete, Cruvero has a fully operational runtime: durable agent loops, multi-agent supervision, memory, causal tracing, cost tracking, sandboxed tools, guardrails, streaming, evaluation, and enterprise hardening. What's missing is a scalable extension seam — a way for external consumers (dashboards, bots, audit pipelines, webhook relays) to react to agent events without polling Temporal queries or coupling to the SSE endpoint.

Current limitations this phase addresses:

SSE endpoint is single-consumer and ephemeral — the GET /api/stream endpoint works for the Web UI but can't serve multiple independent consumers, has no replay, and loses events on disconnect.
MCP server discovery is static — CRUVERO_MCP_SERVERS env var requires worker restart to add/remove servers. With 14+ MCP servers and growing, dynamic discovery is overdue.
Embedding pipeline is synchronous — Phase 8D's embedding work blocks workflow activities while waiting for embedding generation. Batching and backpressure need a queue.
Audit/telemetry writes are synchronous — Phase 9C audit events and Phase 8C telemetry write directly to Postgres in activities, creating write amplification under load.
Teams/Telegram bots need event subscriptions — The planned control plane bots need to subscribe to workflow events without polling.

Architecture Principle: Temporal is the Spine, NATS is the Nervous System

This is the critical constraint. NATS must never be in the workflow deterministic path. Activities can publish to NATS, but workflow completion must never depend on NATS availability. If NATS is down, workflows continue unaffected — events are simply not fanned out until recovery.

┌──────────────────────────────────────────────────────┐
│                  Temporal Cluster                      │
│  ┌────────────┐  ┌────────────┐  ┌────────────────┐  │
│  │ Agent WF   │  │ Graph WF   │  │ Supervisor WF  │  │
│  └─────┬──────┘  └─────┬──────┘  └───────┬────────┘  │
│        │               │                 │            │
│  ┌─────▼───────────────▼─────────────────▼────────┐  │
│  │              Worker Activities                  │  │
│  │  (LLM, Tools, Memory, Traces, Cost)            │  │
│  └─────────────────────┬──────────────────────────┘  │
└────────────────────────┼──────────────────────────────┘
                         │ fire-and-forget publish
                         ▼
              ┌──────────────────────┐
              │     NATS / JetStream │
              │                      │
              │  Subjects:           │
              │  cruvero.events.*    │
              │  cruvero.mcp.*       │
              │  cruvero.embed.*     │
              │  cruvero.audit.*     │
              └──────┬───────────────┘
                     │
        ┌────────────┼────────────────┐
        ▼            ▼                ▼
   ┌─────────┐ ┌──────────┐  ┌──────────────┐
   │ Web UI  │ │ Telegram │  │ Audit Writer │
   │ (SSE)   │ │ Bot      │  │ (batch)      │
   └─────────┘ └──────────┘  └──────────────┘

What Gets Refactored (Existing Code)

Since all phases are complete, this phase touches existing code at well-defined seams:

1. Event Publishing in Activities (`internal/agent/activities.go`)

After each activity completes (LLM decision, tool execution, observation, trace save), publish an event to NATS.
Non-blocking, fire-and-forget. If NATS is unavailable, log a warning and continue.
No changes to workflow determinism or Temporal state.

2. SSE Endpoint Refactor (`cmd/ui/main.go`)

Current: SSE polls Temporal queries on interval.
Refactored: SSE subscribes to NATS [cruvero.events](http://cruvero.events).<workflow_id> subject.
Fallback: If NATS unavailable, degrade to current polling behavior.
Benefit: Real-time events, multiple UI tabs, no polling overhead.

3. MCP Bridge Dynamic Discovery (`internal/tools/mcp_bridge.go`)

Current: Static env config loaded once at startup.
Refactored: MCP servers announce capabilities on cruvero.mcp.announce at startup.
Bridge subscribes and dynamically updates available tools.
Env config remains as fallback for servers that don't support NATS announce.

4. Embedding Pipeline (`internal/memory/embedding.go`)

Current: Synchronous embedding generation in activity.
Refactored: Activity publishes embedding request to cruvero.embed.requests JetStream.
Separate embedding worker consumes, batches, and writes results.
Activity can optionally wait (with timeout) or fire-and-forget for non-blocking semantic extraction.

5. Audit Event Buffering (`internal/audit/`)

Current: Phase 9C writes audit events directly to Postgres per-event.
Refactored: Publish to cruvero.audit.events JetStream stream.
Dedicated consumer batches writes to Postgres (configurable batch size and flush interval).
Guarantees: JetStream acknowledgment ensures no audit event loss.

6. Supervisor Inter-Agent Events (`internal/supervisor/`)

Current: Uses Temporal signals for all inter-agent communication.
No change to signals — they remain the durable communication path.
Addition: Supervisor publishes observability events (agent delegated, result received, saga step) to NATS for external monitoring.
External consumers can watch multi-agent orchestration in real-time.

What's Future-Proofed (New Extension Points)

1. Webhook Relay

Any external system can subscribe to [cruvero.events](http://cruvero.events).> and relay to webhooks. No Cruvero code changes needed — just a NATS consumer.

2. Teams/Telegram Bot Event Subscriptions

Control plane bots subscribe to approval requests, run completions, cost alerts, and error events. No polling, no coupling to Cruvero internals.

3. Multi-Worker Event Coordination

As Cruvero scales to multiple worker instances, NATS provides cross-worker event visibility without shared state.

4. Custom Alerting Pipelines

Operators define NATS consumers that trigger PagerDuty, OpsGenie, or Slack based on event patterns (cost threshold exceeded, agent stuck, tool failure rate spike).

5. External Agent Frameworks

Third-party agents can publish events to NATS subjects and participate in Cruvero's event ecosystem without being Temporal workflows.

6. Replay Event Streams

JetStream persistence means events can be replayed from any point — enabling post-hoc debugging, compliance review, and training data generation.

Sub-Phase Breakdown

This phase is split into 4 sub-phases, each independently deployable and testable. NATS is optional at every stage — CRUVERO_EVENTS_BACKEND=none disables the entire layer.

📡 Phase 12A: Core Event Bus Infrastructure

Goal: Establish the EventBus interface, NATS implementation, and first event publishers.

Duration: 1.5–2 weeks

Deliverables

1. EventBus Interface (`internal/events/bus.go`)

type EventBus interface {
    // Publish fires an event. Non-blocking, best-effort.
    Publish(ctx context.Context, subject string, event Event) error
    
    // Subscribe returns a channel of events for the subject pattern.
    Subscribe(ctx context.Context, subject string) (<-chan Event, error)
    
    // Request sends a request and waits for a reply (for discovery).
    Request(ctx context.Context, subject string, data []byte, timeout time.Duration) ([]byte, error)
    
    // Close tears down connections.
    Close() error
}

type Event struct {
    ID        string          `json:"id"`
    Type      string          `json:"type"`       // run.started, step.completed, tool.called, etc.
    Subject   string          `json:"subject"`    // NATS subject
    Source    string          `json:"source"`     // workflow_id or component
    Timestamp time.Time       `json:"timestamp"`
    TenantID  string          `json:"tenant_id,omitempty"`
    RunID     string          `json:"run_id,omitempty"`
    StepIndex int             `json:"step_index,omitempty"`
    Data      json.RawMessage `json:"data"`
}

2. Implementations

NATSEventBus — full NATS + JetStream implementation.
NoopEventBus — silent no-op for disabled mode.
LogEventBus — logs events to structured logger (useful for debugging without NATS).

3. Event Taxonomy (Subject Hierarchy)

cruvero.events.run.started         {run_id, prompt, model}
cruvero.events.run.completed       {run_id, steps, cost, duration}
cruvero.events.run.failed          {run_id, error, step_index}
cruvero.events.step.decided        {run_id, step, tool, decision}
cruvero.events.step.completed      {run_id, step, tool, result_summary}
cruvero.events.tool.called         {run_id, step, tool, args_hash}
cruvero.events.tool.failed         {run_id, step, tool, error}
cruvero.events.tool.repaired       {run_id, step, tool, attempt}
cruvero.events.approval.requested  {run_id, step, tool, request_id}
cruvero.events.approval.resolved   {run_id, step, approved, reason}
cruvero.events.question.asked      {run_id, step, question}
cruvero.events.question.answered   {run_id, step, answer}
cruvero.events.cost.recorded       {run_id, step, cost}
cruvero.events.cost.threshold      {run_id, threshold, current}
cruvero.events.memory.saved        {run_id, type, key}
cruvero.events.trace.saved         {run_id, step}
cruvero.events.control.pause       \{run_id\}
cruvero.events.control.resume      \{run_id\}
cruvero.events.supervisor.*        {supervisor events mirror above}

4. Activity Integration

Add EventBus to activity context (injected via worker setup).
Instrument LLMDecideActivity, ToolExecuteActivity, ObserveActivity, SaveTraceActivity.
Each publishes 1 event after successful completion. Fire-and-forget with error logging.

5. Configuration

CRUVERO_EVENTS_BACKEND=nats|log|none      (default: none)
CRUVERO_NATS_URL=nats://localhost:4222
CRUVERO_NATS_CLUSTER_ID=cruvero
CRUVERO_NATS_CREDS_FILE=                  (optional NKey/JWT auth)
CRUVERO_NATS_TLS=auto|false
CRUVERO_NATS_CONNECT_TIMEOUT=5s
CRUVERO_NATS_RECONNECT_WAIT=2s
CRUVERO_NATS_MAX_RECONNECTS=60
CRUVERO_EVENTS_SUBJECT_PREFIX=cruvero     (default: cruvero)

6. Docker Compose Addition

nats:
  image: nats:2.10-alpine
  command: ["--jetstream", "--store_dir=/data"]
  ports:
    - "4222:4222"
    - "8222:8222"  # monitoring
  volumes:
    - nats-data:/data
  deploy:
    resources:
      limits:
        memory: 128M

7. Tests

Unit: EventBus interface contract tests (run against Noop, Log, and NATS via testcontainer).
Integration: Publish event from activity mock, verify subscriber receives it.
Resilience: NATS unavailable — verify activity completes without error.
Subject validation: Verify taxonomy compliance.

8. Migration

migrations/0008_events_config.up.sql — optional table for event routing rules per tenant.

Files to Add/Change

internal/events/bus.go (interface)
internal/events/nats.go (NATS impl)
internal/events/noop.go (no-op impl)
internal/events/log.go (log impl)
internal/events/types.go (Event, subject constants)
internal/events/bus_test.go
internal/agent/activities.go (add publish calls)
internal/config/config.go (NATS config)
cmd/worker/main.go (wire EventBus)
docker-compose.yml (add NATS service)
.env.example (NATS vars)
migrations/0008_events_config.up.sql
migrations/0008_events_config.down.sql

Exit Criteria

EventBus interface with 3 implementations
Activities publish events on completion
NATS subscriber receives events in real-time
CRUVERO_EVENTS_BACKEND=none disables all publishing with zero overhead
Worker survives NATS being unavailable
Docker Compose includes NATS with JetStream

Dependencies

All prior phases complete
No breaking changes to any existing workflow or activity

📡 Phase 12B: SSE Refactor & Consumer Patterns

Goal: Refactor the Web UI SSE endpoint to consume from NATS, and establish reusable consumer patterns for external integrations.

Duration: 1–1.5 weeks

Deliverables

1. SSE-over-NATS Bridge (`cmd/ui/sse_nats.go`)

Subscribe to [cruvero.events.run](http://cruvero.events.run).<workflow_id> (wildcard for all step events).
Convert NATS events to SSE format and push to connected clients.
Support multiple concurrent SSE clients per workflow (fan-out from single NATS subscription).
Fallback: If NATS unavailable, degrade to current Temporal query polling.
Configurable via CRUVERO_UI_SSE_SOURCE=nats|temporal (default: nats if events backend is NATS, else temporal).

2. Consumer SDK (`internal/events/consumer.go`)

type ConsumerConfig struct {
    Subject       string
    Group         string        // queue group for load balancing
    BatchSize     int           // for batch consumers
    FlushInterval time.Duration // max time before flush
    MaxPending    int           // backpressure limit
    AckPolicy     AckPolicy     // explicit, none, all
}

type BatchConsumer interface {
    HandleBatch(ctx context.Context, events []Event) error
}

type StreamConsumer interface {
    HandleEvent(ctx context.Context, event Event) error
}

Reusable consumer patterns: single-event handler, batch handler, filtered handler.
Queue groups for load-balanced consumption (multiple bot instances, multiple audit writers).
Built-in metrics: events consumed, processing latency, errors.

3. Event Replay (`cmd/event-replay`)

CLI tool to replay JetStream events from a point in time.
Useful for debugging, compliance review, and backfilling consumers that were offline.

go run ./cmd/event-replay --subject "cruvero.events.run.*" --since 2h --consumer my-debug

4. Web UI Enhancements

Real-time event indicators in run detail view (events arrive instantly vs 1-5s poll).
Event log panel showing raw events for debugging.
Connection status indicator (NATS connected/degraded/polling).

5. Tests

SSE receives events within 100ms of activity publishing.
Multiple SSE clients receive same event.
Graceful degradation when NATS disconnects mid-stream.
Event replay CLI outputs correct events for time range.

Files to Add/Change

cmd/ui/sse_nats.go (NATS-backed SSE)
cmd/ui/main.go (wire NATS SSE source)
cmd/ui/ui/index.html (connection status indicator)
internal/events/consumer.go (consumer SDK)
internal/events/consumer_test.go
cmd/event-replay/main.go

Exit Criteria

SSE streams events from NATS with < 100ms latency
Multiple UI tabs receive events simultaneously
Graceful fallback to Temporal polling when NATS is down
Event replay CLI works for arbitrary time ranges
Consumer SDK supports batch and stream patterns

Dependencies

Phase 12A complete

📡 Phase 12C: MCP Dynamic Discovery & Embedding Pipeline

Goal: Use NATS for MCP server lifecycle management and async embedding generation.

Duration: 1.5–2 weeks

Deliverables

1. MCP Server Announce Protocol

cruvero.mcp.announce     — MCP server publishes tool manifest on startup
cruvero.mcp.heartbeat    — periodic health signal
cruvero.mcp.deregister   — graceful shutdown notification
cruvero.mcp.health.req   — request/reply health check from bridge

Announce payload:

{
  "server_name": "mcp-github",
  "transport": "stdio",
  "tools": [
    {"name": "github_list_repos", "description": "...", "schema": {...}}
  ],
  "version": "1.2.0",
  "capabilities": ["tools", "resources"],
  "timestamp": "2026-02-08T10:00:00Z"
}

2. Dynamic MCP Registry Update (`internal/tools/mcp_bridge.go`)

Bridge subscribes to cruvero.mcp.announce.
On new announcement: validate, diff against current tools, update in-memory tool map.
Periodic health check via [cruvero.mcp.health](http://cruvero.mcp.health).req (request/reply).
Stale servers (no heartbeat for 2 intervals) marked unhealthy; tools return error with explanation.
Env config (CRUVERO_MCP_SERVERS) remains as static fallback — announce supplements, doesn't replace.

3. MCP Server Sidecar Helper (`cmd/mcp-announce`)

Lightweight wrapper that any MCP server can use to announce to NATS.
Can be used as a process wrapper: mcp-announce --server mcp-github -- /path/to/mcp-github.
Publishes announce on start, heartbeat every 30s, deregister on SIGTERM.

4. Async Embedding Pipeline

JetStream stream: CRUVERO_EMBED

cruvero.embed.requests   — embedding generation requests
cruvero.embed.results    — completed embeddings (for consumers that need them)
cruvero.embed.dlq        — dead letter queue for failed embeddings

Request payload:

{
  "request_id": "emb-uuid",
  "source": "semantic_extract",
  "run_id": "run-123",
  "text": "User prefers formal tone in reports",
  "namespace": "user-prefs",
  "metadata": {"fact_id": "fact-456"}
}

5. Embedding Worker (`cmd/embed-worker`)

Consumes from cruvero.embed.requests JetStream.
Batches requests (configurable: CRUVERO_EMBED_BATCH_SIZE=32, CRUVERO_EMBED_FLUSH_MS=500).
Calls embedding provider (Phase 8D infrastructure).
Writes results to vector store (Qdrant/pgvector).
Publishes completion to cruvero.embed.results.
DLQ for persistent failures after retries.

6. Activity Refactor (`internal/memory/embedding.go`)

Current synchronous embedding call replaced with NATS publish.
Two modes:
- async (default): Publish and continue. Embedding available eventually.
- sync: Publish and wait for result on cruvero.embed.results with timeout. Falls back to direct call on timeout.
Mode configurable: CRUVERO_EMBED_MODE=async|sync|direct (direct = current behavior, no NATS).

7. Configuration

# MCP Discovery
CRUVERO_MCP_DISCOVERY=static|nats|both    (default: static)
CRUVERO_MCP_HEARTBEAT_INTERVAL=30s
CRUVERO_MCP_STALE_THRESHOLD=90s

# Embedding Pipeline
CRUVERO_EMBED_MODE=async|sync|direct      (default: direct)
CRUVERO_EMBED_BATCH_SIZE=32
CRUVERO_EMBED_FLUSH_MS=500
CRUVERO_EMBED_DLQ_MAX_RETRIES=3
CRUVERO_EMBED_WORKER_CONCURRENCY=4

8. Tests

MCP announce → bridge picks up new tools within 5s.
MCP deregister → tools marked unavailable.
Stale MCP server → health check fails → tools return error.
Embedding request → batched → written to vector store.
Embedding DLQ → failed requests retried, then dead-lettered.
Direct mode still works when NATS unavailable.

Files to Add/Change

internal/tools/mcp_bridge.go (add NATS discovery)
internal/tools/mcp_announce.go (announce types)
internal/memory/embedding.go (async mode)
cmd/mcp-announce/main.go
cmd/embed-worker/main.go
internal/events/streams.go (JetStream stream definitions)
internal/config/config.go (new config vars)

Exit Criteria

MCP server dynamically registers tools via NATS announce
MCP health check detects and marks stale servers
Embedding requests batched and processed asynchronously
DLQ captures failed embeddings for retry/investigation
All features degrade gracefully when NATS unavailable
Static MCP config continues to work as fallback

Dependencies

Phase 12A complete
Phase 8D (embedding infrastructure) complete

📡 Phase 12D: Audit Buffering, Telemetry Pipeline & Ops Tooling

Goal: Move audit and telemetry writes to buffered NATS consumers, add operational tooling for the event bus.

Duration: 1–1.5 weeks

Deliverables

1. Audit Event Buffering

JetStream stream: CRUVERO_AUDIT

Phase 9C audit events publish to cruvero.audit.events instead of direct Postgres writes.
Dedicated audit consumer batches writes (configurable: CRUVERO_AUDIT_BATCH_SIZE=50, CRUVERO_AUDIT_FLUSH_MS=1000).
JetStream AckExplicit ensures no audit event is lost — consumer acks only after successful Postgres write.
Retention: configurable stream retention matching audit retention policy.
PII detection still runs inline before publishing (PII flags travel with the event).

2. Telemetry Pipeline

cruvero.telemetry.metrics    — cost, latency, token usage per step
cruvero.telemetry.spans      — lightweight span summaries for non-OTEL consumers

Activities publish lightweight telemetry events alongside workflow events.
Consumer writes to Prometheus push gateway, Grafana Loki, or custom sinks.
Decouples telemetry collection from the OTEL exporter — operators can add custom sinks without modifying worker code.

3. Event Bus Health Dashboard

New UI page: /events.html
Shows: NATS connection status, stream stats (messages, consumers, lag), subject activity heatmap.
Uses NATS monitoring endpoint (http://nats:8222/jsz) for stream metadata.

4. Event Bus CLI (`cmd/event-bus`)

go run ./cmd/event-bus status              # connection + stream health
go run ./cmd/event-bus streams             # list JetStream streams
go run ./cmd/event-bus consumers           # list active consumers
go run ./cmd/event-bus subscribe <subject> # live subscribe for debugging
go run ./cmd/event-bus publish <subject> <payload>  # manual event injection
go run ./cmd/event-bus purge <stream>      # purge stream (with confirmation)

5. Tenant Isolation

Multi-tenant event isolation via subject prefix: cruvero.<tenant_id>.events.*
JetStream streams can be per-tenant or shared with subject filtering.
Respects Phase 9A tenant boundaries — consumers only see their tenant's events.

6. Debian Deployment Update

Update Debian deployment guide with NATS service configuration.
NATS memory budget: ~64–128MB (well within 8GB constraint).
JetStream file store for persistence (vs memory store).
Systemd unit file for NATS.

7. Configuration

# Audit Buffering
CRUVERO_AUDIT_BUFFER=nats|direct          (default: direct)
CRUVERO_AUDIT_BATCH_SIZE=50
CRUVERO_AUDIT_FLUSH_MS=1000
CRUVERO_AUDIT_STREAM_RETENTION=365d

# Telemetry
CRUVERO_TELEMETRY_NATS=true|false         (default: false)
CRUVERO_TELEMETRY_SUBJECTS=metrics,spans

# Tenant Isolation
CRUVERO_EVENTS_TENANT_ISOLATION=true|false (default: false)

8. Tests

Audit events buffered and batch-written (verify batch size and flush timing).
No audit event lost during consumer restart (JetStream redelivery).
Tenant A cannot subscribe to Tenant B's events.
Event bus CLI reports accurate stream statistics.
Debian deployment: NATS starts, worker connects, events flow.

Files to Add/Change

internal/audit/nats_writer.go (buffered audit consumer)
internal/events/telemetry.go (telemetry publisher)
cmd/event-bus/main.go (ops CLI)
cmd/ui/ui/events.html (event bus dashboard)
cmd/ui/main.go (events API endpoints)
internal/events/tenant.go (tenant subject routing)
docs/[debian-deploy.md](http://debian-deploy.md) (NATS section)

Exit Criteria

Audit events buffered through JetStream with zero loss
Batch writes reduce Postgres write amplification by 10x+
Event bus CLI provides operational visibility
Tenant isolation enforced at subject level
Debian deployment guide updated with NATS
All features disabled cleanly with CRUVERO_EVENTS_BACKEND=none

Dependencies

Phase 12A and 12B complete
Phase 9A (multi-tenancy) and 9C (audit) complete

Timeline

Sub-Phase	Duration	Cumulative
12A: Core Event Bus	1.5–2 weeks	2 weeks
12B: SSE Refactor & Consumers	1–1.5 weeks	3.5 weeks
12C: MCP Discovery & Embedding Pipeline	1.5–2 weeks	5.5 weeks
12D: Audit Buffering & Ops Tooling	1–1.5 weeks	7 weeks

Total estimated duration: 5–7 weeks

12A and 12B can be developed sequentially (B depends on A). 12C and 12D can begin once 12A is complete and run in parallel with 12B.

Resource Impact (8GB Debian Target)

Component	Memory	Disk	Notes
NATS server	64–128 MB	Minimal	JetStream file store
Embed worker	32–64 MB	None	Separate process, optional
Event consumers	16–32 MB each	None	Run in-process or separate
Total addition	~128–256 MB	< 1 GB	Fits within 8GB budget

NATS is one of the lightest messaging systems available. Alpine image is ~18MB. JetStream file store avoids memory pressure for persistence.

Risk Mitigation

Risk	Mitigation
NATS becomes critical path	Strict fire-and-forget in activities. Health checks log warnings, never block.
Event ordering	JetStream provides per-subject ordering. Cross-subject ordering not guaranteed (and not needed).
Message loss	JetStream with `AckExplicit` for audit/embed. Best-effort for observability events.
Operational complexity	`CRUVERO_EVENTS_BACKEND=none` disables everything. Zero NATS dependency for core workflows.
8GB memory pressure	NATS server capped at 128MB. JetStream uses file store. Embed worker optional.
Schema evolution	Events are JSON with versioned `type` field. Consumers handle unknown types gracefully.

Environment Variables Summary

# Core
CRUVERO_EVENTS_BACKEND=nats|log|none
CRUVERO_NATS_URL=nats://localhost:4222
CRUVERO_NATS_CLUSTER_ID=cruvero
CRUVERO_NATS_CREDS_FILE=
CRUVERO_NATS_TLS=auto|false
CRUVERO_NATS_CONNECT_TIMEOUT=5s
CRUVERO_NATS_RECONNECT_WAIT=2s
CRUVERO_NATS_MAX_RECONNECTS=60
CRUVERO_EVENTS_SUBJECT_PREFIX=cruvero

# MCP Discovery
CRUVERO_MCP_DISCOVERY=static|nats|both
CRUVERO_MCP_HEARTBEAT_INTERVAL=30s
CRUVERO_MCP_STALE_THRESHOLD=90s

# Embedding Pipeline
CRUVERO_EMBED_MODE=async|sync|direct
CRUVERO_EMBED_BATCH_SIZE=32
CRUVERO_EMBED_FLUSH_MS=500
CRUVERO_EMBED_DLQ_MAX_RETRIES=3
CRUVERO_EMBED_WORKER_CONCURRENCY=4

# Audit Buffering
CRUVERO_AUDIT_BUFFER=nats|direct
CRUVERO_AUDIT_BATCH_SIZE=50
CRUVERO_AUDIT_FLUSH_MS=1000

# Telemetry
CRUVERO_TELEMETRY_NATS=true|false

# Tenant Events
CRUVERO_EVENTS_TENANT_ISOLATION=true|false

Success Metrics

Metric	Target
Event publish latency (activity overhead)	< 1ms p99
SSE delivery latency (event → browser)	< 100ms p99
Audit event loss rate	0% (JetStream guarantee)
MCP discovery latency (announce → available)	< 5s
Embedding pipeline throughput	100+ embeddings/sec batched
Worker impact when NATS down	Zero (no errors, no latency change)
Memory overhead (NATS server)	< 128MB
Test coverage	>= 80% for `internal/events/` (enforced by `scripts/check-coverage.sh`)

Code Quality Requirements (SonarQube)

All Go code produced by Phase 12 prompts must pass SonarQube quality gates. Each PROMPT.md file includes these constraints in every prompt's Constraints section:

Error handling: Every returned error must be handled explicitly — never ignore with _
Cyclomatic complexity: Keep functions focused and small; extract complex logic into well-named helpers. Functions under 50 lines where practical
No dead code: No unused variables, empty blocks, or duplicated logic
Resource cleanup: Close all resources (DB rows, response bodies, NATS connections) with proper defer patterns
Early returns: Avoid deeply nested conditionals — prefer guard clauses
No magic values: Use named constants for strings and numbers
No hardcoded secrets: All credentials via env vars
Meaningful names: Descriptive variable and function names
Linting gate: Run go vet ./internal/events/..., staticcheck ./internal/events/..., and golangci-lint run ./internal/events/... before considering the prompt complete — these catch most issues SonarQube flags
Test coverage: 80%+ coverage target for internal/events/

Each sub-phase Exit Criteria section includes:

[ ] go vet ./internal/events/... reports no issues
[ ] staticcheck ./internal/events/... reports no issues
[ ] No functions exceed 50 lines (extract helpers as needed)
[ ] All returned errors are handled (no _ = err patterns)

Open Questions

NATS vs NATS with JetStream only? — JetStream adds persistence but complexity. Recommendation: JetStream for audit and embedding (must not lose), core NATS for observability events (best-effort is fine).
Embedded NATS? — NATS can run embedded in the Go worker process. Simpler deployment but less isolation. Recommendation: External process for production, embedded option for single-node dev.
Event schema registry? — Should events have a formal schema registry? Recommendation: Start with documented JSON conventions, add formal schemas if consumer ecosystem grows.
NATS auth model? — NKey auth, JWT auth, or none? Recommendation: None for dev, NKey for production single-tenant, JWT for multi-tenant.

Relationship to Other Phases

Phase	Relationship
Phase 8C (Observability)	12D adds NATS-based telemetry pipeline as alternative sink
Phase 8D (Embeddings)	12C decouples embedding generation into async pipeline
Phase 9A (Multi-tenancy)	12D adds tenant-scoped event isolation
Phase 9C (Audit)	12D buffers audit writes through JetStream
Phase 10 (Neuro)	Event bus enables real-time metacognitive monitoring
Phase 11C (Streaming)	NATS can carry LLM token streams to UI
Phase 11G (Reward Learning)	Reward signals can be published as events
Teams/Telegram Bots	Bots subscribe to events instead of polling

Why Now​

Architecture Principle: Temporal is the Spine, NATS is the Nervous System​

What Gets Refactored (Existing Code)​

1. Event Publishing in Activities (internal/agent/activities.go)​

2. SSE Endpoint Refactor (cmd/ui/main.go)​

3. MCP Bridge Dynamic Discovery (internal/tools/mcp_bridge.go)​

4. Embedding Pipeline (internal/memory/embedding.go)​

5. Audit Event Buffering (internal/audit/)​

6. Supervisor Inter-Agent Events (internal/supervisor/)​

What's Future-Proofed (New Extension Points)​

1. Webhook Relay​

2. Teams/Telegram Bot Event Subscriptions​

3. Multi-Worker Event Coordination​

4. Custom Alerting Pipelines​

5. External Agent Frameworks​

6. Replay Event Streams​

Sub-Phase Breakdown​

📡 Phase 12A: Core Event Bus Infrastructure​

Deliverables​

1. EventBus Interface (internal/events/bus.go)​

2. Implementations​

3. Event Taxonomy (Subject Hierarchy)​

4. Activity Integration​

5. Configuration​

6. Docker Compose Addition​

7. Tests​

8. Migration​

Files to Add/Change​

Exit Criteria​

Dependencies​

📡 Phase 12B: SSE Refactor & Consumer Patterns​

Deliverables​

1. SSE-over-NATS Bridge (cmd/ui/sse_nats.go)​

2. Consumer SDK (internal/events/consumer.go)​

3. Event Replay (cmd/event-replay)​

4. Web UI Enhancements​

5. Tests​

Files to Add/Change​

Exit Criteria​

Dependencies​

📡 Phase 12C: MCP Dynamic Discovery & Embedding Pipeline​

Deliverables​

1. MCP Server Announce Protocol​

2. Dynamic MCP Registry Update (internal/tools/mcp_bridge.go)​

3. MCP Server Sidecar Helper (cmd/mcp-announce)​

4. Async Embedding Pipeline​

5. Embedding Worker (cmd/embed-worker)​

6. Activity Refactor (internal/memory/embedding.go)​

7. Configuration​

8. Tests​

Files to Add/Change​

Exit Criteria​

Dependencies​

📡 Phase 12D: Audit Buffering, Telemetry Pipeline & Ops Tooling​

Deliverables​

1. Audit Event Buffering​

2. Telemetry Pipeline​

3. Event Bus Health Dashboard​

4. Event Bus CLI (cmd/event-bus)​

5. Tenant Isolation​

6. Debian Deployment Update​

7. Configuration​

8. Tests​

Files to Add/Change​

Exit Criteria​

Dependencies​

Timeline​

Resource Impact (8GB Debian Target)​

Risk Mitigation​

Environment Variables Summary​

Success Metrics​

Code Quality Requirements (SonarQube)​

Open Questions​

Relationship to Other Phases​

Why Now

Architecture Principle: Temporal is the Spine, NATS is the Nervous System

What Gets Refactored (Existing Code)

1. Event Publishing in Activities (`internal/agent/activities.go`)

2. SSE Endpoint Refactor (`cmd/ui/main.go`)

3. MCP Bridge Dynamic Discovery (`internal/tools/mcp_bridge.go`)

4. Embedding Pipeline (`internal/memory/embedding.go`)

5. Audit Event Buffering (`internal/audit/`)

6. Supervisor Inter-Agent Events (`internal/supervisor/`)

What's Future-Proofed (New Extension Points)

1. Webhook Relay

2. Teams/Telegram Bot Event Subscriptions

3. Multi-Worker Event Coordination

4. Custom Alerting Pipelines

5. External Agent Frameworks

6. Replay Event Streams

Sub-Phase Breakdown

📡 Phase 12A: Core Event Bus Infrastructure

Deliverables

1. EventBus Interface (`internal/events/bus.go`)

2. Implementations

3. Event Taxonomy (Subject Hierarchy)

4. Activity Integration

5. Configuration

6. Docker Compose Addition

7. Tests

8. Migration

Files to Add/Change

Exit Criteria

Dependencies

📡 Phase 12B: SSE Refactor & Consumer Patterns

Deliverables

1. SSE-over-NATS Bridge (`cmd/ui/sse_nats.go`)

2. Consumer SDK (`internal/events/consumer.go`)

3. Event Replay (`cmd/event-replay`)

4. Web UI Enhancements

5. Tests

Files to Add/Change

Exit Criteria

Dependencies

📡 Phase 12C: MCP Dynamic Discovery & Embedding Pipeline

Deliverables

1. MCP Server Announce Protocol

2. Dynamic MCP Registry Update (`internal/tools/mcp_bridge.go`)

3. MCP Server Sidecar Helper (`cmd/mcp-announce`)

4. Async Embedding Pipeline

5. Embedding Worker (`cmd/embed-worker`)

6. Activity Refactor (`internal/memory/embedding.go`)

7. Configuration

8. Tests

Files to Add/Change

Exit Criteria

Dependencies

📡 Phase 12D: Audit Buffering, Telemetry Pipeline & Ops Tooling

Deliverables

1. Audit Event Buffering

2. Telemetry Pipeline

3. Event Bus Health Dashboard

4. Event Bus CLI (`cmd/event-bus`)

5. Tenant Isolation

6. Debian Deployment Update

7. Configuration

8. Tests

Files to Add/Change

Exit Criteria

Dependencies

Timeline

Resource Impact (8GB Debian Target)

Risk Mitigation

Environment Variables Summary

Success Metrics

Code Quality Requirements (SonarQube)

Open Questions

Relationship to Other Phases