Skip to main content
neutral

Phase 25 — MCP Enterprise Architecture

Evolves Cruvero's MCP integration from stdio-only subprocess management to a gateway-mediated Streamable HTTP architecture with per-integration scaling, persistent server registry, shared caching, circuit breakers, Vault-backed credential isolation, and Kubernetes-native deployment. Enables 1,000+ concurrent agents across 24+ integrations.

Status: Planned Depends on: Phase 12C (NATS events for MCP dynamic discovery) Migrations: 0030_mcp_server_registry (Phase 25B) Branch: dev


Why Now

With Phases 1-19 complete plus MCP dynamic discovery (Phase 12C), Cruvero's MCP bridge handles static and NATS-discovered servers but has structural limits:

  1. Stdio-only transport — every MCP server must be a local subprocess on the worker host.
  2. No gateway — direct worker-to-server connections; no federation, centralized auth, or RBAC.
  3. No response caching — every CallTool hits the upstream MCP server.
  4. No circuit breakers — failing servers cause cascading timeouts.
  5. No retry policy — single-attempt calls; transient failures are not retried.
  6. In-memory state — dynamic registrations lost on restart.
  7. Credential coupling — flat env vars shared across tenants.
  8. No persistent registry — server metadata exists only in memory.
  9. Linear discoveryListDefinitions iterates servers sequentially.
  10. No observability — no OTel spans or Prometheus metrics on MCP calls.

Architecture

Target Architecture

Design Principles

  1. Temporal as spine — all MCP tool calls remain Temporal activities.
  2. Gateway mediates all calls — no direct worker-to-server in production.
  3. Streamable HTTP everywhere — stdio servers wrapped at Docker build time.
  4. Vault for credentials — per-tenant secrets in Vault namespaces.
  5. KEDA for scaling — per-integration autoscaling based on Temporal queue depth.
  6. Dragonfly for caching — shared response/token/rate-limit cache.
  7. Application-level TLS — cert-manager-managed certificates.

Layer Summary

LayerComponentNamespacePurpose
OrchestrationTemporal workerscruveroWorkflow execution, MCP activity dispatch
GatewayAgentGateway + kgatewaymcp-infraFederation, RBAC, auth, routing
ServersPer-integration Deploymentsmcp-serversOne Deployment per integration type
CachingDragonflymcp-infraResponse cache, token cache, rate counters
AutoscalingKEDA ScaledObjectsmcp-serversPer-queue scaling, scale-to-zero
CredentialsVault + ESO + sidecarsmcp-infraPer-tenant secret isolation
Encryptioncert-manager + app-level TLScluster-widemTLS between workers and MCP servers
ObservabilityOTel Collector + PrometheusmonitoringTraces, metrics, dashboards
DiscoveryPostgreSQL + NATScruveroPersistent server registry + event bus

Reliability Stack

Temporal Activity Retry (outer)
→ CircuitBreaker (per-server, application-level)
→ RetryTransport (exponential backoff + jitter, transient errors only)
→ CachingTransport (Dragonfly, read-only tools)
→ HTTPTransport (TLS-enabled, connection pooling)

Core Types and Interfaces

// Transport interface — all transports implement this
type Transport interface {
Start(ctx context.Context) error
Initialize(ctx context.Context) error
ListTools(ctx context.Context) ([]Tool, error)
CallTool(ctx context.Context, name string, args json.RawMessage) (json.RawMessage, error)
Close() error
}

type TransportConfig struct {
Type string // "stdio", "http", "gateway"
Command string
Args []string
Env []string
Endpoint string
Timeout time.Duration
MaxConns int
GatewayURL string
ServerName string
AuthToken string
}

// MCPServerRecord — persistent server registry
type MCPServerRecord struct {
ID string `json:"id"`
TenantID string `json:"tenant_id"`
Name string `json:"name"`
Transport string `json:"transport"`
Endpoint string `json:"endpoint,omitempty"`
Command string `json:"command,omitempty"`
Args []string `json:"args,omitempty"`
EnvVars json.RawMessage `json:"env_vars,omitempty"`
AllowedEndpoints []string `json:"allowed_endpoints,omitempty"`
Debug bool `json:"debug"`
Source string `json:"source"` // "env" or "database"
Status string `json:"status"`
ToolCount int `json:"tool_count"`
Version string `json:"version,omitempty"`
Capabilities json.RawMessage `json:"capabilities,omitempty"`
HealthStatus string `json:"health_status"`
LastHealthy *time.Time `json:"last_healthy,omitempty"`
LastSeen *time.Time `json:"last_seen,omitempty"`
CreatedAt time.Time `json:"created_at"`
UpdatedAt time.Time `json:"updated_at"`
}

type MCPServerStore interface {
Upsert(ctx context.Context, record MCPServerRecord) error
Get(ctx context.Context, tenantID, name string) (MCPServerRecord, error)
List(ctx context.Context, tenantID string) ([]MCPServerRecord, error)
ListBySource(ctx context.Context, tenantID, source string) ([]MCPServerRecord, error)
UpdateHealth(ctx context.Context, tenantID, name, healthStatus string) error
UpdateStatus(ctx context.Context, tenantID, name, status string) error
Delete(ctx context.Context, tenantID, name string) error
}

// MCPCache — response caching
type MCPCache interface {
Get(ctx context.Context, key string) (json.RawMessage, bool)
Set(ctx context.Context, key string, value json.RawMessage, ttl time.Duration)
}

// CircuitBreaker — per-server failure isolation
type CircuitBreaker struct {
state State // closed, half-open, open
failureThreshold int
successThreshold int
halfOpenInterval time.Duration
}

// CredentialProvider — Vault or env var credential resolution
type CredentialProvider interface {
GetCredentials(ctx context.Context, tenantID, integration string) (map[string]string, error)
}

Sub-Phases

Sub-PhaseNamePromptsDepends On
25ATransport Abstraction + HTTP3
25BCode Exec MCP + Registry + Admin UI425A
25CGateway Integration + Worker Routing225A, 25B
25DCaching + Circuit Breakers + Vault425A
25EKubernetes + TLS + Observability325A-25D

Total: 5 sub-phases, 16 prompts, 11 documentation files

Dependency Graph

25A (Transport) ──┬──→ 25B (Registry) ──┬──→ 25C (Gateway)
│ │
└──→ 25D (Cache/Vault) ┘──→ 25E (K8s/TLS/Observability)

25A is the prerequisite for all other sub-phases. 25B and 25D can run in parallel after 25A. 25C depends on 25A and 25B. 25E depends on all preceding sub-phases.


Environment Variables (17 new)

VariableDefaultDescription
CRUVERO_MCP_TRANSPORTstdioTransport mode: stdio, http, gateway
CRUVERO_MCP_GATEWAY_URLAgentGateway endpoint URL
CRUVERO_MCP_GATEWAY_AUTHGateway auth mode: jwt, apikey, none
CRUVERO_MCP_HTTP_TIMEOUT30sHTTP transport request timeout
CRUVERO_MCP_HTTP_MAX_CONNS100Max HTTP connections per server
CRUVERO_MCP_RETRY_MAX3Max retry attempts for transient failures
CRUVERO_MCP_RETRY_BACKOFF1sInitial retry backoff duration
CRUVERO_MCP_CACHE_ENABLEDfalseEnable MCP response caching
CRUVERO_MCP_CACHE_TTL60sDefault cache TTL
CRUVERO_MCP_CACHE_ADDRCache address (defaults to CRUVERO_DRAGONFLY_ADDR)
CRUVERO_MCP_REGISTRY_ENABLEDfalseEnable persistent MCP server registry
CRUVERO_MCP_VAULT_ENABLEDfalseEnable Vault credential resolution for MCP
CRUVERO_MCP_VAULT_PATHadmin/tenant-\{id\}/kvVault path template for MCP credentials
CRUVERO_MCP_TLS_ENABLEDfalseEnable TLS for MCP HTTP transport
CRUVERO_MCP_TLS_CA_CERTPath to CA certificate bundle
CRUVERO_MCP_TLS_CERTPath to client certificate (for mTLS)
CRUVERO_MCP_TLS_KEYPath to client private key (for mTLS)

None conflict with existing CRUVERO_MCP_* variables.


Files Overview

New Files

FileSub-PhaseDescriptionEst. Lines
internal/mcp/transport.go25ATransport interface, TransportConfig struct~60
internal/mcp/transport_stdio.go25AStdioTransport — extracted from current NewClient~80
internal/mcp/transport_http.go25AHTTPTransport — Streamable HTTP via mcp-go~120
internal/mcp/transport_gateway.go25CGatewayTransport — gateway-aware routing headers~100
internal/mcp/tls.go25EBuildTLSConfig — TLS config builder~60
internal/mcp/retry.go25DRetry policy with exponential backoff + jitter~80
internal/mcp/cache.go25DMCPCache interface, DragonflyCache implementation~120
internal/mcp/circuit.go25DCircuitBreaker per-server state machine~150
internal/mcp/store.go25BMCPServerStore interface, MCPServerRecord struct~60
internal/mcp/store_postgres.go25BPostgresMCPServerStore implementation~200
internal/mcp/vault.go25DCredentialProvider, VaultCredentialProvider, EnvCredentialProvider~120
internal/mcp/observability.go25EOTel span helpers, Prometheus metric registration~100
cmd/mcp-code-exec/main.go25BCode execution MCP server entrypoint~200
migrations/0030_mcp_server_registry.up.sql25Bmcp_servers table~30
migrations/0030_mcp_server_registry.down.sql25BDrop mcp_servers table~1

Modified Files

FileSub-PhaseChange
internal/mcp/client.go25ARefactor to accept Transport interface
internal/mcp/config.go25A, 25BAdd transport, cache, retry, TLS, vault fields
internal/tools/mcp_bridge.go25A, 25B, 25CTransport-mode-aware init, lifecycle methods, gateway mode
internal/tools/mcp_announce.go25BAdd MCPGatewayHealth and MCPRegistrySync event types
internal/config/config.go25AAdd 17 MCP config fields with validation
cmd/worker/main.go25A, 25BTransport-aware init, dual-source loading
cmd/ui/mcp_api.go25BInject bridge, live status, CRUD endpoints
cmd/ui/frontend/src/pages/McpStatusPage.tsx25BLive status badges, add/remove/restart controls

Success Metrics

MetricTarget
HTTP transport latency overhead vs stdio< 5ms p99
Gateway-mediated call latency< 100ms p99 (excl. upstream)
Cache hit rate> 60% for read-heavy integrations
Scale-to-zero cold start< 30s from task arrival to tool response
Circuit breaker recovery< 60s after upstream recovers
Concurrent agent support1,000 agents with 24 integrations
Test coverage>= 80% for internal/mcp/ and internal/tools/

Code Quality Requirements (SonarQube)

All Go code produced by Phase 25 prompts must pass SonarQube quality gates:

  • Error handling: Every returned error must be handled explicitly
  • Cyclomatic complexity: Functions under 50 lines where practical
  • No dead code: No unused variables, empty blocks, or duplicated logic
  • Resource cleanup: Close all resources with proper defer patterns
  • Early returns: Prefer guard clauses over deeply nested conditionals
  • No magic values: Use named constants for strings and numbers
  • Meaningful names: Descriptive variable and function names
  • Linting gate: Run go vet, staticcheck, and golangci-lint run before considering the prompt complete

Each sub-phase Exit Criteria section includes:

  • [ ] go vet ./internal/mcp/... reports no issues
  • [ ] staticcheck ./internal/mcp/... reports no issues
  • [ ] No functions exceed 50 lines (extract helpers as needed)
  • [ ] All returned errors are handled (no _ = err patterns)

Risk Mitigation

RiskMitigation
AgentGateway external dependencyGateway mode is opt-in (CRUVERO_MCP_TRANSPORT=gateway). Stdio and HTTP modes work without gateway.
Migration 0030 breaks existing deploymentsmcp_servers table is additive; CRUVERO_MCP_REGISTRY_ENABLED=false by default.
Stdio-to-HTTP wrapping complexityStandard pattern via mcp-proxy/supergateway at Docker build time.
Vault integration complexityVaultCredentialProvider activates only when CRUVERO_MCP_VAULT_ENABLED=true. EnvCredentialProvider fallback preserves current behavior.
KEDA autoscaling stability300-600s cooldown periods. Graceful shutdown handlers. Temporal activity heartbeating.
TLS certificate managementcert-manager automates issuance/renewal. Application-level TLS avoids service mesh overhead.

Relationship to Other Phases

PhaseRelationship
Phase 12 (Events/NATS)MCP dynamic discovery uses NATS subjects from Phase 12C. Gateway-aware lifecycle events extend the existing event schema.
Phase 14 (Production API)API layer can expose MCP server registry endpoints for dashboards. No direct dependency.
Phase 19 (Tool Registry Restructure)Tool quality scoring applies to MCP tools through Manager.Execute. Persistent MCP registry (0030) complements tool metrics store (0026).
Phase 21 (Kubernetes Deployment)Phase 21 defines core service k8s manifests. Phase 25 MCP manifests follow the same patterns but target mcp-servers and mcp-infra namespaces.
Phase 24 (Context Management)Orthogonal — context management sits in agent/LLM layer, MCP sits in transport layer.

Progress Notes

(none yet)