Phase 25 — MCP Enterprise Architecture
Evolves Cruvero's MCP integration from stdio-only subprocess management to a gateway-mediated Streamable HTTP architecture with per-integration scaling, persistent server registry, shared caching, circuit breakers, Vault-backed credential isolation, and Kubernetes-native deployment. Enables 1,000+ concurrent agents across 24+ integrations.
Status: Planned
Depends on: Phase 12C (NATS events for MCP dynamic discovery)
Migrations: 0030_mcp_server_registry (Phase 25B)
Branch: dev
Why Now
With Phases 1-19 complete plus MCP dynamic discovery (Phase 12C), Cruvero's MCP bridge handles static and NATS-discovered servers but has structural limits:
- Stdio-only transport — every MCP server must be a local subprocess on the worker host.
- No gateway — direct worker-to-server connections; no federation, centralized auth, or RBAC.
- No response caching — every
CallToolhits the upstream MCP server. - No circuit breakers — failing servers cause cascading timeouts.
- No retry policy — single-attempt calls; transient failures are not retried.
- In-memory state — dynamic registrations lost on restart.
- Credential coupling — flat env vars shared across tenants.
- No persistent registry — server metadata exists only in memory.
- Linear discovery —
ListDefinitionsiterates servers sequentially. - No observability — no OTel spans or Prometheus metrics on MCP calls.
Architecture
Target Architecture
Design Principles
- Temporal as spine — all MCP tool calls remain Temporal activities.
- Gateway mediates all calls — no direct worker-to-server in production.
- Streamable HTTP everywhere — stdio servers wrapped at Docker build time.
- Vault for credentials — per-tenant secrets in Vault namespaces.
- KEDA for scaling — per-integration autoscaling based on Temporal queue depth.
- Dragonfly for caching — shared response/token/rate-limit cache.
- Application-level TLS — cert-manager-managed certificates.
Layer Summary
| Layer | Component | Namespace | Purpose |
|---|---|---|---|
| Orchestration | Temporal workers | cruvero | Workflow execution, MCP activity dispatch |
| Gateway | AgentGateway + kgateway | mcp-infra | Federation, RBAC, auth, routing |
| Servers | Per-integration Deployments | mcp-servers | One Deployment per integration type |
| Caching | Dragonfly | mcp-infra | Response cache, token cache, rate counters |
| Autoscaling | KEDA ScaledObjects | mcp-servers | Per-queue scaling, scale-to-zero |
| Credentials | Vault + ESO + sidecars | mcp-infra | Per-tenant secret isolation |
| Encryption | cert-manager + app-level TLS | cluster-wide | mTLS between workers and MCP servers |
| Observability | OTel Collector + Prometheus | monitoring | Traces, metrics, dashboards |
| Discovery | PostgreSQL + NATS | cruvero | Persistent server registry + event bus |
Reliability Stack
Temporal Activity Retry (outer)
→ CircuitBreaker (per-server, application-level)
→ RetryTransport (exponential backoff + jitter, transient errors only)
→ CachingTransport (Dragonfly, read-only tools)
→ HTTPTransport (TLS-enabled, connection pooling)
Core Types and Interfaces
// Transport interface — all transports implement this
type Transport interface {
Start(ctx context.Context) error
Initialize(ctx context.Context) error
ListTools(ctx context.Context) ([]Tool, error)
CallTool(ctx context.Context, name string, args json.RawMessage) (json.RawMessage, error)
Close() error
}
type TransportConfig struct {
Type string // "stdio", "http", "gateway"
Command string
Args []string
Env []string
Endpoint string
Timeout time.Duration
MaxConns int
GatewayURL string
ServerName string
AuthToken string
}
// MCPServerRecord — persistent server registry
type MCPServerRecord struct {
ID string `json:"id"`
TenantID string `json:"tenant_id"`
Name string `json:"name"`
Transport string `json:"transport"`
Endpoint string `json:"endpoint,omitempty"`
Command string `json:"command,omitempty"`
Args []string `json:"args,omitempty"`
EnvVars json.RawMessage `json:"env_vars,omitempty"`
AllowedEndpoints []string `json:"allowed_endpoints,omitempty"`
Debug bool `json:"debug"`
Source string `json:"source"` // "env" or "database"
Status string `json:"status"`
ToolCount int `json:"tool_count"`
Version string `json:"version,omitempty"`
Capabilities json.RawMessage `json:"capabilities,omitempty"`
HealthStatus string `json:"health_status"`
LastHealthy *time.Time `json:"last_healthy,omitempty"`
LastSeen *time.Time `json:"last_seen,omitempty"`
CreatedAt time.Time `json:"created_at"`
UpdatedAt time.Time `json:"updated_at"`
}
type MCPServerStore interface {
Upsert(ctx context.Context, record MCPServerRecord) error
Get(ctx context.Context, tenantID, name string) (MCPServerRecord, error)
List(ctx context.Context, tenantID string) ([]MCPServerRecord, error)
ListBySource(ctx context.Context, tenantID, source string) ([]MCPServerRecord, error)
UpdateHealth(ctx context.Context, tenantID, name, healthStatus string) error
UpdateStatus(ctx context.Context, tenantID, name, status string) error
Delete(ctx context.Context, tenantID, name string) error
}
// MCPCache — response caching
type MCPCache interface {
Get(ctx context.Context, key string) (json.RawMessage, bool)
Set(ctx context.Context, key string, value json.RawMessage, ttl time.Duration)
}
// CircuitBreaker — per-server failure isolation
type CircuitBreaker struct {
state State // closed, half-open, open
failureThreshold int
successThreshold int
halfOpenInterval time.Duration
}
// CredentialProvider — Vault or env var credential resolution
type CredentialProvider interface {
GetCredentials(ctx context.Context, tenantID, integration string) (map[string]string, error)
}
Sub-Phases
| Sub-Phase | Name | Prompts | Depends On |
|---|---|---|---|
| 25A | Transport Abstraction + HTTP | 3 | — |
| 25B | Code Exec MCP + Registry + Admin UI | 4 | 25A |
| 25C | Gateway Integration + Worker Routing | 2 | 25A, 25B |
| 25D | Caching + Circuit Breakers + Vault | 4 | 25A |
| 25E | Kubernetes + TLS + Observability | 3 | 25A-25D |
Total: 5 sub-phases, 16 prompts, 11 documentation files
Dependency Graph
25A (Transport) ──┬──→ 25B (Registry) ──┬──→ 25C (Gateway)
│ │
└──→ 25D (Cache/Vault) ┘──→ 25E (K8s/TLS/Observability)
25A is the prerequisite for all other sub-phases. 25B and 25D can run in parallel after 25A. 25C depends on 25A and 25B. 25E depends on all preceding sub-phases.
Environment Variables (17 new)
| Variable | Default | Description |
|---|---|---|
CRUVERO_MCP_TRANSPORT | stdio | Transport mode: stdio, http, gateway |
CRUVERO_MCP_GATEWAY_URL | — | AgentGateway endpoint URL |
CRUVERO_MCP_GATEWAY_AUTH | — | Gateway auth mode: jwt, apikey, none |
CRUVERO_MCP_HTTP_TIMEOUT | 30s | HTTP transport request timeout |
CRUVERO_MCP_HTTP_MAX_CONNS | 100 | Max HTTP connections per server |
CRUVERO_MCP_RETRY_MAX | 3 | Max retry attempts for transient failures |
CRUVERO_MCP_RETRY_BACKOFF | 1s | Initial retry backoff duration |
CRUVERO_MCP_CACHE_ENABLED | false | Enable MCP response caching |
CRUVERO_MCP_CACHE_TTL | 60s | Default cache TTL |
CRUVERO_MCP_CACHE_ADDR | — | Cache address (defaults to CRUVERO_DRAGONFLY_ADDR) |
CRUVERO_MCP_REGISTRY_ENABLED | false | Enable persistent MCP server registry |
CRUVERO_MCP_VAULT_ENABLED | false | Enable Vault credential resolution for MCP |
CRUVERO_MCP_VAULT_PATH | admin/tenant-\{id\}/kv | Vault path template for MCP credentials |
CRUVERO_MCP_TLS_ENABLED | false | Enable TLS for MCP HTTP transport |
CRUVERO_MCP_TLS_CA_CERT | — | Path to CA certificate bundle |
CRUVERO_MCP_TLS_CERT | — | Path to client certificate (for mTLS) |
CRUVERO_MCP_TLS_KEY | — | Path to client private key (for mTLS) |
None conflict with existing CRUVERO_MCP_* variables.
Files Overview
New Files
| File | Sub-Phase | Description | Est. Lines |
|---|---|---|---|
internal/mcp/transport.go | 25A | Transport interface, TransportConfig struct | ~60 |
internal/mcp/transport_stdio.go | 25A | StdioTransport — extracted from current NewClient | ~80 |
internal/mcp/transport_http.go | 25A | HTTPTransport — Streamable HTTP via mcp-go | ~120 |
internal/mcp/transport_gateway.go | 25C | GatewayTransport — gateway-aware routing headers | ~100 |
internal/mcp/tls.go | 25E | BuildTLSConfig — TLS config builder | ~60 |
internal/mcp/retry.go | 25D | Retry policy with exponential backoff + jitter | ~80 |
internal/mcp/cache.go | 25D | MCPCache interface, DragonflyCache implementation | ~120 |
internal/mcp/circuit.go | 25D | CircuitBreaker per-server state machine | ~150 |
internal/mcp/store.go | 25B | MCPServerStore interface, MCPServerRecord struct | ~60 |
internal/mcp/store_postgres.go | 25B | PostgresMCPServerStore implementation | ~200 |
internal/mcp/vault.go | 25D | CredentialProvider, VaultCredentialProvider, EnvCredentialProvider | ~120 |
internal/mcp/observability.go | 25E | OTel span helpers, Prometheus metric registration | ~100 |
cmd/mcp-code-exec/main.go | 25B | Code execution MCP server entrypoint | ~200 |
migrations/0030_mcp_server_registry.up.sql | 25B | mcp_servers table | ~30 |
migrations/0030_mcp_server_registry.down.sql | 25B | Drop mcp_servers table | ~1 |
Modified Files
| File | Sub-Phase | Change |
|---|---|---|
internal/mcp/client.go | 25A | Refactor to accept Transport interface |
internal/mcp/config.go | 25A, 25B | Add transport, cache, retry, TLS, vault fields |
internal/tools/mcp_bridge.go | 25A, 25B, 25C | Transport-mode-aware init, lifecycle methods, gateway mode |
internal/tools/mcp_announce.go | 25B | Add MCPGatewayHealth and MCPRegistrySync event types |
internal/config/config.go | 25A | Add 17 MCP config fields with validation |
cmd/worker/main.go | 25A, 25B | Transport-aware init, dual-source loading |
cmd/ui/mcp_api.go | 25B | Inject bridge, live status, CRUD endpoints |
cmd/ui/frontend/src/pages/McpStatusPage.tsx | 25B | Live status badges, add/remove/restart controls |
Success Metrics
| Metric | Target |
|---|---|
| HTTP transport latency overhead vs stdio | < 5ms p99 |
| Gateway-mediated call latency | < 100ms p99 (excl. upstream) |
| Cache hit rate | > 60% for read-heavy integrations |
| Scale-to-zero cold start | < 30s from task arrival to tool response |
| Circuit breaker recovery | < 60s after upstream recovers |
| Concurrent agent support | 1,000 agents with 24 integrations |
| Test coverage | >= 80% for internal/mcp/ and internal/tools/ |
Code Quality Requirements (SonarQube)
All Go code produced by Phase 25 prompts must pass SonarQube quality gates:
- Error handling: Every returned error must be handled explicitly
- Cyclomatic complexity: Functions under 50 lines where practical
- No dead code: No unused variables, empty blocks, or duplicated logic
- Resource cleanup: Close all resources with proper
deferpatterns - Early returns: Prefer guard clauses over deeply nested conditionals
- No magic values: Use named constants for strings and numbers
- Meaningful names: Descriptive variable and function names
- Linting gate: Run
go vet,staticcheck, andgolangci-lint runbefore considering the prompt complete
Each sub-phase Exit Criteria section includes:
[ ] go vet ./internal/mcp/... reports no issues[ ] staticcheck ./internal/mcp/... reports no issues[ ] No functions exceed 50 lines (extract helpers as needed)[ ] All returned errors are handled (no _ = err patterns)
Risk Mitigation
| Risk | Mitigation |
|---|---|
| AgentGateway external dependency | Gateway mode is opt-in (CRUVERO_MCP_TRANSPORT=gateway). Stdio and HTTP modes work without gateway. |
| Migration 0030 breaks existing deployments | mcp_servers table is additive; CRUVERO_MCP_REGISTRY_ENABLED=false by default. |
| Stdio-to-HTTP wrapping complexity | Standard pattern via mcp-proxy/supergateway at Docker build time. |
| Vault integration complexity | VaultCredentialProvider activates only when CRUVERO_MCP_VAULT_ENABLED=true. EnvCredentialProvider fallback preserves current behavior. |
| KEDA autoscaling stability | 300-600s cooldown periods. Graceful shutdown handlers. Temporal activity heartbeating. |
| TLS certificate management | cert-manager automates issuance/renewal. Application-level TLS avoids service mesh overhead. |
Relationship to Other Phases
| Phase | Relationship |
|---|---|
| Phase 12 (Events/NATS) | MCP dynamic discovery uses NATS subjects from Phase 12C. Gateway-aware lifecycle events extend the existing event schema. |
| Phase 14 (Production API) | API layer can expose MCP server registry endpoints for dashboards. No direct dependency. |
| Phase 19 (Tool Registry Restructure) | Tool quality scoring applies to MCP tools through Manager.Execute. Persistent MCP registry (0030) complements tool metrics store (0026). |
| Phase 21 (Kubernetes Deployment) | Phase 21 defines core service k8s manifests. Phase 25 MCP manifests follow the same patterns but target mcp-servers and mcp-infra namespaces. |
| Phase 24 (Context Management) | Orthogonal — context management sits in agent/LLM layer, MCP sits in transport layer. |
Progress Notes
(none yet)