Skip to main content
neutral

Phase 24 — Enterprise Context Management

Closes the remaining high-ROI context optimization gaps: accurate tokenization, provider-level prompt caching, observation masking, tool schema compression, multi-turn conversation, rolling summaries, per-tenant budget policies, proactive compression, a composable context pipeline, OTel monitoring, waste detection, and serialization optimization. Twelve improvements organized into three sequential tiers — each tier's items are independent and parallelizable within the tier.

Status: Planned Depends on: Phases 1-19 complete (tool quality scoring from Phase 19 feeds waste detection) Migrations: None (all state in AgentState/TenantConfig.Metadata) Branch: dev


Why Now

With Phases 1-19 complete, Cruvero has phase-aware budgets, 5-component salience scoring, semantic tool search, and multi-tenant isolation. But several high-ROI optimizations remain:

  1. Inaccurate tokenization — Heuristic chars-per-token ratios (±15-20% error) waste ~25K tokens on a 128K window.
  2. No prompt caching — All major LLM APIs support it; Pricing.InputCacheRead/InputCacheWrite fields exist but are never used.
  3. Full observation retention — Tool outputs stay in context long after the LLM has acted on them.
  4. Verbose tool schemas — 100-1,000 bytes per tool; 20+ tools consume 5-15% of context budget.
  5. Single-turn interaction — Fresh {system, user} pair every step; no prior reasoning available.
  6. One-shot summarization — Replaces previous summary entirely, losing older information.
  7. Hardcoded budget percentages — No per-tenant customization.
  8. Reactive-only compression — Only triggers after overflow.

Architecture

Current Context Assembly Pipeline

buildDecisionPrompts() (activity_llm_prompt.go)
└─ DetectPhase(stepIndex, maxSteps) → planning|executing|reviewing
└─ AllocateBudget(totalTokens, systemTokens, phase)
│ └─ phasePercentages(phase) → hardcoded % per section per phase
│ └─ returns ContextBudget with per-section caps
└─ buildContextAssemblerInput() → gathers episodes, memories, tools
└─ AssembleContext(input, budget, model) → deterministic section ordering
│ └─ [SYSTEM] → [PROCEDURES] → [AVAILABLE_TOOLS] → [WORKING_MEMORY] → ...
│ └─ enforcePromptTokenCap() → reactive truncation on overflow
└─ returns []llm.Message{system, user}

Target Pipeline (After Phase 24)

ContextPipeline.Execute(state)
├─ stageDetectPhase → planning|executing|reviewing
├─ stageAllocateBudget → per-section token budgets (with tenant overrides)
├─ stageMaskObservations → replace old observations with one-line refs
├─ stageCompressSchemas → minify/truncate/aggressive schema compression
├─ stageBuildConversation → sliding window multi-turn (if enabled)
├─ stageAssembleContext → deterministic section assembly
└─ stageProactiveCompression → utilization check + escalating compression

Three-Tier Implementation

Tier 1 — Highest ROI (Sub-phase A, independent/parallel):
1.1 Accurate Go-Native Tokenizer
1.2 Prompt Caching (All 5 Providers)
1.3 Observation Masking
1.4 Tool Schema Compression

Tier 2 — Significant Value (Sub-phase B, depends on Tier 1 tokenizer):
2.1 Multi-Turn Conversation Builder
2.2 Rolling Anchored Summaries
2.3 Per-Tenant Context Budget Policies
2.4 Proactive Compression Triggers

Tier 3 — Polish (Sub-phase C, depends on Tier 2):
3.1 Context Pipeline as Middleware
3.2 OTel Token Monitoring
3.3 Context Waste Detection
3.4 Serialization Optimization

Competitive Comparison

CapabilityLangChain/LangGraphCruvero (Current)Cruvero (After Phase 24)
Budget allocationNonePhase-aware (plan/execute/review)Phase-aware + per-tenant overrides
Memory rankingFIFO or naive vector5-component salience scoringSame + rolling anchored summaries
Multi-tenancyNoneFull (quotas, models, tools)+ per-tenant context policies
Prompt cachingBasicNoneAnthropic explicit + OpenAI auto
Tokenizationtiktoken (Python)Heuristic (±15-20%)BPE via tiktoken-go (MIT, pure Go)
Observation maskingNoneNoneJetBrains-validated masking
Tool selectionAll tools every timeSemantic search (Phase 19)+ schema compression + waste tracking
Conversation stateManual checkpointingTemporal-native durability+ sliding window multi-turn
Proactive compressionNoneNoneConfigurable utilization triggers

Core Types and Interfaces

// Tokenizer counts tokens for a given text.
type Tokenizer interface {
CountTokens(text string) int
}

type BPETokenizer struct {
enc *tiktoken.Tiktoken
}

type HeuristicTokenizer struct {
charsPerToken float64
}

type CompressionLevel int

const (
CompressionNone CompressionLevel = iota
CompressionMinify
CompressionTruncate
CompressionAggressive
)

func CompressSchema(raw json.RawMessage, level CompressionLevel) json.RawMessage

type ConversationTurn struct {
StepIndex int `json:"step_index"`
Assistant string `json:"assistant"`
User string `json:"user"`
}

type ConversationBuilder struct {
Window int
Model string
}

type BudgetOverride struct {
Working int `json:"working,omitempty"`
Episodic int `json:"episodic,omitempty"`
Semantic int `json:"semantic,omitempty"`
Tools int `json:"tools,omitempty"`
Procedural int `json:"procedural,omitempty"`
Reserved int `json:"reserved,omitempty"`
}

type ContextPolicy struct {
PhaseOverrides map[TaskPhase]BudgetOverride `json:"phase_overrides,omitempty"`
}

type ContextStage func(ctx context.Context, input *ContextPipelineState) error

type ContextPipeline struct {
stages []ContextStage
}

type AssembledContext struct {
// ... existing fields ...
IncludedTools []string `json:"included_tools"`
WastedTools []string `json:"wasted_tools,omitempty"`
WasteRatio float64 `json:"waste_ratio,omitempty"`
}

Sub-Phases

Sub-PhaseNamePromptsDepends On
24ATier 1: Highest ROI4
24BTier 2: Significant Value424A
24CTier 3: Polish424B

Total: 3 sub-phases, 12 prompts, 7 documentation files

Dependency Graph

24A (Tier 1) → 24B (Tier 2) → 24C (Tier 3)

Strictly sequential: each tier builds on the previous. Items within a tier are independent and parallelizable.


Environment Variables

VariableDefaultDescription
CRUVERO_TOKENIZER_MODEbpeToken counting mode: bpe or heuristic
CRUVERO_PROMPT_CACHE_ENABLEDtrueEnable provider-level prompt caching
CRUVERO_OBSERVATION_MASK_ENABLEDtrueEnable observation masking for consumed outputs
CRUVERO_OBSERVATION_MASK_WINDOW2Number of recent full observations to keep
CRUVERO_TOOL_SCHEMA_COMPRESSIONtruncateSchema compression level: none, minify, truncate, aggressive
CRUVERO_CONVERSATION_ENABLEDfalseEnable multi-turn conversation builder
CRUVERO_CONVERSATION_WINDOW5Max conversation turns in sliding window
CRUVERO_SUMMARY_MODEoneshotSummary mode: rolling or oneshot
CRUVERO_SUMMARY_MAX_BULLETS5Max bullets in rolling summary
CRUVERO_COMPRESSION_THRESHOLD0.85Utilization ratio trigger for proactive compression
CRUVERO_CONTEXT_WASTE_TRACKINGfalseEnable context waste detection metrics

All per-tenant overridable via TenantConfig.Metadata using the variable name without CRUVERO_ prefix in lowercase (e.g., tokenizer_mode, observation_mask_window).


Files Overview

New Files

FileSub-PhaseDescription
internal/registry/schema_compressor.go24ACompressSchema() with 3 compression levels (~80 lines)
internal/agent/conversation.go24BConversationBuilder with sliding window (~120 lines)
internal/agent/context_pipeline.go24CContextPipeline with ordered ContextStage functions (~100 lines)

Modified Files

FileSub-PhaseChange
internal/agent/tokenizer.go24AAdd Tokenizer interface, BPETokenizer, keep HeuristicTokenizer
internal/llm/anthropic.go24AContent blocks + cache_control markers
internal/llm/openai_chat.go24AParse cached_tokens from response
internal/llm/google.go24AAdd CachedContent + cache manager
internal/llm/openrouter.go24Acache_control hints for Anthropic-backed models
internal/llm/types.go24AAdd CacheReadTokens, CacheWriteTokens to Usage
internal/agent/activity_llm.go24A, 24BCache metrics, conversation builder, budget overrides
internal/agent/activity_llm_prompt.go24AObservation masking
internal/agent/context_assembler.go24A, 24B, 24CSchema compression, proactive compression, waste tracking, serialization
internal/agent/state.go24BAdd ConversationHistory to AgentState
internal/agent/activity_observe.go24BRolling incremental summaries
internal/agent/activity_memory.go24BAccept existing summary in rolling mode
internal/agent/context_budget.go24BAccept optional *BudgetOverride
internal/tenant/config.go24BAdd ContextPolicy struct

New Dependency

LibraryLicensePurpose
github.com/pkoukk/tiktoken-goMITBPE tokenization (cl100k_base, o200k_base)

Success Metrics

MetricTarget
Tokenizer accuracy±2% vs reference (OpenAI tokenizer playground)
Prompt cache hit rate> 50% for multi-step runs
Observation masking token savings30-60% on steps with stale observations
Schema compression token savings20-50% reduction in [AVAILABLE_TOOLS] section
Conversation builder coherenceLLM references prior reasoning in 80%+ of multi-step runs
Rolling summary information retentionCritical facts preserved across 5+ summarization rounds
Proactive compression trigger rate< 10% of assemblies need reactive overflow truncation
Context waste ratio< 30% wasted tools across runs
Test coverage>= 80% for internal/agent/ and internal/registry/

Code Quality Requirements (SonarQube)

All Go code produced by Phase 24 prompts must pass SonarQube quality gates:

  • Error handling: Every returned error must be handled explicitly
  • Cyclomatic complexity: Functions under 50 lines where practical
  • No dead code: No unused variables, empty blocks, or duplicated logic
  • Resource cleanup: Close all resources with proper defer patterns
  • Early returns: Prefer guard clauses over deeply nested conditionals
  • No magic values: Use named constants for strings and numbers
  • Meaningful names: Descriptive variable and function names
  • Linting gate: Run go vet, staticcheck, and golangci-lint run before considering the prompt complete

Each sub-phase Exit Criteria section includes:

  • [ ] go vet ./internal/agent/... reports no issues
  • [ ] staticcheck ./internal/agent/... reports no issues
  • [ ] No functions exceed 50 lines (extract helpers as needed)
  • [ ] All returned errors are handled (no _ = err patterns)

Risk Mitigation

RiskMitigation
tiktoken-go adds new dependencyMIT license, pure Go, no CGo. Fallback to heuristic mode via env var.
Prompt caching breaks provider APIsFeature-flagged per provider. Existing request format preserved when disabled.
Observation masking loses critical infoMasking at assembly time only — full observations always preserved in episodic store.
Conversation history grows Temporal stateSliding window bounded; trimmed before ContinueAsNew.
Schema compression breaks tool schemasAll compression levels preserve JSON Schema validity. Table-driven tests validate.
Proactive compression aggressiveEscalation order is incremental; each strategy rechecks utilization before next.

Relationship to Other Phases

PhaseRelationship
Phase 10B (Salience + Context Budget)24A builds on existing context_budget.go and context_assembler.go
Phase 19 (Tool Registry Restructure)24C waste detection feeds Phase 19's tool quality scoring
Phase 25 (MCP Enterprise Architecture)Orthogonal — context management sits in agent/LLM layer, not MCP transport
Phase 17 (PII Guard)PII detection applies to context content via existing boundaries; no Phase 24 interaction

Progress Notes

(none yet)