neutral

Phase 24 — Enterprise Context Management

Closes the remaining high-ROI context optimization gaps: accurate tokenization, provider-level prompt caching, observation masking, tool schema compression, multi-turn conversation, rolling summaries, per-tenant budget policies, proactive compression, a composable context pipeline, OTel monitoring, waste detection, and serialization optimization. Twelve improvements organized into three sequential tiers — each tier's items are independent and parallelizable within the tier.

Status: Planned Depends on: Phases 1-19 complete (tool quality scoring from Phase 19 feeds waste detection) Migrations: None (all state in AgentState/TenantConfig.Metadata) Branch: dev

Why Now

With Phases 1-19 complete, Cruvero has phase-aware budgets, 5-component salience scoring, semantic tool search, and multi-tenant isolation. But several high-ROI optimizations remain:

Inaccurate tokenization — Heuristic chars-per-token ratios (±15-20% error) waste ~25K tokens on a 128K window.
No prompt caching — All major LLM APIs support it; Pricing.InputCacheRead/InputCacheWrite fields exist but are never used.
Full observation retention — Tool outputs stay in context long after the LLM has acted on them.
Verbose tool schemas — 100-1,000 bytes per tool; 20+ tools consume 5-15% of context budget.
Single-turn interaction — Fresh {system, user} pair every step; no prior reasoning available.
One-shot summarization — Replaces previous summary entirely, losing older information.
Hardcoded budget percentages — No per-tenant customization.
Reactive-only compression — Only triggers after overflow.

Architecture

Current Context Assembly Pipeline

buildDecisionPrompts() (activity_llm_prompt.go)
  └─ DetectPhase(stepIndex, maxSteps)           → planning|executing|reviewing
  └─ AllocateBudget(totalTokens, systemTokens, phase)
  │    └─ phasePercentages(phase)               → hardcoded % per section per phase
  │    └─ returns ContextBudget with per-section caps
  └─ buildContextAssemblerInput()               → gathers episodes, memories, tools
  └─ AssembleContext(input, budget, model)       → deterministic section ordering
  │    └─ [SYSTEM] → [PROCEDURES] → [AVAILABLE_TOOLS] → [WORKING_MEMORY] → ...
  │    └─ enforcePromptTokenCap()               → reactive truncation on overflow
  └─ returns []llm.Message{system, user}

Target Pipeline (After Phase 24)

ContextPipeline.Execute(state)
  ├─ stageDetectPhase        → planning|executing|reviewing
  ├─ stageAllocateBudget     → per-section token budgets (with tenant overrides)
  ├─ stageMaskObservations   → replace old observations with one-line refs
  ├─ stageCompressSchemas    → minify/truncate/aggressive schema compression
  ├─ stageBuildConversation  → sliding window multi-turn (if enabled)
  ├─ stageAssembleContext    → deterministic section assembly
  └─ stageProactiveCompression → utilization check + escalating compression

Three-Tier Implementation

Tier 1 — Highest ROI (Sub-phase A, independent/parallel):
1 Accurate Go-Native Tokenizer
2 Prompt Caching (All 5 Providers)
3 Observation Masking
4 Tool Schema Compression

Tier 2 — Significant Value (Sub-phase B, depends on Tier 1 tokenizer):
1 Multi-Turn Conversation Builder
2 Rolling Anchored Summaries
3 Per-Tenant Context Budget Policies
4 Proactive Compression Triggers

Tier 3 — Polish (Sub-phase C, depends on Tier 2):
1 Context Pipeline as Middleware
2 OTel Token Monitoring
3 Context Waste Detection
4 Serialization Optimization

Competitive Comparison

Capability	LangChain/LangGraph	Cruvero (Current)	Cruvero (After Phase 24)
Budget allocation	None	Phase-aware (plan/execute/review)	Phase-aware + per-tenant overrides
Memory ranking	FIFO or naive vector	5-component salience scoring	Same + rolling anchored summaries
Multi-tenancy	None	Full (quotas, models, tools)	+ per-tenant context policies
Prompt caching	Basic	None	Anthropic explicit + OpenAI auto
Tokenization	tiktoken (Python)	Heuristic (±15-20%)	BPE via tiktoken-go (MIT, pure Go)
Observation masking	None	None	JetBrains-validated masking
Tool selection	All tools every time	Semantic search (Phase 19)	+ schema compression + waste tracking
Conversation state	Manual checkpointing	Temporal-native durability	+ sliding window multi-turn
Proactive compression	None	None	Configurable utilization triggers

Core Types and Interfaces

// Tokenizer counts tokens for a given text.
type Tokenizer interface {
    CountTokens(text string) int
}

type BPETokenizer struct {
    enc *tiktoken.Tiktoken
}

type HeuristicTokenizer struct {
    charsPerToken float64
}

type CompressionLevel int

const (
    CompressionNone       CompressionLevel = iota
    CompressionMinify
    CompressionTruncate
    CompressionAggressive
)

func CompressSchema(raw json.RawMessage, level CompressionLevel) json.RawMessage

type ConversationTurn struct {
    StepIndex int    `json:"step_index"`
    Assistant string `json:"assistant"`
    User      string `json:"user"`
}

type ConversationBuilder struct {
    Window int
    Model  string
}

type BudgetOverride struct {
    Working    int `json:"working,omitempty"`
    Episodic   int `json:"episodic,omitempty"`
    Semantic   int `json:"semantic,omitempty"`
    Tools      int `json:"tools,omitempty"`
    Procedural int `json:"procedural,omitempty"`
    Reserved   int `json:"reserved,omitempty"`
}

type ContextPolicy struct {
    PhaseOverrides map[TaskPhase]BudgetOverride `json:"phase_overrides,omitempty"`
}

type ContextStage func(ctx context.Context, input *ContextPipelineState) error

type ContextPipeline struct {
    stages []ContextStage
}

type AssembledContext struct {
    // ... existing fields ...
    IncludedTools []string `json:"included_tools"`
    WastedTools   []string `json:"wasted_tools,omitempty"`
    WasteRatio    float64  `json:"waste_ratio,omitempty"`
}

Sub-Phases

Sub-Phase	Name	Prompts	Depends On
24A	Tier 1: Highest ROI	4	—
24B	Tier 2: Significant Value	4	24A
24C	Tier 3: Polish	4	24B

Total: 3 sub-phases, 12 prompts, 7 documentation files

Dependency Graph

24A (Tier 1) → 24B (Tier 2) → 24C (Tier 3)

Strictly sequential: each tier builds on the previous. Items within a tier are independent and parallelizable.

Environment Variables

Variable	Default	Description
`CRUVERO_TOKENIZER_MODE`	`bpe`	Token counting mode: `bpe` or `heuristic`
`CRUVERO_PROMPT_CACHE_ENABLED`	`true`	Enable provider-level prompt caching
`CRUVERO_OBSERVATION_MASK_ENABLED`	`true`	Enable observation masking for consumed outputs
`CRUVERO_OBSERVATION_MASK_WINDOW`	`2`	Number of recent full observations to keep
`CRUVERO_TOOL_SCHEMA_COMPRESSION`	`truncate`	Schema compression level: `none`, `minify`, `truncate`, `aggressive`
`CRUVERO_CONVERSATION_ENABLED`	`false`	Enable multi-turn conversation builder
`CRUVERO_CONVERSATION_WINDOW`	`5`	Max conversation turns in sliding window
`CRUVERO_SUMMARY_MODE`	`oneshot`	Summary mode: `rolling` or `oneshot`
`CRUVERO_SUMMARY_MAX_BULLETS`	`5`	Max bullets in rolling summary
`CRUVERO_COMPRESSION_THRESHOLD`	`0.85`	Utilization ratio trigger for proactive compression
`CRUVERO_CONTEXT_WASTE_TRACKING`	`false`	Enable context waste detection metrics

All per-tenant overridable via TenantConfig.Metadata using the variable name without CRUVERO_ prefix in lowercase (e.g., tokenizer_mode, observation_mask_window).

Files Overview

New Files

File	Sub-Phase	Description
`internal/registry/schema_compressor.go`	24A	`CompressSchema()` with 3 compression levels (~80 lines)
`internal/agent/conversation.go`	24B	`ConversationBuilder` with sliding window (~120 lines)
`internal/agent/context_pipeline.go`	24C	`ContextPipeline` with ordered `ContextStage` functions (~100 lines)

Modified Files

File	Sub-Phase	Change
`internal/agent/tokenizer.go`	24A	Add `Tokenizer` interface, `BPETokenizer`, keep `HeuristicTokenizer`
`internal/llm/anthropic.go`	24A	Content blocks + `cache_control` markers
`internal/llm/openai_chat.go`	24A	Parse `cached_tokens` from response
`internal/llm/google.go`	24A	Add `CachedContent` + cache manager
`internal/llm/openrouter.go`	24A	`cache_control` hints for Anthropic-backed models
`internal/llm/types.go`	24A	Add `CacheReadTokens`, `CacheWriteTokens` to `Usage`
`internal/agent/activity_llm.go`	24A, 24B	Cache metrics, conversation builder, budget overrides
`internal/agent/activity_llm_prompt.go`	24A	Observation masking
`internal/agent/context_assembler.go`	24A, 24B, 24C	Schema compression, proactive compression, waste tracking, serialization
`internal/agent/state.go`	24B	Add `ConversationHistory` to `AgentState`
`internal/agent/activity_observe.go`	24B	Rolling incremental summaries
`internal/agent/activity_memory.go`	24B	Accept existing summary in rolling mode
`internal/agent/context_budget.go`	24B	Accept optional `*BudgetOverride`
`internal/tenant/config.go`	24B	Add `ContextPolicy` struct

New Dependency

Library	License	Purpose
`github.com/pkoukk/tiktoken-go`	MIT	BPE tokenization (cl100k_base, o200k_base)

Success Metrics

Metric	Target
Tokenizer accuracy	±2% vs reference (OpenAI tokenizer playground)
Prompt cache hit rate	> 50% for multi-step runs
Observation masking token savings	30-60% on steps with stale observations
Schema compression token savings	20-50% reduction in `[AVAILABLE_TOOLS]` section
Conversation builder coherence	LLM references prior reasoning in 80%+ of multi-step runs
Rolling summary information retention	Critical facts preserved across 5+ summarization rounds
Proactive compression trigger rate	< 10% of assemblies need reactive overflow truncation
Context waste ratio	< 30% wasted tools across runs
Test coverage	>= 80% for `internal/agent/` and `internal/registry/`

Code Quality Requirements (SonarQube)

All Go code produced by Phase 24 prompts must pass SonarQube quality gates:

Error handling: Every returned error must be handled explicitly
Cyclomatic complexity: Functions under 50 lines where practical
No dead code: No unused variables, empty blocks, or duplicated logic
Resource cleanup: Close all resources with proper defer patterns
Early returns: Prefer guard clauses over deeply nested conditionals
No magic values: Use named constants for strings and numbers
Meaningful names: Descriptive variable and function names
Linting gate: Run go vet, staticcheck, and golangci-lint run before considering the prompt complete

Each sub-phase Exit Criteria section includes:

[ ] go vet ./internal/agent/... reports no issues
[ ] staticcheck ./internal/agent/... reports no issues
[ ] No functions exceed 50 lines (extract helpers as needed)
[ ] All returned errors are handled (no _ = err patterns)

Risk Mitigation

Risk	Mitigation
tiktoken-go adds new dependency	MIT license, pure Go, no CGo. Fallback to heuristic mode via env var.
Prompt caching breaks provider APIs	Feature-flagged per provider. Existing request format preserved when disabled.
Observation masking loses critical info	Masking at assembly time only — full observations always preserved in episodic store.
Conversation history grows Temporal state	Sliding window bounded; trimmed before `ContinueAsNew`.
Schema compression breaks tool schemas	All compression levels preserve JSON Schema validity. Table-driven tests validate.
Proactive compression aggressive	Escalation order is incremental; each strategy rechecks utilization before next.

Relationship to Other Phases

Phase	Relationship
Phase 10B (Salience + Context Budget)	24A builds on existing `context_budget.go` and `context_assembler.go`
Phase 19 (Tool Registry Restructure)	24C waste detection feeds Phase 19's tool quality scoring
Phase 25 (MCP Enterprise Architecture)	Orthogonal — context management sits in agent/LLM layer, not MCP transport
Phase 17 (PII Guard)	PII detection applies to context content via existing boundaries; no Phase 24 interaction

Progress Notes

(none yet)

Why Now​

Architecture​

Current Context Assembly Pipeline​

Target Pipeline (After Phase 24)​

Three-Tier Implementation​

Competitive Comparison​

Core Types and Interfaces​

Sub-Phases​

Dependency Graph​

Environment Variables​

Files Overview​

New Files​

Modified Files​

New Dependency​

Success Metrics​

Code Quality Requirements (SonarQube)​

Risk Mitigation​

Relationship to Other Phases​

Progress Notes​