06-consolidation-experiment-improvements
Dec 7, 2025
Decision Tree Experiment Improvements
Status: Planning Phase Branch:
honoluluPrevious Experiment: TEN-311 (feature/ten-311) Date: 2025-12-06
Overview
This document outlines improvements to the Decision Tree Logic validation experiments based on findings from TEN-311. The previous experiment successfully validated the basic decision flow but revealed four critical areas needing enhancement for production readiness.
Background: TEN-311 Experiment Summary
What Was Done
The TEN-311 experiment validated Decision Tree Logic using:
80 test cases (20 each: SKIP, UPDATE, CREATE, LINK)
Multi-factor weighted similarity: content 50%, people 15%, threadId 15%, subject 10%, entities 10%
Threshold-based decisions: 0.95 (duplicate), 0.80 (update), 0.50 (related)
Ground truth approach: Semantic labels mapped to numeric scores
Findings
The experiment provided valuable baseline validation but exposed limitations:
Test data unrealistic: Content too short and structured
Semantic nuance insufficient: Cannot distinguish context (e.g., "성장" in different domains)
UPDATE vs LINK ambiguous: Similar scores don't distinguish between updating same entity vs linking related entities
No history preservation: UPDATE operations lose previous information
Four Critical Improvements
1. Realistic Dataset Design
Problem: Test cases contain 20-50 character structured sentences, unlike real emails with signatures, disclaimers, quoted replies, and hundreds to thousands of characters.
Current Example:
Real Email Reality:
Solution: Create realistic email datasets with noise factors
→ Detailed Design: Realistic Dataset Design
2. Contextual Similarity Analysis
Problem: Current approach uses simple embedding similarity or word overlap, failing to distinguish semantic context.
Example Failure Case (CREATE_010):
Memory A: "회사 성장 전략 회의: 매출 증대 방안 논의" (company growth strategy)
Memory B: "직원 성장 프로그램: 직무 교육, 멘토링" (employee growth program)
Keyword "성장" overlaps but contexts are completely different
Current Approach:
Problem: Cannot distinguish same keyword in different domains
Solution: LLM-based semantic decomposition and context-aware similarity
→ Detailed Design: Contextual Similarity
3. UPDATE vs LINK Distinction
Problem: Current threshold-based approach (0.80-0.95 = UPDATE, 0.50-0.80 = LINK) cannot distinguish:
UPDATE: Same entity, property changed (예산 5000 → 6000)
LINK: Related entities, both valid (Q1 OKR → Q2 OKR)
Key Difference:
Current Limitation: Similar similarity scores can mean either UPDATE or LINK depending on semantic context, not just numbers.
Solution: Property-level change detection + dynamic factor weights
→ Detailed Design: UPDATE vs LINK Distinction
4. Version History Management
Problem: When updating memories, previous information is lost. Need history tracking without excessive storage.
Requirements:
Preserve update history for audit and rollback
Keep email thread links for source verification
Minimize storage cost (don't store all full contents)
Support version reconstruction when needed
Trade-offs:
Solution: Hybrid strategy with content pruning and source references
→ Detailed Design: Version History Management
Implementation Roadmap
Phase 1: Realistic Dataset (Day 1-2)
Goal: Create test data matching real email characteristics
Tasks:
Create 20 realistic email test cases
Add noise factors (signatures, disclaimers, quoted replies)
Establish baseline performance comparison
Deliverables:
.data/experiments/datasets/v2/with realistic casesManifest with noise factor metadata
Baseline experiment run comparing v1 vs v2
Acceptance Criteria:
Average email length > 200 characters
At least 15/20 cases include signatures
At least 10/20 cases include quoted replies
Phase 2: Contextual Similarity (Day 3-4)
Goal: Implement semantic decomposition for better context distinction
Tasks:
Implement
SemanticDecompositiontype and LLM extractionCreate 20 "same keyword, different context" test cases
Compare raw embedding vs contextual similarity
Deliverables:
lib/consolidation/semantic-decomposition.tslib/consolidation/contextual-similarity.tsTest cases in
.data/experiments/datasets/v2/contextual/Experiment comparing accuracy on context-sensitive cases
Acceptance Criteria:
Contextual similarity correctly distinguishes 90%+ of different-context cases
False positive rate < 10% (marking different contexts as same)
Phase 3: UPDATE vs LINK Distinction (Day 5-6)
Goal: Accurately distinguish property updates from entity relationships
Tasks:
Implement
PropertyChangeAnalysiswith LLMImplement dynamic factor weight adjustment
Create 20 boundary test cases (UPDATE/LINK ambiguous)
Deliverables:
lib/consolidation/property-change-detector.tslib/consolidation/dynamic-weights.tsUpdated decision flow incorporating property analysis
Experiment comparing fixed weights vs dynamic weights
Acceptance Criteria:
UPDATE/LINK distinction accuracy > 85%
Confusion rate between UPDATE and LINK < 15%
Phase 4: Version History (Day 7-8)
Goal: Implement version management with content pruning
Tasks:
Implement
VersionManagerclassImplement content pruning strategy
Implement source-based version reconstruction
Test version recovery from email threads
Deliverables:
lib/storage/version-manager.tslib/types/memory-version.tsUnit tests for version creation and reconstruction
Documentation on version policies
Acceptance Criteria:
Versions created on every UPDATE
Content pruned after 3 versions
Source links preserved for all versions
Reconstruction success rate > 95%
Success Metrics
Experiment Validation Metrics
Metric | TEN-311 Baseline | Target | Measurement |
|---|---|---|---|
Overall Accuracy | TBD | > 90% | Correct decisions / total cases |
Context Distinction | ~60%* | > 90% | Same-keyword-different-context cases |
UPDATE/LINK Accuracy | ~70%* | > 85% | Correct UPDATE vs LINK decisions |
Realistic Data Performance | N/A | > 85% | Accuracy on noisy, realistic emails |
*Estimated based on current limitations
Production Readiness Metrics
Precision: False positive rate < 5% (marking unrelated as related)
Recall: False negative rate < 10% (missing true relationships)
Latency: Decision time < 2 seconds per memory
Storage Efficiency: Version storage < 2x original content size
Risk Mitigation
Risk 1: LLM-based Analysis Latency
Risk: Semantic decomposition and property change analysis require LLM calls, adding latency
Mitigation:
Use streaming for parallel analysis
Cache decomposition results
Use fast model (GPT-4o-mini or Claude Haiku) for analysis
Implement timeout fallbacks to embedding-only similarity
Risk 2: LLM Analysis Cost
Risk: Per-memory LLM calls increase operational cost
Mitigation:
Use LLM only for UPDATE/LINK boundary cases (similarity 0.50-0.95)
Batch process multiple comparisons in single prompt
Use cheaper models for decomposition
Implement result caching
Risk 3: Context Analysis Accuracy
Risk: LLM may incorrectly judge context similarity
Mitigation:
Use few-shot examples in prompts
Implement confidence thresholds (low confidence → fallback)
Human review loop for low-confidence decisions
Continuously improve prompt based on failure analysis
Risk 4: Version Reconstruction Failure
Risk: Source emails may be deleted or inaccessible
Mitigation:
Keep full content for latest N versions (configurable)
Store change summaries for all versions
Graceful degradation (show "content unavailable")
User notification when source becomes unavailable
Open Questions
Q1: LLM Prompt Design
Question: What prompt structure gives best semantic decomposition accuracy?
Investigation Needed:
Test zero-shot vs few-shot prompts
Compare structured output (JSON) vs natural language
Evaluate different LLM models (GPT-4, Claude, etc.)
Q2: Version Retention Policy
Question: How many versions should keep full content vs pruned?
Investigation Needed:
Analyze typical update frequency per memory
Calculate storage cost projections
Survey user needs for version access
Q3: Dynamic Weight Tuning
Question: Should weights be learned from data or rule-based?
Investigation Needed:
Experiment with fixed rules vs ML-based weight prediction
Evaluate interpretability vs accuracy trade-off
Consider online learning from user feedback
Q4: Embedding Model Selection
Question: Which embedding model optimizes for Korean semantic similarity?
Investigation Needed:
Benchmark OpenAI, Cohere, multilingual models
Test on Korean-specific context distinction cases
Evaluate latency vs accuracy trade-off
Related Documentation
Decision Tree Logic - Current decision flow
Experiment Guide - TEN-311 methodology
Similarity Types - Current similarity system
Realistic Dataset Design - Improvement #1
Contextual Similarity - Improvement #2
UPDATE vs LINK Distinction - Improvement #3
Version History Management - Improvement #4
Next Steps
Review and Approve: Stakeholder review of improvement plan
Create Detailed Designs: Complete detailed design documents (linked above)
Spike: LLM Analysis: Quick spike to validate LLM-based approaches
Phase 1 Execution: Begin with realistic dataset creation
Iterative Validation: Execute phases 2-4 with continuous validation
Change Log
Date | Author | Change |
|---|---|---|
2025-12-06 | Claude | Initial improvement plan created |