06-consolidation-experiment-improvements

Dec 7, 2025

Decision Tree Experiment Improvements

Status: Planning Phase Branch: honolulu Previous Experiment: TEN-311 (feature/ten-311) Date: 2025-12-06

Overview

This document outlines improvements to the Decision Tree Logic validation experiments based on findings from TEN-311. The previous experiment successfully validated the basic decision flow but revealed four critical areas needing enhancement for production readiness.

Background: TEN-311 Experiment Summary

What Was Done

The TEN-311 experiment validated Decision Tree Logic using:

  • 80 test cases (20 each: SKIP, UPDATE, CREATE, LINK)

  • Multi-factor weighted similarity: content 50%, people 15%, threadId 15%, subject 10%, entities 10%

  • Threshold-based decisions: 0.95 (duplicate), 0.80 (update), 0.50 (related)

  • Ground truth approach: Semantic labels mapped to numeric scores

Findings

The experiment provided valuable baseline validation but exposed limitations:

  1. Test data unrealistic: Content too short and structured

  2. Semantic nuance insufficient: Cannot distinguish context (e.g., "성장" in different domains)

  3. UPDATE vs LINK ambiguous: Similar scores don't distinguish between updating same entity vs linking related entities

  4. No history preservation: UPDATE operations lose previous information

Four Critical Improvements

1. Realistic Dataset Design

Problem: Test cases contain 20-50 character structured sentences, unlike real emails with signatures, disclaimers, quoted replies, and hundreds to thousands of characters.

Current Example:

{
  "content": "Q1 마케팅 캠페인 예산은 5000만원입니다."
}

Real Email Reality:

안녕하세요 마케팅팀 여러분,

말씀드렸던 Q1 예산 건으로 연락드립니다.

검토 결과 예산을 6000만원으로 증액하기로 결정되었습니다.
추가 캠페인 진행이 가능하니 참고 부탁드립니다.

감사합니다.

---
김재무 | 재무팀
finance@company.com
Tel: 02-1234-5678

메일은 발신전용이며

Solution: Create realistic email datasets with noise factors

Detailed Design: Realistic Dataset Design

2. Contextual Similarity Analysis

Problem: Current approach uses simple embedding similarity or word overlap, failing to distinguish semantic context.

Example Failure Case (CREATE_010):

  • Memory A: "회사 성장 전략 회의: 매출 증대 방안 논의" (company growth strategy)

  • Memory B: "직원 성장 프로그램: 직무 교육, 멘토링" (employee growth program)

  • Keyword "성장" overlaps but contexts are completely different

Current Approach:

content_similarity = embedding_cosine(A, B)  // Single number

Problem: Cannot distinguish same keyword in different domains

Solution: LLM-based semantic decomposition and context-aware similarity

Detailed Design: Contextual Similarity

3. UPDATE vs LINK Distinction

Problem: Current threshold-based approach (0.80-0.95 = UPDATE, 0.50-0.80 = LINK) cannot distinguish:

  • UPDATE: Same entity, property changed (예산 5000 → 6000)

  • LINK: Related entities, both valid (Q1 OKR → Q2 OKR)

Key Difference:

UPDATE conditions:
- Same "subject" (entity)
- "Property" changed
- Old information "replaced" by new

LINK conditions:
- Related "subjects"
- Each has independent information
- Both valid, need "connection"

Current Limitation: Similar similarity scores can mean either UPDATE or LINK depending on semantic context, not just numbers.

Solution: Property-level change detection + dynamic factor weights

Detailed Design: UPDATE vs LINK Distinction

4. Version History Management

Problem: When updating memories, previous information is lost. Need history tracking without excessive storage.

Requirements:

  • Preserve update history for audit and rollback

  • Keep email thread links for source verification

  • Minimize storage cost (don't store all full contents)

  • Support version reconstruction when needed

Trade-offs:

Full Version Storage:     High cost, instant recovery
No Version Storage:       No cost, no history
Diff + Source Links:      Low cost, recoverable

Solution: Hybrid strategy with content pruning and source references

Detailed Design: Version History Management

Implementation Roadmap

Phase 1: Realistic Dataset (Day 1-2)

Goal: Create test data matching real email characteristics

Tasks:

  • Create 20 realistic email test cases

  • Add noise factors (signatures, disclaimers, quoted replies)

  • Establish baseline performance comparison

Deliverables:

  • .data/experiments/datasets/v2/ with realistic cases

  • Manifest with noise factor metadata

  • Baseline experiment run comparing v1 vs v2

Acceptance Criteria:

  • Average email length > 200 characters

  • At least 15/20 cases include signatures

  • At least 10/20 cases include quoted replies

Phase 2: Contextual Similarity (Day 3-4)

Goal: Implement semantic decomposition for better context distinction

Tasks:

  • Implement SemanticDecomposition type and LLM extraction

  • Create 20 "same keyword, different context" test cases

  • Compare raw embedding vs contextual similarity

Deliverables:

  • lib/consolidation/semantic-decomposition.ts

  • lib/consolidation/contextual-similarity.ts

  • Test cases in .data/experiments/datasets/v2/contextual/

  • Experiment comparing accuracy on context-sensitive cases

Acceptance Criteria:

  • Contextual similarity correctly distinguishes 90%+ of different-context cases

  • False positive rate < 10% (marking different contexts as same)

Phase 3: UPDATE vs LINK Distinction (Day 5-6)

Goal: Accurately distinguish property updates from entity relationships

Tasks:

  • Implement PropertyChangeAnalysis with LLM

  • Implement dynamic factor weight adjustment

  • Create 20 boundary test cases (UPDATE/LINK ambiguous)

Deliverables:

  • lib/consolidation/property-change-detector.ts

  • lib/consolidation/dynamic-weights.ts

  • Updated decision flow incorporating property analysis

  • Experiment comparing fixed weights vs dynamic weights

Acceptance Criteria:

  • UPDATE/LINK distinction accuracy > 85%

  • Confusion rate between UPDATE and LINK < 15%

Phase 4: Version History (Day 7-8)

Goal: Implement version management with content pruning

Tasks:

  • Implement VersionManager class

  • Implement content pruning strategy

  • Implement source-based version reconstruction

  • Test version recovery from email threads

Deliverables:

  • lib/storage/version-manager.ts

  • lib/types/memory-version.ts

  • Unit tests for version creation and reconstruction

  • Documentation on version policies

Acceptance Criteria:

  • Versions created on every UPDATE

  • Content pruned after 3 versions

  • Source links preserved for all versions

  • Reconstruction success rate > 95%

Success Metrics

Experiment Validation Metrics

Metric

TEN-311 Baseline

Target

Measurement

Overall Accuracy

TBD

> 90%

Correct decisions / total cases

Context Distinction

~60%*

> 90%

Same-keyword-different-context cases

UPDATE/LINK Accuracy

~70%*

> 85%

Correct UPDATE vs LINK decisions

Realistic Data Performance

N/A

> 85%

Accuracy on noisy, realistic emails

*Estimated based on current limitations

Production Readiness Metrics

  • Precision: False positive rate < 5% (marking unrelated as related)

  • Recall: False negative rate < 10% (missing true relationships)

  • Latency: Decision time < 2 seconds per memory

  • Storage Efficiency: Version storage < 2x original content size

Risk Mitigation

Risk 1: LLM-based Analysis Latency

Risk: Semantic decomposition and property change analysis require LLM calls, adding latency

Mitigation:

  • Use streaming for parallel analysis

  • Cache decomposition results

  • Use fast model (GPT-4o-mini or Claude Haiku) for analysis

  • Implement timeout fallbacks to embedding-only similarity

Risk 2: LLM Analysis Cost

Risk: Per-memory LLM calls increase operational cost

Mitigation:

  • Use LLM only for UPDATE/LINK boundary cases (similarity 0.50-0.95)

  • Batch process multiple comparisons in single prompt

  • Use cheaper models for decomposition

  • Implement result caching

Risk 3: Context Analysis Accuracy

Risk: LLM may incorrectly judge context similarity

Mitigation:

  • Use few-shot examples in prompts

  • Implement confidence thresholds (low confidence → fallback)

  • Human review loop for low-confidence decisions

  • Continuously improve prompt based on failure analysis

Risk 4: Version Reconstruction Failure

Risk: Source emails may be deleted or inaccessible

Mitigation:

  • Keep full content for latest N versions (configurable)

  • Store change summaries for all versions

  • Graceful degradation (show "content unavailable")

  • User notification when source becomes unavailable

Open Questions

Q1: LLM Prompt Design

Question: What prompt structure gives best semantic decomposition accuracy?

Investigation Needed:

  • Test zero-shot vs few-shot prompts

  • Compare structured output (JSON) vs natural language

  • Evaluate different LLM models (GPT-4, Claude, etc.)

Q2: Version Retention Policy

Question: How many versions should keep full content vs pruned?

Investigation Needed:

  • Analyze typical update frequency per memory

  • Calculate storage cost projections

  • Survey user needs for version access

Q3: Dynamic Weight Tuning

Question: Should weights be learned from data or rule-based?

Investigation Needed:

  • Experiment with fixed rules vs ML-based weight prediction

  • Evaluate interpretability vs accuracy trade-off

  • Consider online learning from user feedback

Q4: Embedding Model Selection

Question: Which embedding model optimizes for Korean semantic similarity?

Investigation Needed:

  • Benchmark OpenAI, Cohere, multilingual models

  • Test on Korean-specific context distinction cases

  • Evaluate latency vs accuracy trade-off

Related Documentation

  • Decision Tree Logic - Current decision flow

  • Experiment Guide - TEN-311 methodology

  • Similarity Types - Current similarity system

  • Realistic Dataset Design - Improvement #1

  • Contextual Similarity - Improvement #2

  • UPDATE vs LINK Distinction - Improvement #3

  • Version History Management - Improvement #4

Next Steps

  1. Review and Approve: Stakeholder review of improvement plan

  2. Create Detailed Designs: Complete detailed design documents (linked above)

  3. Spike: LLM Analysis: Quick spike to validate LLM-based approaches

  4. Phase 1 Execution: Begin with realistic dataset creation

  5. Iterative Validation: Execute phases 2-4 with continuous validation

Change Log

Date

Author

Change

2025-12-06

Claude

Initial improvement plan created