02-types-similarity
Dec 7, 2025
Similarity Types
Code:
lib/types/similarity.tsPhase: 3.2 - Similarity Search
Overview
Similarity types define how we classify and compare memories based on their semantic similarity. This is crucial for:
Deduplication: Avoiding duplicate memories from similar emails
Updates: Recognizing when new information updates existing memory
Relationships: Linking related but distinct memories
SimilarityCategory
The result of comparing two memories falls into one of four categories:
Visual Threshold Map
Category Actions
Category | Score Range | Action | Example |
|---|---|---|---|
| > 0.95 | Skip or merge | "미팅 3시" vs "미팅 3시 변경" |
| 0.80-0.95 | Update existing | "미팅 2시" → "미팅 3시로 변경" |
| 0.50-0.80 | Create + link | "Q4 예산 회의" ↔ "예산 승인 결과" |
| < 0.50 | Create new | "미팅 일정" vs "점심 메뉴" |
Threshold Constants
Why These Values?
0.95 (Duplicate): High threshold to avoid false positives. Only nearly identical content should be skipped.
0.80 (Update): Medium-high threshold for information that updates existing knowledge.
0.50 (Related): Medium threshold captures topically related but distinct information.
These values are tuned for OpenAI's text-embedding-3-small model with L2 normalization.
Multi-Factor Similarity
Beyond raw embedding similarity, we evaluate multiple factors:
Factor Weights
Example Calculation
Result Types
SimilarityResult
Single comparison result:
SimilaritySearchResult
Full search response:
MemoryComparisonResult
Detailed comparison between two memories:
SimilarityAction
Recommended action based on comparison:
Design Decisions
Why Multi-Factor?
Pure embedding similarity misses contextual signals:
Same thread but different topics → should be linked
Same people discussing different projects → may be related
Same subject line but different senders → different contexts
Why Weighted Average?
Different factors have different reliability:
Content embedding is most reliable for semantic meaning
Thread ID is binary but highly informative
Subject lines can be misleading (Re: Re: Re:...)
Threshold Tuning
Thresholds may need adjustment based on:
Domain-specific vocabulary density
Average email length
Language (Korean vs English embedding behavior)
Related Documentation
Memory Node - Memory structure with embedding field
Embedding Module - Embedding generation
Vector Index - Storage and search