02-types-similarity

Dec 7, 2025

Similarity Types

Code: lib/types/similarity.ts Phase: 3.2 - Similarity Search

Overview

Similarity types define how we classify and compare memories based on their semantic similarity. This is crucial for:

Deduplication: Avoiding duplicate memories from similar emails
Updates: Recognizing when new information updates existing memory
Relationships: Linking related but distinct memories

SimilarityCategory

The result of comparing two memories falls into one of four categories:

enum SimilarityCategory {
  DUPLICATE = 'duplicate',  // > 0.95
  UPDATE = 'update',        // 0.80 - 0.95
  RELATED = 'related',      // 0.50 - 0.80
  UNRELATED = 'unrelated',  // < 0.50
}

Visual Threshold Map

Score: 0.0 ─────────────────────────────────────────────────────── 1.0
       │                    │                    │              │
       │     UNRELATED      │      RELATED       │    UPDATE    │ DUP
       │      (skip)        │   (link related)   │  (merge)     │(skip)
       │                    │                    │              │
       └────────────────────┴────────────────────┴──────────────┴────
                          0.50                  0.80           0.95

Category Actions

Category	Score Range	Action	Example
`DUPLICATE`	> 0.95	Skip or merge	"미팅 3시" vs "미팅 3시 변경"
`UPDATE`	0.80-0.95	Update existing	"미팅 2시" → "미팅 3시로 변경"
`RELATED`	0.50-0.80	Create + link	"Q4 예산 회의" ↔ "예산 승인 결과"
`UNRELATED`	< 0.50	Create new	"미팅 일정" vs "점심 메뉴"

Threshold Constants

const SimilarityThreshold = {
  DUPLICATE: 0.95,
  UPDATE: 0.80,
  RELATED: 0.50,
  UNRELATED: 0.50,
} as const;

Why These Values?

0.95 (Duplicate): High threshold to avoid false positives. Only nearly identical content should be skipped.
0.80 (Update): Medium-high threshold for information that updates existing knowledge.
0.50 (Related): Medium threshold captures topically related but distinct information.

These values are tuned for OpenAI's text-embedding-3-small model with L2 normalization.

Multi-Factor Similarity

Beyond raw embedding similarity, we evaluate multiple factors:

interface SimilarityFactors {
  content: number;   // Embedding similarity (primary)
  people: number;    // Participant overlap
  threadId: number;  // Same conversation?
  subject: number;   // Subject line match
  entities: number;  // Shared references
}

Factor Weights

const SimilarityFactorWeights = {
  content: 0.50,   // 50% - Semantic meaning
  people: 0.15,    // 15% - Who's involved
  threadId: 0.15,  // 15% - Conversation context
  subject: 0.10,   // 10% - Topic indicator
  entities: 0.10,  // 10% - Shared references
};

Example Calculation

Memory A: "프로젝트 킥오프 미팅 1월 15일"
Memory B: "프로젝트 킥오프 미팅 시간 변경 (3시로)"

Factors:
  content:  0.85 × 0.50 = 0.425
  people:   1.00 × 0.15 = 0.150  (same participants)
  threadId: 1.00 × 0.15 = 0.150  (same thread)
  subject:  0.90 × 0.10 = 0.090  (similar subject)
  entities: 0.80 × 0.10 = 0.080  (same project)
  ────────────────────────────────
  Total:                  0.895  → UPDATE category

Result Types

SimilarityResult

Single comparison result:

interface SimilarityResult {
  memoryId: string;
  score: number;
  category: SimilarityCategory;
  factors?: SimilarityFactors;
}

SimilaritySearchResult

Full search response:

interface SimilaritySearchResult {
  query: string;
  results: SimilarityResult[];
  distribution: {
    duplicate: number;
    update: number;
    related: number;
    unrelated: number;
    total: number;
  };
  metadata: {
    searchTimeMs: number;
    topK: number;
    minScore: number;
  };
}

MemoryComparisonResult

Detailed comparison between two memories:

interface MemoryComparisonResult {
  memoryAId: string;
  memoryBId: string;
  overallScore: number;
  category: SimilarityCategory;
  factors: SimilarityFactors;
  dominantFactors: Array<{
    factor: keyof SimilarityFactors;
    contribution: number;
    score: number;
  }>;
  recommendedAction: SimilarityAction;
}

SimilarityAction

Recommended action based on comparison:

type SimilarityAction =
  | { type: 'skip'; reason: string }
  | { type: 'merge'; targetId: string; reason: string }
  | { type: 'update'; targetId: string; reason: string }
  | { type: 'link'; targetId: string; relationshipType: string; reason: string }
  | { type: 'create'; reason: string };

Design Decisions

Why Multi-Factor?

Pure embedding similarity misses contextual signals:

Same thread but different topics → should be linked
Same people discussing different projects → may be related
Same subject line but different senders → different contexts

Why Weighted Average?

Different factors have different reliability:

Content embedding is most reliable for semantic meaning
Thread ID is binary but highly informative
Subject lines can be misleading (Re: Re: Re:...)

Threshold Tuning

Thresholds may need adjustment based on:

Domain-specific vocabulary density
Average email length
Language (Korean vs English embedding behavior)