02-types-similarity

Dec 7, 2025

Similarity Types

Code: lib/types/similarity.ts Phase: 3.2 - Similarity Search

Overview

Similarity types define how we classify and compare memories based on their semantic similarity. This is crucial for:

  • Deduplication: Avoiding duplicate memories from similar emails

  • Updates: Recognizing when new information updates existing memory

  • Relationships: Linking related but distinct memories

SimilarityCategory

The result of comparing two memories falls into one of four categories:

enum SimilarityCategory {
  DUPLICATE = 'duplicate',  // > 0.95
  UPDATE = 'update',        // 0.80 - 0.95
  RELATED = 'related',      // 0.50 - 0.80
  UNRELATED = 'unrelated',  // < 0.50
}

Visual Threshold Map

Score: 0.0 ─────────────────────────────────────────────────────── 1.0
       
       UNRELATED      RELATED       UPDATE    DUP
             (skip)           (link related)     (merge)     (skip)
       
       └────────────────────┴────────────────────┴──────────────┴────
                          0.50                  0.80           0.95

Category Actions

Category

Score Range

Action

Example

DUPLICATE

> 0.95

Skip or merge

"미팅 3시" vs "미팅 3시 변경"

UPDATE

0.80-0.95

Update existing

"미팅 2시" → "미팅 3시로 변경"

RELATED

0.50-0.80

Create + link

"Q4 예산 회의" ↔ "예산 승인 결과"

UNRELATED

< 0.50

Create new

"미팅 일정" vs "점심 메뉴"

Threshold Constants

const SimilarityThreshold = {
  DUPLICATE: 0.95,
  UPDATE: 0.80,
  RELATED: 0.50,
  UNRELATED: 0.50,
} as const;

Why These Values?

  • 0.95 (Duplicate): High threshold to avoid false positives. Only nearly identical content should be skipped.

  • 0.80 (Update): Medium-high threshold for information that updates existing knowledge.

  • 0.50 (Related): Medium threshold captures topically related but distinct information.

These values are tuned for OpenAI's text-embedding-3-small model with L2 normalization.

Multi-Factor Similarity

Beyond raw embedding similarity, we evaluate multiple factors:

interface SimilarityFactors {
  content: number;   // Embedding similarity (primary)
  people: number;    // Participant overlap
  threadId: number;  // Same conversation?
  subject: number;   // Subject line match
  entities: number;  // Shared references
}

Factor Weights

const SimilarityFactorWeights = {
  content: 0.50,   // 50% - Semantic meaning
  people: 0.15,    // 15% - Who's involved
  threadId: 0.15,  // 15% - Conversation context
  subject: 0.10,   // 10% - Topic indicator
  entities: 0.10,  // 10% - Shared references
};

Example Calculation

Memory A: "프로젝트 킥오프 미팅 1월 15일"
Memory B: "프로젝트 킥오프 미팅 시간 변경 (3시로)"

Factors:
  content:  0.85 × 0.50 = 0.425
  people:   1.00 × 0.15 = 0.150  (same participants)
  threadId: 1.00 × 0.15 = 0.150  (same thread)
  subject:  0.90 × 0.10 = 0.090  (similar subject)
  entities: 0.80 × 0.10 = 0.080  (same project)
  ────────────────────────────────
  Total:                  0.895  UPDATE category

Result Types

SimilarityResult

Single comparison result:

interface SimilarityResult {
  memoryId: string;
  score: number;
  category: SimilarityCategory;
  factors?: SimilarityFactors;
}

SimilaritySearchResult

Full search response:

interface SimilaritySearchResult {
  query: string;
  results: SimilarityResult[];
  distribution: {
    duplicate: number;
    update: number;
    related: number;
    unrelated: number;
    total: number;
  };
  metadata: {
    searchTimeMs: number;
    topK: number;
    minScore: number;
  };
}

MemoryComparisonResult

Detailed comparison between two memories:

interface MemoryComparisonResult {
  memoryAId: string;
  memoryBId: string;
  overallScore: number;
  category: SimilarityCategory;
  factors: SimilarityFactors;
  dominantFactors: Array<{
    factor: keyof SimilarityFactors;
    contribution: number;
    score: number;
  }>;
  recommendedAction: SimilarityAction;
}

SimilarityAction

Recommended action based on comparison:

type SimilarityAction =
  | { type: 'skip'; reason: string }
  | { type: 'merge'; targetId: string; reason: string }
  | { type: 'update'; targetId: string; reason: string }
  | { type: 'link'; targetId: string; relationshipType: string; reason: string }
  | { type: 'create'; reason: string };

Design Decisions

Why Multi-Factor?

Pure embedding similarity misses contextual signals:

  • Same thread but different topics → should be linked

  • Same people discussing different projects → may be related

  • Same subject line but different senders → different contexts

Why Weighted Average?

Different factors have different reliability:

  • Content embedding is most reliable for semantic meaning

  • Thread ID is binary but highly informative

  • Subject lines can be misleading (Re: Re: Re:...)

Threshold Tuning

Thresholds may need adjustment based on:

  • Domain-specific vocabulary density

  • Average email length

  • Language (Korean vs English embedding behavior)

Related Documentation

  • Memory Node - Memory structure with embedding field

  • Embedding Module - Embedding generation

  • Vector Index - Storage and search