06-consolidation-realistic-dataset-design

Dec 7, 2025

Realistic Dataset Design

Improvement: #1 of 4 Parent Doc: Experiment Improvements Status: Design Phase

Problem Statement

The TEN-311 test dataset contains overly structured, short content that doesn't reflect real-world email characteristics. This limits the experiment's ability to validate production performance.

Current vs Real Email Comparison

Current Test Data (UPDATE_001):

{
  "existing_memory": {
    "content": "Q1 마케팅 캠페인 예산은 5000만원입니다."
  },
  "new_email": {
    "body": "예산이 6000만원으로 증액되었습니다. 추가 캠페인 진행 가능합니다."
  }
}
  • Length: 20-50 characters

  • Structure: Perfect grammar, no noise

  • Context: Compressed information

Real Email Example:

Subject: Re: Q1 마케팅 예산

안녕하세요 마케팅팀 여러분,

지난주 말씀드렸던 Q1 예산 건으로 연락드립니다.

검토 결과 예산을 6000만원으로 증액하기로 최종 결정되었습니다.
이전에 승인된 5000만원에서 1000만원 추가 배정되었으며,
이를 활용해 추가 캠페인 진행이 가능하니 참고 부탁드립니다.

자세한 내용은 첨부 파일을 확인해주세요.

감사합니다.

---
김재무 | 재무팀
finance@company.com
Tel: 02-1234-5678
Mobile: 010-1234-5678

메일은 발신전용이며 회신되지 않습니다.
문의사항은 고객센터(1588-xxxx)연락주시기 바랍니다.

-----Original Message-----
From: 박마케팅 <marketing@company.com>
Sent: Monday, January 8, 2025 3:24 PM
To: 김재무 <finance@company.com>
Subject: Q1 마케팅 예산

안녕하세요,

Q1 마케팅 캠페인 예산 승인 요청드립니다

  • Length: 300-500+ characters

  • Noise: Signatures, disclaimers, quoted replies

  • Context: Natural conversation flow

Impact on System Performance

Challenges Real Emails Introduce:

  1. Noise Filtering: System must extract signal from noise

  2. Content Length: Longer content affects embedding quality and search performance

  3. Quoted Context: Must distinguish new information from quoted previous emails

  4. Multi-turn Conversations: Thread context becomes critical

  5. Language Variations: Real emails contain typos, informal language, mixed Korean/English

Design Goals

Primary Goals

  1. Realism: Test data matches actual Gmail email characteristics

  2. Diversity: Cover various email types (announcement, discussion, decision, etc.)

  3. Noise Spectrum: Range from clean to heavily noisy emails

  4. Backwards Compatibility: Maintain compatibility with existing test framework

Non-Goals

  • Real User Data: Not using actual user emails (privacy concerns)

  • Perfect Realism: Not simulating HTML artifacts, attachments metadata (out of scope)

  • Multilingual: Focusing on Korean + English keywords (no full English emails yet)

Dataset Structure

Directory Layout

.data/experiments/datasets/v2/
├── cases/
├── realistic/              # Full realistic emails
├── SKIP_R001.json
├── UPDATE_R001.json
├── CREATE_R001.json
└── LINK_R001.json

├── noise_levels/           # Varying noise intensities
├── clean/              # Minimal noise (like v1)
├── moderate/           # Signature + disclaimer
└── heavy/              # + quoted replies

├── edge_cases/             # Specific challenge scenarios
├── long_quoted_chain.json
├── mixed_language.json
├── signature_collision.json  # Info in signature
└── attachment_reference.json

└── synthetic/              # Original v1 style (baseline)
└── ... (existing v1 cases)

├── templates/                  # Reusable email components
├── signatures.json
├── disclaimers.json
└── quoted_reply_markers.json

├── raw_samples/                # Reference raw email examples
├── sample_001.txt
└── ...

└── manifest.json               # Dataset metadata

Test Case Schema Extensions

Enhanced Test Case Structure

interface RealisticTestCase extends TestCase {
  // Original fields (unchanged)
  id: string;
  category: 'SKIP' | 'UPDATE' | 'CREATE' | 'LINK';
  difficulty: 'easy' | 'medium' | 'hard' | 'edge';

  // New: Email realism metadata
  realism: {
    noise_level: 'clean' | 'moderate' | 'heavy';
    noise_factors: {
      has_signature: boolean;
      has_disclaimer: boolean;
      has_quoted_reply: boolean;
      quoted_depth?: number;          // How many levels of quoting
      has_html_artifacts: boolean;
      has_mixed_language: boolean;
      has_typos: boolean;
    };

    // Content characteristics
    characteristics: {
      body_length: number;            // Character count
      paragraph_count: number;
      sentence_count: number;
      signal_to_noise_ratio: number; // 0-1, how much is actual content
    };
  };

  // Enhanced email body
  new_email: {
    messageId: string;
    subject: string;

    // Full realistic body
    body: string;

    // Structured breakdown (for analysis)
    body_parts?: {
      greeting?: string;
      main_content: string;
      signature?: string;
      disclaimer?: string;
      quoted_reply?: string;
    };

    from: string;
    to?: string[];
    cc?: string[];
    threadId: string;
    date: string;
  };

  // Expected extraction (what system should extract from noisy email)
  expected_extraction: {
    cleaned_content: string;        // After noise removal
    signal_preserved: boolean;      // Did extraction keep key info?
    noise_filtered: boolean;        // Did extraction remove noise?
  };
}

Example: Realistic UPDATE Case

{
  "id": "UPDATE_R001",
  "category": "UPDATE",
  "difficulty": "medium",

  "realism": {
    "noise_level": "moderate",
    "noise_factors": {
      "has_signature": true,
      "has_disclaimer": true,
      "has_quoted_reply": false,
      "has_html_artifacts": false,
      "has_mixed_language": true,
      "has_typos": false
    },
    "characteristics": {
      "body_length": 387,
      "paragraph_count": 4,
      "sentence_count": 6,
      "signal_to_noise_ratio": 0.65
    }
  },

  "existing_memory": {
    "id": "mem_q1_budget",
    "content": "Q1 마케팅 캠페인 예산은 5000만원으로 책정됨. 집행 기간 1월 15일 ~ 3월 31일.",
    "category": "project",
    "threadId": "thread_q1_marketing",
    "subject": "Q1 마케팅 예산",
    "people": ["marketing@company.com", "finance@company.com"],
    "entities": ["Q1 마케팅 캠페인"],
    "importance": 0.8,
    "occurredAt": "2025-01-10T09:00:00Z"
  },

  "new_email": {
    "messageId": "msg_budget_increase",
    "subject": "Re: Q1 마케팅 예산",
    "body": "안녕하세요 마케팅팀 여러분,\n\n지난주 말씀드렸던 Q1 예산 건으로 연락드립니다.\n\n검토 결과 예산을 6000만원으로 증액하기로 최종 결정되었습니다. 이전 승인액 5000만원에서 1000만원 추가 배정되었으며, 이를 활용해 추가 캠페인 진행이 가능하니 참고 부탁드립니다.\n\nBudget breakdown은 첨부 파일을 확인해주세요.\n\n감사합니다.\n\n---\n김재무 | 재무팀\nfinance@company.com\nTel: 02-1234-5678\n\n본 메일은 발신전용이며 회신되지 않습니다.\n문의사항은 고객센터로 연락주시기 바랍니다.",

    "body_parts": {
      "greeting": "안녕하세요 마케팅팀 여러분,",
      "main_content": "지난주 말씀드렸던 Q1 예산 건으로 연락드립니다.\n\n검토 결과 예산을 6000만원으로 증액하기로 최종 결정되었습니다. 이전 승인액 5000만원에서 1000만원 추가 배정되었으며, 이를 활용해 추가 캠페인 진행이 가능하니 참고 부탁드립니다.\n\nBudget breakdown은 첨부 파일을 확인해주세요.",
      "signature": "김재무 | 재무팀\nfinance@company.com\nTel: 02-1234-5678",
      "disclaimer": "본 메일은 발신전용이며 회신되지 않습니다.\n문의사항은 고객센터로 연락주시기 바랍니다."
    },

    "from": "finance@company.com",
    "to": ["marketing@company.com"],
    "threadId": "thread_q1_marketing",
    "date": "2025-01-15T14:30:00Z"
  },

  "extracted_memory": {
    "tempId": "temp_budget_increase",
    "content": "Q1 마케팅 캠페인 예산이 6000만원으로 증액됨 (기존 5000만원에서 1000만원 추가). 추가 캠페인 진행 가능.",
    "category": "project",
    "people": ["marketing@company.com", "finance@company.com"],
    "entities": ["Q1 마케팅 캠페인", "예산"],
    "threadId": "thread_q1_marketing",
    "subject": "Re: Q1 마케팅 예산",
    "importance": 0.8
  },

  "expected_extraction": {
    "cleaned_content": "Q1 예산을 6000만원으로 증액 결정. 이전 5000만원에서 1000만원 추가 배정. 추가 캠페인 진행 가능.",
    "signal_preserved": true,
    "noise_filtered": true
  },

  "expected": {
    "decision": "UPDATE",
    "target_memory_id": "mem_q1_budget",
    "reason": "같은 Q1 마케팅 캠페인의 예산 정보가 5000만원에서 6000만원으로 변경됨"
  },

  "expected_factors": {
    "content": "high",
    "people": "same",
    "threadId": "same",
    "subject": "similar",
    "entities": "same"
  },

  "notes": "Moderate noise level: signature + disclaimer present. System should extract core budget change info while filtering noise."
}

Email Component Templates

Signature Templates

{
  "signatures": [
    {
      "id": "sig_standard_kr",
      "template": "---\n{name} | {department}\n{email}\nTel: {phone}",
      "example": "김철수 | 마케팅팀\nmarketing@company.com\nTel: 02-1234-5678"
    },
    {
      "id": "sig_detailed_kr",
      "template": "---\n{name} {title}\n{department} | {company}\nEmail: {email}\nTel: {phone}\nMobile: {mobile}",
      "example": "박영희 대리\n인사팀 | ABC 주식회사\nEmail: hr@company.com\nTel: 02-1234-5678\nMobile: 010-1234-5678"
    },
    {
      "id": "sig_minimal",
      "template": "{name}\n{email}",
      "example": "이대리\nlee@company.com"
    }
  ]
}

Disclaimer Templates

{
  "disclaimers": [
    {
      "id": "disc_no_reply",
      "text": "본 메일은 발신전용이며 회신되지 않습니다."
    },
    {
      "id": "disc_confidential",
      "text": "본 메일은 수신인에게만 전달되는 기밀 정보를 포함하고 있습니다. 무단 열람, 사용, 공개, 배포를 금지합니다."
    },
    {
      "id": "disc_privacy",
      "text": "개인정보 보호법에 따라 본 메일의 무단 전송 및 배포를 금지합니다."
    },
    {
      "id": "disc_contact",
      "text": "문의사항은 고객센터(1588-xxxx)로 연락주시기 바랍니다."
    }
  ]
}

Quoted Reply Markers

{
  "quoted_markers": [
    {
      "id": "marker_original_msg",
      "pattern": "-----Original Message-----\nFrom: {from}\nSent: {date}\nTo: {to}\nSubject: {subject}\n\n{body}"
    },
    {
      "id": "marker_simple_kr",
      "pattern": "\n\n--- {date}에 {from}님이 작성 ---\n\n{body}"
    },
    {
      "id": "marker_quote_prefix",
      "pattern": "> {line}"
    }
  ]
}

Noise Level Specifications

Clean (Baseline - v1 Style)

Characteristics:

  • No signature

  • No disclaimer

  • No quoted replies

  • Concise, structured content

  • Signal-to-noise ratio: > 0.95

Use Case: Baseline comparison with TEN-311 results

Example:

예산이 6000만원으로 증액되었습니다

Moderate (Typical Business Email)

Characteristics:

  • Standard signature present

  • Single-line disclaimer

  • No quoted replies

  • Greeting + main content + closing

  • Signal-to-noise ratio: 0.60-0.80

Use Case: Most common real-world scenario

Example:

안녕하세요,

예산이 6000만원으로 증액되었습니다.

감사합니다.

---
김철수 | 재무팀
finance@company.com

Heavy (Complex Email Thread)

Characteristics:

  • Full signature with multiple contact methods

  • Multi-line disclaimer

  • Quoted previous email(s)

  • Multiple paragraphs

  • Mixed language (Korean + English terms)

  • Signal-to-noise ratio: 0.40-0.60

Use Case: Challenging extraction scenarios

Example:

안녕하세요 팀원 여러분,

지난번 논의했던 Q1 budget 업데이트드립니다.

최종 승인되어 6000만원으로 증액되었습니다.
이전 5000만원에서 1000만원 추가 배정이니 참고 부탁드립니다.

첨부 파일 확인 바랍니다.

Best regards,

---
김철수 대리
재무팀 | ABC Corp.
Email: finance@company.com
Tel: 02-1234-5678
Mobile: 010-1234-5678

메일은 발신전용입니다.
개인정보 보호법에 따라 무단 배포를 금지합니다.

-----Original Message-----
From: 박마케팅 <marketing@company.com>
Sent: Monday, January 8, 2025 3:24 PM
To: 김철수 <finance@company.com>
Subject: Q1 마케팅 예산

안녕하세요,

Q1 마케팅 캠페인 예산 승인 요청드립니다.
5000만원으로 요청드리며

Test Case Distribution

Recommended Distribution (Total: 80 cases)

Category

Clean

Moderate

Heavy

Total

SKIP

5

10

5

20

UPDATE

5

10

5

20

CREATE

5

10

5

20

LINK

5

10

5

20

Total

20

40

20

80

Rationale:

  • Moderate focus (50%): Most common real-world scenario

  • Clean (25%): Baseline comparison

  • Heavy (25%): Stress testing

Edge Case Scenarios (Additional 20 cases)

  1. Long Quoted Chain (5 cases)

    • Multiple levels of quoted replies

    • Test: Extract only new information

  2. Signature Info Collision (5 cases)

    • Important info appears in signature section

    • Test: System doesn't incorrectly filter signal as noise

  3. Mixed Language (5 cases)

    • Korean + English mixed sentences

    • Test: Embedding quality with code-switching

  4. Attachment References (5 cases)

    • Email references external attachments

    • Test: Handle incomplete information gracefully

Validation Criteria

Extraction Quality Metrics

interface ExtractionQuality {
  // Signal preservation
  signal_recall: number;        // % of key information extracted
  signal_precision: number;     // % of extracted info is signal (not noise)

  // Noise filtering
  noise_filtered: number;       // % of noise successfully removed
  false_positives: number;      // % of signal incorrectly filtered

  // Overall
  f1_score: number;             // Harmonic mean of precision/recall
}

// Target thresholds
const QUALITY_THRESHOLDS = {
  signal_recall: 0.90,          // Must capture 90%+ of key info
  signal_precision: 0.85,       // 85%+ of extracted content is signal
  noise_filtered: 0.80,         // Remove 80%+ of noise
  false_positives: 0.05,        // < 5% signal loss
};

Decision Accuracy by Noise Level

Noise Level

Target Accuracy

Rationale

Clean

> 95%

Should match TEN-311 baseline

Moderate

> 90%

Acceptable degradation

Heavy

> 85%

Challenging but production-viable

Performance Benchmarks

  • Extraction Time: < 1 second per email (including noise filtering)

  • Decision Time: < 2 seconds per memory (including similarity search)

  • End-to-End Latency: < 3 seconds (extraction + decision)

Implementation Plan

Step 1: Template Creation (Day 1, Morning)

Tasks:

  • Create signature templates JSON

  • Create disclaimer templates JSON

  • Create quoted reply marker templates JSON

Deliverable: .data/experiments/datasets/v2/templates/

Step 2: Realistic Case Generation (Day 1, Afternoon - Day 2)

Tasks:

  • Convert 40 existing cases to "moderate" noise level

  • Create 20 "heavy" noise level cases

  • Create 20 edge case scenarios

Approach:

  • Use templates to augment existing v1 cases

  • LLM-assisted realistic email body generation

  • Manual review for quality

Deliverable: 80 realistic test cases in .data/experiments/datasets/v2/cases/

Step 3: Noise Filtering Implementation (Day 2)

Tasks:

  • Implement signature detection and removal

  • Implement disclaimer detection and removal

  • Implement quoted reply detection and removal

Approach:

// lib/extraction/noise-filter.ts
interface NoiseFilter {
  removeSignatures(body: string): string;
  removeDisclaimers(body: string): string;
  removeQuotedReplies(body: string, keepDepth?: number): string;
  extractMainContent(body: string): string;
}

Deliverable: lib/extraction/noise-filter.ts

Step 4: Baseline Experiment (Day 2)

Tasks:

  • Run experiment on v2 dataset

  • Compare with v1 baseline

  • Analyze performance by noise level

Metrics to Track:

  • Accuracy by noise level

  • Extraction quality scores

  • False positive/negative rates

Deliverable: Experiment run with comparison report

Testing Strategy

Unit Tests

describe('NoiseFilter', () => {
  it('should remove standard Korean signature', () => {
    const input = "메시지\n\n---\n김철수 | 팀\nemail@company.com";
    const output = noiseFilter.removeSignatures(input);
    expect(output).toBe("메시지");
  });

  it('should preserve signature-like content in main body', () => {
    const input = "우리 팀 | 프로젝트 진행 상황\n내용...";
    const output = noiseFilter.removeSignatures(input);
    expect(output).toContain("우리 팀 | 프로젝트");
  });

  it('should remove quoted reply but preserve new content', () => {
    const input = "답변: 확인했습니다.\n\n--- Original ---\n이전 내용...";
    const output = noiseFilter.removeQuotedReplies(input);
    expect(output).toBe("답변: 확인했습니다.");
  });
});

Integration Tests

describe('Realistic Email Extraction', () => {
  it('should correctly extract from moderate noise email', async () => {
    const testCase = await loadTestCase('UPDATE_R001');
    const extracted = await extractMemoryFromEmail(testCase.new_email);

    expect(extracted.content).toContain("6000만원");
    expect(extracted.content).not.toContain("발신전용");
    expect(extracted.content).not.toContain("Tel:");
  });
});

Success Criteria

Phase 1 Success Metrics

  • 80 realistic test cases created and validated

  • Noise filtering accuracy > 80%

  • Signal preservation > 90%

  • Baseline experiment shows < 10% accuracy drop on moderate noise

  • Documentation complete and reviewed

Go/No-Go Decision

Proceed to Phase 2 if:

  • Moderate noise accuracy > 85%

  • Heavy noise accuracy > 75%

  • No critical bugs in noise filtering

Block/Revise if:

  • Accuracy drops > 15% on moderate noise

  • Signal loss > 10% (false positives)

  • Noise filtering causes systematic errors

Related Documentation

  • Experiment Improvements - Parent overview

  • Contextual Similarity - Next improvement

  • Experiment Guide - Testing methodology

Appendix: Real Email Samples

Sample 1: Budget Approval

Subject: Re: Q1 Marketing Budget Approval

안녕하세요 마케팅팀 여러분,

Q1 마케팅 예산 최종 승인되었습니다.

요청하신 6000만원으로 확정되었으며, 아래와 같이 breakdown됩니다:
- 디지털 광고: 3500만원
- 오프라인 이벤트: 1500만원
- 콘텐츠 제작: 1000만원

집행 기간은 2025115일부터 331일까지이며,
월별 리포트는 재무팀으로 제출 부탁드립니다.

추가 문의사항 있으시면 연락주세요.

Best regards,

---
김재무 차장
재무팀 | ABC 주식회사
Email: finance@company.com
Direct: 02-1234-5678
Mobile: 010-1234-5678

메일에 포함된 정보는 수신인에게만 전달되는 기밀 정보입니다.
무단 열람, 사용, 공개 배포를 금지합니다.

-----Original Message-----
From: 박마케팅 <marketing@company.com>
Sent: Friday, January 5, 2025 2:15 PM
To: 김재무 <finance@company.com>
Cc: 경영지원팀 <support@company.com>
Subject: Q1 Marketing Budget Approval

안녕하세요,

Q1 마케팅 캠페인을 위한 예산 승인 요청드립니다.

 예산: 6000만원
기간: 2025 Q1 (1~3)

자세한 내용은 첨부 파일을 참고해주세요.

감사합니다

Expected Extraction:

Q1 마케팅 예산 6000만원 최종 승인. Breakdown: 디지털 광고 3500만원, 오프라인 이벤트 1500만원, 콘텐츠 제작 1000만원. 집행 기간: 2025115~331. 월별 리포트 제출 필요

Change Log

Date

Author

Change

2025-12-06

Claude

Initial realistic dataset design