06-consolidation-realistic-dataset-design
Dec 7, 2025
Realistic Dataset Design
Improvement: #1 of 4 Parent Doc: Experiment Improvements Status: Design Phase
Problem Statement
The TEN-311 test dataset contains overly structured, short content that doesn't reflect real-world email characteristics. This limits the experiment's ability to validate production performance.
Current vs Real Email Comparison
Current Test Data (UPDATE_001):
Length: 20-50 characters
Structure: Perfect grammar, no noise
Context: Compressed information
Real Email Example:
Length: 300-500+ characters
Noise: Signatures, disclaimers, quoted replies
Context: Natural conversation flow
Impact on System Performance
Challenges Real Emails Introduce:
Noise Filtering: System must extract signal from noise
Content Length: Longer content affects embedding quality and search performance
Quoted Context: Must distinguish new information from quoted previous emails
Multi-turn Conversations: Thread context becomes critical
Language Variations: Real emails contain typos, informal language, mixed Korean/English
Design Goals
Primary Goals
Realism: Test data matches actual Gmail email characteristics
Diversity: Cover various email types (announcement, discussion, decision, etc.)
Noise Spectrum: Range from clean to heavily noisy emails
Backwards Compatibility: Maintain compatibility with existing test framework
Non-Goals
Real User Data: Not using actual user emails (privacy concerns)
Perfect Realism: Not simulating HTML artifacts, attachments metadata (out of scope)
Multilingual: Focusing on Korean + English keywords (no full English emails yet)
Dataset Structure
Directory Layout
Test Case Schema Extensions
Enhanced Test Case Structure
Example: Realistic UPDATE Case
Email Component Templates
Signature Templates
Disclaimer Templates
Quoted Reply Markers
Noise Level Specifications
Clean (Baseline - v1 Style)
Characteristics:
No signature
No disclaimer
No quoted replies
Concise, structured content
Signal-to-noise ratio: > 0.95
Use Case: Baseline comparison with TEN-311 results
Example:
Moderate (Typical Business Email)
Characteristics:
Standard signature present
Single-line disclaimer
No quoted replies
Greeting + main content + closing
Signal-to-noise ratio: 0.60-0.80
Use Case: Most common real-world scenario
Example:
Heavy (Complex Email Thread)
Characteristics:
Full signature with multiple contact methods
Multi-line disclaimer
Quoted previous email(s)
Multiple paragraphs
Mixed language (Korean + English terms)
Signal-to-noise ratio: 0.40-0.60
Use Case: Challenging extraction scenarios
Example:
Test Case Distribution
Recommended Distribution (Total: 80 cases)
Category | Clean | Moderate | Heavy | Total |
|---|---|---|---|---|
SKIP | 5 | 10 | 5 | 20 |
UPDATE | 5 | 10 | 5 | 20 |
CREATE | 5 | 10 | 5 | 20 |
LINK | 5 | 10 | 5 | 20 |
Total | 20 | 40 | 20 | 80 |
Rationale:
Moderate focus (50%): Most common real-world scenario
Clean (25%): Baseline comparison
Heavy (25%): Stress testing
Edge Case Scenarios (Additional 20 cases)
Long Quoted Chain (5 cases)
Multiple levels of quoted replies
Test: Extract only new information
Signature Info Collision (5 cases)
Important info appears in signature section
Test: System doesn't incorrectly filter signal as noise
Mixed Language (5 cases)
Korean + English mixed sentences
Test: Embedding quality with code-switching
Attachment References (5 cases)
Email references external attachments
Test: Handle incomplete information gracefully
Validation Criteria
Extraction Quality Metrics
Decision Accuracy by Noise Level
Noise Level | Target Accuracy | Rationale |
|---|---|---|
Clean | > 95% | Should match TEN-311 baseline |
Moderate | > 90% | Acceptable degradation |
Heavy | > 85% | Challenging but production-viable |
Performance Benchmarks
Extraction Time: < 1 second per email (including noise filtering)
Decision Time: < 2 seconds per memory (including similarity search)
End-to-End Latency: < 3 seconds (extraction + decision)
Implementation Plan
Step 1: Template Creation (Day 1, Morning)
Tasks:
Create signature templates JSON
Create disclaimer templates JSON
Create quoted reply marker templates JSON
Deliverable: .data/experiments/datasets/v2/templates/
Step 2: Realistic Case Generation (Day 1, Afternoon - Day 2)
Tasks:
Convert 40 existing cases to "moderate" noise level
Create 20 "heavy" noise level cases
Create 20 edge case scenarios
Approach:
Use templates to augment existing v1 cases
LLM-assisted realistic email body generation
Manual review for quality
Deliverable: 80 realistic test cases in .data/experiments/datasets/v2/cases/
Step 3: Noise Filtering Implementation (Day 2)
Tasks:
Implement signature detection and removal
Implement disclaimer detection and removal
Implement quoted reply detection and removal
Approach:
Deliverable: lib/extraction/noise-filter.ts
Step 4: Baseline Experiment (Day 2)
Tasks:
Run experiment on v2 dataset
Compare with v1 baseline
Analyze performance by noise level
Metrics to Track:
Accuracy by noise level
Extraction quality scores
False positive/negative rates
Deliverable: Experiment run with comparison report
Testing Strategy
Unit Tests
Integration Tests
Success Criteria
Phase 1 Success Metrics
80 realistic test cases created and validated
Noise filtering accuracy > 80%
Signal preservation > 90%
Baseline experiment shows < 10% accuracy drop on moderate noise
Documentation complete and reviewed
Go/No-Go Decision
Proceed to Phase 2 if:
Moderate noise accuracy > 85%
Heavy noise accuracy > 75%
No critical bugs in noise filtering
Block/Revise if:
Accuracy drops > 15% on moderate noise
Signal loss > 10% (false positives)
Noise filtering causes systematic errors
Related Documentation
Experiment Improvements - Parent overview
Contextual Similarity - Next improvement
Experiment Guide - Testing methodology
Appendix: Real Email Samples
Sample 1: Budget Approval
Expected Extraction:
Change Log
Date | Author | Change |
|---|---|---|
2025-12-06 | Claude | Initial realistic dataset design |