07-extraction-01-pipeline-overview

Dec 7, 2025

Pipeline Overview

전체 흐름

Email → Extraction LLM → Structured Output → Judge LLM → Validation Result
         (Haiku)           (domain, subject)   (Haiku)     (correct/incorrect)

1. Input: Raw Email

Gmail API에서 가져온 원본 이메일:

subject: 이메일 제목
from: 발신자
content: 본문 (HTML → plaintext 변환 완료)
groundTruth: 수동 라벨링된 정답 (검증용)

2. Extraction LLM

Model: claude-3-5-haiku-20241022

Input: 이메일 본문

Output:

{
  "domain": "finance | hr | engineering | marketing | legal | operations | general",
  "subject": "이메일의 핵심 주제 (한 문장)",
  "action": "required_action | fyi | none"
}

Prompt: Constitutional AI 6원칙 기반 (02-domain-classification.md 참조)

3. Judge LLM

Purpose: 예측된 subject와 Ground Truth의 의미적 동등성 판단

Model: claude-3-5-haiku-20241022

Input:

Ground Truth subject
Predicted subject

Output:

{
  "equivalent": true/false,
  "confidence": 0.0-1.0,
  "reasoning": "판단 근거"
}

Details: 03-validation-framework.md 참조

4. Validation Result

최종 검증 결과:

interface ValidationResult {
  email: EmailSample;
  predicted: { domain: string; subject: string; action: string };
  correct: boolean;           // domain && subject 모두 일치
  domainCorrect: boolean;     // 도메인 정확도
  subjectCorrect: boolean;    // Judge 판정 결과
  subjectSimilarity?: number; // 임베딩 유사도 (참고용)
  judgeEquivalent?: boolean;  // Judge 판정
  judgeConfidence?: number;   // Judge 확신도
  latencyMs: number;
}

비용 구조

단계	모델	비용/이메일
Extraction	Haiku 3.5	~$0.0008
Embedding	text-embedding-3-small	~$0.0001
Judge	Haiku 3.5	~$0.0002
Total	-	~$0.001

병목 및 최적화

Extraction: 가장 큰 비용 (~80%)
- 최적화: 프롬프트 간소화, 출력 토큰 제한
Judge: 필요시에만 호출
- 최적화: 높은 확신도 케이스는 스킵 가능
Embedding: 참고용이므로 선택적
- 최적화: Judge만 사용시 제거 가능

‹

›