Email Gateway Submission¶
This page documents how the HostAgent email collector constructs ingestion payloads and submits them to the Gateway APIs. The Swift implementation lives under hostagent/Sources/HostAgent/Collectors/EmailCollector.swift with HTTP submission handled by Submission/GatewaySubmissionClient.swift.
Overview¶
- Endpoint targets
POST /v1/ingestfor single redacted email document payloads.POST /v1/ingest:batchfor multiple documents in one request (used whenbatch: truein collector config).POST /v1/ingest/filefor binary attachments plus enrichment metadata.- Idempotency keys
- Documents:
SHA256("{source_type}:{source_id}:{text_hash}"), wheretext_hashis the SHA-256 of the normalized body (CRLF → LF, trimmed) before redaction. - Attachments:
email-attachment:{message_id_or_hash}:{file_sha256}. - Retries
- Automatically retries transient Gateway responses (
429,503) using exponential back-off (0.5s, 1.0s). - Treats other
4xxresponses as hard failures and surfaces the body text in logs.
Document Payload Shape¶
{
"source_type": "email_local",
"source_id": "email:message-123",
"title": "Receipt for your order",
"content": {
"mime_type": "text/plain",
"data": "Hello [EMAIL_REDACTED], thanks for your payment...",
"encoding": null
},
"people": [
{
"identifier": "billing@example.com",
"identifier_type": "email",
"role": "sender",
"display_name": "Billing"
},
{
"identifier": "alice@example.com",
"identifier_type": "email",
"role": "recipient"
}
],
"metadata": {
"message_id": "message-123",
"subject": "Receipt for your order",
"snippet": "Hello [EMAIL_REDACTED], thanks for your payment...",
"content_hash": "ce62…",
"in_reply_to": "parent-456",
"references": ["parent-456","root-001"],
"has_attachments": true,
"attachment_count": 2,
"intent": {
"primary_intent": "receipt",
"confidence": 0.92,
"secondary_intents": ["actionRequired"],
"extracted_entities": {
"amount": "42.00",
"currency": "USD"
}
},
"relevance_score": 0.87
},
"thread_id": "eb3f5f8c-…",
"thread": {
"external_id": "email-thread:85e43…",
"source_type": "email",
"title": "Receipt for your order",
"participants": [
{ "identifier": "billing@example.com", "role": "sender" },
{ "identifier": "alice@example.com", "role": "recipient" }
],
"metadata": {
"message_ids": "message-123,parent-456,root-001"
}
},
"relevance_score": 0.87
}
Field Notes¶
- Redaction – Bodies are run through
EmailService.redactPII(emails, phone numbers, account numbers, SSNs) before hashing. - People – Sender, To, Cc, and Bcc addresses are deduplicated and lowercased.
- Thread – Deterministic UUID derived from all message IDs (current, parent, references). Absent if no threading headers exist.
- Intent – Optional; populated from
IntentClassificationwhen available and mirrored on bothmetadata.intentand top-levelintent. - Timestamps –
content_timestampis set from the emailDateheader withcontent_timestamp_type="received"by default.
Attachment Uploads¶
Attachments are uploaded individually using multipart form data. The collector computes the SHA-256 hash before submitting to minimise duplicate uploads.
Metadata envelope:
{
"source": "email_local",
"path": "email/message-123/invoice.pdf",
"filename": "invoice.pdf",
"mime_type": "application/pdf",
"sha256": "94fe…",
"size": 20480,
"message_id": "message-123",
"content_id": "cid-abc123",
"intent": {
"primary_intent": "receipt",
"confidence": 0.92
},
"relevance_score": 0.87,
"enrichment": {
"ocr_text": "Total due $42.00",
"entities": { "amount": ["42.00"], "merchant": ["Acme"] },
"caption": "Scanned receipt with store logo"
}
}
- The metadata JSON is passed in the
metafield; binary bytes are provided inupload. - The same intent and relevance score from the parent document can be mirrored at the attachment level.
- When enrichment fails, the collector still submits the file but omits the
enrichmentobject.
Configuration¶
Add (or confirm) the Gateway configuration block in ~/.haven/hostagent.yaml:
gateway:
base_url: http://gateway:8085
ingest_path: /v1/ingest
ingest_file_path: /v1/ingest/file
timeout: 30
The ingest_file_path property defaults to /v1/ingest/file for backwards compatibility, so older configuration files continue to work without edits.
Email collector state configuration¶
modules:
mail:
enabled: true
state:
clear_on_new_run: true # discard prior per-run map when new items are discovered
run_state_path: ~/.haven/email_collector_state_run.json
rejected_log_path: ~/.haven/rejected_emails.log
lock_file_path: ~/.haven/email_collector.lock
rejected_retention_days: 30 # daily rotated logs kept for this many days
All fields are optional; the defaults shown match the built-in configuration. When clear_on_new_run is true, discovering any new Envelope Index rows clears the previous per-run map before processing; set it to false to reattempt outstanding submissions across runs. The lock file prevents two collectors from processing the same mailbox concurrently.
Error Handling & Logging¶
- All Gateway interactions log structured events under the
gateway-submissioncategory. - On non-retryable failures (
4xxother than429), the error body is logged and propagated to the caller. - After three retry attempts, the client raises
EmailCollectorError.gatewayHTTPError(-1, "Exceeded retry attempts"). - Attachment read failures bubble up as
EmailCollectorError.attachmentReadFailed; the collector may choose to skip the offending file but continue with the remaining batch. - The email collector persists a per-run submission map to
run_state_path(default~/.haven/email_collector_state_run.json). Each entry records the Gateway idempotency key, attempts, last attempt timestamp, and the latest status (found,submitted,accepted,rejected). - Final rejections are appended as newline-delimited JSON to
rejected_emails-YYYY-MM-DD.log(rotated daily), including the row ID, idempotency key, and server response to aid triage.
Admin endpoints¶
GET /v1/collectors/email_local/state now includes the run-state snapshot:
{
"status": "completed",
"run_state": {
"last_accepted_rowid": 42,
"entries": [
{
"key": "42",
"row_id": 42,
"external_id": "email:message-123",
"status": "accepted",
"attempts": 1,
"last_attempt_at": "2025-10-22T23:45:01.238Z",
"mailbox": "Inbox"
},
{
"key": "43",
"row_id": 43,
"status": "submitted",
"attempts": 2,
"last_error": "Gateway HTTP error 503: Service Unavailable"
}
]
}
}
This payload omits message bodies and attachment data, exposing only metadata needed for operational debugging.
Metrics¶
The collector emits Prometheus-style metrics through /v1/metrics:
email_local_found_total{mailbox="Inbox"}– messages discovered in the current run.email_local_submitted_total{mailbox="Inbox"}– submission attempts (documents).email_local_accepted_total{mailbox="Inbox", duplicate="false"}– documents accepted or deduplicated by Gateway.email_local_rejected_total{mailbox="Inbox", status="400"}– hard failures that will not be retried.email_local_submission_latency_ms{mailbox="Inbox"}– histogram of end-to-end submission latency in milliseconds.
Testing¶
Unit tests in hostagent/Tests/SubmissionTests/EmailCollectorTests.swift verify:
- Payload redaction, people extraction, and hash stability.
- Document requests include the computed idempotency key header.
- Attachment uploads retry automatically on
429and propagate enrichment metadata.
These tests use a custom URLProtocol to simulate Gateway responses, ensuring developers can iterate locally without reaching the real API.