Skip to content

Local Files Collector

The Local Files collector watches directories for files and ingests them into Haven for search and analysis.

Overview

The Local Files collector: - Monitors directories for new and modified files - Supports multiple file types (text, PDF, images) - Extracts text content automatically - Enriches images with OCR and captioning - Tracks processed files to avoid duplicates

Prerequisites

  • macOS, Linux, or Windows
  • Gateway API running and accessible
  • Read access to watched directories

Supported File Types

Extension MIME Type Text Extraction
.txt text/plain Direct read
.md text/markdown Direct read
.pdf application/pdf PDF parser
.png image/png OCR (if enabled)
.jpg, .jpeg image/jpeg OCR (if enabled)
.heic image/heic OCR (if enabled)

Haven.app provides the easiest way to run the Local Files collector:

  1. Launch Haven.app

  2. Configure Collector:

  3. Open Settings (⌘,)
  4. Navigate to Local Files Collector
  5. Configure options:

    • Watch directory (e.g., ~/HavenInbox)
    • Include patterns (glob, e.g., *.txt, *.pdf)
    • Exclude patterns (glob, e.g., *.tmp)
    • Move/delete after processing
    • Tags for ingested files
  6. Run Collector:

  7. Open Collectors window (⌘2)
  8. Select Local Files collector
  9. Click "Run" or use menu "Run All Collectors"

  10. Monitor Progress:

  11. View Dashboard (⌘1) for activity log
  12. Check processed file count
  13. Review any errors

Using CLI (Alternative)

For environments without Haven.app or automated runs:

Basic Usage

# Set authentication
export AUTH_TOKEN="changeme"
export GATEWAY_URL="http://localhost:8085"

# Watch a directory
python scripts/collectors/collector_localfs.py \
  --watch ~/HavenInbox \
  --move-to ~/.haven/localfs/processed

Options

Flag Description
--watch DIR Directory to watch for files
--move-to DIR Move files here after processing
--include PATTERN Include files matching pattern (glob)
--exclude PATTERN Exclude files matching pattern (glob)
--delete-after Delete files after processing (instead of move)
--dry-run Log matches without uploading
--one-shot Process current backlog and exit
--tags TAG1,TAG2 Add tags to ingested files
--state-file PATH Custom state file location

Examples

# Watch directory, move processed files
python scripts/collectors/collector_localfs.py \
  --watch ~/Documents/Inbox \
  --move-to ~/Documents/Processed \
  --include "*.txt" --include "*.pdf"

# One-time run, delete after processing
python scripts/collectors/collector_localfs.py \
  --watch ~/Downloads \
  --delete-after \
  --one-shot

# Dry run to see what would be processed
python scripts/collectors/collector_localfs.py \
  --watch ~/HavenInbox \
  --dry-run

# Add tags to files
python scripts/collectors/collector_localfs.py \
  --watch ~/Documents \
  --tags "personal,important"

How It Works

File Detection

  1. Initial Scan:
  2. Scans watch directory for matching files
  3. Checks against include/exclude patterns
  4. Skips already processed files (via state file)

  5. File Watching:

  6. Uses filesystem events (FSEvents on macOS)
  7. Detects new and modified files
  8. Triggers processing automatically

  9. Duplicate Detection:

  10. Calculates SHA256 hash of file content
  11. Checks against Gateway for existing files
  12. Skips duplicates automatically

File Processing

  1. Upload:
  2. Uploads file to Gateway
  3. Gateway stores in MinIO object storage
  4. Returns file metadata and SHA256

  5. Text Extraction:

  6. Text files: Direct read
  7. PDF: Uses PDF parser
  8. Images: OCR extraction (if enabled)

  9. Image Enrichment:

  10. OCR for text in images
  11. Optional captioning via Ollama
  12. Entity detection

  13. Document Creation:

  14. Creates document with extracted text
  15. Links file attachment
  16. Adds metadata (tags, timestamps, filename)

  17. Post-Processing:

  18. Moves file to move-to directory (if configured)
  19. Or deletes file (if --delete-after)
  20. Updates state file with processed file hash

Configuration

Environment Variables

See Configuration Reference for complete list. Key variables:

  • AUTH_TOKEN - Gateway API authentication
  • GATEWAY_URL - Gateway API base URL
  • LOCALFS_MAX_FILE_MB - Maximum file size (default: 100 MB)
  • LOCALFS_REQUEST_TIMEOUT - HTTP timeout (default: 300 seconds)

Haven.app Configuration

Configure via Settings (⌘,) → Local Files Collector:

collectors:
  localfs:
    enabled: true
    watch_dir: "~/HavenInbox"
    include: ["*.txt", "*.md", "*.pdf", "*.png", "*.jpg"]
    exclude: ["*.tmp", "*.log"]
    move_to: "~/.haven/localfs/processed"
    delete_after: false
    tags: ["inbox"]

Per-Collector Enrichment:

Control enrichment behavior for Local Files collector via Settings (⌘,) → Enrichment Settings:

  • Skip Enrichment: When enabled, file documents are submitted without OCR, face detection, entity extraction, or captioning
  • Default: Enrichment is enabled (skipEnrichment: false)

Global enrichment module settings (OCR quality, entity types, captioning models) are configured in Advanced Settings. See Configuration Reference for details.

File Patterns

Use glob patterns for include/exclude:

  • *.txt - All .txt files
  • *.pdf - All PDF files
  • **/*.md - All .md files recursively
  • *.tmp - Exclude .tmp files
  • test_* - Files starting with "test_"

State Management

State File

State is tracked in ~/Library/Application Support/Haven/State/localfs_collector_state.json:

{
  "processed_files": {
    "sha256_hash": {
      "path": "/full/path/to/file",
      "processed_at": "2024-01-01T12:00:00Z"
    }
  }
}

Resetting State

To reprocess all files:

# Remove state file
rm ~/Library/Application\ Support/Haven/State/localfs_collector_state.json

# Run collector (will process all files)
# In Haven.app: Collectors → Local Files → Run

Warning: This will re-upload all files. Use with caution.

Workflow Examples

Inbox Processing

Set up an inbox directory for manual file drops:

# Create inbox
mkdir -p ~/HavenInbox

# Configure collector to watch inbox
# In Haven.app: Settings → Local Files → Watch Directory: ~/HavenInbox

# Drop files into inbox
cp document.pdf ~/HavenInbox/

# Collector automatically processes and moves files

Automated Document Sync

Watch a shared directory:

python scripts/collectors/collector_localfs.py \
  --watch ~/Documents/Shared \
  --include "*.pdf" \
  --include "*.docx" \
  --tags "shared,work"

One-Time Import

Process existing files:

python scripts/collectors/collector_localfs.py \
  --watch ~/OldDocuments \
  --one-shot \
  --delete-after

Troubleshooting

Files Not Processing

Issue: Files in watch directory not being processed

Solutions: - Check include/exclude patterns match files - Verify file types are supported - Check state file for already-processed files - Review collector logs for errors - Ensure collector is running (check Haven.app status)

Permission Errors

Error: Cannot read files

Solution: - Verify read permissions on watch directory - Check file permissions - On macOS: ensure Full Disk Access if needed

Large File Errors

Error: File too large

Solution: - Increase LOCALFS_MAX_FILE_MB environment variable - Or split large files before processing - Default limit is 100 MB

Duplicate Files

Issue: Same file processed multiple times

Solution: - Check state file is being updated - Verify SHA256 hashing is working - Check for multiple collector instances - Review Gateway deduplication logs

Performance Considerations

File Size Limits

  • Small files (< 1 MB): Fast processing
  • Medium files (1-10 MB): Moderate processing time
  • Large files (10-100 MB): Slower, consider splitting
  • Very large files (> 100 MB): May timeout, increase LOCALFS_REQUEST_TIMEOUT

Batch Processing

The collector processes files one at a time. For large directories: - Use --one-shot for initial import - Then enable continuous watching - Consider processing in smaller batches

Network Considerations

  • Files are uploaded to Gateway
  • Ensure sufficient bandwidth for large files
  • Consider local Gateway for faster uploads