Haven LLM Provider Abstraction¶
The haven_llm module provides a unified, provider-agnostic interface for interacting with multiple LLM backends. This abstraction allows seamless switching between local (Ollama) and cloud-based (OpenAI) LLM providers without changing application code.
Overview¶
Location: shared/haven_llm.py
Supported Providers: - Ollama: Local, self-hosted open-source models (Llama, Mistral, Neural Chat, etc.) - OpenAI: Cloud-based proprietary models (GPT-4, GPT-3.5-turbo, etc.)
Key Features:
- Unified LLMProvider interface for all providers
- Automatic provider selection from environment variables
- Support for both text generation and embeddings
- JSON-formatted responses
- System prompts and streaming-ready architecture
- Health checks for provider availability
- Comprehensive error handling
- Built-in logging and debugging support
Installation¶
The module uses httpx for HTTP communication, which should already be available in the Haven environment:
# Typically already installed
pip install httpx
For local development with Ollama:
# Install Ollama from https://ollama.ai
ollama pull llama3.2
ollama serve
For OpenAI support, set the environment variable:
export OPENAI_API_KEY="sk-..."
Quick Start¶
Basic Usage with Automatic Provider Selection¶
from shared.haven_llm import get_llm_provider
# Automatically selects provider from LLM_PROVIDER env var (default: Ollama)
provider = get_llm_provider()
response = provider.generate(
prompt="What is machine learning?",
)
print(response.text)
Explicit Ollama Provider¶
from shared.haven_llm import OllamaProvider, OllamaSettings
settings = OllamaSettings(
base_url="http://localhost:11434",
model="llama3.2",
temperature=0.7,
)
provider = OllamaProvider(settings=settings)
response = provider.generate(
prompt="Explain neural networks.",
)
Explicit OpenAI Provider¶
from shared.haven_llm import OpenAIProvider, OpenAISettings
settings = OpenAISettings(
model="gpt-4",
temperature=0.8,
max_tokens=2000,
)
provider = OpenAIProvider(settings=settings)
response = provider.generate(
prompt="Write a poem about AI.",
)
Core Components¶
LLMProvider (Abstract Base Class)¶
The LLMProvider abstract base class defines the interface all providers must implement:
class LLMProvider(abc.ABC):
def generate(
self,
prompt: str,
*,
model: Optional[str] = None,
temperature: Optional[float] = None,
format: str = "text",
system_prompt: Optional[str] = None,
**kwargs: Any,
) -> LLMResponse:
"""Generate a response from the LLM."""
...
def embed(
self,
text: str,
*,
model: Optional[str] = None,
**kwargs: Any,
) -> list[float]:
"""Generate embeddings for text."""
...
def health_check(self) -> bool:
"""Check if the provider is accessible."""
...
LLMResponse¶
All generate() calls return an LLMResponse object:
@dataclass
class LLMResponse:
text: str # The generated response
model: str # Model that generated it
provider: str # Provider name ('ollama', 'openai')
raw_response: Dict[str, Any] # Raw API response
usage: Optional[Dict[str, int]] # Token usage {input, output, total}
Usage:
response = provider.generate(prompt="Hello")
print(response.text) # The actual response
print(response.model) # Which model was used
print(response.usage) # Token counts (if available)
print(response.raw_response) # Raw API response for debugging
Settings Classes¶
OllamaSettings¶
@dataclass
class OllamaSettings(LLMSettings):
base_url: str = "http://localhost:11434" # Ollama API endpoint
model: str = "llama3.2" # Default model
temperature: float = 0.7 # Creativity level (0.0-1.0)
top_p: float = 0.95 # Nucleus sampling
top_k: int = 40 # Top-k sampling
timeout: float = 60.0 # Request timeout
max_retries: int = 3 # Retry attempts
retry_delay: float = 1.0 # Initial retry delay
OpenAISettings¶
@dataclass
class OpenAISettings(LLMSettings):
api_key: str = "" # API key (from env if empty)
model: str = "gpt-5-nano-2025-08-07" # Default model
temperature: float = 0.7 # Creativity level (not supported by all models)
max_tokens: Optional[int] = None # Max response length
timeout: float = 60.0 # Request timeout
max_retries: int = 3 # Retry attempts
retry_delay: float = 1.0 # Initial retry delay
Providers¶
OllamaProvider¶
Local, self-hosted LLM provider using the Ollama API.
Supported Operations:
- generate(): Text generation with /api/generate endpoint
- embed(): Embeddings with /api/embeddings endpoint
- health_check(): Check /api/tags endpoint
Example:
from shared.haven_llm import OllamaProvider
provider = OllamaProvider()
# Text generation
response = provider.generate(
prompt="What is the Python GIL?",
temperature=0.5,
)
print(response.text)
# JSON format
response = provider.generate(
prompt="Extract emails from: John@example.com, Jane@test.org",
format="json",
)
# Embeddings
embedding = provider.embed("The quick brown fox")
print(len(embedding)) # e.g., 1024 dimensions
Configuration via Environment:
export OLLAMA_BASE_URL="http://localhost:11434"
export OLLAMA_MODEL="llama3.2"
OpenAIProvider¶
Cloud-based OpenAI models (GPT family) via the OpenAI API.
Supported Operations:
- generate(): Text generation via Responses API (/v1/responses)
- embed(): Embeddings via Embeddings API (/v1/embeddings)
- health_check(): Check Models endpoint (/v1/models)
Example:
from shared.haven_llm import OpenAIProvider
provider = OpenAIProvider()
# Text generation
response = provider.generate(
prompt="Write a Python decorator tutorial.",
model="gpt-5-nano-2025-08-07",
temperature=0.8,
max_tokens=2000,
)
print(response.text)
# With system prompt
response = provider.generate(
prompt="What is 2+2?",
system_prompt="You are a math tutor. Be clear and concise.",
)
# Note: Some models (e.g., gpt-5-nano) don't support temperature parameter
# The provider will automatically skip temperature for unsupported models
# Embeddings
embedding = provider.embed("The quick brown fox")
print(len(embedding)) # 1536 for text-embedding-3-small
Configuration via Environment:
export OPENAI_API_KEY="sk-..."
export OPENAI_MODEL="gpt-5-nano-2025-08-07" # or gpt-5-mini, gpt-5.1, etc.
Common Usage Patterns¶
Pattern 1: Text Generation with Custom Temperature¶
provider = get_llm_provider()
# Deterministic responses (temperature closer to 0)
factual = provider.generate(
prompt="What is the capital of France?",
temperature=0.1,
)
# Creative responses (temperature closer to 1)
creative = provider.generate(
prompt="Write a creative short story about cats.",
temperature=0.9,
)
Pattern 2: JSON-Formatted Extraction¶
provider = get_llm_provider()
response = provider.generate(
prompt="""Extract structured data from:
"Alice works at Google as a Software Engineer with 5 years experience."
Return JSON: {name, company, title, years_experience}""",
format="json",
)
import json
data = json.loads(response.text)
print(data['name']) # Alice
Pattern 3: System Prompts for Context¶
provider = get_llm_provider()
response = provider.generate(
prompt="What programming languages should I learn?",
system_prompt="""You are an expert Python developer and open-source advocate.
Prioritize Python and open-source technologies in your recommendations.""",
)
Pattern 4: Provider Detection and Fallback¶
from shared.haven_llm import get_llm_provider, OllamaProvider, LLMProviderError
# Try to use preferred provider
try:
provider = get_llm_provider("openai")
except LLMProviderError:
print("OpenAI not available, falling back to Ollama")
provider = get_llm_provider("ollama")
response = provider.generate(prompt="Hello!")
Pattern 5: Health Checks Before Processing¶
from shared.haven_llm import get_llm_provider
provider = get_llm_provider()
if not provider.health_check():
print("LLM provider is not available!")
# Handle gracefully, skip processing, etc.
else:
response = provider.generate(prompt="Process this...")
Pattern 6: Embeddings for Semantic Search¶
from shared.haven_llm import get_llm_provider
provider = get_llm_provider()
# Generate embeddings for documents
documents = [
"Python is a programming language",
"The capital of France is Paris",
"Machine learning is part of AI",
]
embeddings = [provider.embed(doc) for doc in documents]
# Store in vector database (e.g., Qdrant)
for doc, embedding in zip(documents, embeddings):
# vector_store.add(doc, embedding)
pass
Environment Variables¶
Provider Selection¶
# Choose provider (default: 'ollama')
export LLM_PROVIDER="ollama" # or "openai"
Ollama Configuration¶
# Ollama API endpoint (default: http://localhost:11434)
export OLLAMA_BASE_URL="http://localhost:11434"
# Default model for Ollama
export OLLAMA_MODEL="llama3.2" # or neural-chat, mistral, etc.
OpenAI Configuration¶
# Required: OpenAI API key
export OPENAI_API_KEY="sk-..."
# Default model (default: gpt-5-nano-2025-08-07)
export OPENAI_MODEL="gpt-5-mini" # or gpt-5.1, etc.
Error Handling¶
The module defines three exception types:
from shared.haven_llm import (
LLMError, # Base exception
LLMProviderError, # Network, API, etc.
LLMConfigError, # Missing config, invalid settings
)
try:
response = provider.generate(prompt="test")
except LLMConfigError as e:
print(f"Configuration error: {e}")
except LLMProviderError as e:
print(f"Provider error: {e}")
except LLMError as e:
print(f"Unexpected LLM error: {e}")
Integration Examples¶
Integration with Intents Classification¶
from shared.haven_llm import get_llm_provider
from src.haven.intents.classifier import classify_artifact
provider = get_llm_provider()
response = provider.generate(
prompt="Classify this as work or personal: 'Meeting at 3pm'",
format="json",
)
Integration with Slot Extraction¶
from shared.haven_llm import get_llm_provider
provider = get_llm_provider()
response = provider.generate(
prompt="Extract: name, phone, email from contact info...",
format="json",
)
Integration with Document Processing¶
from shared.haven_llm import get_llm_provider
provider = get_llm_provider()
# Summarize documents
summary = provider.generate(
prompt=f"Summarize: {document_text}",
temperature=0.3,
)
# Extract entities
entities = provider.generate(
prompt=f"Extract entities: {document_text}",
format="json",
)
Performance Considerations¶
Ollama (Local)¶
- Advantages: Privacy, no API costs, works offline
- Trade-offs: Requires local computational resources
- Model Selection:
llama3.2(3B): Fast, lightweightllama3(8B): Balance of speed and qualitymistral: Good for instruction followingneural-chat: Optimized for conversations
OpenAI (Cloud)¶
- Advantages: No local setup, latest models, managed
- Trade-offs: API costs, network latency, data privacy
- Model Selection:
gpt-5-nano-2025-08-07: Fast, cost-effective (doesn't support temperature)gpt-5-mini: Balanced performance and costgpt-5.1: Most capable- Cost tracking is automatically logged for all models
Optimization Tips¶
- Batch requests to reduce overhead
- Use lower temperature (0.1-0.3) for factual tasks
- Set appropriate timeouts for local/remote
- Cache embeddings for repeated documents
- Use shorter prompts to reduce token usage
Debugging¶
Enable Verbose Logging¶
import logging
from shared.logging import get_logger
logger = get_logger("shared.haven_llm")
logger.setLevel(logging.DEBUG)
Inspect Raw Responses¶
response = provider.generate(prompt="test")
print("Raw response:", response.raw_response)
Provider Information¶
provider = get_llm_provider()
print(provider) # Detailed provider info
print(provider.health_check()) # Is it available?
Testing¶
The module includes comprehensive error handling and logging. Enable debug logging to see detailed request/response information:
import logging
from shared.logging import get_logger
logger = get_logger("shared.haven_llm")
logger.setLevel(logging.DEBUG)
Cost tracking is automatically logged for OpenAI requests when available.
Related Modules¶
shared/image_enrichment.py: Image processing with OCR/visionsrc/haven/intents/classifier/classifier.py: Intent classification using LLMssrc/haven/intents/slots/extractor.py: Slot extraction using LLMsservices/worker_service/workers/intents.py: Intent worker using Ollama
Migration Guide¶
If you have existing code using direct httpx calls to Ollama or OpenAI:
Before:
import httpx
response = httpx.post("http://localhost:11434/api/generate", json=payload)
data = response.json()
text = data["response"]
After:
from shared.haven_llm import get_llm_provider
provider = get_llm_provider()
response = provider.generate(prompt=...)
text = response.text # Clean interface with automatic error handling
OpenAI Responses API Migration:
The OpenAI provider uses the Responses API (/v1/responses), not Chat Completions. The provider handles the request/response format automatically:
Before:
import httpx
response = httpx.post(
"https://api.openai.com/v1/chat/completions",
headers={"Authorization": f"Bearer {api_key}"},
json={"model": "gpt-4", "messages": [...]}
)
After:
from shared.haven_llm import OpenAIProvider
provider = OpenAIProvider()
response = provider.generate(prompt="...", system_prompt="...")
Future Enhancements¶
Potential additions to the module:
- [ ] Streaming response support
- [ ] Async/await interface
- [ ] Retry policy customization
- [x] Cost tracking for OpenAI (implemented)
- [ ] Support for Anthropic Claude
- [ ] Support for local LLMs (LLaMA.cpp, vLLM)
- [ ] Request/response caching
- [ ] Rate limiting
- [ ] Multi-provider fallback chains
- [ ] Structured output formats beyond JSON