System Architecture
Reservoir is designed as a transparent proxy for OpenAI-compatible APIs, with a focus on capturing and enriching AI conversations. This section provides an overview of the system architecture and how components interact.
Request Processing Sequence
Reservoir intercepts your API calls, enriches them with relevant history, manages token limits, and then forwards them to the actual Language Model service. Here's the detailed sequence:
sequenceDiagram participant App participant Reservoir participant Neo4j participant LLM as OpenAI/Ollama App->>Reservoir: Request (e.g. /v1/chat/completions/$USER/my-application) Reservoir->>Reservoir: Check if last message exceeds token limit (Return error if true) Reservoir->>Reservoir: Tag with Trace ID + Partition Reservoir->>Neo4j: Store original request message(s) %% --- Context Enrichment Steps --- Reservoir->>Neo4j: Query for similar & recent messages Neo4j-->>Reservoir: Return relevant context messages Reservoir->>Reservoir: Inject context messages into request payload %% --- End Enrichment Steps --- Reservoir->>Reservoir: Check total token count & truncate if needed (preserving system/last messages) Reservoir->>LLM: Forward enriched & potentially truncated request LLM->>Reservoir: Return LLM response Reservoir->>Neo4j: Store LLM response message Reservoir->>App: Return LLM response
High-Level Architecture
flowchart TB Client(["Client App"]) -->|API Request| HTTPServer{{HTTP Server}} HTTPServer -->|Process Request| Handler[Request Handler] subgraph Handler Logic direction LR Handler_Start(Start) --> CheckInputTokens(Check Input Tokens) CheckInputTokens -- OK --> StoreRequest(Store Request) CheckInputTokens -- Too Long --> ReturnError(Return Error Response) StoreRequest --> QueryContext(Query Neo4j for Context) QueryContext --> InjectContext(Inject Context) InjectContext --> CheckTotalTokens(Check/Truncate Total Tokens) CheckTotalTokens --> ForwardRequest(Forward to LLM) end Handler -->|Store/Query| Neo4j[(Neo4j Database)] Handler -->|Forward/Receive| OpenAI([OpenAI/Ollama API]) OpenAI --> Handler Handler -->|Return Response| HTTPServer HTTPServer -->|API Response| Client Config[/Env Vars/] --> HTTPServer Config --> Handler Config --> Neo4j
Core Components
1. Client Application
Your application making API calls to Reservoir. This could be:
- A web application using the OpenAI JavaScript library
- A Python script using the OpenAI Python library
- A command-line tool like curl
- Any application that can make HTTP requests
2. HTTP Server (Hyper/Tokio)
The HTTP server built on Rust's async ecosystem:
- Receives requests on the configured port (default: 3017)
- Routes based on URL path following the pattern
/v1/partition/{partition}/instance/{instance}/chat/completions
- Handles CORS for web applications
- Manages concurrent requests efficiently using Tokio's async runtime
3. Request Handler
The core logic that processes each request:
Input Validation
- Token size checking: Validates that the last message doesn't exceed token limits
- Request format validation: Ensures the request follows OpenAI's API structure
- Authentication: Forwards API keys to the appropriate provider
Context Management
- Trace ID assignment: Each request gets a unique identifier for tracking
- Partition/Instance extraction: Pulls organization parameters from the URL path
- Message storage: Stores incoming messages in Neo4j with proper tagging
Context Enrichment
- Historical context query: Searches Neo4j for relevant past conversations
- Similarity matching: Uses vector embeddings to find semantically similar messages
- Recency filtering: Includes recent messages from the same partition/instance
- Context injection: Adds relevant context to the request payload
Token Management
- Total token calculation: Counts tokens in the enriched message list
- Smart truncation: Removes older context while preserving system prompts and latest messages
- Provider-specific limits: Respects different token limits for different models
Request Forwarding
- Provider routing: Automatically routes to the correct provider based on model name
- Request forwarding: Sends the enriched request to the upstream LLM
- Response handling: Processes and stores the LLM's response
Relationship Building
- Synapse connections: Links semantically similar messages using vector similarity
- Weak connection removal: Removes relationships with similarity scores below 0.85
- Conversation threading: Maintains coherent conversation threads over time
4. Neo4j Database
The graph database that stores all conversation data:
Data Storage
- MessageNode entities: Each message is stored as a node with properties
- Partition/Instance tagging: Messages are tagged for proper organization
- Vector embeddings: Semantic representations for similarity search
- Temporal information: Timestamps for recency-based queries
Graph Relationships
- Synapse relationships: Connect related messages across conversations
- Conversation threads: Maintain sequential flow of discussions
- Similarity scores: Weighted relationships based on semantic similarity
Query Capabilities
- Vector similarity search: Find semantically similar messages
- Temporal queries: Retrieve recent messages within time windows
- Graph traversal: Navigate conversation relationships
- Partition/Instance filtering: Scope queries to specific contexts
5. External LLM Services
Reservoir supports multiple AI providers:
- OpenAI: GPT-4, GPT-4o, GPT-3.5-turbo, and specialized models
- Ollama: Local models like Llama, Gemma, and custom models
- Mistral AI: Mistral's cloud-hosted models
- Google Gemini: Google's AI models
- Custom providers: Any OpenAI-compatible API endpoint
6. Configuration Management
Environment-based configuration:
- Database connection: Neo4j URI, credentials, and connection pooling
- Server settings: Port, host, CORS configuration
- API keys: Credentials for various AI providers
- Provider endpoints: Custom URLs for different services
- Token limits: Configurable limits for different models
Request Processing Flow
- Request Arrival: Client sends a request to Reservoir's endpoint
- URL Parsing: Extract partition and instance from the URL path
- Input Validation: Check message format and token limits
- Message Storage: Store the user's message in Neo4j
- Context Retrieval: Query for relevant historical context
- Context Enrichment: Inject relevant messages into the request
- Token Management: Ensure the enriched request fits within limits
- Provider Routing: Determine which AI provider to use
- Request Forwarding: Send the enriched request to the AI provider
- Response Processing: Receive and process the AI's response
- Response Storage: Store the AI's response in Neo4j
- Relationship Building: Create or update message relationships
- Response Return: Send the response back to the client
Scalability Considerations
Horizontal Scaling
- Stateless design: Each request is independent
- Database connection pooling: Efficient resource utilization
- Async processing: Non-blocking I/O for high concurrency
Vertical Scaling
- Memory management: Efficient vector storage and retrieval
- CPU optimization: Fast similarity calculations
- Disk I/O: Optimized database queries and indexing
Performance Optimizations
- Vector indexing: Fast similarity search in Neo4j
- Connection pooling: Reuse database connections
- Caching strategies: Cache frequently accessed data
- Batching: Efficient bulk operations where possible
Security Architecture
Authentication
- API key forwarding: Secure handling of provider credentials
- No key storage: Reservoir doesn't store AI provider keys
- Environment-based secrets: Secure configuration management
Data Privacy
- Local storage: All conversation data stays on your infrastructure
- No external logging: Conversation content never leaves your network
- Configurable retention: Control how long data is stored
Access Control
- Partition isolation: Conversations are isolated by partition/instance
- URL-based permissions: Access control through URL structure
- Network security: Configurable CORS and network policies
Monitoring and Observability
Logging
- Request tracing: Unique trace IDs for each request
- Error logging: Detailed error information for debugging
- Performance metrics: Request timing and processing statistics
Health Checks
- Database connectivity: Monitor Neo4j connection health
- Provider availability: Check AI service availability
- Resource utilization: Memory and CPU monitoring
This architecture provides a robust, scalable foundation for AI conversation management while maintaining transparency and compatibility with existing applications.