System Architecture

Reservoir is designed as a transparent proxy for OpenAI-compatible APIs, with a focus on capturing and enriching AI conversations. This section provides an overview of the system architecture and how components interact.

Request Processing Sequence

Reservoir intercepts your API calls, enriches them with relevant history, manages token limits, and then forwards them to the actual Language Model service. Here's the detailed sequence:

sequenceDiagram
    participant App
    participant Reservoir
    participant Neo4j
    participant LLM as OpenAI/Ollama

    App->>Reservoir: Request (e.g. /v1/chat/completions/$USER/my-application)
    Reservoir->>Reservoir: Check if last message exceeds token limit (Return error if true)
    Reservoir->>Reservoir: Tag with Trace ID + Partition
    Reservoir->>Neo4j: Store original request message(s)

    %% --- Context Enrichment Steps ---
    Reservoir->>Neo4j: Query for similar & recent messages
    Neo4j-->>Reservoir: Return relevant context messages
    Reservoir->>Reservoir: Inject context messages into request payload
    %% --- End Enrichment Steps ---

    Reservoir->>Reservoir: Check total token count & truncate if needed (preserving system/last messages)

    Reservoir->>LLM: Forward enriched & potentially truncated request
    LLM->>Reservoir: Return LLM response
    Reservoir->>Neo4j: Store LLM response message
    Reservoir->>App: Return LLM response

High-Level Architecture

flowchart TB
    Client(["Client App"]) -->|API Request| HTTPServer{{HTTP Server}}
    HTTPServer -->|Process Request| Handler[Request Handler]

    subgraph Handler Logic
        direction LR
        Handler_Start(Start) --> CheckInputTokens(Check Input Tokens)
        CheckInputTokens -- OK --> StoreRequest(Store Request)
        CheckInputTokens -- Too Long --> ReturnError(Return Error Response)
        StoreRequest --> QueryContext(Query Neo4j for Context)
        QueryContext --> InjectContext(Inject Context)
        InjectContext --> CheckTotalTokens(Check/Truncate Total Tokens)
        CheckTotalTokens --> ForwardRequest(Forward to LLM)
    end

    Handler -->|Store/Query| Neo4j[(Neo4j Database)]
    Handler -->|Forward/Receive| OpenAI([OpenAI/Ollama API])
    OpenAI --> Handler
    Handler -->|Return Response| HTTPServer
    HTTPServer -->|API Response| Client

    Config[/Env Vars/] --> HTTPServer
    Config --> Handler
    Config --> Neo4j

Core Components

1. Client Application

Your application making API calls to Reservoir. This could be:

A web application using the OpenAI JavaScript library
A Python script using the OpenAI Python library
A command-line tool like curl
Any application that can make HTTP requests

2. HTTP Server (Hyper/Tokio)

The HTTP server built on Rust's async ecosystem:

Receives requests on the configured port (default: 3017)
Routes based on URL path following the pattern /v1/partition/{partition}/instance/{instance}/chat/completions
Handles CORS for web applications
Manages concurrent requests efficiently using Tokio's async runtime

3. Request Handler

The core logic that processes each request:

Input Validation

Token size checking: Validates that the last message doesn't exceed token limits
Request format validation: Ensures the request follows OpenAI's API structure
Authentication: Forwards API keys to the appropriate provider

Context Management

Trace ID assignment: Each request gets a unique identifier for tracking
Partition/Instance extraction: Pulls organization parameters from the URL path
Message storage: Stores incoming messages in Neo4j with proper tagging

Context Enrichment

Historical context query: Searches Neo4j for relevant past conversations
Similarity matching: Uses vector embeddings to find semantically similar messages
Recency filtering: Includes recent messages from the same partition/instance
Context injection: Adds relevant context to the request payload

Token Management

Total token calculation: Counts tokens in the enriched message list
Smart truncation: Removes older context while preserving system prompts and latest messages
Provider-specific limits: Respects different token limits for different models

Request Forwarding

Provider routing: Automatically routes to the correct provider based on model name
Request forwarding: Sends the enriched request to the upstream LLM
Response handling: Processes and stores the LLM's response

Relationship Building

Synapse connections: Links semantically similar messages using vector similarity
Weak connection removal: Removes relationships with similarity scores below 0.85
Conversation threading: Maintains coherent conversation threads over time

4. Neo4j Database

The graph database that stores all conversation data:

Data Storage

MessageNode entities: Each message is stored as a node with properties
Partition/Instance tagging: Messages are tagged for proper organization
Vector embeddings: Semantic representations for similarity search
Temporal information: Timestamps for recency-based queries

Graph Relationships

Synapse relationships: Connect related messages across conversations
Conversation threads: Maintain sequential flow of discussions
Similarity scores: Weighted relationships based on semantic similarity

Query Capabilities

Vector similarity search: Find semantically similar messages
Temporal queries: Retrieve recent messages within time windows
Graph traversal: Navigate conversation relationships
Partition/Instance filtering: Scope queries to specific contexts

5. External LLM Services

Reservoir supports multiple AI providers:

OpenAI: GPT-4, GPT-4o, GPT-3.5-turbo, and specialized models
Ollama: Local models like Llama, Gemma, and custom models
Mistral AI: Mistral's cloud-hosted models
Google Gemini: Google's AI models
Custom providers: Any OpenAI-compatible API endpoint

6. Configuration Management

Environment-based configuration:

Database connection: Neo4j URI, credentials, and connection pooling
Server settings: Port, host, CORS configuration
API keys: Credentials for various AI providers
Provider endpoints: Custom URLs for different services
Token limits: Configurable limits for different models

Request Processing Flow

Request Arrival: Client sends a request to Reservoir's endpoint
URL Parsing: Extract partition and instance from the URL path
Input Validation: Check message format and token limits
Message Storage: Store the user's message in Neo4j
Context Retrieval: Query for relevant historical context
Context Enrichment: Inject relevant messages into the request
Token Management: Ensure the enriched request fits within limits
Provider Routing: Determine which AI provider to use
Request Forwarding: Send the enriched request to the AI provider
Response Processing: Receive and process the AI's response
Response Storage: Store the AI's response in Neo4j
Relationship Building: Create or update message relationships
Response Return: Send the response back to the client

Scalability Considerations

Horizontal Scaling

Stateless design: Each request is independent
Database connection pooling: Efficient resource utilization
Async processing: Non-blocking I/O for high concurrency

Vertical Scaling

Memory management: Efficient vector storage and retrieval
CPU optimization: Fast similarity calculations
Disk I/O: Optimized database queries and indexing

Performance Optimizations

Vector indexing: Fast similarity search in Neo4j
Connection pooling: Reuse database connections
Caching strategies: Cache frequently accessed data
Batching: Efficient bulk operations where possible

Security Architecture

Authentication

API key forwarding: Secure handling of provider credentials
No key storage: Reservoir doesn't store AI provider keys
Environment-based secrets: Secure configuration management

Data Privacy

Local storage: All conversation data stays on your infrastructure
No external logging: Conversation content never leaves your network
Configurable retention: Control how long data is stored

Access Control

Partition isolation: Conversations are isolated by partition/instance
URL-based permissions: Access control through URL structure
Network security: Configurable CORS and network policies

Monitoring and Observability

Logging

Request tracing: Unique trace IDs for each request
Error logging: Detailed error information for debugging
Performance metrics: Request timing and processing statistics

Health Checks

Database connectivity: Monitor Neo4j connection health
Provider availability: Check AI service availability
Resource utilization: Memory and CPU monitoring

This architecture provides a robust, scalable foundation for AI conversation management while maintaining transparency and compatibility with existing applications.

Sector F Labs - Reservoir