Ollama Integration

Reservoir works seamlessly with Ollama, allowing you to use local AI models with persistent memory and context enrichment. This is perfect for privacy-focused workflows where you want to keep all your conversations completely local.

What is Ollama?

Ollama is a tool that makes it easy to run large language models locally on your machine. It supports popular models like Llama, Gemma, and many others, all running entirely on your hardware.

Benefits of Using Ollama with Reservoir

Complete Privacy: All conversations stay on your device
No API Keys: No need for cloud service API keys
Offline Capable: Works without internet connection
Cost Effective: No usage-based charges
Full Control: Choose exactly which models to use

Setting Up Ollama

Step 1: Install Ollama

First, install Ollama from ollama.ai:

# On macOS
brew install ollama

# On Linux
curl -fsSL https://ollama.ai/install.sh | sh

# Or download from https://ollama.ai/download

Step 2: Start Ollama Service

ollama serve

This starts the Ollama service on http://localhost:11434.

Step 3: Download Models

Download the models you want to use:

# Download Gemma 3 (Google's model)
ollama pull gemma3

# Download Llama 3.2 (Meta's model)
ollama pull llama3.2

# Download Mistral (Mistral AI's model)
ollama pull mistral

# See all available models
ollama list

Using Ollama with Reservoir

Regular Mode

By default, Reservoir routes any unrecognized model names to Ollama:

curl "http://127.0.0.1:3017/partition/$USER/instance/ollama-chat/v1/chat/completions" \
    -H "Content-Type: application/json" \
    -d '{
        "model": "gemma3",
        "messages": [
            {
                "role": "user",
                "content": "Explain machine learning in simple terms."
            }
        ]
    }'

No API key required!

Ollama Mode

Reservoir also provides a special "Ollama mode" that makes it a drop-in replacement for Ollama's API:

# Start Reservoir in Ollama mode
cargo run -- start --ollama

In Ollama mode, Reservoir:

Uses the same API endpoints as Ollama
Provides the same response format
Adds memory and context enrichment automatically
Makes existing Ollama clients work with persistent memory

Testing Ollama Mode

# Test with the standard Ollama endpoint format
curl "http://127.0.0.1:3017/v1/chat/completions" \
    -H "Content-Type: application/json" \
    -d '{
        "model": "gemma3",
        "messages": [
            {
                "role": "user",
                "content": "Hello, can you remember our previous conversations?"
            }
        ]
    }'

Popular Ollama Models

Gemma 3 (Google)

Excellent for general conversation and coding:

curl "http://127.0.0.1:3017/partition/$USER/instance/coding/v1/chat/completions" \
    -H "Content-Type: application/json" \
    -d '{
        "model": "gemma3",
        "messages": [
            {
                "role": "user",
                "content": "Write a Python function to sort a list of dictionaries by a specific key."
            }
        ]
    }'

Llama 3.2 (Meta)

Great for reasoning and complex tasks:

curl "http://127.0.0.1:3017/partition/$USER/instance/reasoning/v1/chat/completions" \
    -H "Content-Type: application/json" \
    -d '{
        "model": "llama3.2",
        "messages": [
            {
                "role": "user",
                "content": "Solve this logic puzzle: If all roses are flowers, and some flowers are red, can we conclude that some roses are red?"
            }
        ]
    }'

Mistral 7B

Efficient and good for general tasks:

curl "http://127.0.0.1:3017/partition/$USER/instance/general/v1/chat/completions" \
    -H "Content-Type: application/json" \
    -d '{
        "model": "mistral",
        "messages": [
            {
                "role": "user",
                "content": "Summarize the key points of quantum computing for a beginner."
            }
        ]
    }'

Python Integration with Ollama

Using the OpenAI library with local Ollama models:

import os
from openai import OpenAI

# Setup for Ollama through Reservoir
INSTANCE = "ollama-python"
PARTITION = os.getenv("USER", "default")
RESERVOIR_PORT = os.getenv('RESERVOIR_PORT', '3017')
RESERVOIR_BASE_URL = f"http://localhost:{RESERVOIR_PORT}/v1/partition/{PARTITION}/instance/{INSTANCE}"

client = OpenAI(
    base_url=RESERVOIR_BASE_URL,
    api_key="not-needed-for-ollama"  # Ollama doesn't require API keys
)

# Chat with memory using local model
completion = client.chat.completions.create(
    model="gemma3",
    messages=[
        {
            "role": "user",
            "content": "My favorite hobby is gardening. What plants would you recommend for a beginner?"
        }
    ]
)

print(completion.choices[0].message.content)

# Ask a follow-up that requires memory
follow_up = client.chat.completions.create(
    model="gemma3",
    messages=[
        {
            "role": "user", 
            "content": "What tools do I need to get started with my hobby?"
        }
    ]
)

print(follow_up.choices[0].message.content)
# Will remember you're interested in gardening!

Environment Configuration

You can customize the Ollama endpoint if needed:

# Default Ollama endpoint
export RSV_OLLAMA_BASE_URL="http://localhost:11434/v1/chat/completions"

# Custom endpoint (if running Ollama on different port/host)
export RSV_OLLAMA_BASE_URL="http://192.168.1.100:11434/v1/chat/completions"

Performance Tips

Model Selection

gemma3: Good balance of speed and quality
llama3.2: Higher quality but slower
mistral: Fast and efficient
smaller models (7B parameters): Faster on limited hardware
larger models (13B+): Better quality but require more resources

Hardware Considerations

RAM: 8GB minimum, 16GB+ recommended for larger models
GPU: Optional but significantly speeds up inference
Storage: Models range from 4GB to 40GB+ each

Optimizing Performance

# Use GPU acceleration if available
ollama run gemma3 --gpu

# Monitor resource usage
ollama ps

# Check if Ollama is running
curl http://localhost:11434/api/tags

# If not running, start it
ollama serve

Model Not Available

# List installed models
ollama list

# Pull missing model
ollama pull gemma3

Performance Issues

# Check system resources
ollama ps

# Try a smaller model
ollama pull gemma3:2b  # 2B parameter version

Error Messages

"connection refused": Ollama service isn't running
"model not found": Model needs to be pulled with ollama pull
"out of memory": Try a smaller model or close other applications

Combining Local and Cloud Models

One of Reservoir's strengths is seamlessly switching between local and cloud models:

import os
from openai import OpenAI

# Same client setup
client = OpenAI(base_url=RESERVOIR_BASE_URL, api_key=os.environ.get("OPENAI_API_KEY", ""))

# Start with local model for initial draft
local_response = client.chat.completions.create(
    model="gemma3",  # Local Ollama model
    messages=[{"role": "user", "content": "Write a draft email about project updates"}]
)

# Refine with cloud model for better quality
cloud_response = client.chat.completions.create(
    model="gpt-4",  # Cloud OpenAI model
    messages=[{"role": "user", "content": "Please improve the writing quality and make it more professional"}]
)

Both responses will have access to the same conversation context!

Next Steps

Python Integration - Use Ollama models from Python
Features - Multi-Provider Support - Learn about mixing different providers
Partitioning & Organization - Organize your local conversations
Architecture - Data Model - Understand how conversations are stored

Ready to go private? 🔒 With Ollama and Reservoir, you have a completely local AI assistant with persistent memory!

Sector F Labs - Reservoir