TokenPak + LiteLLM Adapter¶

LiteLLM is a multi-provider LLM router that abstracts away provider differences under a unified OpenAI-compatible API. By routing LiteLLM through TokenPak, you gain automatic compression, caching, and token accounting — all while maintaining LiteLLM's multi-provider fallback and cost-optimization features.

Why Use LiteLLM + TokenPak?¶

Feature	LiteLLM	TokenPak	Together
Multi-provider routing	✅ Fallback, cost optimization	—	✅ Add compression + caching
OpenAI compatibility	✅ Unified API	✅ `/v1/chat/completions`	✅ Seamless integration
Token compression	—	✅ Reduce input/output tokens	✅ Lower costs further
Request caching	—	✅ Cache identical prompts	✅ Deduplicate across clients
Token accounting	Limited	✅ Detailed stats/usage	✅ Unified usage tracking

Use Cases¶

Multi-provider fallback with TokenPak compression: Use LiteLLM's fallback to Claude → Gemini → GPT, with TokenPak deduplicating requests across all routes
Cost optimization across providers: LiteLLM optimizes provider selection, TokenPak optimizes tokens — compound savings
Controlled multi-client access: Route multiple services through TokenPak proxy + LiteLLM for unified auth and cost tracking

Setup¶

1. Install TokenPak and LiteLLM¶

pip install litellm

# TokenPak runs as a standalone proxy service
# See: https://github.com/...tokenpak#getting-started

2. Start TokenPak Proxy¶

# TokenPak default: http://localhost:8766/v1
python -m tokenpak.proxy --port 8766

Verify proxy is running:

curl http://localhost:8766/health
# { "status": "ok" }

Usage Patterns¶

Pattern 1: Direct `litellm.completion()`¶

Route a single completion request through TokenPak:

import litellm

# Set your Anthropic API key
import os
os.environ["ANTHROPIC_API_KEY"] = "sk-ant-..."

response = litellm.completion(
    model="openai/claude-sonnet-4-6",  # "openai/" prefix = OpenAI-compatible endpoint
    api_base="http://localhost:8766/v1",  # Point to TokenPak proxy
    messages=[
        {"role": "user", "content": "What is quantum computing?"}
    ],
    temperature=0.7
)

print(response.choices[0].message.content)

Key points: - Use openai/<model-name> format: LiteLLM routes to the OpenAI-compatible endpoint - api_base points to TokenPak proxy (default: localhost:8766/v1) - TokenPak handles compression, caching, and token accounting

Pattern 2: LiteLLM Proxy Mode with TokenPak Backend¶

Run LiteLLM's own proxy to manage multiple clients, all routing through TokenPak:

# litellm_config.yaml
model_list:
  - model_name: "claude-sonnet"
    litellm_params:
      model: "openai/claude-sonnet-4-6"
      api_base: "http://localhost:8766/v1"
      api_key: "sk-ant-..."  # Or set via env var: ANTHROPIC_API_KEY

  - model_name: "claude-opus"
    litellm_params:
      model: "openai/claude-opus-4-6"
      api_base: "http://localhost:8766/v1"
      api_key: "sk-ant-..."

router_settings:
  fallback_ratio: 0.1  # Fallback after 10% failure rate

Start LiteLLM proxy:

litellm --config litellm_config.yaml --port 8000

Use from your application:

import litellm

response = litellm.completion(
    model="claude-sonnet",  # Routes to LiteLLM proxy
    api_base="http://localhost:8000",
    messages=[{"role": "user", "content": "Hello!"}]
)

Flow:

Your App → LiteLLM Proxy (8000) → TokenPak Proxy (8766) → Anthropic API

Pattern 3: Multi-Provider Fallback with TokenPak Caching¶

Combine LiteLLM's fallback logic with TokenPak's caching for resilient + efficient routing:

# litellm_config.yaml
model_list:
  - model_name: "smart-router"
    litellm_params:
      model: "openai/claude-opus-4-6"
      api_base: "http://localhost:8766/v1"  # Primary: TokenPak → Claude
      api_key: "sk-ant-..."

  - model_name: "smart-router"
    litellm_params:
      model: "openai/gpt-4-turbo"
      api_base: "http://localhost:8766/v1"  # Fallback: TokenPak → GPT-4
      api_key: "sk-openai-..."

router_settings:
  fallback_ratio: 0.2  # Fallback after 20% failure rate
  allowed_fails: 1  # Allow 1 request to fail before fallback kicks in

How it works: 1. LiteLLM routes 80% of requests to Claude (primary) 2. On failures, automatically routes to GPT-4 3. TokenPak caches both paths — identical prompts are deduplicated across providers 4. Unified token usage tracking across all routes

Model Name Convention¶

TokenPak uses the openai/<model-name> pattern for all providers. This tells LiteLLM to treat the endpoint as OpenAI-compatible.

Provider	Model Name
Anthropic	`openai/claude-opus-4-6`, `openai/claude-sonnet-4-6`
OpenAI	`openai/gpt-4-turbo`, `openai/gpt-4o`
Google	`openai/gemini-2.0-flash`

See TokenPak Model Support for the full list.

Verify Routing¶

Check that requests are flowing through TokenPak correctly:

# Check TokenPak stats endpoint
curl http://localhost:8766/stats | jq .

# Example output:
{
  "cached": 12,
  "compressed_requests": 45,
  "token_usage": {
    "input": 2840,
    "output": 1230,
    "cached": 340  # Tokens saved by caching
  }
}

If your request doesn't appear in stats, check: 1. TokenPak is running (curl http://localhost:8766/health) 2. api_base in LiteLLM config matches TokenPak port (default 8766) 3. Firewall/network allows localhost:8766 connection

Limitations & Known Gaps¶

LiteLLM Specific¶

Limitation	Impact	Workaround
Streaming	LiteLLM streaming via TokenPak works, but caching doesn't apply to streamed responses	Cache only unstreamed requests
Custom headers	LiteLLM passes headers through; TokenPak ignores non-standard headers	Use `api_key` + `api_base` only
Async routing	LiteLLM's async API works with TokenPak, but fallback logic runs serially	Acceptable for most use cases

TokenPak Specific¶

Limitation	Impact	Workaround
Cache key	Caching uses model + prompt hash (ignores temperature, top_p)	Ideal for deterministic requests; less useful for creative tasks
Rate limits	TokenPak respects upstream rate limits; LiteLLM's retry logic stacks on top	Configure reasonable `max_retries` in LiteLLM config
Request size	Max request size is 10 MB	Unlikely to hit in practice

Troubleshooting¶

"Connection refused" on `api_base`¶

Error:

litellm.exceptions.APIConnectionError: Failed to connect to http://localhost:8766/v1

Fix: 1. Verify TokenPak is running: curl http://localhost:8766/health 2. Check port is correct (default 8766) 3. If remote: use actual IP instead of localhost (e.g., http://192.168.1.100:8766/v1)

"Invalid API key" errors¶

Error:

litellm.exceptions.AuthenticationError: Invalid API key

Fix: 1. Verify ANTHROPIC_API_KEY is set correctly 2. TokenPak passes your API key upstream — check it matches your provider (Anthropic, OpenAI, etc.) 3. Test directly against TokenPak: curl -H "Authorization: Bearer sk-ant-..." http://localhost:8766/v1/models

Requests not cached¶

Symptom: Stats show cached: 0 even after repeated identical requests

Cause: Likely mismatch in request parameters (temperature, top_p, etc.)

Fix: 1. Cache key = model + prompt hash. Other parameters are ignored. 2. Check your requests are truly identical:

# These will cache:
msg = [{"role": "user", "content": "What is 2+2?"}]
r1 = litellm.completion(model="openai/claude-sonnet-4-6", messages=msg, api_base=proxy)
r2 = litellm.completion(model="openai/claude-sonnet-4-6", messages=msg, api_base=proxy)

# These won't cache (different parameters):
r1 = litellm.completion(model="openai/claude-sonnet-4-6", messages=msg, temperature=0.7, api_base=proxy)
r2 = litellm.completion(model="openai/claude-sonnet-4-6", messages=msg, temperature=0.5, api_base=proxy)

Next Steps¶

OpenAI SDK Adapter — Use TokenPak with openai library
LangChain Adapter — Integrate with LangChain
TokenPak Configuration — Advanced proxy settings
Model Support — Full provider & model reference

TokenPak + LiteLLM Adapter¶

Why Use LiteLLM + TokenPak?¶

Use Cases¶

Setup¶

1. Install TokenPak and LiteLLM¶

2. Start TokenPak Proxy¶

Usage Patterns¶

Pattern 1: Direct litellm.completion()¶

Pattern 2: LiteLLM Proxy Mode with TokenPak Backend¶

Pattern 3: Multi-Provider Fallback with TokenPak Caching¶

Model Name Convention¶

Verify Routing¶

Limitations & Known Gaps¶

LiteLLM Specific¶

TokenPak Specific¶

Troubleshooting¶

"Connection refused" on api_base¶

"Invalid API key" errors¶

Requests not cached¶

Next Steps¶

Pattern 1: Direct `litellm.completion()`¶

"Connection refused" on `api_base`¶