TokenPak + LiteLLM Adapter¶
LiteLLM is a multi-provider LLM router that abstracts away provider differences under a unified OpenAI-compatible API. By routing LiteLLM through TokenPak, you gain automatic compression, caching, and token accounting — all while maintaining LiteLLM's multi-provider fallback and cost-optimization features.
Why Use LiteLLM + TokenPak?¶
| Feature | LiteLLM | TokenPak | Together |
|---|---|---|---|
| Multi-provider routing | ✅ Fallback, cost optimization | — | ✅ Add compression + caching |
| OpenAI compatibility | ✅ Unified API | ✅ /v1/chat/completions |
✅ Seamless integration |
| Token compression | — | ✅ Reduce input/output tokens | ✅ Lower costs further |
| Request caching | — | ✅ Cache identical prompts | ✅ Deduplicate across clients |
| Token accounting | Limited | ✅ Detailed stats/usage | ✅ Unified usage tracking |
Use Cases¶
- Multi-provider fallback with TokenPak compression: Use LiteLLM's fallback to Claude → Gemini → GPT, with TokenPak deduplicating requests across all routes
- Cost optimization across providers: LiteLLM optimizes provider selection, TokenPak optimizes tokens — compound savings
- Controlled multi-client access: Route multiple services through TokenPak proxy + LiteLLM for unified auth and cost tracking
Setup¶
1. Install TokenPak and LiteLLM¶
pip install litellm
# TokenPak runs as a standalone proxy service
# See: https://github.com/...tokenpak#getting-started
2. Start TokenPak Proxy¶
# TokenPak default: http://localhost:8766/v1
python -m tokenpak.proxy --port 8766
Verify proxy is running:
curl http://localhost:8766/health
# { "status": "ok" }
Usage Patterns¶
Pattern 1: Direct litellm.completion()¶
Route a single completion request through TokenPak:
import litellm
# Set your Anthropic API key
import os
os.environ["ANTHROPIC_API_KEY"] = "sk-ant-..."
response = litellm.completion(
model="openai/claude-sonnet-4-6", # "openai/" prefix = OpenAI-compatible endpoint
api_base="http://localhost:8766/v1", # Point to TokenPak proxy
messages=[
{"role": "user", "content": "What is quantum computing?"}
],
temperature=0.7
)
print(response.choices[0].message.content)
Key points:
- Use openai/<model-name> format: LiteLLM routes to the OpenAI-compatible endpoint
- api_base points to TokenPak proxy (default: localhost:8766/v1)
- TokenPak handles compression, caching, and token accounting
Pattern 2: LiteLLM Proxy Mode with TokenPak Backend¶
Run LiteLLM's own proxy to manage multiple clients, all routing through TokenPak:
# litellm_config.yaml
model_list:
- model_name: "claude-sonnet"
litellm_params:
model: "openai/claude-sonnet-4-6"
api_base: "http://localhost:8766/v1"
api_key: "sk-ant-..." # Or set via env var: ANTHROPIC_API_KEY
- model_name: "claude-opus"
litellm_params:
model: "openai/claude-opus-4-6"
api_base: "http://localhost:8766/v1"
api_key: "sk-ant-..."
router_settings:
fallback_ratio: 0.1 # Fallback after 10% failure rate
Start LiteLLM proxy:
litellm --config litellm_config.yaml --port 8000
Use from your application:
import litellm
response = litellm.completion(
model="claude-sonnet", # Routes to LiteLLM proxy
api_base="http://localhost:8000",
messages=[{"role": "user", "content": "Hello!"}]
)
Flow:
Your App → LiteLLM Proxy (8000) → TokenPak Proxy (8766) → Anthropic API
Pattern 3: Multi-Provider Fallback with TokenPak Caching¶
Combine LiteLLM's fallback logic with TokenPak's caching for resilient + efficient routing:
# litellm_config.yaml
model_list:
- model_name: "smart-router"
litellm_params:
model: "openai/claude-opus-4-6"
api_base: "http://localhost:8766/v1" # Primary: TokenPak → Claude
api_key: "sk-ant-..."
- model_name: "smart-router"
litellm_params:
model: "openai/gpt-4-turbo"
api_base: "http://localhost:8766/v1" # Fallback: TokenPak → GPT-4
api_key: "sk-openai-..."
router_settings:
fallback_ratio: 0.2 # Fallback after 20% failure rate
allowed_fails: 1 # Allow 1 request to fail before fallback kicks in
How it works: 1. LiteLLM routes 80% of requests to Claude (primary) 2. On failures, automatically routes to GPT-4 3. TokenPak caches both paths — identical prompts are deduplicated across providers 4. Unified token usage tracking across all routes
Model Name Convention¶
TokenPak uses the openai/<model-name> pattern for all providers. This tells LiteLLM to treat the endpoint as OpenAI-compatible.
| Provider | Model Name |
|---|---|
| Anthropic | openai/claude-opus-4-6, openai/claude-sonnet-4-6 |
| OpenAI | openai/gpt-4-turbo, openai/gpt-4o |
openai/gemini-2.0-flash |
See TokenPak Model Support for the full list.
Verify Routing¶
Check that requests are flowing through TokenPak correctly:
# Check TokenPak stats endpoint
curl http://localhost:8766/stats | jq .
# Example output:
{
"cached": 12,
"compressed_requests": 45,
"token_usage": {
"input": 2840,
"output": 1230,
"cached": 340 # Tokens saved by caching
}
}
If your request doesn't appear in stats, check:
1. TokenPak is running (curl http://localhost:8766/health)
2. api_base in LiteLLM config matches TokenPak port (default 8766)
3. Firewall/network allows localhost:8766 connection
Limitations & Known Gaps¶
LiteLLM Specific¶
| Limitation | Impact | Workaround |
|---|---|---|
| Streaming | LiteLLM streaming via TokenPak works, but caching doesn't apply to streamed responses | Cache only unstreamed requests |
| Custom headers | LiteLLM passes headers through; TokenPak ignores non-standard headers | Use api_key + api_base only |
| Async routing | LiteLLM's async API works with TokenPak, but fallback logic runs serially | Acceptable for most use cases |
TokenPak Specific¶
| Limitation | Impact | Workaround |
|---|---|---|
| Cache key | Caching uses model + prompt hash (ignores temperature, top_p) | Ideal for deterministic requests; less useful for creative tasks |
| Rate limits | TokenPak respects upstream rate limits; LiteLLM's retry logic stacks on top | Configure reasonable max_retries in LiteLLM config |
| Request size | Max request size is 10 MB | Unlikely to hit in practice |
Troubleshooting¶
"Connection refused" on api_base¶
Error:
litellm.exceptions.APIConnectionError: Failed to connect to http://localhost:8766/v1
Fix:
1. Verify TokenPak is running: curl http://localhost:8766/health
2. Check port is correct (default 8766)
3. If remote: use actual IP instead of localhost (e.g., http://192.168.1.100:8766/v1)
"Invalid API key" errors¶
Error:
litellm.exceptions.AuthenticationError: Invalid API key
Fix:
1. Verify ANTHROPIC_API_KEY is set correctly
2. TokenPak passes your API key upstream — check it matches your provider (Anthropic, OpenAI, etc.)
3. Test directly against TokenPak: curl -H "Authorization: Bearer sk-ant-..." http://localhost:8766/v1/models
Requests not cached¶
Symptom: Stats show cached: 0 even after repeated identical requests
Cause: Likely mismatch in request parameters (temperature, top_p, etc.)
Fix: 1. Cache key = model + prompt hash. Other parameters are ignored. 2. Check your requests are truly identical:
# These will cache:
msg = [{"role": "user", "content": "What is 2+2?"}]
r1 = litellm.completion(model="openai/claude-sonnet-4-6", messages=msg, api_base=proxy)
r2 = litellm.completion(model="openai/claude-sonnet-4-6", messages=msg, api_base=proxy)
# These won't cache (different parameters):
r1 = litellm.completion(model="openai/claude-sonnet-4-6", messages=msg, temperature=0.7, api_base=proxy)
r2 = litellm.completion(model="openai/claude-sonnet-4-6", messages=msg, temperature=0.5, api_base=proxy)
Next Steps¶
- OpenAI SDK Adapter — Use TokenPak with
openailibrary - LangChain Adapter — Integrate with LangChain
- TokenPak Configuration — Advanced proxy settings
- Model Support — Full provider & model reference