TokenPak Architecture¶
TokenPak is a transparent, feature-rich proxy that sits between your LLM client application and multiple LLM providers (Anthropic, OpenAI, etc.). It handles routing, caching, token counting, cost tracking, rate limiting, and security—without requiring you to change a single line in your application code.
High-Level Overview¶
graph LR
A["Your Application"]
B["TokenPak Proxy"]
C["Request Router"]
D["Validation Gate"]
E["Token Counter"]
F["Cache Manager"]
G["Rate Limiter"]
H["Provider Router"]
I["Anthropic API"]
J["OpenAI API"]
K["LLM Providers"]
L["Monitoring & Stats"]
A -->|HTTP/HTTPS| B
B --> C
C --> D
D --> E
E --> F
F --> G
G --> H
H -->|Route Request| I
H -->|Route Request| J
H -->|Route Request| K
E -->|Stats| L
F -->|Cache Hit/Miss| L
B -.->|Response| A
I -.->|Response| H
J -.->|Response| H
K -.->|Response| H
Core Components¶
1. Request Router¶
The entry point that receives all API requests from your application. It normalizes incoming requests (support for both OpenAI-compatible and native formats), extracts metadata, and passes requests through the pipeline.
Responsibility: Parse and validate incoming requests, extract user intent and model name, prepare request body for downstream processing.
2. Validation Gate¶
An optional safety layer that inspects message content against configured policies before passing to the proxy. Can detect and block suspicious patterns, enforce compliance rules, or rate-limit based on content risk.
Responsibility: Content security scanning, policy enforcement, risk classification of requests and responses.
3. Token Counter¶
Counts input and output tokens accurately using provider-specific tokenizers. Works transparently for streaming and non-streaming responses, supports prompt caching token accounting, and feeds real usage data to the cost tracker.
Responsibility: Accurate token counting per provider, cache-aware token calculation, real-time stats collection.
4. Cache Manager¶
Implements a multi-layer caching strategy: semantic deduplication (recognizes similar prompts), prompt caching integration (leverages provider caching when available), and configurable TTL-based cache eviction.
Responsibility: Cache storage and retrieval, cache hit rate optimization, prompt cache header management, token savings calculation.
5. Rate Limiter¶
Enforces per-IP rate limiting, per-model rate limits, and cost-per-minute budgets. Prevents runaway spending and protects against abuse.
Responsibility: Rate limit enforcement, cost-based throttling, backpressure handling.
6. Provider Router¶
Decides which LLM provider to use based on request metadata, fallback rules, and provider health. Supports weighted routing, circuit breakers (detects down providers), and failover logic.
Responsibility: Provider selection, failover logic, circuit breaker management, health checking.
7. Monitoring & Observability¶
Real-time stats collection: token usage, cost, cache hit rates, latency, provider health. Exports metrics to dashboards and analytics tools.
Responsibility: Metrics collection, stats aggregation, performance monitoring, usage reporting.
Request Flow¶
Here's what happens when your application sends a request through TokenPak:
sequenceDiagram
participant App as Your App
participant TP as TokenPak Proxy
participant VG as Validation Gate
participant TC as Token Counter
participant CM as Cache Manager
participant RL as Rate Limiter
participant PR as Provider Router
participant LLM as LLM Provider
App->>TP: POST /v1/messages (with API key)
TP->>TP: Parse & normalize request
TP->>VG: Check content policy
VG->>VG: Risk assessment
VG-->>TP: ✓ Allowed
TP->>CM: Check cache for similar request
CM-->>TP: Cache hit? Return cached response
alt Cache Hit
TP->>TP: No token usage
TP-->>App: Cached response (instant)
else Cache Miss
TP->>RL: Check rate limit & budget
RL-->>TP: ✓ Within limits
TP->>PR: Select provider (routing rules)
PR->>LLM: Forward request
LLM-->>PR: Response + usage
TP->>TC: Count tokens (input + output)
TC-->>TP: Token counts
TP->>CM: Store in cache
TP-->>App: Response (with token metadata)
end
TP->>TP: Log stats (cost, latency, cache, etc.)
- Parse Request — Normalize the incoming request format (OpenAI-compatible, native, etc.)
- Validation — Check content against policies; block if unsafe
- Cache Check — Look for cached response (exact or semantic match)
- Rate Limit Check — Verify IP is within quota; verify cost budget
- Provider Selection — Pick the best provider based on routing rules and health
- Forward Request — Send to the chosen LLM provider
- Count Tokens — Calculate input and output token usage
- Update Cache — Store response for future use
- Collect Stats — Record cost, latency, cache hit, usage metrics
- Return Response — Send response back to application
Deployment Models¶
Single-Machine Deployment¶
TokenPak runs on one machine and all requests flow through it. Simple, low-overhead setup.
Your Application → [TokenPak Proxy] → LLM Provider
↓
Local SQLite Cache
Local Stats DB
Docker Deployment¶
Run TokenPak in a containerized environment, easily scalable.
Docker Container
├── TokenPak Proxy
├── Cache (volume mount)
└── Stats (volume mount)
Multi-Node Deployment (Distributed)¶
Multiple TokenPak instances for high availability and load distribution.
Load Balancer
↓
┌─────────────┬─────────────┬─────────────┐
↓ ↓ ↓
Node 1 Node 2 Node 3
[TokenPak] [TokenPak] [TokenPak]
↓ ↓ ↓
[Shared Cache] ← Redis/Memcached or similar
[Shared Stats] ← Prometheus/InfluxDB or similar
Internal Module Structure¶
graph TD
A["StageTrace & PipelineTrace"]
B["VaultIndex<br/>Token Counting & Compression"]
C["Provider Router<br/>Route Selection & Failover"]
D["Validation Gate<br/>Content Security"]
E["Cache Manager<br/>Response & Prompt Cache"]
F["Rate Limiter<br/>Quota Enforcement"]
G["Monitor<br/>Stats & Metrics"]
H["FormatAdapter<br/>OpenAI ↔ Native Conversion"]
I["Circuit Breaker<br/>Provider Health"]
A -->|Tracing| B
B -->|Token Data| G
B -->|Routes to| C
C -->|Routes to| I
D -->|Filters| E
E -->|Cache Stats| G
F -->|Quota Check| G
H -->|Format Convert| C
I -->|Health Status| C
- StageTrace & PipelineTrace: Request tracing for debugging and performance analysis
- VaultIndex: Token counting, semantic compression, and cost calculation
- Provider Router: Logic for selecting which LLM provider to use
- Validation Gate: Content scanning and policy enforcement
- Cache Manager: Response caching and prompt cache integration
- Rate Limiter: Per-IP, per-model, and cost-based limits
- Monitor: Real-time stats and usage reporting
- FormatAdapter: Converts between OpenAI and native formats transparently
- Circuit Breaker: Detects and routes around failing providers
Caching Strategy¶
TokenPak uses a three-tier caching approach to maximize token savings:
- Exact Match Cache — If we've seen this exact request before, return the cached response instantly (0 tokens)
- Semantic Cache — If a similar request exists (same intent, minor wording differences), TokenPak can return a cached response with high confidence
- Prompt Cache Headers — When available, TokenPak automatically injects prompt caching headers so the LLM provider caches expensive prompt prefixes
Token Counting & Cost Tracking¶
TokenPak counts tokens accurately for every request/response, accounting for:
- Input tokens — User message + system prompt
- Output tokens — Model response
- Cache read tokens — Tokens served from provider caching (1/4 cost)
- Cache creation tokens — Tokens used to create a new cache entry (full cost)
Cost is calculated per-provider using live pricing data, giving you real per-request cost visibility.
Monitoring & Observability¶
TokenPak exports metrics for:
- Token usage — Input, output, cache reads, cache creates
- Cost — Per-request, per-model, cumulative
- Cache metrics — Hit rate, miss rate, semantic matches
- Provider health — Response times, error rates, circuit breaker status
- Rate limiting — Requests throttled, budgets exceeded
- Latency — End-to-end response time, provider latency
Access stats via:
curl http://localhost:8766/stats
Security Features¶
- Validation Gate: Blocks suspicious content before it reaches providers
- Rate Limiting: Prevents abuse and runaway costs
- Per-IP Quotas: Control who can use the proxy and how much
- API Key Isolation: Proxied requests don't leak your API keys to the client
- Encrypted Config: Sensitive settings encrypted at rest
Configuration¶
TokenPak is configured via environment variables and a local config file:
# Core settings
TOKENPAK_BIND=0.0.0.0:8766
TOKENPAK_UPSTREAM=https://api.anthropic.com
# Cache settings
CACHE_ENABLED=true
CACHE_TTL_SECONDS=3600
# Rate limiting
RATE_LIMIT_PER_IP=100 # requests per minute
COST_LIMIT_PER_MINUTE=10.0 # dollars per minute
# Validation gate
VALIDATION_GATE_ENABLED=true
See docs/CONFIG.md for full options.
Extension Points¶
TokenPak is designed to be extended:
- Custom providers — Add support for new LLM APIs
- Custom validation rules — Implement your own content policies
- Custom cache backends — Use Redis, Memcached, or your own storage
- Custom routing logic — Implement custom provider selection rules
- Custom metrics exporters — Send stats to your monitoring system
See docs/CONTRIBUTING.md for extension patterns.