TokenPak Architecture¶
TokenPak is a transparent, feature-rich proxy that sits between your LLM client application and multiple LLM providers (Anthropic, OpenAI, etc.). It handles routing, caching, token counting, cost tracking, rate limiting, and security—without requiring you to change a single line in your application code.
High-Level Overview¶
graph LR
A["Your Application"]
B["TokenPak Proxy"]
C["Request Router"]
D["Validation Gate"]
E["Token Counter"]
F["Cache Manager"]
G["Rate Limiter"]
H["Provider Router"]
I["Anthropic API"]
J["OpenAI API"]
K["LLM Providers"]
L["Monitoring & Stats"]
A -->|HTTP/HTTPS| B
B --> C
C --> D
D --> E
E --> F
F --> G
G --> H
H -->|Route Request| I
H -->|Route Request| J
H -->|Route Request| K
E -->|Stats| L
F -->|Cache Hit/Miss| L
B -.->|Response| A
I -.->|Response| H
J -.->|Response| H
K -.->|Response| H
Core Components¶
1. Request Router¶
The entry point that receives all API requests from your application. It normalizes incoming requests (support for both OpenAI-compatible and native formats), extracts metadata, and passes requests through the pipeline.
Responsibility: Parse and validate incoming requests, extract user intent and model name, prepare request body for downstream processing.
2. Validation Gate¶
An optional safety layer that inspects message content against configured policies before passing to the proxy. Can detect and block suspicious patterns, enforce compliance rules, or rate-limit based on content risk.
Responsibility: Content security scanning, policy enforcement, risk classification of requests and responses.
3. Token Counter¶
Counts input and output tokens accurately using provider-specific tokenizers. Works transparently for streaming and non-streaming responses, supports prompt caching token accounting, and feeds real usage data to the cost tracker.
Responsibility: Accurate token counting per provider, cache-aware token calculation, real-time stats collection.
4. Cache Manager¶
Implements a multi-layer caching strategy: provider-native prompt caching (e.g. Anthropic prompt cache pass-through), and an opt-in TokenPak-managed semantic cache for narrow read-shaped route classes. Semantic response substitution is OFF by default, bypassed entirely for code/debug/streaming/Claude-Code traffic, and gated by per-route similarity thresholds. Configurable TTL-based eviction governs cache lifetime. See the Caching Strategy section below for the full safety contract.
Responsibility: Cache storage and retrieval, cache hit rate optimization, prompt cache header management, token savings calculation.
5. Rate Limiter¶
Enforces per-IP rate limiting, per-model rate limits, and cost-per-minute budgets. Prevents runaway spending and protects against abuse.
Responsibility: Rate limit enforcement, cost-based throttling, backpressure handling.
6. Provider Router¶
Decides which LLM provider to use based on request metadata, fallback rules, and provider health. Supports weighted routing, circuit breakers (detects down providers), and failover logic.
Responsibility: Provider selection, failover logic, circuit breaker management, health checking.
7. Monitoring & Observability¶
Real-time stats collection: token usage, cost, cache hit rates, latency, provider health. Exports metrics to dashboards and analytics tools.
Responsibility: Metrics collection, stats aggregation, performance monitoring, usage reporting.
Request Flow¶
Here's what happens when your application sends a request through TokenPak:
sequenceDiagram
participant App as Your App
participant TP as TokenPak Proxy
participant VG as Validation Gate
participant TC as Token Counter
participant CM as Cache Manager
participant RL as Rate Limiter
participant PR as Provider Router
participant LLM as LLM Provider
App->>TP: POST /v1/messages (with API key)
TP->>TP: Parse & normalize request
TP->>VG: Check content policy
VG->>VG: Risk assessment
VG-->>TP: ✓ Allowed
TP->>CM: Check cache for similar request
CM-->>TP: Cache hit? Return cached response
alt Cache Hit
TP->>TP: No token usage
TP-->>App: Cached response (instant)
else Cache Miss
TP->>RL: Check rate limit & budget
RL-->>TP: ✓ Within limits
TP->>PR: Select provider (routing rules)
PR->>LLM: Forward request
LLM-->>PR: Response + usage
TP->>TC: Count tokens (input + output)
TC-->>TP: Token counts
TP->>CM: Store in cache
TP-->>App: Response (with token metadata)
end
TP->>TP: Log stats (cost, latency, cache, etc.)
- Parse Request — Normalize the incoming request format (OpenAI-compatible, native, etc.)
- Validation — Check content against policies; block if unsafe
- Cache Check — Look for cached response (exact or semantic match)
- Rate Limit Check — Verify IP is within quota; verify cost budget
- Provider Selection — Pick the best provider based on routing rules and health
- Forward Request — Send to the chosen LLM provider
- Count Tokens — Calculate input and output token usage
- Update Cache — Store response for future use
- Collect Stats — Record cost, latency, cache hit, usage metrics
- Return Response — Send response back to application
Deployment Models¶
Single-Machine Deployment¶
TokenPak runs on one machine and all requests flow through it. Simple, low-overhead setup.
Your Application → [TokenPak Proxy] → LLM Provider
↓
Local SQLite Cache
Local Stats DB
Docker Deployment¶
Run TokenPak in a containerized environment, easily scalable.
Docker Container
├── TokenPak Proxy
├── Cache (volume mount)
└── Stats (volume mount)
Multi-Node Deployment (Distributed)¶
Multiple TokenPak instances for high availability and load distribution.
Load Balancer
↓
┌─────────────┬─────────────┬─────────────┐
↓ ↓ ↓
Node 1 Node 2 Node 3
[TokenPak] [TokenPak] [TokenPak]
↓ ↓ ↓
[Shared Cache] ← Redis/Memcached or similar
[Shared Stats] ← Prometheus/InfluxDB or similar
Internal Module Structure¶
graph TD
A["StageTrace & PipelineTrace"]
B["VaultIndex<br/>Vault Retrieval & Semantic Search"]
C["Provider Router<br/>Route Selection & Failover"]
D["Validation Gate<br/>Content Security"]
E["Cache Manager<br/>Response & Prompt Cache"]
F["Rate Limiter<br/>Quota Enforcement"]
G["Monitor<br/>Stats & Metrics"]
H["FormatAdapter<br/>OpenAI ↔ Native Conversion"]
I["Circuit Breaker<br/>Provider Health"]
A -->|Tracing| B
B -->|Token Data| G
B -->|Routes to| C
C -->|Routes to| I
D -->|Filters| E
E -->|Cache Stats| G
F -->|Quota Check| G
H -->|Format Convert| C
I -->|Health Status| C
- StageTrace & PipelineTrace: Request tracing for debugging and performance analysis
- VaultIndex: Vault retrieval and semantic-search index — selects relevant indexed blocks for context injection (not the token counter)
- Provider Router: Logic for selecting which LLM provider to use
- Validation Gate: Content scanning and policy enforcement
- Cache Manager: Response caching and prompt cache integration
- Rate Limiter: Per-IP, per-model, and cost-based limits
- Monitor: Real-time stats and usage reporting
- FormatAdapter: Converts between OpenAI and native formats transparently
- Circuit Breaker: Detects and routes around failing providers
Caching Strategy¶
TokenPak uses a three-tier caching approach to maximize token savings:
- Exact Match Cache — If we've seen this exact request before, return the cached response instantly (0 tokens).
- Semantic Cache (opt-in) — When explicitly enabled via
TOKENPAK_SEMANTIC_CACHE_STAGE, TokenPak may serve cached responses for a narrow set of read-shaped route classes (status checks, summarization, configuration inspection) at conservative per-route similarity thresholds. All code-generation, code-edit, code-review, debugging, test-failure, log-analysis, git-diff-review, and shell-command-analysis prompts are bypassed entirely. Streaming requests are never served from semantic cache. Claude Code traffic is excluded to preserve message-id fidelity. Unknown route classes default to no response substitution. The TokenPak semantic cache stores normalized + hashed query forms only — entries hold a 12-character query hash, response bytes, content type, and wire format. Raw prompt text is not persisted in the semantic-cache store. (This statement scopes only the semantic-cache store; other TokenPak surfaces — request telemetry, trace records, companion journal, capsule storage — have their own retention rules documented in their respective standards and operator configuration.) - Prompt Cache Headers — When available, TokenPak automatically injects prompt caching headers so the LLM provider caches expensive prompt prefixes.
Token Counting & Cost Tracking¶
TokenPak counts tokens accurately for every request/response, accounting for:
- Input tokens — User message + system prompt
- Output tokens — Model response
- Cache read tokens — Tokens served from provider caching (1/4 cost)
- Cache creation tokens — Tokens used to create a new cache entry (full cost)
Cost is calculated per-provider using live pricing data, giving you real per-request cost visibility.
Monitoring & Observability¶
TokenPak exports metrics for:
- Token usage — Input, output, cache reads, cache creates
- Cost — Per-request, per-model, cumulative
- Cache metrics — Hit rate, miss rate, semantic matches
- Provider health — Response times, error rates, circuit breaker status
- Rate limiting — Requests throttled, budgets exceeded
- Latency — End-to-end response time, provider latency
Access stats via:
curl http://localhost:8766/stats
Security Features¶
- Validation Gate: Blocks suspicious content before it reaches providers
- Rate Limiting: Prevents abuse and runaway costs
- Per-IP Quotas: Control who can use the proxy and how much
- API Key Isolation: Proxied requests don't leak your API keys to the client
- Encrypted Config: Sensitive settings encrypted at rest
Configuration¶
TokenPak is configured via environment variables and a local config file (~/.tokenpak/config.yaml). Environment variables override config file values:
# Core settings
TOKENPAK_BIND_ADDRESS=127.0.0.1 # default bind host (loopback only)
TOKENPAK_PORT=8766 # default listen port
TOKENPAK_MODE=hybrid # compression mode: strict | hybrid | aggressive
TOKENPAK_COMPACT=1 # master compression switch (0 = off)
# Storage
TOKENPAK_DB=.tokenpak/monitor.db # SQLite database path
# Logging
TOKENPAK_LOG_LEVEL=info # debug | info | warning | error
# Spend guard
TOKENPAK_SPEND_GUARD_ENABLED=true
See configuration.md and env-vars.md for the full set of options.
Extension Points¶
TokenPak is designed to be extended:
- Custom providers — Add support for new LLM APIs
- Custom validation rules — Implement your own content policies
- Custom cache backends — Use Redis, Memcached, or your own storage
- Custom routing logic — Implement custom provider selection rules
- Custom metrics exporters — Send stats to your monitoring system
See the Plugin Guide for extension patterns, and the tokenpak/tokenpak repository for contribution guidelines.