TokenPak Architecture¶

TokenPak is a transparent, feature-rich proxy that sits between your LLM client application and multiple LLM providers (Anthropic, OpenAI, etc.). It handles routing, caching, token counting, cost tracking, rate limiting, and security—without requiring you to change a single line in your application code.

High-Level Overview¶

graph LR
    A["Your Application"]
    B["TokenPak Proxy"]
    C["Request Router"]
    D["Validation Gate"]
    E["Token Counter"]
    F["Cache Manager"]
    G["Rate Limiter"]
    H["Provider Router"]
    I["Anthropic API"]
    J["OpenAI API"]
    K["LLM Providers"]
    L["Monitoring & Stats"]

    A -->|HTTP/HTTPS| B
    B --> C
    C --> D
    D --> E
    E --> F
    F --> G
    G --> H
    H -->|Route Request| I
    H -->|Route Request| J
    H -->|Route Request| K
    E -->|Stats| L
    F -->|Cache Hit/Miss| L
    B -.->|Response| A
    I -.->|Response| H
    J -.->|Response| H
    K -.->|Response| H

Core Components¶

1. Request Router¶

The entry point that receives all API requests from your application. It normalizes incoming requests (support for both OpenAI-compatible and native formats), extracts metadata, and passes requests through the pipeline.

Responsibility: Parse and validate incoming requests, extract user intent and model name, prepare request body for downstream processing.

2. Validation Gate¶

An optional safety layer that inspects message content against configured policies before passing to the proxy. Can detect and block suspicious patterns, enforce compliance rules, or rate-limit based on content risk.

Responsibility: Content security scanning, policy enforcement, risk classification of requests and responses.

3. Token Counter¶

Counts input and output tokens accurately using provider-specific tokenizers. Works transparently for streaming and non-streaming responses, supports prompt caching token accounting, and feeds real usage data to the cost tracker.

Responsibility: Accurate token counting per provider, cache-aware token calculation, real-time stats collection.

4. Cache Manager¶

Implements a multi-layer caching strategy: semantic deduplication (recognizes similar prompts), prompt caching integration (leverages provider caching when available), and configurable TTL-based cache eviction.

Responsibility: Cache storage and retrieval, cache hit rate optimization, prompt cache header management, token savings calculation.

5. Rate Limiter¶

Enforces per-IP rate limiting, per-model rate limits, and cost-per-minute budgets. Prevents runaway spending and protects against abuse.

Responsibility: Rate limit enforcement, cost-based throttling, backpressure handling.

6. Provider Router¶

Decides which LLM provider to use based on request metadata, fallback rules, and provider health. Supports weighted routing, circuit breakers (detects down providers), and failover logic.

Responsibility: Provider selection, failover logic, circuit breaker management, health checking.

7. Monitoring & Observability¶

Real-time stats collection: token usage, cost, cache hit rates, latency, provider health. Exports metrics to dashboards and analytics tools.

Responsibility: Metrics collection, stats aggregation, performance monitoring, usage reporting.

Request Flow¶

Here's what happens when your application sends a request through TokenPak:

sequenceDiagram
    participant App as Your App
    participant TP as TokenPak Proxy
    participant VG as Validation Gate
    participant TC as Token Counter
    participant CM as Cache Manager
    participant RL as Rate Limiter
    participant PR as Provider Router
    participant LLM as LLM Provider

    App->>TP: POST /v1/messages (with API key)
    TP->>TP: Parse & normalize request
    TP->>VG: Check content policy
    VG->>VG: Risk assessment
    VG-->>TP: ✓ Allowed
    TP->>CM: Check cache for similar request
    CM-->>TP: Cache hit? Return cached response
    alt Cache Hit
        TP->>TP: No token usage
        TP-->>App: Cached response (instant)
    else Cache Miss
        TP->>RL: Check rate limit & budget
        RL-->>TP: ✓ Within limits
        TP->>PR: Select provider (routing rules)
        PR->>LLM: Forward request
        LLM-->>PR: Response + usage
        TP->>TC: Count tokens (input + output)
        TC-->>TP: Token counts
        TP->>CM: Store in cache
        TP-->>App: Response (with token metadata)
    end
    TP->>TP: Log stats (cost, latency, cache, etc.)

Parse Request — Normalize the incoming request format (OpenAI-compatible, native, etc.)
Validation — Check content against policies; block if unsafe
Cache Check — Look for cached response (exact or semantic match)
Rate Limit Check — Verify IP is within quota; verify cost budget
Provider Selection — Pick the best provider based on routing rules and health
Forward Request — Send to the chosen LLM provider
Count Tokens — Calculate input and output token usage
Update Cache — Store response for future use
Collect Stats — Record cost, latency, cache hit, usage metrics
Return Response — Send response back to application

Deployment Models¶

Single-Machine Deployment¶

TokenPak runs on one machine and all requests flow through it. Simple, low-overhead setup.

Your Application → [TokenPak Proxy] → LLM Provider
                        ↓
                   Local SQLite Cache
                   Local Stats DB

Docker Deployment¶

Run TokenPak in a containerized environment, easily scalable.

Docker Container
├── TokenPak Proxy
├── Cache (volume mount)
└── Stats (volume mount)

Multi-Node Deployment (Distributed)¶

Multiple TokenPak instances for high availability and load distribution.

Load Balancer
    ↓
  ┌─────────────┬─────────────┬─────────────┐
  ↓             ↓             ↓
Node 1       Node 2       Node 3
[TokenPak]   [TokenPak]   [TokenPak]
  ↓             ↓             ↓
[Shared Cache] ← Redis/Memcached or similar
[Shared Stats] ← Prometheus/InfluxDB or similar

Internal Module Structure¶

graph TD
    A["StageTrace & PipelineTrace"]
    B["VaultIndex<br/>Token Counting & Compression"]
    C["Provider Router<br/>Route Selection & Failover"]
    D["Validation Gate<br/>Content Security"]
    E["Cache Manager<br/>Response & Prompt Cache"]
    F["Rate Limiter<br/>Quota Enforcement"]
    G["Monitor<br/>Stats & Metrics"]
    H["FormatAdapter<br/>OpenAI ↔ Native Conversion"]
    I["Circuit Breaker<br/>Provider Health"]

    A -->|Tracing| B
    B -->|Token Data| G
    B -->|Routes to| C
    C -->|Routes to| I
    D -->|Filters| E
    E -->|Cache Stats| G
    F -->|Quota Check| G
    H -->|Format Convert| C
    I -->|Health Status| C

StageTrace & PipelineTrace: Request tracing for debugging and performance analysis
VaultIndex: Token counting, semantic compression, and cost calculation
Provider Router: Logic for selecting which LLM provider to use
Validation Gate: Content scanning and policy enforcement
Cache Manager: Response caching and prompt cache integration
Rate Limiter: Per-IP, per-model, and cost-based limits
Monitor: Real-time stats and usage reporting
FormatAdapter: Converts between OpenAI and native formats transparently
Circuit Breaker: Detects and routes around failing providers

Caching Strategy¶

TokenPak uses a three-tier caching approach to maximize token savings:

Exact Match Cache — If we've seen this exact request before, return the cached response instantly (0 tokens)
Semantic Cache — If a similar request exists (same intent, minor wording differences), TokenPak can return a cached response with high confidence
Prompt Cache Headers — When available, TokenPak automatically injects prompt caching headers so the LLM provider caches expensive prompt prefixes

Token Counting & Cost Tracking¶

TokenPak counts tokens accurately for every request/response, accounting for:

Input tokens — User message + system prompt
Output tokens — Model response
Cache read tokens — Tokens served from provider caching (1/4 cost)
Cache creation tokens — Tokens used to create a new cache entry (full cost)

Cost is calculated per-provider using live pricing data, giving you real per-request cost visibility.

Monitoring & Observability¶

TokenPak exports metrics for:

Token usage — Input, output, cache reads, cache creates
Cost — Per-request, per-model, cumulative
Cache metrics — Hit rate, miss rate, semantic matches
Provider health — Response times, error rates, circuit breaker status
Rate limiting — Requests throttled, budgets exceeded
Latency — End-to-end response time, provider latency

Access stats via:

curl http://localhost:8766/stats

Security Features¶

Validation Gate: Blocks suspicious content before it reaches providers
Rate Limiting: Prevents abuse and runaway costs
Per-IP Quotas: Control who can use the proxy and how much
API Key Isolation: Proxied requests don't leak your API keys to the client
Encrypted Config: Sensitive settings encrypted at rest

Configuration¶

TokenPak is configured via environment variables and a local config file:

# Core settings
TOKENPAK_BIND=0.0.0.0:8766
TOKENPAK_UPSTREAM=https://api.anthropic.com

# Cache settings
CACHE_ENABLED=true
CACHE_TTL_SECONDS=3600

# Rate limiting
RATE_LIMIT_PER_IP=100  # requests per minute
COST_LIMIT_PER_MINUTE=10.0  # dollars per minute

# Validation gate
VALIDATION_GATE_ENABLED=true

See docs/CONFIG.md for full options.

Extension Points¶

TokenPak is designed to be extended:

Custom providers — Add support for new LLM APIs
Custom validation rules — Implement your own content policies
Custom cache backends — Use Redis, Memcached, or your own storage
Custom routing logic — Implement custom provider selection rules
Custom metrics exporters — Send stats to your monitoring system

See docs/CONTRIBUTING.md for extension patterns.