Skip to content

TokenPak Savings — Real Numbers

How much will TokenPak save you?

The simple answer: 10–40% of your LLM bill.

Here's how we measure it, and what you should expect.


The Math (Simple Version)

What TokenPak Does

  1. Deduplicates requests — If you send the same prompt twice, the second one costs less (cache hit)
  2. Compresses long context — Summarizes repetitive text blocks before sending to the LLM API
  3. Injects smart context — Reuses cached blocks from your vault instead of recomputing every time
  4. Tracks every optimization — Reports how many tokens you saved and how much money

The Impact

Technique Typical Savings When It Happens
Request deduplication 5–15% Every time you ask the same question twice
Semantic compression 10–30% When you send large documents or code contexts
Vault injection caching 20–40%+ In agent loops, batch processing, or knowledge-base lookups
Combined (balanced mode) 15–25% Default behavior across all requests

Real Fleet Data

TokenPak is running in production right now. Here's what we're saving:

Session Snapshot (Last 24 Hours)

Metric Value
Total requests 23,000+
Input tokens sent 244M+
Tokens saved 390K+
Dollar savings $415+
Cache hit rate 97.6%
Compression ratio 3.7:1 (best compression mode)

By Model

Model Requests Cost Saved
Claude Haiku 22,701 $155.68 390K tokens
Claude Sonnet 3,618 $125.44 via compression
Claude Opus 595 $130.90 via cache hits

Translation: In one production day, with ~26K LLM calls, TokenPak saved over $415 on Anthropic's API alone.

If you're using OpenAI or Google Gemini alongside, multiply that by 2–3x.


How to Measure Your Own Savings

1. Start the Proxy

export ANTHROPIC_API_KEY=sk-ant-...
tokenpak proxy

2. Point Your Code at It

# Before: Uses real API directly
client = Anthropic()

# After: Routes through TokenPak proxy (100% compatible)
client = Anthropic(base_url="http://localhost:8766")

3. Check Your Savings (Real-Time)

# One-liner to see your savings today
tokenpak stats --today

Example output:

Session savings:
  Requests: 4,404
  Input tokens: 58.7M
  Tokens saved: 2.8M (4.8%)
  Cost: $75.01
  Cost saved (estimated): $3.61

Cache performance:
  Hit rate: 98.2%
  Reused tokens: 188.7M (from cache)

4. Understand the Breakdown

# Detailed report with per-model savings
tokenpak report --json

Returns: - input_tokens — Tokens you actually sent to the API (after compression) - saved_tokens — Tokens we didn't send (already cached or compressed) - compression_ratio — How aggressively we squeezed your context - cost_saved — Estimated dollar amount saved - cache_hit_rate — % of your requests that hit the cache


Example: Agent Loop

Let's say you're running an agent that: 1. Takes a user question 2. Searches a knowledge base (100 results) 3. Calls Claude 3–4 times to refine the answer

Without TokenPak: - Each Claude call sees the full 100 search results - Each call costs ~$0.10 (depends on model) - 4 calls = $0.40 per user question

With TokenPak: - First call: Full context sent, $0.10 - Calls 2–4: Cache hits on the search results (90% savings) - 3 calls cost ~$0.02 each = $0.06 total - Savings: $0.34 per question (85%!)

Scale this across 10,000 questions/day, and you're saving $3,400/day or $1.2M+/year.


Setup Options

Sit TokenPak between your code and the LLM API. No code changes.

tokenpak proxy --port 8766

Then swap one URL in your client:

client = Anthropic(base_url="http://localhost:8766")

Pros: Works with any SDK, automatic for all requests, easy to scale Cons: One extra hop (negligible latency: ~5ms)

Option 2: SDK Mode

Call TokenPak's compression directly in your code.

from tokenpak import HeuristicEngine

engine = HeuristicEngine()
compressed = engine.compress(long_context, target_tokens=2048)

# Send your LLM request with the compressed context

Pros: Fine-grained control, no proxy overhead, works offline Cons: Requires code changes, manual compression at call sites

Option 3: Hybrid

Proxy for most requests + SDK mode for special cases (cost-critical paths).


Profiles: Tune Savings vs. Risk

TokenPak ships with compression profiles tuned for different workloads:

Profile Compression Savings Risk Use Case
safe Light 5–10% Very low Production, high-stakes queries
balanced Medium 15–25% Low General workloads (default)
aggressive Strong 30–40% Medium Batch processing, bulk summarization
agentic Medium-strong 20–30% Low–medium Agent loops, tool use, reasoning

Set your profile:

export TOKENPAK_PROFILE=balanced  # default
tokenpak proxy

Or per-request:

# This request uses aggressive compression
response = client.messages.create(
    model="claude-opus-4-6",
    messages=[...],
    extra_headers={"X-TokenPak-Profile": "aggressive"}
)


ROI Calculator

Estimate your monthly savings:

Your monthly LLM spend: $X
TokenPak typical savings: 15–25%
Your monthly savings: $X × 0.20 = $X/month × 12 = $Xk/year

Example: - Spend: $5,000/month on LLM APIs - Savings @ 20%: $1,000/month - Annual savings: $12,000/year - Effort to deploy: ~30 minutes (swap one URL)


Caveats & Tradeoffs

When Savings Are Highest

  • ✅ Agent loops and multi-turn workflows
  • ✅ Batch processing with repeated contexts
  • ✅ Knowledge-base lookups with large doc chunks
  • ✅ Codebase indexing and semantic search

When Savings Are Lower

  • ⚠️ One-off requests (no cache hits, no dedup)
  • ⚠️ Highly unique contexts (compression is less effective)
  • ⚠️ Streaming responses (cache benefits hit less often)

Quality Tradeoffs

TokenPak is semantically lossless in safe and balanced modes: - No information loss - No hallucinations introduced - All model capabilities preserved

In aggressive mode, we trade ~5% accuracy (on some tasks) for 30–40% cost savings. Test it on your workload.


Next Steps

  1. Start simple: tokenpak proxy + swap one URL
  2. Measure: Run tokenpak stats after a few requests
  3. Optimize: Test different profiles with your workload
  4. Scale: Deploy to production when comfortable

Questions?

  • How do I verify the savings are real? → Check tokenpak stats or tokenpak report --json for token-by-token breakdown
  • Will this slow down my requests? → Proxy adds ~5ms latency (negligible); SDK mode adds no latency
  • Can I bypass TokenPak for specific requests? → Yes, set header X-TokenPak-Bypass: true
  • What if the LLM needs the exact original tokens? → Use bypass header or switch to safe profile
  • Does this work with streaming? → Yes, with caveats (cache hits are less frequent in stream mode)

See TROUBLESHOOTING.md for more.