TokenPak Savings¶
How much will TokenPak save you?
It depends on your workload — specifically how much of your context repeats and how compressible it is. The honest answer is: measure it on your own traffic. TokenPak gives you the tools to do exactly that.
A note on numbers: TokenPak does not publish headline savings or cost figures until they are backed by a validated, frozen-fixture benchmark run. Our benchmark suite is in progress; receipt-backed figures will publish once it produces a validated run. Until then, the most reliable savings number is the one you measure on your own workload with
tokenpak statsandtokenpak report --json.
The Math (Simple Version)¶
What TokenPak Does¶
- Deduplicates requests — If you send the same prompt twice, the second one can be served from cache instead of re-sent (cache hit)
- Compresses long context — Summarizes repetitive text blocks before sending to the LLM API
- Injects smart context — Reuses cached blocks from your vault instead of recomputing every time
- Tracks every optimization — Reports how many tokens and how much money you saved, per request
The Impact¶
Savings come from two complementary mechanisms:
| Technique | What it does | When it helps |
|---|---|---|
| Request deduplication | Avoids resending an identical prompt | Every time you ask the same question twice |
| Semantic compression | Shrinks repetitive or verbose context | When you send large documents or code contexts |
| Vault injection caching | Reuses cached context blocks | In agent loops, batch processing, or knowledge-base lookups |
| Combined (balanced mode) | Applies caching + light compression by default | Across all requests |
How much each of these saves depends entirely on your workload and repeat rate — there is no single number that holds across all traffic.
How Savings Behave in Practice¶
TokenPak's savings are workload-dependent. The general shape:
- Highly repetitive traffic (agent loops, batch jobs, knowledge-base lookups) benefits most, because cached and compressible context dominates.
- Provider-cached flows show lower incremental gains, because the provider is already discounting repeated context.
- One-off, highly unique requests benefit least, because there is little to dedup or compress.
The only way to know your number is to run TokenPak on your traffic and read the report.
How to Measure Your Own Savings¶
1. Start the Proxy¶
export ANTHROPIC_API_KEY=sk-ant-...
tokenpak serve
2. Point Your Code at It¶
# Before: Uses real API directly
client = Anthropic()
# After: Routes through TokenPak proxy (drop-in compatible)
client = Anthropic(base_url="http://localhost:8766")
3. Check Your Savings (Real-Time)¶
# One-liner to see your savings today
tokenpak savings
# Or scope it to the current session/day
tokenpak stats --today
The output reports the requests, tokens, and estimated cost saved for your traffic — these are the numbers that matter for your decision.
4. Understand the Breakdown¶
# Detailed report with per-model savings
tokenpak report --json
Returns:
input_tokens— Tokens you actually sent to the API (after compression)saved_tokens— Tokens we didn't send (already cached or compressed)compression_ratio— How aggressively we compressed your contextcost_saved— Estimated dollar amount savedcache_hit_rate— Share of your requests that hit the cache
Because these are computed from your own traffic, they are the authoritative measure of what TokenPak does for you — far more reliable than any generic headline figure.
Example: Agent Loop¶
Consider an agent that:
- Takes a user question
- Searches a knowledge base
- Calls Claude several times to refine the answer
Without TokenPak: Every Claude call re-sends the full search context, so you pay for the same large context on each call.
With TokenPak: The first call sends the full context; subsequent calls reuse it from cache instead of re-sending it. The more calls reuse the same context, the larger the cumulative saving.
This is the workload shape where TokenPak helps most — but the actual saving depends on how much context repeats across your calls. Run tokenpak report --json against your own agent to see the real figure.
Setup Options¶
Option 1: Proxy (Recommended)¶
Sit TokenPak between your code and the LLM API. No code changes.
tokenpak serve --port 8766
Then swap one URL in your client:
client = Anthropic(base_url="http://localhost:8766")
Pros: Works with any SDK, automatic for all requests, easy to scale Cons: One extra network hop; the added latency depends on your deployment path (run the proxy on the same machine/network to minimize it)
Option 2: SDK Mode¶
Call TokenPak's compression directly in your code.
from tokenpak import HeuristicEngine
engine = HeuristicEngine()
compressed = engine.compress(long_context, target_tokens=2048)
# Send your LLM request with the compressed context
Pros: Fine-grained control, no proxy overhead, works offline Cons: Requires code changes, manual compression at call sites
Option 3: Hybrid¶
Proxy for most requests + SDK mode for special cases (cost-critical paths).
Profiles: Tune Savings vs. Risk¶
TokenPak ships with compression profiles tuned for different workloads. Heavier compression generally trades more aggressively for savings; lighter compression prioritizes fidelity.
| Profile | Compression | Risk | Use Case |
|---|---|---|---|
| safe | Light | Very low | Production, high-stakes queries |
| balanced | Medium | Low | General workloads (default) |
| aggressive | Strong | Medium | Batch processing, bulk summarization |
| agentic | Medium-strong | Low–medium | Agent loops, tool use, reasoning |
Set your profile:
export TOKENPAK_PROFILE=balanced # default
tokenpak serve
Or per-request:
# This request uses aggressive compression
response = client.messages.create(
model="claude-opus-4-8",
messages=[...],
extra_headers={"X-TokenPak-Profile": "aggressive"}
)
Estimating Your ROI¶
To estimate your own return, measure first, then extrapolate:
- Run TokenPak over a representative slice of your traffic.
- Read your measured saving from
tokenpak savings/tokenpak report --json. - Apply that measured rate to your monthly LLM spend.
Deploying the proxy is low-effort — typically a single URL swap in your client — so you can measure your real savings rate before committing to a wider rollout.
Caveats & Tradeoffs¶
When Savings Are Highest¶
- ✅ Agent loops and multi-turn workflows
- ✅ Batch processing with repeated contexts
- ✅ Knowledge-base lookups with large doc chunks
- ✅ Codebase indexing and semantic search
When Savings Are Lower¶
- ⚠️ One-off requests (no cache hits, no dedup)
- ⚠️ Highly unique contexts (compression is less effective)
- ⚠️ Provider-cached flows (the provider already discounts repeated context, so incremental gains are smaller)
- ⚠️ Streaming responses (cache benefits hit less often)
Quality Tradeoffs¶
TokenPak is semantically lossless in safe and balanced modes:
- No information loss
- No hallucinations introduced
- All model capabilities preserved
In aggressive mode, TokenPak compresses more heavily to favor cost savings, which can affect accuracy on some tasks. Test it on your workload before relying on it.
Next Steps¶
- Start simple:
tokenpak serve+ swap one URL - Measure: Run
tokenpak savings/tokenpak statsafter a few requests - Optimize: Test different profiles with your workload
- Scale: Deploy to production when comfortable
Questions?¶
- How do I verify the savings are real? → Check
tokenpak savings,tokenpak stats, ortokenpak report --jsonfor a token-by-token breakdown of your own traffic - Will this slow down my requests? → The proxy adds a network hop; the added latency depends on your deployment path (run it on the same machine/network to minimize it). SDK mode adds no network hop.
- Can I bypass TokenPak for specific requests? → Yes, set header
X-TokenPak-Bypass: true - What if the LLM needs the exact original tokens? → Use bypass header or switch to
safeprofile - Does this work with streaming? → Yes, with caveats (cache hits are less frequent in stream mode)
See Troubleshooting for more.