Save 35% on LLM Tokens¶

Real numbers, practical setup, and a cost savings calculator.

If you use LLMs heavily — running coding agents, summarizing documents, doing research — token costs add up fast. The math is straightforward:

Claude Sonnet 3.5: $3 per million input tokens
GPT-4o: $2.50 per million input tokens
At 500 requests/day × 4,000 tokens each: $6–7.50/day, or $180–225/month

TokenPak cuts that by compressing your prompts before they hit the API. Here's what that looks like in practice.

Real Numbers from Production¶

We benchmarked TokenPak on a mixed workload of coding, writing, and operational tasks across a 572-file vault:

Scenario	Avg tokens/request	With TokenPak	Reduction
Code review	8,200	5,330	35%
Doc summarization	6,800	3,740	45%
Ops / runbook queries	4,100	2,870	30%
QMD + TokenPak (combined)	20,801 → 6,136	3,265	84%

The 84% figure is from using TokenPak alongside Query-Matched Decoding (QMD). With TokenPak alone, expect 30–45% on typical workloads. Higher for verbose codebases and documentation-heavy contexts.

Before and After: Real Examples¶

Code Review Request¶

Before TokenPak (4,231 tokens):

Please review this Python module. Here is the full content:

#!/usr/bin/env python3
"""
Module for handling authentication.

This module provides comprehensive authentication functionality including
user login, session management, token generation, and validation. It supports
multiple authentication backends including database, LDAP, and OAuth2.

The module is designed to be extensible and can be integrated with any
web framework. See the documentation at docs.example.com for full details.

Author: Engineering Team
Last updated: 2026-01-15
Version: 2.4.1
"""

import hashlib
import hmac
import os
import time
# ... (hundreds of lines of commented, verbose code) ...

After TokenPak (2,847 tokens — 33% reduction):

Please review this Python module. Here is the full content:

import hashlib
import hmac
import os
import time
# [module header compressed: auth module v2.4.1]
# ... (same logic, stripped of verbose docstrings and redundant comments) ...

Same request. Same code. 33% fewer tokens. The LLM's answer is identical.

Documentation Query¶

Before (6,100 tokens): A Markdown file pasted into context with extensive formatting, nested bullet points, version history notes, and HTML comments.

After (3,350 tokens — 45% reduction): Same content, with collapsed whitespace, stripped HTML comments, normalized bullet depth, and deduplicated section headers.

The meaning is preserved. The noise is gone.

Setup in 5 Minutes¶

# 1. Install
pip install tokenpak

# 2. Start the proxy
tokenpak serve --port 8766

# 3. Point your LLM client at it
export ANTHROPIC_BASE_URL=http://localhost:8766
# or for OpenAI:
export OPENAI_BASE_URL=http://localhost:8766/v1

# 4. Verify
tokenpak status

That's it. Your existing workflow is unchanged. Every request now runs through the compression pipeline.

Cost Savings Calculator¶

Use this to estimate your monthly savings:

Inputs: - Daily requests: R - Average input tokens per request: T - Your model's price per million input tokens: P - Expected compression rate: C (use 0.35 as a conservative estimate)

Formula:

Monthly cost without TokenPak:  R × 30 × T × P / 1,000,000
Monthly cost with TokenPak:     R × 30 × T × (1 - C) × P / 1,000,000
Monthly savings:                R × 30 × T × C × P / 1,000,000

Examples:

Daily requests	Avg tokens	Model	Monthly before	Monthly after	Savings
100	4,000	Claude Sonnet ($3/M)	$36	$23.40	$12.60
500	4,000	Claude Sonnet ($3/M)	$180	$117	$63
100	8,000	GPT-4o ($2.50/M)	$60	$39	$21
500	8,000	GPT-4o ($2.50/M)	$300	$195	$105
1,000	6,000	Claude Sonnet ($3/M)	$540	$351	$189

At 500 daily requests on Claude Sonnet: TokenPak pays for itself in the first hour of use.

What Gets Compressed (and What Doesn't)¶

TokenPak uses typed recipes — it knows the difference between Python code, Markdown docs, JSON configs, and prose. Each type gets appropriate treatment:

Content type	Typical reduction	What's removed
Python files	20–35%	Docstrings, blank lines, type annotations (optional)
Markdown	30–50%	Excessive formatting, repeated headers, HTML comments
JSON	15–25%	Whitespace (minification)
Generic prose	10–20%	Filler phrases, redundant whitespace
Shell scripts	15–25%	Comments, blank lines

What's never touched: - Code logic and syntax (we don't rewrite code) - Your prompt structure (questions, instructions) - Output you explicitly request

The compression is transparent. You can inspect exactly what was removed:

tokenpak compress myfile.py --diff
# Shows a diff of what the pipeline removed

Track Your Savings¶

TokenPak records everything locally. Check your savings anytime:

tokenpak savings
# This month: saved 142,000 tokens (~$0.43) via compression

tokenpak cost --week --by-model
# GPT-4o:       $8.20  (saved 38%)
# Claude Sonnet: $6.10  (saved 41%)

Or view the dashboard at http://localhost:8766/dashboard while the proxy is running.

Optimize Further¶

1. Set a monthly budget¶

tokenpak budget set --monthly 50
tokenpak budget alert --at 80%

2. Route cheap queries to cheaper models¶

# Send test/debug queries to a cheaper model
tokenpak route set ".*test.*" gpt-4o-mini
tokenpak route set ".*debug.*" claude-haiku-3-5

Routing a 4,000-token test request from GPT-4o ($0.01) to GPT-4o-mini ($0.0006) saves 94% on that request alone.

3. Combine with vault indexing¶

Instead of pasting large files into context, index your codebase and query it semantically:

tokenpak index ~/project
tokenpak vault search "authentication middleware"
# Returns: the 3 most relevant functions, ~500 tokens total
# Instead of: the whole auth module, ~8,000 tokens

4. Calibrate for your hardware¶

If you're indexing large vaults, run calibration once:

tokenpak calibrate ~/project --max-workers 8

This profiles your machine and sets optimal parallelism — making indexing 50–100x faster.

The Math Over Time¶

Compression savings compound:

Month 1 at 500 req/day: save $63
Month 6: save $378 total (same rate)
Year 1: save $756

With vault indexing reducing context size even further, and smart routing handling the cheapest queries with the cheapest models, realistic users with heavy workloads save 40–60% on their total LLM spend.

Get Started¶

pip install tokenpak
tokenpak serve
# Point your client at http://localhost:8766
tokenpak cost --today  # check savings after first day

Full setup guide →

All benchmarks from internal testing on a 572-file mixed-language vault. Your results will vary based on content type and prompt patterns. Compression rates are estimates; TokenPak only compresses when it improves the cost/quality ratio.