TokenPak Compression Tuning Guide¶
This guide explains how to tune TokenPak's compression engine to maximize token savings while minimizing latency impact for your specific workload.
Overview: Why Compression Matters¶
LLM API costs scale with token count. TokenPak's compression pipeline intercepts requests and semantically equivalent content with 2–8% fewer tokens, depending on your data.
The Math¶
- Without compression: 10,000-token request = $0.30 (Sonnet)
- With 5% compression: 9,500 tokens = $0.285 (saves $0.015 per request)
- At 100 requests/day: ~$5/month saved (or 5% cost reduction)
The Tradeoff¶
Compression has latency cost:
| Strategy | Latency | Token Savings | Best For |
|---|---|---|---|
| Dedup | <1ms | 2–3% (on repeated context) | Iterative workflows, session context |
| Segmentation | <2ms | 1–2% (metadata, structure) | Code review, doc analysis |
| Alias compression | <5ms | 3–5% (long repeated names) | Large schemas, entity lists |
| Instruction table | <10ms | 4–6% (cookbook patterns) | Repetitive tasks, templates |
| Semantic caching (off) | N/A | 15–40% (prompt cache hit) | Same prompts, different inputs |
Default (all enabled): ~5ms latency for ~3–5% savings. Acceptable for most workloads.
Compression Strategies: How to Use Each¶
Strategy 1: Dedup (Fast, Safe)¶
What it does: Removes duplicate message turns from conversation history.
When it helps: - Iterative debugging (code repeatedly pasted) - Multi-turn conversations where context is re-injected - Workflow loops where the same block appears multiple times
Real example:
Message 1: "Here's the current auth schema:\n<200 lines JSON>"
Message 2: "Review line 42."
Message 3: [Assistant response]
Message 4: "Here's the current auth schema:\n<200 lines JSON>" ← DEDUP removes this
Message 5: "Now add refresh tokens."
Savings: 1–3% (depends on how often context repeats)
Configuration (in proxy config):
# tokenpack/proxy.py
pipeline = CompressionPipeline(
enable_dedup=True, # ← enable/disable
enable_segmentation=True,
enable_alias=True,
enable_directives=True,
)
When to disable: Single-turn requests (queries, completions). No benefit, adds latency.
Strategy 2: Segmentation (Safe, Structural)¶
What it does: Classifies message content into typed blocks (code, markdown, JSON, tool results, etc.) and applies targeted compression to each type.
Strategies per segment type:
| Segment Type | What Gets Compressed | Savings | Risk |
|---|---|---|---|
| Code | Signature extraction, docstring keep | 5–8% | Low (retains logic) |
| Markdown | Keep headers, strip body text | 3–6% | Medium (loses details) |
| JSON | Schema + sample data (strip repetitive rows) | 4–7% | Medium (loses volume) |
| Tool results | Truncation (keep first N lines) | 2–4% | Low (summaries) |
| Text/prose | Token filtering by importance | 3–5% | High (selective) |
Real example — code compression:
# Before (28 tokens)
def calculate_total(items):
"""Calculate the sum of item values."""
result = 0
for item in items:
result += item['price']
return result
# After (signature only, 8 tokens)
def calculate_total(items): ...
"""Calculate the sum of item values."""
Configuration (in proxy.py):
pipeline = CompressionPipeline(
enable_segmentation=True, # ← enable/disable
enable_dedup=True,
enable_alias=True,
enable_directives=True,
)
# Optionally provide a recipe (directives)
# See recipes/oss/*.yaml for examples
When to disable: If you need full code bodies preserved (not just signatures). Saves 2ms latency but gives up 3–6% compression.
Strategy 3: Alias Compression (Moderate)¶
What it does: Detects long repeated names/entities (variable names, long strings, UUIDs) and replaces them with short aliases.
Real example:
Before:
"The ManagerInterface.process_authentication_token() method..."
"Then ManagerInterface.process_authentication_token() handles..."
"Finally ManagerInterface.process_authentication_token() returns..."
After:
"The A1() method..."
"Then A1() handles..."
"Finally A1() returns..."
Mapping: A1 → ManagerInterface.process_authentication_token
When it helps: - Long class/function names repeated 3+ times - Domain-specific acronyms or entity names - Code with verbose variable names
Savings: 3–5% (depends on repetition and name length)
Configuration:
pipeline = CompressionPipeline(
enable_alias=True, # ← enable/disable
alias_min_occurrences=3, # minimum times to alias
alias_min_length=20, # minimum name length to alias
enable_dedup=True,
enable_segmentation=True,
)
Tuning parameters:
- alias_min_occurrences=2 → more aggressive, catch 2+ repeats
- alias_min_occurrences=5 → conservative, only high-frequency names
- alias_min_length=15 → catch shorter names
- alias_min_length=30 → only very long names
When to disable: If output is sent to users (aliases make it unreadable). Safe to disable; minimal latency impact.
Strategy 4: Instruction Table (Advanced)¶
What it does: Uses a persistent table of common instructions and replaces repetitive task descriptions with references.
Real example:
Before:
"You are a code reviewer. Your job is to find bugs, suggest improvements,
enforce style consistency, and suggest refactoring opportunities..."
After:
"Apply instruction [CODE-REVIEW-V2]"
Lookup table maps [CODE-REVIEW-V2] → full instruction text
When it helps: - Batch processing (same role repeated 10+ times) - Service agents (standard prompts) - Workflows with template instructions
Savings: 4–8% (depends on instruction repetition)
Configuration:
pipeline = CompressionPipeline(
enable_instruction_table=True, # ← enable/disable
instruction_table_path="path/to/instruction.db", # optional custom table
context_budget_tight=True, # aggressive mode
enable_dedup=True,
enable_segmentation=True,
enable_alias=True,
)
How to add instructions:
# In your code:
from tokenpak.agent.compression.instruction_table import InstructionTable
table = InstructionTable(path="instruction.db")
table.add_instruction(
id="CODE-REVIEW-V2",
text="You are a code reviewer...",
)
When to disable: One-shot requests, unique prompts. Overhead > savings for low-repetition tasks.
Strategy 5: Semantic Caching (Native to Claude API)¶
What it does: Reuses cached prompt prefixes when subsequent requests have similar context.
How it works: - First request with context → stored in Claude's cache (5 min TTL, by default) - Identical or very similar context → reuses cached tokens at ~10% cost
Real example:
Request 1: "Here's the codebase:\n<50KB context>" → 12 cache creation tokens
Request 2: "Same codebase, different question" → 24 cache read tokens (10% cost)
Savings: (12 - 2.4) tokens per request = ~80% on that chunk
Savings: 15–40% (only on repeated prefix, but huge when it hits)
How to enable (in your client code):
import anthropic
client = anthropic.Anthropic()
response = client.messages.create(
model="claude-opus-4-6",
max_tokens=1024,
system=[
{
"type": "text",
"text": "You are a code reviewer. [... static system prompt ...]",
"cache_control": {"type": "ephemeral"} # ← enable caching
}
],
messages=[
{
"role": "user",
"content": [
{
"type": "text",
"text": "Here's the full codebase:\n" + large_code,
"cache_control": {"type": "ephemeral"} # ← cache this too
}
]
}
]
)
# Subsequent requests with same codebase will hit the cache
When to use: - System prompts (static, reused 100% of the time) - Large context blocks (code, docs, schemas) used in multiple requests - Batch workflows where the same context applies to different questions
When NOT to use: - One-off requests - Context that changes every turn
Performance Characteristics: Latency vs Savings¶
Measured on Fleet (March 2026 Benchmark)¶
| Agent | Compression Mode | Token Savings | P50 Latency | P99 Latency |
|---|---|---|---|---|
| Trix | All enabled (default) | 2.8% | 5.2ms | 12ms |
| Trix | Dedup + Segment only | 2.1% | 2.1ms | 5ms |
| Trix | Dedup only | 1.2% | 0.8ms | 2ms |
| Sue | All enabled | 2.2% | 6.1ms | 14ms |
| Cali | All enabled | 2.8% | 4.9ms | 11ms |
Analysis: - Dedup: <1ms overhead, 1–2% savings (always worth it) - Segmentation: <2ms overhead, 1–2% savings (usually worth it) - Alias: <5ms overhead, 3–5% savings (worth it for code-heavy workloads) - Instruction table: <10ms overhead, 4–6% savings (worth it for batch/service work)
Configuration: Copy-Paste Examples¶
Example 1: Lightweight (Low Latency)¶
Use this for real-time chat, quick queries.
# In proxy.py
pipeline = CompressionPipeline(
enable_dedup=True,
enable_segmentation=False,
enable_alias=False,
enable_instruction_table=False,
enable_directives=False,
)
Tradeoff: <2ms latency, 1–2% savings.
Example 2: Balanced (Default)¶
Use this for general workloads (development, analysis).
# In proxy.py
pipeline = CompressionPipeline(
enable_dedup=True,
enable_segmentation=True,
enable_alias=True,
enable_instruction_table=False,
enable_directives=True,
)
Tradeoff: ~5ms latency, 3–5% savings.
Example 3: Aggressive (High Savings)¶
Use this for batch work, background jobs, offline analysis.
# In proxy.py
pipeline = CompressionPipeline(
enable_dedup=True,
enable_segmentation=True,
enable_alias=True,
enable_instruction_table=True,
enable_directives=True,
context_budget_tight=True,
alias_min_occurrences=2, # catch more aliases
alias_min_length=15, # shorter names too
)
Tradeoff: ~10–15ms latency, 5–8% savings.
Example 4: Code Review Specialized¶
Optimized for code review tasks.
# In proxy.py
pipeline = CompressionPipeline(
enable_dedup=True,
enable_segmentation=True,
enable_alias=True,
enable_instruction_table=True,
enable_directives=True,
)
# Add custom hook for code-specific compression
def code_priority_hook(messages):
"""Keep code segments, compress narrative text."""
for msg in messages:
# Custom logic here
pass
return messages
pipeline.add_hook(code_priority_hook)
Tradeoff: ~8ms latency, 6–8% savings on code.
Tuning Checklist¶
When you want to optimize compression for YOUR workload:
- Profile your requests: What's the typical size? Code? Text? JSON?
- Set a baseline: Run a week with
enable_all=True, measure token savings. - Identify bottlenecks: Which compression stage gives the most savings? (Use
PipelineResult.stages_run) - Disable low-ROI stages: If alias compression adds 5ms for <0.5% savings, disable it.
- Batch profile: Test on 100+ requests to get real averages (single-request measurements are noisy).
- Test in production: A/B test config changes on real workloads, measure cost + latency.
Monitoring¶
# After pipeline.run(), inspect:
result = pipeline.run(messages)
print(f"Tokens saved: {result.tokens_saved} ({result.savings_pct}%)")
print(f"Latency: {result.duration_ms}ms")
print(f"Stages run: {', '.join(result.stages_run)}")
Common Tuning Questions¶
Q: "Compression makes responses slightly different. Is this safe?"¶
A: TokenPak compression is semantic-preserving. The meaning of the request/response is identical; only formatting and redundancy are removed. Safe for production.
Q: "Can I compress the response too?"¶
A: TokenPak currently compresses requests only (to LLM). Response compression would require client-side modifications. Future feature.
Q: "How much should I save?"¶
A: Typical range is 2–6% depending on workload: - Text-heavy (essays, reports): 2–3% - Code-heavy (review, analysis): 4–6% - JSON/structured: 3–5% - Real-time chat (short messages): <1%
Q: "Should I use alias compression?"¶
A: Yes, unless output is user-facing. Aliases make text unreadable in logs/exports.
Q: "How often should I update the instruction table?"¶
A: Once per week or when your templates change significantly. It's auto-reloaded every 5 minutes.
Q: "What if compression breaks something?"¶
A: File an issue on GitHub. In the meantime, disable the offending stage and continue. Compression is designed to fail gracefully.
Reference: Source Code Links¶
- Pipeline orchestrator:
packages/core/tokenpak/agent/compression/pipeline.py(line 20–150) - Dedup logic:
packages/core/tokenpak/agent/compression/dedup.py - Segmentizer:
packages/core/tokenpak/agent/compression/segmentizer.py - Alias compressor:
packages/core/tokenpak/agent/compression/alias_compressor.py(line 30–80 for tuning) - Instruction table:
packages/core/tokenpak/agent/compression/instruction_table.py - Directives applier:
packages/core/tokenpak/agent/compression/directives.py
Next Steps¶
- Start with the balanced config (Example 2 above).
- Measure token savings on your workload for 1 week.
- Adjust based on your latency tolerance: Trade off ~5% savings for <10ms latency most cases.
- Monitor regularly: Token costs shift as context size changes.
Questions? Issues? Open a GitHub issue or reach out to the TokenPak team on Slack.