Compression: How It Works¶
TokenPak intercepts LLM requests on your machine and applies a multi-stage compression pipeline before forwarding them to the provider. The result is semantically equivalent content in fewer tokens.
When Compression Runs¶
Compression only activates when the request exceeds a token threshold (default: 4,500 tokens). Requests below the threshold are forwarded unchanged — no overhead.
Request received
│
├── input_tokens < threshold → passthrough (0ms overhead)
│
└── input_tokens ≥ threshold → compression pipeline
Adjust the threshold:
The Pipeline¶
Stage 1: Dedup¶
Scans the message history for duplicate or near-duplicate turns. If the same content appears multiple times (common when context is repeatedly injected), duplicates after the first occurrence are removed.
# Before
messages = [
{"role": "user", "content": "Here is the code:\n<500 lines>"},
{"role": "assistant", "content": "I'll review it."},
{"role": "user", "content": "Here is the code:\n<500 lines>"}, # ← duplicate
{"role": "user", "content": "Now fix line 42."},
]
# After dedup
messages = [
{"role": "user", "content": "Here is the code:\n<500 lines>"},
{"role": "assistant", "content": "I'll review it."},
{"role": "user", "content": "Now fix line 42."},
]
Stage 2: Segmentize¶
The segmentizer classifies message content into typed blocks:
| Segment type | Content | Compression strategy |
|---|---|---|
code |
Fenced code blocks | Signature extraction |
markdown |
Headers, lists, prose | Sentence filtering |
json |
JSON objects/arrays | Schema + sampling |
tool_call |
Tool use / function calls | Keep as-is |
tool_result |
Tool outputs | Truncation |
system |
System prompt | Recipe-based |
text |
Plain prose | Token filtering |
Each segment carries metadata: estimated token count, language (for code), and priority score.
Stage 3: Directives¶
Directives are declarative compression instructions attached to a recipe. Each directive targets a segment type and describes what to do.
Example directive (recipes/oss/code-review.yaml):
directives:
- type: code
action: signature_only # keep function signatures, strip bodies
language: [python, js, ts]
preserve_docstrings: true
- type: markdown
action: keep_headers # strip body text, keep heading structure
max_depth: 3
- type: text
action: filter_tokens
ratio: 0.6 # keep top 60% by importance score
Built-in recipes live in recipes/oss/. Pro recipes add more aggressive options.
Result¶
After the pipeline, the PipelineResult object contains:
@dataclass
class PipelineResult:
messages: List[Dict] # compressed messages (same format, fewer tokens)
segments: List[Segment] # per-segment metadata
tokens_raw: int # tokens before compression
tokens_after: int # tokens after compression
duration_ms: float # pipeline wall time
stages_run: List[str] # which stages ran
@property
def savings_pct(self) -> float: ...
Compression Modes¶
| Mode | TOKENPAK_MODE |
Behavior |
|---|---|---|
| Hybrid (default) | hybrid |
Compresses when tokens > threshold; skips if below |
| Strict | strict |
Always compresses, no threshold check |
| Aggressive | aggressive |
Maximum compression; accepts some quality reduction |
Engines¶
Heuristic engine (default)¶
Rule-based compression. Runs in <5ms, zero ML dependencies. Handles:
- Regex-based whitespace normalization
- Comment stripping (configurable per language)
- Boilerplate removal (common patterns:
# type: ignore,pylint: disable=...) - Markdown flattening
LLMLingua engine (optional, Pro/advanced)¶
ML-powered token-level compression using the LLMLingua-2 model. Achieves 2–20x compression with <5% quality loss (per Microsoft benchmarks).
Install:
LLMLingua activates automatically when installed. It runs locally — no API calls.
Custom Hooks¶
Add your own compression logic via the pipeline hook API:
from tokenpak.agent.compression.pipeline import CompressionPipeline
def my_hook(messages):
# Remove messages older than 10 turns
return messages[-10:]
pipeline = CompressionPipeline()
pipeline.add_hook(my_hook)
result = pipeline.run(messages)
Hooks run after the standard stages in insertion order.
Recipe Development¶
Recipes are YAML files in recipes/oss/ that define directives for a specific use case.
Minimal recipe:
# recipes/oss/my-recipe.yaml
name: my-recipe
version: "1.0"
description: Custom compression for my workflow
directives:
- type: text
action: filter_tokens
ratio: 0.7
- type: code
action: signature_only
language: [python]
Apply a recipe:
See Recipe Development for the full directive schema reference.
Dry Run¶
Preview what compression would do without actually sending the request:
Output:
Input: 12,840 tokens
Output: 6,918 tokens
Saved: 5,922 tokens (46.1%)
Time: 8.4ms
Stages: dedup (0 removed) → segmentize (14 blocks) → directives (applied)
Performance¶
| Optimization | Speedup |
|---|---|
| LRU token count cache | 25x faster repeated counting |
| Pre-compiled regex | 30% faster processing |
| Batch SQLite WAL writes | 60% faster telemetry |
Compression runs in the request path. On typical payloads it adds 10–50ms, which is negligible compared to LLM latency (500ms–5s).