Compression: How It Works¶

TokenPak intercepts LLM requests on your machine and applies a multi-stage compression pipeline before forwarding them to the provider. The result is semantically equivalent content in fewer tokens.

When Compression Runs¶

Compression only activates when the request exceeds a token threshold (default: 4,500 tokens). Requests below the threshold are forwarded unchanged — no overhead.

Request received
       │
       ├── input_tokens < threshold → passthrough (0ms overhead)
       │
       └── input_tokens ≥ threshold → compression pipeline

Adjust the threshold:

TOKENPAK_COMPACT_THRESHOLD_TOKENS=2000 tokenpak serve

The Pipeline¶

Stage 1: Dedup¶

Scans the message history for duplicate or near-duplicate turns. If the same content appears multiple times (common when context is repeatedly injected), duplicates after the first occurrence are removed.

# Before
messages = [
    {"role": "user", "content": "Here is the code:\n<500 lines>"},
    {"role": "assistant", "content": "I'll review it."},
    {"role": "user", "content": "Here is the code:\n<500 lines>"},  # ← duplicate
    {"role": "user", "content": "Now fix line 42."},
]

# After dedup
messages = [
    {"role": "user", "content": "Here is the code:\n<500 lines>"},
    {"role": "assistant", "content": "I'll review it."},
    {"role": "user", "content": "Now fix line 42."},
]

Stage 2: Segmentize¶

The segmentizer classifies message content into typed blocks:

Segment type	Content	Compression strategy
`code`	Fenced code blocks	Signature extraction
`markdown`	Headers, lists, prose	Sentence filtering
`json`	JSON objects/arrays	Schema + sampling
`tool_call`	Tool use / function calls	Keep as-is
`tool_result`	Tool outputs	Truncation
`system`	System prompt	Recipe-based
`text`	Plain prose	Token filtering

Each segment carries metadata: estimated token count, language (for code), and priority score.

Stage 3: Directives¶

Directives are declarative compression instructions attached to a recipe. Each directive targets a segment type and describes what to do.

Example directive (recipes/oss/code-review.yaml):

directives:
  - type: code
    action: signature_only     # keep function signatures, strip bodies
    language: [python, js, ts]
    preserve_docstrings: true

  - type: markdown
    action: keep_headers       # strip body text, keep heading structure
    max_depth: 3

  - type: text
    action: filter_tokens
    ratio: 0.6                 # keep top 60% by importance score

Built-in recipes live in recipes/oss/. Pro recipes add more aggressive options.

Result¶

After the pipeline, the PipelineResult object contains:

@dataclass
class PipelineResult:
    messages: List[Dict]    # compressed messages (same format, fewer tokens)
    segments: List[Segment] # per-segment metadata
    tokens_raw: int         # tokens before compression
    tokens_after: int       # tokens after compression
    duration_ms: float      # pipeline wall time
    stages_run: List[str]   # which stages ran

    @property
    def savings_pct(self) -> float: ...

Compression Modes¶

Mode	`TOKENPAK_MODE`	Behavior
Hybrid (default)	`hybrid`	Compresses when tokens > threshold; skips if below
Strict	`strict`	Always compresses, no threshold check
Aggressive	`aggressive`	Maximum compression; accepts some quality reduction

Engines¶

Heuristic engine (default)¶

Rule-based compression. Runs in <5ms, zero ML dependencies. Handles:

Regex-based whitespace normalization
Comment stripping (configurable per language)
Boilerplate removal (common patterns: # type: ignore, pylint: disable=...)
Markdown flattening

LLMLingua engine (optional, Pro/advanced)¶

ML-powered token-level compression using the LLMLingua-2 model. Achieves 2–20x compression with <5% quality loss (per Microsoft benchmarks).

Install:

pip install tokenpak[ml]

LLMLingua activates automatically when installed. It runs locally — no API calls.

Custom Hooks¶

Add your own compression logic via the pipeline hook API:

from tokenpak.agent.compression.pipeline import CompressionPipeline

def my_hook(messages):
    # Remove messages older than 10 turns
    return messages[-10:]

pipeline = CompressionPipeline()
pipeline.add_hook(my_hook)
result = pipeline.run(messages)

Hooks run after the standard stages in insertion order.

Recipe Development¶

Recipes are YAML files in recipes/oss/ that define directives for a specific use case.

Minimal recipe:

# recipes/oss/my-recipe.yaml
name: my-recipe
version: "1.0"
description: Custom compression for my workflow

directives:
  - type: text
    action: filter_tokens
    ratio: 0.7

  - type: code
    action: signature_only
    language: [python]

Apply a recipe:

tokenpak template use my-recipe

See Recipe Development for the full directive schema reference.

Dry Run¶

Preview what compression would do without actually sending the request:

tokenpak compress myfile.txt

Output:

Input:  12,840 tokens
Output:  6,918 tokens
Saved:   5,922 tokens (46.1%)
Time:    8.4ms

Stages: dedup (0 removed) → segmentize (14 blocks) → directives (applied)

Performance¶

Optimization	Speedup
LRU token count cache	25x faster repeated counting
Pre-compiled regex	30% faster processing
Batch SQLite WAL writes	60% faster telemetry

Compression runs in the request path. On typical payloads it adds 10–50ms, which is negligible compared to LLM latency (500ms–5s).