TokenPak Proxy — Production SLA Targets¶
Last updated: 2026-03-24 | Phase 6 Production Hardening
Proxy Endpoint SLAs¶
| Endpoint | p50 target | p95 target | p99 target | Error rate |
|---|---|---|---|---|
/health |
< 15ms | < 250ms | < 500ms | < 0.1% |
/stats |
< 15ms | < 250ms | < 500ms | < 0.1% |
/v1/messages (passthrough) |
< 50ms overhead | — | — | < 0.1% |
Note: Targets reflect TrixBot (4GB RAM, Python HTTPServer, 20-worker bounded pool). Production deployment with asyncio server (
server_async.py) should achieve p99 < 20ms.
Measured Performance (2026-03-24)¶
Benchmark run on TrixBot, 100 req/sec sustained for 5s:
| Endpoint | p50 | p95 | p99 | Errors |
|---|---|---|---|---|
/health |
~4ms | ~100ms | ~250ms | 0% |
/stats |
~4ms | ~100ms | ~250ms | 0% |
Features Validated (Phase 6)¶
/health endpoint¶
Reports: status, uptime_seconds, compression_ratio_avg, circuit_breakers, index_freshness, request_timeout_seconds
curl http://localhost:8766/health
curl http://localhost:8766/health?deep=true # includes memory + disk
Index Freshness Check¶
/health now reports index_freshness.age_seconds — stale if > 600s (10 min).
Index auto-reloads every 5min via VAULT_INDEX_RELOAD_INTERVAL.
Graceful Adapter Fallback¶
Circuit breaker per provider (tokenpak/agent/proxy/circuit_breaker.py):
- Opens after 5 failures in 60s window
- HALF_OPEN probe after 60s recovery timeout
- Fail-fast 503 when open (no timeout wasted)
Request Timeout Enforcement¶
Set TOKENPAK_REQUEST_TIMEOUT=<seconds> to enforce per-request upstream timeout.
Default: 0 (disabled — rely on pool defaults).
export TOKENPAK_REQUEST_TIMEOUT=30 # 30s hard timeout per request
Timeout is passed directly to the connection pool's stream() and request() calls.
Load Test Suite¶
cd ~/Projects/tokenpak
python -m pytest tests/benchmarks/test_load_100rps.py -v
6 tests covering: - p99 < 500ms at 100 req/sec (5s sustained) - p50 < 15ms at 100 req/sec - Zero errors under load - /stats p99 < 30ms (relative) - Throughput ≥ 85 req/sec achievable - JSON validity under concurrent load
Error Rate Budget¶
- Target: < 0.1% (1 error per 1,000 requests)
- Measured: 0% in all load test runs
- Circuit breaker threshold: 5 failures / 60s window
Upgrade Path¶
For production deployments requiring stricter p99:
1. Switch to server_async.py (asyncio-based, removes GIL contention)
2. Gzip-compress vault index blocks (3x smaller, faster load) — see analysis/index-compression-2026-03.md
3. Exclude JS/binary chunks from vault index (~7.5MB saved)