Error Telemetry & Observability¶

TokenPak includes comprehensive error tracking and telemetry for production deployments.

Error Logging¶

Quick Start¶

The error logger automatically captures exceptions with context for post-mortem analysis.

from tokenpak.telemetry.error_logger import get_error_logger

logger = get_error_logger()

try:
    result = call_llm(model="gpt-4")
except Exception as e:
    logger.log_error(
        request_id="req-123",
        error=e,
        context={
            "model": "gpt-4",
            "provider": "openai",
            "input_size": 1024,
            "cost_estimate": 0.045,
            "duration_ms": 2350
        }
    )

Using the Decorator¶

For automatic exception logging, use the @log_exception decorator:

from tokenpak.telemetry.error_logger import log_exception

@log_exception(
    request_id="req-456",
    context={"model": "gpt-4", "provider": "openai"}
)
def call_model():
    return openai.ChatCompletion.create(...)

Any exception raised in the decorated function is automatically logged and re-raised.

Log Storage¶

Errors are stored in append-only JSON Lines format:

~/.tokenpak/logs/errors-2026-03-24.jsonl
~/.tokenpak/logs/errors-2026-03-23.jsonl
...

Each line is a JSON object containing: - timestamp — ISO 8601 UTC timestamp - request_id — Unique request identifier - error_type — Exception class name - message — Exception message - stack_trace — Full Python traceback - context — Dict with optional metadata (model, provider, cost, timing, etc.)

Example log entry:

{
  "timestamp": "2026-03-24T17:35:22.123456Z",
  "request_id": "req-123",
  "error_type": "ValueError",
  "message": "Invalid model parameter",
  "stack_trace": "Traceback (most recent call last):\n  ...",
  "context": {
    "model": "invalid-model",
    "provider": "openai",
    "input_size": 1024
  }
}

Log Rotation¶

Log files are automatically rotated daily. Logs older than 7 days are automatically: 1. Compressed with gzip 2. Moved to ~/.tokenpak/logs/archive/

This keeps active logs lean while preserving historical data for analysis.

CLI Commands¶

Export Telemetry Events¶

Export telemetry/event data to JSON or CSV with tokenpak telemetry export:

# Export to JSON (default format)
tokenpak telemetry export --format json

# Export to CSV
tokenpak telemetry export --format csv

# Filter by time window
tokenpak telemetry export --since 2026-03-01 --until 2026-03-31

# Filter by provider
tokenpak telemetry export --provider anthropic

Options:

Option	Description
`--format`	Output format: `json` or `csv`
`--since`	Include events on/after this date
`--until`	Include events on/before this date
`--provider`	Filter to a single provider

The exported data is suitable for: - External analysis tools - Visualization dashboards - Integration with error tracking services (Sentry, DataDog, etc.)

Inspect Errors Programmatically¶

Error counts and metrics are available through the error logger API (see below). To browse the raw error log files, read the JSON Lines files under ~/.tokenpak/logs/ directly.

Prometheus Metrics¶

Error counts by type are tracked for Prometheus integration:

from tokenpak.telemetry.error_logger import get_error_logger

logger = get_error_logger()
metrics = logger.get_metrics()

# Output:
# {
#   'ValueError': 18,
#   'TimeoutError': 12,
#   'AuthenticationError': 8
# }

Include in your metrics endpoint:

from prometheus_client import Gauge

error_gauge = Gauge('tokenpak_errors_total', 'Total errors by type', ['error_type'])

metrics = logger.get_metrics()
for error_type, count in metrics.items():
    error_gauge.labels(error_type=error_type).set(count)

Thread Safety¶

The error logger is fully thread-safe. Multiple threads can log errors concurrently without contention:

import threading
from tokenpak.telemetry.error_logger import get_error_logger

logger = get_error_logger()

def worker():
    try:
        # Do work
        pass
    except Exception as e:
        logger.log_error(f"req-{threading.current_thread().name}", e)

threads = [threading.Thread(target=worker) for _ in range(10)]
for t in threads:
    t.start()
for t in threads:
    t.join()

Error Reporting Best Practices¶

Always include a request ID — Use it to correlate errors across distributed logs
Add context fields — Include model, provider, timing, and cost data for debugging
Don't log personally identifiable information — Filter PII before logging
Review logs regularly — Use daily error reports to catch new failure patterns
Integrate with alerting — Set up alerts for error spikes or new error types

Troubleshooting¶

"Failed to write error log"¶

Check that ~/.tokenpak/logs/ is writable:

ls -la ~/.tokenpak/logs/
chmod 755 ~/.tokenpak/logs/

"Malformed log line"¶

Log files can be partially corrupt if the process crashes. This is non-fatal — the logger skips malformed lines and continues. Use the export command to re-emit valid entries:

tokenpak telemetry export --format json

Large log files¶

If log files grow too quickly, consider: 1. Enabling sampling (log only a percentage of errors) 2. Reducing context verbosity 3. Exporting and archiving old logs manually

# Manually archive logs older than 30 days
find ~/.tokenpak/logs -name "errors-*.jsonl" -mtime +30 -exec gzip {} \; -exec mv {}.gz archive/ \;

Next Steps¶

Set up monitoring: Integrate error reports into your observability stack (DataDog, NewRelic, etc.)
Create dashboards: Visualize error trends by type, provider, and time
Automate alerts: Trigger notifications on error spikes or critical error types