Recipe: Streaming Responses¶

Status: Verified runnable. TokenPak's proxy is a byte-preserving passthrough, so streaming works transparently: when your client requests a streaming response, the proxy forwards the upstream provider's stream (server-sent events) back to your client verbatim. The commands below run against the default proxy at http://127.0.0.1:8766.

Note: there is no dedicated streaming: config block, no separate -streaming model names, and no proxy-side token buffering in the current release — streaming is enabled by the "stream": true flag in the request body and the provider's own SSE response, which the proxy passes through unchanged.

What this solves: Receiving responses token-by-token instead of waiting for the full response, improving perceived latency.

Prerequisites¶

TokenPak installed (tokenpak --help)
A streaming-aware client (curl --no-buffer, Python requests with stream=True, Node.js streams)
A valid API key for your provider

Start the proxy¶

tokenpak serve   # listens on http://127.0.0.1:8766

No special config is required for streaming — the proxy forwards the upstream stream as-is.

Test & Verify¶

Streaming with curl (server-sent events):

curl -N -X POST http://127.0.0.1:8766/v1/messages \
  -H "Content-Type: application/json" \
  -H "x-api-key: $ANTHROPIC_API_KEY" \
  -H "anthropic-version: 2023-06-01" \
  -d '{
    "model": "claude-3-5-sonnet-20241022",
    "max_tokens": 128,
    "stream": true,
    "messages": [{"role": "user", "content": "Write a haiku about API proxies"}]
  }'

# Output (illustrative shape — exact events come from the upstream provider):
# event: content_block_delta
# data: {"type":"content_block_delta","delta":{"type":"text_delta","text":"API"}}
# event: content_block_delta
# data: {"type":"content_block_delta","delta":{"type":"text_delta","text":" proxies"}}
# ...

curl -N (--no-buffer) disables curl's output buffering so you see events as they arrive.

Streaming with Python:

import os
import requests

with requests.post(
    "http://127.0.0.1:8766/v1/messages",
    headers={
        "Content-Type": "application/json",
        "x-api-key": os.environ["ANTHROPIC_API_KEY"],
        "anthropic-version": "2023-06-01",
    },
    json={
        "model": "claude-3-5-sonnet-20241022",
        "max_tokens": 128,
        "stream": True,
        "messages": [{"role": "user", "content": "Write a 2-sentence story"}],
    },
    stream=True,
) as response:
    for line in response.iter_lines():
        if line:
            print(line.decode("utf-8"))

Streaming with Node.js:

async function streamResponse() {
  const response = await fetch('http://127.0.0.1:8766/v1/messages', {
    method: 'POST',
    headers: {
      'Content-Type': 'application/json',
      'x-api-key': process.env.ANTHROPIC_API_KEY,
      'anthropic-version': '2023-06-01',
    },
    body: JSON.stringify({
      model: 'claude-3-5-sonnet-20241022',
      max_tokens: 128,
      stream: true,
      messages: [{ role: 'user', content: 'Count to 5' }],
    }),
  });

  const reader = response.body.getReader();
  const decoder = new TextDecoder();
  while (true) {
    const { done, value } = await reader.read();
    if (done) break;
    process.stdout.write(decoder.decode(value));
  }
}

streamResponse();

What Just Happened¶

Client sends a request with "stream": true.
The proxy forwards the request to the upstream provider with streaming enabled.
As the provider emits SSE events, the proxy passes them through to your client verbatim (byte-preserving passthrough).
Your client renders the deltas as they arrive.

Because the proxy does not buffer or rewrite the stream, the streaming behavior your client sees is the provider's own — TokenPak adds transport pass-through, not a separate streaming engine.

Common Pitfalls¶

Client doesn't handle the SSE format - ❌ Wrong: JSON.parse() the whole streamed body (it is a sequence of data: lines, not one JSON document). - ✅ Right: Parse SSE line-by-line; each data: line carries a JSON payload.

Output buffering hides the stream - ❌ Wrong: Plain curl (buffers output) makes the response look non-streaming. - ✅ Right: Use curl -N / --no-buffer, or a client that reads incrementally.

Not handling connection drops - ❌ Wrong: Assume the streaming connection never breaks. - ✅ Right: Implement reconnect/retry logic for long streams.

Measuring total time instead of time-to-first-token - ❌ Wrong: Report total response time as "latency." - ✅ Right: Track time-to-first-token separately — that's where streaming helps perceived latency.