Recipe: Streaming Responses¶
Status: Verified runnable. TokenPak's proxy is a byte-preserving passthrough, so streaming works transparently: when your client requests a streaming response, the proxy forwards the upstream provider's stream (server-sent events) back to your client verbatim. The commands below run against the default proxy at
http://127.0.0.1:8766.Note: there is no dedicated
streaming:config block, no separate-streamingmodel names, and no proxy-side token buffering in the current release — streaming is enabled by the"stream": trueflag in the request body and the provider's own SSE response, which the proxy passes through unchanged.
What this solves: Receiving responses token-by-token instead of waiting for the full response, improving perceived latency.
Prerequisites¶
- TokenPak installed (
tokenpak --help) - A streaming-aware client (
curl --no-buffer, Pythonrequestswithstream=True, Node.js streams) - A valid API key for your provider
Start the proxy¶
tokenpak serve # listens on http://127.0.0.1:8766
No special config is required for streaming — the proxy forwards the upstream stream as-is.
Test & Verify¶
Streaming with curl (server-sent events):
curl -N -X POST http://127.0.0.1:8766/v1/messages \
-H "Content-Type: application/json" \
-H "x-api-key: $ANTHROPIC_API_KEY" \
-H "anthropic-version: 2023-06-01" \
-d '{
"model": "claude-3-5-sonnet-20241022",
"max_tokens": 128,
"stream": true,
"messages": [{"role": "user", "content": "Write a haiku about API proxies"}]
}'
# Output (illustrative shape — exact events come from the upstream provider):
# event: content_block_delta
# data: {"type":"content_block_delta","delta":{"type":"text_delta","text":"API"}}
# event: content_block_delta
# data: {"type":"content_block_delta","delta":{"type":"text_delta","text":" proxies"}}
# ...
curl -N (--no-buffer) disables curl's output buffering so you see events as they arrive.
Streaming with Python:
import os
import requests
with requests.post(
"http://127.0.0.1:8766/v1/messages",
headers={
"Content-Type": "application/json",
"x-api-key": os.environ["ANTHROPIC_API_KEY"],
"anthropic-version": "2023-06-01",
},
json={
"model": "claude-3-5-sonnet-20241022",
"max_tokens": 128,
"stream": True,
"messages": [{"role": "user", "content": "Write a 2-sentence story"}],
},
stream=True,
) as response:
for line in response.iter_lines():
if line:
print(line.decode("utf-8"))
Streaming with Node.js:
async function streamResponse() {
const response = await fetch('http://127.0.0.1:8766/v1/messages', {
method: 'POST',
headers: {
'Content-Type': 'application/json',
'x-api-key': process.env.ANTHROPIC_API_KEY,
'anthropic-version': '2023-06-01',
},
body: JSON.stringify({
model: 'claude-3-5-sonnet-20241022',
max_tokens: 128,
stream: true,
messages: [{ role: 'user', content: 'Count to 5' }],
}),
});
const reader = response.body.getReader();
const decoder = new TextDecoder();
while (true) {
const { done, value } = await reader.read();
if (done) break;
process.stdout.write(decoder.decode(value));
}
}
streamResponse();
What Just Happened¶
- Client sends a request with
"stream": true. - The proxy forwards the request to the upstream provider with streaming enabled.
- As the provider emits SSE events, the proxy passes them through to your client verbatim (byte-preserving passthrough).
- Your client renders the deltas as they arrive.
Because the proxy does not buffer or rewrite the stream, the streaming behavior your client sees is the provider's own — TokenPak adds transport pass-through, not a separate streaming engine.
Common Pitfalls¶
Client doesn't handle the SSE format
- ❌ Wrong: JSON.parse() the whole streamed body (it is a sequence of data: lines, not one JSON document).
- ✅ Right: Parse SSE line-by-line; each data: line carries a JSON payload.
Output buffering hides the stream
- ❌ Wrong: Plain curl (buffers output) makes the response look non-streaming.
- ✅ Right: Use curl -N / --no-buffer, or a client that reads incrementally.
Not handling connection drops - ❌ Wrong: Assume the streaming connection never breaks. - ✅ Right: Implement reconnect/retry logic for long streams.
Measuring total time instead of time-to-first-token - ❌ Wrong: Report total response time as "latency." - ✅ Right: Track time-to-first-token separately — that's where streaming helps perceived latency.