50ms of LLM proxy overhead — what it costs and buys

The strongest objection to putting an LLM gateway in front of your model calls is latency. The argument goes: every millisecond between user and model is a millisecond of perceived slowness, especially for streaming chat, where time-to-first-token (TTFT) is the time between request submission and the first character of the model's response — the metric users actually feel. If you can call OpenAI directly, why add a hop?

This post answers that question with measurements. We measured Synapse Garden's overhead vs going direct to OpenAI and Anthropic, on the same hardware, against the same models, with the same payloads. We're publishing the numbers we saw and the numbers we target. If you're evaluating any LLM gateway (not just ours), the framing in this post will tell you what to ask the others.

What we measured

The setup, kept boring on purpose:

Client. A k6 load runner from us-east-1 (AWS), the same region the gateway is in. The point isn't to test the network between Sydney and Iowa; it's to test the gateway itself.
Direct calls. k6 → api.openai.com/v1/chat/completions and k6 → api.anthropic.com/v1/messages.
Proxied calls. k6 → synapse.garden/api/v1/chat/completions and k6 → synapse.garden/api/v1/messages. Same gateway routes the request to the same upstream provider.
Models. openai/gpt-5.4-mini and anthropic/claude-sonnet-4-6. Cheap and frontier-class, two ends of the spectrum.
Payload. A 50-token system prompt and a 200-token user message. Single-turn, non-streaming and streaming. The point is to isolate gateway overhead, so the request is short on purpose.
Concurrency. 100 RPS sustained for 5 minutes. We ran each scenario three times and took the median across runs to dodge transient glitches.
What we measured. Time-to-first-byte for non-streaming, time-to-first-token for streaming, and total request time. P50, P95, P99.

The full k6 script lives in tests/load/k6-proxy.js if you want to run your own. Numbers below are from a real run on May 5, 2026.

The numbers

Metric	Direct OpenAI	Via Synapse Garden	Delta
TTFB, P50	412 ms	426 ms	+14 ms
TTFB, P95	982 ms	1024 ms	+42 ms
TTFB, P99	1670 ms	1741 ms	+71 ms
Total, P50	891 ms	906 ms	+15 ms
Total, P95	1842 ms	1888 ms	+46 ms
Total, P99	2934 ms	3018 ms	+84 ms

For Claude:

Metric	Direct Anthropic	Via Synapse Garden	Delta
TTFB, P50	504 ms	519 ms	+15 ms
TTFB, P95	1147 ms	1190 ms	+43 ms
TTFB, P99	1893 ms	1968 ms	+75 ms
Total, P50	1023 ms	1041 ms	+18 ms
Total, P95	2113 ms	2160 ms	+47 ms
Total, P99	3284 ms	3380 ms	+96 ms

The headline: median overhead is about 15 ms. P95 overhead is 42-47 ms, which is below our internal 50 ms target. P99 overhead is 71-96 ms, above target on the worst run. We have CI that fails the build if P95 exceeds 50 ms; we don't currently gate on P99 because the variance from upstream tail latency dominates the gateway's contribution.

Where the 15 ms goes

Adding a hop to a network path is not free. For our stack, the breakdown of the median overhead, instrumented in OpenTelemetry spans:

Stage	Median (ms)	What it does
Header parse + key format check	0.3	Pull `Authorization`, validate `mg_live_*` shape, reject malformed keys before any I/O
Upstash key cache lookup (HIT)	2.1	Look up the hashed key in Redis; >99% hit rate in steady state
Body validation (Zod)	1.4	Parse and validate the JSON body. Rejects malformed requests before they reach the upstream
Model allowlist check	0.2	Confirm the requested model is enabled for this org/key
Rate-limit check (Upstash Lua)	1.6	Sliding-window rate limit per key, atomic on the Redis side
Token estimate + budget check	2.4	Estimate input cost via tiktoken, confirm the org has budget
Proxy network hop (us-east-1 → us-east-1)	~6 ms	TCP + TLS handshake amortized via keep-alive; this is the only "physics" cost
Buffer + parse upstream response	1.2	The fast common path; not actually parsing the streamed body, just headers
Total internal overhead	~15 ms

The math adds up to slightly more than 15 ms because we don't run all stages strictly serially — the body validation overlaps with the key lookup. The point is that almost all of the overhead comes from explicit work we're doing on your behalf, not from the proxy hop itself.

What's not on this list:

DB calls. None on the hot path. We never hit Postgres for a request that's getting served. Logging, aggregation, and budget deduction get pushed to a queue after the response has already started streaming back.
Logging. Token counts, model id, latency, and status get fire-and-forwarded into Vercel Queues. Total time on the hot path: zero (the queue write is async).
Tracing. OTEL spans are sampled and pushed to a separate collector. The hot path doesn't await the trace export.

If you're evaluating other gateways, ask them what hits the database during a request. The honest answer for any gateway with strong governance is "the key lookup, but only on cache miss." The dishonest answer is "a few queries for analytics, but they're fast." Analytics queries on the hot path are how you turn 15 ms of overhead into 80 ms.

When the overhead matters and when it doesn't

The 15 ms median number is small enough that for most workloads it's invisible. A few cases where it does matter:

Voice and live conversation. If your product streams audio and you're measuring round-trip from "user stops talking" to "first audio chunk plays," every millisecond between the model and the speaker is felt. For these workloads, even 15 ms of gateway overhead can be the difference between "natural conversation" and "feels slightly off." Test with users, not with stopwatches.

High-frequency tool calls in an agent loop. An agent that makes 8 tool calls in 4 seconds is paying the proxy hop 8 times. 15 ms × 8 = 120 ms of overhead per task. If you're building this kind of agent, consider batching where possible, and consider whether you're hitting the same upstream model 8 times in a row (in which case provider-side caching might help more than gateway optimization).

Background batch jobs. None. If you're running a million classifications overnight, 15 ms doesn't matter. The wall-clock cost of the gateway over a million requests is 4 hours; the wall-clock cost of running a million LLM calls at all is several days. The bottleneck is not the gateway.

Interactive chat. Probably none. Time-to-first-token is dominated by the model itself, which takes 200-1000 ms to start. 15 ms on top of 500 ms is 3% — within the noise of network jitter.

The honest framing: gateways are bad for sub-100 ms response budgets and fine for everything else. If you're at <100 ms and need to be, you're in a regime where a gateway's tradeoffs probably aren't worth it. If you're at 500+ ms (which is where 99% of LLM workloads live), the overhead is below user-perception threshold.

What the overhead buys you

Every millisecond of overhead exists because we're doing work that prevents bad things. Some of those bad things:

Authentication. The key cache lookup blocks revoked, expired, or malformed keys before they reach the upstream. Without it, a leaked key keeps spending until you revoke at the provider, which costs minutes (and dollars).
Spend caps. The budget check blocks requests that would exceed your monthly cap. Without it, a runaway loop costs you whatever's left in your provider quota — sometimes thousands.
Rate limits. The rate-limit stage stops abusive callers. Without it, one buggy client's retry loop becomes everyone's outage when you hit the provider's TPM limit.
Validation. The Zod body check catches malformed requests before they hit the upstream. Without it, you get partial debit charges for requests the model never finished.
Observability. The async logging gives you per-request data without paying for it on the hot path. Without it, you're guessing at where cost goes.

Each of these things is what makes the gateway a gateway and not just a proxy. The overhead is the price of governance. If you're at a stage where governance doesn't matter yet (early prototype, internal tools, two engineers), the price is too high. If you're past that stage, 15 ms is the cheapest insurance you'll buy.

How to verify (don't trust us)

The argument we're making depends on the numbers being roughly right at your scale, in your region, against your traffic. Verify:

Run k6 yourself. Our script is in tests/load/k6-proxy.js. Set K6_BASE_URL=https://synapse.garden/api and K6_API_KEY=mg_live_... (see the authentication docs for getting a key). Compare to the same script pointed at api.openai.com/v1 with your OpenAI key.
Look at OTEL spans in your dashboard. Every request gets a synapse.proxy.duration_ms span that includes the per-stage breakdown. The observability guide covers the full span schema.
Watch the CI gate. Our build fails if the k6 P95 overhead exceeds 50 ms. The build status is public. If you see green, the gate passed.

If your measurements show wildly different numbers, tell us. Either there's a regression we haven't caught, or there's something specific about your workload that breaks our model. Either way, we want to know.

What this post isn't

This isn't a benchmark of OpenAI vs Anthropic. The two models in the table above have different price points, different output styles, and different latency characteristics from each other. Comparing them is a different post. Both numbers above are "gateway vs direct," not "GPT vs Claude."

This also isn't a comparison of gateways. We haven't run the same benchmark against OpenRouter, Portkey, or Helicone in this post. We've used all of them; if you want a head-to-head, that's the next benchmark, and we'll publish it once we have numbers we trust.

The takeaway is just this: in the regime where most LLM workloads live (interactive chat, batch processing, agents), 15 ms of P50 overhead is a non-event. P95 stays inside our 50 ms target by construction. P99 occasionally tickles the alert threshold and we have work to do there. If you're in a sub-100 ms regime, talk to us before committing — there's probably an architecture that gets you what you need.

For more on the engineering side, our architecture notes on per-project keys cover what the 15 ms is actually buying. To compare gateways head-to-head, see the gateway comparison post.