50ms of LLM proxy overhead — what it costs and buys
Every LLM gateway adds latency. The honest question is how much, when it shows up, and whether the tradeoffs are worth it. Real k6 measurements, not vibes.
- performance
- benchmarks
- engineering
The strongest objection to putting an LLM gateway in front of your model calls is latency. The argument goes: every millisecond between user and model is a millisecond of perceived slowness, especially for streaming chat, where time-to-first-token (TTFT) is the time between request submission and the first character of the model's response — the metric users actually feel. If you can call OpenAI directly, why add a hop?
This post answers that question with measurements. We measured Synapse Garden's overhead vs going direct to OpenAI and Anthropic, on the same hardware, against the same models, with the same payloads. We're publishing the numbers we saw and the numbers we target. If you're evaluating any LLM gateway (not just ours), the framing in this post will tell you what to ask the others.
What we measured
The setup, kept boring on purpose:
- Client. A k6 load runner from us-east-1 (AWS), the same region the gateway is in. The point isn't to test the network between Sydney and Iowa; it's to test the gateway itself.
- Direct calls. k6 →
api.openai.com/v1/chat/completionsand k6 →api.anthropic.com/v1/messages. - Proxied calls. k6 →
synapse.garden/api/v1/chat/completionsand k6 →synapse.garden/api/v1/messages. Same gateway routes the request to the same upstream provider. - Models.
openai/gpt-5.4-miniandanthropic/claude-sonnet-4-6. Cheap and frontier-class, two ends of the spectrum. - Payload. A 50-token system prompt and a 200-token user message. Single-turn, non-streaming and streaming. The point is to isolate gateway overhead, so the request is short on purpose.
- Concurrency. 100 RPS sustained for 5 minutes. We ran each scenario three times and took the median across runs to dodge transient glitches.
- What we measured. Time-to-first-byte for non-streaming, time-to-first-token for streaming, and total request time. P50, P95, P99.
The full k6 script lives in tests/load/k6-proxy.js if you want to run your own. Numbers below are from a real run on May 5, 2026.
The numbers
| Metric | Direct OpenAI | Via Synapse Garden | Delta |
|---|---|---|---|
| TTFB, P50 | 412 ms | 426 ms | +14 ms |
| TTFB, P95 | 982 ms | 1024 ms | +42 ms |
| TTFB, P99 | 1670 ms | 1741 ms | +71 ms |
| Total, P50 | 891 ms | 906 ms | +15 ms |
| Total, P95 | 1842 ms | 1888 ms | +46 ms |
| Total, P99 | 2934 ms | 3018 ms | +84 ms |
For Claude:
| Metric | Direct Anthropic | Via Synapse Garden | Delta |
|---|---|---|---|
| TTFB, P50 | 504 ms | 519 ms | +15 ms |
| TTFB, P95 | 1147 ms | 1190 ms | +43 ms |
| TTFB, P99 | 1893 ms | 1968 ms | +75 ms |
| Total, P50 | 1023 ms | 1041 ms | +18 ms |
| Total, P95 | 2113 ms | 2160 ms | +47 ms |
| Total, P99 | 3284 ms | 3380 ms | +96 ms |
The headline: median overhead is about 15 ms. P95 overhead is 42-47 ms, which is below our internal 50 ms target. P99 overhead is 71-96 ms, above target on the worst run. We have CI that fails the build if P95 exceeds 50 ms; we don't currently gate on P99 because the variance from upstream tail latency dominates the gateway's contribution.
Where the 15 ms goes
Adding a hop to a network path is not free. For our stack, the breakdown of the median overhead, instrumented in OpenTelemetry spans:
| Stage | Median (ms) | What it does |
|---|---|---|
| Header parse + key format check | 0.3 | Pull Authorization, validate mg_live_* shape, reject malformed keys before any I/O |
| Upstash key cache lookup (HIT) | 2.1 | Look up the hashed key in Redis; >99% hit rate in steady state |
| Body validation (Zod) | 1.4 | Parse and validate the JSON body. Rejects malformed requests before they reach the upstream |
| Model allowlist check | 0.2 | Confirm the requested model is enabled for this org/key |
| Rate-limit check (Upstash Lua) | 1.6 | Sliding-window rate limit per key, atomic on the Redis side |
| Token estimate + budget check | 2.4 | Estimate input cost via tiktoken, confirm the org has budget |
| Proxy network hop (us-east-1 → us-east-1) | ~6 ms | TCP + TLS handshake amortized via keep-alive; this is the only "physics" cost |
| Buffer + parse upstream response | 1.2 | The fast common path; not actually parsing the streamed body, just headers |
| Total internal overhead | ~15 ms |
The math adds up to slightly more than 15 ms because we don't run all stages strictly serially — the body validation overlaps with the key lookup. The point is that almost all of the overhead comes from explicit work we're doing on your behalf, not from the proxy hop itself.
What's not on this list:
- DB calls. None on the hot path. We never hit Postgres for a request that's getting served. Logging, aggregation, and budget deduction get pushed to a queue after the response has already started streaming back.
- Logging. Token counts, model id, latency, and status get fire-and-forwarded into Vercel Queues. Total time on the hot path: zero (the queue write is async).
- Tracing. OTEL spans are sampled and pushed to a separate collector. The hot path doesn't await the trace export.
If you're evaluating other gateways, ask them what hits the database during a request. The honest answer for any gateway with strong governance is "the key lookup, but only on cache miss." The dishonest answer is "a few queries for analytics, but they're fast." Analytics queries on the hot path are how you turn 15 ms of overhead into 80 ms.
When the overhead matters and when it doesn't
The 15 ms median number is small enough that for most workloads it's invisible. A few cases where it does matter:
Voice and live conversation. If your product streams audio and you're measuring round-trip from "user stops talking" to "first audio chunk plays," every millisecond between the model and the speaker is felt. For these workloads, even 15 ms of gateway overhead can be the difference between "natural conversation" and "feels slightly off." Test with users, not with stopwatches.
High-frequency tool calls in an agent loop. An agent that makes 8 tool calls in 4 seconds is paying the proxy hop 8 times. 15 ms × 8 = 120 ms of overhead per task. If you're building this kind of agent, consider batching where possible, and consider whether you're hitting the same upstream model 8 times in a row (in which case provider-side caching might help more than gateway optimization).
Background batch jobs. None. If you're running a million classifications overnight, 15 ms doesn't matter. The wall-clock cost of the gateway over a million requests is 4 hours; the wall-clock cost of running a million LLM calls at all is several days. The bottleneck is not the gateway.
Interactive chat. Probably none. Time-to-first-token is dominated by the model itself, which takes 200-1000 ms to start. 15 ms on top of 500 ms is 3% — within the noise of network jitter.
The honest framing: gateways are bad for sub-100 ms response budgets and fine for everything else. If you're at <100 ms and need to be, you're in a regime where a gateway's tradeoffs probably aren't worth it. If you're at 500+ ms (which is where 99% of LLM workloads live), the overhead is below user-perception threshold.
What the overhead buys you
Every millisecond of overhead exists because we're doing work that prevents bad things. Some of those bad things:
- Authentication. The key cache lookup blocks revoked, expired, or malformed keys before they reach the upstream. Without it, a leaked key keeps spending until you revoke at the provider, which costs minutes (and dollars).
- Spend caps. The budget check blocks requests that would exceed your monthly cap. Without it, a runaway loop costs you whatever's left in your provider quota — sometimes thousands.
- Rate limits. The rate-limit stage stops abusive callers. Without it, one buggy client's retry loop becomes everyone's outage when you hit the provider's TPM limit.
- Validation. The Zod body check catches malformed requests before they hit the upstream. Without it, you get partial debit charges for requests the model never finished.
- Observability. The async logging gives you per-request data without paying for it on the hot path. Without it, you're guessing at where cost goes.
Each of these things is what makes the gateway a gateway and not just a proxy. The overhead is the price of governance. If you're at a stage where governance doesn't matter yet (early prototype, internal tools, two engineers), the price is too high. If you're past that stage, 15 ms is the cheapest insurance you'll buy.
How to verify (don't trust us)
The argument we're making depends on the numbers being roughly right at your scale, in your region, against your traffic. Verify:
-
Run k6 yourself. Our script is in
tests/load/k6-proxy.js. SetK6_BASE_URL=https://synapse.garden/apiandK6_API_KEY=mg_live_...(see the authentication docs for getting a key). Compare to the same script pointed atapi.openai.com/v1with your OpenAI key. -
Look at OTEL spans in your dashboard. Every request gets a
synapse.proxy.duration_msspan that includes the per-stage breakdown. The observability guide covers the full span schema. -
Watch the CI gate. Our build fails if the k6 P95 overhead exceeds 50 ms. The build status is public. If you see green, the gate passed.
If your measurements show wildly different numbers, tell us. Either there's a regression we haven't caught, or there's something specific about your workload that breaks our model. Either way, we want to know.
What this post isn't
This isn't a benchmark of OpenAI vs Anthropic. The two models in the table above have different price points, different output styles, and different latency characteristics from each other. Comparing them is a different post. Both numbers above are "gateway vs direct," not "GPT vs Claude."
This also isn't a comparison of gateways. We haven't run the same benchmark against OpenRouter, Portkey, or Helicone in this post. We've used all of them; if you want a head-to-head, that's the next benchmark, and we'll publish it once we have numbers we trust.
The takeaway is just this: in the regime where most LLM workloads live (interactive chat, batch processing, agents), 15 ms of P50 overhead is a non-event. P95 stays inside our 50 ms target by construction. P99 occasionally tickles the alert threshold and we have work to do there. If you're in a sub-100 ms regime, talk to us before committing — there's probably an architecture that gets you what you need.
For more on the engineering side, our architecture notes on per-project keys cover what the 15 ms is actually buying. To compare gateways head-to-head, see the gateway comparison post.
Synapse Publication
Field notes, technical write-ups, and benchmarks from the team building Synapse Garden.
- Deep dive
Vercel AI Elements: 20+ React components for AI apps explained
A walk-through of every AI Elements component, what each one solves, and where rolling your own still wins. Practical patterns, real composition.
- How-to
Vercel AI SDK chatbot tutorial: useChat, streaming, real patterns
A working production-grade chatbot built on Vercel AI SDK v6. Streaming with useChat, tool calls, persistence, and the patterns that hold up after the demo.