Streaming

Stream model output token-by-token. Server-Sent Events under the hood, idiomatic AI SDK on top.

FIG.
FIG. 00 · STREAMINGSSE · text/event-stream

Every chat, message, and response endpoint streams. The wire format is OpenAI-style Server-Sent Events (SSE) — text/event-stream with one event per \n\n block.

FIG. 01WIRE TIMELINE
SCHEMATIC
The server flushes one SSE chunk per token batch with no buffering; each chunk lands on the wire as it is produced. The client decodes the stream into tokens via `result.textStream`. The terminal `[DONE]` sentinel closes the stream.

The AI SDK gives you streamText and streamObject — both wrap the SSE protocol and expose strongly-typed async iterators. See the AI SDK streamText reference for the full surface.

The AI SDK gives you streamText and streamObject. Both return an awaitable result with iterators for tokens, parts, tool calls, and a final usage promise.

import { streamText } from "ai"

const result = streamText({
  model: "openai/gpt-5.4",
  baseURL: "https://synapse.garden/api/v1",
  apiKey: process.env.MG_KEY,
  prompt: "Write three bullet points about Postgres partitioning.",
})

for await (const part of result.textStream) {
  process.stdout.write(part)
}

console.log("\n— Usage:", await result.usage)
console.log("— Finish reason:", await result.finishReason)
console.log("— Provider:", await result.providerMetadata)

result.textStream yields plain text deltas. For structured streams (tool calls, reasoning summaries, multi-part output), use result.fullStream which yields a discriminated union of part kinds.

Streaming costs the same as non-streaming

First token typically arrives in 200–800ms depending on the model. The total cost is identical to a single-shot call — what changes is perceived latency.

With the OpenAI SDK

import OpenAI from "openai"

const client = new OpenAI({
  apiKey: process.env.MG_KEY,
  baseURL: "https://synapse.garden/api/v1",
})

const stream = await client.chat.completions.create({
  model: "openai/gpt-5.4",
  messages: [{ role: "user", content: "Tell me about Synapse Garden." }],
  stream: true,
})

for await (const chunk of stream) {
  process.stdout.write(chunk.choices[0]?.delta?.content ?? "")
}

Wire format (raw SSE)

If you're hand-rolling, the events look like this:

data: {"id":"chatcmpl_…","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"role":"assistant","content":"Post"},"finish_reason":null}]}

data: {"id":"chatcmpl_…","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"content":"gres"},"finish_reason":null}]}



data: {"id":"chatcmpl_…","object":"chat.completion.chunk","choices":[{"index":0,"delta":{},"finish_reason":"stop"}]}

data: [DONE]

Three rules:

  • One event per \n\n block; lines inside an event are prefixed with data:.
  • Each data: payload is JSON, except the terminating data: [DONE].
  • The final event before [DONE] carries finish_reason (stop, length, tool_calls, content_filter).

Anthropic-style streaming

The /v1/messages endpoint streams using the Anthropic event protocol:

event: message_start
data: {"type":"message_start","message":{...}}

event: content_block_delta
data: {"type":"content_block_delta","index":0,"delta":{"type":"text_delta","text":"Post"}}



event: message_stop
data: {"type":"message_stop"}

Use the Anthropic SDK and let it parse for you — the wire shape only matters when you're debugging.

Cancellation

The AI SDK accepts an AbortSignal. Cancel mid-stream and we stop charging tokens after the cancellation propagates (a few hundred ms).

const controller = new AbortController()
setTimeout(() => controller.abort(), 1500) // cancel after 1.5s

const result = streamText({
  model: "openai/gpt-5.4",
  prompt: "…",
  abortSignal: controller.signal,
})

try {
  for await (const part of result.textStream) {
    process.stdout.write(part)
  }
} catch (err) {
  if (err.name === "AbortError") console.log("\n(cancelled)")
  else throw err
}

For raw fetch, pass the same signal and the connection closes immediately.

Edge / serverless caveats

Streaming responses pass through serverless function bodies just fine, but CDN buffering can break SSE if the platform queues bytes. We send X-Accel-Buffering: no and Cache-Control: no-cache, no-transform to tell common proxies to flush each event.

If you're piping our stream through your own function, do the same:

return new Response(upstream.body, {
  headers: {
    "Content-Type": "text/event-stream",
    "Cache-Control": "no-cache, no-transform",
    "X-Accel-Buffering": "no",
  },
})

Errors mid-stream

If the upstream model errors after some tokens have flowed, we send a final SSE event with the error envelope, then [DONE]:

data: {"error":{"code":"UPSTREAM_ERROR","message":"Provider returned 500"}}

data: [DONE]

The AI SDK throws on these automatically. With raw fetch, parse each data: line and check for an error field before accumulating content.

Time-to-first-token (TTFT)

Streaming TTFT is the most common latency metric for conversational UX. We surface per-provider TTFT on the model detail pages so you can pick a fast variant when you need it.

const t0 = performance.now()
const result = streamText({ model: "openai/gpt-5.4", prompt: "…" })

let firstToken: number | null = null
for await (const part of result.textStream) {
  if (firstToken === null) firstToken = performance.now() - t0
  process.stdout.write(part)
}
console.log(`\nTTFT: ${firstToken?.toFixed(0)}ms`)

Typical TTFT ranges:

Model classTTFT (P50)
Edge / nano (gpt-5.4-nano, gemini-2.5-nano)80–200 ms
Fast workhorse (gpt-5.4-mini, claude-haiku-4-5)200–400 ms
Flagship (gpt-5.4, claude-sonnet-4.6, gemini-2.5-pro)400–800 ms
Reasoning models (with reasoningEffort: 'high')1.5–8s before first token