Caching

Implicit and explicit prompt caching. Free latency wins, dramatically cheaper repeat prompts.

FIG.

FIG. 00 · CACHINGCACHE HIT FORK

Most providers cache prompts internally to avoid recomputing the same prefix on repeat requests. Synapse Garden gives you a single switch (caching: 'auto') that handles every provider's caching contract correctly — opt in once on any streamText call and reap the savings everywhere.

FIG. 01PREFIX REUSE

SCHEMATIC

Caching is prefix-keyed: the stable system prompt and corpus get a cache marker, the per-request question stays uncached. A hit pays ~10% of the normal input rate; a miss writes the prefix for the next call.

Implicit vs explicit caching

Different providers handle caching differently:

Provider	Caching	Action required
OpenAI	Implicit	None — automatic for any prefix re-used in 5–10 min
Google Gemini	Implicit	None
DeepSeek	Implicit	None
Anthropic Claude	Explicit	Need `cache_control` markers on messages
MiniMax	Explicit	Need cache markers

For implicit cachers, repeat prompts cost less without any code change. For explicit cachers (Anthropic, MiniMax), you need to flag which content to cache.

Auto caching (recommended)

Pass caching: 'auto' and Synapse Garden does the right thing per provider:

import { streamText } from "ai"

const result = streamText({
  model: "anthropic/claude-sonnet-4.6",
  baseURL: "https://synapse.garden/api/v1",
  apiKey: process.env.MG_KEY,
  system: "You are a helpful assistant with access to a large knowledge base… [long prompt]",
  prompt: "What does the law say about X?",
  providerOptions: {
    gateway: { caching: "auto" },
  },
})

For Anthropic, we add a cache_control marker at the end of your system prompt automatically. For OpenAI / Google / DeepSeek, the request passes through unchanged (they cache automatically).

Without `caching: 'auto'`

Anthropic responses still work, but the prompt won't be cached — every call recomputes the full prefix. The caching: 'auto' flag is the cheapest performance win in the whole product.

Manual cache markers (Anthropic)

For fine-grained control, add cacheControl: { type: 'ephemeral' } to specific messages or content blocks:

import { streamText } from "ai"

const result = streamText({
  model: "anthropic/claude-sonnet-4.6",
  messages: [
    {
      role: "system",
      content: "You are a legal research assistant.",
    },
    {
      role: "user",
      content: [
        {
          type: "text",
          text: longLegalCorpus, // 50K tokens
          experimental_providerMetadata: {
            anthropic: { cacheControl: { type: "ephemeral" } },
          },
        },
        { type: "text", text: "What does the law say about copyright fair use?" },
      ],
    },
  ],
})

The cached corpus is 90% cheaper to read on the next call (within ~5 minutes) versus the full prompt rate.

Cache lifetimes

Provider	TTL	Notes
OpenAI	5–10 min	Implicit; rolls per identical prefix
Google	5 min	Implicit
DeepSeek	5 min	Implicit
Anthropic	5 min (ephemeral) / 1 hour (with `ttl: '1h'`)	Explicit
MiniMax	5 min	Explicit

For longer caches (Anthropic):

{
  type: "text",
  text: largeCorpus,
  experimental_providerMetadata: {
    anthropic: { cacheControl: { type: "ephemeral", ttl: "1h" } },
  },
}

Longer TTL costs slightly more on the write but saves dramatically on reads if the same prompt is hot.

What gets cached

Caching works on prefix match — the model's KV-cache for the prompt up to a certain point is reused. If you change a single token in the middle of a long prompt, the cache benefit is lost from that point forward.

The pattern:

[ stable system prompt + corpus ]   ← cached
[ user's specific question ]         ← varies per call

If you put the user's question first, you defeat the cache. Always:

Stable / shared content first — system prompt, large corpus, examples
Per-request content last — the user's question, current state, dynamic data

Reading cache hit metrics

The response provider metadata tells you what cached:

const result = streamText({ ... })
const meta = await result.providerMetadata

// Anthropic:
meta?.anthropic?.usage
// { cache_read_input_tokens: 49234, cache_creation_input_tokens: 0, input_tokens: 542, output_tokens: 312 }

// OpenAI:
meta?.openai?.cachedTokens
// 49234

Watch the ratio of cached vs uncached tokens — high cache hit rate = lower cost. Aim for >70% on hot prompts.

Cache cost

Cached reads cost a fraction of normal input tokens (typically 10–25%). Cache writes cost slightly more than normal input — this is the upfront fee for putting the prompt in the KV cache. The math works out cheap as long as you re-read the cached prefix at least 2–3 times.

Rough Anthropic numbers:

Normal input: 1× rate
Cache write: 1.25× rate (one-time on first call)
Cache read: 0.1× rate (per subsequent call)

Three calls with the same 50K prompt:

Without caching: 3 × 50K × 1× = 150K-token-equivalent cost
With caching: 50K × 1.25 + 2 × 50K × 0.1 = 72.5K-token-equivalent cost
Savings: ~52%

Use cases that benefit most

RAG with stable context — 10K+ tokens of retrieved docs that vary slowly across queries
Long system prompts — extensive instructions, examples, persona definitions
Multi-turn conversations — system prompt + earlier messages stay cached as the conversation grows
Document Q&A — same document, many questions

Use cases that don't benefit

Single-shot, one-time prompts — no repeat read, no cache benefit
Prompts where every byte changes — random sampling, unique generation per call
Tiny prompts — overhead of the cache write isn't worth it under ~1K tokens

Caveats

Cache key is exact — change one whitespace and you miss. Normalize your prompts before sending.
No cross-request cache. Different API calls (even from the same key) don't share a cache; each call gets its own. The cache is keyed per (provider, model, key, prefix).
TTL is best-effort. Providers may evict before the stated TTL under memory pressure.
Streaming + caching works fine — caching happens at the prompt-prefix layer, not the response layer.
Tools and caching. Tool definitions count toward the cached prefix — define your tools in a stable order and they'll be cached too.

Caching

On this page