Caching

Implicit and explicit prompt caching. Free latency wins, dramatically cheaper repeat prompts.

FIG.
FIG. 00 · CACHINGCACHE HIT FORK

Most providers cache prompts internally to avoid recomputing the same prefix on repeat requests. Synapse Garden gives you a single switch (caching: 'auto') that handles every provider's caching contract correctly — opt in once on any streamText call and reap the savings everywhere.

FIG. 01PREFIX REUSE
SCHEMATIC
Caching is prefix-keyed: the stable system prompt and corpus get a cache marker, the per-request question stays uncached. A hit pays ~10% of the normal input rate; a miss writes the prefix for the next call.

Implicit vs explicit caching

Different providers handle caching differently:

ProviderCachingAction required
OpenAIImplicitNone — automatic for any prefix re-used in 5–10 min
Google GeminiImplicitNone
DeepSeekImplicitNone
Anthropic ClaudeExplicitNeed cache_control markers on messages
MiniMaxExplicitNeed cache markers

For implicit cachers, repeat prompts cost less without any code change. For explicit cachers (Anthropic, MiniMax), you need to flag which content to cache.

Pass caching: 'auto' and Synapse Garden does the right thing per provider:

import { streamText } from "ai"

const result = streamText({
  model: "anthropic/claude-sonnet-4.6",
  baseURL: "https://synapse.garden/api/v1",
  apiKey: process.env.MG_KEY,
  system: "You are a helpful assistant with access to a large knowledge base… [long prompt]",
  prompt: "What does the law say about X?",
  providerOptions: {
    gateway: { caching: "auto" },
  },
})

For Anthropic, we add a cache_control marker at the end of your system prompt automatically. For OpenAI / Google / DeepSeek, the request passes through unchanged (they cache automatically).

Without `caching: 'auto'`

Anthropic responses still work, but the prompt won't be cached — every call recomputes the full prefix. The caching: 'auto' flag is the cheapest performance win in the whole product.

Manual cache markers (Anthropic)

For fine-grained control, add cacheControl: { type: 'ephemeral' } to specific messages or content blocks:

import { streamText } from "ai"

const result = streamText({
  model: "anthropic/claude-sonnet-4.6",
  messages: [
    {
      role: "system",
      content: "You are a legal research assistant.",
    },
    {
      role: "user",
      content: [
        {
          type: "text",
          text: longLegalCorpus, // 50K tokens
          experimental_providerMetadata: {
            anthropic: { cacheControl: { type: "ephemeral" } },
          },
        },
        { type: "text", text: "What does the law say about copyright fair use?" },
      ],
    },
  ],
})

The cached corpus is 90% cheaper to read on the next call (within ~5 minutes) versus the full prompt rate.

Cache lifetimes

ProviderTTLNotes
OpenAI5–10 minImplicit; rolls per identical prefix
Google5 minImplicit
DeepSeek5 minImplicit
Anthropic5 min (ephemeral) / 1 hour (with ttl: '1h')Explicit
MiniMax5 minExplicit

For longer caches (Anthropic):

{
  type: "text",
  text: largeCorpus,
  experimental_providerMetadata: {
    anthropic: { cacheControl: { type: "ephemeral", ttl: "1h" } },
  },
}

Longer TTL costs slightly more on the write but saves dramatically on reads if the same prompt is hot.

What gets cached

Caching works on prefix match — the model's KV-cache for the prompt up to a certain point is reused. If you change a single token in the middle of a long prompt, the cache benefit is lost from that point forward.

The pattern:

[ stable system prompt + corpus ]   ← cached
[ user's specific question ]         ← varies per call

If you put the user's question first, you defeat the cache. Always:

  1. Stable / shared content first — system prompt, large corpus, examples
  2. Per-request content last — the user's question, current state, dynamic data

Reading cache hit metrics

The response provider metadata tells you what cached:

const result = streamText({ ... })
const meta = await result.providerMetadata

// Anthropic:
meta?.anthropic?.usage
// { cache_read_input_tokens: 49234, cache_creation_input_tokens: 0, input_tokens: 542, output_tokens: 312 }

// OpenAI:
meta?.openai?.cachedTokens
// 49234

Watch the ratio of cached vs uncached tokens — high cache hit rate = lower cost. Aim for >70% on hot prompts.

Cache cost

Cached reads cost a fraction of normal input tokens (typically 10–25%). Cache writes cost slightly more than normal input — this is the upfront fee for putting the prompt in the KV cache. The math works out cheap as long as you re-read the cached prefix at least 2–3 times.

Rough Anthropic numbers:

  • Normal input: 1× rate
  • Cache write: 1.25× rate (one-time on first call)
  • Cache read: 0.1× rate (per subsequent call)

Three calls with the same 50K prompt:

  • Without caching: 3 × 50K × 1× = 150K-token-equivalent cost
  • With caching: 50K × 1.25 + 2 × 50K × 0.1 = 72.5K-token-equivalent cost
  • Savings: ~52%

Use cases that benefit most

  • RAG with stable context — 10K+ tokens of retrieved docs that vary slowly across queries
  • Long system prompts — extensive instructions, examples, persona definitions
  • Multi-turn conversations — system prompt + earlier messages stay cached as the conversation grows
  • Document Q&A — same document, many questions

Use cases that don't benefit

  • Single-shot, one-time prompts — no repeat read, no cache benefit
  • Prompts where every byte changes — random sampling, unique generation per call
  • Tiny prompts — overhead of the cache write isn't worth it under ~1K tokens

Caveats

  • Cache key is exact — change one whitespace and you miss. Normalize your prompts before sending.
  • No cross-request cache. Different API calls (even from the same key) don't share a cache; each call gets its own. The cache is keyed per (provider, model, key, prefix).
  • TTL is best-effort. Providers may evict before the stated TTL under memory pressure.
  • Streaming + caching works fine — caching happens at the prompt-prefix layer, not the response layer.
  • Tools and caching. Tool definitions count toward the cached prefix — define your tools in a stable order and they'll be cached too.