Caching
Implicit and explicit prompt caching. Free latency wins, dramatically cheaper repeat prompts.
Most providers cache prompts internally to avoid recomputing the same prefix on repeat requests. Synapse Garden gives you a single switch (caching: 'auto') that handles every provider's caching contract correctly — opt in once on any streamText call and reap the savings everywhere.
Implicit vs explicit caching
Different providers handle caching differently:
| Provider | Caching | Action required |
|---|---|---|
| OpenAI | Implicit | None — automatic for any prefix re-used in 5–10 min |
| Google Gemini | Implicit | None |
| DeepSeek | Implicit | None |
| Anthropic Claude | Explicit | Need cache_control markers on messages |
| MiniMax | Explicit | Need cache markers |
For implicit cachers, repeat prompts cost less without any code change. For explicit cachers (Anthropic, MiniMax), you need to flag which content to cache.
Auto caching (recommended)
Pass caching: 'auto' and Synapse Garden does the right thing per provider:
import { streamText } from "ai"
const result = streamText({
model: "anthropic/claude-sonnet-4.6",
baseURL: "https://synapse.garden/api/v1",
apiKey: process.env.MG_KEY,
system: "You are a helpful assistant with access to a large knowledge base… [long prompt]",
prompt: "What does the law say about X?",
providerOptions: {
gateway: { caching: "auto" },
},
})For Anthropic, we add a cache_control marker at the end of your system prompt automatically. For OpenAI / Google / DeepSeek, the request passes through unchanged (they cache automatically).
Anthropic responses still work, but the prompt won't be cached — every call recomputes the full prefix. The caching: 'auto' flag is the cheapest performance win in the whole product.
Manual cache markers (Anthropic)
For fine-grained control, add cacheControl: { type: 'ephemeral' } to specific messages or content blocks:
import { streamText } from "ai"
const result = streamText({
model: "anthropic/claude-sonnet-4.6",
messages: [
{
role: "system",
content: "You are a legal research assistant.",
},
{
role: "user",
content: [
{
type: "text",
text: longLegalCorpus, // 50K tokens
experimental_providerMetadata: {
anthropic: { cacheControl: { type: "ephemeral" } },
},
},
{ type: "text", text: "What does the law say about copyright fair use?" },
],
},
],
})The cached corpus is 90% cheaper to read on the next call (within ~5 minutes) versus the full prompt rate.
Cache lifetimes
| Provider | TTL | Notes |
|---|---|---|
| OpenAI | 5–10 min | Implicit; rolls per identical prefix |
| 5 min | Implicit | |
| DeepSeek | 5 min | Implicit |
| Anthropic | 5 min (ephemeral) / 1 hour (with ttl: '1h') | Explicit |
| MiniMax | 5 min | Explicit |
For longer caches (Anthropic):
{
type: "text",
text: largeCorpus,
experimental_providerMetadata: {
anthropic: { cacheControl: { type: "ephemeral", ttl: "1h" } },
},
}Longer TTL costs slightly more on the write but saves dramatically on reads if the same prompt is hot.
What gets cached
Caching works on prefix match — the model's KV-cache for the prompt up to a certain point is reused. If you change a single token in the middle of a long prompt, the cache benefit is lost from that point forward.
The pattern:
[ stable system prompt + corpus ] ← cached
[ user's specific question ] ← varies per callIf you put the user's question first, you defeat the cache. Always:
- Stable / shared content first — system prompt, large corpus, examples
- Per-request content last — the user's question, current state, dynamic data
Reading cache hit metrics
The response provider metadata tells you what cached:
const result = streamText({ ... })
const meta = await result.providerMetadata
// Anthropic:
meta?.anthropic?.usage
// { cache_read_input_tokens: 49234, cache_creation_input_tokens: 0, input_tokens: 542, output_tokens: 312 }
// OpenAI:
meta?.openai?.cachedTokens
// 49234Watch the ratio of cached vs uncached tokens — high cache hit rate = lower cost. Aim for >70% on hot prompts.
Cache cost
Cached reads cost a fraction of normal input tokens (typically 10–25%). Cache writes cost slightly more than normal input — this is the upfront fee for putting the prompt in the KV cache. The math works out cheap as long as you re-read the cached prefix at least 2–3 times.
Rough Anthropic numbers:
- Normal input: 1× rate
- Cache write: 1.25× rate (one-time on first call)
- Cache read: 0.1× rate (per subsequent call)
Three calls with the same 50K prompt:
- Without caching: 3 × 50K × 1× = 150K-token-equivalent cost
- With caching: 50K × 1.25 + 2 × 50K × 0.1 = 72.5K-token-equivalent cost
- Savings: ~52%
Use cases that benefit most
- RAG with stable context — 10K+ tokens of retrieved docs that vary slowly across queries
- Long system prompts — extensive instructions, examples, persona definitions
- Multi-turn conversations — system prompt + earlier messages stay cached as the conversation grows
- Document Q&A — same document, many questions
Use cases that don't benefit
- Single-shot, one-time prompts — no repeat read, no cache benefit
- Prompts where every byte changes — random sampling, unique generation per call
- Tiny prompts — overhead of the cache write isn't worth it under ~1K tokens
Caveats
- Cache key is exact — change one whitespace and you miss. Normalize your prompts before sending.
- No cross-request cache. Different API calls (even from the same key) don't share a cache; each call gets its own. The cache is keyed per (provider, model, key, prefix).
- TTL is best-effort. Providers may evict before the stated TTL under memory pressure.
- Streaming + caching works fine — caching happens at the prompt-prefix layer, not the response layer.
- Tools and caching. Tool definitions count toward the cached prefix — define your tools in a stable order and they'll be cached too.