How to switch from OpenAI to Claude without rewriting

In our experience migrating internal services across providers, the hard part of switching LLMs is rarely the code. The OpenAI SDK and the Anthropic SDK are surprisingly similar at the surface. The hard part is everything underneath: prompts that worked on gpt-4o returning slightly different shapes from claude-sonnet-4-6, tool-call schemas that look identical but behave differently when arguments are nested, and streaming events that arrive in a different order.

A drop-in migration is one where the model id and base URL change, but no other code does. That's the bar this post aims for.

This is a complete migration walkthrough. We're going to assume you have an OpenAI integration in production today and you want to either move to Claude entirely or, more realistically, route some requests to Claude and others to GPT depending on the task. The end state we're aiming for is one codebase that talks to both without forking.

The shortest possible migration

If your code looks like this:

import OpenAI from "openai"

const client = new OpenAI({
  apiKey: process.env.OPENAI_API_KEY,
})

const response = await client.chat.completions.create({
  model: "gpt-4o-mini",
  messages: [{ role: "user", content: "Summarize this email." }],
})

The minimum viable migration to Claude is two lines. Point the OpenAI SDK at a Claude-compatible endpoint and change the model string:

const client = new OpenAI({
  apiKey: process.env.SYNAPSE_GARDEN_KEY, // mg_live_...
  baseURL: "https://synapse.garden/api/v1",
})

const response = await client.chat.completions.create({
  model: "anthropic/claude-sonnet-4-6", // was: "gpt-4o-mini"
  messages: [{ role: "user", content: "Summarize this email." }],
})

That's it for a basic chat call. The OpenAI SDK speaks an OpenAI-compatible interface, the gateway translates the request into Anthropic's messages API behind the scenes, and the response comes back in the OpenAI shape. You can keep using client.chat.completions.create and treat it as an OpenAI call.

Two things to know:

The base URL has /v1 for OpenAI-style routes. Anthropic's official SDK omits the /v1 because it adds it internally; if you decide to use the Anthropic SDK directly instead, point it at https://synapse.garden/api (no /v1) and use model: "claude-sonnet-4-6" (no anthropic/ prefix, since Anthropic's SDK assumes its own catalog).
The model id format is <provider>/<model> when going through the gateway. This is how a single endpoint exposes 100+ models without naming collisions. The full catalog and per-model pricing live on /pricing.

That covers the simple case. Now the parts that actually trip people up.

Prompts don't transfer one-to-one

A prompt tuned on GPT-4o will work on Claude. It will not necessarily work as well. The two model families have different default behaviors:

GPT tends to over-format. If you ask GPT-4o for a list of three items, you'll usually get bullet points, often with bold lead-ins. Claude is more willing to write prose.
Claude tends to over-explain. Claude likes to caveat and acknowledge. If you ask for a one-word answer, you have a better-than-even chance of getting "Sure! The answer is X." from Claude.
System prompts are weighted differently. Claude is more obedient to system prompts than GPT. If your existing system prompt is loose ("You're a helpful assistant"), GPT will fill in defaults; Claude will take the prompt at face value and may produce sparser output than you expect.

Practical rule: re-test the top 20 inputs your production endpoint sees. We tested this approach internally on a customer-support summarization workload — green CI suite, but reading the first 10 Claude outputs surfaced two prompts that needed tightening to match the GPT baseline. The test suite checks shape, not feel.

The corrections that usually fix it:

Add response-shape constraints to the user message, not just the system prompt: "Reply with one sentence. No preamble."
Use Claude's preference for XML tags. Claude was trained heavily on XML-tagged inputs (Anthropic's prompt engineering guide covers this). <input>...</input> and <output_format>...</output_format> work well. GPT also handles XML, so you can write prompts that work for both.
Move examples up. Few-shot examples earlier in the prompt help Claude more than GPT. Claude follows patterns aggressively.

Tool calling: the same API, slightly different physics

Both OpenAI and Anthropic support tool calls, and through an OpenAI-compatible gateway both look identical to your code. The differences are at the model layer:

Claude is more conservative about calling tools. It will often answer from its own context when GPT would call a tool. This is good for cost and bad if you want strict tool-only behavior.
Claude handles parallel tool calls differently. Claude tends to call one tool, look at the result, then call the next. GPT-4o is happier issuing several tool calls in one turn. If your application depends on parallel calls (for example, fanning out to multiple data sources), measure on both before committing.
Tool result formatting matters more for Claude. When you return a tool result, give Claude well-structured JSON or XML rather than free text. Claude is sensitive to result shape; it can lose track of which tool produced which output if you concatenate everything into a string.

The OpenAI tool-call format works through the gateway:

const response = await client.chat.completions.create({
  model: "anthropic/claude-sonnet-4-6",
  messages: [...],
  tools: [
    {
      type: "function",
      function: {
        name: "get_weather",
        description: "Get the current weather for a city.",
        parameters: {
          type: "object",
          properties: {
            city: { type: "string" },
          },
          required: ["city"],
        },
      },
    },
  ],
})

You don't need to rewrite this. The gateway translates between OpenAI's tools array and Anthropic's tools array (which has the same shape but different envelope). The response also comes back in OpenAI shape, with tool_calls on the message.

Streaming: same surface, different event order

Both providers support server-sent events for streaming. Both speak SSE. The OpenAI SDK exposes streaming the same way regardless of upstream:

const stream = await client.chat.completions.create({
  model: "anthropic/claude-sonnet-4-6",
  messages: [{ role: "user", content: "Write a haiku." }],
  stream: true,
})

for await (const chunk of stream) {
  process.stdout.write(chunk.choices[0]?.delta?.content ?? "")
}

This works. But if you've written code that depends on specific event ordering — for example, "the model finishes thinking before it emits the first content token" — the assumptions don't always hold across providers.

A few specifics:

Claude's "thinking" blocks (extended-thinking models) emit a separate event type. The gateway maps these into OpenAI's delta shape, so your existing for-loop won't break. But if you're parsing custom event types, you'll need to handle thinking content differently.
Tool-call deltas arrive incrementally under both providers. OpenAI sends delta.tool_calls[] with index-keyed partial JSON; Claude does the same. The gateway preserves indexes. Existing tool-streaming code works.
The finish_reason field is normalized across providers (stop, length, tool_calls, content_filter). If you have a switch on finish_reason, you don't need to change it.

Pricing: stop comparing per-million-token rates in isolation

The most common migration mistake is comparing list prices in spreadsheets. claude-sonnet-4-6 is priced differently per token than gpt-4o, but token counts also differ between models for the same prompt — different tokenizers split words at different boundaries. A 1,000-word document is 1,300 tokens with cl100k_base (GPT-4 tokenizer) and roughly 1,400 with Claude's tokenizer. Across a year of traffic, that adds up.

The honest comparison is cost per task. Run 200 representative prompts through both, measure output tokens, multiply by list price, compare. The answer is rarely what the per-million headline rate suggests.

While you're benchmarking, also measure:

First-token latency. GPT-4o-mini and Claude Haiku both target sub-second; in practice, latency varies by region and time of day. Measure during your real traffic windows.
Cache hit rates. Both providers offer prompt caching, with different rules. Anthropic caches blocks marked with cache_control: { type: "ephemeral" }. OpenAI caches automatically when prefixes match. If you have long, repeated system prompts, the caching strategy meaningfully changes your bill.
Refusal rates. Both models occasionally refuse benign prompts. The rate isn't zero on either side. If your application is in a domain that touches policy edges (medical, legal, compliance), test refusal behavior at scale.

What actually breaks

Here are the migration failures we've seen most often:

Hard-coded token-count assumptions. Code that says if (tokens > 3500) truncate(...) was tuned for a specific tokenizer. After switching, the truncation point is wrong by several percent. Re-tune.
JSON parsing on non-JSON output. GPT is more reliably valid-JSON than Claude when asked for JSON output. Claude sometimes wraps responses in prose ("Here is the JSON: ..."). Use the response_format: { type: "json_object" } parameter, which the gateway translates correctly to Anthropic's structured output mode.
Test suites that pin to exact strings. Output isn't deterministic across providers (or even within the same provider across versions). Tests that assert exact substrings will go red on migration. Move to semantic assertions: "the response mentions a refund," not "the response contains the word 'refund' at position 12."
Rate-limit handling. OpenAI and Anthropic have different rate-limit headers and different burst behavior. The gateway normalizes the response shape but doesn't paper over the underlying limits. If you were close to OpenAI's TPM limit, you'll be close to a different limit on Claude.

Doing both at once

Most teams that migrate end up routing different request types to different models. Cheap classification calls go to a small, fast model; long-form synthesis goes to a frontier model; vision goes to whichever provider has the better current model for the specific image type. This is the long-term value of going through a gateway: the model id becomes a deployment-time decision instead of a code change.

A common shape:

async function summarize(text: string, mode: "fast" | "detailed") {
  const model =
    mode === "fast"
      ? "openai/gpt-5.4-mini"
      : "anthropic/claude-opus-4-7"

  return client.chat.completions.create({
    model,
    messages: [
      { role: "system", content: SUMMARY_SYSTEM_PROMPT },
      { role: "user", content: text },
    ],
  })
}

This pattern works because the calling code doesn't change when you swap models. The only thing that changes is the model id, and the model id is data, not code.

What to do this week

If you're on OpenAI and considering Claude, the cheapest experiment is:

Sign up for a Synapse Garden account (free tier, no card).
Change baseURL and the API key in a non-production environment.
Pick one endpoint that handles a non-critical workload — internal summarization, log classification, anything where wrong is recoverable.
Add a 50/50 split: half the requests to your existing model, half to Claude. Compare outputs for a week.
Decide based on what you actually saw, not the spreadsheet.

You don't have to commit to anything. The migration is reversible up to the point you delete your OpenAI key — and we'd suggest not doing that for at least a quarter, regardless of what the comparison says.

For the architectural side of running both providers under one set of credentials, see per-project API keys for LLMs. For the latency math, 50ms of LLM proxy overhead covers what the gateway costs you on the wire.