Vision input

FIG.

FIG. 00 · VISION INPUTIMAGE + TEXT

Vision input lets you pass images alongside text. Most modern flagship models (GPT-5.4, Claude 4.6, Gemini 2.5) accept images natively, and with the AI SDK you mix text and image content blocks in the same call to streamText. Filter the catalog by the Vision modality on /models to see all supported models.

FIG. 01MULTIMODAL INPUT

SCHEMATIC

Each `user` message can carry interleaved text and image content blocks. URLs pass through; buffers and `File` objects are base64-encoded by the SDK. The model sees them as a single multimodal turn and answers in text.

With the AI SDK

The AI SDK accepts images as URL strings, Uint8Array buffers, or File/Blob objects in its messages array:

import { generateText } from "ai"
import fs from "node:fs"

const { text } = await generateText({
  model: "openai/gpt-5.4",
  baseURL: "https://synapse.garden/api/v1",
  apiKey: process.env.MG_KEY,
  messages: [
    {
      role: "user",
      content: [
        { type: "text", text: "What's in this image?" },
        { type: "image", image: "https://example.com/photo.jpg" },
      ],
    },
  ],
})

From a local file

const buffer = fs.readFileSync("photo.jpg")

await generateText({
  model: "openai/gpt-5.4",
  messages: [
    {
      role: "user",
      content: [
        { type: "text", text: "Describe this image." },
        { type: "image", image: buffer, mediaType: "image/jpeg" },
      ],
    },
  ],
})

The AI SDK base64-encodes the buffer and sends it as a data URL. Maximum image size depends on the model — typically 20MB.

Multiple images at once

messages: [
  {
    role: "user",
    content: [
      { type: "text", text: "Pick the better photo and explain why." },
      { type: "image", image: "https://example.com/option-a.jpg" },
      { type: "image", image: "https://example.com/option-b.jpg" },
    ],
  },
]

With the OpenAI SDK

const res = await client.chat.completions.create({
  model: "openai/gpt-5.4",
  messages: [
    {
      role: "user",
      content: [
        { type: "text", text: "What's in this image?" },
        {
          type: "image_url",
          image_url: { url: "https://example.com/photo.jpg" },
        },
      ],
    },
  ],
})

For base64:

const base64 = fs.readFileSync("photo.jpg").toString("base64")

const res = await client.chat.completions.create({
  model: "openai/gpt-5.4",
  messages: [
    {
      role: "user",
      content: [
        { type: "text", text: "Describe this image." },
        {
          type: "image_url",
          image_url: { url: `data:image/jpeg;base64,${base64}` },
        },
      ],
    },
  ],
})

With the Anthropic SDK

Anthropic uses a slightly different content block format:

const message = await client.messages.create({
  model: "anthropic/claude-opus-4.6",
  max_tokens: 1024,
  messages: [
    {
      role: "user",
      content: [
        { type: "text", text: "What's in this image?" },
        {
          type: "image",
          source: {
            type: "url",
            url: "https://example.com/photo.jpg",
          },
        },
      ],
    },
  ],
})

Anthropic also supports base64-encoded images:

{
  type: "image",
  source: {
    type: "base64",
    media_type: "image/jpeg",
    data: base64String,
  },
}

Image quality / detail

OpenAI vision models accept a detail flag that trades cost for resolution:

{
  type: "image_url",
  image_url: {
    url: "…",
    detail: "high", // "low" | "high" | "auto" (default)
  },
}

Detail	Behavior	Cost impact
`low`	512×512 fixed, ~85 tokens	Cheapest
`high`	Full resolution, tiled at 768×768	Up to 8× more tokens
`auto`	Model decides	Variable

For most use cases, auto is fine. Use low when the image is decorative; use high when you need to read fine print or count items.

Token cost for images

Images are billed as input tokens based on resolution. Rough heuristics:

Resolution	Approximate tokens (high detail)
512×512	~170
1024×1024	~765
2048×2048	~3,000
4096×4096	~12,000

Different providers use different tokenizations, so the exact number varies. For accurate cost projection, send a sample image and check result.usage on the response.

Vision + tools

Vision works seamlessly with tool use. The model can call tools based on what it sees:

streamText({
  model: "openai/gpt-5.4",
  messages: [
    {
      role: "user",
      content: [
        { type: "text", text: "What's the price of the item in this photo?" },
        { type: "image", image: shoppingPhoto },
      ],
    },
  ],
  tools: {
    searchProducts: tool({
      description: "Search the product catalog by name",
      parameters: z.object({ query: z.string() }),
      execute: async ({ query }) => searchCatalog(query),
    }),
  },
})

The model "sees" the photo, extracts a query, calls the tool, and uses the tool result to answer. No extra orchestration needed.

File input (PDFs, audio)

Some models (Gemini, Claude with file beta) accept richer file types:

// Gemini accepts PDFs directly
messages: [
  {
    role: "user",
    content: [
      { type: "text", text: "Summarize this paper." },
      {
        type: "file",
        data: pdfBuffer,
        mediaType: "application/pdf",
      },
    ],
  },
]

Filter the catalog by file-input capability to see which models accept what. Audio input is supported on google/gemini-2.5-pro (multimodal) and a handful of specialized speech-to-text models.

Caveats

EXIF orientation isn't always honored. Rotate the image yourself before sending if it matters.
Resolution caps vary. Models silently downsample huge images, but if your image is wildly large you may see degraded output. Resize to 2048×2048 or smaller for predictable behavior.
Faces, IDs, license plates — provider-side safety policies may refuse to describe these. The model's response will say so explicitly.
Animated GIFs / video frames — most vision models accept the first frame only. For multi-frame analysis, extract frames and send them as separate images, or use a video-capable model (Gemini 2.5 Pro accepts short video clips).

Vision input

On this page