Embeddings

Convert text to vectors. RAG, search, classification, clustering — all the same primitive.

FIG.

FIG. 00 · EMBEDDINGStext → float[N]

An embedding is a fixed-length vector that represents the meaning of a piece of text. Two pieces of text that mean similar things will have similar vectors (cosine similarity near 1). Embeddings power semantic search, clustering, classification, and retrieval-augmented generation (RAG). With the AI SDK use embed for one input and embedMany for batches.

FIG. 01RETRIEVAL PIPELINE

SCHEMATIC

Indexing runs once per document: chunk → batch-embed via `POST /v1/embeddings` → store vectors in pgvector or similar. Query runs on every user request: embed the query → ANN search for top-k → optional `POST /v1/rerank` to reorder by a cross-encoder.

Quick example

import { embed } from "ai"

const { embedding } = await embed({
  model: "openai/text-embedding-3-large",
  baseURL: "https://synapse.garden/api/v1",
  apiKey: process.env.MG_KEY,
  value: "The quick brown fox jumps over the lazy dog.",
})

console.log(embedding.length)   // 3072 for text-embedding-3-large
console.log(embedding.slice(0, 5))  // [0.012, -0.034, 0.089, ...]

Batch embed

Embedding hundreds of strings one at a time wastes round-trips. Use embedMany:

import { embedMany } from "ai"

const { embeddings } = await embedMany({
  model: "openai/text-embedding-3-large",
  values: chunks, // string[]
})

// embeddings.length === chunks.length
// each embeddings[i] is a number[] of dimension 3072

The AI SDK chunks under the hood when the batch exceeds the provider's limit (typically 2048 inputs per request).

OpenAI-compatible API

const res = await client.embeddings.create({
  model: "openai/text-embedding-3-large",
  input: ["The quick brown fox", "Jumps over the lazy dog"],
})

for (const item of res.data) {
  console.log(item.index, item.embedding.length)
}

Available embedding models

Filter the catalog by the Embeddings modality on /models. Common choices:

Model	Dimension	Max tokens	Notes
`openai/text-embedding-3-large`	3072	8192	OpenAI flagship; best quality
`openai/text-embedding-3-small`	1536	8192	OpenAI default; cheap and capable
`cohere/embed-v4`	1024	8192	Multilingual; 100+ languages
`voyage/voyage-3-large`	1024	32000	Long context; strong on code/legal
`google/gemini-embedding-exp`	768	8192	Google embeddings

Reducing dimensions

OpenAI's v3 models support dimension truncation — request a smaller vector and they truncate-then-renormalize. Useful for storage cost:

await client.embeddings.create({
  model: "openai/text-embedding-3-large",
  input: "...",
  dimensions: 1024, // truncate from 3072
})

Quality drops gracefully — 1024 is usually fine for most retrieval tasks.

Building a RAG pipeline

import { embed, embedMany, generateText, cosineSimilarity } from "ai"

// 1. Index your corpus (do this once at ingest time)
const chunks = chunkText(myDocument, { size: 800, overlap: 100 })
const { embeddings } = await embedMany({
  model: "openai/text-embedding-3-large",
  values: chunks,
})

// Persist (chunks[i], embeddings[i]) tuples to your vector store —
// Postgres pgvector, Pinecone, Weaviate, Qdrant, etc.
await db.insert("docs", chunks.map((c, i) => ({ text: c, vector: embeddings[i] })))

// 2. Query (do this every request)
const userQuestion = "How do I rotate an API key?"
const { embedding: queryVec } = await embed({
  model: "openai/text-embedding-3-large",
  value: userQuestion,
})

const top = await db.query`
  SELECT text, 1 - (vector <=> ${queryVec}) AS similarity
  FROM docs
  ORDER BY vector <=> ${queryVec}
  LIMIT 5
`

// 3. Generate with retrieved context
const { text } = await generateText({
  model: "openai/gpt-5.4-mini",
  system: "Answer using only the context provided.",
  prompt: `Context:\n${top.map((t) => t.text).join("\n---\n")}\n\nQuestion: ${userQuestion}`,
})

console.log(text)

For better recall, rerank the top-K candidates with a cross-encoder before generation — see Reranking.

Cosine similarity

import { cosineSimilarity } from "ai"

const a = await embed({ model: "...", value: "..." })
const b = await embed({ model: "...", value: "..." })

const score = cosineSimilarity(a.embedding, b.embedding)
// 1.0 = identical, 0.0 = unrelated, -1.0 = opposite

For pairwise scoring across many candidates, batch the query through a vector DB rather than computing similarity in JS — it's orders of magnitude faster.

Storing embeddings

Recommended vector stores:

Store	Best for	Notes
Postgres + pgvector	Most apps	Same DB as your relational data; HNSW index
Pinecone	Managed; massive scale	Pay-as-you-go
Weaviate	Hybrid (vector + keyword)	Open source + managed
Qdrant	Speed + filtering	Rust-based; very fast
LanceDB	Embedded / local	SQLite for vectors

Caveats

Pick one model and stick with it. Embeddings from different models live in different vector spaces — you can't mix them. If you swap models, re-index your whole corpus.
Normalize before storing. Most vector DBs assume unit vectors for cosine similarity. The big providers return normalized vectors, but verify with Math.hypot(...vec) ≈ 1.
Don't embed raw HTML. Strip tags, run through a text extractor, optionally summarize. The model's tokens are precious.
Chunk smartly. 600–1200 character chunks with 10–15% overlap is the standard. Boundary on paragraph or sentence ends, not arbitrary character counts.
Cache where possible. If you embed the same text twice, you pay twice. Cache by SHA-256 of the input.

Pricing

Embedding models are billed per million input tokens. Output dimension doesn't affect cost. Browse /models filtered by Embeddings for live rates.

A typical RAG pipeline:

Ingest 1M words ≈ 1.3M tokens → one-time cost (depends on the model)
Each query embeds ~50 tokens → near-zero per-query cost
The big spend is the LLM doing the actual generation, not the embedding step

Embeddings

On this page