Embedding
Meaning represented as numbers
A vector representation that captures the meaning of text, images, or audio.
An embedding model takes an input like the word "king" and represents it as a fixed-length vector. This vector encodes its meaning — for example, the embeddings of "king" and "queen" are close to each other.
Semantic search, recommendation systems, clustering, and RAG — all are powered by embeddings. Modern models include OpenAI text-embedding-3, Cohere Embed v3, Voyage, BGE, and Jina.
Model choice is critical: language support, vector dimension (256–3072), context window (maximum input tokens), and cost ($0.02–$0.13 per 1M tokens) vary across models. A poor choice leads to weak retrieval and poor RAG performance.
You compress a 12-megapixel image into 1536 numbers, yet those numbers still contain enough information to answer "is this a cat or a dog?". That’s what embeddings do: reduce detail while preserving meaning. It’s lossy compression — but it discards the right details.
Imagine a Slack message archive. You search for "I mentioned the deploy failure last month." A traditional search returns messages containing both "deploy" and "failure" — but not something like "Friday's release exploded" (different words, same meaning).
With embedding-based search, your query is converted into a vector, and each message already has its own vector. Using cosine similarity, the system returns the top 20 closest messages — including "release exploded".
from openai import OpenAI
import numpy as np
client = OpenAI()
def embed(text: str) -> np.ndarray:
resp = client.embeddings.create(
model="text-embedding-3-small",
input=text,
)
return np.array(resp.data[0].embedding)
def cosine(a: np.ndarray, b: np.ndarray) -> float:
return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))
v_king = embed("king")
v_queen = embed("queen")
v_potato = embed("potato")
print(cosine(v_king, v_queen)) # ≈ 0.78 (semantically close)
print(cosine(v_king, v_potato)) # ≈ 0.23 (semantically distant)- Semantic search (beyond keyword matching)
- Indexing chunks for RAG
- Recommendation systems (user/item similarity)
- Duplicate detection (near-duplicates rather than exact matches)
- Using embeddings as features for classification
- Exact string matching (slug, ID, username)
-
Structured queries (
WHERE created_at > X) - When decisions must be strictly explainable
- When model cost is high and data rarely changes — make sure to cache
Embed and forget
Updating the embedding model requires re-embedding all data. Old and new vectors do not share the same space, so you must plan a proper migration.
Underestimating chunking
Embedding an entire document as a single vector causes loss of detail. You don’t want to compress 50 pages into one vector. Proper chunking is essential.
Using English-only models for other languages
Models trained only on English perform poorly on other languages. Prefer multilingual models such as Cohere multilingual or BGE-m3.