Understanding Embeddings: From Theory to Production
Embeddings are the bridge between human-readable data and machine-processable mathematics. They are the input to vector search, the foundation of recommendation systems, and the representation layer in modern NLP. Despite their centrality, most engineering teams treat embeddings as a black box: call an API, get a list of floats, move on. This works until it doesn’t—until your search quality degrades on a specific data type, your recommendation system produces nonsensical results for edge cases, or your embedding storage costs balloon unexpectedly. Understanding what embeddings actually are, how they are produced, and what can go wrong gives you the tools to diagnose and fix these problems instead of guessing.
What an Embedding Actually Represents
An embedding is a learned mapping from a discrete, high-dimensional space (words, sentences, images, user behavior) to a continuous, lower-dimensional vector space where geometric relationships encode semantic relationships. That definition is precise but abstract. Let me make it concrete.
Consider the sentence “The cat sat on the mat.” An embedding model converts this into a vector of 384 floating-point numbers (if using MiniLM) or 1536 numbers (if using OpenAI’s ada-002). Each number represents the sentence’s position along one axis of a 384-dimensional space. The sentence “A feline rested on the rug” would be mapped to a nearby vector in that space, because the two sentences have similar meanings. The sentence “Quarterly revenue exceeded projections” would be mapped to a distant vector, because its meaning is unrelated.
“Nearby” and “distant” are measured by a distance function—typically cosine similarity, which ranges from -1 (opposite) to 1 (identical). In practice, unrelated sentences score around 0.1–0.3, vaguely related sentences score 0.4–0.6, and semantically equivalent sentences score 0.7–0.95. These ranges are model-dependent, which is why you should always evaluate on your own data rather than relying on published thresholds.
The key insight is that this geometric structure is learned from data. The embedding model was trained on millions (sometimes billions) of text pairs where the model was asked to produce similar vectors for similar meanings and different vectors for different meanings. The resulting vector space reflects the statistical patterns of meaning in the training data. It is not a hand-crafted feature space—it is an emergent structure that captures semantic relationships automatically. This is both its power (it captures nuances that hand-crafted features miss) and its limitation (it can only capture patterns that exist in the training data).
How Embedding Models Work
Modern text embedding models are typically transformer-based architectures (BERT, RoBERTa, or proprietary variants) that have been fine-tuned specifically for producing useful sentence-level representations. The training process involves three stages, each building on the previous one:
Stage 1: Pre-training. A base language model (e.g., BERT-base with 110 million parameters) is trained on a massive text corpus (Wikipedia, BookCorpus, web crawls) using masked language modeling. The model learns to predict missing words in sentences, which forces it to develop internal representations of word meaning, syntax, and world knowledge. This stage takes weeks on large GPU clusters and is typically done once by a research lab. However, the pre-trained model does not produce good sentence embeddings. Raw BERT embeddings are notoriously poor for similarity tasks because the model was trained to predict tokens, not to make semantically similar sentences produce similar vectors. Averaging BERT’s token-level outputs produces vectors where any two random sentences have cosine similarity around 0.6–0.8—the space is “anisotropic,” meaning all vectors are clustered in a narrow cone rather than spread across the full space.
Stage 2: Contrastive fine-tuning. The pre-trained model is further trained on pairs of sentences labeled as similar or dissimilar. The training objective pushes similar pairs’ vectors together and dissimilar pairs’ vectors apart. This is where the useful geometric structure emerges. The main training approaches are:
- SimCSE — Uses dropout as a form of data augmentation: the same sentence passed through the model twice with different dropout masks produces two slightly different vectors, which are treated as a positive pair. All other sentences in the batch serve as negatives. Elegant because it requires no labeled data—just sentences.
- MNRL (Multiple Negatives Ranking Loss) — Each batch contains explicit positive pairs (e.g., query-document pairs from search logs). All other sentences in the batch serve as negatives. Efficient because you get O(n²) training signal from n pairs. This is how sentence-transformers models are typically trained.
- Triplet loss — For each anchor sentence, a positive (similar) and negative (dissimilar) sentence are provided. The model learns to place the anchor closer to the positive than the negative by a configurable margin. More explicit than MNRL but requires pre-mined triplets.
Stage 3: Task-specific fine-tuning (optional). For domain-specific applications (legal, medical, scientific, code), the model is further fine-tuned on domain data. This can dramatically improve quality for specialized content because the general-purpose training data may not contain enough domain-specific examples. We will discuss this further in the failure modes section.
Practical Embedding Generation
In production, you need to handle embedding generation at two scales: bulk ingestion (embedding your entire document corpus, potentially millions of documents) and real-time queries (embedding individual search queries as they arrive, with latency requirements under 50ms).
import torch
from sentence_transformers import SentenceTransformer
from typing import Generator
import numpy as np
class EmbeddingService:
def __init__(self, model_name: str = "all-MiniLM-L6-v2"):
self.model = SentenceTransformer(model_name)
self.dimension = self.model.get_sentence_embedding_dimension()
# Use GPU if available, fall back to CPU
if torch.cuda.is_available():
self.model = self.model.to('cuda')
self.device = 'cuda'
else:
self.device = 'cpu'
def embed_query(self, text: str) -> np.ndarray:
"""Embed a single query. Optimized for latency."""
with torch.no_grad():
embedding = self.model.encode(
text,
normalize_embeddings=True,
show_progress_bar=False
)
return embedding
def embed_batch(self, texts: list[str], batch_size: int = 128) -> np.ndarray:
"""Embed a batch of documents. Optimized for throughput."""
with torch.no_grad():
embeddings = self.model.encode(
texts,
batch_size=batch_size,
normalize_embeddings=True,
show_progress_bar=True
)
return embeddings
def embed_stream(
self, texts: Generator[str, None, None], batch_size: int = 128
) -> Generator[np.ndarray, None, None]:
"""Stream embeddings for large corpora that don't fit in memory."""
batch = []
for text in texts:
batch.append(text)
if len(batch) >= batch_size:
yield self.embed_batch(batch, batch_size=batch_size)
batch = []
if batch:
yield self.embed_batch(batch, batch_size=batch_size)
The streaming interface (embed_stream) is essential for ingesting large corpora. If you have 10 million documents at 300 words each, you cannot load them all into memory simultaneously—just the text strings would consume 15–20 GB of RAM, and the embeddings (10M x 384 x 4 bytes) add another 15 GB. Instead, you stream batches through the pipeline, writing each batch to your vector store before processing the next one. Our ingestion pipeline processes 10 million documents in about 3 hours on a single GPU using this streaming approach, with peak memory usage under 4 GB.
The torch.no_grad() context manager is important for inference performance. Without it, PyTorch tracks gradients for every operation (needed for training but unnecessary for inference), consuming additional memory and CPU cycles. For embedding generation, this wrapper reduces memory usage by approximately 30% and improves throughput by 15–20%.
Choosing the Right Model
The embedding model landscape changes rapidly, but the evaluation methodology is stable. What matters is your data and your use case, not benchmark leaderboards. Models that rank first on the MTEB benchmark may perform poorly on your specific data if your domain is underrepresented in the benchmark datasets. Here is a structured evaluation process:
- Create an evaluation set. Assemble 100–200 query-document pairs where you know the correct matches. This is manual work and there is no shortcut—automated evaluation sets are circular (they test whether the model matches itself, not whether it matches human judgment). Invest a day in building this dataset. It will save you weeks of guesswork.
- Select 3–5 candidate models based on dimension size, inference speed, and licensing constraints.
- Embed your evaluation set with each candidate model.
- Compute Recall@10 and MRR for each model against your human-labeled ground truth.
- Measure latency and throughput. A model that is 3% more accurate but 10x slower might not be worth it if your application has real-time requirements.
def evaluate_model(model_name: str, eval_set: list[dict]) -> dict:
"""Evaluate an embedding model against a labeled evaluation set."""
service = EmbeddingService(model_name)
# Embed all documents
doc_texts = [item['document'] for item in eval_set]
doc_embeddings = service.embed_batch(doc_texts)
recall_scores = []
mrr_scores = []
latencies = []
for item in eval_set:
start = time.perf_counter()
query_embedding = service.embed_query(item['query'])
latencies.append((time.perf_counter() - start) * 1000)
# Compute cosine similarity against all documents
similarities = np.dot(doc_embeddings, query_embedding)
top_10_indices = np.argsort(similarities)[-10:][::-1]
# Check if correct document is in top 10
correct_idx = eval_set.index(item)
recall_scores.append(1.0 if correct_idx in top_10_indices else 0.0)
# Compute reciprocal rank
for rank, idx in enumerate(top_10_indices):
if idx == correct_idx:
mrr_scores.append(1.0 / (rank + 1))
break
else:
mrr_scores.append(0.0)
return {
'model': model_name,
'recall@10': np.mean(recall_scores),
'mrr': np.mean(mrr_scores),
'dimension': service.dimension,
'avg_latency_ms': np.mean(latencies),
'p99_latency_ms': np.percentile(latencies, 99)
}
In our evaluation across five candidate models for Harbor Software’s documentation search use case, the results were illuminating:
all-MiniLM-L6-v2: Recall@10 = 0.82, MRR = 0.61, 384 dims, 10ms/query (CPU)all-mpnet-base-v2: Recall@10 = 0.88, MRR = 0.68, 768 dims, 35ms/query (CPU)text-embedding-ada-002: Recall@10 = 0.91, MRR = 0.72, 1536 dims, 120ms/query (API)e5-large-v2: Recall@10 = 0.89, MRR = 0.70, 1024 dims, 50ms/query (CPU)bge-base-en-v1.5: Recall@10 = 0.87, MRR = 0.67, 768 dims, 32ms/query (CPU)
We chose all-mpnet-base-v2 because the 6-point recall improvement over MiniLM justified the 3.5x latency increase, while the incremental gain from ada-002 did not justify the API dependency, 12x latency increase, and per-token cost. These numbers are specific to our technical documentation corpus—your evaluation will yield different results, and that is exactly the point.
What Can Go Wrong
Domain mismatch. General-purpose embedding models are trained primarily on web text (Wikipedia, Reddit, news articles, web pages). If your data is highly specialized—legal contracts, medical records, source code, financial reports—the model may not capture domain-specific semantics. Two legal clauses with opposite meanings (“the licensee MAY distribute” vs. “the licensee MAY NOT distribute”) might produce similar vectors because the model treats them as “legal text about distribution” without understanding the contractual distinction. Fine-tuning on domain data typically improves recall by 10–25 percentage points for specialized content. We saw a 14-point recall improvement when we fine-tuned on our technical documentation corpus versus using the off-the-shelf model.
Length sensitivity. Most embedding models produce better representations for shorter texts (1–3 sentences) than longer texts (full documents). This is because the model must compress all information into a fixed-size vector, and longer texts contain more information to compress. A 10-page document embedded as a single vector loses nuance because 384 or 768 dimensions cannot faithfully represent that much information. This is why chunking is essential—and why chunk size is a hyperparameter worth tuning carefully.
Semantic collapse. Some queries produce embeddings that are equally close to many unrelated documents, making retrieval effectively random. This typically happens with very short queries (“help”), very generic queries (“information about the product”), or queries that match the “average” topic of your corpus. Mitigate by requiring minimum query length, augmenting short queries with context from the user’s session, or combining vector search with keyword filtering to narrow the candidate set.
Embedding drift. If you switch embedding models (or even model versions within the same family), old and new embeddings are incompatible. They occupy different vector spaces—a vector from model A is meaningless when compared to vectors from model B. You cannot mix vectors from different models in the same index. Always re-embed your entire corpus when changing models, and version your embedding model alongside your vector index.
Storage and Cost Considerations
Embedding storage costs are often underestimated, especially at scale. A single 384-dimensional vector stored as 32-bit floats consumes 1,536 bytes (384 x 4). At 10 million documents with an average of 5 chunks per document, that is 50 million vectors x 1,536 bytes = 76.8 GB of raw vector storage. Add metadata, index structures, and replication overhead, and you are looking at 150–250 GB total for a managed vector database.
With 768-dimensional vectors, those numbers double. With 1536-dimensional vectors (ada-002), they quadruple. At managed service pricing ($0.10–$0.25 per GB/month for vector databases), the annual storage cost for 50 million ada-002 vectors is $3,600–$9,000. For MiniLM vectors, it is $900–$2,250. The model choice has direct and significant cost implications beyond quality and latency.
Strategies for reducing storage costs without sacrificing quality:
- Dimensionality reduction. PCA or random projection can reduce 768-dimensional vectors to 384 dimensions with minimal quality loss (typically 2–4% recall reduction on our benchmarks). This halves storage with a small, measurable quality trade-off. Run your evaluation before and after to verify the trade-off is acceptable for your use case.
- Quantization. Storing vectors as 8-bit integers instead of 32-bit floats reduces storage by 4x with 1–2% recall loss. FAISS supports this natively with
IndexIVFPQ(Product Quantization). Qdrant and other managed services offer similar options. This is the most effective storage optimization because the quality loss is negligible for most applications. - Choosing a smaller model. A 384-dimensional model uses half the storage of a 768-dimensional model. If your evaluation shows marginal quality difference for your use case, the smaller model wins on cost, latency, and storage. Do not assume bigger is better.
Conclusion
Embeddings are not a black box you can safely ignore. The model you choose, the way you chunk your documents, the distance metric you use, and the way you handle edge cases all materially affect the quality of any downstream application. Treat embedding generation as a first-class component of your system—with its own evaluation framework, version management, and performance monitoring. Evaluate models empirically against your data, plan for the operational realities of storing and updating millions of vectors, and budget for re-embedding when models change. The teams that do this well build search and recommendation systems that feel almost magical. The teams that treat embeddings as a solved problem build systems that work 80% of the time and fail in ways that are frustratingly hard to diagnose.
Practical Next Steps
If you are starting from zero, here is a concrete action plan. First, install sentence-transformers and embed a small sample of your data (100–500 documents) to get hands-on experience with the tooling. Compute cosine similarities between a few queries and documents manually to develop intuition for what similarity scores mean in practice for your data. You will quickly learn that 0.8 cosine similarity feels very relevant while 0.4 feels barely related—but these numbers are model-dependent, so calibrate on your own corpus.
Second, build a minimal evaluation set of 50 query-document pairs where you know the correct answers. This is the most important investment you will make. Every subsequent decision—model selection, chunk size, index parameters, score threshold—depends on having a reliable way to measure quality. Spend a few hours manually creating these pairs. Use real questions from customer support tickets, search logs, or internal Slack channels—these represent actual usage patterns, not hypothetical queries you think users might ask.
Third, evaluate at least three embedding models against your evaluation set. Do not skip this step because you read a blog post (including this one) recommending a specific model. The model that works best for our technical documentation corpus may perform poorly on your e-commerce product descriptions or legal contracts. Run the evaluation, look at the numbers, and make a data-driven choice. The evaluation code in this post is production-ready—copy it, adapt it to your data format, and run it.
Fourth, choose a vector storage solution appropriate for your scale. Under 100K vectors: FAISS flat index or even a simple NumPy array with brute-force cosine similarity. 100K–5M vectors: FAISS IVF or a managed service like Qdrant. Over 5M vectors: a managed service with horizontal sharding is almost certainly the right choice unless you have dedicated infrastructure engineers. The managed service cost ($50–$200/month at moderate scale) is almost always less than the engineering time you would spend operating a self-hosted solution.