Skip links

Building Your First Vector Search Pipeline

Vector search is one of those technologies that sounds intimidating until you build one. The core idea is deceptively simple: convert your data into numerical vectors (embeddings), store them in a specialized index, and find similar items by measuring distance in vector space. The implementation details, however, are where teams stumble. This guide walks through building a production-grade vector search pipeline from scratch, covering embedding generation, index selection, query optimization, and the operational concerns that documentation rarely mentions.

Article Overview

Building Your First Vector Search Pipeline

7 sections · Reading flow

01
What Vector Search Actually Does
02
Stage 1: Generating Embeddings
03
Stage 2: Choosing and Building an Index
04
Stage 3: Query Pipeline
05
Stage 4: Evaluation and Tuning
06
Operational Concerns
07
Appendix: Our Recommended Starting Stack

HARBOR SOFTWARE · Engineering Insights

What Vector Search Actually Does

Traditional keyword search works by matching tokens. If a user searches for “comfortable running shoes,” a keyword engine looks for documents containing those exact words (or stems of them). Vector search works differently: it converts both the query and the documents into high-dimensional numerical representations, then finds documents whose vectors are geometrically close to the query vector.

This means vector search can find semantically similar results even when there is zero keyword overlap. A search for “comfortable running shoes” can surface a document about “ergonomic athletic footwear with cushioned soles” because their embedding vectors occupy nearby regions of the vector space. This is the fundamental advantage over keyword search: it operates on meaning, not tokens.

To make this concrete, consider a customer support knowledge base. A customer asks “how do I get my money back?” Keyword search requires the knowledge base article to contain the words “money” or “back” or “get.” Vector search matches the query to an article titled “Refund Policy and Return Process” because the embedding model understands that “get my money back” and “refund” are semantically equivalent. This is not a theoretical advantage—it is the difference between a support system that works 60% of the time and one that works 90% of the time.

The pipeline has four stages:

  1. Embedding generation — Convert text (or images, audio, etc.) into fixed-length vectors
  2. Indexing — Store vectors in a data structure optimized for similarity search
  3. Querying — Convert the search query into a vector and find nearest neighbors
  4. Post-processing — Re-rank, filter, and format results

Stage 1: Generating Embeddings

The embedding model is the most consequential decision in the entire pipeline. It determines the quality of your search results, and switching models later requires re-embedding your entire corpus—which can be expensive and time-consuming at scale.

For text embeddings in early 2022, the landscape looks like this:

  • sentence-transformers (open source) — The all-MiniLM-L6-v2 model produces 384-dimensional vectors, runs on CPU in ~10ms per sentence, and handles most English-language use cases well. This is our default recommendation for getting started. The model is 80 MB and can run on a laptop without a GPU.
  • OpenAI text-embedding-ada-002 — 1536-dimensional vectors via API call. Higher quality for diverse content, but adds network latency (~50–150ms) and costs $0.0001 per 1K tokens. For a 1-million-document corpus with an average of 200 tokens per chunk, that is $20 for initial embedding.
  • Cohere embed-english-v2.0 — 4096-dimensional vectors. Strong performance on retrieval benchmarks, but the high dimensionality increases storage and compute costs proportionally.

Here is a minimal embedding pipeline using sentence-transformers:

from sentence_transformers import SentenceTransformer
import numpy as np

model = SentenceTransformer('all-MiniLM-L6-v2')

def embed_documents(documents: list[str], batch_size: int = 64) -> np.ndarray:
    """Embed a list of documents in batches.
    
    Returns an array of shape (n_documents, 384).
    """
    embeddings = model.encode(
        documents,
        batch_size=batch_size,
        show_progress_bar=True,
        normalize_embeddings=True  # Critical for cosine similarity
    )
    return embeddings

def embed_query(query: str) -> np.ndarray:
    """Embed a single search query."""
    return model.encode(
        query,
        normalize_embeddings=True
    )

Two critical details here. First, normalize_embeddings=True normalizes each vector to unit length. This converts dot product similarity into cosine similarity, which is what you almost always want for text search. Without normalization, longer documents tend to have larger vector magnitudes, biasing search results toward long documents regardless of relevance. Second, batch encoding is significantly faster than encoding documents one at a time due to GPU/CPU parallelism—expect 5–10x throughput improvement with batch_size=64 on CPU.

Chunking Strategy

Embedding models have input length limits (512 tokens for MiniLM, 8191 for ada-002). Documents longer than the limit get truncated silently, losing information. You need a chunking strategy. This is not optional—it is arguably the most impactful design decision after model selection.

def chunk_document(text: str, chunk_size: int = 256, overlap: int = 50) -> list[str]:
    """Split document into overlapping chunks by word count."""
    words = text.split()
    chunks = []
    start = 0
    while start < len(words):
        end = start + chunk_size
        chunk = ' '.join(words[start:end])
        chunks.append(chunk)
        start = end - overlap  # Overlap prevents information loss at boundaries
    return chunks

We use 256-word chunks with 50-word overlap. The overlap ensures that information spanning a chunk boundary is captured by at least one chunk. Without overlap, a question like "what are the shipping costs for international orders?" might fail to match if "shipping costs" is at the end of one chunk and "international orders" is at the beginning of the next. The 50-word overlap gives a buffer zone where boundary-spanning information is captured.

Chunk size is a trade-off that depends on your data. Smaller chunks (100–150 words) give more precise retrieval—the vector represents a narrow topic, so similarity scores are more meaningful. But smaller chunks increase index size (more vectors to store and search) and may lack context for the downstream application to use them effectively. Larger chunks (500+ words) provide more context per result but dilute the embedding signal—a 500-word chunk about three different topics produces a vector that is vaguely related to all three but strongly related to none.

We experimented with three chunk sizes on our documentation corpus and measured Recall@10:

  • 128 words: Recall@10 = 0.79 (too small, lost context)
  • 256 words: Recall@10 = 0.86 (best balance)
  • 512 words: Recall@10 = 0.81 (too large, diluted signal)

Stage 2: Choosing and Building an Index

Once you have vectors, you need to store and search them efficiently. Brute-force nearest neighbor search (computing distance to every vector) works for small datasets but scales linearly. At 1 million vectors with 384 dimensions, a brute-force search takes ~50ms on a modern CPU. At 100 million vectors, you are looking at 5 seconds per query—far too slow for interactive applications.

Approximate Nearest Neighbor (ANN) indices trade a small amount of accuracy for dramatic speed improvements. The main options:

  • FAISS (Facebook AI Similarity Search) — The gold standard for self-hosted vector search. Supports multiple index types (flat, IVF, HNSW, PQ), GPU acceleration, and scales to billions of vectors. Written in C++ with Python bindings. The learning curve is steeper than alternatives but the flexibility is unmatched.
  • Annoy (Spotify) — Memory-mapped read-only index based on random projection trees. Excellent for static datasets where you build the index once and query many times. Cannot add vectors after building, which limits its use for dynamic data.
  • Hnswlib — HNSW (Hierarchical Navigable Small World) graph implementation. Best recall-speed trade-off for datasets under 10 million vectors. Supports incremental inserts without rebuilding. Simpler API than FAISS.
  • Managed services — Pinecone, Weaviate, Milvus, Qdrant. These handle infrastructure, scaling, metadata filtering, and updates. Higher cost but dramatically lower operational burden. If you do not want to manage index servers, start here.

For this guide, we will use FAISS with an IVF (Inverted File) index, which clusters vectors and only searches relevant clusters at query time:

import faiss
import numpy as np

def build_ivf_index(
    embeddings: np.ndarray,
    n_clusters: int = 256,
    n_probe: int = 16
) -> faiss.IndexIVFFlat:
    """Build a FAISS IVF index for approximate nearest neighbor search.
    
    Args:
        embeddings: Array of shape (n_vectors, dimension)
        n_clusters: Number of Voronoi cells (more = faster search, lower recall)
        n_probe: Number of cells to search at query time (more = slower, higher recall)
    """
    dimension = embeddings.shape[1]
    
    # Quantizer maps vectors to cluster centroids
    quantizer = faiss.IndexFlatIP(dimension)  # Inner product (= cosine for normalized vectors)
    
    # IVF index with flat storage within each cluster
    index = faiss.IndexIVFFlat(quantizer, dimension, n_clusters, faiss.METRIC_INNER_PRODUCT)
    
    # Training phase: learn cluster centroids from data
    index.train(embeddings)
    
    # Add vectors to trained index
    index.add(embeddings)
    
    # Set search-time parameter
    index.nprobe = n_probe
    
    return index

def search(index: faiss.IndexIVFFlat, query_vector: np.ndarray, top_k: int = 10):
    """Search the index for nearest neighbors."""
    query_vector = query_vector.reshape(1, -1)  # FAISS expects 2D input
    scores, indices = index.search(query_vector, top_k)
    return scores[0], indices[0]

The n_clusters and n_probe parameters control the speed-accuracy trade-off. With 1 million vectors, 256 clusters, and n_probe=16, you search roughly 6% of the index per query, giving you sub-5ms query times with 95%+ recall (compared to exact search). Increasing n_probe to 32 raises recall to 98% but doubles query time to ~10ms. For most applications, 95% recall with 5ms latency is the right operating point.

The training step is important and often misunderstood. IVF indexes need to learn cluster centroids from a representative sample of your data before you can add vectors. You need at least n_clusters * 39 training vectors (roughly 10,000 for 256 clusters) for stable centroids. If your dataset is smaller than this, use a flat index instead—at small scale, brute-force search is fast enough and gives you perfect recall.

Stage 3: Query Pipeline

A production query pipeline does more than embed-and-search. Here is our full query function with filtering, deduplication, and score thresholding:

from dataclasses import dataclass
from typing import Optional

@dataclass
class SearchResult:
    document_id: str
    chunk_id: int
    score: float
    text: str
    metadata: dict

def search_pipeline(
    query: str,
    index: faiss.IndexIVFFlat,
    document_store: dict,
    top_k: int = 10,
    score_threshold: float = 0.5,
    deduplicate: bool = True
) -> list[SearchResult]:
    """Full search pipeline with embedding, retrieval, filtering, and dedup."""
    
    # 1. Embed the query
    query_vector = embed_query(query)
    
    # 2. Retrieve candidates (fetch more than needed for filtering)
    fetch_k = top_k * 3 if deduplicate else top_k
    scores, indices = search(index, query_vector, top_k=fetch_k)
    
    # 3. Build result objects with metadata
    results = []
    for score, idx in zip(scores, indices):
        if idx == -1:  # FAISS returns -1 for empty slots
            continue
        if score < score_threshold:
            continue
        
        chunk_info = document_store[idx]
        results.append(SearchResult(
            document_id=chunk_info['doc_id'],
            chunk_id=chunk_info['chunk_id'],
            score=float(score),
            text=chunk_info['text'],
            metadata=chunk_info['metadata']
        ))
    
    # 4. Deduplicate: keep only the best chunk per document
    if deduplicate:
        seen_docs = set()
        deduped = []
        for result in results:
            if result.document_id not in seen_docs:
                seen_docs.add(result.document_id)
                deduped.append(result)
        results = deduped
    
    return results[:top_k]

The deduplication step is important and often overlooked. When you chunk a long document, multiple chunks from the same document might appear in the top results. Without deduplication, your top 10 results might contain 7 chunks from the same document, which is rarely what users want. The fetch_k = top_k * 3 over-fetching compensates for the results that get deduplicated, ensuring you still return top_k unique documents.

The score threshold (score_threshold=0.5) prevents low-quality results from appearing. Without it, a query with no good matches still returns 10 results—they are just irrelevant. A threshold ensures that silence is preferable to noise. We tuned our threshold empirically: at 0.3, we got too many false positives; at 0.7, we missed valid results. The optimal threshold depends on your embedding model and data—do not copy this number blindly.

Stage 4: Evaluation and Tuning

The hardest part of vector search is not building the pipeline—it is knowing whether the results are good. You need an evaluation dataset: a set of queries with known relevant documents. Start with 50–100 manually annotated query-document pairs. This is tedious work, but there is no shortcut. Automated evaluation (using the same embedding model to judge its own results) is circular and unreliable.

Use these metrics:

  • Recall@K — What fraction of relevant documents appear in the top K results? This measures completeness. If a user's question has 3 relevant documents and Recall@10 is 0.67, it means 2 of the 3 were retrieved.
  • MRR (Mean Reciprocal Rank) — Where does the first relevant result appear on average? MRR = 1.0 means the correct answer is always first. MRR = 0.5 means it is second on average. This measures ranking quality.
  • Precision@K — What fraction of the top K results are relevant? This measures noise. High precision means few irrelevant results pollute the top positions.
def recall_at_k(results: list[SearchResult], relevant_ids: set[str], k: int) -> float:
    """Compute recall@K."""
    retrieved_ids = {r.document_id for r in results[:k]}
    hits = retrieved_ids & relevant_ids
    return len(hits) / len(relevant_ids) if relevant_ids else 0.0

def mrr(results: list[SearchResult], relevant_ids: set[str]) -> float:
    """Compute Mean Reciprocal Rank."""
    for i, result in enumerate(results):
        if result.document_id in relevant_ids:
            return 1.0 / (i + 1)
    return 0.0

Our production system targets Recall@10 >= 0.85 and MRR >= 0.6. When these metrics dip, we investigate in this order (from highest to lowest impact): chunking strategy, embedding model choice, index parameters, score threshold, deduplication logic. The first two levers account for 80% of quality variation in our experience.

Operational Concerns

Three things that will bite you in production that tutorials never mention:

Index updates. FAISS IVF indices support adding vectors but not deleting them. If your documents change, you either need to rebuild the index periodically (simple but slow—our 2-million-vector index takes 4 minutes to rebuild) or use a system like Qdrant or Weaviate that supports CRUD operations natively. We rebuild nightly for datasets under 5 million vectors using a background job that builds a new index, verifies it against the evaluation set, and swaps it in atomically. For larger datasets or more frequent updates, a managed solution with native CRUD is the pragmatic choice.

Embedding model versioning. If you update your embedding model, all existing vectors become incompatible. The new model maps text to a different vector space—similar texts might produce vectors that are far apart in the old space but close in the new space, or vice versa. You must re-embed your entire corpus when changing models. Plan for this by storing raw text alongside vectors, tracking which model version produced each vector, and automating the re-embedding pipeline so it is a single command, not a multi-day project.

Cold starts. Loading a large FAISS index into memory takes time. Our 2-million-vector index (384 dimensions, IVF256) is 3.2 GB on disk and takes 8 seconds to load into RAM. In a containerized environment, this means your service is unavailable for 8 seconds on every restart. We mitigate this with health check grace periods (the load balancer does not route traffic until the index is loaded), rolling deployments (only restart one instance at a time), and persistent memory-mapped storage where possible.

Conclusion

A vector search pipeline is four components: embeddings, an index, a query pipeline, and an evaluation framework. Start with sentence-transformers and FAISS for a local setup that runs on a single machine with no external dependencies. Graduate to managed solutions (Pinecone, Qdrant, Weaviate) when you need CRUD operations, horizontal scaling, or sub-millisecond latency at scale. The embedding model is your most important decision—invest time in evaluating models against your specific data before committing to one. And always, always build an evaluation dataset before you start tuning parameters. Without evaluation data, you are optimizing blind, and every parameter change is a guess.

Appendix: Our Recommended Starting Stack

For teams building their first vector search pipeline, here is the specific stack we recommend based on six months of production experience. Use all-MiniLM-L6-v2 for embedding generation unless you have a specific reason to use a larger model—it runs on CPU, produces high-quality 384-dimensional vectors, and supports up to 256-word input sequences. Use 256-word chunks with 50-word overlap for documents longer than 200 words. Store chunks in a dictionary or SQLite database alongside their metadata (document ID, chunk index, title, URL, timestamp) so you can enrich search results without a second database query.

For the index, start with FAISS IndexFlatIP if you have fewer than 100,000 vectors—brute-force search is fast enough at this scale (under 10ms) and gives you perfect recall with zero configuration. Switch to IndexIVFFlat when query latency exceeds your target, typically around 500,000–1,000,000 vectors. If you need vector CRUD (add, update, delete without rebuilding), skip FAISS entirely and start with Qdrant or Weaviate—the operational simplicity is worth the managed service cost.

Build your evaluation set before you launch. Fifty query-document pairs is enough to catch major quality issues. One hundred pairs gives you statistical confidence in your metrics. Two hundred pairs is ideal but rarely necessary at launch. The evaluation set does not need to be perfect—it needs to be representative. Include common queries, edge cases (very short queries, queries with typos, queries in different phrasings), and adversarial examples (queries that should return no results). Review and expand the evaluation set monthly as you learn from production usage patterns. The teams that maintain their evaluation sets build great search. The teams that build the pipeline and forget evaluation build search that degrades silently until users complain.

Leave a comment

Explore
Drag