Natural Language Processing for Enterprise Search

Author Sarah Chen

Published on: May 26, 2023

Enterprise search is broken in most organizations, and almost everyone has accepted the brokenness as normal. The search bar exists on the company intranet, the document management system, the knowledge base, the wiki, the ticketing system, and a dozen other tools. Users type a query, get 500 results sorted by some opaque relevance score, scan the first page, find nothing useful, reformulate their query three times, give up, and walk over to a colleague’s desk to ask in person. The information they needed existed in the system the entire time. The search just could not find it.

Article Overview

Natural Language Processing for Enterprise Search

7 sections · Reading flow

01
The Limitations of Keyword Search

→

02
Semantic Search with Vector Embeddings

→

03
Hybrid Search: Combining Keywords and Semantics

→

04
Vector Database Selection

→

05
Query Understanding and Enhancement

→

06
Retrieval-Augmented Generation (RAG)

→

07
Evaluation and Continuous Improvement

HARBOR SOFTWARE · Engineering Insights

The problem is not a lack of documents or a lack of indexing. It is a fundamental mismatch between how humans express information needs and how traditional search systems work. A user searching for “how do we handle returns from international customers” gets results containing the word “returns” — including financial returns reports, return values in engineering documentation, the return shipping policy that applies to domestic orders only, and the actual international returns policy buried on page 3 of results. Keyword matching finds words, not meaning.

NLP-powered enterprise search bridges this gap by understanding the semantics of both queries and documents. It returns results that match the user’s intent, not just the words they happened to type. Here is how to build one that works in practice, not just in a demo.

The Limitations of Keyword Search

Traditional search engines — Elasticsearch, Apache Solr, and even PostgreSQL’s built-in full-text search — work by tokenizing documents into terms, building an inverted index that maps each term to the documents containing it, and scoring matches using algorithms like BM25. BM25 is surprisingly effective for many use cases: it considers term frequency (how often the search term appears in a document), document length normalization (so short documents are not unfairly disadvantaged), and inverse document frequency (rare terms are weighted more heavily than common ones). For web search engines where queries and documents share extensive vocabulary, BM25 works remarkably well.

Enterprise search has three specific problems that BM25 fundamentally cannot solve, no matter how you tune its parameters:

Vocabulary mismatch. The user searches for “refund policy” but the document uses “returns and exchanges procedure.” Same concept, zero keyword overlap, zero BM25 score. In enterprise environments where different departments, business units, and geographic offices use different terminology for the same concepts (“headcount” vs. “FTE” vs. “employee count” vs. “team size”), vocabulary mismatch is pervasive and systematic. We measured this on one client’s knowledge base: 23% of relevant documents for a set of 200 test queries had zero keyword overlap with the query that should have found them.

Intent ambiguity. The query “Python” might mean the programming language (engineering team), the Monty Python comedy troupe (events committee), or a type of snake (facilities team reporting a pest incident — this actually happened at one of our clients). BM25 cannot disambiguate without explicit boolean filters that users should not have to construct manually. The search system should understand context from the user’s department, recent search history, or the content of the query itself.

Complex natural language queries. Users increasingly express queries as complete questions: “What is the approval process for vendor contracts over $50K?” BM25 tokenizes this into individual words — “approval”, “process”, “vendor”, “contracts”, “50K” — and matches documents containing those words independently. It cannot understand the compositional meaning: that the user wants a process, specifically for approval, specifically for vendor contracts, specifically with a dollar threshold. A document about “vendor management best practices” might score highly despite not answering the question at all.

Semantic Search with Vector Embeddings

Semantic search represents both queries and documents as dense numerical vectors (embeddings) in a high-dimensional space where geometric proximity corresponds to semantic similarity. Documents and queries that are “about the same thing” are close together in this space, regardless of whether they share specific words. This solves vocabulary mismatch by design — the model learns that “refund policy” and “returns and exchanges procedure” should have similar vector representations because they appear in similar contexts across the training data.

The pipeline is conceptually simple:

Index time: For each document, split it into chunks, generate an embedding vector for each chunk using a language model, and store the vectors in a vector database alongside the document metadata and original text.
Query time: Generate an embedding for the user’s query using the same model. Find the nearest vectors in the database using approximate nearest neighbor search. Return the corresponding document chunks, ranked by vector similarity.

from sentence_transformers import SentenceTransformer
import numpy as np

# Initialize embedding model
# all-MiniLM-L6-v2: 384 dimensions, fast inference, good quality
# all-mpnet-base-v2: 768 dimensions, slower, better semantic quality
model = SentenceTransformer('all-MiniLM-L6-v2')

def generate_embedding(text: str) -> np.ndarray:
    """Generate a normalized dense vector for text."""
    return model.encode(text, normalize_embeddings=True)

def chunk_document(
    text: str, chunk_size: int = 500, overlap: int = 50
) -> list[str]:
    """Split document into overlapping chunks for embedding."""
    words = text.split()
    chunks = []
    for i in range(0, len(words), chunk_size - overlap):
        chunk = ' '.join(words[i:i + chunk_size])
        if len(chunk.strip()) > 50:  # Skip tiny trailing fragments
            chunks.append(chunk)
    return chunks

def index_document(doc_id: str, title: str, text: str, metadata: dict) -> list[dict]:
    """Chunk, embed, and prepare a document for vector storage."""
    chunks = chunk_document(text)
    records = []
    for i, chunk in enumerate(chunks):
        # Prepend title to chunk for better context
        enriched_text = f"{title}. {chunk}"
        embedding = generate_embedding(enriched_text)
        records.append({
            'id': f"{doc_id}_chunk_{i}",
            'doc_id': doc_id,
            'chunk_index': i,
            'text': chunk,
            'title': title,
            'embedding': embedding.tolist(),
            'metadata': metadata
        })
    return records

The chunking strategy matters more than most people realize. Documents longer than the model’s context window (typically 256-512 tokens for efficient embedding models) must be split into chunks that are each independently meaningful. Overlapping chunks (we use 50-word overlap) prevent information from being lost at chunk boundaries. Prepending the document title to each chunk provides context that helps the embedding model understand what the chunk is about, even when the chunk alone is ambiguous.

The choice of embedding model significantly impacts quality. For English-only enterprise search, all-MiniLM-L6-v2 offers the best speed-quality tradeoff: 384-dimensional embeddings generated in about 80ms per document on CPU. For multilingual support or higher accuracy needs, OpenAI’s text-embedding-ada-002 (1536 dimensions, $0.0001 per 1K tokens) or Cohere’s embed-english-v3.0 are strong options. We benchmark candidate models against a set of 100-200 query-document pairs specific to the client’s domain before committing to one.

Hybrid Search: Combining Keywords and Semantics

Pure semantic search has its own blind spots. It struggles with exact-match requirements (searching for error code “ERR-4523” should find documents containing that exact string, not documents about error handling in general), acronyms that are not in the model’s training data, newly coined terms, and proper nouns. The solution is hybrid search: combine keyword (BM25) results with semantic (embedding) results to get the strengths of both.

def hybrid_search(
    query: str,
    bm25_results: list[dict],
    semantic_results: list[dict],
    alpha: float = 0.6
) -> list[dict]:
    """
    Combine BM25 and semantic results using Reciprocal Rank Fusion.
    alpha: weight for semantic (0 = pure BM25, 1 = pure semantic)
    """
    k = 60  # RRF constant -- robust to changes, rarely needs tuning
    scores = {}

    for rank, result in enumerate(bm25_results):
        doc_id = result['doc_id']
        rrf_score = (1 - alpha) * (1 / (k + rank + 1))
        scores[doc_id] = scores.get(doc_id, 0) + rrf_score

    for rank, result in enumerate(semantic_results):
        doc_id = result['doc_id']
        rrf_score = alpha * (1 / (k + rank + 1))
        scores[doc_id] = scores.get(doc_id, 0) + rrf_score

    ranked = sorted(scores.items(), key=lambda x: x[1], reverse=True)
    return [{'doc_id': doc_id, 'score': score} for doc_id, score in ranked]

Reciprocal Rank Fusion (RRF) is our preferred combination method because it requires no score normalization across ranking systems (BM25 scores and cosine similarity scores are on completely different scales), is parameter-light (the k constant is robust to changes), and performs well empirically. The alpha parameter controls the balance: we typically use 0.6-0.7 (favoring semantic) for general knowledge base search where vocabulary mismatch is the primary problem, and 0.3-0.4 (favoring keyword) for technical documentation where exact terms and code references matter.

Vector Database Selection

You need a database optimized for approximate nearest neighbor (ANN) search on high-dimensional vectors. The market has grown rapidly. Our assessment based on production deployments:

pgvector (PostgreSQL extension): Add vector search to your existing PostgreSQL database with a single CREATE EXTENSION. No new infrastructure, no new operational burden. Performance is adequate up to approximately 1 million vectors with IVFFlat indexing, and up to 5 million with HNSW indexing. Our default choice for projects starting out — operational simplicity trumps raw performance at this scale.
Pinecone: Fully managed, excellent query performance, simple API. Best for teams that want zero infrastructure management. Pricing starts around $70 per month for 1 million vectors.
Weaviate: Open-source, supports hybrid search natively within a single query. Good if you want to self-host and need hybrid keyword-plus-vector search without building the fusion logic yourself.
Qdrant: Open-source, written in Rust, excellent filtering performance. Best for use cases combining vector search with complex metadata filters (“find similar documents created in the last 30 days by the legal department in the EMEA region”).
Elasticsearch 8.x with kNN: If you already operate Elasticsearch, version 8 supports vector search natively. Avoids introducing a new database. Quality is reasonable though not best-in-class for pure vector search.

Query Understanding and Enhancement

Before executing search, processing the query can meaningfully improve result quality. We apply three query-time techniques, each targeting a different weakness:

Query rewriting with an LLM. Take the user’s raw natural language query and rewrite it into a more search-effective form. This helps with verbose queries, poorly structured questions, and queries that use conversational language rather than information-seeking language:

def rewrite_query(original_query: str, llm_client) -> str:
    response = llm_client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[{
            "role": "system",
            "content": (
                "Rewrite the user's search query to improve search results. "
                "Keep the core intent. Remove filler words. "
                "Expand abbreviations. Add relevant synonyms in parentheses. "
                "Return only the rewritten query."
            )
        }, {
            "role": "user",
            "content": original_query
        }],
        temperature=0,
        max_tokens=100
    )
    return response.choices[0].message.content.strip()

Hypothetical Document Embeddings (HyDE). Instead of embedding the short query directly, ask an LLM to generate a hypothetical paragraph that would answer the query, then embed that paragraph. The embedding of a paragraph-length hypothetical answer is often geometrically closer in vector space to actual answer documents than the embedding of a short query. This technique consistently improves recall by 10-15% in our benchmarks.

Query decomposition for complex questions. Split compound queries into sub-queries. “What is the approval process for vendor contracts over $50K and who needs to sign off?” becomes two independent searches: one for the approval process and one for signoff authorities. Results from both searches are merged and deduplicated. This improves recall for complex information needs that span multiple documents.

Retrieval-Augmented Generation (RAG)

The most transformative application of NLP in enterprise search is RAG: using retrieved document chunks as context for an LLM to generate a direct, synthesized answer to the user’s question. Instead of returning a ranked list of documents that the user must read and synthesize themselves, the system returns a concise answer derived from the most relevant sources, with citations back to the original documents.

def search_and_answer(query: str, vector_db, llm_client) -> dict:
    # Step 1: Retrieve top relevant chunks via hybrid search
    results = hybrid_search_pipeline(query, vector_db, top_k=5)

    # Step 2: Build context string from retrieved chunks
    context_parts = []
    sources = []
    for i, result in enumerate(results):
        context_parts.append(f"[Source {i+1}] {result['title']}:n{result['text']}")
        sources.append({
            'doc_id': result['doc_id'],
            'title': result['title'],
            'url': result.get('url', ''),
            'relevance_score': result['score']
        })

    context = "nn".join(context_parts)

    # Step 3: Generate answer grounded in retrieved context
    response = llm_client.chat.completions.create(
        model="gpt-4",
        messages=[{
            "role": "system",
            "content": (
                "Answer the user's question based ONLY on the provided sources. "
                "If the sources don't contain enough information, say so explicitly. "
                "Cite sources using [Source N] notation. "
                "Be concise, specific, and actionable."
            )
        }, {
            "role": "user",
            "content": f"Sources:n{context}nnQuestion: {query}"
        }],
        temperature=0
    )

    return {
        'answer': response.choices[0].message.content,
        'sources': sources,
        'query': query
    }

RAG transforms enterprise search from a document discovery tool (“here are 50 documents that might contain what you need”) into an answer engine (“here is the answer to your question, synthesized from these 3 documents”). For internal knowledge bases, policy repositories, and technical documentation, this is genuinely transformative. Users get answers in 10 seconds instead of spending 15-30 minutes reading through multiple documents and synthesizing the answer themselves.

The grounding instruction (“based ONLY on the provided sources”) is critical for preventing hallucination. Without it, the LLM might generate plausible-sounding answers from its training data that contradict the organization’s actual policies. With grounding, wrong answers become “I don’t have enough information to answer this” — which is dramatically safer.

Evaluation and Continuous Improvement

Measuring search quality is notoriously difficult because relevance is subjective and context-dependent. We use a pragmatic combination of quantitative metrics and qualitative review:

Click-through rate (CTR): What percentage of searches result in a user clicking a result? Target: above 60% on the first page. Below 40% indicates systemic relevance problems.
Mean Reciprocal Rank (MRR): The average of 1/rank where rank is the position of the first relevant result. An MRR of 0.5 means the first relevant result is typically at position 2. Target: above 0.6.
Abandonment rate: What percentage of searches lead to no clicks, no query refinement, and no answer interaction? Target: below 25%.
RAG answer faithfulness: For RAG systems, use a separate LLM (different from the answer generator) to evaluate whether the generated answer is faithful to the source documents and actually answers the question. This automated evaluation correlates well with human judgments at a fraction of the cost.
Weekly query review: Every week, review the 20 worst-performing queries (lowest CTR, highest abandonment, lowest faithfulness scores). These reveal gaps in your document corpus, embedding model weaknesses, chunking problems, and opportunities for query understanding improvement.

Conclusion

Enterprise search powered by NLP is not about replacing Elasticsearch with a vector database. It is about building a multi-layered system that understands what users actually need, matches their intent against document semantics rather than just keywords, and delivers synthesized answers rather than ranked lists of links. The technical components — embedding models, vector databases, hybrid search fusion, query enhancement, and RAG — are all available as production-ready, well-documented tools. The hard work is integrating them into a system that handles real enterprise data with all its messiness: inconsistent terminology across departments, stale documents that were never archived, duplicate content across platforms, and users who do not know the precise keywords that would find what they need.

Start with hybrid search on your most critical document collection — the one your team searches most frequently and complains about most loudly. Measure CTR and abandonment rate as your baseline. Layer on RAG for the use cases where users consistently need synthesized answers rather than individual documents. Every incremental improvement in search quality pays compound dividends across the entire organization, because every employee searches for information every day, and every minute saved searching is a minute available for actual work.

The organizations that get the most value from NLP-powered search are those that treat it as an ongoing capability, not a one-time project. Search quality degrades naturally as new documents are added in new formats, organizational terminology evolves, and user needs shift. Continuous investment in evaluation, query analysis, embedding model updates, and document corpus curation is what separates search systems that deliver sustained value from those that impress in the demo and disappoint six months later.

Natural Language Processing for Enterprise Search

The Limitations of Keyword Search

Semantic Search with Vector Embeddings

Hybrid Search: Combining Keywords and Semantics

Vector Database Selection

Query Understanding and Enhancement

Retrieval-Augmented Generation (RAG)

Evaluation and Continuous Improvement

Conclusion

You may also like

Leave a comment Cancel reply