Friendly robot hand offering gift-wrapped box to human hand

Building Recommendation Systems That Don’t Feel Creepy

Author Sarah Chen

Published on: August 30, 2024

Recommendation systems are the most commercially impactful application of machine learning. Netflix estimates their recommendation engine saves them $1 billion per year in reduced churn. Amazon attributes 35% of revenue to recommendations. Spotify’s Discover Weekly has become the primary way 40 million users find new music. But there is a line between “helpful suggestions” and “my phone is listening to me,” and most engineering teams do not think about where that line is until they have already crossed it and users are tweeting screenshots of uncomfortably precise recommendations.

Article Overview

Building Recommendation Systems That Don't Feel Creepy

8 sections · Reading flow

01
Why Recommendations Feel Creepy

→

02
Architecture: The Three-Layer Approach

→

03
The Privacy Filter: Technical Implementation

→

04
The Diversity Filter: Breaking the Echo Chamber

→

05
The Sensitivity Tier System

→

06
Measuring What Matters

→

07
A/B Testing Recommendation Changes Safely

→

08
The Business Case for Restraint

HARBOR SOFTWARE · Engineering Insights

This post covers how to build recommendation systems that are effective without being unsettling, including the technical architecture, the algorithmic choices, and the product design patterns that keep users comfortable while still driving meaningful business metrics.

Why Recommendations Feel Creepy

The creepiness factor is not random. It comes from three specific, identifiable causes, and understanding them precisely is essential because the technical solutions differ for each one:

Unexplainable precision. When a recommendation is too accurate based on data the user did not knowingly provide, it feels like surveillance. Recommending baby products to someone who only searched for folic acid once crosses this line. The user’s mental model is: “I searched for a vitamin supplement. Why does this store think I’m pregnant?” The system connected dots the user did not expect it to connect, and the resulting recommendation reveals an inference the user did not consent to.
Cross-context leakage. Using work browsing behavior to recommend personal products (or vice versa) violates implicit contextual boundaries. A user who researches ergonomic office furniture during work hours does not expect to see standing desk recommendations in their personal evening browsing session on the same platform. Users maintain mental separation between their professional and personal selves, and recommendations that breach this boundary feel invasive even if the data is technically from the same account.
Temporal uncanniness. This takes two forms. The “I was just talking about this” phenomenon occurs when a recommendation coincidentally aligns with an offline conversation, which is almost always coincidence combined with confirmation bias, but it erodes trust in the platform. The more common form is the “I already bought this” problem: continuing to recommend something the user already purchased. This is not creepy in the surveillance sense but it signals that the system is not paying attention, which paradoxically makes users think about how much the system IS paying attention to their other behavior.

Architecture: The Three-Layer Approach

Our recommendation architecture separates candidate generation, scoring, and filtering into distinct layers. This separation is the architectural decision that makes it possible to add privacy constraints, diversity requirements, and sensitivity rules without rebuilding the core recommendation algorithm:

class RecommendationPipeline:
    def __init__(self, config: PipelineConfig):
        self.candidate_generator = CandidateGenerator(
            collaborative=CollaborativeFilter(config.cf_model),
            content_based=ContentBasedFilter(config.cb_model),
            popularity=PopularityFilter(config.pop_params)
        )
        self.scorer = RecommendationScorer(
            model=config.scoring_model
        )
        self.filter_chain = FilterChain([
            PurchaseHistoryFilter(),    # Remove recently purchased
            InventoryFilter(),          # Remove out-of-stock
            PrivacyFilter(),            # Block cross-context leakage
            SensitivityFilter(),        # Handle health, financial, etc.
            DiversityFilter(),          # Ensure category variety
            ExplainabilityFilter(),     # Only show explainable recs
            FreshnessFilter(),          # Avoid stale recommendations
        ])
    
    async def recommend(
        self, user: User, 
        context: RecommendationContext, 
        n: int = 20
    ) -> list[Recommendation]:
        # Step 1: Generate 500-1000 candidates (fast, broad)
        candidates = await self.candidate_generator.generate(
            user, context, limit=500
        )
        
        # Step 2: Score each candidate (slower, more precise)
        scored = await self.scorer.score(
            user, candidates, context
        )
        scored.sort(key=lambda x: x.score, reverse=True)
        
        # Step 3: Apply filter chain sequentially
        filtered = await self.filter_chain.apply(
            scored, user, context
        )
        
        # Step 4: Attach explanations to surviving recs
        for rec in filtered[:n]:
            rec.explanation = self._generate_explanation(
                rec, user, context
            )
        
        return filtered[:n]

Candidate generation in detail

We use a hybrid candidate generator that combines three signal sources. Collaborative filtering (“users who bought X also bought Y”) uses matrix factorization via the implicit library with ALS (Alternating Least Squares), trained nightly on the full user-item interaction matrix. This captures taste patterns that transcend item attributes: people who like this obscure art house film also tend to like this particular brand of tea. Content-based similarity uses pre-computed item embeddings from a fine-tuned sentence transformer model, capturing attribute-level similarity: items with similar descriptions, specifications, and categories. Popularity within the user’s segment captures trending items that the other two methods might miss because they are too new to have interaction data.

Each method contributes candidates independently, and the candidate pool is the union of all three (typically 150-200 from collaborative, 150-200 from content-based, and 50-100 from popularity). The scoring layer then re-ranks this combined pool using a gradient-boosted model (LightGBM) that considers all three signals plus contextual features (time of day, device type, recent browsing session).

The Privacy Filter: Technical Implementation

The privacy filter is where we prevent cross-context leakage and limit unexplainable precision. It operates on a simple but powerful principle: every recommendation must be explainable by data the user knowingly provided in the current context. If we cannot point to specific user behavior that justifies the recommendation, we do not show it.

class PrivacyFilter:
    SENSITIVE_CATEGORIES = {
        'health', 'fertility', 'sexual_wellness', 
        'addiction_recovery', 'mental_health',
        'political', 'religious', 'financial',
        'weight_management', 'disability'
    }
    
    def apply(
        self, recommendations: list[ScoredItem], 
        user: User, context: RecommendationContext
    ) -> list[ScoredItem]:
        filtered = []
        
        for rec in recommendations:
            # Rule 1: Sensitive categories need explicit signal
            if rec.item.category in self.SENSITIVE_CATEGORIES:
                if not self._has_explicit_interest(
                    user, rec.item.category, context
                ):
                    continue
            
            # Rule 2: Must be traceable to in-context behavior
            explanation = self._generate_explanation(
                rec, user, context
            )
            if explanation is None:
                continue  # Can't explain = don't show
            
            # Rule 3: Cross-context contamination check
            if self._is_cross_context(rec, user, context):
                continue
            
            rec.explanation = explanation
            filtered.append(rec)
        
        return filtered
    
    def _has_explicit_interest(
        self, user: User, category: str, 
        context: RecommendationContext
    ) -> bool:
        """Require explicit engagement in this context."""
        actions = user.get_actions(
            context=context.id,
            categories=[category],
            action_types=['view_product', 'add_to_cart', 
                         'search', 'category_browse'],
            days=30
        )
        # Require at least 2 deliberate interactions
        # (a single accidental click should not trigger 
        #  sensitive recommendations)
        deliberate = [
            a for a in actions 
            if a.duration_seconds > 5  # Stayed on page >5s
        ]
        return len(deliberate) >= 2
    
    def _is_cross_context(
        self, rec: ScoredItem, user: User,
        context: RecommendationContext
    ) -> bool:
        """Check if rec is primarily driven by 
        behavior from a different context."""
        signals = rec.signal_breakdown
        in_context_weight = sum(
            w for source, w in signals.items()
            if source.startswith(context.id)
        )
        total_weight = sum(signals.values())
        
        # If >60% of the signal comes from other contexts,
        # this is cross-context leakage
        return (in_context_weight / total_weight) < 0.4 
            if total_weight > 0 else True
    
    def _generate_explanation(
        self, rec: ScoredItem, user: User, 
        context: RecommendationContext
    ) -> str | None:
        """Generate human-readable explanation.
        Returns None if no non-creepy explanation exists."""
        reasons = []
        signals = rec.signal_sources
        
        if signals.get('purchased_similar'):
            item_name = signals['purchased_similar']['name']
            reasons.append(
                f"Similar to {item_name} that you purchased"
            )
        
        if signals.get('viewed_category'):
            reasons.append(
                "Based on categories you've browsed"
            )
        
        if signals.get('collaborative'):
            reasons.append(
                "Popular with customers who have "
                "similar interests"
            )
        
        if signals.get('trending_segment'):
            reasons.append("Trending in your area")
        
        # Only return if we have at least one 
        # user-friendly reason
        return reasons[0] if reasons else None

The key design decisions in this filter are: requiring 2 deliberate interactions (not just clicks) for sensitive categories, checking dwell time to exclude accidental clicks, and requiring that at least 40% of the recommendation signal comes from the current context. These thresholds were calibrated through A/B testing: lower thresholds produced more relevant recommendations but higher creepiness reports in user surveys, while higher thresholds were too aggressive and suppressed legitimately good recommendations.

The Diversity Filter: Breaking the Echo Chamber

Left unchecked, recommendation systems create feedback loops. Users click on what is recommended, which reinforces those recommendations, which narrows the recommendation pool, which limits what users see. Over weeks, a user who once browsed broadly across 10 categories gets funneled into 2-3 categories. This is bad for discovery (users stop finding new things they might love), bad for long-term engagement (recommendation fatigue sets in when every session looks the same), and bad for the business (you want users discovering new high-margin categories, not just re-buying commodity products from a single category).

class DiversityFilter:
    def __init__(
        self, 
        min_categories: int = 3, 
        max_same_brand: int = 3,
        exploration_ratio: float = 0.15
    ):
        self.min_categories = min_categories
        self.max_same_brand = max_same_brand
        self.exploration_ratio = exploration_ratio
    
    def apply(
        self, recommendations: list[ScoredItem], 
        user: User, context: RecommendationContext
    ) -> list[ScoredItem]:
        selected = []
        category_counts = {}
        brand_counts = {}
        
        # Phase 1: Select top items with diversity constraints
        for rec in recommendations:
            cat = rec.item.category
            brand = rec.item.brand
            
            if brand_counts.get(brand, 0) >= self.max_same_brand:
                continue
            
            selected.append(rec)
            category_counts[cat] = 
                category_counts.get(cat, 0) + 1
            brand_counts[brand] = 
                brand_counts.get(brand, 0) + 1
        
        # Phase 2: Ensure minimum category diversity
        if len(category_counts) < self.min_categories:
            selected = self._inject_exploration(
                selected, recommendations, 
                category_counts
            )
        
        # Phase 3: Add serendipity items
        n_explore = max(
            1, int(len(selected) * self.exploration_ratio)
        )
        selected = self._add_serendipity(
            selected, user, n_explore
        )
        
        return selected
    
    def _add_serendipity(
        self, selected: list, user: User, n: int
    ) -> list:
        """Replace lowest-scored items with items from 
        categories the user has never explored."""
        user_categories = set(
            user.get_interacted_categories(days=90)
        )
        novel_items = [
            item for item in self.popularity_pool
            if item.category not in user_categories
            and item.score > self.quality_threshold
        ]
        
        if novel_items and len(selected) > n:
            # Replace bottom N items with novel discoveries
            for i, novel in enumerate(novel_items[:n]):
                novel.explanation = (
                    "Something new you might like"
                )
                selected[-(i+1)] = novel
        
        return selected

The serendipity injection is the most controversial design choice. Replacing “objectively better” recommendations with novel discoveries reduces short-term click-through rate by 3-5%. But in our longitudinal studies across 3 e-commerce platforms, the 15% serendipity injection increased 90-day category breadth (number of distinct categories a user purchases from) by 28% and increased average order value by 12% because users discovered higher-value categories they had not previously considered. The key is that the serendipity items must pass a quality threshold (only popular, well-reviewed items) so users still find value in them even though they are outside the user’s established preference profile.

The Sensitivity Tier System

Some product categories require extra care in how they are recommended. The sensitivity is not about the products themselves but about what the recommendation implies about the user. Recommending headphones is neutral. Recommending a pregnancy test implies knowledge about the user’s reproductive status. We maintain a four-tier sensitivity system:

Tier 0 (unrestricted): Electronics, clothing, books, home goods, kitchen, office supplies. Can be recommended based on any signal including weak collaborative filtering signals.
Tier 1 (moderate): Diet and fitness products, self-help books, career coaching services, dating-related items. Require at least 2 explicit interactions in the category within 30 days. Explanations use generic language (“Popular in categories you browse” rather than “Because you searched for weight loss”).
Tier 2 (sensitive): Health supplements, fertility products, financial counseling, addiction recovery resources, religious items. Require at least 3 explicit interactions AND a direct category page visit (not just a search that happened to match). Explanations must be explicitly opt-in: users see a “Recommended for you” label only after they enable personalization for that category in their settings.
Tier 3 (restricted): Categories the user has explicitly opted out of via a “Don’t show me this” action. Never recommend regardless of signal strength. Maintain a per-user blocklist that persists indefinitely.

The tier assignments are maintained in a configuration file reviewed quarterly by the product team. New categories default to Tier 1 (moderate) until explicitly classified. The quarterly review catches categories that should be reclassified based on user feedback: if a category generates disproportionate “don’t show me this” feedback (more than 3x the average rate), it is escalated to a higher tier.

Measuring What Matters

Standard recommendation metrics (click-through rate, conversion rate, revenue per recommendation slot) capture effectiveness but not user comfort. We track three additional metrics specifically designed to detect creepiness before it becomes a PR problem:

Recommendation dismissal rate: How often users click “not interested,” “hide this,” or similar negative feedback buttons. We track this per category, per recommendation source (collaborative vs. content-based vs. popularity), and per sensitivity tier. A dismissal rate above 5% for any segment triggers investigation. Above 8% triggers an automatic reduction in recommendation aggressiveness for that segment.
Category diversity index: Shannon entropy of recommended categories per user per session. Higher entropy means more diverse recommendations. We target an entropy of at least 2.0 (equivalent to roughly equal representation of 4+ categories). Below 1.5 indicates the filter bubble is tightening and the diversity filter parameters need adjustment.
Explanation coverage: What percentage of displayed recommendations have a human-readable explanation attached? We require 100%. If we cannot explain a recommendation in terms the user would find reasonable, we do not show it. This is enforced by the ExplainabilityFilter in the filter chain: recommendations without explanations are dropped before they reach the user.

We also run quarterly user surveys with a direct question: “In the past month, have any product recommendations felt uncomfortable, invasive, or too personal?” A positive response rate above 3% triggers an architecture review. We have hit this threshold once, traced it to cross-context leakage from a new feature that merged browsing sessions across devices, and fixed it by tightening the cross-context filter. The survey is our ground truth for whether the system is behaving acceptably from the user’s perspective, as opposed to the algorithmic metrics that measure behavior but not sentiment.

A/B Testing Recommendation Changes Safely

Every change to a recommendation system has the potential to affect revenue, engagement, and user trust simultaneously. We never deploy recommendation algorithm changes to 100% of users without A/B testing, and we have learned through painful experience that recommendation A/B tests require a significantly longer observation period than typical UI tests.

A standard UI test (button color, layout change, copy variation) reaches statistical significance in 3-7 days because the effect is immediate and the primary metric (click rate or conversion) is fast-moving. Recommendation algorithm changes take 3-4 weeks to reach meaningful conclusions because the most important metrics (retention, category breadth, lifetime value, and the creepiness indicators described above) are inherently slow-moving. A recommendation change might increase short-term clicks while decreasing 30-day retention, and you will not see the retention effect in a 7-day test. We run all recommendation tests for a minimum of 28 days with a minimum sample size of 10,000 users per variant to ensure statistical power on slow-moving metrics.

We also monitor automated guardrail metrics during every active test. Revenue per user in the treatment group must not decrease by more than 2% during the first week of the test, even if our hypothesis predicts it will recover later as users adapt to the new recommendations. The recommendation dismissal rate must not increase by more than 1.5 percentage points versus control. And the category diversity index must not drop below our 2.0 entropy threshold for any treatment variant at any point during the test. If any guardrail is violated, the test is automatically paused by our experimentation platform and flagged for review by the data science team before it can be resumed or expanded.

This automated guardrail system has prevented two problematic recommendation changes from reaching full deployment. The first inadvertently surfaced Tier 2 sensitive category items to users who had not met the explicit interaction threshold, because a code change to the scoring model bypassed the sensitivity filter for items above a certain score. The second reduced category diversity below our threshold due to a bug in the serendipity injection logic that caused the exploration items to be drawn from the same dominant category as the rest of the recommendations, rather than from underrepresented categories as intended. In both cases, the guardrails triggered within 72 hours of test start, well before the tests would have been evaluated for full rollout.

The Business Case for Restraint

The counterargument from stakeholders is always: “But showing more targeted recommendations increases conversion.” This is demonstrably true in the short term and demonstrably false in the long term. Our data across three e-commerce deployments shows a consistent pattern: aggressive recommendation targeting (using all available signals including cross-context, low interaction thresholds, no sensitivity filtering) increases 30-day conversion by 8-12% compared to our restrained approach. However, the same aggressive approach decreases 180-day retention by 15-20%. Users who feel surveilled reduce engagement gradually: they stop browsing, they use private/incognito mode more frequently, they provide less explicit feedback, and eventually they stop visiting entirely.

The restrained approach (explanation-required, diversity-enforced, sensitivity-aware, cross-context-blocked) produces 5-7% lower short-term conversion but 10-15% higher long-term retention. Over a 12-month period, the restrained approach generates more total revenue for every client we have measured, because the compounding effect of higher retention outweighs the marginal conversion loss. A user who converts at 4% but visits weekly for 12 months generates far more revenue than a user who converts at 4.5% but churns after 4 months.

Building recommendation systems that do not feel creepy is not about sacrificing effectiveness for ethics (though the ethical argument is valid on its own). It is about optimizing for the right time horizon. Short-term conversion is easy to measure and easy to optimize for. Long-term trust is harder to measure, slower to build, and impossible to rebuild once lost. Build for trust.