Building a Social Media Monitoring System from Scratch

Author David Park

Published on: June 21, 2024

Commercial social media monitoring tools like Brandwatch, Sprout Social, and Mention charge $500-$3,000/month and still miss the signals that matter most to your business. They excel at volume metrics and sentiment pie charts. They fail at detecting emerging narratives, tracking competitor positioning shifts, and surfacing the specific conversations your sales team should join. When a client asked us to build a custom monitoring system that could do all of that for their niche B2B market, we learned a lot about what it actually takes to build one from the ground up.

Article Overview

Building a Social Media Monitoring System from Scratch

6 sections · Reading flow

01
Architecture Overview

→

02
Data Collection: The Ugly Reality

→

03
The Processing Pipeline

→

04
The AI Layer: From Data to Insights

→

05
Storage and Querying

→

06
Operational Lessons

HARBOR SOFTWARE · Engineering Insights

This post covers the architecture, data ingestion challenges, processing pipeline, and the AI layer that makes raw social data actually useful. We are not talking about a weekend project. This is a production system processing 50,000-100,000 posts per day across four platforms.

Architecture Overview

The system has four layers, each with distinct responsibilities and failure modes:

┌─────────────────────────────────────────────┐
│           Dashboard / API Layer              │
│  (Next.js + Chart.js + WebSocket feeds)      │
├─────────────────────────────────────────────┤
│          Processing & Analysis Layer         │
│  (Python workers: NLP, sentiment, topics)    │
├─────────────────────────────────────────────┤
│          Message Queue / Stream              │
│  (Redis Streams for buffering & fan-out)     │
├─────────────────────────────────────────────┤
│          Data Collection Layer               │
│  (Platform scrapers, API clients, webhooks)  │
└─────────────────────────────────────────────┘

Storage: PostgreSQL (structured) + Elasticsearch (full-text search)
Cache: Redis (rate limiting, dedup, real-time counters)

The critical design decision was using Redis Streams as the central bus rather than going straight from collection to processing. Social media data arrives in unpredictable bursts. A single viral tweet mentioning your brand can generate 10x normal volume in minutes. Without a buffer, your processing workers either need to be provisioned for peak load (expensive, wasteful 99% of the time) or they fall behind during spikes and create a growing backlog that takes hours to clear. Redis Streams absorbs these spikes and lets processors consume at their own pace while providing consumer group semantics for parallel processing.

Data Collection: The Ugly Reality

Every platform has different data access rules, rate limits, and reliability characteristics. Here is what we deal with on each major platform:

Twitter/X API

The v2 API with Academic or Pro access ($5,000/month for the full firehose, $100/month for Basic) provides the cleanest data pipeline. We use the filtered stream endpoint for real-time collection and the search endpoint for backfilling:

import tweepy

class TwitterCollector:
    def __init__(self, bearer_token: str):
        self.client = tweepy.Client(
            bearer_token=bearer_token,
            wait_on_rate_limit=True
        )
    
    def setup_stream(self, keywords: list[str], stream_client):
        # Clear existing rules
        existing = stream_client.get_rules()
        if existing.data:
            stream_client.delete_rules(
                [r.id for r in existing.data]
            )
        
        # Add keyword rules (max 25 on Basic, 1000 on Pro)
        rules = []
        for kw in keywords:
            rules.append(
                tweepy.StreamRule(f'"{kw}" -is:retweet lang:en')
            )
        stream_client.add_rules(rules)
    
    def backfill(self, query: str, days_back: int = 7):
        """Search recent tweets for historical data."""
        tweets = []
        for response in tweepy.Paginator(
            self.client.search_recent_tweets,
            query=f'"{query}" -is:retweet lang:en',
            max_results=100,
            tweet_fields=['created_at', 'public_metrics', 
                         'author_id', 'conversation_id'],
            user_fields=['username', 'followers_count', 
                        'verified']
        ):
            if response.data:
                tweets.extend(response.data)
        return tweets

The filtered stream is the gold standard for real-time monitoring: sub-second delivery, no polling overhead, and you only receive tweets matching your rules. The downside is that stream disconnections happen regularly (network issues, API maintenance, rate limit enforcement), and reconnecting requires careful backfill logic to avoid gaps. We track the last received tweet timestamp and use the search endpoint to backfill any gap longer than 30 seconds upon reconnection.

Reddit’s API is generous for read-only monitoring: 100 requests per minute with OAuth, which is enough to poll 50+ subreddits at 60-second intervals. We use asyncpraw for async collection. The challenge with Reddit is comment threading. A mention in a deeply nested comment reply has different significance than a top-level post. We store the full thread context and compute a “visibility score” based on position in the thread (top-level = 1.0, each nesting level reduces by 0.3), upvote count relative to the subreddit’s average, and the subreddit’s subscriber count. A top-level post in r/technology (15M subscribers, high visibility) is weighted 50x higher than a nested comment in a 5,000-subscriber niche subreddit.

LinkedIn has the most restrictive API. Unless you are a LinkedIn Marketing Partner (which requires significant revenue commitment and a multi-month application process), you cannot access the feed API. We use a hybrid approach: the official API for company page analytics (follower counts, engagement metrics on your own posts) and a headless browser with Playwright for public post collection. The Playwright approach works by navigating to public profiles and company pages, scrolling to load posts, and extracting the post content, engagement counts, and timestamps from the DOM. This is fragile, rate-limited by IP, and breaks every time LinkedIn updates their frontend markup. We budget 4-6 hours per quarter for LinkedIn scraper maintenance, and we always have a 48-hour buffer in our monitoring SLA to account for scraper downtime when LinkedIn ships a breaking change.

News and Blogs

RSS feeds cover the long tail of industry publications. We aggregate 200+ feeds with a custom crawler that respects robots.txt and uses conditional GET requests (If-Modified-Since and ETag) to minimize bandwidth. For major publications without RSS, we use web scraping with rotating residential proxies through a provider like Bright Data or Oxylabs. We rotate user agents, randomize request timing, and implement exponential backoff on 429 responses. The proxy cost is about $200/month for our volume, which is worth it to avoid IP blocks.

The Processing Pipeline

Raw social posts are noisy. The processing pipeline transforms them into structured, searchable, actionable data. Each post passes through five stages, and the design of each stage was refined through months of production operation.

class PostProcessor:
    def __init__(self):
        self.dedup = DeduplicationFilter()
        self.lang_detector = LanguageDetector()
        self.ner = SpacyNERExtractor("en_core_web_trf")
        self.sentiment = SentimentAnalyzer()
        self.topic_classifier = TopicClassifier()
    
    async def process(self, post: RawPost) -> ProcessedPost:
        # Stage 1: Deduplication (simhash fingerprint)
        if self.dedup.is_duplicate(post.text):
            return None
        
        # Stage 2: Language detection and filtering
        lang = self.lang_detector.detect(post.text)
        if lang not in self.config.target_languages:
            return None
        
        # Stage 3: Entity extraction
        entities = self.ner.extract(post.text)
        # Brands, products, people, locations
        
        # Stage 4: Sentiment analysis
        sentiment = self.sentiment.analyze(
            post.text, 
            entities=entities  # aspect-based sentiment
        )
        
        # Stage 5: Topic classification
        topics = self.topic_classifier.classify(post.text)
        
        return ProcessedPost(
            raw=post,
            language=lang,
            entities=entities,
            sentiment=sentiment,
            topics=topics,
            visibility_score=self._compute_visibility(post),
            processed_at=datetime.utcnow()
        )

Deduplication is harder than you think

Cross-platform deduplication is essential. The same blog post gets shared on Twitter, LinkedIn, Reddit, and Hacker News with different snippets and commentary. A product launch announcement gets reposted by 50 different accounts with minor variations. Without deduplication, your analytics are inflated and your team wastes time reviewing the same content multiple times.

We use simhash fingerprinting with a Hamming distance threshold of 3 to catch near-duplicates. The simhash algorithm creates a 64-bit fingerprint where semantically similar texts produce fingerprints with few bit differences. A Hamming distance of 3 (at most 3 bits different out of 64) catches 94% of cross-platform duplicates while maintaining less than 0.1% false positive rate. The remaining 6% are caught by URL deduplication for posts containing links: we normalize URLs by stripping tracking parameters, normalizing www/non-www and http/https, and comparing the canonical URLs.

The dedup database uses a Redis sorted set with simhash fingerprints as keys. To check a new post, we compute its simhash and query Redis for all fingerprints within Hamming distance 3. The lookup is O(1) per fingerprint check using bit manipulation, and we expire fingerprints after 72 hours to prevent unbounded memory growth.

Aspect-based sentiment

Overall sentiment (“this post is positive”) is nearly useless for business decisions. A post saying “Love the new dashboard design but the API documentation is terrible” has positive overall sentiment, which hides the critical feedback about documentation. Aspect-based sentiment tells you what people feel positive or negative about. The same post registers as positive for UI and negative for documentation.

We fine-tuned a DeBERTa-v3-base model on 5,000 manually labeled examples to extract aspect-sentiment pairs. The labeling was done using Scale AI’s data labeling service at approximately $0.40 per example ($2,000 total). Annotators identified aspects (product features, company attributes, competitor mentions) and assigned sentiment (positive, negative, neutral) to each aspect. The model achieves 0.84 macro-F1 on aspect extraction and 0.89 accuracy on sentiment classification for extracted aspects. Training took 8 hours on a single A100 GPU, and inference runs at 15ms per post on a T4, which is well within our processing budget.

The AI Layer: From Data to Insights

The most valuable part of the system is not the data collection or processing. It is the insight generation layer that runs on top of the processed data. We use GPT-4o for three specific analytical tasks that would be prohibitively expensive to do with human analysts at our data volume.

Narrative detection

Every hour, we cluster the last 24 hours of posts using BERTopic (which combines UMAP dimensionality reduction with HDBSCAN clustering on top of sentence-transformer embeddings). The top 10 clusters that are either new (not matching any existing narrative) or growing (50%+ volume increase in 6 hours) are fed to GPT-4o with a structured prompt that asks the model to identify whether each cluster represents a meaningful narrative or noise:

SYSTEM: You analyze clusters of social media posts about {brand} 
and identify emerging narratives. A narrative is a specific story, 
theme, or sentiment shift gaining traction. It is NOT a generic 
topic like "product discussion" - it is something specific like 
"customers are comparing our pricing unfavorably to Competitor X's 
new free tier."

USER: Here are {n} representative posts from a cluster detected 
in the last 24 hours:

{posts}

Previously identified active narratives:
{existing_narratives}

For this cluster, determine:
1. Is this a new narrative not in the existing list? If so, 
   summarize it in one sentence, rate urgency (1-5), and 
   estimate the audience reach based on the post metrics.
2. An evolution of an existing narrative? If so, which one 
   and what changed?
3. Noise / not a meaningful narrative? Explain briefly.

Return JSON with your classification and reasoning.

This catches signals like “Customers are comparing our pricing unfavorably to Competitor X’s new free tier” days before it shows up in support tickets or sales call objections. In one case, we detected a narrative about product quality concerns from a batch of returns that started as 3 Reddit posts and grew to 40+ mentions across platforms over 48 hours. The client’s product team was briefed within 6 hours of the first detection, two full business days before their customer support team reported the trend.

Daily digest generation

A daily email digest summarizing the most important signals, generated by GPT-4o from the structured data. The digest includes: new narratives with urgency ratings, sentiment shifts exceeding 2 standard deviations from the 30-day moving average, high-engagement mentions (posts with 10x+ average engagement for their platform), and competitor activity highlights. This replaced a $1,200/month Brandwatch subscription for one client and was universally preferred by the marketing team because the summaries were tailored to their specific competitive landscape rather than generic metrics.

Storage and Querying

We use PostgreSQL for structured data (posts, authors, metrics, processed results) and Elasticsearch for full-text search and aggregation queries. The split is deliberate and serves different access patterns:

PostgreSQL handles: time-series queries (“sentiment trend for the last 30 days”), relational queries (“all posts by authors with >10k followers mentioning our brand”), and transactional writes from the processing pipeline. We use TimescaleDB extension for time-series hypertables on the metrics data.
Elasticsearch handles: full-text search across millions of posts with complex boolean queries, faceted search (filter by platform + topic + sentiment + date range simultaneously), and real-time aggregations for the dashboard (post count by hour, sentiment distribution by platform, top entities this week).

Posts are written to both stores synchronously via a transactional outbox pattern: the processing worker writes to PostgreSQL within a transaction, and a separate process reads the outbox table and indexes into Elasticsearch. We consider Elasticsearch the secondary store: if it falls behind or has indexing issues, we rebuild from PostgreSQL using a full reindex job that takes about 4 hours for 10M documents. This has saved us twice in 18 months of operation. Both times were Elasticsearch cluster issues (once a disk space problem, once a mapping conflict from a schema change) that would have caused data loss if Elasticsearch were the primary store.

Operational Lessons

After running this system for 18 months, here is what we know that we did not know at the start:

Rate limit management is a full-time concern. We track rate limit headers from every API call, implement exponential backoff with jitter (randomized delay to prevent thundering herd), and maintain per-platform circuit breakers. When Twitter rate-limited us during a product launch (200% volume spike), the circuit breaker paused collection for 15 minutes rather than burning through our monthly quota in hours. The circuit breaker uses a token bucket algorithm with a 5-minute observation window: if more than 30% of requests in the window return 429, collection pauses and resumes with reduced throughput.
Data quality degrades silently. Scrapers break, API schemas change, and sentiment models drift as language evolves. We run data quality checks every 6 hours: collection volume by platform (alert if below 50% of 7-day baseline), sentiment distribution (alert if mean shifts more than 0.3 standard deviations), entity extraction recall (spot-check 50 random posts against manual labels weekly), and language detection accuracy (monthly audit of 100 posts flagged as non-English). These automated checks caught a LinkedIn scraper failure 4 hours after it occurred, compared to the 3 days it took to notice a similar failure before we had the checks in place.
The dashboard is 80% of the perceived value. We could have the most sophisticated NLP pipeline in the world, but if the dashboard is slow, confusing, or ugly, the client thinks the product is broken. We spent as much engineering time on the visualization layer (real-time WebSocket updates, sub-200ms query responses, mobile-responsive charts, intuitive filtering) as on the entire processing pipeline. The first version of the dashboard used pre-rendered charts updated every 15 minutes. When we switched to real-time WebSocket feeds with client-side rendering, the client’s NPS score for the tool jumped from 6 to 9.
Historical backfill is critical at launch. Nobody wants to stare at an empty dashboard for 30 days while data accumulates. We budget 2-3 days for historical backfill using platform search APIs and archive services. This means the system launches with 30-90 days of data already loaded, so trend lines, baseline metrics, and narrative history are available from day one.

Building a social media monitoring system is 30% data engineering, 30% NLP, 20% infrastructure operations, and 20% product design. The technical challenges are real but solvable. The harder problem is defining what “signal” means for each specific business and tuning the system to surface it reliably without drowning users in noise.

Total infrastructure cost for our reference deployment: approximately $800/month on AWS (2x m5.large for processing workers, 1x t3.medium for Redis and PostgreSQL, 1x t3.medium for Elasticsearch, plus platform API costs). That is a fraction of what commercial tools charge, with 10x the customization potential and none of the “one-size-fits-all” limitations that make commercial tools mediocre for specialized use cases.

Building a Social Media Monitoring System from Scratch

Architecture Overview

Data Collection: The Ugly Reality

Twitter/X API

Reddit

LinkedIn

News and Blogs

The Processing Pipeline

Deduplication is harder than you think

Aspect-based sentiment

The AI Layer: From Data to Insights

Narrative detection

Daily digest generation

Storage and Querying

Operational Lessons

You may also like

Leave a comment Cancel reply