The Architecture Behind Real-Time Social Monitoring

Why Reddit Monitoring Matters More Than You Think

Every brand has conversations happening about it that it never sees. Product complaints in niche subreddits. Feature requests buried in comment threads. Competitor comparisons that shape purchase decisions. For companies that sell to developers or technical products, Reddit is often the platform where authentic, unfiltered opinions live — and unlike Twitter/X, those opinions come with multi-paragraph context and community discussion.

Article Overview

The Architecture Behind Real-Time Social Monitoring

9 sections · Reading flow

01
Why Reddit Monitoring Matters More Than You Think

→

02
The Reddit Data Challenge

→

03
Comment Tree Processing

→

04
The Mention Detection Pipeline

→

05
Sentiment Analysis: Beyond Positive/Negative

→

06
The Alerting System

→

07
Data Pipeline and Storage

→

08
Handling Reddit's Idiosyncrasies

→

09
Lessons for Building Real-Time Monitoring Systems

HARBOR SOFTWARE · Engineering Insights

SparkAI started as an internal tool. We needed to track mentions of our clients’ brands across Reddit, identify sentiment shifts in real time, and route actionable items to the right teams. Existing social monitoring tools either ignore Reddit entirely (Brandwatch, Sprout Social) or treat it as an afterthought — scraping titles without understanding the nested comment structure that makes Reddit discussions uniquely valuable. None of them handle the nuance of Reddit-specific communication: sarcasm, nested context, community-specific jargon, and the signal difference between a top-level post with 500 upvotes and a buried reply with 3.

Building a real-time monitoring platform for Reddit turned out to be a masterclass in stream processing, rate limit engineering, and the nuances of applying NLP to conversational text that breaks every assumption traditional sentiment models make. Here’s how we architected SparkAI and what we learned along the way.

The Reddit Data Challenge

Reddit’s API is both a gift and a constraint. The gift: it’s well-documented, returns structured JSON, and provides access to posts, comments, and metadata. The constraint: rate limits. Reddit allows 100 requests per minute for OAuth-authenticated applications, which sounds generous until you’re monitoring 50 subreddits with active comment streams. Each subreddit poll uses one request. Each comment thread expansion uses one request. Enriching a single mention with full comment context can consume 3-5 requests. At 100 mentions per day across 50 subreddits, you’re already pushing limits.

The first architecture decision was choosing between polling and streaming. Reddit deprecated its real-time streaming endpoint in 2023, so polling is the only official option. Third-party services like Pushshift have become unreliable and legally uncertain. Polling it is — but polling at the volume we needed required careful orchestration to stay within rate limits while meeting our latency targets.

We built a priority-based polling scheduler. High-activity subreddits (like r/programming with thousands of daily posts) get polled every 30 seconds. Low-activity subreddits (like r/sysadmin, which is highly relevant for certain clients but slower) get polled every 5 minutes. The scheduler dynamically adjusts intervals based on observed post frequency, and it tracks rate limit headers from the Reddit API to avoid hitting walls:

interface SubredditPollConfig {
  name: string;
  intervalMs: number;
  lastPostRate: number;  // posts per hour, trailing 24h
  priority: 'critical' | 'high' | 'medium' | 'low';
  lastPollAt: Date;
  consecutiveErrors: number;
}

const calculateInterval = (config: SubredditPollConfig): number => {
  const baseIntervals = {
    critical: 15_000,   // 15s
    high: 30_000,       // 30s
    medium: 120_000,    // 2min
    low: 300_000,       // 5min
  };
  
  let interval = baseIntervals[config.priority];
  
  // Scale based on activity
  if (config.lastPostRate > 100) interval = Math.max(interval * 0.5, 15_000);
  if (config.lastPostRate < 5) interval = interval * 2;
  
  // Back off on errors
  if (config.consecutiveErrors > 0) {
    interval = Math.min(interval * Math.pow(2, config.consecutiveErrors), 600_000);
  }
  
  return interval;
};

This scheduling approach keeps us well within rate limits while ensuring we catch mentions within minutes. For context, our target SLA is “mention detected within 5 minutes of posting” for critical-priority subreddits, and within 15 minutes for everything else. In practice, we typically detect mentions within 2 minutes for critical subreddits.

We also implemented a rate limit budget system. At any given moment, we know how many requests we’ve made in the current minute. High-priority tasks (expanding a comment thread for a detected mention) get allocated from the budget first. Low-priority tasks (polling quiet subreddits) yield when the budget is tight. This prevents a burst of mentions in one subreddit from starving monitoring of other subreddits.

Comment Tree Processing

Posts are easy. Comments are hard. A Reddit post might have 500 comments in a nested tree structure. A brand mention in a deeply nested reply carries different weight than a mention in a top-level comment. More importantly, the parent context matters for sentiment analysis — a comment saying “I switched from [Competitor] to [Client] and it’s been great” has very different implications than “I switched from [Client] to [Competitor] and never looked back.”

We process comments as trees, not flat lists. When SparkAI detects a mention in a comment, it captures the full ancestor chain up to the root comment, plus sibling comments for peer context. This context window is what makes our sentiment analysis meaningfully different from tools that analyze comments in isolation.

interface CommentContext {
  comment: RedditComment;
  ancestors: RedditComment[];   // Root to parent, ordered
  siblings: RedditComment[];    // Same-level replies to parent
  postTitle: string;
  postBody: string;
  subreddit: string;
  postScore: number;
  commentScore: number;
  commentDepth: number;
}

const buildCommentContext = async (
  comment: RedditComment, 
  postData: RedditPost
): Promise => {
  const ancestors: RedditComment[] = [];
  let current = comment;
  
  // Walk up the tree to the root comment
  while (current.parent_id && current.parent_id !== postData.name) {
    const parent = await fetchComment(current.parent_id);
    ancestors.unshift(parent);
    current = parent;
  }

  // Fetch sibling comments for peer context
  const siblings = await fetchCommentReplies(comment.parent_id);
  
  return {
    comment,
    ancestors,
    siblings: siblings.filter(s => s.id !== comment.id).slice(0, 5),
    postTitle: postData.title,
    postBody: postData.selftext,
    subreddit: postData.subreddit,
    postScore: postData.score,
    commentScore: comment.score,
    commentDepth: ancestors.length,
  };
};

Building this context consumes API requests, which is why the rate limit budget system matters. Expanding a full comment context for a mention costs 2-4 requests (fetching ancestors and siblings). For a busy monitoring day with 50 mentions, that’s 100-200 requests just for context expansion — a significant chunk of our per-minute budget.

The Mention Detection Pipeline

“Mention detection” sounds simple. Search for the brand name. Done. In practice, it’s riddled with edge cases that make it one of the most iteratively improved components in SparkAI.

Brand names collide with common words. One of our clients has a product called “Flow” — try searching for that without drowning in false positives from every subreddit discussing workflow, cash flow, state management flow, or yoga flow. Abbreviations, misspellings, and Reddit-specific slang add further complexity. Users write “VibeGuard” as “vibe guard”, “VG”, or even “that harbor security thing.”

We built a multi-layer detection system:

Exact match: Case-insensitive search for the brand name and configured aliases. Fast, high precision, catches 60-70% of mentions.
Fuzzy match: Levenshtein distance with a threshold calibrated per brand name length. Short names (4 characters or fewer) get distance 1 only to avoid false positives. Longer names get distance 2. Catches typos and common misspellings.
Semantic match: For ambiguous brand names, we run a lightweight classifier that considers the surrounding text. If someone says “the flow was terrible” in r/yoga, that’s not about our client’s product. If they say “the flow integration broke my CI” in r/devops, it probably is. This layer uses a fine-tuned classifier trained on 2,000 labeled examples.
Context match: Even when the brand name doesn’t appear directly, related terms might. If a competitor is mentioned alongside product features our client offers, that’s a competitive intelligence signal worth capturing. This layer is opt-in and generates more noise but surfaces conversations the other layers miss entirely.

const detectMention = async (
  text: string, 
  brand: BrandConfig
): Promise => {
  // Layer 1: Exact match
  for (const alias of [brand.name, ...brand.aliases]) {
    const regex = new RegExp(`b${escapeRegex(alias)}b`, 'gi');
    if (regex.test(text)) {
      return { type: 'exact', confidence: 0.95, matchedTerm: alias };
    }
  }

  // Layer 2: Fuzzy match
  const words = text.split(/s+/);
  for (const word of words) {
    for (const alias of [brand.name, ...brand.aliases]) {
      const distance = levenshtein(word.toLowerCase(), alias.toLowerCase());
      const threshold = alias.length <= 4 ? 1 : 2;
      if (distance <= threshold && distance > 0) {
        return { type: 'fuzzy', confidence: 0.70, matchedTerm: word };
      }
    }
  }

  // Layer 3: Semantic match (only for ambiguous brands)
  if (brand.isAmbiguous) {
    const isRelevant = await classifyRelevance(text, brand);
    if (isRelevant.score > 0.8) {
      return { type: 'semantic', confidence: isRelevant.score, matchedTerm: brand.name };
    }
  }

  return null;
};

Sentiment Analysis: Beyond Positive/Negative

Basic sentiment analysis is commoditized. Every NLP library can tell you a sentence is positive or negative. For brand monitoring, that granularity is barely useful. Telling a product manager “62% of mentions this week were negative” provides no actionable information. They need to know what was negative, why it was negative, and how much reach the negative sentiment had.

We developed an aspect-based sentiment analysis pipeline that extracts multiple dimensions from each mention:

Aspect identification: What part of the product is being discussed? Pricing, performance, UX, documentation, support, reliability, features?
Comparative context: Is the product being compared to a competitor? Is the comparison favorable or unfavorable?
Intent classification: Is the user venting frustration, seeking help, recommending the product, warning others away, or just discussing it neutrally?
Reach estimation: How many people are likely seeing this mention? Based on upvote score, comment depth, subreddit size, and the post’s age-to-score ratio.

interface SentimentAnalysis {
  overall: 'positive' | 'negative' | 'neutral' | 'mixed';
  confidence: number;
  aspects: {
    aspect: string;
    sentiment: 'positive' | 'negative' | 'neutral';
    evidence: string;   // The specific text supporting this
  }[];
  intent: 'complaint' | 'praise' | 'question' | 'comparison' 
    | 'recommendation' | 'warning' | 'discussion';
  comparisons: {
    competitor: string;
    direction: 'favorable' | 'unfavorable' | 'neutral';
    context: string;
  }[];
  reach: {
    upvotes: number;
    commentDepth: number;
    subredditSubscribers: number;
    estimatedVisibility: 'high' | 'medium' | 'low';
    velocityScore: number;  // Upvote rate in first hour — predicts viral potential
  };
}

The output isn’t a single score — it’s a structured analysis that routes to different teams. A complaint about billing goes to customer success. A performance issue goes to engineering. A feature request goes to product. A competitive comparison goes to marketing. This routing saves teams from sifting through irrelevant mentions and ensures actionable items reach the people who can act on them.

Sentiment analysis on Reddit text is particularly tricky because of sarcasm. “Oh great, another update that breaks everything. Exactly what I needed on a Friday” is extremely negative despite containing the words “great” and “exactly what I needed.” Off-the-shelf sentiment models get this wrong consistently. We handle it by including the comment tree context in the analysis prompt — the surrounding conversation usually disambiguates sarcasm — and by using a sarcasm-aware few-shot prompt that includes Reddit-specific examples.

The Alerting System

Real-time monitoring is only useful if the right people get notified at the right time. We built a tiered alerting system with different urgency levels:

Tier 1 (Immediate — Slack notification within 2 minutes): Negative mentions with high visibility (100+ upvotes, or top-level in subreddits with 500K+ subscribers). Also triggered for any mention classified as a “security incident” or “data breach” discussion involving the client.
Tier 2 (Hourly digest): All mentions grouped by aspect and sentiment. This is the default view for marketing and product teams — a curated summary that takes 5 minutes to review instead of requiring constant monitoring.
Tier 3 (Daily summary): Trend analysis showing sentiment shifts, emerging topics, competitive comparisons, and reach metrics over time. This serves the executive level — are we trending positively or negatively this week compared to last week?

The Tier 1 alerts are the most valuable. We’ve seen clients respond to a viral negative post within 30 minutes of it being published — long before it hits mainstream tech media or the company’s support queue. That early response window has measurably impacted outcomes. A genuine, helpful reply from an official account when a post has 50 upvotes changes the entire trajectory of the conversation compared to arriving when it has 5,000 upvotes and an entrenched narrative.

We also implemented alert fatigue protection. If the same subreddit generates more than 5 Tier 1 alerts in an hour (which happens during product launches or incidents), we consolidate them into a single summary alert. This prevents the Slack channel from becoming unusable during the moments when it matters most.

Data Pipeline and Storage

SparkAI processes roughly 50,000 comments per day across all monitored subreddits for all clients combined. Not all of these contain mentions — the vast majority are background noise that gets filtered out. But every comment passes through the detection pipeline, and the detection pipeline needs to be fast enough that processing 50,000 items doesn’t create a backlog.

The data pipeline is built on BullMQ for job orchestration and PostgreSQL for storage. We considered Kafka but it was overkill for our throughput. BullMQ gives us reliable job queuing, retry logic, rate limiting, and priority handling without the operational overhead of a dedicated streaming platform. The job flow is straightforward:

// Job flow
// 1. Poll scheduler enqueues subreddit poll jobs at configured intervals
// 2. Poll workers fetch new posts/comments from Reddit
// 3. New content goes to mention detection queue (high throughput, fast)
// 4. Detected mentions go to context enrichment queue (moderate throughput)
// 5. Enriched mentions go to sentiment analysis queue (lower throughput, LLM calls)
// 6. Analyzed mentions go to alert/storage queue

const mentionQueue = new Queue('mention-detection', {
  defaultJobOptions: {
    attempts: 3,
    backoff: { type: 'exponential', delay: 2000 },
    removeOnComplete: { age: 86400 },      // Keep completed jobs for 24h
    removeOnFail: { age: 604800 },          // Keep failed jobs for 7d
  }
});

const sentimentQueue = new Queue('sentiment-analysis', {
  limiter: {
    max: 10,           // Max 10 concurrent sentiment analysis jobs
    duration: 1000,    // Per second
  },
});

The mention detection step runs in under 5ms per comment (regex + Levenshtein). At 50,000 comments per day, that’s about 4 minutes of total CPU time — negligible. The bottleneck is sentiment analysis, which requires an LLM call per mention. We process about 50-100 mentions per day, each taking 2-5 seconds for the AI analysis. Total daily processing time for sentiment is about 5-8 minutes.

For storage, we keep raw Reddit data for 90 days (compliance reasons — some clients operate in regulated industries) and processed mention data indefinitely. The mention data is stored in a schema designed for time-series queries — we need to answer questions like “how did sentiment about Feature X change after the v2.0 release?” efficiently. PostgreSQL’s native partitioning by month keeps queries fast as the dataset grows.

Handling Reddit’s Idiosyncrasies

Reddit has quirks that don’t exist on other platforms, and they all affect monitoring quality:

Deleted and removed content: Reddit serves [deleted] and [removed] placeholders for content that was there when we first polled but has since been taken down. We store the original content at first detection and flag when it gets removed — sometimes the removal itself is signal (a highly upvoted complaint that gets removed by moderators suggests the subreddit may have a relationship with the brand).
Score volatility: A comment’s upvote score fluctuates wildly in the first hour due to Reddit’s vote fuzzing algorithm. We re-poll high-priority mentions at 1h, 6h, and 24h intervals to get stable scores for visibility estimation. The 1h score is used for alerting; the 24h score is used for trend reports.
Subreddit-specific norms: r/technology has very different conversational norms than r/sysadmin, which has different norms than r/gaming. Sarcasm rates vary significantly. Technical jargon varies. We tune sentiment analysis confidence thresholds per subreddit based on historical accuracy data. Subreddits where we consistently misjudge sentiment get wider confidence intervals.
Bot comments: AutoModerator and other bots generate noise. We maintain a blocklist of known bot accounts and filter their content before it hits the detection pipeline. We also detect bot-like patterns (identical comments posted across multiple subreddits, comments that exactly match a template).
Crossposting: The same content can appear in multiple subreddits via crossposting. We deduplicate crossposts by tracking the original post ID to avoid alerting on the same mention multiple times.

Lessons for Building Real-Time Monitoring Systems

Building SparkAI crystallized several principles that apply to any real-time monitoring system, regardless of the data source:

Rate limits are a feature, not a bug. Reddit’s rate limits forced us to build a smarter polling system than we would have otherwise. The priority-based scheduler outperforms a naive “poll everything as fast as possible” approach because it focuses resources where activity actually happens. If Reddit had unlimited API access, we’d probably have built something dumber and more expensive to operate.

Context is everything for sentiment. Analyzing text in isolation produces unreliable results. The comment tree, the subreddit culture, the post title, the user’s history — all of it feeds into accurate interpretation. Every hour we spent building the context enrichment pipeline paid for itself in reduced false positive alerts and more accurate routing.

Latency targets drive architecture. Our 5-minute detection SLA shaped every component choice. If we’d accepted 1-hour latency, the entire system could have been a simple cron job with a SQL database. The real-time requirement demanded proper job queues, parallel processing, and careful rate limit management. Choose your latency target before your tech stack.

The hardest problem is knowing what to ignore. 50,000 comments per day, maybe 50 actionable mentions. The 99.9% that aren’t relevant still consume processing time. Every optimization we made to the filtering pipeline — better pre-filters, smarter subreddit prioritization, more aggressive bot detection — directly improved the signal-to-noise ratio that determines whether clients actually trust and use the alerts.

SparkAI is running in production for several Harbor clients today, and the pattern has proven replicable. We’ve had conversations about extending it to Hacker News, Stack Overflow, and Discord. The mention detection and sentiment pipelines are platform-agnostic — only the ingestion layer needs to change. That modularity wasn’t accidental. It was the result of learning, the hard way during early prototyping, that platforms change their APIs and policies faster than you can rewrite your application logic. Keeping the platform-specific code in a thin ingestion layer is the only architecture that survives long-term.

The Architecture Behind Real-Time Social Monitoring

Why Reddit Monitoring Matters More Than You Think

The Architecture Behind Real-Time Social Monitoring

The Reddit Data Challenge

Comment Tree Processing

The Mention Detection Pipeline

Sentiment Analysis: Beyond Positive/Negative

The Alerting System

Data Pipeline and Storage

Handling Reddit’s Idiosyncrasies

Lessons for Building Real-Time Monitoring Systems

Building for Scale: Architecture Decisions That Compound

The Architecture of FlowBoard: Building an Agency OS

Leave a comment Cancel reply

Contact us

Why Reddit Monitoring Matters More Than You Think

The Architecture Behind Real-Time Social Monitoring

The Reddit Data Challenge

Comment Tree Processing

The Mention Detection Pipeline

Sentiment Analysis: Beyond Positive/Negative

The Alerting System

Data Pipeline and Storage

Handling Reddit’s Idiosyncrasies

Lessons for Building Real-Time Monitoring Systems

You may also like

Leave a comment Cancel reply