The Engineering Behind RSS Feed Intelligence

Author Sarah Chen

Published on: August 2, 2024

RSS is the cockroach of web technologies. It was declared dead a decade ago, and yet it quietly powers an enormous amount of the internet’s information infrastructure. Podcast directories, news aggregators, financial data feeds, government publication systems, academic paper repositories — they all run on RSS or its close cousin Atom. At Harbor Software, we built NimbusFeed to make sense of this firehose. What started as a side project to aggregate tech news became a production system processing over 2 million feed items per day across 47,000 active feeds.

Article Overview

The Engineering Behind RSS Feed Intelligence

6 sections · Reading flow

01
The Deceptive Simplicity of RSS

→

02
Pipeline Architecture

→

03
The Deduplication Problem

→

04
Performance: What Actually Mattered

→

05
Monitoring and Observability

→

06
Lessons Learned

HARBOR SOFTWARE · Engineering Insights

This article is a deep technical walkthrough of the engineering behind NimbusFeed’s RSS processing pipeline — the architecture decisions, the gnarly edge cases, the performance optimizations that actually mattered, and the ones that turned out to be premature.

The Deceptive Simplicity of RSS

On the surface, RSS is trivially simple. It’s an XML document with a channel containing items, each with a title, link, description, and publication date. You fetch it, parse it, store the new items. What could go wrong?

Everything. Everything can go wrong.

In the real world, RSS feeds are a case study in Postel’s Law pushed to its absolute limit. We’ve encountered feeds that serve RSS 0.91, RSS 1.0, RSS 2.0, Atom 1.0, and occasionally something that isn’t quite any of them. We’ve seen feeds with invalid XML (unclosed tags, unescaped ampersands, UTF-8 declared but actually Windows-1252). We’ve seen feeds that return HTML error pages with a 200 status code. We’ve seen feeds where the publication date is in formats ranging from RFC 822 to ISO 8601 to “March 29th, 2026” to a Unix timestamp to just the word “today.”

The first lesson of building an RSS processing system: your parser needs to be absurdly tolerant.

Pipeline Architecture

NimbusFeed’s processing pipeline has five stages, each running as an independent service connected by message queues. This wasn’t the original architecture — we started with a monolithic cron job and decomposed it as we hit scaling bottlenecks. Here’s where we ended up:

┌─────────────┐    ┌──────────────┐    ┌──────────────┐    ┌───────────┐    ┌──────────┐
│  Scheduler   │───▶│   Fetcher    │───▶│   Parser     │───▶│ Enricher  │───▶│  Writer  │
│  (cron)      │    │  (workers)   │    │  (workers)   │    │ (workers) │    │ (batch)  │
└─────────────┘    └──────────────┘    └──────────────┘    └───────────┘    └──────────┘
       │                  │                   │                  │               │
       ▼                  ▼                   ▼                  ▼               ▼
   Feed Registry     Fetch Queue        Parse Queue        Enrich Queue    PostgreSQL
   (PostgreSQL)       (Redis)            (Redis)            (Redis)        + Typesense

Stage 1: The Scheduler

The scheduler determines which feeds need to be fetched and when. This sounds simple until you realize that 47,000 feeds have wildly different update frequencies. A major news outlet might publish 50 items per hour. A personal blog might publish once a month. Polling both at the same interval is either wasteful (checking the blog every 5 minutes) or insufficient (checking the news site every hour).

We use an adaptive polling algorithm that adjusts fetch frequency based on historical update patterns:

interface FeedSchedule {
  feedId: string;
  lastFetched: Date;
  avgItemsPerDay: number;
  lastItemPublished: Date;
  consecutiveEmptyFetches: number;
  fetchIntervalMinutes: number;
}

function calculateNextFetch(schedule: FeedSchedule): number {
  const daysSinceLastItem = daysBetween(schedule.lastItemPublished, new Date());
  
  // Active feeds: fetch frequently
  if (schedule.avgItemsPerDay > 10) {
    return 5; // every 5 minutes
  }
  
  if (schedule.avgItemsPerDay > 2) {
    return 15; // every 15 minutes
  }
  
  if (schedule.avgItemsPerDay > 0.5) {
    return 60; // hourly
  }
  
  // Dormant feeds: back off exponentially, cap at 24 hours
  if (schedule.consecutiveEmptyFetches > 10) {
    return Math.min(1440, schedule.fetchIntervalMinutes * 1.5);
  }
  
  // Potentially dead feeds: check daily
  if (daysSinceLastItem > 90) {
    return 1440;
  }
  
  return 120; // default: every 2 hours
}

This adaptive scheduling reduced our total daily fetch operations from 6.8 million to 1.2 million — an 82% reduction — with no measurable increase in item discovery latency for active feeds.

Stage 2: The Fetcher

The fetcher is responsible for one thing: turning a feed URL into raw bytes. It handles HTTP correctly, which is harder than it sounds.

Conditional requests are the single most important optimization. Every fetch stores the ETag and Last-Modified headers from the response. On subsequent fetches, we send If-None-Match and If-Modified-Since. If the feed hasn’t changed, the server returns 304 Not Modified with no body, saving bandwidth and processing time.

async function fetchFeed(feed: Feed): Promise<FetchResult> {
  const headers: Record<string, string> = {
    'User-Agent': 'NimbusFeed/1.0 (+https://nimbusfeed.com/bot)',
    'Accept': 'application/rss+xml, application/atom+xml, application/xml, text/xml',
  };
  
  if (feed.etag) {
    headers['If-None-Match'] = feed.etag;
  }
  if (feed.lastModified) {
    headers['If-Modified-Since'] = feed.lastModified;
  }
  
  const response = await fetch(feed.url, {
    headers,
    redirect: 'follow',
    signal: AbortSignal.timeout(15000), // 15s timeout
  });
  
  if (response.status === 304) {
    return { status: 'not_modified', feedId: feed.id };
  }
  
  if (response.status === 301) {
    // Permanent redirect: update the feed URL
    await updateFeedUrl(feed.id, response.url);
  }
  
  if (!response.ok) {
    return {
      status: 'error',
      feedId: feed.id,
      httpStatus: response.status,
      error: `HTTP ${response.status}`,
    };
  }
  
  const body = await response.text();
  return {
    status: 'fetched',
    feedId: feed.id,
    body,
    etag: response.headers.get('etag'),
    lastModified: response.headers.get('last-modified'),
    contentType: response.headers.get('content-type'),
  };
}

In practice, about 40% of our fetches return 304. That’s 40% of our bandwidth and parsing work eliminated by two HTTP headers.

The User-Agent header matters. Some feed servers block requests without a proper User-Agent, or throttle unidentified clients. We include our bot name and a URL where server operators can learn about us and request exclusion. Being a good web citizen pays off — we’ve never been blocked by a major feed provider.

Timeout handling is critical. A feed server that accepts connections but never responds will tie up a worker indefinitely. Our 15-second timeout is aggressive by design — if a feed can’t respond in 15 seconds, it’s probably having issues, and we’ll try again at the next scheduled interval.

Stage 3: The Parser

This is where the real complexity lives. The parser takes raw XML (or XML-like substance) and produces normalized item objects.

We use a three-tier parsing strategy:

Strict parse: Try to parse as valid XML using a spec-compliant parser. Works for about 85% of feeds.
Lenient parse: If strict parsing fails, run the content through an HTML-aware tidying pass that fixes common XML errors (unclosed tags, unescaped entities, invalid characters), then parse again. Catches another 12%.
Regex fallback: For the remaining 3% of feeds that are so malformed they can’t be parsed as XML at all, we use regex extraction to pull out titles, links, and dates. This is ugly and we’re not proud of it, but it works.

function parseFeed(raw: string, contentType: string): ParsedFeed {
  // Tier 1: Strict XML parse
  try {
    const doc = parseXml(raw, { strict: true });
    return extractFeedData(doc);
  } catch (strictError) {
    metrics.increment('parse.strict_failure');
  }
  
  // Tier 2: Lenient parse with tidying
  try {
    const tidied = tidyXml(raw, {
      fixUnclosedTags: true,
      escapeAmpersands: true,
      stripInvalidChars: true,
      normalizeEncoding: true,
    });
    const doc = parseXml(tidied, { strict: false });
    return extractFeedData(doc);
  } catch (lenientError) {
    metrics.increment('parse.lenient_failure');
  }
  
  // Tier 3: Regex fallback
  try {
    return regexExtractFeed(raw);
  } catch (regexError) {
    metrics.increment('parse.total_failure');
    throw new ParseError('All parsing strategies failed', { strictError, lenientError, regexError });
  }
}

Date parsing deserves its own section. We wrote a date parser that handles 23 different date formats encountered in the wild. Here are the ones that caused the most headaches:

// Formats we've actually encountered in production feeds
const realWorldDateFormats = [
  'ddd, DD MMM YYYY HH:mm:ss ZZ',    // RFC 822 (standard RSS)
  'YYYY-MM-DDTHH:mm:ssZ',              // ISO 8601 (standard Atom)
  'YYYY-MM-DD HH:mm:ss',               // MySQL datetime
  'MM/DD/YYYY',                          // American date
  'DD/MM/YYYY',                          // European date (ambiguous!)
  'MMMM Do, YYYY',                       // "March 29th, 2026"
  'X',                                    // Unix timestamp (seconds)
  'x',                                    // Unix timestamp (milliseconds)
  'ddd, DD MMM YYYY HH:mm:ss',          // RFC 822 without timezone
  'YYYY-MM-DDTHH:mm:ss.SSSZ',           // ISO 8601 with milliseconds
];

The ambiguous date formats (is 03/04/2026 March 4th or April 3rd?) are resolved using feed-level heuristics. If a feed consistently uses dates where the first number exceeds 12, we know it’s DD/MM format. If we can’t determine the format, we flag the feed for manual review.

Stage 4: The Enricher

The enricher takes normalized feed items and adds intelligence. This is where NimbusFeed goes beyond simple aggregation:

Deduplication: The same article might appear in multiple feeds (syndication, cross-posting, feed aggregators). We use a combination of URL normalization, title similarity (using trigram comparison), and content fingerprinting to identify duplicates.
Categorization: We classify articles into topic categories using a lightweight text classification model. This runs locally — no external API calls — because we need it to be fast and we process millions of items per day. The model is a fine-tuned DistilBERT that we trained on 200,000 manually-labeled articles.
Entity extraction: We pull out named entities (companies, people, technologies) and store them as structured tags. This powers NimbusFeed’s “topic tracking” feature, where users can follow a technology or company across all feeds.
Language detection: We support 14 languages. Language detection uses a combination of character set analysis and n-gram frequency, which is more reliable than the xml:lang attribute that many feeds either omit or set incorrectly.

async function enrichItem(item: NormalizedItem): Promise<EnrichedItem> {
  const [category, entities, language, duplicateOf] = await Promise.all([
    classifier.categorize(item.title + ' ' + item.contentText),
    entityExtractor.extract(item.contentText),
    detectLanguage(item.title + ' ' + item.contentText),
    deduplicator.findDuplicate(item),
  ]);
  
  return {
    ...item,
    category: category.label,
    categoryConfidence: category.score,
    entities,
    language,
    isDuplicate: !!duplicateOf,
    duplicateOf: duplicateOf?.id,
  };
}

Stage 5: The Writer

The writer stage batch-inserts enriched items into PostgreSQL and indexes them in Typesense for full-text search. We use PostgreSQL’s ON CONFLICT clause to handle items we’ve already seen:

INSERT INTO feed_items (
  id, feed_id, title, url, content_html, content_text,
  published_at, category, language, entities, content_hash
) VALUES 
  ($1, $2, $3, $4, $5, $6, $7, $8, $9, $10, $11)
ON CONFLICT (feed_id, content_hash) DO UPDATE SET
  title = EXCLUDED.title,
  content_html = EXCLUDED.content_html,
  content_text = EXCLUDED.content_text,
  updated_at = NOW()
WHERE feed_items.content_hash != EXCLUDED.content_hash;

The content_hash is a SHA-256 of the normalized content text. This means if a feed updates an article’s content (fixing a typo, adding a correction), we update our stored version. If nothing changed, the WHERE clause prevents an unnecessary write.

Batch sizes matter enormously. We experimented with batch sizes from 1 (insert as they come) to 10,000. The sweet spot for our workload turned out to be 500 items per batch — large enough to amortize the round-trip overhead, small enough to keep write latency predictable and not lock the table for too long.

The Deduplication Problem

Deduplication deserves deeper treatment because it’s the hardest problem in the entire pipeline. The naive approach — deduplicate on URL — misses about 30% of duplicates. The same article often has different URLs across syndication partners (different UTM parameters, different URL structures, AMP vs. canonical).

Our deduplication pipeline:

URL normalization: Strip query parameters (except meaningful ones like id or page), remove trailing slashes, lowercase the hostname, resolve shortlinks. This catches about 60% of duplicates.
Title similarity: Compute trigram similarity between the new item’s title and recent items. If similarity exceeds 0.85, flag as potential duplicate. This catches headline variations like “Breaking: Company X Acquires Y” vs. “Company X to Acquire Y in $3B Deal.”
Content fingerprinting: Generate a SimHash of the article’s content text. SimHash is locality-sensitive — similar documents produce similar hashes — so we can find near-duplicates efficiently using Hamming distance.

async function findDuplicate(item: NormalizedItem): Promise<FeedItem | null> {
  // Step 1: Check normalized URL
  const normalizedUrl = normalizeUrl(item.url);
  const urlMatch = await db.query(
    'SELECT * FROM feed_items WHERE normalized_url = $1 LIMIT 1',
    [normalizedUrl]
  );
  if (urlMatch.rows.length > 0) return urlMatch.rows[0];
  
  // Step 2: Check title similarity (last 7 days only)
  const titleMatch = await db.query(
    `SELECT *, similarity(title, $1) AS sim 
     FROM feed_items 
     WHERE published_at > NOW() - INTERVAL '7 days'
     AND similarity(title, $1) > 0.85
     ORDER BY sim DESC LIMIT 1`,
    [item.title]
  );
  if (titleMatch.rows.length > 0) return titleMatch.rows[0];
  
  // Step 3: Check content SimHash (Hamming distance <= 3)
  const simhash = computeSimHash(item.contentText);
  const hashMatch = await db.query(
    `SELECT * FROM feed_items 
     WHERE published_at > NOW() - INTERVAL '7 days'
     AND bit_count(content_simhash # $1::bit(64)) <= 3
     LIMIT 1`,
    [simhash]
  );
  if (hashMatch.rows.length > 0) return hashMatch.rows[0];
  
  return null;
}

This multi-layer approach catches 94% of duplicates with a false positive rate under 0.2%. The remaining 6% are edge cases — heavily rewritten articles that share the same core story but are genuinely different enough to be considered separate content.

Performance: What Actually Mattered

We went through three major optimization rounds. Here’s what each one taught us:

Round 1: Connection pooling. Our first scaling bottleneck wasn’t CPU or memory — it was database connections. Each worker opened its own connection, and with 50 workers running, we were hitting PostgreSQL’s connection limit. Switching to PgBouncer in transaction pooling mode let us run 200 workers on 20 actual database connections. This single change increased throughput by 4x.

Round 2: Fetch parallelism with backpressure. Fetching feeds is I/O-bound, so we wanted maximum parallelism. But unbounded parallelism creates problems — you can accidentally DDoS a feed server by sending 100 concurrent requests to the same host. We implemented per-host concurrency limits (max 2 concurrent requests to any single host) with a global concurrency limit of 200 total active fetches.

Round 3: Selective indexing. We were indexing every item in Typesense, but discovered that only 15% of items were ever searched for. Items older than 30 days were almost never accessed via search. We moved to a two-tier system: all items go into PostgreSQL (our system of record), but only items from the last 30 days get indexed in Typesense. Older items are searchable via PostgreSQL’s built-in full-text search, which is slower but perfectly adequate for infrequent queries.

Monitoring and Observability

A system processing 2 million items per day has a lot of ways to fail silently. Our monitoring covers four dimensions:

Feed health: For each feed, we track success rate, average response time, consecutive failures, and content freshness. A feed that returns 200 OK but hasn’t had new content in 3x its historical average is flagged as “stale” — often indicating a server-side issue where the feed is being served from a stale cache.
Pipeline throughput: Items per second at each stage, queue depths, worker utilization. If the parse queue grows while the fetch queue shrinks, we know the parser workers are falling behind.
Data quality: Percentage of items with parseable dates, percentage with content (vs. title-only), percentage classified with high confidence. Sudden drops in data quality usually indicate a format change in a high-volume feed.
Resource utilization: Standard infrastructure metrics (CPU, memory, disk, network), plus application-specific metrics like active HTTP connections, database connection pool usage, and Redis memory consumption.

// Health check endpoint exposes pipeline status
GET /api/health/pipeline

{
  "status": "healthy",
  "stages": {
    "scheduler": { "status": "ok", "feedsDueForFetch": 342 },
    "fetcher": {
      "status": "ok",
      "activeWorkers": 47,
      "queueDepth": 128,
      "fetchesLastHour": 18420,
      "notModifiedRate": 0.41
    },
    "parser": {
      "status": "ok",
      "activeWorkers": 12,
      "queueDepth": 34,
      "strictParseRate": 0.847,
      "totalFailureRate": 0.003
    },
    "enricher": {
      "status": "ok",
      "activeWorkers": 8,
      "queueDepth": 67,
      "duplicateRate": 0.12
    },
    "writer": {
      "status": "ok",
      "batchesLastHour": 4200,
      "itemsWrittenLastHour": 89400
    }
  }
}

Lessons Learned

Build for the worst feed, not the best. If your system only works with well-formed RSS 2.0, you’ll miss a third of the real-world web. Tolerance is a feature.

Conditional HTTP requests are not optional. They’re the single highest-impact optimization you can implement in any feed processing system. Implement them on day one.

Deduplication is a spectrum, not a binary. You’ll never catch 100% of duplicates without also catching false positives. Pick a threshold that’s right for your use case and accept the trade-off.

Message queues between pipeline stages are worth the operational complexity. They give you independent scaling, fault isolation, and natural backpressure. When the enricher falls behind, the queue grows but the fetcher and parser keep running. Without queues, a slow enricher would block the entire pipeline.

RSS will outlive most of the technologies you use to process it. We’ve rewritten NimbusFeed’s stack twice (from Python to Node.js, then from Node.js to a Node.js + Rust hybrid for the parser). The RSS feeds we’re consuming haven’t changed at all. Design your system to evolve around a stable input format.

NimbusFeed started as a weekend project and grew into one of our most technically challenging systems. If you’re building something that needs to process RSS or Atom feeds at scale, we’d be happy to share more details. Drop us a line.

The Engineering Behind RSS Feed Intelligence

The Deceptive Simplicity of RSS

Pipeline Architecture

Stage 1: The Scheduler

Stage 2: The Fetcher

Stage 3: The Parser

Stage 4: The Enricher

Stage 5: The Writer

The Deduplication Problem

Performance: What Actually Mattered

Monitoring and Observability

Lessons Learned

You may also like

Leave a comment Cancel reply