Skip links
Three conveyor belt systems side by side in factory with different speeds

Message Queues Compared: RabbitMQ vs Kafka vs Redis Streams

Every distributed system needs a message passing layer, and the choice between RabbitMQ, Kafka, and Redis Streams shapes your architecture in ways that are painful to undo later. We have deployed all three in production across 12 projects over the last 4 years, sometimes within the same organization for different use cases. The right choice depends on factors that benchmark blog posts never discuss: operational complexity at your team’s actual skill level, failure mode characteristics under real-world conditions, and how the system behaves when things go wrong at 3 AM on a Saturday.

Article Overview

Message Queues Compared: RabbitMQ vs Kafka vs Redis Streams

5 sections · Reading flow

01
The Fundamental Mental Model Difference
02
RabbitMQ: The Flexible Router
03
Kafka: The Distributed Log
04
Redis Streams: The Pragmatic Middle Ground
05
Decision Framework

HARBOR SOFTWARE · Engineering Insights

The Fundamental Mental Model Difference

Before comparing features and performance, understand that these three systems have fundamentally different mental models. This is not a marketing distinction. It determines how you think about data flow, how you handle failures, and what happens when consumers fall behind.

  • RabbitMQ is a message broker. It actively routes messages from producers to consumers using exchanges, bindings, and queues. Messages are consumed, acknowledged, and deleted. The broker manages delivery semantics, retry logic, and dead-letter routing. Think of it as a smart postal service that guarantees delivery, tracks acknowledgments, and handles returned mail.
  • Kafka is a distributed log. Producers append events to partitioned, ordered, durable logs called topics. Consumers read from the log at their own pace and track their position (offset). Messages persist for a configurable retention period regardless of whether anyone consumed them. Think of it as a shared, append-only journal that anyone can read from any point.
  • Redis Streams is a data structure. It is an append-only log with consumer group support, built as a native data structure within Redis. It borrows concepts from Kafka (consumer groups, offsets, persistent storage) but lives inside your Redis instance alongside your caches and counters. Think of it as Kafka’s lightweight cousin that shares an apartment with your cache layer.

This difference matters because it determines not just how you publish and consume messages, but what happens when a consumer crashes, when you need to replay historical events, and when you need to add a new consumer to an existing data stream.

RabbitMQ: The Flexible Router

import pika
import json

# Publisher
connection = pika.BlockingConnection(
    pika.ConnectionParameters(
        host='localhost',
        heartbeat=600,
        blocked_connection_timeout=300
    )
)
channel = connection.channel()

# Declare durable exchange and queue
channel.exchange_declare(
    exchange='orders', 
    exchange_type='topic', 
    durable=True
)
channel.queue_declare(
    queue='order.processing', 
    durable=True,
    arguments={
        'x-dead-letter-exchange': 'orders.dlx',
        'x-message-ttl': 86400000,  # 24 hour TTL
        'x-max-length': 100000      # Max queue depth
    }
)
channel.queue_bind(
    queue='order.processing', 
    exchange='orders', 
    routing_key='order.created'
)

# Publish persistent message
channel.basic_publish(
    exchange='orders',
    routing_key='order.created',
    body=json.dumps({
        'order_id': 123, 
        'total': 99.99,
        'customer_id': 'cust_456'
    }),
    properties=pika.BasicProperties(
        delivery_mode=2,  # Persistent to disk
        content_type='application/json',
        message_id='msg_789',
        timestamp=int(time.time())
    )
)

# Consumer with manual acknowledgment
def callback(ch, method, properties, body):
    order = json.loads(body)
    try:
        process_order(order)
        ch.basic_ack(delivery_tag=method.delivery_tag)
    except TransientError:
        # Requeue for retry
        ch.basic_nack(
            delivery_tag=method.delivery_tag, 
            requeue=True
        )
    except PermanentError:
        # Send to dead-letter queue
        ch.basic_nack(
            delivery_tag=method.delivery_tag, 
            requeue=False
        )

channel.basic_qos(prefetch_count=10)
channel.basic_consume(
    queue='order.processing', 
    on_message_callback=callback
)

Strengths

  • Routing flexibility: Topic exchanges support wildcard routing (order.* matches order.created and order.updated). Fanout exchanges broadcast to all bound queues. Header-based exchanges route on message metadata. Direct exchanges route on exact key match. No other system matches this routing flexibility without custom code.
  • Message-level acknowledgment: Each message is individually acknowledged, rejected, or requeued. This gives fine-grained control over error handling that Kafka and Redis Streams lack at the individual message level.
  • Dead-letter queues: Failed messages automatically route to a DLQ after configurable retry attempts or TTL expiration. This is built into the broker, not an application-level concern. You can inspect and replay DLQ messages through the management UI.
  • Priority queues: Native message priority (1-255) where higher-priority messages are delivered first. Kafka and Redis Streams have no equivalent without building a multi-queue workaround.
  • Management UI: The RabbitMQ management plugin provides a comprehensive web UI for monitoring queue depths, message rates, consumer status, and connection health. You can publish and consume test messages from the UI, which is invaluable for debugging.

Weaknesses

  • No replay: Once a message is consumed and acknowledged, it is gone. You cannot reprocess yesterday’s messages to populate a new analytics pipeline or recover from a consumer bug that corrupted data. This is the single biggest limitation and the primary reason teams migrate to Kafka.
  • Memory pressure under backlog: If consumers fall behind, messages accumulate in memory. RabbitMQ can page to disk (“lazy queues”), but performance degrades sharply once paging begins. We have seen clusters become unresponsive when a slow consumer caused memory to spike to 95% of the high watermark, triggering connection-blocking publisher flow control that cascaded into timeouts across the entire system.
  • Scaling is per-queue, not per-topic: A single queue is processed by a single Erlang process on a single node. To scale beyond that node’s capacity, you need to manually shard into multiple queues (e.g., order.processing.shard1, order.processing.shard2) and add routing logic. This is not hard but it is not automatic either.
  • Cluster operations are delicate: Adding and removing nodes from a RabbitMQ cluster, especially with mirrored queues, requires careful orchestration. We have experienced data loss during a cluster reconfiguration that coincided with a node failure. Quorum queues (introduced in 3.8) are more robust but require more resources per queue.

Best for

Task queues (background job processing with retry and dead-letter handling), complex routing requirements (when different consumers need different subsets of messages), request-reply patterns (RPC over messaging), and systems where message-level acknowledgment and priority are important.

Kafka: The Distributed Log

from confluent_kafka import Producer, Consumer, KafkaError
import json

# Producer with reliability settings
producer = Producer({
    'bootstrap.servers': 'kafka1:9092,kafka2:9092,kafka3:9092',
    'acks': 'all',               # Wait for all ISR replicas
    'enable.idempotence': True,   # Exactly-once semantics
    'linger.ms': 5,               # Micro-batch for throughput
    'batch.size': 65536,          # 64KB batches
    'compression.type': 'lz4',    # Compress for network savings
    'retries': 3,
    'retry.backoff.ms': 100,
})

def delivery_callback(err, msg):
    if err:
        logger.error(
            f'Delivery failed for {msg.key()}: {err}'
        )
    else:
        logger.debug(
            f'Delivered to {msg.topic()} '
            f'[{msg.partition()}] @ {msg.offset()}'
        )

producer.produce(
    topic='orders',
    key=str(order_id).encode(),  # Partition by order_id
    value=json.dumps(order).encode(),
    callback=delivery_callback
)
producer.flush()

# Consumer with manual offset management
consumer = Consumer({
    'bootstrap.servers': 'kafka1:9092,kafka2:9092',
    'group.id': 'order-processor',
    'auto.offset.reset': 'earliest',
    'enable.auto.commit': False,    # Manual commit
    'max.poll.interval.ms': 300000, # 5 min processing budget
    'session.timeout.ms': 45000,    # 45s heartbeat timeout
    'fetch.min.bytes': 1024,        # Batch fetches
})
consumer.subscribe(['orders'])

try:
    while True:
        msg = consumer.poll(timeout=1.0)
        if msg is None:
            continue
        if msg.error():
            if msg.error().code() == KafkaError._PARTITION_EOF:
                continue
            raise KafkaException(msg.error())
        
        order = json.loads(msg.value())
        process_order(order)
        consumer.commit(asynchronous=False)
finally:
    consumer.close()

Strengths

  • Replay capability: Messages persist for a configurable retention period (7 days by default, configurable to weeks, months, or infinite). You can reset a consumer group’s offset to any point in time and reprocess historical data. This is transformative for three scenarios: debugging production consumer bugs (replay the same data through a fixed consumer), populating new consumers (a new analytics pipeline can process all historical data without replaying from the producer side), and disaster recovery (if a downstream system loses data, replay from Kafka).
  • Throughput: Kafka handles 1-2 million messages per second per broker with commodity hardware (8 cores, 32GB RAM, NVMe SSD). A 3-node cluster can sustain 3-5 million messages per second for small messages. The sequential disk I/O pattern, zero-copy data transfer, and client-side batching make Kafka’s throughput characteristics fundamentally different from broker-based systems.
  • Ordering guarantees: Messages within a partition are strictly ordered by offset. Partition by entity ID (user_id, order_id) and you get per-entity ordering without any coordination overhead. Kafka is the only system of the three that guarantees ordering at this level without sacrificing throughput.
  • Multiple consumer groups: Multiple independent consumer groups can read the same topic at their own pace. One group for real-time processing, another for analytics, another for audit logging, another for search indexing. Each tracks its own offset. Adding a new consumer group has zero impact on existing consumers and does not duplicate data. This is architecturally impossible with RabbitMQ without publishing to multiple exchanges.
  • Exactly-once semantics: With idempotent producers and transactional consumers (Kafka Streams or the transactional API), Kafka provides exactly-once processing guarantees end-to-end. Neither RabbitMQ nor Redis Streams can match this without application-level deduplication.

Weaknesses

  • Operational complexity: A production Kafka deployment requires ZooKeeper (or the newer KRaft controller, which eliminates ZooKeeper but is still maturing), careful partition planning (partitions cannot be reduced after creation), ISR management, and broker configuration tuning. A minimum production deployment is 3 brokers + 3 ZooKeeper nodes (or 3 KRaft controllers). We budget 10-15 hours per month for Kafka operations on a 5-node cluster: version upgrades, partition rebalancing after adding topics, disk space management, consumer lag monitoring, and troubleshooting under-replicated partitions.
  • No per-message routing or filtering: All consumers of a topic get all messages for their assigned partitions. If Consumer A only cares about order.created events and Consumer B only cares about order.updated events, you have two choices: put them on separate topics (operational overhead) or have each consumer filter client-side (wasted network and CPU). RabbitMQ’s exchange-based routing solves this at the broker level.
  • Latency floor: Kafka’s batching and sequential I/O design optimizes for throughput, not latency. With default settings, end-to-end latency (produce to consume) is typically 5-50ms. With aggressive tuning (linger.ms=0, fetch.min.bytes=1), you can get to 2-5ms, but at the cost of throughput. RabbitMQ delivers individual messages in under 1ms.
  • No native dead-letter queue: Failed message handling is entirely application-level. You need to catch exceptions, publish failed messages to a separate DLQ topic, and build your own retry/replay tooling. This is roughly 200-300 lines of boilerplate that every Kafka consumer application needs.

Best for

Event sourcing and event-driven architectures, stream processing (Kafka Streams, Flink, Spark Streaming), high-throughput data pipelines (log aggregation, metrics collection, CDC), systems where multiple consumers need the same data, and any scenario where replay capability is valuable (which in our experience is most scenarios that justify a message queue at all).

Redis Streams: The Pragmatic Middle Ground

import redis.asyncio as redis
import time

r = redis.Redis(host='localhost', port=6379, 
                decode_responses=True)

# Producer: XADD
await r.xadd('orders', {
    'order_id': '123',
    'total': '99.99',
    'customer': 'user_456',
    'created_at': str(time.time())
}, maxlen=1000000)  # Cap stream at 1M entries

# Create consumer group (idempotent)
try:
    await r.xgroup_create(
        'orders', 'order-processors', 
        id='0', mkstream=True
    )
except redis.ResponseError as e:
    if 'BUSYGROUP' not in str(e):
        raise  # Group already exists, that's fine

# Consumer loop with acknowledgment
async def consume():
    consumer_name = f'worker-{os.getpid()}'
    
    while True:
        # Read new messages
        messages = await r.xreadgroup(
            groupname='order-processors',
            consumername=consumer_name,
            streams={'orders': '>'},
            count=10,      # Batch size
            block=5000     # Block 5s max
        )
        
        for stream, entries in messages:
            for msg_id, data in entries:
                try:
                    await process_order(data)
                    await r.xack(
                        'orders', 'order-processors', 
                        msg_id
                    )
                except Exception as e:
                    logger.error(
                        f'Failed {msg_id}: {e}'
                    )
                    # Message stays pending for claim

async def claim_stuck_messages():
    """Periodically claim messages stuck >60 seconds."""
    while True:
        pending = await r.xpending_range(
            'orders', 'order-processors',
            min='-', max='+', count=100
        )
        for msg in pending:
            idle_ms = msg['time_since_delivered']
            if idle_ms > 60000:  # Stuck >60 seconds
                claimed = await r.xclaim(
                    'orders', 'order-processors',
                    f'worker-{os.getpid()}',
                    min_idle_time=60000,
                    message_ids=[msg['message_id']]
                )
                for msg_id, data in claimed:
                    await process_order(data)
                    await r.xack(
                        'orders', 'order-processors', 
                        msg_id
                    )
        await asyncio.sleep(30)

Strengths

  • Zero additional infrastructure if you already run Redis. No new cluster, no new monitoring, no new deployment pipeline. Your existing Redis backup strategy, monitoring dashboards, and connection pooling all cover the stream.
  • Sub-millisecond latency: Typical publish-to-consume latency is 0.1-0.5ms when consumers are not blocked. This is the fastest of the three by an order of magnitude, because there is no network hop to a separate message broker and no batching delay.
  • Consumer groups with pending message tracking: Similar to Kafka’s consumer groups but with built-in visibility into stuck messages via XPENDING and claim-based recovery via XCLAIM. When a consumer crashes, its pending messages can be claimed by another consumer after a configurable idle timeout, without any coordinator or rebalance protocol.
  • Simple operational model: It is Redis. Your team already knows how to operate Redis, how to monitor it, how to back it up, and how to scale it.
  • Atomic operations: Because Redis is single-threaded, stream operations (XADD, XACK, XCLAIM) are atomic. There are no race conditions in consumer group message assignment, unlike Kafka where rebalances can cause brief periods of duplicate delivery.

Weaknesses

  • Durability depends on persistence config: With default RDB snapshots (every 60 seconds if >1000 keys changed), you can lose up to 60 seconds of data on crash. With AOF in appendfsync everysec mode, loss is limited to ~1 second. With appendfsync always, loss is zero but throughput drops to ~10K messages/second (every write waits for fsync). For most use cases, everysec is the right trade-off, but if you need guaranteed zero message loss, Redis Streams is not the right choice.
  • Memory-bound: All stream data lives in RAM. At 100,000 messages per hour with 1KB average size, you accumulate 2.4GB per day. You must configure MAXLEN (cap at N entries) or MINID (cap by minimum ID/timestamp) to prevent unbounded growth. This limits your retention period to what fits in memory, which is typically hours to days rather than Kafka’s days to weeks.
  • Single-node throughput ceiling: A single Redis instance handles roughly 100,000-200,000 XADD operations per second (depending on message size and hardware). With consumer groups, XREADGROUP throughput is similar. Beyond this, you need Redis Cluster, which adds complexity and does not support streams spanning multiple shards (each stream lives on one shard).
  • No built-in routing: One stream = one topic. If you need producer-side routing (messages going to different consumers based on content), you need multiple streams and application-level routing logic. RabbitMQ’s exchange model handles this at the broker level.

Best for

Lightweight event buses between microservices, real-time notifications and activity feeds, inter-service communication in systems that already use Redis, task queues where sub-millisecond latency matters, and any scenario where simplicity and latency matter more than long-term durability and high throughput.

Decision Framework

After deploying all three across 12 projects over 4 years, here is the decision tree we actually use:

  1. Do you need message replay (reprocessing historical events)? If yes, Kafka. This is the killer feature that no other system provides at scale. If there is even a 30% chance you will need replay in the next 12 months, start with Kafka.
  2. Do you need complex routing (topic-based, header-based, priority queues)? If yes, RabbitMQ. Its exchange model is unmatched for routing flexibility.
  3. Is your throughput under 100K messages/second and you already run Redis? If yes, Redis Streams. Zero additional infrastructure cost and the simplest operational model of the three.
  4. Is your throughput over 500K messages/second? Kafka. It is the only option that handles this without heroic optimization or horizontal scaling workarounds.
  5. Do you need exactly-once delivery semantics? Kafka with idempotent producers and transactional consumers. RabbitMQ and Redis Streams both provide at-least-once, which means your consumers need to be idempotent.

The most common mistake is choosing Kafka for a system that processes 500 messages per minute because “we might need to scale later.” Kafka’s operational overhead is real: cluster management, partition planning, consumer group monitoring, version upgrades, and troubleshooting. If your message volume does not justify that overhead today, use Redis Streams or RabbitMQ and migrate when you outgrow them. The migration cost (typically 1-2 weeks of engineering time) is almost always less than 18 months of unnecessary Kafka operations (5-15 hours per month of engineering time).

Conversely, the most expensive mistake is choosing RabbitMQ or Redis Streams for a system that later needs replay. Retrofitting replay onto a broker-based system requires rearchitecting the entire data flow, adding a parallel persistence layer, and migrating consumers. If there is any realistic chance you will need to reprocess historical events (building new analytics, recovering from consumer bugs, adding new downstream systems), start with Kafka and accept the operational overhead.

Leave a comment

Explore
Drag