Skip to content

Architecture

Redpanda's internals: the Seastar reactor, thread-per-core model, per-partition Raft replication, the controller, archival uploader (tiered storage), Schema Registry, Pandaproxy, and WASM data transforms.

Component Overview

flowchart TB
    Client["Kafka client / Connect / kcat"]
    subgraph Process["Single Redpanda Process (per node)"]
        direction TB
        Reactor["Seastar Reactor\n(thread-per-core)"]
        subgraph Shards["Per-core Shards"]
            S0["Shard 0\n(reactor + partitions)"]
            S1["Shard 1"]
            S2["Shard 2"]
            S3["Shard 3"]
        end
        Controller["Controller\n(Raft group on Shard 0)"]
        PartitionRaft["Partition Raft groups\n(distributed across shards)"]
        ArchivalUploader["Archival Uploader\n(per-partition uploaders)"]
        SchemaRegistry["Schema Registry"]
        Pandaproxy["Pandaproxy (HTTP REST)"]
        WasmRuntime["WASM Transform Runtime\n(in-broker)"]
        StorageEngine["Storage Engine\n(NTP segments + index)"]
    end
    subgraph ObjectStore["Object Storage Tier"]
        S3["S3 / GCS / Azure ADLS"]
    end
    Client --> Reactor
    Reactor --> Shards
    S0 --> Controller
    Shards --> PartitionRaft
    PartitionRaft --> StorageEngine
    StorageEngine --> ArchivalUploader
    ArchivalUploader --> S3
    PartitionRaft --> WasmRuntime
    Reactor --> SchemaRegistry
    Reactor --> Pandaproxy

Components

Component Role
Seastar reactor Runs one event loop per CPU core; futures/promises avoid syscalls and locks.
Shard (core) Owns a subset of partitions; messages move between shards via lockless queues.
Controller Cluster metadata: topic config, ACLs, cluster membership, broker registration.
Per-partition Raft Each partition is its own Raft group with its own leader/followers.
Storage engine Append-only segments + indexes per Name-Topic-Partition (NTP).
Archival uploader Uploads closed segments to S3/GCS/Azure for tiered storage.
Cloud storage interface Reads tiered segments back transparently when consumers ask for old offsets.
Schema Registry Stores Avro/Protobuf/JSON schemas, compatible with Confluent SR API.
Pandaproxy REST gateway implementing the Kafka REST Proxy contract.
WASM Transform Runtime Runs user-supplied WASM modules to mutate messages on produce.
rpk CLI bundled with the binary; talks to the Admin API on :9644.

Thread-per-core (Seastar)

Traditional brokers (Kafka, RabbitMQ) use thread pools shared across CPU cores, which incurs context switches and locking. Redpanda uses Seastar, where:

  • Each CPU core has one reactor thread pinned to it.
  • Tasks on a core never block; I/O is asynchronous via io_uring or aio.
  • Inter-core communication is through lockless message-passing queues.
  • Memory is shard-local; each core has its own slab allocator.

This eliminates lock contention for hot paths and gives predictable tail latency at the cost of slightly more bookkeeping when partitions need cross-shard work (uncommon).

Per-partition Raft

sequenceDiagram
    participant P as Producer
    participant L as Partition Leader (Node A)
    participant F1 as Follower (Node B)
    participant F2 as Follower (Node C)
    P->>L: Produce(records)
    L->>L: append to log + flush
    L->>F1: AppendEntries
    L->>F2: AppendEntries
    F1->>L: ack
    F2->>L: ack
    Note right of L: quorum reached (2/3)
    L->>P: ProduceResponse

Unlike Kafka (which keeps separate KRaft for metadata and ISR-based replication for data), Redpanda uses Raft for both metadata and partition data. Benefits:

  • Single, well-understood consensus protocol.
  • No "elections vs ISR" duality.
  • Cluster membership changes are just Raft reconfiguration.

Trade-off: per-partition Raft groups have a small fixed cost (heartbeats, vote tracking), so very tiny topics with many partitions are not the optimal regime.

Cluster Metadata (Controller)

The controller is a single Raft group covering the whole cluster. It holds:

  • Topic configurations and ACLs.
  • Partition assignment to brokers.
  • Cluster membership and broker registration.
  • Schema Registry data (when running in single-process mode).
flowchart LR
    AdminAPI["rpk / Admin API"]
    Controller["Controller (Raft)"]
    Broker1["Broker 1"]
    Broker2["Broker 2"]
    Broker3["Broker 3"]
    AdminAPI --> Controller
    Controller --> Broker1
    Controller --> Broker2
    Controller --> Broker3

Tiered Storage

flowchart LR
    LocalDisk["Local NVMe segments"]
    Uploader["Archival Uploader"]
    S3["S3 / GCS / Azure"]
    SegMeta["Segment metadata\n(controller)"]
    LocalDisk -->|on close| Uploader
    Uploader --> S3
    Uploader --> SegMeta
    Reader["Consumer fetch (old offset)"]
    Reader --> CloudIface["Cloud Storage Interface"]
    CloudIface --> S3
    CloudIface --> Reader

When a segment closes (size or time threshold), the uploader pushes it to object storage. Consumers reading old offsets fetch transparently — Redpanda streams from S3 back to the client.

Tiered Storage Read Replicas (Enterprise) take this further: a separate Redpanda cluster reads tiered storage directly, allowing analytics-style consumers to bypass the production cluster.

WASM Data Transforms

flowchart LR
    Producer --> Topic1["Topic input"]
    Topic1 --> Transform["WASM transform"]
    Transform --> Topic2["Topic output"]
    Topic2 --> Consumer

A WebAssembly module is registered against an input topic; for each record it can emit zero or more records to one or more output topics. Compiled with rpk transform build from Go or Rust source.

Iceberg Integration

Redpanda topics can be configured to write directly into an Apache Iceberg table:

flowchart LR
    Producer --> Topic["Redpanda topic"]
    Topic --> IcebergWriter["Iceberg Writer"]
    IcebergWriter --> ParquetFiles["Parquet files (S3)"]
    IcebergWriter --> Catalog["Iceberg catalog"]
    Trino["Trino / Spark / dbt"] --> Catalog
    Trino --> ParquetFiles

This pattern collapses the typical Kafka → Connect → Parquet pipeline into a single broker-native flow.

Performance Characteristics

Workload Numbers (Redpanda blog claims; verify locally)
Single-broker NVMe sustained 1+ GB/s producer throughput
3-broker cluster R3 sustained 4–5 GB/s aggregate
p99 produce latency sub-10 ms typical, sub-1 ms achievable on well-tuned NVMe
Tiered Storage upload throughput proportional to S3 PUT rate (10–50 MB/s/partition)
WASM transform overhead ~50–100 µs/record for simple transforms

Vendor benchmarks

Redpanda's published numbers are vendor-controlled; Confluent has published counter-benchmarks. Always run openmessaging-benchmark with your real workload before sizing decisions.

Comparison Hooks

  • vs Kafka — same wire protocol; Redpanda wins on ops simplicity and tail latency, Kafka wins on ecosystem breadth and JVM tooling familiarity.
  • vs Pulsar — Pulsar's compute/storage split scales differently; Redpanda's per-partition Raft is simpler.
  • vs NATS — different protocol; NATS is lighter for sub-ms request-reply but lacks Kafka-API compatibility.