Architecture¶
Redpanda's internals: the Seastar reactor, thread-per-core model, per-partition Raft replication, the controller, archival uploader (tiered storage), Schema Registry, Pandaproxy, and WASM data transforms.
Component Overview¶
flowchart TB
Client["Kafka client / Connect / kcat"]
subgraph Process["Single Redpanda Process (per node)"]
direction TB
Reactor["Seastar Reactor\n(thread-per-core)"]
subgraph Shards["Per-core Shards"]
S0["Shard 0\n(reactor + partitions)"]
S1["Shard 1"]
S2["Shard 2"]
S3["Shard 3"]
end
Controller["Controller\n(Raft group on Shard 0)"]
PartitionRaft["Partition Raft groups\n(distributed across shards)"]
ArchivalUploader["Archival Uploader\n(per-partition uploaders)"]
SchemaRegistry["Schema Registry"]
Pandaproxy["Pandaproxy (HTTP REST)"]
WasmRuntime["WASM Transform Runtime\n(in-broker)"]
StorageEngine["Storage Engine\n(NTP segments + index)"]
end
subgraph ObjectStore["Object Storage Tier"]
S3["S3 / GCS / Azure ADLS"]
end
Client --> Reactor
Reactor --> Shards
S0 --> Controller
Shards --> PartitionRaft
PartitionRaft --> StorageEngine
StorageEngine --> ArchivalUploader
ArchivalUploader --> S3
PartitionRaft --> WasmRuntime
Reactor --> SchemaRegistry
Reactor --> Pandaproxy
Components¶
| Component | Role |
|---|---|
| Seastar reactor | Runs one event loop per CPU core; futures/promises avoid syscalls and locks. |
| Shard (core) | Owns a subset of partitions; messages move between shards via lockless queues. |
| Controller | Cluster metadata: topic config, ACLs, cluster membership, broker registration. |
| Per-partition Raft | Each partition is its own Raft group with its own leader/followers. |
| Storage engine | Append-only segments + indexes per Name-Topic-Partition (NTP). |
| Archival uploader | Uploads closed segments to S3/GCS/Azure for tiered storage. |
| Cloud storage interface | Reads tiered segments back transparently when consumers ask for old offsets. |
| Schema Registry | Stores Avro/Protobuf/JSON schemas, compatible with Confluent SR API. |
| Pandaproxy | REST gateway implementing the Kafka REST Proxy contract. |
| WASM Transform Runtime | Runs user-supplied WASM modules to mutate messages on produce. |
| rpk | CLI bundled with the binary; talks to the Admin API on :9644. |
Thread-per-core (Seastar)¶
Traditional brokers (Kafka, RabbitMQ) use thread pools shared across CPU cores, which incurs context switches and locking. Redpanda uses Seastar, where:
- Each CPU core has one reactor thread pinned to it.
- Tasks on a core never block; I/O is asynchronous via
io_uringoraio. - Inter-core communication is through lockless message-passing queues.
- Memory is shard-local; each core has its own slab allocator.
This eliminates lock contention for hot paths and gives predictable tail latency at the cost of slightly more bookkeeping when partitions need cross-shard work (uncommon).
Per-partition Raft¶
sequenceDiagram
participant P as Producer
participant L as Partition Leader (Node A)
participant F1 as Follower (Node B)
participant F2 as Follower (Node C)
P->>L: Produce(records)
L->>L: append to log + flush
L->>F1: AppendEntries
L->>F2: AppendEntries
F1->>L: ack
F2->>L: ack
Note right of L: quorum reached (2/3)
L->>P: ProduceResponse
Unlike Kafka (which keeps separate KRaft for metadata and ISR-based replication for data), Redpanda uses Raft for both metadata and partition data. Benefits:
- Single, well-understood consensus protocol.
- No "elections vs ISR" duality.
- Cluster membership changes are just Raft reconfiguration.
Trade-off: per-partition Raft groups have a small fixed cost (heartbeats, vote tracking), so very tiny topics with many partitions are not the optimal regime.
Cluster Metadata (Controller)¶
The controller is a single Raft group covering the whole cluster. It holds:
- Topic configurations and ACLs.
- Partition assignment to brokers.
- Cluster membership and broker registration.
- Schema Registry data (when running in single-process mode).
flowchart LR
AdminAPI["rpk / Admin API"]
Controller["Controller (Raft)"]
Broker1["Broker 1"]
Broker2["Broker 2"]
Broker3["Broker 3"]
AdminAPI --> Controller
Controller --> Broker1
Controller --> Broker2
Controller --> Broker3
Tiered Storage¶
flowchart LR
LocalDisk["Local NVMe segments"]
Uploader["Archival Uploader"]
S3["S3 / GCS / Azure"]
SegMeta["Segment metadata\n(controller)"]
LocalDisk -->|on close| Uploader
Uploader --> S3
Uploader --> SegMeta
Reader["Consumer fetch (old offset)"]
Reader --> CloudIface["Cloud Storage Interface"]
CloudIface --> S3
CloudIface --> Reader
When a segment closes (size or time threshold), the uploader pushes it to object storage. Consumers reading old offsets fetch transparently — Redpanda streams from S3 back to the client.
Tiered Storage Read Replicas (Enterprise) take this further: a separate Redpanda cluster reads tiered storage directly, allowing analytics-style consumers to bypass the production cluster.
WASM Data Transforms¶
flowchart LR
Producer --> Topic1["Topic input"]
Topic1 --> Transform["WASM transform"]
Transform --> Topic2["Topic output"]
Topic2 --> Consumer
A WebAssembly module is registered against an input topic; for each record it can emit zero or more records to one or more output topics. Compiled with rpk transform build from Go or Rust source.
Iceberg Integration¶
Redpanda topics can be configured to write directly into an Apache Iceberg table:
flowchart LR
Producer --> Topic["Redpanda topic"]
Topic --> IcebergWriter["Iceberg Writer"]
IcebergWriter --> ParquetFiles["Parquet files (S3)"]
IcebergWriter --> Catalog["Iceberg catalog"]
Trino["Trino / Spark / dbt"] --> Catalog
Trino --> ParquetFiles
This pattern collapses the typical Kafka → Connect → Parquet pipeline into a single broker-native flow.
Performance Characteristics¶
| Workload | Numbers (Redpanda blog claims; verify locally) |
|---|---|
| Single-broker NVMe sustained | 1+ GB/s producer throughput |
| 3-broker cluster R3 sustained | 4–5 GB/s aggregate |
| p99 produce latency | sub-10 ms typical, sub-1 ms achievable on well-tuned NVMe |
| Tiered Storage upload throughput | proportional to S3 PUT rate (10–50 MB/s/partition) |
| WASM transform overhead | ~50–100 µs/record for simple transforms |
Vendor benchmarks
Redpanda's published numbers are vendor-controlled; Confluent has published counter-benchmarks. Always run openmessaging-benchmark with your real workload before sizing decisions.
Comparison Hooks¶
- vs Kafka — same wire protocol; Redpanda wins on ops simplicity and tail latency, Kafka wins on ecosystem breadth and JVM tooling familiarity.
- vs Pulsar — Pulsar's compute/storage split scales differently; Redpanda's per-partition Raft is simpler.
- vs NATS — different protocol; NATS is lighter for sub-ms request-reply but lacks Kafka-API compatibility.