Architecture¶
CockroachDB is a distributed SQL database built for cloud-native resilience. It presents a PostgreSQL-compatible SQL interface while internally organising data as a monolithic sorted key-value map, partitioned into contiguous ranges and replicated across nodes via Raft consensus. The system guarantees serializable isolation and survives full datacenter failures without manual intervention.
See also: index, architecture, operations, security
Layer Model¶
CockroachDB is organised as a stack of five cooperating layers. Each layer communicates only with the layer directly above or below it, keeping concerns cleanly separated.
| Layer | Responsibility | Key Abstractions |
|---|---|---|
| SQL | Parse, plan, optimise, execute SQL | pgwire protocol, planner, optimizer |
| Transaction | ACID guarantees, distributed txns | timestamp cache, txn coordinator, intent resolution |
| Distribution | Route requests to correct range | DistSender, range descriptor cache |
| Replication | Consistent replication via Raft | Raft groups, leaseholder, Raft log |
| Storage | Durable KV write path | Pebble LSM tree, MVCC keys |
graph TB
Client["SQL Client (pgwire)"]
SQL["SQL Layer<br/>(Parser / Optimizer / Executor)"]
TXN["Transaction Layer<br/>(TxnCoordinator / Timestamp Cache)"]
DIST["Distribution Layer<br/>(DistSender / Range Cache)"]
REPL["Replication Layer<br/>(Raft Groups / Leaseholder)"]
STORE["Storage Layer<br/>(Pebble Engine / MVCC)"]
Client --> SQL
SQL --> TXN
TXN --> DIST
DIST --> REPL
REPL --> STORE
style SQL fill:#e8f4f8,stroke:#2196f3,color:#000
style TXN fill:#fff3e0,stroke:#ff9800,color:#000
style DIST fill:#e8f5e9,stroke:#4caf50,color:#000
style REPL fill:#fce4ec,stroke:#e91e63,color:#000
style STORE fill:#f3e5f5,stroke:#9c27b0,color:#000
SQL Layer¶
The SQL layer speaks the PostgreSQL wire protocol (pgwire), making CockroachDB compatible with most PostgreSQL drivers and ORMs out of the box. Its pipeline is:
- Parser -- converts SQL text into an abstract syntax tree.
- Optimizer -- cost-based optimizer (CBO) that explores equivalent query plans using relational algebra transformations, then picks the lowest-cost plan. Heuristic rules handle trivial queries without full exploration.
- Executor -- dispatches the physical plan. For distributed queries, it pushes computation (filtering, aggregation, joins) as close to the data as possible, reducing network round-trips.
PostgreSQL Compatibility
CockroachDB supports a growing subset of PostgreSQL syntax and features, but not all PostgreSQL extensions are available. Window functions, CTEs, JSONB, and ARRAY types are supported; PL/pgSQL stored procedures and some advanced features are not.
Transaction Layer¶
CockroachDB implements distributed ACID transactions using a combination of techniques:
- Timestamp ordering -- every transaction is assigned a HLC (Hybrid Logical Clock) timestamp. Reads observe data at the transaction start timestamp; writes create intent records at the commit timestamp.
- Timestamp Cache -- tracks the highest read timestamp for each key. If a write's timestamp falls below the cached read timestamp for a key, the write's timestamp is pushed forward, preventing anomalies.
- Transaction Coordinator -- the gateway node that initiated the txn maintains a txn record and drives the two-phase commit protocol across all participating ranges.
- Intent Resolution -- uncommitted writes are stored as MVCC intents. On commit, the coordinator resolves intents by converting them to committed values. If the coordinator crashes, other nodes detect the dangling intents via the
TxnHeartbeatmechanism and resolve them asynchronously.
Serializable Isolation Only
Unlike PostgreSQL which offers multiple isolation levels, CockroachDB always runs at serializable isolation. This eliminates read-committed anomalies but may increase transaction retries under high contention.
Distribution Layer¶
The distribution layer maps logical key ranges to physical nodes:
- Range Descriptors -- the keyspace is divided into ranges (default split threshold 512 MiB). Each range has a descriptor listing its replica locations.
- DistSender -- the gateway node's DistSender routes
BatchRequeststo the leaseholder of the target range. It maintains a cache of range descriptors and refreshes it when it encounters stale entries. - Range Splits -- ranges split automatically when they exceed
range_max_bytes(default 512 MiB). Splits can also be triggered manually viaALTER TABLE ... SPLIT ATfor workload isolation.
Replication Layer¶
Each range is replicated via its own Raft group. This layer ensures strong consistency and fault tolerance:
- Replicas -- each range has
num_replicasreplicas (default 3), stored on different nodes. Replica placement is governed by zone configurations. - Leaseholder -- one replica per range holds the range lease. The leaseholder serves reads directly (bypassing Raft) and proposes all writes to the Raft group. It is almost always co-located with the Raft leader for lowest write latency.
- Raft Leader -- proposes entries to the Raft log. A write is committed once a majority of replicas acknowledge it.
- Raft Log -- an on-disk, ordered log of writes agreed upon by the replica group. This is the authoritative source of truth for consistent replication.
graph LR
Client["Client Request"]
LH["Leaseholder<br/>(Node A)"]
R1["Replica Follower<br/>(Node B)"]
R2["Replica Follower<br/>(Node C)"]
Client -->|"Read/Write"| LH
LH -->|"Raft Proposal"| R1
LH -->|"Raft Proposal"| R2
R1 -->|"Ack"| LH
R2 -->|"Ack"| LH
LH -->|"Majority reached<br/>Write committed"| Client
style LH fill:#e8f5e9,stroke:#4caf50,color:#000
style R1 fill:#e3f2fd,stroke:#2196f3,color:#000
style R2 fill:#e3f2fd,stroke:#2196f3,color:#000
Read Performance
Because strongly-consistent reads go only to the leaseholder (no Raft round-trip needed), single-range reads are as fast as a local key-value lookup on the leaseholder node.
Zone Configurations¶
Zone configs control replication topology. A zone config specifies:
| Parameter | Purpose |
|---|---|
num_replicas |
Number of replicas per range |
constraints |
Required or prohibited node attributes (e.g. +region=us-east) |
lease_preferences |
Preferred node attributes for lease placement |
range_min_bytes / range_max_bytes |
Target range size bounds |
gc.ttlseconds |
How long MVCC garbage is retained |
Example: pin a table's data to a specific region:
ALTER DATABASE appdb CONFIGURE ZONE USING
num_replicas = 5,
constraints = '[+region=us-east]';
-- lease_preferences uses double-bracket syntax, e.g. double-bracket with +region=us-east
Storage Layer (Pebble)¶
Pebble is CockroachDB's purpose-built storage engine, written in Go. It is an LSM-tree (Log-Structured Merge-tree) key-value store inspired by LevelDB and RocksDB, with several CockroachDB-specific optimisations.
LSM Tree Structure¶
graph TB
WRITE["Write Request"]
WAL["Write-Ahead Log<br/>(WAL)"]
MEM["MemTable<br/>(in-memory skiplist)"]
L0["L0 SSTables<br/>(sorted, may overlap)"]
L1["L1 SSTables<br/>(sorted, partitioned)"]
L2["L2 SSTables<br/>(~10x L1)"]
LPLUS["L3 ... L6<br/>(each ~10x prior)"]
WRITE --> WAL
WRITE --> MEM
MEM -->|"Flush when full"| L0
L0 -->|"Compaction"| L1
L1 -->|"Compaction"| L2
L2 -->|"Compaction"| LPLUS
style MEM fill:#fff3e0,stroke:#ff9800,color:#000
style WAL fill:#fce4ec,stroke:#e91e63,color:#000
style L0 fill:#e8f5e9,stroke:#4caf50,color:#000
Key Pebble internals:
| Component | Role |
|---|---|
| MemTable | In-memory sorted structure (skiplist). All writes land here first. |
| WAL | Sequential write-ahead log for crash recovery of in-flight MemTable data. |
| L0 SSTables | Flushed MemTables. Sorted within each file but key ranges may overlap across files. |
| L1-L6 SSTables | Compacted levels. Each level is partitioned into non-overlapping key ranges. Each level is approximately 10x the size of the previous. |
| Bloom Filters | Per-SSTable bloom filters reduce disk reads for point lookups on non-existent keys. |
| Block Cache | Frequently accessed data blocks are cached in memory to reduce disk I/O. |
| Compaction | Background process that merges SSTables from one level into the next, removing duplicates, expired MVCC versions, and deleted entries. |
MVCC Integration¶
CockroachDB keys are encoded with MVCC timestamps, producing keys of the form <key>/<timestamp>. Pebble stores these in sorted order, meaning all versions of a key are adjacent. During compaction, Pebble can garbage-collect MVCC versions older than the GC TTL, reclaiming disk space.
Why Pebble (not RocksDB)¶
CockroachDB originally used RocksDB but built Pebble to:
- Eliminate CGo overhead and the complexity of managing a C++ dependency from Go.
- Gain tighter control over compaction heuristics, especially MVCC-aware garbage collection.
- Implement CockroachDB-specific features such as range tombstones (for efficient large-range deletions) and ingest-based external file loading (used for IMPORT and backup restore).
Data Flow: End-to-End Write Path¶
The following diagram traces a single-row INSERT from client to persistent storage:
sequenceDiagram
participant C as SQL Client
participant GW as Gateway Node<br/>(TxnCoordinator)
participant LH as Leaseholder<br/>(Range R1)
participant F1 as Follower Replica
participant F2 as Follower Replica
participant P as Pebble (LH disk)
C->>GW: INSERT INTO orders (id, total) VALUES (42, 99.5)
GW->>GW: Parse + optimize SQL<br/>Convert to KV put
GW->>LH: BatchRequest (KV put for key /Table/51/1/42)
LH->>LH: Acquire range lease<br/>Check timestamp cache
LH->>LH: Write MVCC intent to Pebble
LH->>F1: Raft Propose (intent write)
LH->>F2: Raft Propose (intent write)
F1-->>LH: Raft Ack
F2-->>LH: Raft Ack
LH-->>GW: Write intent stored (majority)
GW->>GW: Record txn status = COMMITTED
GW->>LH: Resolve intent (commit)
LH->>P: Write committed MVCC value<br/>Clear intent
LH-->>GW: Intent resolved
GW-->>C: Query OK, 1 row affected
Replication Topology Patterns¶
Single-Region (3 replicas)¶
Suitable for applications that need HA within one region. Tolerates one node failure.
Multi-Region (5 replicas, 3 regions)¶
Example: 3 voters in us-east, 1 voter in us-central, 1 voter in eu-west. The leaseholder stays in us-east for low-latency reads. Tolerates a full region failure.
Geo-Partitioned¶
Tables and indexes are partitioned by region using zone configs. Each partition's replicas are pinned to the relevant region. Minimises cross-region latency for reads and writes.
Latency Trade-offs
Every write must reach a Raft majority. In multi-region deployments, configure num_voters and voter placement carefully to avoid committing writes across high-latency WAN links on every operation.
Key Configuration Knobs¶
| Setting | Default | Impact |
|---|---|---|
range_max_bytes |
512 MiB | Maximum range size before auto-split |
num_replicas |
3 | Replication factor per range |
gc.ttlseconds |
90000 (25 hours) | MVCC garbage retention window |
kv.raft_log.disable_fsync |
false | Disabling improves write throughput but risks data loss on power failure |
storage.max_sync_duration |
20s | Threshold for slow fsync warnings |
Sources¶
- CockroachDB Architecture Documentation
- Pebble Storage Engine (GitHub)
- CockroachDB Replication Layer
- CockroachDB Distribution Layer
How It Works¶
Range-based sharding, Raft consensus, distributed transaction protocol, and leaseholder reads.
Data Distribution¶
flowchart TB
subgraph Keyspace["Sorted Keyspace"]
R1["Range 1\n/meta → /system"]
R2["Range 2\n/table/users/a-m"]
R3["Range 3\n/table/users/n-z"]
R4["Range 4\n/table/orders/*"]
end
subgraph Cluster_C["3-Node Cluster"]
N1["Node 1\n(R1-leader, R2-follower, R4-leader)"]
N2["Node 2\n(R1-follower, R2-leader, R3-follower)"]
N3["Node 3\n(R1-follower, R3-leader, R4-follower)"]
end
R1 -.-> N1
R2 -.-> N2
R3 -.-> N3
style Keyspace fill:#6933ff,color:#fff
Distributed Transaction (Write)¶
sequenceDiagram
participant Client_C as Client (PG wire)
participant GW as Gateway Node
participant LH1 as Leaseholder (Range A)
participant LH2 as Leaseholder (Range B)
participant Followers as Raft Followers
Client_C->>GW: BEGIN; UPDATE users...; UPDATE orders...;
GW->>LH1: Write intent on users (Range A)
GW->>LH2: Write intent on orders (Range B)
Note over GW: Two-Phase Commit
GW->>LH1: PREPARE (write transaction record)
LH1->>Followers: Raft replicate (majority ACK)
Followers-->>LH1: ACK (2/3 nodes)
GW->>LH1: COMMIT
GW->>LH2: Resolve intent (committed)
Client_C-->>Client_C: COMMIT OK
Leaseholder Reads (Fast Path)¶
Each range has a leaseholder — the node that serves reads without Raft consensus:
| Read Type | Mechanism | Latency |
|---|---|---|
| Leaseholder read | Read from lease holder directly | ~1ms |
| Follower read | Read from closest replica (slightly stale) | ~1ms (geo-local) |
| Consistent read | Must go through leaseholder | +network RTT |
Follower Reads¶
CockroachDB supports follower reads -- reading from the nearest replica without waiting for leaseholder consensus:
- Exact staleness reads (
AS OF SYSTEM TIME): Read from a local replica at a specified timestamp. No network hop needed if the replica is sufficiently caught up. - ** bounded staleness reads**: Read from the closest replica within a staleness bound. CockroachDB automatically chooses the freshest available replica.
This dramatically reduces read latency in geo-distributed deployments by avoiding cross-region round trips to the leaseholder.
Range Splits and Merges¶
CockroachDB dynamically splits and merges ranges based on load:
- Split trigger: When a range exceeds 512 MiB or receives disproportionately high read/write traffic
- Merge trigger: When adjacent ranges are small and quiescent
- Rebalance: The store rebalancer moves range replicas between nodes to equalize disk usage and QPS across the cluster
Sources¶
Benchmarks¶
Scope
Performance characteristics, scaling limits, and resource consumption for CockroachDB.
TPC-C Performance¶
| Nodes | vCPUs/Node | Warehouses | TPS | Efficiency |
|---|---|---|---|---|
| 3 | 4 | 500 | 2,000 | 80% |
| 9 | 16 | 5,000 | 15,000 | 85% |
| 81 | 16 | 50,000 | 120,000 | 90% |
Latency Characteristics¶
| Operation | Same Zone | Cross Zone | Cross Region |
|---|---|---|---|
| Point read | 1-2ms | 2-5ms | 50-200ms |
| Point write | 5-10ms | 10-20ms | 100-300ms |
| Range scan (100 rows) | 5-15ms | 10-30ms | 100-500ms |
Scaling Limits¶
| Dimension | Limit | Notes |
|---|---|---|
| Cluster size | 500+ nodes | Tested by Cockroach Labs |
| Database size | 100TB+ | Linear scaling with nodes |
| Concurrent connections | 10,000+ per node | Connection pooling recommended |
| Transaction size | 64MB | kv.transaction.max_intents_bytes |
Sourcing Status¶
Unsourced Performance Data
The performance numbers in this document are estimated from vendor documentation, community benchmarks, and engineering judgment. They do not represent controlled benchmarks with documented test conditions. Specific hardware configurations, software versions, and test methodologies were not recorded.
Use these figures as rough guidance only. For production capacity planning, run your own benchmarks against your specific workload and infrastructure.