Architecture¶

CockroachDB is a distributed SQL database built for cloud-native resilience. It presents a PostgreSQL-compatible SQL interface while internally organising data as a monolithic sorted key-value map, partitioned into contiguous ranges and replicated across nodes via Raft consensus. The system guarantees serializable isolation and survives full datacenter failures without manual intervention.

See also: index, architecture, operations, security

Layer Model¶

CockroachDB is organised as a stack of five cooperating layers. Each layer communicates only with the layer directly above or below it, keeping concerns cleanly separated.

Layer	Responsibility	Key Abstractions
SQL	Parse, plan, optimise, execute SQL	pgwire protocol, planner, optimizer
Transaction	ACID guarantees, distributed txns	timestamp cache, txn coordinator, intent resolution
Distribution	Route requests to correct range	DistSender, range descriptor cache
Replication	Consistent replication via Raft	Raft groups, leaseholder, Raft log
Storage	Durable KV write path	Pebble LSM tree, MVCC keys

graph TB
    Client["SQL Client (pgwire)"]
    SQL["SQL Layer<br/>(Parser / Optimizer / Executor)"]
    TXN["Transaction Layer<br/>(TxnCoordinator / Timestamp Cache)"]
    DIST["Distribution Layer<br/>(DistSender / Range Cache)"]
    REPL["Replication Layer<br/>(Raft Groups / Leaseholder)"]
    STORE["Storage Layer<br/>(Pebble Engine / MVCC)"]

    Client --> SQL
    SQL --> TXN
    TXN --> DIST
    DIST --> REPL
    REPL --> STORE

    style SQL fill:#e8f4f8,stroke:#2196f3,color:#000
    style TXN fill:#fff3e0,stroke:#ff9800,color:#000
    style DIST fill:#e8f5e9,stroke:#4caf50,color:#000
    style REPL fill:#fce4ec,stroke:#e91e63,color:#000
    style STORE fill:#f3e5f5,stroke:#9c27b0,color:#000

SQL Layer¶

The SQL layer speaks the PostgreSQL wire protocol (pgwire), making CockroachDB compatible with most PostgreSQL drivers and ORMs out of the box. Its pipeline is:

Parser -- converts SQL text into an abstract syntax tree.
Optimizer -- cost-based optimizer (CBO) that explores equivalent query plans using relational algebra transformations, then picks the lowest-cost plan. Heuristic rules handle trivial queries without full exploration.
Executor -- dispatches the physical plan. For distributed queries, it pushes computation (filtering, aggregation, joins) as close to the data as possible, reducing network round-trips.

PostgreSQL Compatibility

CockroachDB supports a growing subset of PostgreSQL syntax and features, but not all PostgreSQL extensions are available. Window functions, CTEs, JSONB, and ARRAY types are supported; PL/pgSQL stored procedures and some advanced features are not.

Transaction Layer¶

CockroachDB implements distributed ACID transactions using a combination of techniques:

Timestamp ordering -- every transaction is assigned a HLC (Hybrid Logical Clock) timestamp. Reads observe data at the transaction start timestamp; writes create intent records at the commit timestamp.
Timestamp Cache -- tracks the highest read timestamp for each key. If a write's timestamp falls below the cached read timestamp for a key, the write's timestamp is pushed forward, preventing anomalies.
Transaction Coordinator -- the gateway node that initiated the txn maintains a txn record and drives the two-phase commit protocol across all participating ranges.
Intent Resolution -- uncommitted writes are stored as MVCC intents. On commit, the coordinator resolves intents by converting them to committed values. If the coordinator crashes, other nodes detect the dangling intents via the TxnHeartbeat mechanism and resolve them asynchronously.

Serializable Isolation Only

Unlike PostgreSQL which offers multiple isolation levels, CockroachDB always runs at serializable isolation. This eliminates read-committed anomalies but may increase transaction retries under high contention.

Distribution Layer¶

The distribution layer maps logical key ranges to physical nodes:

Range Descriptors -- the keyspace is divided into ranges (default split threshold 512 MiB). Each range has a descriptor listing its replica locations.
DistSender -- the gateway node's DistSender routes BatchRequests to the leaseholder of the target range. It maintains a cache of range descriptors and refreshes it when it encounters stale entries.
Range Splits -- ranges split automatically when they exceed range_max_bytes (default 512 MiB). Splits can also be triggered manually via ALTER TABLE ... SPLIT AT for workload isolation.

Replication Layer¶

Each range is replicated via its own Raft group. This layer ensures strong consistency and fault tolerance:

Replicas -- each range has num_replicas replicas (default 3), stored on different nodes. Replica placement is governed by zone configurations.
Leaseholder -- one replica per range holds the range lease. The leaseholder serves reads directly (bypassing Raft) and proposes all writes to the Raft group. It is almost always co-located with the Raft leader for lowest write latency.
Raft Leader -- proposes entries to the Raft log. A write is committed once a majority of replicas acknowledge it.
Raft Log -- an on-disk, ordered log of writes agreed upon by the replica group. This is the authoritative source of truth for consistent replication.

graph LR
    Client["Client Request"]
    LH["Leaseholder<br/>(Node A)"]
    R1["Replica Follower<br/>(Node B)"]
    R2["Replica Follower<br/>(Node C)"]

    Client -->|"Read/Write"| LH
    LH -->|"Raft Proposal"| R1
    LH -->|"Raft Proposal"| R2
    R1 -->|"Ack"| LH
    R2 -->|"Ack"| LH
    LH -->|"Majority reached<br/>Write committed"| Client

    style LH fill:#e8f5e9,stroke:#4caf50,color:#000
    style R1 fill:#e3f2fd,stroke:#2196f3,color:#000
    style R2 fill:#e3f2fd,stroke:#2196f3,color:#000

Read Performance

Because strongly-consistent reads go only to the leaseholder (no Raft round-trip needed), single-range reads are as fast as a local key-value lookup on the leaseholder node.

Zone Configurations¶

Zone configs control replication topology. A zone config specifies:

Parameter	Purpose
`num_replicas`	Number of replicas per range
`constraints`	Required or prohibited node attributes (e.g. `+region=us-east`)
`lease_preferences`	Preferred node attributes for lease placement
`range_min_bytes` / `range_max_bytes`	Target range size bounds
`gc.ttlseconds`	How long MVCC garbage is retained

Example: pin a table's data to a specific region:

ALTER DATABASE appdb CONFIGURE ZONE USING
    num_replicas = 5,
    constraints = '[+region=us-east]';
-- lease_preferences uses double-bracket syntax, e.g. double-bracket with +region=us-east

Storage Layer (Pebble)¶

Pebble is CockroachDB's purpose-built storage engine, written in Go. It is an LSM-tree (Log-Structured Merge-tree) key-value store inspired by LevelDB and RocksDB, with several CockroachDB-specific optimisations.

LSM Tree Structure¶

graph TB
    WRITE["Write Request"]
    WAL["Write-Ahead Log<br/>(WAL)"]
    MEM["MemTable<br/>(in-memory skiplist)"]
    L0["L0 SSTables<br/>(sorted, may overlap)"]
    L1["L1 SSTables<br/>(sorted, partitioned)"]
    L2["L2 SSTables<br/>(~10x L1)"]
    LPLUS["L3 ... L6<br/>(each ~10x prior)"]

    WRITE --> WAL
    WRITE --> MEM
    MEM -->|"Flush when full"| L0
    L0 -->|"Compaction"| L1
    L1 -->|"Compaction"| L2
    L2 -->|"Compaction"| LPLUS

    style MEM fill:#fff3e0,stroke:#ff9800,color:#000
    style WAL fill:#fce4ec,stroke:#e91e63,color:#000
    style L0 fill:#e8f5e9,stroke:#4caf50,color:#000

Key Pebble internals:

Component	Role
MemTable	In-memory sorted structure (skiplist). All writes land here first.
WAL	Sequential write-ahead log for crash recovery of in-flight MemTable data.
L0 SSTables	Flushed MemTables. Sorted within each file but key ranges may overlap across files.
L1-L6 SSTables	Compacted levels. Each level is partitioned into non-overlapping key ranges. Each level is approximately 10x the size of the previous.
Bloom Filters	Per-SSTable bloom filters reduce disk reads for point lookups on non-existent keys.
Block Cache	Frequently accessed data blocks are cached in memory to reduce disk I/O.
Compaction	Background process that merges SSTables from one level into the next, removing duplicates, expired MVCC versions, and deleted entries.

MVCC Integration¶

CockroachDB keys are encoded with MVCC timestamps, producing keys of the form <key>/<timestamp>. Pebble stores these in sorted order, meaning all versions of a key are adjacent. During compaction, Pebble can garbage-collect MVCC versions older than the GC TTL, reclaiming disk space.

Why Pebble (not RocksDB)¶

CockroachDB originally used RocksDB but built Pebble to:

Eliminate CGo overhead and the complexity of managing a C++ dependency from Go.
Gain tighter control over compaction heuristics, especially MVCC-aware garbage collection.
Implement CockroachDB-specific features such as range tombstones (for efficient large-range deletions) and ingest-based external file loading (used for IMPORT and backup restore).

Data Flow: End-to-End Write Path¶

The following diagram traces a single-row INSERT from client to persistent storage:

sequenceDiagram
    participant C as SQL Client
    participant GW as Gateway Node<br/>(TxnCoordinator)
    participant LH as Leaseholder<br/>(Range R1)
    participant F1 as Follower Replica
    participant F2 as Follower Replica
    participant P as Pebble (LH disk)

    C->>GW: INSERT INTO orders (id, total) VALUES (42, 99.5)
    GW->>GW: Parse + optimize SQL<br/>Convert to KV put
    GW->>LH: BatchRequest (KV put for key /Table/51/1/42)
    LH->>LH: Acquire range lease<br/>Check timestamp cache
    LH->>LH: Write MVCC intent to Pebble
    LH->>F1: Raft Propose (intent write)
    LH->>F2: Raft Propose (intent write)
    F1-->>LH: Raft Ack
    F2-->>LH: Raft Ack
    LH-->>GW: Write intent stored (majority)
    GW->>GW: Record txn status = COMMITTED
    GW->>LH: Resolve intent (commit)
    LH->>P: Write committed MVCC value<br/>Clear intent
    LH-->>GW: Intent resolved
    GW-->>C: Query OK, 1 row affected

Replication Topology Patterns¶

Single-Region (3 replicas)¶

Suitable for applications that need HA within one region. Tolerates one node failure.

Multi-Region (5 replicas, 3 regions)¶

Example: 3 voters in us-east, 1 voter in us-central, 1 voter in eu-west. The leaseholder stays in us-east for low-latency reads. Tolerates a full region failure.

Geo-Partitioned¶

Tables and indexes are partitioned by region using zone configs. Each partition's replicas are pinned to the relevant region. Minimises cross-region latency for reads and writes.

Latency Trade-offs

Every write must reach a Raft majority. In multi-region deployments, configure num_voters and voter placement carefully to avoid committing writes across high-latency WAN links on every operation.

Key Configuration Knobs¶

Setting	Default	Impact
`range_max_bytes`	512 MiB	Maximum range size before auto-split
`num_replicas`	3	Replication factor per range
`gc.ttlseconds`	90000 (25 hours)	MVCC garbage retention window
`kv.raft_log.disable_fsync`	false	Disabling improves write throughput but risks data loss on power failure
`storage.max_sync_duration`	20s	Threshold for slow fsync warnings

Sources¶

How It Works¶

Range-based sharding, Raft consensus, distributed transaction protocol, and leaseholder reads.

Data Distribution¶

flowchart TB
    subgraph Keyspace["Sorted Keyspace"]
        R1["Range 1\n/meta → /system"]
        R2["Range 2\n/table/users/a-m"]
        R3["Range 3\n/table/users/n-z"]
        R4["Range 4\n/table/orders/*"]
    end

    subgraph Cluster_C["3-Node Cluster"]
        N1["Node 1\n(R1-leader, R2-follower, R4-leader)"]
        N2["Node 2\n(R1-follower, R2-leader, R3-follower)"]
        N3["Node 3\n(R1-follower, R3-leader, R4-follower)"]
    end

    R1 -.-> N1
    R2 -.-> N2
    R3 -.-> N3

    style Keyspace fill:#6933ff,color:#fff

Distributed Transaction (Write)¶

sequenceDiagram
    participant Client_C as Client (PG wire)
    participant GW as Gateway Node
    participant LH1 as Leaseholder (Range A)
    participant LH2 as Leaseholder (Range B)
    participant Followers as Raft Followers

    Client_C->>GW: BEGIN; UPDATE users...; UPDATE orders...;
    GW->>LH1: Write intent on users (Range A)
    GW->>LH2: Write intent on orders (Range B)

    Note over GW: Two-Phase Commit
    GW->>LH1: PREPARE (write transaction record)
    LH1->>Followers: Raft replicate (majority ACK)
    Followers-->>LH1: ACK (2/3 nodes)
    GW->>LH1: COMMIT
    GW->>LH2: Resolve intent (committed)

    Client_C-->>Client_C: COMMIT OK

Leaseholder Reads (Fast Path)¶

Each range has a leaseholder — the node that serves reads without Raft consensus:

Read Type	Mechanism	Latency
Leaseholder read	Read from lease holder directly	~1ms
Follower read	Read from closest replica (slightly stale)	~1ms (geo-local)
Consistent read	Must go through leaseholder	+network RTT

Follower Reads¶

CockroachDB supports follower reads -- reading from the nearest replica without waiting for leaseholder consensus:

Exact staleness reads (AS OF SYSTEM TIME): Read from a local replica at a specified timestamp. No network hop needed if the replica is sufficiently caught up.
** bounded staleness reads**: Read from the closest replica within a staleness bound. CockroachDB automatically chooses the freshest available replica.

This dramatically reduces read latency in geo-distributed deployments by avoiding cross-region round trips to the leaseholder.

Range Splits and Merges¶

CockroachDB dynamically splits and merges ranges based on load:

Split trigger: When a range exceeds 512 MiB or receives disproportionately high read/write traffic
Merge trigger: When adjacent ranges are small and quiescent
Rebalance: The store rebalancer moves range replicas between nodes to equalize disk usage and QPS across the cluster

Sources¶

Benchmarks¶

Scope

Performance characteristics, scaling limits, and resource consumption for CockroachDB.

TPC-C Performance¶

Nodes	vCPUs/Node	Warehouses	TPS	Efficiency
3	4	500	2,000	80%
9	16	5,000	15,000	85%
81	16	50,000	120,000	90%

Latency Characteristics¶

Operation	Same Zone	Cross Zone	Cross Region
Point read	1-2ms	2-5ms	50-200ms
Point write	5-10ms	10-20ms	100-300ms
Range scan (100 rows)	5-15ms	10-30ms	100-500ms

Scaling Limits¶

Dimension	Limit	Notes
Cluster size	500+ nodes	Tested by Cockroach Labs
Database size	100TB+	Linear scaling with nodes
Concurrent connections	10,000+ per node	Connection pooling recommended
Transaction size	64MB	kv.transaction.max_intents_bytes

Sourcing Status¶

Unsourced Performance Data

The performance numbers in this document are estimated from vendor documentation, community benchmarks, and engineering judgment. They do not represent controlled benchmarks with documented test conditions. Specific hardware configurations, software versions, and test methodologies were not recorded.

Use these figures as rough guidance only. For production capacity planning, run your own benchmarks against your specific workload and infrastructure.