Skip to content

Architecture

CockroachDB is a distributed SQL database built for cloud-native resilience. It presents a PostgreSQL-compatible SQL interface while internally organising data as a monolithic sorted key-value map, partitioned into contiguous ranges and replicated across nodes via Raft consensus. The system guarantees serializable isolation and survives full datacenter failures without manual intervention.

See also: index, architecture, operations, security


Layer Model

CockroachDB is organised as a stack of five cooperating layers. Each layer communicates only with the layer directly above or below it, keeping concerns cleanly separated.

Layer Responsibility Key Abstractions
SQL Parse, plan, optimise, execute SQL pgwire protocol, planner, optimizer
Transaction ACID guarantees, distributed txns timestamp cache, txn coordinator, intent resolution
Distribution Route requests to correct range DistSender, range descriptor cache
Replication Consistent replication via Raft Raft groups, leaseholder, Raft log
Storage Durable KV write path Pebble LSM tree, MVCC keys
graph TB
    Client["SQL Client (pgwire)"]
    SQL["SQL Layer<br/>(Parser / Optimizer / Executor)"]
    TXN["Transaction Layer<br/>(TxnCoordinator / Timestamp Cache)"]
    DIST["Distribution Layer<br/>(DistSender / Range Cache)"]
    REPL["Replication Layer<br/>(Raft Groups / Leaseholder)"]
    STORE["Storage Layer<br/>(Pebble Engine / MVCC)"]

    Client --> SQL
    SQL --> TXN
    TXN --> DIST
    DIST --> REPL
    REPL --> STORE

    style SQL fill:#e8f4f8,stroke:#2196f3,color:#000
    style TXN fill:#fff3e0,stroke:#ff9800,color:#000
    style DIST fill:#e8f5e9,stroke:#4caf50,color:#000
    style REPL fill:#fce4ec,stroke:#e91e63,color:#000
    style STORE fill:#f3e5f5,stroke:#9c27b0,color:#000

SQL Layer

The SQL layer speaks the PostgreSQL wire protocol (pgwire), making CockroachDB compatible with most PostgreSQL drivers and ORMs out of the box. Its pipeline is:

  1. Parser -- converts SQL text into an abstract syntax tree.
  2. Optimizer -- cost-based optimizer (CBO) that explores equivalent query plans using relational algebra transformations, then picks the lowest-cost plan. Heuristic rules handle trivial queries without full exploration.
  3. Executor -- dispatches the physical plan. For distributed queries, it pushes computation (filtering, aggregation, joins) as close to the data as possible, reducing network round-trips.

PostgreSQL Compatibility

CockroachDB supports a growing subset of PostgreSQL syntax and features, but not all PostgreSQL extensions are available. Window functions, CTEs, JSONB, and ARRAY types are supported; PL/pgSQL stored procedures and some advanced features are not.


Transaction Layer

CockroachDB implements distributed ACID transactions using a combination of techniques:

  • Timestamp ordering -- every transaction is assigned a HLC (Hybrid Logical Clock) timestamp. Reads observe data at the transaction start timestamp; writes create intent records at the commit timestamp.
  • Timestamp Cache -- tracks the highest read timestamp for each key. If a write's timestamp falls below the cached read timestamp for a key, the write's timestamp is pushed forward, preventing anomalies.
  • Transaction Coordinator -- the gateway node that initiated the txn maintains a txn record and drives the two-phase commit protocol across all participating ranges.
  • Intent Resolution -- uncommitted writes are stored as MVCC intents. On commit, the coordinator resolves intents by converting them to committed values. If the coordinator crashes, other nodes detect the dangling intents via the TxnHeartbeat mechanism and resolve them asynchronously.

Serializable Isolation Only

Unlike PostgreSQL which offers multiple isolation levels, CockroachDB always runs at serializable isolation. This eliminates read-committed anomalies but may increase transaction retries under high contention.


Distribution Layer

The distribution layer maps logical key ranges to physical nodes:

  • Range Descriptors -- the keyspace is divided into ranges (default split threshold 512 MiB). Each range has a descriptor listing its replica locations.
  • DistSender -- the gateway node's DistSender routes BatchRequests to the leaseholder of the target range. It maintains a cache of range descriptors and refreshes it when it encounters stale entries.
  • Range Splits -- ranges split automatically when they exceed range_max_bytes (default 512 MiB). Splits can also be triggered manually via ALTER TABLE ... SPLIT AT for workload isolation.

Replication Layer

Each range is replicated via its own Raft group. This layer ensures strong consistency and fault tolerance:

  • Replicas -- each range has num_replicas replicas (default 3), stored on different nodes. Replica placement is governed by zone configurations.
  • Leaseholder -- one replica per range holds the range lease. The leaseholder serves reads directly (bypassing Raft) and proposes all writes to the Raft group. It is almost always co-located with the Raft leader for lowest write latency.
  • Raft Leader -- proposes entries to the Raft log. A write is committed once a majority of replicas acknowledge it.
  • Raft Log -- an on-disk, ordered log of writes agreed upon by the replica group. This is the authoritative source of truth for consistent replication.
graph LR
    Client["Client Request"]
    LH["Leaseholder<br/>(Node A)"]
    R1["Replica Follower<br/>(Node B)"]
    R2["Replica Follower<br/>(Node C)"]

    Client -->|"Read/Write"| LH
    LH -->|"Raft Proposal"| R1
    LH -->|"Raft Proposal"| R2
    R1 -->|"Ack"| LH
    R2 -->|"Ack"| LH
    LH -->|"Majority reached<br/>Write committed"| Client

    style LH fill:#e8f5e9,stroke:#4caf50,color:#000
    style R1 fill:#e3f2fd,stroke:#2196f3,color:#000
    style R2 fill:#e3f2fd,stroke:#2196f3,color:#000

Read Performance

Because strongly-consistent reads go only to the leaseholder (no Raft round-trip needed), single-range reads are as fast as a local key-value lookup on the leaseholder node.

Zone Configurations

Zone configs control replication topology. A zone config specifies:

Parameter Purpose
num_replicas Number of replicas per range
constraints Required or prohibited node attributes (e.g. +region=us-east)
lease_preferences Preferred node attributes for lease placement
range_min_bytes / range_max_bytes Target range size bounds
gc.ttlseconds How long MVCC garbage is retained

Example: pin a table's data to a specific region:

ALTER DATABASE appdb CONFIGURE ZONE USING
    num_replicas = 5,
    constraints = '[+region=us-east]';
-- lease_preferences uses double-bracket syntax, e.g. double-bracket with +region=us-east

Storage Layer (Pebble)

Pebble is CockroachDB's purpose-built storage engine, written in Go. It is an LSM-tree (Log-Structured Merge-tree) key-value store inspired by LevelDB and RocksDB, with several CockroachDB-specific optimisations.

LSM Tree Structure

graph TB
    WRITE["Write Request"]
    WAL["Write-Ahead Log<br/>(WAL)"]
    MEM["MemTable<br/>(in-memory skiplist)"]
    L0["L0 SSTables<br/>(sorted, may overlap)"]
    L1["L1 SSTables<br/>(sorted, partitioned)"]
    L2["L2 SSTables<br/>(~10x L1)"]
    LPLUS["L3 ... L6<br/>(each ~10x prior)"]

    WRITE --> WAL
    WRITE --> MEM
    MEM -->|"Flush when full"| L0
    L0 -->|"Compaction"| L1
    L1 -->|"Compaction"| L2
    L2 -->|"Compaction"| LPLUS

    style MEM fill:#fff3e0,stroke:#ff9800,color:#000
    style WAL fill:#fce4ec,stroke:#e91e63,color:#000
    style L0 fill:#e8f5e9,stroke:#4caf50,color:#000

Key Pebble internals:

Component Role
MemTable In-memory sorted structure (skiplist). All writes land here first.
WAL Sequential write-ahead log for crash recovery of in-flight MemTable data.
L0 SSTables Flushed MemTables. Sorted within each file but key ranges may overlap across files.
L1-L6 SSTables Compacted levels. Each level is partitioned into non-overlapping key ranges. Each level is approximately 10x the size of the previous.
Bloom Filters Per-SSTable bloom filters reduce disk reads for point lookups on non-existent keys.
Block Cache Frequently accessed data blocks are cached in memory to reduce disk I/O.
Compaction Background process that merges SSTables from one level into the next, removing duplicates, expired MVCC versions, and deleted entries.

MVCC Integration

CockroachDB keys are encoded with MVCC timestamps, producing keys of the form <key>/<timestamp>. Pebble stores these in sorted order, meaning all versions of a key are adjacent. During compaction, Pebble can garbage-collect MVCC versions older than the GC TTL, reclaiming disk space.

Why Pebble (not RocksDB)

CockroachDB originally used RocksDB but built Pebble to:

  • Eliminate CGo overhead and the complexity of managing a C++ dependency from Go.
  • Gain tighter control over compaction heuristics, especially MVCC-aware garbage collection.
  • Implement CockroachDB-specific features such as range tombstones (for efficient large-range deletions) and ingest-based external file loading (used for IMPORT and backup restore).

Data Flow: End-to-End Write Path

The following diagram traces a single-row INSERT from client to persistent storage:

sequenceDiagram
    participant C as SQL Client
    participant GW as Gateway Node<br/>(TxnCoordinator)
    participant LH as Leaseholder<br/>(Range R1)
    participant F1 as Follower Replica
    participant F2 as Follower Replica
    participant P as Pebble (LH disk)

    C->>GW: INSERT INTO orders (id, total) VALUES (42, 99.5)
    GW->>GW: Parse + optimize SQL<br/>Convert to KV put
    GW->>LH: BatchRequest (KV put for key /Table/51/1/42)
    LH->>LH: Acquire range lease<br/>Check timestamp cache
    LH->>LH: Write MVCC intent to Pebble
    LH->>F1: Raft Propose (intent write)
    LH->>F2: Raft Propose (intent write)
    F1-->>LH: Raft Ack
    F2-->>LH: Raft Ack
    LH-->>GW: Write intent stored (majority)
    GW->>GW: Record txn status = COMMITTED
    GW->>LH: Resolve intent (commit)
    LH->>P: Write committed MVCC value<br/>Clear intent
    LH-->>GW: Intent resolved
    GW-->>C: Query OK, 1 row affected

Replication Topology Patterns

Single-Region (3 replicas)

Suitable for applications that need HA within one region. Tolerates one node failure.

Multi-Region (5 replicas, 3 regions)

Example: 3 voters in us-east, 1 voter in us-central, 1 voter in eu-west. The leaseholder stays in us-east for low-latency reads. Tolerates a full region failure.

Geo-Partitioned

Tables and indexes are partitioned by region using zone configs. Each partition's replicas are pinned to the relevant region. Minimises cross-region latency for reads and writes.

Latency Trade-offs

Every write must reach a Raft majority. In multi-region deployments, configure num_voters and voter placement carefully to avoid committing writes across high-latency WAN links on every operation.


Key Configuration Knobs

Setting Default Impact
range_max_bytes 512 MiB Maximum range size before auto-split
num_replicas 3 Replication factor per range
gc.ttlseconds 90000 (25 hours) MVCC garbage retention window
kv.raft_log.disable_fsync false Disabling improves write throughput but risks data loss on power failure
storage.max_sync_duration 20s Threshold for slow fsync warnings

Sources


How It Works

Range-based sharding, Raft consensus, distributed transaction protocol, and leaseholder reads.

Data Distribution

flowchart TB
    subgraph Keyspace["Sorted Keyspace"]
        R1["Range 1\n/meta → /system"]
        R2["Range 2\n/table/users/a-m"]
        R3["Range 3\n/table/users/n-z"]
        R4["Range 4\n/table/orders/*"]
    end

    subgraph Cluster_C["3-Node Cluster"]
        N1["Node 1\n(R1-leader, R2-follower, R4-leader)"]
        N2["Node 2\n(R1-follower, R2-leader, R3-follower)"]
        N3["Node 3\n(R1-follower, R3-leader, R4-follower)"]
    end

    R1 -.-> N1
    R2 -.-> N2
    R3 -.-> N3

    style Keyspace fill:#6933ff,color:#fff

Distributed Transaction (Write)

sequenceDiagram
    participant Client_C as Client (PG wire)
    participant GW as Gateway Node
    participant LH1 as Leaseholder (Range A)
    participant LH2 as Leaseholder (Range B)
    participant Followers as Raft Followers

    Client_C->>GW: BEGIN; UPDATE users...; UPDATE orders...;
    GW->>LH1: Write intent on users (Range A)
    GW->>LH2: Write intent on orders (Range B)

    Note over GW: Two-Phase Commit
    GW->>LH1: PREPARE (write transaction record)
    LH1->>Followers: Raft replicate (majority ACK)
    Followers-->>LH1: ACK (2/3 nodes)
    GW->>LH1: COMMIT
    GW->>LH2: Resolve intent (committed)

    Client_C-->>Client_C: COMMIT OK

Leaseholder Reads (Fast Path)

Each range has a leaseholder — the node that serves reads without Raft consensus:

Read Type Mechanism Latency
Leaseholder read Read from lease holder directly ~1ms
Follower read Read from closest replica (slightly stale) ~1ms (geo-local)
Consistent read Must go through leaseholder +network RTT

Follower Reads

CockroachDB supports follower reads -- reading from the nearest replica without waiting for leaseholder consensus:

  • Exact staleness reads (AS OF SYSTEM TIME): Read from a local replica at a specified timestamp. No network hop needed if the replica is sufficiently caught up.
  • ** bounded staleness reads**: Read from the closest replica within a staleness bound. CockroachDB automatically chooses the freshest available replica.

This dramatically reduces read latency in geo-distributed deployments by avoiding cross-region round trips to the leaseholder.

Range Splits and Merges

CockroachDB dynamically splits and merges ranges based on load:

  • Split trigger: When a range exceeds 512 MiB or receives disproportionately high read/write traffic
  • Merge trigger: When adjacent ranges are small and quiescent
  • Rebalance: The store rebalancer moves range replicas between nodes to equalize disk usage and QPS across the cluster

Sources


Benchmarks

Scope

Performance characteristics, scaling limits, and resource consumption for CockroachDB.

TPC-C Performance

Nodes vCPUs/Node Warehouses TPS Efficiency
3 4 500 2,000 80%
9 16 5,000 15,000 85%
81 16 50,000 120,000 90%

Latency Characteristics

Operation Same Zone Cross Zone Cross Region
Point read 1-2ms 2-5ms 50-200ms
Point write 5-10ms 10-20ms 100-300ms
Range scan (100 rows) 5-15ms 10-30ms 100-500ms

Scaling Limits

Dimension Limit Notes
Cluster size 500+ nodes Tested by Cockroach Labs
Database size 100TB+ Linear scaling with nodes
Concurrent connections 10,000+ per node Connection pooling recommended
Transaction size 64MB kv.transaction.max_intents_bytes

Sourcing Status

Unsourced Performance Data

The performance numbers in this document are estimated from vendor documentation, community benchmarks, and engineering judgment. They do not represent controlled benchmarks with documented test conditions. Specific hardware configurations, software versions, and test methodologies were not recorded.

Use these figures as rough guidance only. For production capacity planning, run your own benchmarks against your specific workload and infrastructure.

Sources