Skip to content

CockroachDB — How It Works

Range-based sharding, Raft consensus, distributed transaction protocol, and leaseholder reads.

Data Distribution

flowchart TB
    subgraph Keyspace["Sorted Keyspace"]
        R1["Range 1\n/meta → /system"]
        R2["Range 2\n/table/users/a-m"]
        R3["Range 3\n/table/users/n-z"]
        R4["Range 4\n/table/orders/*"]
    end

    subgraph Cluster_C["3-Node Cluster"]
        N1["Node 1\n(R1-leader, R2-follower, R4-leader)"]
        N2["Node 2\n(R1-follower, R2-leader, R3-follower)"]
        N3["Node 3\n(R1-follower, R3-leader, R4-follower)"]
    end

    R1 -.-> N1
    R2 -.-> N2
    R3 -.-> N3

    style Keyspace fill:#6933ff,color:#fff

Distributed Transaction (Write)

sequenceDiagram
    participant Client_C as Client (PG wire)
    participant GW as Gateway Node
    participant LH1 as Leaseholder (Range A)
    participant LH2 as Leaseholder (Range B)
    participant Followers as Raft Followers

    Client_C->>GW: BEGIN; UPDATE users...; UPDATE orders...;
    GW->>LH1: Write intent on users (Range A)
    GW->>LH2: Write intent on orders (Range B)

    Note over GW: Two-Phase Commit
    GW->>LH1: PREPARE (write transaction record)
    LH1->>Followers: Raft replicate (majority ACK)
    Followers-->>LH1: ACK (2/3 nodes)
    GW->>LH1: COMMIT
    GW->>LH2: Resolve intent (committed)

    Client_C-->>Client_C: COMMIT OK

Leaseholder Reads (Fast Path)

Each range has a leaseholder — the node that serves reads without Raft consensus:

Read Type Mechanism Latency
Leaseholder read Read from lease holder directly ~1ms
Follower read Read from closest replica (slightly stale) ~1ms (geo-local)
Consistent read Must go through leaseholder +network RTT

Sources