CockroachDB — How It Works¶
Range-based sharding, Raft consensus, distributed transaction protocol, and leaseholder reads.
Data Distribution¶
flowchart TB
subgraph Keyspace["Sorted Keyspace"]
R1["Range 1\n/meta → /system"]
R2["Range 2\n/table/users/a-m"]
R3["Range 3\n/table/users/n-z"]
R4["Range 4\n/table/orders/*"]
end
subgraph Cluster_C["3-Node Cluster"]
N1["Node 1\n(R1-leader, R2-follower, R4-leader)"]
N2["Node 2\n(R1-follower, R2-leader, R3-follower)"]
N3["Node 3\n(R1-follower, R3-leader, R4-follower)"]
end
R1 -.-> N1
R2 -.-> N2
R3 -.-> N3
style Keyspace fill:#6933ff,color:#fff
Distributed Transaction (Write)¶
sequenceDiagram
participant Client_C as Client (PG wire)
participant GW as Gateway Node
participant LH1 as Leaseholder (Range A)
participant LH2 as Leaseholder (Range B)
participant Followers as Raft Followers
Client_C->>GW: BEGIN; UPDATE users...; UPDATE orders...;
GW->>LH1: Write intent on users (Range A)
GW->>LH2: Write intent on orders (Range B)
Note over GW: Two-Phase Commit
GW->>LH1: PREPARE (write transaction record)
LH1->>Followers: Raft replicate (majority ACK)
Followers-->>LH1: ACK (2/3 nodes)
GW->>LH1: COMMIT
GW->>LH2: Resolve intent (committed)
Client_C-->>Client_C: COMMIT OK
Leaseholder Reads (Fast Path)¶
Each range has a leaseholder — the node that serves reads without Raft consensus:
| Read Type | Mechanism | Latency |
|---|---|---|
| Leaseholder read | Read from lease holder directly | ~1ms |
| Follower read | Read from closest replica (slightly stale) | ~1ms (geo-local) |
| Consistent read | Must go through leaseholder | +network RTT |