Skip to content

Ceph — How It Works

RADOS object store, CRUSH data placement, OSD write path, and recovery mechanics.

RADOS Architecture

flowchart TB
    subgraph Clients["Client Access"]
        RBD["RBD Client\n(block)"]
        RGW["RGW\n(S3/Swift)"]
        CephFS_C["CephFS Client\n(POSIX)"]
    end

    subgraph RADOS["RADOS (Core)"]
        PG["Placement Groups\n(PG)"]
        CRUSH_A["CRUSH Map\n(deterministic placement)"]
    end

    subgraph Daemons["Cluster Daemons"]
        MON["MON ×3+\n(quorum, cluster map)"]
        MGR["MGR ×2\n(metrics, dashboard)"]
        OSD1["OSD 1"]
        OSD2["OSD 2"]
        OSD3["OSD 3"]
        OSDN["OSD N"]
        MDS_D["MDS ×2+\n(CephFS metadata)"]
    end

    Clients --> RADOS
    RADOS --> Daemons
    CRUSH_A --> PG
    PG --> OSD1
    PG --> OSD2
    PG --> OSD3

    style RADOS fill:#c62828,color:#fff
    style Daemons fill:#1565c0,color:#fff

CRUSH Algorithm

The CRUSH (Controlled Replication Under Scalable Hashing) algorithm is what makes Ceph unique — clients calculate data placement directly, without a central lookup table.

flowchart LR
    Object["Object ID"] --> Hash["Hash\n(CRUSH)"]
    Hash --> PG_C["Placement Group\n(PG = hash mod num_pgs)"]
    PG_C --> CRUSH_C["CRUSH Rules\n(rack/host/OSD failure domains)"]
    CRUSH_C --> OSD_Set["OSD Set\n{OSD.4, OSD.17, OSD.29}"]

    style CRUSH_C fill:#e65100,color:#fff

Write Path

sequenceDiagram
    participant Client as Client
    participant Primary as Primary OSD
    participant Replica1 as Replica OSD 1
    participant Replica2 as Replica OSD 2
    participant Journal as WAL/DB (BlueStore)

    Client->>Primary: Write object
    Primary->>Journal: Write to WAL (journal)
    par Replicate
        Primary->>Replica1: Forward write
        Primary->>Replica2: Forward write
    end
    Replica1-->>Primary: ACK
    Replica2-->>Primary: ACK
    Primary-->>Client: Write complete
    Note over Primary: All replicas confirmed<br/>before client ACK (strong consistency)

BlueStore (Default OSD Backend)

Component Role
BlockDevice Raw block device (HDD/SSD/NVMe) — no filesystem
RocksDB Object metadata store
WAL Write-ahead log (best on fast NVMe)
DB RocksDB data (best on SSD)
Data Object data (can be HDD)

Data Protection

Method Overhead Speed Use Case
Replication (3×) 200% Fast writes Hot data, databases
Erasure Coding (4+2) 50% Slower writes, fast reads Cold data, archives
FastEC (Tentacle) 50% Improved small I/O General purpose

Sources