Skip to content

Architecture

Ceph is a unified distributed storage system that delivers object, block, and file storage from a single RADOS (Reliable Autonomic Distributed Object Store) cluster. It uses intelligent daemons and the CRUSH algorithm to eliminate centralized lookup bottlenecks, enabling petabyte-to-exabyte scale with commodity hardware.

See also: index, architecture, operations, security

Core Component Diagram

graph TB
    subgraph Clients
        RBD[Ceph Block Device<br/>librbd / kernel RBD]
        RGW[Ceph Object Storage<br/>radosgw -- S3/Swift]
        CephFS[Ceph File System<br/>FUSE / kernel mount]
        LibR[Custom Apps<br/>librados]
    end

    subgraph RADOS["RADOS Storage Cluster"]
        MON[Ceph Monitor<br/>ceph-mon<br/>Paxos quorum]
        MGR[Ceph Manager<br/>ceph-mgr<br/>Metrics + Dashboard]
        OSD[Ceph OSD<br/>ceph-osd<br/>BlueStore backend]
        MDS[Ceph MDS<br/>ceph-mds<br/>CephFS metadata]
    end

    subgraph Storage
        DISK[(Local disks<br/>per OSD)]
    end

    RBD --> OSD
    RGW --> OSD
    CephFS --> MDS
    CephFS --> OSD
    LibR --> OSD
    MON -->|Cluster maps| OSD
    MON -->|Cluster maps| Clients
    MGR --> MON
    MDS --> OSD
    OSD --> DISK

RADOS Daemons

Ceph Monitor (ceph-mon)

Monitors maintain the master copy of the cluster map, a collection of five sub-maps that describe the entire cluster topology:

Map Contents View Command
Monitor Map Cluster fsid, monitor addresses, epochs ceph mon dump
OSD Map Pool list, replica sizes, PG numbers, OSD statuses ceph osd dump
PG Map PG versions, up/acting sets, per-PG state, usage stats ceph pg dump
CRUSH Map Device list, failure domain hierarchy, placement rules ceph osd getcrushmap
MDS Map MDS ranks, metadata pool, active/standby state ceph fs dump

Monitors achieve consensus via the Paxos algorithm. A quorum requires a strict majority (for example, 2 of 3, 3 of 5). Losing quorum makes the entire cluster unavailable for writes, which is why production deployments run at least 3 monitors across failure domains.

Ceph OSD Daemon (ceph-osd)

Each OSD manages a single storage device (typically one per physical disk). OSDs handle all read, write, replication, recovery, rebalancing, and scrubbing operations locally. Key responsibilities:

  • Data storage via BlueStore (the default backend since Luminous), which writes objects directly to raw block devices in a database-like structure (RocksDB metadata + WAL on SSD, object data on HDD or SSD).
  • Heartbeating neighboring OSDs and reporting state changes to monitors via MOSDBeacon messages.
  • Peer recovery when OSDs join or leave the acting set for a PG.
  • Scrubbing (light: daily metadata comparison; deep: weekly bit-for-bit data comparison).

OSDs are statefully described as up/down (daemon running or not) and in/out (participating in data placement or not).

Ceph Manager (ceph-mgr)

The manager daemon provides monitoring, orchestration, and a plug-in framework. It hosts the Ceph Dashboard (web UI), Prometheus metrics exporter, and orchestrator modules (cephadm, Rook). At least one active manager is required; a standby takes over automatically if the active fails.

Ceph Metadata Server (ceph-mds)

MDS is only required for CephFS. It stores filesystem metadata (directories, ownership, permissions, timestamps) in memory for fast POSIX operations. MDS supports two deployment modes:

  • Active-standby HA: One active MDS serves a filesystem; standbys take over on failure.
  • Active-active scaling: Multiple MDS daemons share the directory tree, splitting busy subtrees and even individual directory shards across instances.

RADOS Gateway (radosgw)

The RGW daemon provides S3-compatible and Swift-compatible RESTful APIs on top of RADOS. It maintains its own user database, authentication, and access control. The unified namespace means data written via the S3 API is readable via the Swift API and vice versa.

CRUSH Algorithm and Placement Groups

CRUSH Overview

CRUSH (Controlled Replication Under Scalable Hashing) is the algorithm that determines where each object is stored. Unlike a central lookup table, every client and OSD computes placement independently using the same CRUSH map, eliminating a single bottleneck.

Object-to-OSD Placement Flow

flowchart LR
    OBJ["Object<br/>(pool + object-id)"]
    HASH["Hash object-id<br/>mod PG count"]
    PG["Placement Group<br/>e.g. 4.58"]
    CRUSH["CRUSH algorithm<br/>+ failure domain"]
    OSDS["Acting Set<br/>[Primary, Secondary, ...]"]

    OBJ --> HASH --> PG --> CRUSH --> OSDS

The full computation:

  1. Client provides the pool name and object ID.
  2. Ceph hashes the object ID and takes the result modulo the number of PGs in the pool, yielding a PG ID (for example, 58).
  3. The pool ID is prepended (for example, pool 4 -> PG 4.58).
  4. CRUSH maps the PG to an ordered list of OSDs called the acting set. The first OSD is the primary and handles client writes; it forwards data to secondaries for replication.
  5. Because CRUSH is deterministic, any client with the same cluster map produces the same result.

Placement Groups (PGs)

PGs are logical buckets that amortize the cost of per-object placement. Each pool has a configurable number of PGs. The PG autoscaler (default since Pacific) adjusts PG counts automatically based on pool usage and target PGs-per-OSD ratios.

Key concepts:

  • Acting Set: The ordered list of OSDs responsible for a PG. The first OSD is the primary.
  • Up Set: The subset of the acting set where OSDs are currently up. When an OSD in the acting set fails, Ceph remaps the PG and the next OSD becomes primary.
  • Peering: The process by which OSDs in a PG's acting set agree on the consistent state of all objects. Peering must complete before the PG becomes active + clean.

Failure Domains

CRUSH uses a hierarchical topology (bucket types: osd, host, rack, row, room, root) to place replica copies across separate failure domains. A typical rule ensures that no two replicas of the same PG land on the same host (or rack, if configured).

Rebalancing

When OSDs are added or removed, the cluster map changes and CRUSH recomputes placement. Only the PGs whose mapping changes migrate data. CRUSH is stable in that most PGs remain where they are, and new OSDs receive a proportional share without causing load spikes.

Pools and Erasure Coding

Replicated Pools

By default, pools use replication (size = 3, min_size = 2). The primary OSD writes the object and synchronously replicates to the remaining OSDs in the acting set. This provides the lowest write latency.

Erasure-Coded Pools

Erasure coding splits each object into K data chunks and M coding chunks, distributed across K+M OSDs. For example, a K=3, M=2 profile uses 5 OSDs and tolerates 2 failures with only 67% storage overhead (versus 200% for 3x replication).

graph LR
    OBJ["Object NYAN<br/>ABCDEFGHI"]
    subgraph EC["Erasure Encoding K=3, M=2"]
        D1["D1: ABC"]
        D2["D2: DEF"]
        D3["D3: GHI"]
        C1["C1: YXY"]
        C2["C2: QGC"]
    end
    OBJ --> D1
    OBJ --> D2
    OBJ --> D3
    OBJ --> C1
    OBJ --> C2
    D1 --> O1[(OSD 5)]
    D2 --> O2[(OSD 1)]
    D3 --> O3[(OSD 2)]
    C1 --> O4[(OSD 3)]
    C2 --> O5[(OSD 4)]

Reads require K chunks (any combination of data and parity). Writes go through the primary OSD, which encodes the payload and distributes chunks. Erasure-coded pools are best suited for cold data, backups, and large objects where storage efficiency outweighs write performance.

Replication vs Erasure Coding

Use replicated pools for hot data (RBD VM images, CephFS active directories) where write latency matters. Use erasure-coded pools for cold data (RGW archival, backup targets) where storage efficiency matters. The EC write path is heavier because the primary must encode and distribute shards.

Client Interfaces

Interface Daemon/Lib Access Method Typical Use Case
RBD (Block) librbd, kernel rbd Block device (QEMU/libvirt, kernel) VM disks, Kubernetes PVs
RGW (Object) radosgw S3 / Swift REST API S3-compatible object storage
CephFS (File) ceph-mds, libcephfs POSIX mount (kernel/FUSE) Shared filesystem, HPC
Custom librados Direct object API Application-specific storage

BlueStore

BlueStore is the default OSD storage backend (replacing FileStore). It writes object data directly to a raw block device (no intervening filesystem) and stores metadata in an embedded RocksDB with a separate WAL device. Key advantages:

  • Direct I/O: Bypasses the Linux page cache for predictable performance.
  • WAL + DB separation: RocksDB metadata and WAL can be placed on faster SSD/NVMe while bulk data resides on HDD.
  • Checksums and compression: Per-object checksumming and optional inline compression (zlib, zstd, lz4).
  • Device encryption: Optional dmcrypt encryption per OSD.

Crimson (Experimental)

Crimson is a next-generation OSD implementation built on the Seastar framework, targeting improved CPU efficiency and lower latency. It is not yet production-ready as of the Reef/Squid release series.

Key Architectural Properties

  • No centralized gateway: Clients compute placement via CRUSH and talk directly to OSDs.
  • Self-healing: OSDs detect failures, trigger peering, and rebuild data from replicas automatically.
  • Elastic scaling: Add or remove OSDs; CRUSH rebalances only affected PGs.
  • Uniform hardware: Designed for commodity servers with local disks; no proprietary hardware required.
  • Shared-nothing OSDs: Each OSD owns its disk independently, avoiding shared-disk contention.

Sources


How It Works

RADOS object store, CRUSH data placement, OSD write path, and recovery mechanics.

RADOS Architecture

flowchart TB
    subgraph Clients["Client Access"]
        RBD["RBD Client\n(block)"]
        RGW["RGW\n(S3/Swift)"]
        CephFS_C["CephFS Client\n(POSIX)"]
    end

    subgraph RADOS["RADOS (Core)"]
        PG["Placement Groups\n(PG)"]
        CRUSH_A["CRUSH Map\n(deterministic placement)"]
    end

    subgraph Daemons["Cluster Daemons"]
        MON["MON ×3+\n(quorum, cluster map)"]
        MGR["MGR ×2\n(metrics, dashboard)"]
        OSD1["OSD 1"]
        OSD2["OSD 2"]
        OSD3["OSD 3"]
        OSDN["OSD N"]
        MDS_D["MDS ×2+\n(CephFS metadata)"]
    end

    Clients --> RADOS
    RADOS --> Daemons
    CRUSH_A --> PG
    PG --> OSD1
    PG --> OSD2
    PG --> OSD3

    style RADOS fill:#c62828,color:#fff
    style Daemons fill:#1565c0,color:#fff

CRUSH Algorithm

The CRUSH (Controlled Replication Under Scalable Hashing) algorithm is what makes Ceph unique — clients calculate data placement directly, without a central lookup table.

flowchart LR
    Object["Object ID"] --> Hash["Hash\n(CRUSH)"]
    Hash --> PG_C["Placement Group\n(PG = hash mod num_pgs)"]
    PG_C --> CRUSH_C["CRUSH Rules\n(rack/host/OSD failure domains)"]
    CRUSH_C --> OSD_Set["OSD Set\n{OSD.4, OSD.17, OSD.29}"]

    style CRUSH_C fill:#e65100,color:#fff

Write Path

sequenceDiagram
    participant Client as Client
    participant Primary as Primary OSD
    participant Replica1 as Replica OSD 1
    participant Replica2 as Replica OSD 2
    participant Journal as WAL/DB (BlueStore)

    Client->>Primary: Write object
    Primary->>Journal: Write to WAL (journal)
    par Replicate
        Primary->>Replica1: Forward write
        Primary->>Replica2: Forward write
    end
    Replica1-->>Primary: ACK
    Replica2-->>Primary: ACK
    Primary-->>Client: Write complete
    Note over Primary: All replicas confirmed<br/>before client ACK (strong consistency)

Read Path

sequenceDiagram
    participant Client as Client
    participant MON as Monitor
    participant Primary as Primary OSD
    participant BlueStore as BlueStore (Disk)

    Client->>MON: Get cluster map (cached)
    Client->>Client: CRUSH(object) -> PG -> OSD set
    Client->>Primary: Read object (from primary OSD)
    Primary->>BlueStore: Lookup in RocksDB (omap)
    BlueStore-->>Primary: Object location on block device
    Primary->>BlueStore: Read data from block device
    Primary-->>Client: Return object data
    Note over Client: Client reads only from primary OSD<br/>for strong consistency (min_size=2)

Read path is simpler than writes because the primary OSD serves reads directly. The client uses the CRUSH algorithm to compute the OSD set for a given object, then contacts the primary OSD. The primary looks up the object's on-disk location via RocksDB (stored in BlueStore's metadata partition) and reads the data from the raw block device.

For replicated pools (size=3), reads go only to the primary OSD. For erasure-coded pools, reads may involve reconstructing data from multiple OSD shards if some are unavailable.

BlueStore (Default OSD Backend)

Component Role
BlockDevice Raw block device (HDD/SSD/NVMe) — no filesystem
RocksDB Object metadata store
WAL Write-ahead log (best on fast NVMe)
DB RocksDB data (best on SSD)
Data Object data (can be HDD)

Data Protection

Method Overhead Speed Use Case
Replication (3×) 200% Fast writes Hot data, databases
Erasure Coding (4+2) 50% Slower writes, fast reads Cold data, archives
FastEC (Tentacle) 50% Improved small I/O General purpose

Sources


Benchmarks

Scope

Performance characteristics, scaling limits, and resource consumption for Ceph.

RADOS Bench Results

Operation 3 OSD (HDD) 3 OSD (SSD) 12 OSD (NVMe)
Seq Write 300-500 MB/s 1-2 GB/s 5-10 GB/s
Seq Read 400-600 MB/s 1.5-3 GB/s 8-15 GB/s
Random 4K Write 500-1,000 IOPS 10k-30k IOPS 100k+ IOPS
Random 4K Read 1,000-2,000 IOPS 20k-50k IOPS 200k+ IOPS

RBD (Block) Performance

Block Size Throughput IOPS Latency (P99)
4K random N/A 10k-50k 1-5ms
64K sequential 500MB-2GB/s N/A 2-10ms
1M sequential 1-5 GB/s N/A 5-20ms

Scaling Limits

Dimension Limit Notes
OSDs per cluster 10,000+ Tested by CERN
Total capacity Exabytes Linear scaling
Default pg_num per pool 64 Tune per pool based on OSD count
PGs per OSD 100-200 (recommended) Performance degrades beyond 300
Objects per PG Millions No hard limit

Sourcing Status

Unsourced Performance Data

The performance numbers in this document are estimated from vendor documentation, community benchmarks, and engineering judgment. They do not represent controlled benchmarks with documented test conditions. Specific hardware configurations, software versions, and test methodologies were not recorded.

Use these figures as rough guidance only. For production capacity planning, run your own benchmarks against your specific workload and infrastructure.

Sources