Architecture¶

Ceph is a unified distributed storage system that delivers object, block, and file storage from a single RADOS (Reliable Autonomic Distributed Object Store) cluster. It uses intelligent daemons and the CRUSH algorithm to eliminate centralized lookup bottlenecks, enabling petabyte-to-exabyte scale with commodity hardware.

See also: index, architecture, operations, security

Core Component Diagram¶

graph TB
    subgraph Clients
        RBD[Ceph Block Device<br/>librbd / kernel RBD]
        RGW[Ceph Object Storage<br/>radosgw -- S3/Swift]
        CephFS[Ceph File System<br/>FUSE / kernel mount]
        LibR[Custom Apps<br/>librados]
    end

    subgraph RADOS["RADOS Storage Cluster"]
        MON[Ceph Monitor<br/>ceph-mon<br/>Paxos quorum]
        MGR[Ceph Manager<br/>ceph-mgr<br/>Metrics + Dashboard]
        OSD[Ceph OSD<br/>ceph-osd<br/>BlueStore backend]
        MDS[Ceph MDS<br/>ceph-mds<br/>CephFS metadata]
    end

    subgraph Storage
        DISK[(Local disks<br/>per OSD)]
    end

    RBD --> OSD
    RGW --> OSD
    CephFS --> MDS
    CephFS --> OSD
    LibR --> OSD
    MON -->|Cluster maps| OSD
    MON -->|Cluster maps| Clients
    MGR --> MON
    MDS --> OSD
    OSD --> DISK

RADOS Daemons¶

Ceph Monitor (ceph-mon)¶

Monitors maintain the master copy of the cluster map, a collection of five sub-maps that describe the entire cluster topology:

Map	Contents	View Command
Monitor Map	Cluster `fsid`, monitor addresses, epochs	`ceph mon dump`
OSD Map	Pool list, replica sizes, PG numbers, OSD statuses	`ceph osd dump`
PG Map	PG versions, up/acting sets, per-PG state, usage stats	`ceph pg dump`
CRUSH Map	Device list, failure domain hierarchy, placement rules	`ceph osd getcrushmap`
MDS Map	MDS ranks, metadata pool, active/standby state	`ceph fs dump`

Monitors achieve consensus via the Paxos algorithm. A quorum requires a strict majority (for example, 2 of 3, 3 of 5). Losing quorum makes the entire cluster unavailable for writes, which is why production deployments run at least 3 monitors across failure domains.

Ceph OSD Daemon (ceph-osd)¶

Each OSD manages a single storage device (typically one per physical disk). OSDs handle all read, write, replication, recovery, rebalancing, and scrubbing operations locally. Key responsibilities:

Data storage via BlueStore (the default backend since Luminous), which writes objects directly to raw block devices in a database-like structure (RocksDB metadata + WAL on SSD, object data on HDD or SSD).
Heartbeating neighboring OSDs and reporting state changes to monitors via MOSDBeacon messages.
Peer recovery when OSDs join or leave the acting set for a PG.
Scrubbing (light: daily metadata comparison; deep: weekly bit-for-bit data comparison).

OSDs are statefully described as up/down (daemon running or not) and in/out (participating in data placement or not).

Ceph Manager (ceph-mgr)¶

The manager daemon provides monitoring, orchestration, and a plug-in framework. It hosts the Ceph Dashboard (web UI), Prometheus metrics exporter, and orchestrator modules (cephadm, Rook). At least one active manager is required; a standby takes over automatically if the active fails.

Ceph Metadata Server (ceph-mds)¶

MDS is only required for CephFS. It stores filesystem metadata (directories, ownership, permissions, timestamps) in memory for fast POSIX operations. MDS supports two deployment modes:

Active-standby HA: One active MDS serves a filesystem; standbys take over on failure.
Active-active scaling: Multiple MDS daemons share the directory tree, splitting busy subtrees and even individual directory shards across instances.

RADOS Gateway (radosgw)¶

The RGW daemon provides S3-compatible and Swift-compatible RESTful APIs on top of RADOS. It maintains its own user database, authentication, and access control. The unified namespace means data written via the S3 API is readable via the Swift API and vice versa.

CRUSH Algorithm and Placement Groups¶

CRUSH Overview¶

CRUSH (Controlled Replication Under Scalable Hashing) is the algorithm that determines where each object is stored. Unlike a central lookup table, every client and OSD computes placement independently using the same CRUSH map, eliminating a single bottleneck.

Object-to-OSD Placement Flow¶

flowchart LR
    OBJ["Object<br/>(pool + object-id)"]
    HASH["Hash object-id<br/>mod PG count"]
    PG["Placement Group<br/>e.g. 4.58"]
    CRUSH["CRUSH algorithm<br/>+ failure domain"]
    OSDS["Acting Set<br/>[Primary, Secondary, ...]"]

    OBJ --> HASH --> PG --> CRUSH --> OSDS

The full computation:

Client provides the pool name and object ID.
Ceph hashes the object ID and takes the result modulo the number of PGs in the pool, yielding a PG ID (for example, 58).
The pool ID is prepended (for example, pool 4 -> PG 4.58).
CRUSH maps the PG to an ordered list of OSDs called the acting set. The first OSD is the primary and handles client writes; it forwards data to secondaries for replication.
Because CRUSH is deterministic, any client with the same cluster map produces the same result.

Placement Groups (PGs)¶

PGs are logical buckets that amortize the cost of per-object placement. Each pool has a configurable number of PGs. The PG autoscaler (default since Pacific) adjusts PG counts automatically based on pool usage and target PGs-per-OSD ratios.

Key concepts:

Acting Set: The ordered list of OSDs responsible for a PG. The first OSD is the primary.
Up Set: The subset of the acting set where OSDs are currently up. When an OSD in the acting set fails, Ceph remaps the PG and the next OSD becomes primary.
Peering: The process by which OSDs in a PG's acting set agree on the consistent state of all objects. Peering must complete before the PG becomes active + clean.

Failure Domains¶

CRUSH uses a hierarchical topology (bucket types: osd, host, rack, row, room, root) to place replica copies across separate failure domains. A typical rule ensures that no two replicas of the same PG land on the same host (or rack, if configured).

Rebalancing¶

When OSDs are added or removed, the cluster map changes and CRUSH recomputes placement. Only the PGs whose mapping changes migrate data. CRUSH is stable in that most PGs remain where they are, and new OSDs receive a proportional share without causing load spikes.

Pools and Erasure Coding¶

Replicated Pools¶

By default, pools use replication (size = 3, min_size = 2). The primary OSD writes the object and synchronously replicates to the remaining OSDs in the acting set. This provides the lowest write latency.

Erasure-Coded Pools¶

Erasure coding splits each object into K data chunks and M coding chunks, distributed across K+M OSDs. For example, a K=3, M=2 profile uses 5 OSDs and tolerates 2 failures with only 67% storage overhead (versus 200% for 3x replication).

graph LR
    OBJ["Object NYAN<br/>ABCDEFGHI"]
    subgraph EC["Erasure Encoding K=3, M=2"]
        D1["D1: ABC"]
        D2["D2: DEF"]
        D3["D3: GHI"]
        C1["C1: YXY"]
        C2["C2: QGC"]
    end
    OBJ --> D1
    OBJ --> D2
    OBJ --> D3
    OBJ --> C1
    OBJ --> C2
    D1 --> O1[(OSD 5)]
    D2 --> O2[(OSD 1)]
    D3 --> O3[(OSD 2)]
    C1 --> O4[(OSD 3)]
    C2 --> O5[(OSD 4)]

Reads require K chunks (any combination of data and parity). Writes go through the primary OSD, which encodes the payload and distributes chunks. Erasure-coded pools are best suited for cold data, backups, and large objects where storage efficiency outweighs write performance.

Replication vs Erasure Coding

Use replicated pools for hot data (RBD VM images, CephFS active directories) where write latency matters. Use erasure-coded pools for cold data (RGW archival, backup targets) where storage efficiency matters. The EC write path is heavier because the primary must encode and distribute shards.

Client Interfaces¶

Interface	Daemon/Lib	Access Method	Typical Use Case
RBD (Block)	`librbd`, kernel `rbd`	Block device (QEMU/libvirt, kernel)	VM disks, Kubernetes PVs
RGW (Object)	`radosgw`	S3 / Swift REST API	S3-compatible object storage
CephFS (File)	`ceph-mds`, `libcephfs`	POSIX mount (kernel/FUSE)	Shared filesystem, HPC
Custom	`librados`	Direct object API	Application-specific storage

BlueStore¶

BlueStore is the default OSD storage backend (replacing FileStore). It writes object data directly to a raw block device (no intervening filesystem) and stores metadata in an embedded RocksDB with a separate WAL device. Key advantages:

Direct I/O: Bypasses the Linux page cache for predictable performance.
WAL + DB separation: RocksDB metadata and WAL can be placed on faster SSD/NVMe while bulk data resides on HDD.
Checksums and compression: Per-object checksumming and optional inline compression (zlib, zstd, lz4).
Device encryption: Optional dmcrypt encryption per OSD.

Crimson (Experimental)

Crimson is a next-generation OSD implementation built on the Seastar framework, targeting improved CPU efficiency and lower latency. It is not yet production-ready as of the Reef/Squid release series.

Key Architectural Properties¶

No centralized gateway: Clients compute placement via CRUSH and talk directly to OSDs.
Self-healing: OSDs detect failures, trigger peering, and rebuild data from replicas automatically.
Elastic scaling: Add or remove OSDs; CRUSH rebalances only affected PGs.
Uniform hardware: Designed for commodity servers with local disks; no proprietary hardware required.
Shared-nothing OSDs: Each OSD owns its disk independently, avoiding shared-disk contention.

Sources¶

Ceph Architecture Documentation
CRUSH Algorithm Paper: Sage Weil et al., "CRUSH: Controlled, Scalable, Decentralized Placement of Replicated Data"
BlueStore Design

How It Works¶

RADOS object store, CRUSH data placement, OSD write path, and recovery mechanics.

RADOS Architecture¶

flowchart TB
    subgraph Clients["Client Access"]
        RBD["RBD Client\n(block)"]
        RGW["RGW\n(S3/Swift)"]
        CephFS_C["CephFS Client\n(POSIX)"]
    end

    subgraph RADOS["RADOS (Core)"]
        PG["Placement Groups\n(PG)"]
        CRUSH_A["CRUSH Map\n(deterministic placement)"]
    end

    subgraph Daemons["Cluster Daemons"]
        MON["MON ×3+\n(quorum, cluster map)"]
        MGR["MGR ×2\n(metrics, dashboard)"]
        OSD1["OSD 1"]
        OSD2["OSD 2"]
        OSD3["OSD 3"]
        OSDN["OSD N"]
        MDS_D["MDS ×2+\n(CephFS metadata)"]
    end

    Clients --> RADOS
    RADOS --> Daemons
    CRUSH_A --> PG
    PG --> OSD1
    PG --> OSD2
    PG --> OSD3

    style RADOS fill:#c62828,color:#fff
    style Daemons fill:#1565c0,color:#fff

CRUSH Algorithm¶

The CRUSH (Controlled Replication Under Scalable Hashing) algorithm is what makes Ceph unique — clients calculate data placement directly, without a central lookup table.

flowchart LR
    Object["Object ID"] --> Hash["Hash\n(CRUSH)"]
    Hash --> PG_C["Placement Group\n(PG = hash mod num_pgs)"]
    PG_C --> CRUSH_C["CRUSH Rules\n(rack/host/OSD failure domains)"]
    CRUSH_C --> OSD_Set["OSD Set\n{OSD.4, OSD.17, OSD.29}"]

    style CRUSH_C fill:#e65100,color:#fff

Write Path¶

sequenceDiagram
    participant Client as Client
    participant Primary as Primary OSD
    participant Replica1 as Replica OSD 1
    participant Replica2 as Replica OSD 2
    participant Journal as WAL/DB (BlueStore)

    Client->>Primary: Write object
    Primary->>Journal: Write to WAL (journal)
    par Replicate
        Primary->>Replica1: Forward write
        Primary->>Replica2: Forward write
    end
    Replica1-->>Primary: ACK
    Replica2-->>Primary: ACK
    Primary-->>Client: Write complete
    Note over Primary: All replicas confirmed<br/>before client ACK (strong consistency)

Read Path¶

sequenceDiagram
    participant Client as Client
    participant MON as Monitor
    participant Primary as Primary OSD
    participant BlueStore as BlueStore (Disk)

    Client->>MON: Get cluster map (cached)
    Client->>Client: CRUSH(object) -> PG -> OSD set
    Client->>Primary: Read object (from primary OSD)
    Primary->>BlueStore: Lookup in RocksDB (omap)
    BlueStore-->>Primary: Object location on block device
    Primary->>BlueStore: Read data from block device
    Primary-->>Client: Return object data
    Note over Client: Client reads only from primary OSD<br/>for strong consistency (min_size=2)

Read path is simpler than writes because the primary OSD serves reads directly. The client uses the CRUSH algorithm to compute the OSD set for a given object, then contacts the primary OSD. The primary looks up the object's on-disk location via RocksDB (stored in BlueStore's metadata partition) and reads the data from the raw block device.

For replicated pools (size=3), reads go only to the primary OSD. For erasure-coded pools, reads may involve reconstructing data from multiple OSD shards if some are unavailable.

BlueStore (Default OSD Backend)¶

Component	Role
BlockDevice	Raw block device (HDD/SSD/NVMe) — no filesystem
RocksDB	Object metadata store
WAL	Write-ahead log (best on fast NVMe)
DB	RocksDB data (best on SSD)
Data	Object data (can be HDD)

Data Protection¶

Method	Overhead	Speed	Use Case
Replication (3×)	200%	Fast writes	Hot data, databases
Erasure Coding (4+2)	50%	Slower writes, fast reads	Cold data, archives
FastEC (Tentacle)	50%	Improved small I/O	General purpose

Sources¶

Benchmarks¶

Scope

Performance characteristics, scaling limits, and resource consumption for Ceph.

RADOS Bench Results¶

Operation	3 OSD (HDD)	3 OSD (SSD)	12 OSD (NVMe)
Seq Write	300-500 MB/s	1-2 GB/s	5-10 GB/s
Seq Read	400-600 MB/s	1.5-3 GB/s	8-15 GB/s
Random 4K Write	500-1,000 IOPS	10k-30k IOPS	100k+ IOPS
Random 4K Read	1,000-2,000 IOPS	20k-50k IOPS	200k+ IOPS

RBD (Block) Performance¶

Block Size	Throughput	IOPS	Latency (P99)
4K random	N/A	10k-50k	1-5ms
64K sequential	500MB-2GB/s	N/A	2-10ms
1M sequential	1-5 GB/s	N/A	5-20ms

Scaling Limits¶

Dimension	Limit	Notes
OSDs per cluster	10,000+	Tested by CERN
Total capacity	Exabytes	Linear scaling
Default pg_num per pool	64	Tune per pool based on OSD count
PGs per OSD	100-200 (recommended)	Performance degrades beyond 300
Objects per PG	Millions	No hard limit

Sourcing Status¶

Unsourced Performance Data

The performance numbers in this document are estimated from vendor documentation, community benchmarks, and engineering judgment. They do not represent controlled benchmarks with documented test conditions. Specific hardware configurations, software versions, and test methodologies were not recorded.

Use these figures as rough guidance only. For production capacity planning, run your own benchmarks against your specific workload and infrastructure.