Architecture¶
Ceph is a unified distributed storage system that delivers object, block, and file storage from a single RADOS (Reliable Autonomic Distributed Object Store) cluster. It uses intelligent daemons and the CRUSH algorithm to eliminate centralized lookup bottlenecks, enabling petabyte-to-exabyte scale with commodity hardware.
See also: index, architecture, operations, security
Core Component Diagram¶
graph TB
subgraph Clients
RBD[Ceph Block Device<br/>librbd / kernel RBD]
RGW[Ceph Object Storage<br/>radosgw -- S3/Swift]
CephFS[Ceph File System<br/>FUSE / kernel mount]
LibR[Custom Apps<br/>librados]
end
subgraph RADOS["RADOS Storage Cluster"]
MON[Ceph Monitor<br/>ceph-mon<br/>Paxos quorum]
MGR[Ceph Manager<br/>ceph-mgr<br/>Metrics + Dashboard]
OSD[Ceph OSD<br/>ceph-osd<br/>BlueStore backend]
MDS[Ceph MDS<br/>ceph-mds<br/>CephFS metadata]
end
subgraph Storage
DISK[(Local disks<br/>per OSD)]
end
RBD --> OSD
RGW --> OSD
CephFS --> MDS
CephFS --> OSD
LibR --> OSD
MON -->|Cluster maps| OSD
MON -->|Cluster maps| Clients
MGR --> MON
MDS --> OSD
OSD --> DISK
RADOS Daemons¶
Ceph Monitor (ceph-mon)¶
Monitors maintain the master copy of the cluster map, a collection of five sub-maps that describe the entire cluster topology:
| Map | Contents | View Command |
|---|---|---|
| Monitor Map | Cluster fsid, monitor addresses, epochs |
ceph mon dump |
| OSD Map | Pool list, replica sizes, PG numbers, OSD statuses | ceph osd dump |
| PG Map | PG versions, up/acting sets, per-PG state, usage stats | ceph pg dump |
| CRUSH Map | Device list, failure domain hierarchy, placement rules | ceph osd getcrushmap |
| MDS Map | MDS ranks, metadata pool, active/standby state | ceph fs dump |
Monitors achieve consensus via the Paxos algorithm. A quorum requires a strict majority (for example, 2 of 3, 3 of 5). Losing quorum makes the entire cluster unavailable for writes, which is why production deployments run at least 3 monitors across failure domains.
Ceph OSD Daemon (ceph-osd)¶
Each OSD manages a single storage device (typically one per physical disk). OSDs handle all read, write, replication, recovery, rebalancing, and scrubbing operations locally. Key responsibilities:
- Data storage via BlueStore (the default backend since Luminous), which writes objects directly to raw block devices in a database-like structure (RocksDB metadata + WAL on SSD, object data on HDD or SSD).
- Heartbeating neighboring OSDs and reporting state changes to monitors via
MOSDBeaconmessages. - Peer recovery when OSDs join or leave the acting set for a PG.
- Scrubbing (light: daily metadata comparison; deep: weekly bit-for-bit data comparison).
OSDs are statefully described as up/down (daemon running or not) and in/out (participating in data placement or not).
Ceph Manager (ceph-mgr)¶
The manager daemon provides monitoring, orchestration, and a plug-in framework. It hosts the Ceph Dashboard (web UI), Prometheus metrics exporter, and orchestrator modules (cephadm, Rook). At least one active manager is required; a standby takes over automatically if the active fails.
Ceph Metadata Server (ceph-mds)¶
MDS is only required for CephFS. It stores filesystem metadata (directories, ownership, permissions, timestamps) in memory for fast POSIX operations. MDS supports two deployment modes:
- Active-standby HA: One active MDS serves a filesystem; standbys take over on failure.
- Active-active scaling: Multiple MDS daemons share the directory tree, splitting busy subtrees and even individual directory shards across instances.
RADOS Gateway (radosgw)¶
The RGW daemon provides S3-compatible and Swift-compatible RESTful APIs on top of RADOS. It maintains its own user database, authentication, and access control. The unified namespace means data written via the S3 API is readable via the Swift API and vice versa.
CRUSH Algorithm and Placement Groups¶
CRUSH Overview¶
CRUSH (Controlled Replication Under Scalable Hashing) is the algorithm that determines where each object is stored. Unlike a central lookup table, every client and OSD computes placement independently using the same CRUSH map, eliminating a single bottleneck.
Object-to-OSD Placement Flow¶
flowchart LR
OBJ["Object<br/>(pool + object-id)"]
HASH["Hash object-id<br/>mod PG count"]
PG["Placement Group<br/>e.g. 4.58"]
CRUSH["CRUSH algorithm<br/>+ failure domain"]
OSDS["Acting Set<br/>[Primary, Secondary, ...]"]
OBJ --> HASH --> PG --> CRUSH --> OSDS
The full computation:
- Client provides the pool name and object ID.
- Ceph hashes the object ID and takes the result modulo the number of PGs in the pool, yielding a PG ID (for example,
58). - The pool ID is prepended (for example, pool
4-> PG4.58). - CRUSH maps the PG to an ordered list of OSDs called the acting set. The first OSD is the primary and handles client writes; it forwards data to secondaries for replication.
- Because CRUSH is deterministic, any client with the same cluster map produces the same result.
Placement Groups (PGs)¶
PGs are logical buckets that amortize the cost of per-object placement. Each pool has a configurable number of PGs. The PG autoscaler (default since Pacific) adjusts PG counts automatically based on pool usage and target PGs-per-OSD ratios.
Key concepts:
- Acting Set: The ordered list of OSDs responsible for a PG. The first OSD is the primary.
- Up Set: The subset of the acting set where OSDs are currently
up. When an OSD in the acting set fails, Ceph remaps the PG and the next OSD becomes primary. - Peering: The process by which OSDs in a PG's acting set agree on the consistent state of all objects. Peering must complete before the PG becomes
active + clean.
Failure Domains¶
CRUSH uses a hierarchical topology (bucket types: osd, host, rack, row, room, root) to place replica copies across separate failure domains. A typical rule ensures that no two replicas of the same PG land on the same host (or rack, if configured).
Rebalancing¶
When OSDs are added or removed, the cluster map changes and CRUSH recomputes placement. Only the PGs whose mapping changes migrate data. CRUSH is stable in that most PGs remain where they are, and new OSDs receive a proportional share without causing load spikes.
Pools and Erasure Coding¶
Replicated Pools¶
By default, pools use replication (size = 3, min_size = 2). The primary OSD writes the object and synchronously replicates to the remaining OSDs in the acting set. This provides the lowest write latency.
Erasure-Coded Pools¶
Erasure coding splits each object into K data chunks and M coding chunks, distributed across K+M OSDs. For example, a K=3, M=2 profile uses 5 OSDs and tolerates 2 failures with only 67% storage overhead (versus 200% for 3x replication).
graph LR
OBJ["Object NYAN<br/>ABCDEFGHI"]
subgraph EC["Erasure Encoding K=3, M=2"]
D1["D1: ABC"]
D2["D2: DEF"]
D3["D3: GHI"]
C1["C1: YXY"]
C2["C2: QGC"]
end
OBJ --> D1
OBJ --> D2
OBJ --> D3
OBJ --> C1
OBJ --> C2
D1 --> O1[(OSD 5)]
D2 --> O2[(OSD 1)]
D3 --> O3[(OSD 2)]
C1 --> O4[(OSD 3)]
C2 --> O5[(OSD 4)]
Reads require K chunks (any combination of data and parity). Writes go through the primary OSD, which encodes the payload and distributes chunks. Erasure-coded pools are best suited for cold data, backups, and large objects where storage efficiency outweighs write performance.
Replication vs Erasure Coding
Use replicated pools for hot data (RBD VM images, CephFS active directories) where write latency matters. Use erasure-coded pools for cold data (RGW archival, backup targets) where storage efficiency matters. The EC write path is heavier because the primary must encode and distribute shards.
Client Interfaces¶
| Interface | Daemon/Lib | Access Method | Typical Use Case |
|---|---|---|---|
| RBD (Block) | librbd, kernel rbd |
Block device (QEMU/libvirt, kernel) | VM disks, Kubernetes PVs |
| RGW (Object) | radosgw |
S3 / Swift REST API | S3-compatible object storage |
| CephFS (File) | ceph-mds, libcephfs |
POSIX mount (kernel/FUSE) | Shared filesystem, HPC |
| Custom | librados |
Direct object API | Application-specific storage |
BlueStore¶
BlueStore is the default OSD storage backend (replacing FileStore). It writes object data directly to a raw block device (no intervening filesystem) and stores metadata in an embedded RocksDB with a separate WAL device. Key advantages:
- Direct I/O: Bypasses the Linux page cache for predictable performance.
- WAL + DB separation: RocksDB metadata and WAL can be placed on faster SSD/NVMe while bulk data resides on HDD.
- Checksums and compression: Per-object checksumming and optional inline compression (zlib, zstd, lz4).
- Device encryption: Optional
dmcryptencryption per OSD.
Crimson (Experimental)
Crimson is a next-generation OSD implementation built on the Seastar framework, targeting improved CPU efficiency and lower latency. It is not yet production-ready as of the Reef/Squid release series.
Key Architectural Properties¶
- No centralized gateway: Clients compute placement via CRUSH and talk directly to OSDs.
- Self-healing: OSDs detect failures, trigger peering, and rebuild data from replicas automatically.
- Elastic scaling: Add or remove OSDs; CRUSH rebalances only affected PGs.
- Uniform hardware: Designed for commodity servers with local disks; no proprietary hardware required.
- Shared-nothing OSDs: Each OSD owns its disk independently, avoiding shared-disk contention.
Sources¶
- Ceph Architecture Documentation
- CRUSH Algorithm Paper: Sage Weil et al., "CRUSH: Controlled, Scalable, Decentralized Placement of Replicated Data"
- BlueStore Design
How It Works¶
RADOS object store, CRUSH data placement, OSD write path, and recovery mechanics.
RADOS Architecture¶
flowchart TB
subgraph Clients["Client Access"]
RBD["RBD Client\n(block)"]
RGW["RGW\n(S3/Swift)"]
CephFS_C["CephFS Client\n(POSIX)"]
end
subgraph RADOS["RADOS (Core)"]
PG["Placement Groups\n(PG)"]
CRUSH_A["CRUSH Map\n(deterministic placement)"]
end
subgraph Daemons["Cluster Daemons"]
MON["MON ×3+\n(quorum, cluster map)"]
MGR["MGR ×2\n(metrics, dashboard)"]
OSD1["OSD 1"]
OSD2["OSD 2"]
OSD3["OSD 3"]
OSDN["OSD N"]
MDS_D["MDS ×2+\n(CephFS metadata)"]
end
Clients --> RADOS
RADOS --> Daemons
CRUSH_A --> PG
PG --> OSD1
PG --> OSD2
PG --> OSD3
style RADOS fill:#c62828,color:#fff
style Daemons fill:#1565c0,color:#fff
CRUSH Algorithm¶
The CRUSH (Controlled Replication Under Scalable Hashing) algorithm is what makes Ceph unique — clients calculate data placement directly, without a central lookup table.
flowchart LR
Object["Object ID"] --> Hash["Hash\n(CRUSH)"]
Hash --> PG_C["Placement Group\n(PG = hash mod num_pgs)"]
PG_C --> CRUSH_C["CRUSH Rules\n(rack/host/OSD failure domains)"]
CRUSH_C --> OSD_Set["OSD Set\n{OSD.4, OSD.17, OSD.29}"]
style CRUSH_C fill:#e65100,color:#fff
Write Path¶
sequenceDiagram
participant Client as Client
participant Primary as Primary OSD
participant Replica1 as Replica OSD 1
participant Replica2 as Replica OSD 2
participant Journal as WAL/DB (BlueStore)
Client->>Primary: Write object
Primary->>Journal: Write to WAL (journal)
par Replicate
Primary->>Replica1: Forward write
Primary->>Replica2: Forward write
end
Replica1-->>Primary: ACK
Replica2-->>Primary: ACK
Primary-->>Client: Write complete
Note over Primary: All replicas confirmed<br/>before client ACK (strong consistency)
Read Path¶
sequenceDiagram
participant Client as Client
participant MON as Monitor
participant Primary as Primary OSD
participant BlueStore as BlueStore (Disk)
Client->>MON: Get cluster map (cached)
Client->>Client: CRUSH(object) -> PG -> OSD set
Client->>Primary: Read object (from primary OSD)
Primary->>BlueStore: Lookup in RocksDB (omap)
BlueStore-->>Primary: Object location on block device
Primary->>BlueStore: Read data from block device
Primary-->>Client: Return object data
Note over Client: Client reads only from primary OSD<br/>for strong consistency (min_size=2)
Read path is simpler than writes because the primary OSD serves reads directly. The client uses the CRUSH algorithm to compute the OSD set for a given object, then contacts the primary OSD. The primary looks up the object's on-disk location via RocksDB (stored in BlueStore's metadata partition) and reads the data from the raw block device.
For replicated pools (size=3), reads go only to the primary OSD. For erasure-coded pools, reads may involve reconstructing data from multiple OSD shards if some are unavailable.
BlueStore (Default OSD Backend)¶
| Component | Role |
|---|---|
| BlockDevice | Raw block device (HDD/SSD/NVMe) — no filesystem |
| RocksDB | Object metadata store |
| WAL | Write-ahead log (best on fast NVMe) |
| DB | RocksDB data (best on SSD) |
| Data | Object data (can be HDD) |
Data Protection¶
| Method | Overhead | Speed | Use Case |
|---|---|---|---|
| Replication (3×) | 200% | Fast writes | Hot data, databases |
| Erasure Coding (4+2) | 50% | Slower writes, fast reads | Cold data, archives |
| FastEC (Tentacle) | 50% | Improved small I/O | General purpose |
Sources¶
Benchmarks¶
Scope
Performance characteristics, scaling limits, and resource consumption for Ceph.
RADOS Bench Results¶
| Operation | 3 OSD (HDD) | 3 OSD (SSD) | 12 OSD (NVMe) |
|---|---|---|---|
| Seq Write | 300-500 MB/s | 1-2 GB/s | 5-10 GB/s |
| Seq Read | 400-600 MB/s | 1.5-3 GB/s | 8-15 GB/s |
| Random 4K Write | 500-1,000 IOPS | 10k-30k IOPS | 100k+ IOPS |
| Random 4K Read | 1,000-2,000 IOPS | 20k-50k IOPS | 200k+ IOPS |
RBD (Block) Performance¶
| Block Size | Throughput | IOPS | Latency (P99) |
|---|---|---|---|
| 4K random | N/A | 10k-50k | 1-5ms |
| 64K sequential | 500MB-2GB/s | N/A | 2-10ms |
| 1M sequential | 1-5 GB/s | N/A | 5-20ms |
Scaling Limits¶
| Dimension | Limit | Notes |
|---|---|---|
| OSDs per cluster | 10,000+ | Tested by CERN |
| Total capacity | Exabytes | Linear scaling |
| Default pg_num per pool | 64 | Tune per pool based on OSD count |
| PGs per OSD | 100-200 (recommended) | Performance degrades beyond 300 |
| Objects per PG | Millions | No hard limit |
Sourcing Status¶
Unsourced Performance Data
The performance numbers in this document are estimated from vendor documentation, community benchmarks, and engineering judgment. They do not represent controlled benchmarks with documented test conditions. Specific hardware configurations, software versions, and test methodologies were not recorded.
Use these figures as rough guidance only. For production capacity planning, run your own benchmarks against your specific workload and infrastructure.