Ceph¶
Unified, software-defined distributed storage providing block, object, and file storage at exabyte scale — the industry standard for OpenStack and Kubernetes.
Overview¶
Ceph is a unified distributed storage system that provides block (RBD), object (RGW/S3), and file (CephFS) storage from a single cluster. It uses the CRUSH algorithm for deterministic data placement, eliminating central lookup tables. Ceph is the default storage backend for OpenStack and widely used with Kubernetes (via Rook).
Key Facts¶
| Attribute | Detail |
|---|---|
| Repository | github.com/ceph/ceph |
| Stars | ~14k+ ⭐ |
| Latest Version | v20.2.1 "Tentacle" (April 6, 2026) |
| Language | C++, Python |
| License | LGPL 2.1 / 3.0 |
| Governance | Community + Red Hat / IBM |
Evaluation¶
| Pros | Cons |
|---|---|
| Unified: block + object + file | Complex to deploy and operate |
| CRUSH: no SPOF, linear scaling | High resource requirements (RAM, CPU) |
| Self-healing, auto-rebalancing | Tuning required for optimal performance |
| Industry standard (OpenStack, K8s) | Erasure coding historically slower (improved in Tentacle) |
| Exabyte-scale proven | OSD recovery can saturate network |
| Cephadm for lifecycle management | |
| FastEC in v20: better EC performance |
Key Features (Tentacle v20.2)¶
| Feature | Detail |
|---|---|
| FastEC | Major erasure coding performance improvement for small I/O |
| SMB Manager | Integrated Samba/CephFS SMB shares with AD support |
| SeaStore preview | Next-gen OSD object store for NVMe devices |
| Multi-cluster dashboard | Manage multiple Ceph clusters from one UI |
| OAuth 2.0 | Dashboard authentication |
| NVMe/TCP gateways | NVMe-oF target support |
| RBD transient locks | Better exclusive lock handling |
Storage Interfaces¶
| Interface | Protocol | Use Case |
|---|---|---|
| RBD | Block (RADOS Block Device) | VM disks, K8s PVs, databases |
| RGW | Object (S3/Swift API) | Backups, media, data lakes |
| CephFS | File (POSIX) | Shared filesystems, NFS replacement |
Notes¶
Sources¶
- Ceph Docs
- Architecture — RADOS, CRUSH
- BlueStore
- GitHub
Questions¶
Answered¶
-
Q: When to use EC vs replication? -- EC for cold data (50% overhead). Replication for hot data (200%, faster writes).
-
Q: How many monitors should a production cluster run? -- At least 3 monitors for quorum. Use 5 for larger clusters or when spanning racks. Always use an odd number to avoid split-brain. Each monitor stores a full copy of the cluster map and participates in Paxos consensus.
-
Q: What is the minimum replication factor for data safety? -- Ceph supports
size=2(R2) as a technical minimum, but R3 (size=3,min_size=2) is strongly recommended. The Ceph documentation warns that on a long enough timeline, data stored with R2 will be lost due to the probability of simultaneous dual failures. -
Q: How does CRUSH avoid a central lookup table? -- Both clients and OSDs compute object placement independently using the same CRUSH map and a deterministic pseudo-random hash. The client hashes the object ID, takes the result modulo the number of PGs, then uses CRUSH to map the PG to an ordered list of OSDs based on the failure domain hierarchy. No query to a central service is needed.
-
Q: What storage backend does Ceph OSD use? -- BlueStore is the default (and only supported) backend since the Reef release. It writes object data directly to raw block devices (no intervening filesystem) and stores metadata in an embedded RocksDB with a separate WAL. The older FileStore backend is deprecated.
-
Q: How does Ceph handle OSD failures? -- OSDs heartbeat their neighbors. If an OSD misses heartbeats, a peer OSD reports it
downto the monitors. The monitor marks the OSDdownand updates the OSD map. CRUSH remaps affected PGs to new OSDs, triggering peering and recovery. The PG autoscaler adjusts placement group counts if needed. -
Q: Can Ceph encrypt data at rest? -- Yes. Ceph supports
dm-crypt(LUKS) encryption per OSD device. Each OSD gets a unique encryption key stored in a Monitor-managed lockbox. BlueStore also supports per-object checksumming (not encryption) for integrity verification. -
Q: What is the role of the Ceph Manager (ceph-mgr)? -- The Manager provides the Ceph Dashboard (web UI), Prometheus metrics endpoint, and orchestrator modules (cephadm, Rook). It is required for cluster operation; at least one active and one standby are recommended for HA.
-
Q: How does CephFS MDS scaling work? -- Multiple active MDS daemons can split the directory tree into subtrees (and shard busy directories), distributing metadata load. Additional standby MDS daemons provide failover. For example, run 3 active + 1 standby for a mix of scaling and HA.
-
Q: What is Crimson? -- Crimson is an experimental next-generation OSD implementation built on the Seastar framework for user-space I/O. It targets improved CPU efficiency and lower latency compared to the classic OSD. It is not production-ready as of the Reef/Squid release series.
Open¶
-
Q: How does SeaStore (announced for Ceph Tentacle v20) compare to BlueStore for NVMe workloads? -- SeaStore is a next-generation OSD object store designed for NVMe devices with a log-structured design. Need to benchmark against BlueStore once it moves beyond preview.
-
Q: What is the practical PG autoscaler overhead for very large clusters (1000+ OSDs)? -- The PG autoscaler monitors pool usage and adjusts PG counts, but the rebalancing impact on large clusters needs validation under production load.
-
Q: How does FastEC in Ceph Tentacle v20 improve small I/O erasure coding performance? -- FastEC is listed as a major Tentacle feature. Need to evaluate benchmarks comparing FastEC vs standard EC for RBD small-block random write workloads.