Ceph¶

Unified, software-defined distributed storage providing block, object, and file storage at exabyte scale — the industry standard for OpenStack and Kubernetes.

Overview¶

Ceph is a unified distributed storage system that provides block (RBD), object (RGW/S3), and file (CephFS) storage from a single cluster. It uses the CRUSH algorithm for deterministic data placement, eliminating central lookup tables. Ceph is the default storage backend for OpenStack and widely used with Kubernetes (via Rook).

Key Facts¶

Attribute	Detail
Repository	github.com/ceph/ceph
Stars	~14k+ ⭐
Latest Version	v20.2.2 "Tentacle" (2026)
Language	C++, Python
License	LGPL 2.1 / 3.0
Governance	Community + Red Hat / IBM

Evaluation¶

Pros	Cons
Unified: block + object + file	Complex to deploy and operate
CRUSH: no SPOF, linear scaling	High resource requirements (RAM, CPU)
Self-healing, auto-rebalancing	Tuning required for optimal performance
Industry standard (OpenStack, K8s)	Erasure coding historically slower (improved in Tentacle)
Exabyte-scale proven	OSD recovery can saturate network
Cephadm for lifecycle management
FastEC in v20: better EC performance

Key Features (Tentacle v20.2)¶

Feature	Detail
FastEC	Major erasure coding performance improvement for small I/O
SMB Manager	Integrated Samba/CephFS SMB shares with AD support
SeaStore preview	Next-gen OSD object store for NVMe devices
Multi-cluster dashboard	Manage multiple Ceph clusters from one UI
OAuth 2.0	Dashboard authentication
NVMe/TCP gateways	NVMe-oF target support
RBD transient locks	Better exclusive lock handling

Storage Interfaces¶

Interface	Protocol	Use Case
RBD	Block (RADOS Block Device)	VM disks, K8s PVs, databases
RGW	Object (S3/Swift API)	Backups, media, data lakes
CephFS	File (POSIX)	Shared filesystems, NFS replacement

Notes¶

Sources¶

Questions¶

Answered¶

Q: When to use EC vs replication? -- EC for cold data (50% overhead). Replication for hot data (200%, faster writes).
Q: How many monitors should a production cluster run? -- At least 3 monitors for quorum. Use 5 for larger clusters or when spanning racks. Always use an odd number to avoid split-brain. Each monitor stores a full copy of the cluster map and participates in Paxos consensus.
Q: What is the minimum replication factor for data safety? -- Ceph supports size=2 (R2) as a technical minimum, but R3 (size=3, min_size=2) is strongly recommended. The Ceph documentation warns that on a long enough timeline, data stored with R2 will be lost due to the probability of simultaneous dual failures.
Q: How does CRUSH avoid a central lookup table? -- Both clients and OSDs compute object placement independently using the same CRUSH map and a deterministic pseudo-random hash. The client hashes the object ID, takes the result modulo the number of PGs, then uses CRUSH to map the PG to an ordered list of OSDs based on the failure domain hierarchy. No query to a central service is needed.
Q: What storage backend does Ceph OSD use? -- BlueStore is the default (and only supported) backend since the Reef release. It writes object data directly to raw block devices (no intervening filesystem) and stores metadata in an embedded RocksDB with a separate WAL. The older FileStore backend is deprecated.
Q: How does Ceph handle OSD failures? -- OSDs heartbeat their neighbors. If an OSD misses heartbeats, a peer OSD reports it down to the monitors. The monitor marks the OSD down and updates the OSD map. CRUSH remaps affected PGs to new OSDs, triggering peering and recovery. The PG autoscaler adjusts placement group counts if needed.
Q: Can Ceph encrypt data at rest? -- Yes. Ceph supports dm-crypt (LUKS) encryption per OSD device. Each OSD gets a unique encryption key stored in a Monitor-managed lockbox. BlueStore also supports per-object checksumming (not encryption) for integrity verification.
Q: What is the role of the Ceph Manager (ceph-mgr)? -- The Manager provides the Ceph Dashboard (web UI), Prometheus metrics endpoint, and orchestrator modules (cephadm, Rook). It is required for cluster operation; at least one active and one standby are recommended for HA.
Q: How does CephFS MDS scaling work? -- Multiple active MDS daemons can split the directory tree into subtrees (and shard busy directories), distributing metadata load. Additional standby MDS daemons provide failover. For example, run 3 active + 1 standby for a mix of scaling and HA.
Q: What is Crimson? -- Crimson is an experimental next-generation OSD implementation built on the Seastar framework for user-space I/O. It targets improved CPU efficiency and lower latency compared to the classic OSD. It is not production-ready as of the Reef/Squid release series.

Open¶

Q: How does SeaStore (announced for Ceph Tentacle v20) compare to BlueStore for NVMe workloads? -- SeaStore is a next-generation OSD object store designed for NVMe devices with a log-structured design. Need to benchmark against BlueStore once it moves beyond preview.
Q: What is the practical PG autoscaler overhead for very large clusters (1000+ OSDs)? -- The PG autoscaler monitors pool usage and adjusts PG counts, but the rebalancing impact on large clusters needs validation under production load.
Q: How does FastEC in Ceph Tentacle v20 improve small I/O erasure coding performance? -- FastEC is listed as a major Tentacle feature. Need to evaluate benchmarks comparing FastEC vs standard EC for RBD small-block random write workloads.