Skip to content

Operations

Scope

Cluster deployment, CRUSH map management, pool tuning, OSD operations, and health monitoring.

Cluster Architecture

Component Role Min Count
MON Cluster state, Paxos consensus 3 (odd number)
MGR Metrics, dashboard, orchestrator 2 (active/standby)
OSD Data storage (1 per disk) 3+
MDS CephFS metadata (if using CephFS) 2+
RGW S3/Swift gateway (if using object) 2+

Deployment Methods

# Cephadm (recommended for new clusters)
cephadm bootstrap --mon-ip 10.0.0.1 --initial-dashboard-user admin

# Add hosts
ceph orch host add node2 10.0.0.2
ceph orch host add node3 10.0.0.3

# Deploy OSDs on all available devices
ceph orch apply osd --all-available-devices

Pool Management

# Create replicated pool
ceph osd pool create mypool 128 128 replicated

# Create erasure coded pool (higher storage efficiency)
ceph osd pool create ecpool 128 128 erasure

# Set replication factor
ceph osd pool set mypool size 3 min_size 2

# Enable compression
ceph osd pool set mypool compression_algorithm zstd
ceph osd pool set mypool compression_mode aggressive

CRUSH Map

# View CRUSH hierarchy
ceph osd tree

# Create failure domain rule
ceph osd crush rule create-replicated replicated_rack default rack

# Move OSD to specific host/rack
ceph osd crush set osd.5 1.0 root=default datacenter=dc1 rack=rack2 host=node5

Health & Monitoring

# Cluster health
ceph health detail
ceph status

# OSD performance
ceph osd perf
ceph osd df

# PG status
ceph pg stat
ceph pg dump_stuck unclean

# Prometheus metrics (via MGR)
ceph mgr module enable prometheus

Common Issues

Issue Diagnosis Fix
HEALTH_WARN: PGs degraded ceph health detail Wait for recovery or add OSDs
OSD down ceph osd tree Check disk, restart OSD daemon
Slow requests ceph daemon osd.X perf dump Check disk latency, network
Near-full OSDs ceph osd df Reweight, add storage, delete data
Clock skew ceph health detail Configure NTP on all nodes

Commands & Recipes

Cluster Health

# Quick status
ceph status
ceph health detail

# OSD tree (disk layout)
ceph osd tree

# Disk usage
ceph df
ceph osd df tree

# PG status
ceph pg stat
ceph pg dump_stuck

Pool Management

# Create replicated pool
ceph osd pool create mypool 128 128 replicated
ceph osd pool set mypool size 3
ceph osd pool set mypool min_size 2

# Create erasure-coded pool
ceph osd pool create ecpool 128 128 erasure

# Enable application
ceph osd pool application enable mypool rbd

RBD (Block Storage)

# Create RBD image (100GB)
rbd create mypool/myimage --size 102400

# Map to kernel device
rbd device map mypool/myimage
mkfs.xfs /dev/rbd0
mount /dev/rbd0 /mnt/rbd

# Snapshot
rbd snap create mypool/myimage@snap1
rbd snap rollback mypool/myimage@snap1

RGW (Object Storage / S3)

# Create RGW user
radosgw-admin user create --uid=myuser --display-name="My User"
radosgw-admin user info --uid=myuser  # get access/secret keys

# S3 access test (aws cli)
aws --endpoint-url=http://rgw-host:7480 s3 mb s3://mybucket
aws --endpoint-url=http://rgw-host:7480 s3 cp file.txt s3://mybucket/

CephFS

# Create CephFS
ceph fs volume create myfs

# Mount (kernel client)
mount -t ceph mon1:/ /mnt/cephfs -o name=admin,secret=<key>

# Mount (FUSE)
ceph-fuse /mnt/cephfs

Deployment (Cephadm)

# Bootstrap new cluster
cephadm bootstrap --mon-ip 10.0.0.1

# Add hosts
ceph orch host add node2 10.0.0.2
ceph orch host add node3 10.0.0.3

# Deploy OSDs on all available disks
ceph orch apply osd --all-available-devices

# Deploy RGW
ceph orch apply rgw myrgw --placement="count:2"

# Deploy MDS (for CephFS)
ceph orch apply mds myfs --placement="count:2"

Troubleshooting

# Find slow OSDs
ceph osd perf

# Check CRUSH map
ceph osd crush dump | jq '.buckets'

# Recovery status
ceph pg dump | grep -i recovering

# Check for data inconsistency
ceph health detail | grep inconsistent

Sources