Storage¶

Distributed storage systems for Kubernetes and cloud environments — covering unified software-defined storage, cloud-native block storage, and high-performance S3-compatible object storage.

← Knowledge Base

Topics¶

System	Description
Ceph	Unified software-defined storage providing block (RBD), object (RGW), and file (CephFS) at exabyte scale via RADOS and the CRUSH algorithm.
Longhorn	Cloud-native distributed block storage for Kubernetes — CNCF Incubating, lightweight, with iSCSI-based engine and native snapshot/backup support.
MinIO	High-performance S3-compatible object storage with erasure coding, site replication, and IAM/STS — now primarily the commercial AIStor platform.

Comparisons¶

Comparison	Scope
Storage Comparison	Ceph vs Longhorn vs MinIO — use cases, architecture, performance, and operational complexity

Landscape¶

The Kubernetes storage ecosystem has matured around the Container Storage Interface (CSI) specification, which decoupled storage provisioning from the kubelet and enabled a plugin marketplace of 100+ drivers covering cloud-provider volumes, SAN/NAS appliances, and software-defined storage systems. S3-compatible object storage has become the de facto persistence layer for modern distributed systems — not just for unstructured data, but as the backing store for observability platforms (Loki, Tempo, Mimir, VictoriaMetrics), data lakes (Iceberg, Delta Lake), and AI/ML training pipelines.

Software-defined storage (Ceph, Longhorn, OpenEBS, Rook) abstracts away hardware heterogeneity, enabling organizations to build storage tiers on commodity servers with NVMe, SSD, and HDD pools managed by placement policies. Erasure coding has emerged as the storage-efficient alternative to full replication for large-scale object stores — MinIO defaults to EC with Reed-Solomon parity, and Ceph supports EC pools for RGW and CephFS.

Storage as a Platform

The trend toward "storage as a platform" is visible in Ceph's evolution from a POSIX filesystem into a unified block/object/file platform managed declaratively through Rook's Kubernetes operator. Rook translates CephCluster, CephBlockPool, and CephObjectStore CRDs into the underlying Ceph configuration, enabling GitOps-driven storage management alongside application workloads.

Data protection and disaster recovery are increasingly built into the storage layer — Longhorn provides scheduled snapshots with S3 backup targets, Ceph supports RBD mirroring for cross-cluster DR, and MinIO offers site replication for active-active multi-site deployments. The convergence of AI workloads and storage is driving demand for high-throughput parallel file systems and GPU-direct storage access, pushing CSI drivers to support RDMA and NVMe-oF protocols.

Key Concepts¶

CSI (Container Storage Interface)¶

A standardized gRPC-based interface between container orchestrators (Kubernetes) and storage providers, enabling storage vendors to implement a single plugin that works across any CSI-compatible orchestrator.

CSI defines three services:

Identity: Capability reporting and plugin health checks
Controller: Volume create/delete, snapshot create/delete, volume expand, and topology awareness
Node: Mount/unmount, stage/unstage, and node-local operations

Kubernetes exposes CSI through StorageClass, PersistentVolume, PersistentVolumeClaim, and VolumeSnapshot objects. CSI drivers run as sidecar containers alongside the storage provider's controller, with node plugins deployed as DaemonSets.

Block vs Object vs File¶

Storage Modalities

Block storage (RBD, EBS, Persistent Disks): Raw, fixed-size blocks exposed as a disk device. Ideal for databases and stateful workloads that need low-latency, POSIX-compliant access. Mounted by a single pod (ReadWriteOnce) in most implementations.
Object storage (S3, RGW, MinIO): Flat namespace of key-value blobs accessed via HTTP APIs (PUT/GET/DELETE). Scales to billions of objects with built-in metadata. No POSIX semantics — not mountable as a filesystem without FUSE adapters (s3fs, goofys). Dominant for analytics, backups, and observability data.
File storage (CephFS, NFS, EFS): Hierarchical filesystem accessible by multiple clients simultaneously (ReadWriteMany). Necessary for shared workloads like CMS platforms, legacy applications, and build caches. CephFS provides POSIX compliance at scale using MDS (Metadata Server) for namespace operations.

Erasure Coding¶

A data protection scheme that splits data into k data fragments and m parity fragments, tolerating up to m simultaneous drive failures while using only (k+m)/k times the raw data size — significantly more storage-efficient than full replication. MinIO defaults to EC 4+4 (2x overhead) and supports up to EC 8+8 for higher durability. Ceph supports EC pools for object (RGW) and file (CephFS) workloads but recommends 3x replication for latency-sensitive block (RBD) workloads because EC requires reading from multiple OSDs to reconstruct data on read. The trade-off is always storage efficiency versus read/write latency and CPU overhead for parity computation.

Replication Factor¶

The number of copies of each data unit maintained across storage nodes for durability and availability. Ceph defaults to 3x replication for pool type "replicated" — each write is committed to a primary OSD and two replicas before acknowledging the client. Longhorn similarly defaults to 3 replicas with synchronous writes. Higher replication factors increase durability (tolerating more simultaneous node failures) but linearly increase storage consumption and write amplification. The choice between replication and erasure coding depends on the workload's I/O pattern: replication for latency-sensitive random I/O (databases), erasure coding for throughput-oriented sequential I/O (object storage, backups).

Storage Classes¶

Kubernetes StorageClass objects define tiers of storage with different performance, durability, and cost characteristics. Each StorageClass specifies a CSI provisioner, reclaim policy (Delete or Retain), volume binding mode (Immediate or WaitForFirstConsumer), and provider-specific parameters (IOPS, throughput, filesystem type, encryption). Platform teams typically expose 3-4 storage classes: fast (NVMe-backed, high IOPS for databases), standard (SSD, general purpose), bulk (HDD or erasure-coded, high capacity for logs/backups), and shared (ReadWriteMany file storage). WaitForFirstConsumer binding ensures volumes are provisioned in the same availability zone as the scheduled pod, avoiding cross-AZ latency.

Open Questions¶

As S3-compatible APIs become the universal storage interface for distributed systems, does block storage (RBD, Longhorn) remain necessary beyond stateful database workloads, or will S3-native architectures (Iceberg, Delta Lake) subsume most persistent storage needs?
How does Ceph's operational complexity compare to running separate purpose-built systems (Longhorn for block, MinIO for object) — at what scale does Ceph's unified architecture justify the operational investment?
With NVMe-oF (NVMe over Fabrics) enabling remote NVMe access with near-local latency, will software-defined storage on commodity hardware lose its cost advantage to disaggregated NVMe pools managed by dedicated controllers?