Architecture¶
Longhorn is a cloud-native distributed block storage system for Kubernetes. It implements storage using containers and microservices, creating a dedicated storage controller (the Longhorn Engine) for each volume and synchronously replicating data across multiple replicas on separate nodes.
See also: index, architecture, operations, security
High-Level Component Diagram¶
graph TB
subgraph Kubernetes Cluster
subgraph Control Plane
CSI[CSI Driver + Provisioner]
LHM[Longhorn Manager<br/>DaemonSet per node]
UI[Longhorn UI]
end
subgraph Node A
ENGINE_A[Longhorn Engine V1/V2<br/>Per-volume controller]
REPLICA_A1[Replica A1]
REPLICA_A2[Replica A2<br/>on Node B]
REPLICA_A3[Replica A3<br/>on Node C]
end
subgraph Data Plane Components
IM[Instance Manager<br/>Engine + Replica pods]
SM[Share Manager<br/>RWX NFS server]
BIM[Backing Image Manager]
end
end
subgraph External
S3[(Backup Target<br/>S3 / NFS)]
end
K8S_API[Kubernetes API Server]
POD[Application Pod]
POD -->|PVC| CSI
CSI -->|Longhorn API| LHM
LHM -->|Watches CRs| K8S_API
UI -->|Longhorn API| LHM
LHM --> ENGINE_A
ENGINE_A -->|Sync replication| REPLICA_A1
ENGINE_A -->|Sync replication| REPLICA_A2
ENGINE_A -->|Sync replication| REPLICA_A3
LHM --> IM
LHM --> SM
LHM --> BIM
LHM -->|Backups| S3
Two-Layer Design¶
Longhorn separates concerns into a control plane and a data plane:
| Layer | Component | Role |
|---|---|---|
| Control Plane | Longhorn Manager | Orchestrates volumes, handles CSI calls, manages CRs |
| Data Plane | Longhorn Engine | Per-volume storage controller, synchronous replication |
| Data Plane | Instance Manager | Runs engine and replica processes as containers |
| Data Plane | Share Manager | Provides NFS-based RWX access |
Longhorn Manager¶
The Longhorn Manager runs as a Kubernetes DaemonSet with one pod per node. It follows the Kubernetes controller/operator pattern:
- Watches for Longhorn Custom Resources (volumes, engines, replicas, nodes, settings) via the Kubernetes API server.
- When a new Volume CR is created (triggered by a CSI provisioning request), the Manager on the target node creates a Longhorn Engine instance and schedules replicas on separate nodes.
- Handles all API calls from the Longhorn UI and the CSI plugin.
- Manages the lifecycle of Instance Manager pods, Backing Image Manager pods, and Share Manager pods.
Longhorn Engine (Data Plane)¶
The Engine is the per-volume storage controller. It always runs on the same node as the Pod consuming the volume, which keeps the data path local and avoids an extra network hop for I/O.
V1 Engine (GA)¶
- Exposes a block device to the host via iSCSI (requires
open-iscsioriscsiadmon the host). - Engine and replicas run as Linux processes inside Instance Manager pods.
- Synchronously replicates writes to all healthy replicas.
V2 Engine (Technical Preview)¶
- Built on SPDK (Storage Performance Development Kit) for user-space I/O, bypassing the kernel.
- Engine operates as an SPDK RAID bdev; replicas as SPDK logical volume bdevs.
- Frontend options: NVMe-TCP (requires
nvme_tcpkernel module) or UBLK (requiresublk_drvmodule and huge pages). - Delivers lower latency and higher IOPS/throughput compared to V1.
V1 vs V2 Feature Parity
V2 Data Engine is currently a Technical Preview. Not all V1 features are available in V2. Check the official feature parity matrix before using V2 in production.
Volume Replication¶
Replica Placement and Synchronous Replication¶
sequenceDiagram
participant App as Application Pod
participant Eng as Longhorn Engine
participant R1 as Replica 1 (Node A)
participant R2 as Replica 2 (Node B)
participant R3 as Replica 3 (Node C)
App->>Eng: WRITE data block
Eng->>R1: Sync write (local)
Eng->>R2: Sync write (network)
Eng->>R3: Sync write (network)
R1-->>Eng: ACK
R2-->>Eng: ACK
R3-->>Eng: ACK
Eng-->>App: Write complete
Key properties:
- Synchronous replication: The Engine waits for all healthy replicas to acknowledge before completing the write. This guarantees crash consistency.
- Failure tolerance: With
Nreplicas, the volume toleratesN-1replica failures and stays operational. - Default replica count: Configurable globally via Longhorn settings; can be overridden per StorageClass (
numberOfReplicasparameter). - Replica rebuilding: When a replica fails, the Manager creates a blank replica on another node. The Engine pauses I/O briefly, takes a system snapshot of all healthy replicas, adds the new replica in write-only mode, resumes I/O, then syncs historical data in the background. Once sync completes, the replica enters read-write mode.
Replica Storage: Sparse Files and Read Index¶
Each replica is stored as a Linux sparse file on the host disk, providing thin provisioning out of the box. A 1 TB volume that contains 10 GB of actual data consumes only 10 GB on disk.
Longhorn maintains an in-memory read index per replica. This is a byte-array (one byte per 4 KB block) that records which snapshot or live data layer holds the most recent data for each block. This avoids scanning the entire snapshot chain on reads and limits each volume to 254 snapshots.
- A 1 TB volume consumes approximately 256 MB of read index memory per replica.
- Write operations reset the read index entry to point to live data.
- Read operations traverse the index to find the correct source layer.
CSI Driver Integration¶
sequenceDiagram
participant PVC as PersistentVolumeClaim
participant K8s as Kubernetes API
participant CSI as Longhorn CSI Plugin
participant LHM as Longhorn Manager
participant Eng as Longhorn Engine
PVC->>K8s: Provision request
K8s->>CSI: CreateVolume
CSI->>LHM: Create Longhorn Volume CR
LHM->>Eng: Start Engine + Replicas
Eng-->>LHM: Volume ready
LHM-->>CSI: Volume path
CSI->>K8s: PV created + formatted + mounted
K8s-->>PVC: Bound and ready
The Longhorn CSI driver handles:
- CreateVolume / DeleteVolume: Provisioning and teardown.
- ControllerPublishVolume / ControllerUnpublishVolume: Attach/detach to nodes.
- NodeStageVolume / NodePublishVolume: Format and mount the block device into the Pod.
- CreateSnapshot / DeleteSnapshot: Snapshot management.
- ExpandVolume: Online volume expansion.
For encrypted volumes, the CSI driver passes encryption secrets (stored as Kubernetes Secrets) to dm_crypt / cryptsetup on the host.
Instance Manager¶
The Instance Manager is a system-managed pod (one per node per engine version) that runs the Engine and Replica processes. It replaces the older model of one pod per engine/replica, reducing pod overhead. Longhorn automatically manages the Instance Manager lifecycle.
Share Manager (RWX Volumes)¶
Longhorn provides ReadWriteMany (RWX) access by deploying a Share Manager pod that runs an NFS server backed by the Longhorn volume. The NFS export is exposed as a Kubernetes Service, and the CSI driver mounts the NFS share into the requesting Pods.
- Each RWX volume gets its own Share Manager pod.
- Share Manager pods are managed by a dedicated Deployment controller.
- RWX volumes do not support Block (volumeMode: Block) mode.
Backup Architecture¶
flowchart LR
subgraph Primary Storage
VOL[Longhorn Volume]
SNAP[Snapshot chain]
end
subgraph Secondary Storage
BS[Backupstore<br/>S3 or NFS]
BK1[Backup 1<br/>2 MB blocks]
BK2[Backup 2<br/>Incremental 2 MB blocks]
end
VOL --> SNAP
SNAP -->|Flatten + diff| BK1
SNAP -->|Incremental diff| BK2
BK1 -->|Shared 2 MB blocks| BK2
- Backup target: Configured as an S3-compatible endpoint or NFS share external to the cluster.
- Incremental backups: Each backup transmits only changed 2 MB blocks since the previous backup. Block-level checksums provide deduplication within the same volume.
- Disaster Recovery (DR) volumes: A DR volume in a secondary cluster incrementally restores from the backupstore. On failover, the DR volume is activated and becomes a normal Longhorn volume.
- Recurring backups/snapshots: Configurable schedules per volume or per StorageClass.
Key Architectural Properties¶
- One engine per volume: Failure domains are isolated; a controller crash affects only one volume.
- Microservices-based: Engine, replicas, Instance Manager, and Share Manager are all orchestrated as Kubernetes resources.
- Thin provisioning: Volumes consume only the space actually written.
- Crash-consistent: Longhorn runs
syncbefore creating snapshots, but OS-level cache may contain unflushed data at crash time. - Live upgrades: Engines and replicas can be upgraded without disrupting I/O, using rolling upgrade jobs.
Sources¶
How It Works¶
Per-volume engine model, synchronous replication, snapshot mechanics, and CSI integration.
Per-Volume Engine Architecture¶
Unlike Ceph's shared daemon model, Longhorn assigns a dedicated Engine process to each volume. This isolates failures — a bug in one volume's engine cannot affect other volumes.
flowchart TB
subgraph Node1["Node 1"]
E1["Engine (Vol-1)\n(storage controller)"]
R1A["Replica 1A\n(Vol-1 data)"]
R2A["Replica 2A\n(Vol-2 data)"]
E2["Engine (Vol-2)"]
end
subgraph Node2["Node 2"]
R1B["Replica 1B\n(Vol-1 data)"]
end
subgraph Node3["Node 3"]
R1C["Replica 1C\n(Vol-1 data)"]
end
Pod1["Pod (uses Vol-1)"] --> E1
E1 -->|"sync write"| R1A
E1 -->|"sync write"| R1B
E1 -->|"sync write"| R1C
style Node1 fill:#2e7d32,color:#fff
Write Path¶
sequenceDiagram
participant Pod as Pod
participant Engine as Longhorn Engine
participant R1 as Replica 1 (local)
participant R2 as Replica 2 (remote)
participant R3 as Replica 3 (remote)
Pod->>Engine: Write block
par Synchronous replication
Engine->>R1: Write to replica 1
Engine->>R2: Write to replica 2
Engine->>R3: Write to replica 3
end
R1-->>Engine: ACK
R2-->>Engine: ACK
R3-->>Engine: ACK
Engine-->>Pod: Write complete
Note over Engine: All replicas confirmed before ACK
Snapshot & Backup¶
flowchart LR
Vol["Volume\n(live data)"] --> Snap1["Snapshot 1\n(point-in-time)"]
Snap1 --> Snap2["Snapshot 2\n(incremental)"]
Snap2 --> Backup["Backup\n(to S3/NFS)"]
Backup --> DR["DR Volume\n(remote cluster)"]
style Backup fill:#1565c0,color:#fff
Sources¶
Benchmarks¶
Scope
Performance characteristics, scaling limits, and resource consumption for Longhorn.
I/O Performance¶
| Configuration | Seq Read | Seq Write | Random 4K Read | Random 4K Write |
|---|---|---|---|---|
| 1 replica | 500-800 MB/s | 300-500 MB/s | 15k IOPS | 8k IOPS |
| 2 replicas | 500-800 MB/s | 200-350 MB/s | 15k IOPS | 5k IOPS |
| 3 replicas | 500-800 MB/s | 150-300 MB/s | 15k IOPS | 3k IOPS |
Note
Performance depends heavily on underlying disk type (HDD vs SSD vs NVMe) and network bandwidth between nodes.
Resource Overhead¶
| Component | CPU | Memory | Per |
|---|---|---|---|
| Longhorn Manager | 100-300m | 256-512Mi | Per node |
| Engine (per volume) | 50-200m | 100-200Mi | Per volume |
| Replica (per replica) | 50-100m | 100-200Mi | Per replica |
Scaling Limits¶
| Dimension | Limit | Notes |
|---|---|---|
| Volumes per cluster | 1,000+ | Manager memory scales |
| Volume size | 10TB+ | Large volumes need more memory |
| Replicas per volume | 1-20 | 3 is default |
| Nodes | 50+ | Tested in production |
| Snapshots per volume | 250 | disk_based_snapshots |
Sourcing Status¶
Unsourced Performance Data
The performance numbers in this document are estimated from vendor documentation, community benchmarks, and engineering judgment. They do not represent controlled benchmarks with documented test conditions. Specific hardware configurations, software versions, and test methodologies were not recorded.
Use these figures as rough guidance only. For production capacity planning, run your own benchmarks against your specific workload and infrastructure.