Architecture¶

Longhorn is a cloud-native distributed block storage system for Kubernetes. It implements storage using containers and microservices, creating a dedicated storage controller (the Longhorn Engine) for each volume and synchronously replicating data across multiple replicas on separate nodes.

See also: index, architecture, operations, security

High-Level Component Diagram¶

graph TB
    subgraph Kubernetes Cluster
        subgraph Control Plane
            CSI[CSI Driver + Provisioner]
            LHM[Longhorn Manager<br/>DaemonSet per node]
            UI[Longhorn UI]
        end

        subgraph Node A
            ENGINE_A[Longhorn Engine V1/V2<br/>Per-volume controller]
            REPLICA_A1[Replica A1]
            REPLICA_A2[Replica A2<br/>on Node B]
            REPLICA_A3[Replica A3<br/>on Node C]
        end

        subgraph Data Plane Components
            IM[Instance Manager<br/>Engine + Replica pods]
            SM[Share Manager<br/>RWX NFS server]
            BIM[Backing Image Manager]
        end
    end

    subgraph External
        S3[(Backup Target<br/>S3 / NFS)]
    end

    K8S_API[Kubernetes API Server]
    POD[Application Pod]

    POD -->|PVC| CSI
    CSI -->|Longhorn API| LHM
    LHM -->|Watches CRs| K8S_API
    UI -->|Longhorn API| LHM
    LHM --> ENGINE_A
    ENGINE_A -->|Sync replication| REPLICA_A1
    ENGINE_A -->|Sync replication| REPLICA_A2
    ENGINE_A -->|Sync replication| REPLICA_A3
    LHM --> IM
    LHM --> SM
    LHM --> BIM
    LHM -->|Backups| S3

Two-Layer Design¶

Longhorn separates concerns into a control plane and a data plane:

Layer	Component	Role
Control Plane	Longhorn Manager	Orchestrates volumes, handles CSI calls, manages CRs
Data Plane	Longhorn Engine	Per-volume storage controller, synchronous replication
Data Plane	Instance Manager	Runs engine and replica processes as containers
Data Plane	Share Manager	Provides NFS-based RWX access

Longhorn Manager¶

The Longhorn Manager runs as a Kubernetes DaemonSet with one pod per node. It follows the Kubernetes controller/operator pattern:

Watches for Longhorn Custom Resources (volumes, engines, replicas, nodes, settings) via the Kubernetes API server.
When a new Volume CR is created (triggered by a CSI provisioning request), the Manager on the target node creates a Longhorn Engine instance and schedules replicas on separate nodes.
Handles all API calls from the Longhorn UI and the CSI plugin.
Manages the lifecycle of Instance Manager pods, Backing Image Manager pods, and Share Manager pods.

Longhorn Engine (Data Plane)¶

The Engine is the per-volume storage controller. It always runs on the same node as the Pod consuming the volume, which keeps the data path local and avoids an extra network hop for I/O.

V1 Engine (GA)¶

Exposes a block device to the host via iSCSI (requires open-iscsi or iscsiadm on the host).
Engine and replicas run as Linux processes inside Instance Manager pods.
Synchronously replicates writes to all healthy replicas.

V2 Engine (Technical Preview)¶

Built on SPDK (Storage Performance Development Kit) for user-space I/O, bypassing the kernel.
Engine operates as an SPDK RAID bdev; replicas as SPDK logical volume bdevs.
Frontend options: NVMe-TCP (requires nvme_tcp kernel module) or UBLK (requires ublk_drv module and huge pages).
Delivers lower latency and higher IOPS/throughput compared to V1.

V1 vs V2 Feature Parity

V2 Data Engine is currently a Technical Preview. Not all V1 features are available in V2. Check the official feature parity matrix before using V2 in production.

Volume Replication¶

Replica Placement and Synchronous Replication¶

sequenceDiagram
    participant App as Application Pod
    participant Eng as Longhorn Engine
    participant R1 as Replica 1 (Node A)
    participant R2 as Replica 2 (Node B)
    participant R3 as Replica 3 (Node C)

    App->>Eng: WRITE data block
    Eng->>R1: Sync write (local)
    Eng->>R2: Sync write (network)
    Eng->>R3: Sync write (network)
    R1-->>Eng: ACK
    R2-->>Eng: ACK
    R3-->>Eng: ACK
    Eng-->>App: Write complete

Key properties:

Synchronous replication: The Engine waits for all healthy replicas to acknowledge before completing the write. This guarantees crash consistency.
Failure tolerance: With N replicas, the volume tolerates N-1 replica failures and stays operational.
Default replica count: Configurable globally via Longhorn settings; can be overridden per StorageClass (numberOfReplicas parameter).
Replica rebuilding: When a replica fails, the Manager creates a blank replica on another node. The Engine pauses I/O briefly, takes a system snapshot of all healthy replicas, adds the new replica in write-only mode, resumes I/O, then syncs historical data in the background. Once sync completes, the replica enters read-write mode.

Replica Storage: Sparse Files and Read Index¶

Each replica is stored as a Linux sparse file on the host disk, providing thin provisioning out of the box. A 1 TB volume that contains 10 GB of actual data consumes only 10 GB on disk.

Longhorn maintains an in-memory read index per replica. This is a byte-array (one byte per 4 KB block) that records which snapshot or live data layer holds the most recent data for each block. This avoids scanning the entire snapshot chain on reads and limits each volume to 254 snapshots.

A 1 TB volume consumes approximately 256 MB of read index memory per replica.
Write operations reset the read index entry to point to live data.
Read operations traverse the index to find the correct source layer.

CSI Driver Integration¶

sequenceDiagram
    participant PVC as PersistentVolumeClaim
    participant K8s as Kubernetes API
    participant CSI as Longhorn CSI Plugin
    participant LHM as Longhorn Manager
    participant Eng as Longhorn Engine

    PVC->>K8s: Provision request
    K8s->>CSI: CreateVolume
    CSI->>LHM: Create Longhorn Volume CR
    LHM->>Eng: Start Engine + Replicas
    Eng-->>LHM: Volume ready
    LHM-->>CSI: Volume path
    CSI->>K8s: PV created + formatted + mounted
    K8s-->>PVC: Bound and ready

The Longhorn CSI driver handles:

CreateVolume / DeleteVolume: Provisioning and teardown.
ControllerPublishVolume / ControllerUnpublishVolume: Attach/detach to nodes.
NodeStageVolume / NodePublishVolume: Format and mount the block device into the Pod.
CreateSnapshot / DeleteSnapshot: Snapshot management.
ExpandVolume: Online volume expansion.

For encrypted volumes, the CSI driver passes encryption secrets (stored as Kubernetes Secrets) to dm_crypt / cryptsetup on the host.

Instance Manager¶

The Instance Manager is a system-managed pod (one per node per engine version) that runs the Engine and Replica processes. It replaces the older model of one pod per engine/replica, reducing pod overhead. Longhorn automatically manages the Instance Manager lifecycle.

Longhorn provides ReadWriteMany (RWX) access by deploying a Share Manager pod that runs an NFS server backed by the Longhorn volume. The NFS export is exposed as a Kubernetes Service, and the CSI driver mounts the NFS share into the requesting Pods.

Each RWX volume gets its own Share Manager pod.
Share Manager pods are managed by a dedicated Deployment controller.
RWX volumes do not support Block (volumeMode: Block) mode.

Backup Architecture¶

flowchart LR
    subgraph Primary Storage
        VOL[Longhorn Volume]
        SNAP[Snapshot chain]
    end

    subgraph Secondary Storage
        BS[Backupstore<br/>S3 or NFS]
        BK1[Backup 1<br/>2 MB blocks]
        BK2[Backup 2<br/>Incremental 2 MB blocks]
    end

    VOL --> SNAP
    SNAP -->|Flatten + diff| BK1
    SNAP -->|Incremental diff| BK2
    BK1 -->|Shared 2 MB blocks| BK2

Backup target: Configured as an S3-compatible endpoint or NFS share external to the cluster.
Incremental backups: Each backup transmits only changed 2 MB blocks since the previous backup. Block-level checksums provide deduplication within the same volume.
Disaster Recovery (DR) volumes: A DR volume in a secondary cluster incrementally restores from the backupstore. On failover, the DR volume is activated and becomes a normal Longhorn volume.
Recurring backups/snapshots: Configurable schedules per volume or per StorageClass.

Key Architectural Properties¶

One engine per volume: Failure domains are isolated; a controller crash affects only one volume.
Microservices-based: Engine, replicas, Instance Manager, and Share Manager are all orchestrated as Kubernetes resources.
Thin provisioning: Volumes consume only the space actually written.
Crash-consistent: Longhorn runs sync before creating snapshots, but OS-level cache may contain unflushed data at crash time.
Live upgrades: Engines and replicas can be upgraded without disrupting I/O, using rolling upgrade jobs.

Sources¶

How It Works¶

Per-volume engine model, synchronous replication, snapshot mechanics, and CSI integration.

Per-Volume Engine Architecture¶

Unlike Ceph's shared daemon model, Longhorn assigns a dedicated Engine process to each volume. This isolates failures — a bug in one volume's engine cannot affect other volumes.

flowchart TB
    subgraph Node1["Node 1"]
        E1["Engine (Vol-1)\n(storage controller)"]
        R1A["Replica 1A\n(Vol-1 data)"]
        R2A["Replica 2A\n(Vol-2 data)"]
        E2["Engine (Vol-2)"]
    end

    subgraph Node2["Node 2"]
        R1B["Replica 1B\n(Vol-1 data)"]
    end

    subgraph Node3["Node 3"]
        R1C["Replica 1C\n(Vol-1 data)"]
    end

    Pod1["Pod (uses Vol-1)"] --> E1
    E1 -->|"sync write"| R1A
    E1 -->|"sync write"| R1B
    E1 -->|"sync write"| R1C

    style Node1 fill:#2e7d32,color:#fff

Write Path¶

sequenceDiagram
    participant Pod as Pod
    participant Engine as Longhorn Engine
    participant R1 as Replica 1 (local)
    participant R2 as Replica 2 (remote)
    participant R3 as Replica 3 (remote)

    Pod->>Engine: Write block
    par Synchronous replication
        Engine->>R1: Write to replica 1
        Engine->>R2: Write to replica 2
        Engine->>R3: Write to replica 3
    end
    R1-->>Engine: ACK
    R2-->>Engine: ACK
    R3-->>Engine: ACK
    Engine-->>Pod: Write complete
    Note over Engine: All replicas confirmed before ACK

Snapshot & Backup¶

flowchart LR
    Vol["Volume\n(live data)"] --> Snap1["Snapshot 1\n(point-in-time)"]
    Snap1 --> Snap2["Snapshot 2\n(incremental)"]
    Snap2 --> Backup["Backup\n(to S3/NFS)"]
    Backup --> DR["DR Volume\n(remote cluster)"]

    style Backup fill:#1565c0,color:#fff

Sources¶

Benchmarks¶

Scope

Performance characteristics, scaling limits, and resource consumption for Longhorn.

I/O Performance¶

Configuration	Seq Read	Seq Write	Random 4K Read	Random 4K Write
1 replica	500-800 MB/s	300-500 MB/s	15k IOPS	8k IOPS
2 replicas	500-800 MB/s	200-350 MB/s	15k IOPS	5k IOPS
3 replicas	500-800 MB/s	150-300 MB/s	15k IOPS	3k IOPS

Note

Performance depends heavily on underlying disk type (HDD vs SSD vs NVMe) and network bandwidth between nodes.

Resource Overhead¶

Component	CPU	Memory	Per
Longhorn Manager	100-300m	256-512Mi	Per node
Engine (per volume)	50-200m	100-200Mi	Per volume
Replica (per replica)	50-100m	100-200Mi	Per replica

Scaling Limits¶

Dimension	Limit	Notes
Volumes per cluster	1,000+	Manager memory scales
Volume size	10TB+	Large volumes need more memory
Replicas per volume	1-20	3 is default
Nodes	50+	Tested in production
Snapshots per volume	250	disk_based_snapshots

Sourcing Status¶

Unsourced Performance Data

The performance numbers in this document are estimated from vendor documentation, community benchmarks, and engineering judgment. They do not represent controlled benchmarks with documented test conditions. Specific hardware configurations, software versions, and test methodologies were not recorded.

Use these figures as rough guidance only. For production capacity planning, run your own benchmarks against your specific workload and infrastructure.

Architecture¶

High-Level Component Diagram¶

Two-Layer Design¶

Longhorn Manager¶

Longhorn Engine (Data Plane)¶

V1 Engine (GA)¶

V2 Engine (Technical Preview)¶

Volume Replication¶

Replica Placement and Synchronous Replication¶

Replica Storage: Sparse Files and Read Index¶

CSI Driver Integration¶

Instance Manager¶

Share Manager (RWX Volumes)¶

Backup Architecture¶

Key Architectural Properties¶

Sources¶

How It Works¶

Per-Volume Engine Architecture¶

Write Path¶

Snapshot & Backup¶

Sources¶

Benchmarks¶

I/O Performance¶

Resource Overhead¶

Scaling Limits¶

Sourcing Status¶

Sources¶