Skip to content

Architecture

Longhorn is a cloud-native distributed block storage system for Kubernetes. It implements storage using containers and microservices, creating a dedicated storage controller (the Longhorn Engine) for each volume and synchronously replicating data across multiple replicas on separate nodes.

See also: index, architecture, operations, security

High-Level Component Diagram

graph TB
    subgraph Kubernetes Cluster
        subgraph Control Plane
            CSI[CSI Driver + Provisioner]
            LHM[Longhorn Manager<br/>DaemonSet per node]
            UI[Longhorn UI]
        end

        subgraph Node A
            ENGINE_A[Longhorn Engine V1/V2<br/>Per-volume controller]
            REPLICA_A1[Replica A1]
            REPLICA_A2[Replica A2<br/>on Node B]
            REPLICA_A3[Replica A3<br/>on Node C]
        end

        subgraph Data Plane Components
            IM[Instance Manager<br/>Engine + Replica pods]
            SM[Share Manager<br/>RWX NFS server]
            BIM[Backing Image Manager]
        end
    end

    subgraph External
        S3[(Backup Target<br/>S3 / NFS)]
    end

    K8S_API[Kubernetes API Server]
    POD[Application Pod]

    POD -->|PVC| CSI
    CSI -->|Longhorn API| LHM
    LHM -->|Watches CRs| K8S_API
    UI -->|Longhorn API| LHM
    LHM --> ENGINE_A
    ENGINE_A -->|Sync replication| REPLICA_A1
    ENGINE_A -->|Sync replication| REPLICA_A2
    ENGINE_A -->|Sync replication| REPLICA_A3
    LHM --> IM
    LHM --> SM
    LHM --> BIM
    LHM -->|Backups| S3

Two-Layer Design

Longhorn separates concerns into a control plane and a data plane:

Layer Component Role
Control Plane Longhorn Manager Orchestrates volumes, handles CSI calls, manages CRs
Data Plane Longhorn Engine Per-volume storage controller, synchronous replication
Data Plane Instance Manager Runs engine and replica processes as containers
Data Plane Share Manager Provides NFS-based RWX access

Longhorn Manager

The Longhorn Manager runs as a Kubernetes DaemonSet with one pod per node. It follows the Kubernetes controller/operator pattern:

  1. Watches for Longhorn Custom Resources (volumes, engines, replicas, nodes, settings) via the Kubernetes API server.
  2. When a new Volume CR is created (triggered by a CSI provisioning request), the Manager on the target node creates a Longhorn Engine instance and schedules replicas on separate nodes.
  3. Handles all API calls from the Longhorn UI and the CSI plugin.
  4. Manages the lifecycle of Instance Manager pods, Backing Image Manager pods, and Share Manager pods.

Longhorn Engine (Data Plane)

The Engine is the per-volume storage controller. It always runs on the same node as the Pod consuming the volume, which keeps the data path local and avoids an extra network hop for I/O.

V1 Engine (GA)

  • Exposes a block device to the host via iSCSI (requires open-iscsi or iscsiadm on the host).
  • Engine and replicas run as Linux processes inside Instance Manager pods.
  • Synchronously replicates writes to all healthy replicas.

V2 Engine (Technical Preview)

  • Built on SPDK (Storage Performance Development Kit) for user-space I/O, bypassing the kernel.
  • Engine operates as an SPDK RAID bdev; replicas as SPDK logical volume bdevs.
  • Frontend options: NVMe-TCP (requires nvme_tcp kernel module) or UBLK (requires ublk_drv module and huge pages).
  • Delivers lower latency and higher IOPS/throughput compared to V1.

V1 vs V2 Feature Parity

V2 Data Engine is currently a Technical Preview. Not all V1 features are available in V2. Check the official feature parity matrix before using V2 in production.

Volume Replication

Replica Placement and Synchronous Replication

sequenceDiagram
    participant App as Application Pod
    participant Eng as Longhorn Engine
    participant R1 as Replica 1 (Node A)
    participant R2 as Replica 2 (Node B)
    participant R3 as Replica 3 (Node C)

    App->>Eng: WRITE data block
    Eng->>R1: Sync write (local)
    Eng->>R2: Sync write (network)
    Eng->>R3: Sync write (network)
    R1-->>Eng: ACK
    R2-->>Eng: ACK
    R3-->>Eng: ACK
    Eng-->>App: Write complete

Key properties:

  • Synchronous replication: The Engine waits for all healthy replicas to acknowledge before completing the write. This guarantees crash consistency.
  • Failure tolerance: With N replicas, the volume tolerates N-1 replica failures and stays operational.
  • Default replica count: Configurable globally via Longhorn settings; can be overridden per StorageClass (numberOfReplicas parameter).
  • Replica rebuilding: When a replica fails, the Manager creates a blank replica on another node. The Engine pauses I/O briefly, takes a system snapshot of all healthy replicas, adds the new replica in write-only mode, resumes I/O, then syncs historical data in the background. Once sync completes, the replica enters read-write mode.

Replica Storage: Sparse Files and Read Index

Each replica is stored as a Linux sparse file on the host disk, providing thin provisioning out of the box. A 1 TB volume that contains 10 GB of actual data consumes only 10 GB on disk.

Longhorn maintains an in-memory read index per replica. This is a byte-array (one byte per 4 KB block) that records which snapshot or live data layer holds the most recent data for each block. This avoids scanning the entire snapshot chain on reads and limits each volume to 254 snapshots.

  • A 1 TB volume consumes approximately 256 MB of read index memory per replica.
  • Write operations reset the read index entry to point to live data.
  • Read operations traverse the index to find the correct source layer.

CSI Driver Integration

sequenceDiagram
    participant PVC as PersistentVolumeClaim
    participant K8s as Kubernetes API
    participant CSI as Longhorn CSI Plugin
    participant LHM as Longhorn Manager
    participant Eng as Longhorn Engine

    PVC->>K8s: Provision request
    K8s->>CSI: CreateVolume
    CSI->>LHM: Create Longhorn Volume CR
    LHM->>Eng: Start Engine + Replicas
    Eng-->>LHM: Volume ready
    LHM-->>CSI: Volume path
    CSI->>K8s: PV created + formatted + mounted
    K8s-->>PVC: Bound and ready

The Longhorn CSI driver handles:

  • CreateVolume / DeleteVolume: Provisioning and teardown.
  • ControllerPublishVolume / ControllerUnpublishVolume: Attach/detach to nodes.
  • NodeStageVolume / NodePublishVolume: Format and mount the block device into the Pod.
  • CreateSnapshot / DeleteSnapshot: Snapshot management.
  • ExpandVolume: Online volume expansion.

For encrypted volumes, the CSI driver passes encryption secrets (stored as Kubernetes Secrets) to dm_crypt / cryptsetup on the host.

Instance Manager

The Instance Manager is a system-managed pod (one per node per engine version) that runs the Engine and Replica processes. It replaces the older model of one pod per engine/replica, reducing pod overhead. Longhorn automatically manages the Instance Manager lifecycle.

Share Manager (RWX Volumes)

Longhorn provides ReadWriteMany (RWX) access by deploying a Share Manager pod that runs an NFS server backed by the Longhorn volume. The NFS export is exposed as a Kubernetes Service, and the CSI driver mounts the NFS share into the requesting Pods.

  • Each RWX volume gets its own Share Manager pod.
  • Share Manager pods are managed by a dedicated Deployment controller.
  • RWX volumes do not support Block (volumeMode: Block) mode.

Backup Architecture

flowchart LR
    subgraph Primary Storage
        VOL[Longhorn Volume]
        SNAP[Snapshot chain]
    end

    subgraph Secondary Storage
        BS[Backupstore<br/>S3 or NFS]
        BK1[Backup 1<br/>2 MB blocks]
        BK2[Backup 2<br/>Incremental 2 MB blocks]
    end

    VOL --> SNAP
    SNAP -->|Flatten + diff| BK1
    SNAP -->|Incremental diff| BK2
    BK1 -->|Shared 2 MB blocks| BK2
  • Backup target: Configured as an S3-compatible endpoint or NFS share external to the cluster.
  • Incremental backups: Each backup transmits only changed 2 MB blocks since the previous backup. Block-level checksums provide deduplication within the same volume.
  • Disaster Recovery (DR) volumes: A DR volume in a secondary cluster incrementally restores from the backupstore. On failover, the DR volume is activated and becomes a normal Longhorn volume.
  • Recurring backups/snapshots: Configurable schedules per volume or per StorageClass.

Key Architectural Properties

  • One engine per volume: Failure domains are isolated; a controller crash affects only one volume.
  • Microservices-based: Engine, replicas, Instance Manager, and Share Manager are all orchestrated as Kubernetes resources.
  • Thin provisioning: Volumes consume only the space actually written.
  • Crash-consistent: Longhorn runs sync before creating snapshots, but OS-level cache may contain unflushed data at crash time.
  • Live upgrades: Engines and replicas can be upgraded without disrupting I/O, using rolling upgrade jobs.

Sources


How It Works

Per-volume engine model, synchronous replication, snapshot mechanics, and CSI integration.

Per-Volume Engine Architecture

Unlike Ceph's shared daemon model, Longhorn assigns a dedicated Engine process to each volume. This isolates failures — a bug in one volume's engine cannot affect other volumes.

flowchart TB
    subgraph Node1["Node 1"]
        E1["Engine (Vol-1)\n(storage controller)"]
        R1A["Replica 1A\n(Vol-1 data)"]
        R2A["Replica 2A\n(Vol-2 data)"]
        E2["Engine (Vol-2)"]
    end

    subgraph Node2["Node 2"]
        R1B["Replica 1B\n(Vol-1 data)"]
    end

    subgraph Node3["Node 3"]
        R1C["Replica 1C\n(Vol-1 data)"]
    end

    Pod1["Pod (uses Vol-1)"] --> E1
    E1 -->|"sync write"| R1A
    E1 -->|"sync write"| R1B
    E1 -->|"sync write"| R1C

    style Node1 fill:#2e7d32,color:#fff

Write Path

sequenceDiagram
    participant Pod as Pod
    participant Engine as Longhorn Engine
    participant R1 as Replica 1 (local)
    participant R2 as Replica 2 (remote)
    participant R3 as Replica 3 (remote)

    Pod->>Engine: Write block
    par Synchronous replication
        Engine->>R1: Write to replica 1
        Engine->>R2: Write to replica 2
        Engine->>R3: Write to replica 3
    end
    R1-->>Engine: ACK
    R2-->>Engine: ACK
    R3-->>Engine: ACK
    Engine-->>Pod: Write complete
    Note over Engine: All replicas confirmed before ACK

Snapshot & Backup

flowchart LR
    Vol["Volume\n(live data)"] --> Snap1["Snapshot 1\n(point-in-time)"]
    Snap1 --> Snap2["Snapshot 2\n(incremental)"]
    Snap2 --> Backup["Backup\n(to S3/NFS)"]
    Backup --> DR["DR Volume\n(remote cluster)"]

    style Backup fill:#1565c0,color:#fff

Sources


Benchmarks

Scope

Performance characteristics, scaling limits, and resource consumption for Longhorn.

I/O Performance

Configuration Seq Read Seq Write Random 4K Read Random 4K Write
1 replica 500-800 MB/s 300-500 MB/s 15k IOPS 8k IOPS
2 replicas 500-800 MB/s 200-350 MB/s 15k IOPS 5k IOPS
3 replicas 500-800 MB/s 150-300 MB/s 15k IOPS 3k IOPS

Note

Performance depends heavily on underlying disk type (HDD vs SSD vs NVMe) and network bandwidth between nodes.

Resource Overhead

Component CPU Memory Per
Longhorn Manager 100-300m 256-512Mi Per node
Engine (per volume) 50-200m 100-200Mi Per volume
Replica (per replica) 50-100m 100-200Mi Per replica

Scaling Limits

Dimension Limit Notes
Volumes per cluster 1,000+ Manager memory scales
Volume size 10TB+ Large volumes need more memory
Replicas per volume 1-20 3 is default
Nodes 50+ Tested in production
Snapshots per volume 250 disk_based_snapshots

Sourcing Status

Unsourced Performance Data

The performance numbers in this document are estimated from vendor documentation, community benchmarks, and engineering judgment. They do not represent controlled benchmarks with documented test conditions. Specific hardware configurations, software versions, and test methodologies were not recorded.

Use these figures as rough guidance only. For production capacity planning, run your own benchmarks against your specific workload and infrastructure.

Sources