Architecture¶
MinIO is a high-performance, S3-compatible object storage server designed for cloud-native and on-premises deployments. It uses erasure coding as its foundational resiliency mechanism and delivers multi-petabyte scale with deterministic performance on commodity hardware.
See also: index, architecture, operations, security
Deployment Architecture¶
graph TB
subgraph Clients
SDK1[S3 SDK<br/>Java/Python/Go/JS]
SDK2[MinIO mc CLI]
SDK3[S3-compatible app]
end
subgraph LB[Load Balancer<br/>NGINX / HAProxy]
end
subgraph Pool1["Server Pool 1 (4 nodes)"]
N1A[Node 1<br/>minio server]
N2A[Node 2<br/>minio server]
N3A[Node 3<br/>minio server]
N4A[Node 4<br/>minio server]
subgraph ES1["Erasure Set 1 (16 drives)"]
D1A[Drive 1]
D2A[Drive 2]
D15A[...Drive 16]
end
end
subgraph Pool2["Server Pool 2 (expansion)"]
N1B[Node 5]
N2B[Node 6]
N3B[Node 7]
N4B[Node 8]
end
SDK1 --> LB
SDK2 --> LB
SDK3 --> LB
LB --> N1A
LB --> N2A
LB --> N3A
LB --> N4A
LB --> N1B
LB --> N2B
LB --> N3B
LB --> N4B
Server Pools¶
A production MinIO deployment consists of at least 4 homogeneous nodes (matching CPU, RAM, storage, network). MinIO aggregates all nodes in the initial deployment into a single server pool.
Key properties:
- Locally-attached storage: MinIO performs best with direct-attached NVMe or SSD drives. Drives should be formatted as XFS, presented in JBOD configuration with no RAID, pooling, or hardware caching.
- Any node can serve any request: Every MinIO server has a complete picture of the distributed topology. The receiving node handles internode routing transparently.
- Pool expansion: New pools (groups of nodes) can be added to increase capacity. Each pool has its own independent erasure sets. MinIO queries each pool to locate the correct erasure set for a given object, which means each additional pool adds some internode coordination overhead.
Erasure Coding¶
How Erasure Coding Works¶
MinIO automatically groups all drives in a pool into erasure sets -- the foundational unit of availability and resiliency. Each erasure set consists of up to 16 drives striped symmetrically across nodes.
graph LR
OBJ["Object<br/>(binary data)"]
subgraph EC["Erasure Encoding EC:4"]
SH1["Data shard 1"]
SH2["Data shard 2"]
SH3["Data shard 3"]
SH4["Data shard 4"]
SH5["Data shard 5"]
SH6["Data shard 6"]
SH7["Data shard 7"]
SH8["Data shard 8"]
SH9["Parity shard 1"]
SH10["Parity shard 2"]
SH11["Parity shard 3"]
SH12["Parity shard 4"]
end
OBJ --> SH1
OBJ --> SH2
OBJ --> SH3
OBJ --> SH4
OBJ --> SH5
OBJ --> SH6
OBJ --> SH7
OBJ --> SH8
OBJ --> SH9
OBJ --> SH10
OBJ --> SH11
OBJ --> SH12
SH1 --> DRV1[(Drive 1)]
SH2 --> DRV2[(Drive 2)]
SH3 --> DRV3[(Drive 3)]
SH4 --> DRV4[(Drive 4)]
SH5 --> DRV5[(Drive 5)]
SH6 --> DRV6[(Drive 6)]
SH7 --> DRV7[(Drive 7)]
SH8 --> DRV8[(Drive 8)]
SH9 --> DRV9[(Drive 9)]
SH10 --> DRV10[(Drive 10)]
SH11 --> DRV11[(Drive 11)]
SH12 --> DRV12[(Drive 12)]
MinIO partitions each object into data shards and parity shards based on the configured parity level (EC:N). With the maximum parity of EC:8, an object is split into 8 data and 8 parity blocks across the erasure set.
Parity Levels and Fault Tolerance¶
| Parity Setting | Data Shards | Parity Shards | Storage Overhead | Drives Tolerated |
|---|---|---|---|---|
| EC:0 | 16 | 0 | 0% (replication only) | 0 |
| EC:2 | 14 | 2 | ~14% | 2 |
| EC:4 | 12 | 4 | ~33% | 4 |
| EC:8 | 8 | 8 | 100% | 8 |
Read and Write Quorum¶
- Read quorum: MinIO needs at least
data_shardsintact shards (data or parity) to serve an object. WithEC:4on a 16-drive erasure set, 12 of 16 drives must be available. - Write quorum: MinIO needs at least
data_shards + 1drives available to accept a write, preventing split-brain writes to the same object. - Bitrot protection: MinIO computes HighwayHash-256 checksums on every shard, detecting silent data corruption at the drive level.
Object Healing¶
When drives fail or shards become corrupted, MinIO heals objects automatically:
- Detects damaged or missing shards during read or scrub operations.
- Uses remaining data and parity shards to reconstruct lost shards.
- Writes healed shards to healthy drives (or replacement drives).
- Healing is transparent to the client; the requesting node reconstructs the full object before returning it.
Erasure Set Selection¶
MinIO uses a deterministic hashing algorithm based on the object name and path (BUCKET/PREFIX/.../OBJECT) to select the erasure set. For any given object namespace, MinIO always selects the same erasure set, ensuring consistency. No single drive contains only data or only parity for all objects; shards are randomized across drives for even load distribution.
Identity and Access Management (IAM)¶
MinIO implements a full IAM subsystem:
- Root credentials: Set via environment variables at startup. Equivalent to AWS root account.
- Users: Created via
mc admin user add. Authenticated by access key + secret key. - Groups: Logical groupings of users for policy attachment.
- Policies: JSON policy documents (AWS IAM policy format) attached to users or groups. Support policy variables like
${aws:username}and${jwt:preferred_username}for OIDC-integrated policies. - Built-in policies:
readwrite,readonly,writeonly,diagnostics,consoleAdmin, etc. - OIDC / LDAP integration: External identity providers can be configured for federated authentication.
Security Token Service (STS)¶
MinIO supports STS for issuing temporary credentials:
- Web Identity (OIDC): Exchange an OIDC token for temporary MinIO credentials.
- Client Grants: Exchange a client credentials grant for temporary access.
- AssumeRole: Similar to AWS STS AssumeRole, allowing a user to assume a specific policy for a duration.
- LDAP STS: Bind LDAP credentials to temporary S3 access.
Temporary credentials include an access key, secret key, and session token with a configurable expiration.
Bucket Notifications¶
MinIO supports event notifications on bucket operations:
- Event types:
s3:ObjectCreated:*,s3:ObjectRemoved:*,s3:ObjectAccessed:* - Targets: AMQP, Elasticsearch, Kafka, MQTT, MySQL, NATS, PostgreSQL, Redis, Webhooks.
- Configuration via
mc event addor the S3-compatible notification API. - Notifications are not replicated across sites in a site replication configuration.
Information Lifecycle Management (ILM)¶
ILM rules define automated object tiering and expiration:
- Transition rules: Move objects between storage tiers (for example, from NVMe to HDD-based MinIO or to a remote S3 tier) after a specified number of days.
- Expiration rules: Delete objects or incomplete multipart uploads after a specified period.
- Noncurrent version expiration: Manage lifecycle of noncurrent object versions in versioned buckets.
- ILM configurations are not replicated across sites in a site replication configuration.
Site Replication¶
MinIO supports multi-site replication for BC/DR and geo-distributed access:
graph TB
subgraph SiteA["Site A (Primary)"]
MA[MinIO Server]
end
subgraph SiteB["Site B (Peer)"]
MB[MinIO Server]
end
subgraph SiteC["Site C (Peer)"]
MC[MinIO Server]
end
GLB[Global Load Balancer<br/>Geo-local / failover]
GLB --> MA
GLB --> MB
GLB --> MC
MA <-->|Bidirectional replication| MB
MB <-->|Bidirectional replication| MC
MA <-->|Bidirectional replication| MC
- Bidirectional: All sites are peers; writes to any site replicate to all others.
- Replicated objects: Buckets, objects, and IAM configuration replicate automatically.
- Not replicated: Bucket notifications, ILM configurations, site-level settings.
- Setup:
mc admin replicate add site1 site2 site3 - Latency consideration: Replication lag depends on inter-site network latency. A 100 ms round trip means at least 100 ms before an object is available on all peers.
- Queued replication: Transient failures are handled by queuing objects for retry.
Key Architectural Properties¶
- S3 API strict compatibility: Requires AWS Signature V4 (or V2). All operations are signed, making intermediate header modification impossible.
- Erasure coding by default: No separate replication layer; erasure coding provides both resiliency and storage efficiency.
- Deterministic placement: Object-to-erasure-set mapping is hash-based and consistent.
- No RAID, no caching: MinIO expects direct access to raw XFS drives. Hardware RAID or drive-level caching introduces unpredictable performance.
- Any-to-any routing: Any node can handle any request and internally routes to the correct erasure set.
- Pool-based horizontal scaling: Add capacity by deploying additional pools; existing data stays in place.
Sources¶
How It Works¶
Erasure coding internals, distributed object placement, bitrot protection, and S3-native architecture.
Architecture Overview¶
flowchart TB
subgraph MinIO_C["MinIO Cluster"]
subgraph ES["Erasure Set (4+4 example)"]
D1["Drive 1\n(data)"]
D2["Drive 2\n(data)"]
D3["Drive 3\n(data)"]
D4["Drive 4\n(data)"]
P1["Drive 5\n(parity)"]
P2["Drive 6\n(parity)"]
P3["Drive 7\n(parity)"]
P4["Drive 8\n(parity)"]
end
end
Client_M["S3 Client"] -->|"PUT object"| MinIO_C
Note over ES: With 4 parity drives,<br/>survives loss of any 4 drives
style ES fill:#c62828,color:#fff
Erasure Set Organization¶
MinIO groups drives into erasure sets -- fixed-size groups of drives (default: 16 drives per set, configurable from 4 to 16). A cluster with 64 drives has 4 erasure sets of 16 drives each.
- Each object is placed on exactly one erasure set using a deterministic hash of the object name
- Within the erasure set, the object is split into data and parity shards using Reed-Solomon coding
- The ratio of data to parity shards is configurable per-bucket via
minio server --parityor the S3 API - Default parity:
EC:4for production deployments (configurable via--parity). Maximum durability parity isN/2where N is the erasure set size
Erasure Coding Math¶
For a 16-drive erasure set with 8 data + 8 parity: - A 16MB object is split into 8 x 2MB data shards - 8 x 2MB parity shards are computed from the data shards - All 16 shards (32MB total) are written to 16 drives - Storage efficiency: 50% (16MB stored for 16MB of user data) - Durability: tolerates loss of any 8 drives simultaneously
Write Path¶
sequenceDiagram
participant Client_MW as S3 Client
participant MinIO_H as MinIO Node
participant EC as Erasure Coder
participant Drives as Local Drives (NVMe/SSD)
Client_MW->>MinIO_H: PUT /bucket/object (multipart)
MinIO_H->>EC: Split into data + parity shards
EC->>Drives: Write shards across drives
Drives-->>MinIO_H: All shards written
MinIO_H-->>Client_MW: 200 OK (ETag)
Write Internals¶
- Hash to erasure set: Object name + bucket is hashed to select the target erasure set
- Bitrot hashing: A HighwayHash-256 checksum is computed for each shard. Unlike MD5/SHA256, HighwayHash is SIMD-accelerated and runs at memory bandwidth speeds
- Parallel shard write: All data and parity shards are written to their respective drives simultaneously
- Quorum check: Write succeeds when (N/2)+1 shards are confirmed, where N is the total shards in the erasure set
- Metadata: Object metadata (size, ETag, content-type, custom headers) is stored alongside the data shards in
xl.metaformat
Bitrot Protection¶
MinIO uses HighwayHash for bitrot detection at the shard level. Each shard gets a checksum stored in its metadata:
- Traditional bitrot (silent data corruption from disk firmware, cosmic rays, etc.) is detected on every read
- If a corrupted shard is detected during read, MinIO reconstructs it from the remaining healthy shards
- The reconstruction happens transparently -- the S3 client sees no error
Read Path¶
- Client sends
GET /bucket/object - MinIO hashes the object name to find the erasure set
- Reads data shards from the fastest drives (measured by recent latency)
- If any data shard is corrupted or missing, reads the corresponding parity shard and reconstructs
- Returns the reassembled object to the client
⚠️ OSS Archive Notice¶
The MinIO open-source repository was archived February 13, 2026. MinIO Inc. now develops the commercial AIStor product. For open-source S3 storage, consider Ceph RGW, SeaweedFS, or Garage.
Sources¶
Benchmarks¶
Scope
Performance characteristics, scaling limits, and resource consumption for MinIO.
Object Storage Performance¶
| Configuration | PUT (obj/s) | GET (obj/s) | Throughput |
|---|---|---|---|
| 4 nodes, HDD | 500-1,000 | 1,000-2,000 | 1-2 GB/s |
| 4 nodes, SSD | 2,000-5,000 | 5,000-10,000 | 5-10 GB/s |
| 16 nodes, NVMe | 10,000-30,000 | 30,000-80,000 | 30-80 GB/s |
Erasure Coding Overhead¶
| Parity | Storage Efficiency | Write Penalty | Failure Tolerance |
|---|---|---|---|
| EC:2 | 87.5% | +15% | 2 drives |
| EC:4 (default) | 75% | +30% | 4 drives |
| EC:8 | 50% | +60% | 8 drives |
Scaling Limits¶
| Dimension | Limit | Notes |
|---|---|---|
| Objects per bucket | Billions | No practical limit |
| Object size | 5TB (single PUT) | Multipart for larger |
| Buckets per server | 1,000+ | Metadata overhead |
| Server pools | 32 | Horizontal expansion |
| Total capacity | Exabytes | Linear scaling |
Resource Requirements¶
| Nodes | CPU/Node | Memory/Node | Network |
|---|---|---|---|
| 4 (minimum) | 4 vCPU | 8Gi | 10Gbps |
| 8 (production) | 8 vCPU | 16Gi | 25Gbps |
| 16 (large) | 16 vCPU | 32Gi | 25-100Gbps |
Sourcing Status¶
Unsourced Performance Data
The performance numbers in this document are estimated from vendor documentation, community benchmarks, and engineering judgment. They do not represent controlled benchmarks with documented test conditions. Specific hardware configurations, software versions, and test methodologies were not recorded.
Use these figures as rough guidance only. For production capacity planning, run your own benchmarks against your specific workload and infrastructure.