Architecture¶
PostgreSQL uses a process-per-connection model supervised by a postmaster daemon. Shared memory structures (buffer pool, WAL buffer, lock tables) are accessible to all backend processes, while each backend handles query parsing, planning, and execution independently. This architecture provides strong isolation between client sessions and relies on the Write-Ahead Log (WAL) protocol for crash recovery.
See also: index, architecture, operations, security
Process Architecture¶
When PostgreSQL starts, the postmaster daemon initialises shared memory and spawns several background processes. For each incoming client connection, the postmaster forks a dedicated backend process.
graph TB
POSTMASTER["postmaster<br/>(supervisor daemon)"]
subgraph "Background Processes"
BGWRITER["Background Writer<br/>(bgwriter)"]
CHECKPT["Checkpointer"]
WALWRITER["WAL Writer<br/>(walwriter)"]
AUTOVAC["Autovacuum Launcher"]
AUTOWORK["Autovacuum Workers<br/>(dynamic)"]
STATS["Statistics Collector<br/>(or shared memory stats in PG 15+)"]
LOGICAL["Logical Replication<br/>Worker (if configured)"]
end
subgraph "Client Backends"
BE1["Backend Process<br/>(session 1)"]
BE2["Backend Process<br/>(session 2)"]
BEN["Backend Process<br/>(session N)"]
end
subgraph "Shared Memory"
SHARED["shared_buffers<br/>WAL Buffer<br/>Lock Tables<br/>ProcArray / CLog"]
end
POSTMASTER --> BGWRITER
POSTMASTER --> CHECKPT
POSTMASTER --> WALWRITER
POSTMASTER --> AUTOVAC
AUTOVAC --> AUTOWORK
POSTMASTER --> STATS
POSTMASTER --> LOGICAL
POSTMASTER --> BE1
POSTMASTER --> BE2
POSTMASTER --> BEN
BE1 --- SHARED
BE2 --- SHARED
BEN --- SHARED
BGWRITER --- SHARED
CHECKPT --- SHARED
WALWRITER --- SHARED
style POSTMASTER fill:#e8f4f8,stroke:#2196f3,color:#000
style SHARED fill:#fff3e0,stroke:#ff9800,color:#000
style BGWRITER fill:#e8f5e9,stroke:#4caf50,color:#000
Background Processes¶
| Process | Role |
|---|---|
| Background Writer (bgwriter) | Periodically writes dirty pages from shared_buffers to the OS page cache. Its goal is to reduce the I/O spike during checkpoints by spreading writes over time. Tuned via bgwriter_delay, bgwriter_lru_maxpages. |
| Checkpointer | Forces all dirty pages in shared_buffers to disk (via fsync) and records a checkpoint record in the WAL. This allows WAL segments older than the checkpoint to be recycled or archived. Triggered by checkpoint_timeout (default 5 min) or max_wal_size being reached. |
| WAL Writer (walwriter) | Flushes WAL records from the in-memory WAL buffer to WAL segment files on disk. Runs on a cycle controlled by wal_writer_delay (default 200 ms). Ensures recent WAL records are durable even for transactions that have not yet committed. |
| Autovacuum Launcher | Monitors table statistics and spawns autovacuum worker processes for tables that have exceeded their dead-tuple threshold. Essential for MVCC garbage collection. |
| Statistics Collector | Collects runtime statistics (table access counts, dead tuple counts, etc.). In PostgreSQL 15+, replaced by an in-memory shared statistics area eliminating the separate collector process. |
Shared Memory Structures¶
PostgreSQL allocates shared memory at startup. The most important regions are:
graph LR
subgraph "Shared Memory"
SB["shared_buffers<br/>(8 KiB page frames)"]
WALB["WAL Buffer<br/>(circular)"]
LT["Lock Tables"]
PA["ProcArray<br/>(active txn XIDs)"]
CL["CLOG / CommitLog<br/>(txn status bits)"]
MT["MultiXact<br/>(shared row locks)"]
end
style SB fill:#e3f2fd,stroke:#2196f3,color:#000
style WALB fill:#fce4ec,stroke:#e91e63,color:#000
| Region | Purpose |
|---|---|
| shared_buffers | Main buffer pool. Default 128 MiB; production systems typically set this to 25% of system RAM. Pages are 8 KiB. Uses a clock-sweep algorithm for page eviction. |
| WAL Buffer | Circular buffer for WAL records before they are written to disk. Default 64 KiB (wal_buffers). Autotuned to 1/32 of shared_buffers if left at default. |
| Lock Tables | Shared lock manager tracking relation-level, page-level, tuple-level, and advisory locks held by all backends. |
| ProcArray | Array of all active backend processes and their current transaction IDs. Used to compute snapshot visibility (which XIDs are visible to a given transaction). |
| CLOG (Commit Log) | Stores the commit status of each transaction ID (in-progress, committed, aborted). Stored as bitmaps in shared memory and materialised to disk in pg_xact/. |
| MultiXact | Tracks sets of transaction IDs that hold locks on the same row. Used for SELECT ... FOR SHARE and similar operations. |
WAL (Write-Ahead Log)¶
The WAL protocol is the foundation of PostgreSQL's crash recovery. The core rule: dirty data pages are never written to disk before their corresponding WAL records have been flushed.
WAL Record Lifecycle¶
sequenceDiagram
participant BE as Backend Process
participant WB as WAL Buffer
participant WF as WAL Segment Files<br/>(pg_wal/)
participant CK as Checkpointer
participant DF as Data Files<br/>(table extents)
BE->>BE: Modify tuple in shared_buffers
BE->>WB: XLogInsertRecord (WAL entry)
Note over BE,WB: WAL record appended to WAL buffer
BE->>WF: XLogFlush at COMMIT
Note over BE,WF: Ensures durability before ack to client
CK->>DF: Write all dirty pages to disk (fsync)
CK->>WF: Write checkpoint record
Note over CK,WF: Old WAL segments before checkpoint<br/>can be recycled or archived
WAL Configuration¶
| Parameter | Default | Purpose |
|---|---|---|
wal_level |
replica |
Controls amount of WAL data. minimal for crash recovery only; replica for streaming replication; logical for logical decoding. |
max_wal_size |
1 GiB | Maximum WAL size before a checkpoint is triggered. |
min_wal_size |
80 MiB | Minimum WAL size to keep for recycling. |
checkpoint_timeout |
5 min | Maximum time between automatic checkpoints. |
wal_buffers |
-1 (auto) | Size of WAL buffer in shared memory. Auto = 1/32 of shared_buffers. |
wal_compression |
off |
Compress full-page images in WAL. lz4 or zstd reduce WAL volume significantly. |
Full-Page Writes (FPW)¶
After a checkpoint, the first modification to any data page triggers a full-page image (FPI) to be written to the WAL. This protects against torn page writes: if a partial 8 KiB write occurs, the complete page can be reconstructed from the FPI in the WAL.
MVCC and VACUUM¶
PostgreSQL implements Multi-Version Concurrency Control (MVCC) using tuple header fields, not undo logs. Each row version (tuple) has:
- xmin -- XID of the transaction that inserted this tuple.
- xmax -- XID of the transaction that deleted or updated this tuple (0 if still valid).
- infomask -- status bits (committed, aborted, locked, etc.).
Visibility Rules¶
When a backend reads a table, it computes a snapshot of which transaction IDs are visible:
- Tuples with
xmincommitted andxmaxeither 0 or not-yet-committed are visible. - Tuples with
xminnot-yet-committed are invisible (still in-progress from another session). - Tuples with
xmaxcommitted are invisible (deleted or replaced by an update).
Dead Tuples and VACUUM¶
When rows are updated or deleted, the old tuple versions become dead tuples. They remain on disk until VACUUM reclaims the space.
- VACUUM -- marks dead tuple space as available for reuse. Does not shrink the file or return space to the OS (unless the dead tuples are at the end of the file).
- VACUUM FULL -- rewrites the entire table, compacting it and returning space to the OS. Requires an exclusive lock.
- Autovacuum -- daemon that automatically runs VACUUM and ANALYZE based on thresholds defined by
autovacuum_vacuum_scale_factorandautovacuum_vacuum_threshold. Enabled by default.
Bloat from Long-Running Transactions
Long-running transactions (including idle-in-transaction sessions) prevent autovacuum from reclaiming dead tuples that are newer than the oldest open transaction's snapshot. This causes table and index bloat.
Transaction ID Wraparound¶
PostgreSQL uses a 32-bit transaction ID counter. Without intervention, it would wrap around after approximately 2 billion transactions, causing data corruption. To prevent this:
- Freeze operations mark old tuples as "always visible" by replacing their
xminwith a specialFrozenXID. - Autovacuum triggers freeze operations when
autovacuum_freeze_max_age(default 200 million transactions) is approached. - The system forces an autovacuum to prevent wraparound even if autovacuum is otherwise disabled.
Query Processing Pipeline¶
Each backend processes queries through the following stages:
- Parser -- converts SQL text into a parse tree.
- Analyzer/Semantic Analysis -- resolves table and column references, performs type checking. Produces a query tree.
- Rewriter -- applies rules (e.g., views are expanded, RLS policies are attached).
- Planner/Optimizer -- generates execution plans. Uses a cost-based optimizer that considers:
- Sequential scans vs. index scans vs. index-only scans.
- Join methods: nested loop, hash join, merge join.
- Join ordering via dynamic programming or GEQO (for many-table joins).
- Parallel query paths (parallel sequential scan, parallel hash join).
- Executor -- runs the plan using a tuple-oriented pipeline (each node pulls tuples from its children).
Logical Replication¶
PostgreSQL supports two native replication mechanisms:
Streaming (Physical) Replication¶
- Ships WAL byte streams to standby servers.
- Standbys are byte-for-byte copies of the primary.
- Supports synchronous and asynchronous modes.
- Standbys can serve read-only queries (hot standby).
Logical Replication¶
- Decodes WAL records into logical changes (INSERT, UPDATE, DELETE) using a logical decoding output plugin (e.g.,
pgoutput). - Publications define which tables to replicate; subscriptions consume them.
- Uses replication slots to track the consumer's WAL position, ensuring the primary retains WAL until all subscribers have consumed it.
- Allows selective table replication and cross-version replication.
graph LR
PUB["Primary<br/>(Publisher)"]
WALD["Logical Decoder<br/>(pgoutput plugin)"]
SLOT["Replication Slot<br/>(tracks LSN)"]
SUB1["Subscriber 1"]
SUB2["Subscriber 2"]
PUB --> WALD --> SLOT
SLOT -->|"pg_replication slot"| SUB1
SLOT -->|"pg_replication slot"| SUB2
style PUB fill:#e8f5e9,stroke:#4caf50,color:#000
style SLOT fill:#fff3e0,stroke:#ff9800,color:#000
Replication Slot Dangers
If a subscriber is offline for an extended period, its replication slot prevents WAL recycling on the primary. Monitor pg_replication_slots and set max_slot_wal_keep_size to prevent disk exhaustion.
Extension API¶
PostgreSQL's extensibility model allows adding new types, functions, operators, index access methods, and procedural languages without modifying core code:
- Extensions -- packaged via
CREATE EXTENSION(e.g.,pgcrypto,PostGIS,pg_stat_statements). - Procedural Languages -- PL/pgSQL (built-in), PL/Python, PL/Perl, PL/v8 (JavaScript).
- Custom Scan Providers -- allow extensions to replace or augment query execution plans (used by Citus for distributed queries).
- Foreign Data Wrappers (FDW) -- access external data sources as local tables (e.g.,
postgres_fdw,file_fdw).
Key Configuration Parameters¶
| Parameter | Default | Purpose |
|---|---|---|
shared_buffers |
128 MiB | Buffer pool size. Set to ~25% of system RAM. |
work_mem |
4 MiB | Per-sort/hash memory before spilling to disk. |
maintenance_work_mem |
64 MiB | Memory for VACUUM, CREATE INDEX, ALTER TABLE. |
effective_cache_size |
4 GiB | Planner hint for total OS + PG cache. Set to ~75% of RAM. |
max_connections |
100 | Maximum concurrent backend processes. |
wal_level |
replica |
WAL detail level for replication support. |
max_wal_size |
1 GiB | Checkpoint trigger threshold. |
autovacuum |
ON | Enable automatic VACUUM and ANALYZE. |
Sources¶
- PostgreSQL Documentation -- Architecture
- PostgreSQL Documentation -- WAL
- PostgreSQL Documentation -- MVCC
- PostgreSQL Documentation -- Logical Replication
How It Works¶
Process model, shared memory, MVCC, WAL, query execution, and v18 async I/O.
Process Architecture¶
flowchart TB
subgraph Postmaster["Postmaster (main process)"]
PM["Process Manager\n(forks backends)"]
end
subgraph Backends["Backend Processes"]
B1["Backend 1\n(client connection)"]
B2["Backend 2"]
BN["Backend N"]
end
subgraph Background["Background Processes"]
WAL_W["WAL Writer"]
Checkpointer["Checkpointer"]
Autovac["Autovacuum"]
BGWriter["Background Writer"]
StatsCol["Stats Collector"]
end
subgraph SharedMem["Shared Memory"]
SharedBuf["Shared Buffers\n(page cache)"]
WAL_Buf["WAL Buffers"]
CLOG["CLOG\n(transaction status)"]
end
PM --> Backends
PM --> Background
Backends --> SharedMem
Background --> SharedMem
style SharedMem fill:#1565c0,color:#fff
MVCC (Multi-Version Concurrency Control)¶
Each row has hidden xmin (created by) and xmax (deleted by) transaction IDs. Readers never block writers.
sequenceDiagram
participant TX1 as Transaction 1 (xid=100)
participant Heap as Table Heap
participant TX2 as Transaction 2 (xid=101)
TX1->>Heap: UPDATE row → creates new version (xmin=100)
Note over Heap: Old version: xmax=100<br/>New version: xmin=100, xmax=∞
TX2->>Heap: SELECT → sees old version (100 not committed yet)
TX1->>TX1: COMMIT
TX2->>Heap: SELECT → now sees new version
Async I/O (v18)¶
flowchart LR
subgraph Old["v17 (Synchronous I/O)"]
Req1_O["Read page 1"] --> Wait1["⏳ Wait"] --> Req2_O["Read page 2"] --> Wait2["⏳ Wait"]
end
subgraph New["v18 (Async I/O)"]
Req1_N["Read page 1"]
Req2_N["Read page 2"]
Req3_N["Read page 3"]
Req1_N --> Batch["io_uring batch\n(all in parallel)"]
Req2_N --> Batch
Req3_N --> Batch
Batch --> Done["All pages ready\n(2-3× faster)"]
end
style Old fill:#c62828,color:#fff
style New fill:#2e7d32,color:#fff
Sources¶
Benchmarks¶
Scope
PostgreSQL performance metrics, pgbench results, scaling characteristics, and comparison baselines.
pgbench Results (Standard Benchmark)¶
TPC-B Like Workload¶
| Hardware | Clients | TPS (read-write) | TPS (read-only) | Latency P99 |
|---|---|---|---|---|
| 4 vCPU, 16Gi, SSD | 16 | 2,500-4,000 | 15,000-25,000 | 5-10ms |
| 8 vCPU, 32Gi, NVMe | 32 | 8,000-15,000 | 50,000-80,000 | 2-5ms |
| 16 vCPU, 64Gi, NVMe | 64 | 20,000-35,000 | 100,000-150,000 | 1-3ms |
| 32 vCPU, 128Gi, NVMe | 128 | 40,000-60,000 | 200,000-300,000 | 1-2ms |
Index Performance¶
| Operation | B-tree | Hash | GIN | GiST | BRIN |
|---|---|---|---|---|---|
| Point lookup | ~0.1ms | ~0.05ms | N/A | ~0.5ms | ~1ms |
| Range scan | ~1ms | N/A | ~5ms | ~2ms | ~0.5ms |
| Insert overhead | Low | Low | High | Medium | Very low |
| Storage per row | 8-16 bytes | 4-8 bytes | Variable | Variable | ~1 byte |
Connection Scaling¶
| Connection Count | Without PgBouncer | With PgBouncer | Notes |
|---|---|---|---|
| 50 | 100% baseline | 100% | No difference |
| 200 | 85-90% | 98% | PgBouncer multiplexes |
| 500 | 60-70% | 95% | Memory pressure without pooler |
| 1,000 | 30-40% | 90% | Context switching kills perf |
| 5,000 | Fails (OOM) | 85% | Must use connection pooling |
WAL Write Performance¶
| Storage Type | WAL Write Throughput | fsync Latency | Notes |
|---|---|---|---|
| HDD | 50-100 MB/s | 5-20ms | Not recommended for production |
| SATA SSD | 200-500 MB/s | 0.5-2ms | Acceptable for small deployments |
| NVMe SSD | 1-3 GB/s | 0.05-0.2ms | Recommended for production |
Replication Performance¶
| Scenario | Replication Lag | Throughput | Notes |
|---|---|---|---|
| Async, same DC | < 10ms | Near line-rate | Default config |
| Async, cross-DC | 10-100ms | Network limited | WAN bandwidth matters |
| Sync, same DC | < 1ms | 70-80% of standalone | Commit must wait for replica |
| Sync, cross-DC | 10-50ms | 20-40% of standalone | Not recommended for write-heavy |
Scaling Limits¶
| Dimension | Soft Limit | Hard Limit | Notes |
|---|---|---|---|
| Database size | 10TB (comfortable) | 100TB+ | Needs partitioning beyond 10TB |
| Table size | 1TB | 32TB | Partition large tables |
| Rows per table | 1 billion | No hard limit | Performance degrades with bloat |
| Columns per table | 250 | 1,600 | TOAST for wide rows |
| Indexes per table | 20 | No limit | Each index adds write overhead |
| Concurrent connections | 200-500 | 10,000+ (with pooler) | Use PgBouncer |
Sourcing Status¶
Unsourced Performance Data
The performance numbers in this document are estimated from vendor documentation, community benchmarks, and engineering judgment. They do not represent controlled benchmarks with documented test conditions. Specific hardware configurations, software versions, and test methodologies were not recorded.
Use these figures as rough guidance only. For production capacity planning, run your own benchmarks against your specific workload and infrastructure.