Architecture¶

PostgreSQL uses a process-per-connection model supervised by a postmaster daemon. Shared memory structures (buffer pool, WAL buffer, lock tables) are accessible to all backend processes, while each backend handles query parsing, planning, and execution independently. This architecture provides strong isolation between client sessions and relies on the Write-Ahead Log (WAL) protocol for crash recovery.

See also: index, architecture, operations, security

Process Architecture¶

When PostgreSQL starts, the postmaster daemon initialises shared memory and spawns several background processes. For each incoming client connection, the postmaster forks a dedicated backend process.

graph TB
    POSTMASTER["postmaster<br/>(supervisor daemon)"]

    subgraph "Background Processes"
        BGWRITER["Background Writer<br/>(bgwriter)"]
        CHECKPT["Checkpointer"]
        WALWRITER["WAL Writer<br/>(walwriter)"]
        AUTOVAC["Autovacuum Launcher"]
        AUTOWORK["Autovacuum Workers<br/>(dynamic)"]
        STATS["Statistics Collector<br/>(or shared memory stats in PG 15+)"]
        LOGICAL["Logical Replication<br/>Worker (if configured)"]
    end

    subgraph "Client Backends"
        BE1["Backend Process<br/>(session 1)"]
        BE2["Backend Process<br/>(session 2)"]
        BEN["Backend Process<br/>(session N)"]
    end

    subgraph "Shared Memory"
        SHARED["shared_buffers<br/>WAL Buffer<br/>Lock Tables<br/>ProcArray / CLog"]
    end

    POSTMASTER --> BGWRITER
    POSTMASTER --> CHECKPT
    POSTMASTER --> WALWRITER
    POSTMASTER --> AUTOVAC
    AUTOVAC --> AUTOWORK
    POSTMASTER --> STATS
    POSTMASTER --> LOGICAL
    POSTMASTER --> BE1
    POSTMASTER --> BE2
    POSTMASTER --> BEN

    BE1 --- SHARED
    BE2 --- SHARED
    BEN --- SHARED
    BGWRITER --- SHARED
    CHECKPT --- SHARED
    WALWRITER --- SHARED

    style POSTMASTER fill:#e8f4f8,stroke:#2196f3,color:#000
    style SHARED fill:#fff3e0,stroke:#ff9800,color:#000
    style BGWRITER fill:#e8f5e9,stroke:#4caf50,color:#000

Background Processes¶

Process	Role
Background Writer (bgwriter)	Periodically writes dirty pages from `shared_buffers` to the OS page cache. Its goal is to reduce the I/O spike during checkpoints by spreading writes over time. Tuned via `bgwriter_delay`, `bgwriter_lru_maxpages`.
Checkpointer	Forces all dirty pages in `shared_buffers` to disk (via `fsync`) and records a checkpoint record in the WAL. This allows WAL segments older than the checkpoint to be recycled or archived. Triggered by `checkpoint_timeout` (default 5 min) or `max_wal_size` being reached.
WAL Writer (walwriter)	Flushes WAL records from the in-memory WAL buffer to WAL segment files on disk. Runs on a cycle controlled by `wal_writer_delay` (default 200 ms). Ensures recent WAL records are durable even for transactions that have not yet committed.
Autovacuum Launcher	Monitors table statistics and spawns autovacuum worker processes for tables that have exceeded their dead-tuple threshold. Essential for MVCC garbage collection.
Statistics Collector	Collects runtime statistics (table access counts, dead tuple counts, etc.). In PostgreSQL 15+, replaced by an in-memory shared statistics area eliminating the separate collector process.

Shared Memory Structures¶

PostgreSQL allocates shared memory at startup. The most important regions are:

graph LR
    subgraph "Shared Memory"
        SB["shared_buffers<br/>(8 KiB page frames)"]
        WALB["WAL Buffer<br/>(circular)"]
        LT["Lock Tables"]
        PA["ProcArray<br/>(active txn XIDs)"]
        CL["CLOG / CommitLog<br/>(txn status bits)"]
        MT["MultiXact<br/>(shared row locks)"]
    end

    style SB fill:#e3f2fd,stroke:#2196f3,color:#000
    style WALB fill:#fce4ec,stroke:#e91e63,color:#000

Region	Purpose
shared_buffers	Main buffer pool. Default 128 MiB; production systems typically set this to 25% of system RAM. Pages are 8 KiB. Uses a clock-sweep algorithm for page eviction.
WAL Buffer	Circular buffer for WAL records before they are written to disk. Default 64 KiB (`wal_buffers`). Autotuned to 1/32 of `shared_buffers` if left at default.
Lock Tables	Shared lock manager tracking relation-level, page-level, tuple-level, and advisory locks held by all backends.
ProcArray	Array of all active backend processes and their current transaction IDs. Used to compute snapshot visibility (which XIDs are visible to a given transaction).
CLOG (Commit Log)	Stores the commit status of each transaction ID (in-progress, committed, aborted). Stored as bitmaps in shared memory and materialised to disk in `pg_xact/`.
MultiXact	Tracks sets of transaction IDs that hold locks on the same row. Used for `SELECT ... FOR SHARE` and similar operations.

WAL (Write-Ahead Log)¶

The WAL protocol is the foundation of PostgreSQL's crash recovery. The core rule: dirty data pages are never written to disk before their corresponding WAL records have been flushed.

WAL Record Lifecycle¶

sequenceDiagram
    participant BE as Backend Process
    participant WB as WAL Buffer
    participant WF as WAL Segment Files<br/>(pg_wal/)
    participant CK as Checkpointer
    participant DF as Data Files<br/>(table extents)

    BE->>BE: Modify tuple in shared_buffers
    BE->>WB: XLogInsertRecord (WAL entry)
    Note over BE,WB: WAL record appended to WAL buffer
    BE->>WF: XLogFlush at COMMIT
    Note over BE,WF: Ensures durability before ack to client
    CK->>DF: Write all dirty pages to disk (fsync)
    CK->>WF: Write checkpoint record
    Note over CK,WF: Old WAL segments before checkpoint<br/>can be recycled or archived

WAL Configuration¶

Parameter	Default	Purpose
`wal_level`	`replica`	Controls amount of WAL data. `minimal` for crash recovery only; `replica` for streaming replication; `logical` for logical decoding.
`max_wal_size`	1 GiB	Maximum WAL size before a checkpoint is triggered.
`min_wal_size`	80 MiB	Minimum WAL size to keep for recycling.
`checkpoint_timeout`	5 min	Maximum time between automatic checkpoints.
`wal_buffers`	-1 (auto)	Size of WAL buffer in shared memory. Auto = 1/32 of `shared_buffers`.
`wal_compression`	`off`	Compress full-page images in WAL. `lz4` or `zstd` reduce WAL volume significantly.

Full-Page Writes (FPW)¶

After a checkpoint, the first modification to any data page triggers a full-page image (FPI) to be written to the WAL. This protects against torn page writes: if a partial 8 KiB write occurs, the complete page can be reconstructed from the FPI in the WAL.

MVCC and VACUUM¶

PostgreSQL implements Multi-Version Concurrency Control (MVCC) using tuple header fields, not undo logs. Each row version (tuple) has:

xmin -- XID of the transaction that inserted this tuple.
xmax -- XID of the transaction that deleted or updated this tuple (0 if still valid).
infomask -- status bits (committed, aborted, locked, etc.).

Visibility Rules¶

When a backend reads a table, it computes a snapshot of which transaction IDs are visible:

Tuples with xmin committed and xmax either 0 or not-yet-committed are visible.
Tuples with xmin not-yet-committed are invisible (still in-progress from another session).
Tuples with xmax committed are invisible (deleted or replaced by an update).

Dead Tuples and VACUUM¶

When rows are updated or deleted, the old tuple versions become dead tuples. They remain on disk until VACUUM reclaims the space.

VACUUM -- marks dead tuple space as available for reuse. Does not shrink the file or return space to the OS (unless the dead tuples are at the end of the file).
VACUUM FULL -- rewrites the entire table, compacting it and returning space to the OS. Requires an exclusive lock.
Autovacuum -- daemon that automatically runs VACUUM and ANALYZE based on thresholds defined by autovacuum_vacuum_scale_factor and autovacuum_vacuum_threshold. Enabled by default.

Bloat from Long-Running Transactions

Long-running transactions (including idle-in-transaction sessions) prevent autovacuum from reclaiming dead tuples that are newer than the oldest open transaction's snapshot. This causes table and index bloat.

Transaction ID Wraparound¶

PostgreSQL uses a 32-bit transaction ID counter. Without intervention, it would wrap around after approximately 2 billion transactions, causing data corruption. To prevent this:

Freeze operations mark old tuples as "always visible" by replacing their xmin with a special FrozenXID.
Autovacuum triggers freeze operations when autovacuum_freeze_max_age (default 200 million transactions) is approached.
The system forces an autovacuum to prevent wraparound even if autovacuum is otherwise disabled.

Query Processing Pipeline¶

Each backend processes queries through the following stages:

Parser -- converts SQL text into a parse tree.
Analyzer/Semantic Analysis -- resolves table and column references, performs type checking. Produces a query tree.
Rewriter -- applies rules (e.g., views are expanded, RLS policies are attached).
Planner/Optimizer -- generates execution plans. Uses a cost-based optimizer that considers:
- Sequential scans vs. index scans vs. index-only scans.
- Join methods: nested loop, hash join, merge join.
- Join ordering via dynamic programming or GEQO (for many-table joins).
- Parallel query paths (parallel sequential scan, parallel hash join).
Executor -- runs the plan using a tuple-oriented pipeline (each node pulls tuples from its children).

Logical Replication¶

PostgreSQL supports two native replication mechanisms:

Streaming (Physical) Replication¶

Ships WAL byte streams to standby servers.
Standbys are byte-for-byte copies of the primary.
Supports synchronous and asynchronous modes.
Standbys can serve read-only queries (hot standby).

Logical Replication¶

Decodes WAL records into logical changes (INSERT, UPDATE, DELETE) using a logical decoding output plugin (e.g., pgoutput).
Publications define which tables to replicate; subscriptions consume them.
Uses replication slots to track the consumer's WAL position, ensuring the primary retains WAL until all subscribers have consumed it.
Allows selective table replication and cross-version replication.

graph LR
    PUB["Primary<br/>(Publisher)"]
    WALD["Logical Decoder<br/>(pgoutput plugin)"]
    SLOT["Replication Slot<br/>(tracks LSN)"]
    SUB1["Subscriber 1"]
    SUB2["Subscriber 2"]

    PUB --> WALD --> SLOT
    SLOT -->|"pg_replication slot"| SUB1
    SLOT -->|"pg_replication slot"| SUB2

    style PUB fill:#e8f5e9,stroke:#4caf50,color:#000
    style SLOT fill:#fff3e0,stroke:#ff9800,color:#000

Replication Slot Dangers

If a subscriber is offline for an extended period, its replication slot prevents WAL recycling on the primary. Monitor pg_replication_slots and set max_slot_wal_keep_size to prevent disk exhaustion.

Extension API¶

PostgreSQL's extensibility model allows adding new types, functions, operators, index access methods, and procedural languages without modifying core code:

Extensions -- packaged via CREATE EXTENSION (e.g., pgcrypto, PostGIS, pg_stat_statements).
Procedural Languages -- PL/pgSQL (built-in), PL/Python, PL/Perl, PL/v8 (JavaScript).
Custom Scan Providers -- allow extensions to replace or augment query execution plans (used by Citus for distributed queries).
Foreign Data Wrappers (FDW) -- access external data sources as local tables (e.g., postgres_fdw, file_fdw).

Key Configuration Parameters¶

Parameter	Default	Purpose
`shared_buffers`	128 MiB	Buffer pool size. Set to ~25% of system RAM.
`work_mem`	4 MiB	Per-sort/hash memory before spilling to disk.
`maintenance_work_mem`	64 MiB	Memory for VACUUM, CREATE INDEX, ALTER TABLE.
`effective_cache_size`	4 GiB	Planner hint for total OS + PG cache. Set to ~75% of RAM.
`max_connections`	100	Maximum concurrent backend processes.
`wal_level`	`replica`	WAL detail level for replication support.
`max_wal_size`	1 GiB	Checkpoint trigger threshold.
`autovacuum`	ON	Enable automatic VACUUM and ANALYZE.

Sources¶

How It Works¶

Process model, shared memory, MVCC, WAL, query execution, and v18 async I/O.

Process Architecture¶

flowchart TB
    subgraph Postmaster["Postmaster (main process)"]
        PM["Process Manager\n(forks backends)"]
    end

    subgraph Backends["Backend Processes"]
        B1["Backend 1\n(client connection)"]
        B2["Backend 2"]
        BN["Backend N"]
    end

    subgraph Background["Background Processes"]
        WAL_W["WAL Writer"]
        Checkpointer["Checkpointer"]
        Autovac["Autovacuum"]
        BGWriter["Background Writer"]
        StatsCol["Stats Collector"]
    end

    subgraph SharedMem["Shared Memory"]
        SharedBuf["Shared Buffers\n(page cache)"]
        WAL_Buf["WAL Buffers"]
        CLOG["CLOG\n(transaction status)"]
    end

    PM --> Backends
    PM --> Background
    Backends --> SharedMem
    Background --> SharedMem

    style SharedMem fill:#1565c0,color:#fff

MVCC (Multi-Version Concurrency Control)¶

Each row has hidden xmin (created by) and xmax (deleted by) transaction IDs. Readers never block writers.

sequenceDiagram
    participant TX1 as Transaction 1 (xid=100)
    participant Heap as Table Heap
    participant TX2 as Transaction 2 (xid=101)

    TX1->>Heap: UPDATE row → creates new version (xmin=100)
    Note over Heap: Old version: xmax=100<br/>New version: xmin=100, xmax=∞
    TX2->>Heap: SELECT → sees old version (100 not committed yet)
    TX1->>TX1: COMMIT
    TX2->>Heap: SELECT → now sees new version

Async I/O (v18)¶

flowchart LR
    subgraph Old["v17 (Synchronous I/O)"]
        Req1_O["Read page 1"] --> Wait1["⏳ Wait"] --> Req2_O["Read page 2"] --> Wait2["⏳ Wait"]
    end

    subgraph New["v18 (Async I/O)"]
        Req1_N["Read page 1"]
        Req2_N["Read page 2"]
        Req3_N["Read page 3"]
        Req1_N --> Batch["io_uring batch\n(all in parallel)"]
        Req2_N --> Batch
        Req3_N --> Batch
        Batch --> Done["All pages ready\n(2-3× faster)"]
    end

    style Old fill:#c62828,color:#fff
    style New fill:#2e7d32,color:#fff

Sources¶

Benchmarks¶

Scope

PostgreSQL performance metrics, pgbench results, scaling characteristics, and comparison baselines.

pgbench Results (Standard Benchmark)¶

TPC-B Like Workload¶

Hardware	Clients	TPS (read-write)	TPS (read-only)	Latency P99
4 vCPU, 16Gi, SSD	16	2,500-4,000	15,000-25,000	5-10ms
8 vCPU, 32Gi, NVMe	32	8,000-15,000	50,000-80,000	2-5ms
16 vCPU, 64Gi, NVMe	64	20,000-35,000	100,000-150,000	1-3ms
32 vCPU, 128Gi, NVMe	128	40,000-60,000	200,000-300,000	1-2ms

Index Performance¶

Operation	B-tree	Hash	GIN	GiST	BRIN
Point lookup	~0.1ms	~0.05ms	N/A	~0.5ms	~1ms
Range scan	~1ms	N/A	~5ms	~2ms	~0.5ms
Insert overhead	Low	Low	High	Medium	Very low
Storage per row	8-16 bytes	4-8 bytes	Variable	Variable	~1 byte

Connection Scaling¶

Connection Count	Without PgBouncer	With PgBouncer	Notes
50	100% baseline	100%	No difference
200	85-90%	98%	PgBouncer multiplexes
500	60-70%	95%	Memory pressure without pooler
1,000	30-40%	90%	Context switching kills perf
5,000	Fails (OOM)	85%	Must use connection pooling

WAL Write Performance¶

Storage Type	WAL Write Throughput	fsync Latency	Notes
HDD	50-100 MB/s	5-20ms	Not recommended for production
SATA SSD	200-500 MB/s	0.5-2ms	Acceptable for small deployments
NVMe SSD	1-3 GB/s	0.05-0.2ms	Recommended for production

Replication Performance¶

Scenario	Replication Lag	Throughput	Notes
Async, same DC	< 10ms	Near line-rate	Default config
Async, cross-DC	10-100ms	Network limited	WAN bandwidth matters
Sync, same DC	< 1ms	70-80% of standalone	Commit must wait for replica
Sync, cross-DC	10-50ms	20-40% of standalone	Not recommended for write-heavy

Scaling Limits¶

Dimension	Soft Limit	Hard Limit	Notes
Database size	10TB (comfortable)	100TB+	Needs partitioning beyond 10TB
Table size	1TB	32TB	Partition large tables
Rows per table	1 billion	No hard limit	Performance degrades with bloat
Columns per table	250	1,600	TOAST for wide rows
Indexes per table	20	No limit	Each index adds write overhead
Concurrent connections	200-500	10,000+ (with pooler)	Use PgBouncer

Sourcing Status¶

Unsourced Performance Data

The performance numbers in this document are estimated from vendor documentation, community benchmarks, and engineering judgment. They do not represent controlled benchmarks with documented test conditions. Specific hardware configurations, software versions, and test methodologies were not recorded.

Use these figures as rough guidance only. For production capacity planning, run your own benchmarks against your specific workload and infrastructure.