Skip to content

Architecture

PostgreSQL uses a process-per-connection model supervised by a postmaster daemon. Shared memory structures (buffer pool, WAL buffer, lock tables) are accessible to all backend processes, while each backend handles query parsing, planning, and execution independently. This architecture provides strong isolation between client sessions and relies on the Write-Ahead Log (WAL) protocol for crash recovery.

See also: index, architecture, operations, security


Process Architecture

When PostgreSQL starts, the postmaster daemon initialises shared memory and spawns several background processes. For each incoming client connection, the postmaster forks a dedicated backend process.

graph TB
    POSTMASTER["postmaster<br/>(supervisor daemon)"]

    subgraph "Background Processes"
        BGWRITER["Background Writer<br/>(bgwriter)"]
        CHECKPT["Checkpointer"]
        WALWRITER["WAL Writer<br/>(walwriter)"]
        AUTOVAC["Autovacuum Launcher"]
        AUTOWORK["Autovacuum Workers<br/>(dynamic)"]
        STATS["Statistics Collector<br/>(or shared memory stats in PG 15+)"]
        LOGICAL["Logical Replication<br/>Worker (if configured)"]
    end

    subgraph "Client Backends"
        BE1["Backend Process<br/>(session 1)"]
        BE2["Backend Process<br/>(session 2)"]
        BEN["Backend Process<br/>(session N)"]
    end

    subgraph "Shared Memory"
        SHARED["shared_buffers<br/>WAL Buffer<br/>Lock Tables<br/>ProcArray / CLog"]
    end

    POSTMASTER --> BGWRITER
    POSTMASTER --> CHECKPT
    POSTMASTER --> WALWRITER
    POSTMASTER --> AUTOVAC
    AUTOVAC --> AUTOWORK
    POSTMASTER --> STATS
    POSTMASTER --> LOGICAL
    POSTMASTER --> BE1
    POSTMASTER --> BE2
    POSTMASTER --> BEN

    BE1 --- SHARED
    BE2 --- SHARED
    BEN --- SHARED
    BGWRITER --- SHARED
    CHECKPT --- SHARED
    WALWRITER --- SHARED

    style POSTMASTER fill:#e8f4f8,stroke:#2196f3,color:#000
    style SHARED fill:#fff3e0,stroke:#ff9800,color:#000
    style BGWRITER fill:#e8f5e9,stroke:#4caf50,color:#000

Background Processes

Process Role
Background Writer (bgwriter) Periodically writes dirty pages from shared_buffers to the OS page cache. Its goal is to reduce the I/O spike during checkpoints by spreading writes over time. Tuned via bgwriter_delay, bgwriter_lru_maxpages.
Checkpointer Forces all dirty pages in shared_buffers to disk (via fsync) and records a checkpoint record in the WAL. This allows WAL segments older than the checkpoint to be recycled or archived. Triggered by checkpoint_timeout (default 5 min) or max_wal_size being reached.
WAL Writer (walwriter) Flushes WAL records from the in-memory WAL buffer to WAL segment files on disk. Runs on a cycle controlled by wal_writer_delay (default 200 ms). Ensures recent WAL records are durable even for transactions that have not yet committed.
Autovacuum Launcher Monitors table statistics and spawns autovacuum worker processes for tables that have exceeded their dead-tuple threshold. Essential for MVCC garbage collection.
Statistics Collector Collects runtime statistics (table access counts, dead tuple counts, etc.). In PostgreSQL 15+, replaced by an in-memory shared statistics area eliminating the separate collector process.

Shared Memory Structures

PostgreSQL allocates shared memory at startup. The most important regions are:

graph LR
    subgraph "Shared Memory"
        SB["shared_buffers<br/>(8 KiB page frames)"]
        WALB["WAL Buffer<br/>(circular)"]
        LT["Lock Tables"]
        PA["ProcArray<br/>(active txn XIDs)"]
        CL["CLOG / CommitLog<br/>(txn status bits)"]
        MT["MultiXact<br/>(shared row locks)"]
    end

    style SB fill:#e3f2fd,stroke:#2196f3,color:#000
    style WALB fill:#fce4ec,stroke:#e91e63,color:#000
Region Purpose
shared_buffers Main buffer pool. Default 128 MiB; production systems typically set this to 25% of system RAM. Pages are 8 KiB. Uses a clock-sweep algorithm for page eviction.
WAL Buffer Circular buffer for WAL records before they are written to disk. Default 64 KiB (wal_buffers). Autotuned to 1/32 of shared_buffers if left at default.
Lock Tables Shared lock manager tracking relation-level, page-level, tuple-level, and advisory locks held by all backends.
ProcArray Array of all active backend processes and their current transaction IDs. Used to compute snapshot visibility (which XIDs are visible to a given transaction).
CLOG (Commit Log) Stores the commit status of each transaction ID (in-progress, committed, aborted). Stored as bitmaps in shared memory and materialised to disk in pg_xact/.
MultiXact Tracks sets of transaction IDs that hold locks on the same row. Used for SELECT ... FOR SHARE and similar operations.

WAL (Write-Ahead Log)

The WAL protocol is the foundation of PostgreSQL's crash recovery. The core rule: dirty data pages are never written to disk before their corresponding WAL records have been flushed.

WAL Record Lifecycle

sequenceDiagram
    participant BE as Backend Process
    participant WB as WAL Buffer
    participant WF as WAL Segment Files<br/>(pg_wal/)
    participant CK as Checkpointer
    participant DF as Data Files<br/>(table extents)

    BE->>BE: Modify tuple in shared_buffers
    BE->>WB: XLogInsertRecord (WAL entry)
    Note over BE,WB: WAL record appended to WAL buffer
    BE->>WF: XLogFlush at COMMIT
    Note over BE,WF: Ensures durability before ack to client
    CK->>DF: Write all dirty pages to disk (fsync)
    CK->>WF: Write checkpoint record
    Note over CK,WF: Old WAL segments before checkpoint<br/>can be recycled or archived

WAL Configuration

Parameter Default Purpose
wal_level replica Controls amount of WAL data. minimal for crash recovery only; replica for streaming replication; logical for logical decoding.
max_wal_size 1 GiB Maximum WAL size before a checkpoint is triggered.
min_wal_size 80 MiB Minimum WAL size to keep for recycling.
checkpoint_timeout 5 min Maximum time between automatic checkpoints.
wal_buffers -1 (auto) Size of WAL buffer in shared memory. Auto = 1/32 of shared_buffers.
wal_compression off Compress full-page images in WAL. lz4 or zstd reduce WAL volume significantly.

Full-Page Writes (FPW)

After a checkpoint, the first modification to any data page triggers a full-page image (FPI) to be written to the WAL. This protects against torn page writes: if a partial 8 KiB write occurs, the complete page can be reconstructed from the FPI in the WAL.


MVCC and VACUUM

PostgreSQL implements Multi-Version Concurrency Control (MVCC) using tuple header fields, not undo logs. Each row version (tuple) has:

  • xmin -- XID of the transaction that inserted this tuple.
  • xmax -- XID of the transaction that deleted or updated this tuple (0 if still valid).
  • infomask -- status bits (committed, aborted, locked, etc.).

Visibility Rules

When a backend reads a table, it computes a snapshot of which transaction IDs are visible:

  1. Tuples with xmin committed and xmax either 0 or not-yet-committed are visible.
  2. Tuples with xmin not-yet-committed are invisible (still in-progress from another session).
  3. Tuples with xmax committed are invisible (deleted or replaced by an update).

Dead Tuples and VACUUM

When rows are updated or deleted, the old tuple versions become dead tuples. They remain on disk until VACUUM reclaims the space.

  • VACUUM -- marks dead tuple space as available for reuse. Does not shrink the file or return space to the OS (unless the dead tuples are at the end of the file).
  • VACUUM FULL -- rewrites the entire table, compacting it and returning space to the OS. Requires an exclusive lock.
  • Autovacuum -- daemon that automatically runs VACUUM and ANALYZE based on thresholds defined by autovacuum_vacuum_scale_factor and autovacuum_vacuum_threshold. Enabled by default.

Bloat from Long-Running Transactions

Long-running transactions (including idle-in-transaction sessions) prevent autovacuum from reclaiming dead tuples that are newer than the oldest open transaction's snapshot. This causes table and index bloat.

Transaction ID Wraparound

PostgreSQL uses a 32-bit transaction ID counter. Without intervention, it would wrap around after approximately 2 billion transactions, causing data corruption. To prevent this:

  • Freeze operations mark old tuples as "always visible" by replacing their xmin with a special FrozenXID.
  • Autovacuum triggers freeze operations when autovacuum_freeze_max_age (default 200 million transactions) is approached.
  • The system forces an autovacuum to prevent wraparound even if autovacuum is otherwise disabled.

Query Processing Pipeline

Each backend processes queries through the following stages:

  1. Parser -- converts SQL text into a parse tree.
  2. Analyzer/Semantic Analysis -- resolves table and column references, performs type checking. Produces a query tree.
  3. Rewriter -- applies rules (e.g., views are expanded, RLS policies are attached).
  4. Planner/Optimizer -- generates execution plans. Uses a cost-based optimizer that considers:
    • Sequential scans vs. index scans vs. index-only scans.
    • Join methods: nested loop, hash join, merge join.
    • Join ordering via dynamic programming or GEQO (for many-table joins).
    • Parallel query paths (parallel sequential scan, parallel hash join).
  5. Executor -- runs the plan using a tuple-oriented pipeline (each node pulls tuples from its children).

Logical Replication

PostgreSQL supports two native replication mechanisms:

Streaming (Physical) Replication

  • Ships WAL byte streams to standby servers.
  • Standbys are byte-for-byte copies of the primary.
  • Supports synchronous and asynchronous modes.
  • Standbys can serve read-only queries (hot standby).

Logical Replication

  • Decodes WAL records into logical changes (INSERT, UPDATE, DELETE) using a logical decoding output plugin (e.g., pgoutput).
  • Publications define which tables to replicate; subscriptions consume them.
  • Uses replication slots to track the consumer's WAL position, ensuring the primary retains WAL until all subscribers have consumed it.
  • Allows selective table replication and cross-version replication.
graph LR
    PUB["Primary<br/>(Publisher)"]
    WALD["Logical Decoder<br/>(pgoutput plugin)"]
    SLOT["Replication Slot<br/>(tracks LSN)"]
    SUB1["Subscriber 1"]
    SUB2["Subscriber 2"]

    PUB --> WALD --> SLOT
    SLOT -->|"pg_replication slot"| SUB1
    SLOT -->|"pg_replication slot"| SUB2

    style PUB fill:#e8f5e9,stroke:#4caf50,color:#000
    style SLOT fill:#fff3e0,stroke:#ff9800,color:#000

Replication Slot Dangers

If a subscriber is offline for an extended period, its replication slot prevents WAL recycling on the primary. Monitor pg_replication_slots and set max_slot_wal_keep_size to prevent disk exhaustion.


Extension API

PostgreSQL's extensibility model allows adding new types, functions, operators, index access methods, and procedural languages without modifying core code:

  • Extensions -- packaged via CREATE EXTENSION (e.g., pgcrypto, PostGIS, pg_stat_statements).
  • Procedural Languages -- PL/pgSQL (built-in), PL/Python, PL/Perl, PL/v8 (JavaScript).
  • Custom Scan Providers -- allow extensions to replace or augment query execution plans (used by Citus for distributed queries).
  • Foreign Data Wrappers (FDW) -- access external data sources as local tables (e.g., postgres_fdw, file_fdw).

Key Configuration Parameters

Parameter Default Purpose
shared_buffers 128 MiB Buffer pool size. Set to ~25% of system RAM.
work_mem 4 MiB Per-sort/hash memory before spilling to disk.
maintenance_work_mem 64 MiB Memory for VACUUM, CREATE INDEX, ALTER TABLE.
effective_cache_size 4 GiB Planner hint for total OS + PG cache. Set to ~75% of RAM.
max_connections 100 Maximum concurrent backend processes.
wal_level replica WAL detail level for replication support.
max_wal_size 1 GiB Checkpoint trigger threshold.
autovacuum ON Enable automatic VACUUM and ANALYZE.

Sources


How It Works

Process model, shared memory, MVCC, WAL, query execution, and v18 async I/O.

Process Architecture

flowchart TB
    subgraph Postmaster["Postmaster (main process)"]
        PM["Process Manager\n(forks backends)"]
    end

    subgraph Backends["Backend Processes"]
        B1["Backend 1\n(client connection)"]
        B2["Backend 2"]
        BN["Backend N"]
    end

    subgraph Background["Background Processes"]
        WAL_W["WAL Writer"]
        Checkpointer["Checkpointer"]
        Autovac["Autovacuum"]
        BGWriter["Background Writer"]
        StatsCol["Stats Collector"]
    end

    subgraph SharedMem["Shared Memory"]
        SharedBuf["Shared Buffers\n(page cache)"]
        WAL_Buf["WAL Buffers"]
        CLOG["CLOG\n(transaction status)"]
    end

    PM --> Backends
    PM --> Background
    Backends --> SharedMem
    Background --> SharedMem

    style SharedMem fill:#1565c0,color:#fff

MVCC (Multi-Version Concurrency Control)

Each row has hidden xmin (created by) and xmax (deleted by) transaction IDs. Readers never block writers.

sequenceDiagram
    participant TX1 as Transaction 1 (xid=100)
    participant Heap as Table Heap
    participant TX2 as Transaction 2 (xid=101)

    TX1->>Heap: UPDATE row → creates new version (xmin=100)
    Note over Heap: Old version: xmax=100<br/>New version: xmin=100, xmax=∞
    TX2->>Heap: SELECT → sees old version (100 not committed yet)
    TX1->>TX1: COMMIT
    TX2->>Heap: SELECT → now sees new version

Async I/O (v18)

flowchart LR
    subgraph Old["v17 (Synchronous I/O)"]
        Req1_O["Read page 1"] --> Wait1["⏳ Wait"] --> Req2_O["Read page 2"] --> Wait2["⏳ Wait"]
    end

    subgraph New["v18 (Async I/O)"]
        Req1_N["Read page 1"]
        Req2_N["Read page 2"]
        Req3_N["Read page 3"]
        Req1_N --> Batch["io_uring batch\n(all in parallel)"]
        Req2_N --> Batch
        Req3_N --> Batch
        Batch --> Done["All pages ready\n(2-3× faster)"]
    end

    style Old fill:#c62828,color:#fff
    style New fill:#2e7d32,color:#fff

Sources


Benchmarks

Scope

PostgreSQL performance metrics, pgbench results, scaling characteristics, and comparison baselines.

pgbench Results (Standard Benchmark)

TPC-B Like Workload

Hardware Clients TPS (read-write) TPS (read-only) Latency P99
4 vCPU, 16Gi, SSD 16 2,500-4,000 15,000-25,000 5-10ms
8 vCPU, 32Gi, NVMe 32 8,000-15,000 50,000-80,000 2-5ms
16 vCPU, 64Gi, NVMe 64 20,000-35,000 100,000-150,000 1-3ms
32 vCPU, 128Gi, NVMe 128 40,000-60,000 200,000-300,000 1-2ms

Index Performance

Operation B-tree Hash GIN GiST BRIN
Point lookup ~0.1ms ~0.05ms N/A ~0.5ms ~1ms
Range scan ~1ms N/A ~5ms ~2ms ~0.5ms
Insert overhead Low Low High Medium Very low
Storage per row 8-16 bytes 4-8 bytes Variable Variable ~1 byte

Connection Scaling

Connection Count Without PgBouncer With PgBouncer Notes
50 100% baseline 100% No difference
200 85-90% 98% PgBouncer multiplexes
500 60-70% 95% Memory pressure without pooler
1,000 30-40% 90% Context switching kills perf
5,000 Fails (OOM) 85% Must use connection pooling

WAL Write Performance

Storage Type WAL Write Throughput fsync Latency Notes
HDD 50-100 MB/s 5-20ms Not recommended for production
SATA SSD 200-500 MB/s 0.5-2ms Acceptable for small deployments
NVMe SSD 1-3 GB/s 0.05-0.2ms Recommended for production

Replication Performance

Scenario Replication Lag Throughput Notes
Async, same DC < 10ms Near line-rate Default config
Async, cross-DC 10-100ms Network limited WAN bandwidth matters
Sync, same DC < 1ms 70-80% of standalone Commit must wait for replica
Sync, cross-DC 10-50ms 20-40% of standalone Not recommended for write-heavy

Scaling Limits

Dimension Soft Limit Hard Limit Notes
Database size 10TB (comfortable) 100TB+ Needs partitioning beyond 10TB
Table size 1TB 32TB Partition large tables
Rows per table 1 billion No hard limit Performance degrades with bloat
Columns per table 250 1,600 TOAST for wide rows
Indexes per table 20 No limit Each index adds write overhead
Concurrent connections 200-500 10,000+ (with pooler) Use PgBouncer

Sourcing Status

Unsourced Performance Data

The performance numbers in this document are estimated from vendor documentation, community benchmarks, and engineering judgment. They do not represent controlled benchmarks with documented test conditions. Specific hardware configurations, software versions, and test methodologies were not recorded.

Use these figures as rough guidance only. For production capacity planning, run your own benchmarks against your specific workload and infrastructure.

Sources