Databases¶

Relational database engines covering distributed SQL, traditional RDBMS, replication topologies, and modern capabilities including JSON, full-text search, and AI-integrated analytics.

← Knowledge Base

Topics¶

Database	Description
CockroachDB	Distributed SQL with global ACID consistency via Raft consensus — PostgreSQL-compatible wire protocol, horizontal scale-out, multi-region survivability.
MySQL	World's most deployed open-source database with dual release tracks (LTS 8.4 / Innovation 9.x) and commercial HeatWave AI/analytics integration.
PostgreSQL	World's most advanced open-source RDBMS — ACID compliance, rich extensibility (JSONB, PostGIS, FDW), and landmark async I/O in v18.

Comparisons¶

Comparison	Scope
Database Comparison	CockroachDB vs MySQL vs PostgreSQL — consistency models, replication, scalability, and operational trade-offs

Landscape¶

The database landscape is stratified between mature single-node RDBMS engines (PostgreSQL, MySQL) and a new generation of distributed SQL databases (CockroachDB, TiDB, YugabyteDB) that offer horizontal scaling while maintaining SQL semantics and ACID guarantees. Cloud providers have further blurred the lines with proprietary NewSQL offerings — Aurora, AlloyDB, PolarDB — that decouple storage from compute to achieve elastic scaling without abandoning PostgreSQL or MySQL wire compatibility.

The "PostgreSQL everywhere" trend is accelerating as PG becomes the default wire protocol even for non-relational workloads: Timescale for time-series, pgvector for embeddings, and Hydra/Citus for analytics. Connection management has emerged as a critical operational concern at scale, with PgBouncer, PgCat, and Supavisor competing to solve the connection pooling problem for serverless and multi-tenant architectures.

Embedded Analytics Shift

The rise of embedded analytical databases like DuckDB is enabling a new pattern where OLAP queries run in-process alongside application code, challenging the traditional separation of transactional and analytical workloads. DuckDB can query Parquet files directly from S3, making it possible to run analytical queries without a dedicated data warehouse in many scenarios.

Meanwhile, the AI/ML wave is pulling databases into the vector search domain — PostgreSQL with pgvector, MySQL with vector type support in 9.x, and CockroachDB's planned vector index all aim to serve RAG and similarity search workloads without requiring a separate vector database.

Key Concepts¶

ACID vs BASE¶

ACID (Atomicity, Consistency, Isolation, Durability) guarantees that every transaction is all-or-nothing, moves the database between valid states, isolates concurrent transactions, and persists committed data. BASE (Basically Available, Soft state, Eventually consistent) is the relaxed model used by many NoSQL and AP-leaning distributed systems, trading strict consistency for availability and partition tolerance.

Distributed SQL databases like CockroachDB attempt to provide ACID across nodes using consensus protocols (Raft), though this introduces latency proportional to inter-node round trips. The practical impact: a CockroachDB transaction spanning three regions incurs consensus latency of 2x the longest inter-region RTT, while a single-region transaction sees sub-millisecond consensus overhead.

CAP Theorem in Practice¶

Brewer's CAP theorem states that a distributed system can guarantee at most two of Consistency, Availability, and Partition tolerance simultaneously. In practice, the choice is between CP systems (CockroachDB, Spanner — consistent but may reject writes during partitions) and AP systems (Cassandra, DynamoDB — available but may serve stale reads).

Modern databases often allow per-query tuning of the consistency-availability trade-off:

CockroachDB supports AS OF SYSTEM TIME for bounded-staleness reads that can be served from the nearest replica
Cassandra allows per-query consistency levels (ONE, QUORUM, ALL)
DynamoDB offers both eventually consistent and strongly consistent read options

Replication Topologies¶

Common Topologies

Single-leader (primary-replica): All writes go to one node; replicas serve reads. Simple but creates a write bottleneck. Used by PostgreSQL streaming replication and MySQL Group Replication in single-primary mode.
Multi-leader: Multiple nodes accept writes, with conflict resolution required. Used by MySQL Group Replication in multi-primary mode and CockroachDB's leaseholder model.
Leaderless (quorum): Any node can serve reads/writes with quorum agreement. Used by Cassandra and DynamoDB.

Connection Pooling¶

A middleware layer that maintains a pool of persistent database connections, multiplexing many short-lived application connections onto fewer backend connections. Essential for Kubernetes workloads where pod churn creates connection storms that can overwhelm the database's max_connections limit.

PgBouncer operates in three modes:

Session mode: Holds a backend connection for the entire client session (safest, least efficient)
Transaction mode: Releases the backend connection after each transaction (best balance for most workloads)
Statement mode: Releases after each statement (most efficient but breaks multi-statement transactions)

Newer poolers like PgCat add query-level load balancing across read replicas, and Supavisor (Elixir-based) supports multi-tenant pooling with per-tenant connection limits.

MVCC (Multi-Version Concurrency Control)¶

A concurrency control method where the database maintains multiple versions of each row, allowing readers to see a consistent snapshot without blocking writers. PostgreSQL implements MVCC by storing old row versions in the heap (requiring VACUUM to reclaim dead tuples), while MySQL/InnoDB uses an undo log with rollback segments that are cleaned by a purge thread.

CockroachDB extends MVCC across distributed ranges with hybrid-logical clock (HLC) timestamps for global ordering, ensuring that a read at timestamp T always sees a consistent snapshot across all nodes — even when data is spread across multiple regions.

Open Questions¶

As pgvector and DiskANN indexes mature in PostgreSQL, does it eliminate the need for dedicated vector databases (Pinecone, Weaviate) for most retrieval-augmented generation (RAG) workloads?
What are the real-world latency and consistency trade-offs when running CockroachDB across multiple cloud regions with survival goals set to "region" versus "zone"?
How will the convergence of OLTP and OLAP in single engines (HTAP) — as attempted by TiDB's TiFlash and AlloyDB's columnar engine — affect the traditional ETL pipeline to a data warehouse?