Observability 2.0¶

A paradigm shift from the "three pillars" model (metrics, logs, traces as separate silos) to a unified, wide-event-centric approach to understanding distributed systems. Coined and popularized by Charity Majors (Honeycomb CTO), the term describes both a philosophical and architectural change in how telemetry data is collected, stored, and queried.

Summary¶

Observability 1.0 treats metrics, logs, and traces as independent signal types requiring different storage backends, query languages, and dashboards. Engineers "bunny-hop" between tools to correlate data during incidents. This model was adequate for monoliths and simple distributed systems but breaks down at modern microservice scale.

Observability 2.0 replaces this with a single source of truth: wide structured events — one context-rich record per request per service hop, containing every dimension needed for debugging. From this raw event stream, metrics, traces, and log views are derived at query time rather than pre-aggregated at write time.

The Core Insight

Instead of logging what your code is doing, log what happened to this request. Stop thinking about logs as a debugging diary. Start thinking about them as a structured record of business events.

Observability 1.0 vs 2.0¶

Dimension	Observability 1.0	Observability 2.0
Data model	Three pillars: metrics, logs, traces	Unified wide events
Storage	Separate backends per signal (Prometheus, Loki, Tempo)	Single unified data store
Schema	Low cardinality, pre-defined labels	High cardinality, high dimensionality, dynamic fields
Aggregation	Pre-aggregated at write time (recording rules)	Derived at query time from raw events
Debugging	Known unknowns — dashboards for anticipated failures	Unknown unknowns — ad-hoc exploratory queries
Context	Scattered across log lines, requires correlation	Complete per-request context in one record
Scaling	Metric cardinality explosions, log volume cost	Event volume scales with traffic; tail sampling controls cost
Backward compat	N/A	Must still power Grafana dashboards, PromQL, trace views

Key Concepts¶

Wide Events (Canonical Log Lines)¶

A single, context-rich structured event emitted once per request per service hop, containing 30-100+ fields. Popularized by Stripe as "canonical log lines." Unlike traditional logging (many small log lines per request), a wide event captures the full picture:

Request context: method, path, status, duration
Infrastructure context: service, version, deployment ID, region
Business context: user ID, subscription tier, account age, lifetime value
Operation details: payment provider, latency, attempt count
Error details: type, code, message, retriability
Feature flags: which experiments are active

Structured logging is necessary but not sufficient

Structured logging means your logs are JSON instead of strings — that's table stakes. Wide events are a philosophy: one comprehensive event per request, with all context attached. You can have structured logs that are still useless (5 fields, no user context, scattered across 20 log lines).

High Cardinality¶

The number of unique values a field can have. user_id has high cardinality (millions of unique values). http_method has low cardinality (GET, POST, PUT, DELETE). High cardinality fields are what make observability data actually useful for debugging — they let you drill down to individual users, requests, or deployments. Traditional metrics systems (Prometheus) choke on high cardinality; O11y 2.0 databases embrace it.

High Dimensionality¶

The number of fields in your event. A log with 5 fields has low dimensionality. A wide event with 50+ fields has high dimensionality. More dimensions = more questions you can answer without re-instrumenting.

Tail Sampling¶

Making the sampling decision after the request completes, based on its outcome. This keeps costs manageable while preserving the events that matter:

Always keep errors — 100% of 500s, exceptions, and failures
Always keep slow requests — anything above your p99 latency threshold
Always keep specific users — VIP customers, internal testing accounts
Randomly sample the rest — happy, fast requests at 1-5%

See architecture#tail-sampling-implementation for code.

Ecosystem¶

Tool / Project	Role in O11y 2.0
Honeycomb	Pioneer of the paradigm. SaaS platform built on wide events with BubbleUp analysis
GreptimeDB	Open-source analytical database purpose-built for wide events. Rust, columnar, disaggregated storage
ClickHouse	General-purpose OLAP engine widely used as O11y 2.0 backend (SigNoz, Signoz, ClickStack)
OpenTelemetry	Vendor-neutral telemetry collection standard — delivery mechanism, not the solution itself
Dash0	O11y 2.0-aligned platform with native wide event support
Nominal	Boris Tane's upcoming platform ("nobody should be on-call in 2026")

OpenTelemetry won't save you alone

OTel is a protocol and SDK set — it standardizes how telemetry is collected and exported. But it doesn't decide what to log. It doesn't add business context. If you're still thinking in terms of "log statements," you'll just emit bad telemetry in a standardized format.

Misconceptions¶

Misconception	Reality
"Structured logging is the same as wide events"	Structured logging is JSON instead of strings — table stakes. Wide events are a philosophy: one comprehensive event per request with all context attached
"We already use OTel, so we're good"	OTel is a delivery mechanism. Most implementations capture bare minimum (span name, duration, status). You must deliberately instrument with business context
"This is just tracing with extra steps"	Tracing gives request flow across services. Wide events give context within a service. Ideally, your wide events ARE your trace spans, enriched
"Logs are for debugging, metrics are for dashboards"	This distinction is artificial. Wide events power both — query them for debugging, aggregate them for dashboards
"High-cardinality data is expensive and slow"	Expensive on legacy logging systems built for low-cardinality string search. Modern columnar databases (ClickHouse, GreptimeDB) are designed for exactly this

Observability — parent domain
SigNoz — OpenTelemetry-native platform on ClickHouse
OpenObserve — Parquet + S3 unified platform
LGTM Stack — traditional three-pillar stack (the "1.0" approach)
Coroot — eBPF-based auto-instrumentation

Sources¶

loggingsucks.com — Boris Tane — definitive practical guide to wide events with implementation patterns, sampling strategies, and misconception debunking
GreptimeDB: The New Database for Observability 2.0 — database requirements for O11y 2.0 (columnar storage, disaggregated compute, backward-compat)
Observability 2.0 — Honeycomb — Charity Majors' original framing
The Three Pillars are a Lie — charity.wtf — critique of the pillar model as vendor marketing
Observability 2.0 — SUSE — industry overview
Wide Events — Charity Majors — structured events philosophy
Canonical Log Lines — Stripe — origin of the canonical log line pattern

Questions¶

How do organizations migrate incrementally from 1.0 to 2.0 without a "big bang" rewrite of their observability stack?
At what scale does tail sampling accuracy become a concern — can 1% sampling reliably represent the full request distribution?
Will O11y 2.0 databases (GreptimeDB, ClickHouse) converge on a common query interface, or will PromQL/SQL/TraceQL fragmentation persist?
How does O11y 2.0 interact with AI-powered root cause analysis — does the richer event context make LLM-based debugging significantly more effective?
What is the real-world storage cost comparison between a three-pillar stack and a unified wide-event store at 100K events/sec?