Skip to content

Observability 2.0

A paradigm shift from the "three pillars" model (metrics, logs, traces as separate silos) to a unified, wide-event-centric approach to understanding distributed systems. Coined and popularized by Charity Majors (Honeycomb CTO), the term describes both a philosophical and architectural change in how telemetry data is collected, stored, and queried.

Summary

Observability 1.0 treats metrics, logs, and traces as independent signal types requiring different storage backends, query languages, and dashboards. Engineers "bunny-hop" between tools to correlate data during incidents. This model was adequate for monoliths and simple distributed systems but breaks down at modern microservice scale.

Observability 2.0 replaces this with a single source of truth: wide structured events — one context-rich record per request per service hop, containing every dimension needed for debugging. From this raw event stream, metrics, traces, and log views are derived at query time rather than pre-aggregated at write time.

The Core Insight

Instead of logging what your code is doing, log what happened to this request. Stop thinking about logs as a debugging diary. Start thinking about them as a structured record of business events.

Observability 1.0 vs 2.0

Dimension Observability 1.0 Observability 2.0
Data model Three pillars: metrics, logs, traces Unified wide events
Storage Separate backends per signal (Prometheus, Loki, Tempo) Single unified data store
Schema Low cardinality, pre-defined labels High cardinality, high dimensionality, dynamic fields
Aggregation Pre-aggregated at write time (recording rules) Derived at query time from raw events
Debugging Known unknowns — dashboards for anticipated failures Unknown unknowns — ad-hoc exploratory queries
Context Scattered across log lines, requires correlation Complete per-request context in one record
Scaling Metric cardinality explosions, log volume cost Event volume scales with traffic; tail sampling controls cost
Backward compat N/A Must still power Grafana dashboards, PromQL, trace views

Key Concepts

Wide Events (Canonical Log Lines)

A single, context-rich structured event emitted once per request per service hop, containing 30-100+ fields. Popularized by Stripe as "canonical log lines." Unlike traditional logging (many small log lines per request), a wide event captures the full picture:

  • Request context: method, path, status, duration
  • Infrastructure context: service, version, deployment ID, region
  • Business context: user ID, subscription tier, account age, lifetime value
  • Operation details: payment provider, latency, attempt count
  • Error details: type, code, message, retriability
  • Feature flags: which experiments are active

Structured logging is necessary but not sufficient

Structured logging means your logs are JSON instead of strings — that's table stakes. Wide events are a philosophy: one comprehensive event per request, with all context attached. You can have structured logs that are still useless (5 fields, no user context, scattered across 20 log lines).

High Cardinality

The number of unique values a field can have. user_id has high cardinality (millions of unique values). http_method has low cardinality (GET, POST, PUT, DELETE). High cardinality fields are what make observability data actually useful for debugging — they let you drill down to individual users, requests, or deployments. Traditional metrics systems (Prometheus) choke on high cardinality; O11y 2.0 databases embrace it.

High Dimensionality

The number of fields in your event. A log with 5 fields has low dimensionality. A wide event with 50+ fields has high dimensionality. More dimensions = more questions you can answer without re-instrumenting.

Tail Sampling

Making the sampling decision after the request completes, based on its outcome. This keeps costs manageable while preserving the events that matter:

  1. Always keep errors — 100% of 500s, exceptions, and failures
  2. Always keep slow requests — anything above your p99 latency threshold
  3. Always keep specific users — VIP customers, internal testing accounts
  4. Randomly sample the rest — happy, fast requests at 1-5%

See architecture#tail-sampling-implementation for code.

Ecosystem

Tool / Project Role in O11y 2.0
Honeycomb Pioneer of the paradigm. SaaS platform built on wide events with BubbleUp analysis
GreptimeDB Open-source analytical database purpose-built for wide events. Rust, columnar, disaggregated storage
ClickHouse General-purpose OLAP engine widely used as O11y 2.0 backend (SigNoz, Signoz, ClickStack)
OpenTelemetry Vendor-neutral telemetry collection standard — delivery mechanism, not the solution itself
Dash0 O11y 2.0-aligned platform with native wide event support
Nominal Boris Tane's upcoming platform ("nobody should be on-call in 2026")

OpenTelemetry won't save you alone

OTel is a protocol and SDK set — it standardizes how telemetry is collected and exported. But it doesn't decide what to log. It doesn't add business context. If you're still thinking in terms of "log statements," you'll just emit bad telemetry in a standardized format.

Misconceptions

Misconception Reality
"Structured logging is the same as wide events" Structured logging is JSON instead of strings — table stakes. Wide events are a philosophy: one comprehensive event per request with all context attached
"We already use OTel, so we're good" OTel is a delivery mechanism. Most implementations capture bare minimum (span name, duration, status). You must deliberately instrument with business context
"This is just tracing with extra steps" Tracing gives request flow across services. Wide events give context within a service. Ideally, your wide events ARE your trace spans, enriched
"Logs are for debugging, metrics are for dashboards" This distinction is artificial. Wide events power both — query them for debugging, aggregate them for dashboards
"High-cardinality data is expensive and slow" Expensive on legacy logging systems built for low-cardinality string search. Modern columnar databases (ClickHouse, GreptimeDB) are designed for exactly this
  • Observability — parent domain
  • SigNoz — OpenTelemetry-native platform on ClickHouse
  • OpenObserve — Parquet + S3 unified platform
  • LGTM Stack — traditional three-pillar stack (the "1.0" approach)
  • Coroot — eBPF-based auto-instrumentation

Sources

Questions

  • How do organizations migrate incrementally from 1.0 to 2.0 without a "big bang" rewrite of their observability stack?
  • At what scale does tail sampling accuracy become a concern — can 1% sampling reliably represent the full request distribution?
  • Will O11y 2.0 databases (GreptimeDB, ClickHouse) converge on a common query interface, or will PromQL/SQL/TraceQL fragmentation persist?
  • How does O11y 2.0 interact with AI-powered root cause analysis — does the richer event context make LLM-based debugging significantly more effective?
  • What is the real-world storage cost comparison between a three-pillar stack and a unified wide-event store at 100K events/sec?