Observability 2.0¶
A paradigm shift from the "three pillars" model (metrics, logs, traces as separate silos) to a unified, wide-event-centric approach to understanding distributed systems. Coined and popularized by Charity Majors (Honeycomb CTO), the term describes both a philosophical and architectural change in how telemetry data is collected, stored, and queried.
Summary¶
Observability 1.0 treats metrics, logs, and traces as independent signal types requiring different storage backends, query languages, and dashboards. Engineers "bunny-hop" between tools to correlate data during incidents. This model was adequate for monoliths and simple distributed systems but breaks down at modern microservice scale.
Observability 2.0 replaces this with a single source of truth: wide structured events — one context-rich record per request per service hop, containing every dimension needed for debugging. From this raw event stream, metrics, traces, and log views are derived at query time rather than pre-aggregated at write time.
The Core Insight
Instead of logging what your code is doing, log what happened to this request. Stop thinking about logs as a debugging diary. Start thinking about them as a structured record of business events.
Observability 1.0 vs 2.0¶
| Dimension | Observability 1.0 | Observability 2.0 |
|---|---|---|
| Data model | Three pillars: metrics, logs, traces | Unified wide events |
| Storage | Separate backends per signal (Prometheus, Loki, Tempo) | Single unified data store |
| Schema | Low cardinality, pre-defined labels | High cardinality, high dimensionality, dynamic fields |
| Aggregation | Pre-aggregated at write time (recording rules) | Derived at query time from raw events |
| Debugging | Known unknowns — dashboards for anticipated failures | Unknown unknowns — ad-hoc exploratory queries |
| Context | Scattered across log lines, requires correlation | Complete per-request context in one record |
| Scaling | Metric cardinality explosions, log volume cost | Event volume scales with traffic; tail sampling controls cost |
| Backward compat | N/A | Must still power Grafana dashboards, PromQL, trace views |
Key Concepts¶
Wide Events (Canonical Log Lines)¶
A single, context-rich structured event emitted once per request per service hop, containing 30-100+ fields. Popularized by Stripe as "canonical log lines." Unlike traditional logging (many small log lines per request), a wide event captures the full picture:
- Request context: method, path, status, duration
- Infrastructure context: service, version, deployment ID, region
- Business context: user ID, subscription tier, account age, lifetime value
- Operation details: payment provider, latency, attempt count
- Error details: type, code, message, retriability
- Feature flags: which experiments are active
Structured logging is necessary but not sufficient
Structured logging means your logs are JSON instead of strings — that's table stakes. Wide events are a philosophy: one comprehensive event per request, with all context attached. You can have structured logs that are still useless (5 fields, no user context, scattered across 20 log lines).
High Cardinality¶
The number of unique values a field can have. user_id has high cardinality (millions of unique values). http_method has low cardinality (GET, POST, PUT, DELETE). High cardinality fields are what make observability data actually useful for debugging — they let you drill down to individual users, requests, or deployments. Traditional metrics systems (Prometheus) choke on high cardinality; O11y 2.0 databases embrace it.
High Dimensionality¶
The number of fields in your event. A log with 5 fields has low dimensionality. A wide event with 50+ fields has high dimensionality. More dimensions = more questions you can answer without re-instrumenting.
Tail Sampling¶
Making the sampling decision after the request completes, based on its outcome. This keeps costs manageable while preserving the events that matter:
- Always keep errors — 100% of 500s, exceptions, and failures
- Always keep slow requests — anything above your p99 latency threshold
- Always keep specific users — VIP customers, internal testing accounts
- Randomly sample the rest — happy, fast requests at 1-5%
See architecture#tail-sampling-implementation for code.
Ecosystem¶
| Tool / Project | Role in O11y 2.0 |
|---|---|
| Honeycomb | Pioneer of the paradigm. SaaS platform built on wide events with BubbleUp analysis |
| GreptimeDB | Open-source analytical database purpose-built for wide events. Rust, columnar, disaggregated storage |
| ClickHouse | General-purpose OLAP engine widely used as O11y 2.0 backend (SigNoz, Signoz, ClickStack) |
| OpenTelemetry | Vendor-neutral telemetry collection standard — delivery mechanism, not the solution itself |
| Dash0 | O11y 2.0-aligned platform with native wide event support |
| Nominal | Boris Tane's upcoming platform ("nobody should be on-call in 2026") |
OpenTelemetry won't save you alone
OTel is a protocol and SDK set — it standardizes how telemetry is collected and exported. But it doesn't decide what to log. It doesn't add business context. If you're still thinking in terms of "log statements," you'll just emit bad telemetry in a standardized format.
Misconceptions¶
| Misconception | Reality |
|---|---|
| "Structured logging is the same as wide events" | Structured logging is JSON instead of strings — table stakes. Wide events are a philosophy: one comprehensive event per request with all context attached |
| "We already use OTel, so we're good" | OTel is a delivery mechanism. Most implementations capture bare minimum (span name, duration, status). You must deliberately instrument with business context |
| "This is just tracing with extra steps" | Tracing gives request flow across services. Wide events give context within a service. Ideally, your wide events ARE your trace spans, enriched |
| "Logs are for debugging, metrics are for dashboards" | This distinction is artificial. Wide events power both — query them for debugging, aggregate them for dashboards |
| "High-cardinality data is expensive and slow" | Expensive on legacy logging systems built for low-cardinality string search. Modern columnar databases (ClickHouse, GreptimeDB) are designed for exactly this |
Related Topics¶
- Observability — parent domain
- SigNoz — OpenTelemetry-native platform on ClickHouse
- OpenObserve — Parquet + S3 unified platform
- LGTM Stack — traditional three-pillar stack (the "1.0" approach)
- Coroot — eBPF-based auto-instrumentation
Sources¶
- loggingsucks.com — Boris Tane — definitive practical guide to wide events with implementation patterns, sampling strategies, and misconception debunking
- GreptimeDB: The New Database for Observability 2.0 — database requirements for O11y 2.0 (columnar storage, disaggregated compute, backward-compat)
- Observability 2.0 — Honeycomb — Charity Majors' original framing
- The Three Pillars are a Lie — charity.wtf — critique of the pillar model as vendor marketing
- Observability 2.0 — SUSE — industry overview
- Wide Events — Charity Majors — structured events philosophy
- Canonical Log Lines — Stripe — origin of the canonical log line pattern
Questions¶
- How do organizations migrate incrementally from 1.0 to 2.0 without a "big bang" rewrite of their observability stack?
- At what scale does tail sampling accuracy become a concern — can 1% sampling reliably represent the full request distribution?
- Will O11y 2.0 databases (GreptimeDB, ClickHouse) converge on a common query interface, or will PromQL/SQL/TraceQL fragmentation persist?
- How does O11y 2.0 interact with AI-powered root cause analysis — does the richer event context make LLM-based debugging significantly more effective?
- What is the real-world storage cost comparison between a three-pillar stack and a unified wide-event store at 100K events/sec?