Skip to content

Architecture

Technical deep dive into the Observability 2.0 paradigm — wide event anatomy, implementation patterns, sampling strategies, database requirements, and the GreptimeDB reference architecture.

Observability 1.0 vs 2.0 Architecture

graph TB
    subgraph "Observability 1.0 — Three Pillars"
        direction TB
        APP1["Application"] --> PROM_EXP["Prometheus Exporter"]
        APP1 --> LOG_AGENT["Fluentd / Fluent Bit"]
        APP1 --> OTEL_SDK1["OTel SDK (Traces)"]

        PROM_EXP --> MIMIR["Mimir / Prometheus"]
        LOG_AGENT --> LOKI["Loki / Elasticsearch"]
        OTEL_SDK1 --> TEMPO["Tempo / Jaeger"]

        MIMIR --> GRAFANA1["Grafana Dashboards"]
        LOKI --> GRAFANA1
        TEMPO --> GRAFANA1
    end

    subgraph "Observability 2.0 — Unified Wide Events"
        direction TB
        APP2["Application"] --> WIDE["Wide Event Middleware"]
        WIDE --> OTEL_COL["OTel Collector"]
        OTEL_COL --> UNIFIED_DB["Unified DB\n(GreptimeDB / ClickHouse)"]

        UNIFIED_DB --> DASH["Dashboards (PromQL)"]
        UNIFIED_DB --> EXPLORE["Exploratory Queries (SQL)"]
        UNIFIED_DB --> ALERT["Alerts & Rules"]
        UNIFIED_DB --> TRACE_VIEW["Trace View"]
    end

    style APP1 fill:#e74c3c,color:#fff
    style APP2 fill:#27ae60,color:#fff
    style UNIFIED_DB fill:#2980b9,color:#fff

Key difference: In 1.0, the application emits three separate signal types to three separate backends. In 2.0, the application emits one wide event per request, routed through a single pipeline to a single store that powers all views.

Wide Event Anatomy

A wide event captures the complete context of a single request in one structured record. This example from a checkout service shows approximately 30 fields across 6 context groups:

{
  "timestamp": "2025-01-15T10:23:45.612Z",
  "request_id": "req_8bf7ec2d",
  "trace_id": "abc123",

  "service": "checkout-service",
  "version": "2.4.1",
  "deployment_id": "deploy_789",
  "region": "us-east-1",

  "method": "POST",
  "path": "/api/checkout",
  "status_code": 500,
  "duration_ms": 1247,

  "user": {
    "id": "user_456",
    "subscription": "premium",
    "account_age_days": 847,
    "lifetime_value_cents": 284700
  },

  "cart": {
    "id": "cart_xyz",
    "item_count": 3,
    "total_cents": 15999,
    "coupon_applied": "SAVE20"
  },

  "payment": {
    "method": "card",
    "provider": "stripe",
    "latency_ms": 1089,
    "attempt": 3
  },

  "error": {
    "type": "PaymentError",
    "code": "card_declined",
    "message": "Card declined by issuer",
    "retriable": false,
    "stripe_decline_code": "insufficient_funds"
  },

  "feature_flags": {
    "new_checkout_flow": true,
    "express_payment": false
  }
}

When a user complains, searching user_id = "user_456" instantly reveals:

  • Premium customer, 2+ year account (high priority)
  • Payment failed on 3rd attempt — insufficient funds
  • Using the new checkout flow (potential correlation?)
  • No grep-ing, no guessing, no second search

Context Groups in a Wide Event

Group Fields Purpose
Identity request_id, trace_id, timestamp Correlation and ordering
Infrastructure service, version, deployment_id, region Where the event happened
Request method, path, status_code, duration_ms What happened
User / Business user.id, subscription, account_age_days, lifetime_value_cents Who was affected and business impact
Operation payment.method, payment.provider, payment.latency_ms Domain-specific operation details
Error error.type, error.code, error.message, error.retriable Failure specifics
Experiments feature_flags.* Active experiments for correlation analysis

Queries Enabled by Wide Events

With wide events you run analytics on production traffic, not string searches on logs:

-- Premium users hitting payment errors in the last hour with new checkout flow
SELECT user.id, error.code, payment.attempt, duration_ms
FROM events
WHERE status_code >= 500
  AND user.subscription = 'premium'
  AND feature_flags.new_checkout_flow = true
  AND timestamp > NOW() - INTERVAL '1 hour'
ORDER BY user.lifetime_value_cents DESC;

-- Error rate by deployment, grouped by region
SELECT deployment_id, region,
       COUNT(*) FILTER (WHERE status_code >= 500) AS errors,
       COUNT(*) AS total,
       ROUND(100.0 * COUNT(*) FILTER (WHERE status_code >= 500) / COUNT(*), 2) AS error_pct
FROM events
WHERE timestamp > NOW() - INTERVAL '15 minutes'
GROUP BY deployment_id, region
ORDER BY error_pct DESC;

-- P99 latency by service version (canary vs stable)
SELECT version,
       PERCENTILE_CONT(0.99) WITHIN GROUP (ORDER BY duration_ms) AS p99_ms,
       COUNT(*) AS request_count
FROM events
WHERE service = 'checkout-service'
  AND timestamp > NOW() - INTERVAL '1 hour'
GROUP BY version;

Implementation Pattern

The key insight: build the event throughout the request lifecycle, then emit once at the end.

Middleware Approach (TypeScript / Hono)

// middleware/wideEvent.ts
export function wideEventMiddleware() {
  return async (ctx, next) => {
    const startTime = Date.now();

    // Initialize the wide event with request context
    const event: Record<string, unknown> = {
      request_id: ctx.get('requestId'),
      timestamp: new Date().toISOString(),
      method: ctx.req.method,
      path: ctx.req.path,
      service: process.env.SERVICE_NAME,
      version: process.env.SERVICE_VERSION,
      deployment_id: process.env.DEPLOYMENT_ID,
      region: process.env.REGION,
    };

    // Make the event accessible to handlers
    ctx.set('wideEvent', event);

    try {
      await next();
      event.status_code = ctx.res.status;
      event.outcome = 'success';
    } catch (error) {
      event.status_code = 500;
      event.outcome = 'error';
      event.error = {
        type: error.name,
        message: error.message,
        code: error.code,
        retriable: error.retriable ?? false,
      };
      throw error;
    } finally {
      event.duration_ms = Date.now() - startTime;

      // Emit the wide event — ONE log line per request
      logger.info(event);
    }
  };
}

Handler Enrichment

Handlers enrich the event with business context as they process the request:

app.post('/checkout', async (ctx) => {
  const event = ctx.get('wideEvent');
  const user = ctx.get('user');

  // Add user context
  event.user = {
    id: user.id,
    subscription: user.subscription,
    account_age_days: daysSince(user.createdAt),
    lifetime_value_cents: user.ltv,
  };

  // Add business context as you process
  const cart = await getCart(user.id);
  event.cart = {
    id: cart.id,
    item_count: cart.items.length,
    total_cents: cart.total,
    coupon_applied: cart.coupon?.code,
  };

  // Process payment — measure sub-operation latency
  const paymentStart = Date.now();
  const payment = await processPayment(cart, user);

  event.payment = {
    method: payment.method,
    provider: payment.provider,
    latency_ms: Date.now() - paymentStart,
    attempt: payment.attemptNumber,
  };

  // If payment fails, add error details
  if (payment.error) {
    event.error = {
      type: 'PaymentError',
      code: payment.error.code,
      stripe_decline_code: payment.error.declineCode,
    };
  }

  return ctx.json({ orderId: payment.orderId });
});

Wide Event Data Flow

sequenceDiagram
    participant Client
    participant Middleware as Wide Event Middleware
    participant Handler as Request Handler
    participant DB as Business Logic / DB
    participant Logger as Event Emitter
    participant Backend as O11y Backend

    Client->>Middleware: POST /checkout
    Middleware->>Middleware: Initialize event (request_id, method, path, service, version)
    Middleware->>Handler: next()

    Handler->>Handler: Enrich event (user.id, subscription, LTV)
    Handler->>DB: getCart(user.id)
    DB-->>Handler: cart data
    Handler->>Handler: Enrich event (cart.id, item_count, total)
    Handler->>DB: processPayment(cart, user)
    DB-->>Handler: payment result
    Handler->>Handler: Enrich event (payment.method, latency_ms, attempt)

    Handler-->>Middleware: response

    Middleware->>Middleware: Finalize event (duration_ms, status_code, outcome)
    Middleware->>Logger: logger.info(event)
    Logger->>Backend: Single wide event (~50 fields)
    Middleware-->>Client: HTTP response

Tail Sampling Implementation

Tail sampling makes the keep/drop decision after the request completes, based on outcome. This keeps costs manageable while never losing the events that matter.

// Tail sampling decision function
function shouldSample(event: WideEvent): boolean {
  // Always keep errors
  if (event.status_code >= 500) return true;
  if (event.error) return true;

  // Always keep slow requests (above p99)
  if (event.duration_ms > 2000) return true;

  // Always keep VIP users
  if (event.user?.subscription === 'enterprise') return true;

  // Always keep requests with specific feature flags (debugging rollouts)
  if (event.feature_flags?.new_checkout_flow) return true;

  // Random sample the rest at 5%
  return Math.random() < 0.05;
}

Sampling Rules Summary

Rule Keep Rate Rationale
Errors (5xx, exceptions) 100% Never lose failure evidence
Slow requests (> p99) 100% Tail latency is where problems hide
VIP / enterprise users 100% Business-critical — immediate escalation
Feature flag rollouts 100% Correlate new code with new failures
Everything else 1-5% Happy, fast requests — sample for baselines

Naive random sampling is dangerous

If you randomly sample 1% of all traffic, you might accidentally drop the one request that explains your outage. Always use tail sampling with outcome-based rules.

OTel Collector Tail Sampling

The same logic can be implemented in the OTel Collector's tail sampling processor, which examines complete traces before deciding whether to keep them:

# otel-collector-config.yaml
processors:
  tail_sampling:
    decision_wait: 10s
    num_traces: 100000
    policies:
      - name: errors
        type: status_code
        status_code: { status_codes: [ERROR] }
      - name: slow-traces
        type: latency
        latency: { threshold_ms: 2000 }
      - name: probabilistic-sample
        type: probabilistic
        probabilistic: { sampling_percentage: 5 }

Database Requirements for Observability 2.0

A single uncompressed wide event can exceed 2KB. At 10,000 requests/second, that's 20MB/s of raw event data. The database must handle this efficiently while supporting both real-time dashboards and ad-hoc exploratory queries.

Core Requirements

graph LR
    subgraph "Ingest"
        OTLP["OTLP / OpenTelemetry"]
        TRANSFORM["Transform Engine\n(pre-processing)"]
    end

    subgraph "Store"
        COLUMNAR["Columnar Storage\n(Parquet / Arrow)"]
        OBJECT["Object Storage\n(S3 / GCS)"]
        MATVIEW["Materialized Views\n(derived metrics)"]
    end

    subgraph "Query"
        ROUTINE["Routine Queries\n(dashboards, alerts)"]
        EXPLORE["Exploratory Queries\n(ad-hoc analysis)"]
        PROMQL["PromQL\n(backward compat)"]
    end

    OTLP --> TRANSFORM --> COLUMNAR
    COLUMNAR --> OBJECT
    COLUMNAR --> MATVIEW
    COLUMNAR --> ROUTINE
    COLUMNAR --> EXPLORE
    MATVIEW --> PROMQL
    MATVIEW --> ROUTINE

    style COLUMNAR fill:#2980b9,color:#fff
    style MATVIEW fill:#8e44ad,color:#fff
Requirement Why How
Columnar storage Wide events have 50+ fields; columnar format enables column pruning and vectorized processing Apache Parquet, Arrow; dictionary/RLE encoding per column
Disaggregated compute + storage Storage scales independently of compute; cost-efficient long-term retention S3/GCS as primary persistence; local SSD for hot data
Dynamic schema New fields appear as instrumentation evolves; can't ALTER TABLE for every new attribute Auto-create columns on first occurrence
High-cardinality indexing user_id, trace_id, request_id have millions of unique values Inverted indexes, skip indexes, bloom filters
Real-time ingestion + query Data must be visible within seconds for dashboards and alerting WAL + memtable architecture; streaming ingestion
Materialized views Metrics derived from raw events (error rates, p99 latencies) for dashboard performance Incremental computation; update aggregates without reprocessing
PromQL backward compatibility Existing Grafana dashboards and alert rules must work without rebuild PromQL query engine on top of columnar store
Read replicas Exploratory analytics must not degrade dashboard/alert performance Isolated compute for heavy analytical queries

Routine vs Exploratory Queries

Query Type Purpose Latency Target Example
Routine Dashboards, alerts, SLO tracking Sub-second Error rate by service over last 5 min
Exploratory Ad-hoc debugging, unknown unknowns Seconds to minutes "Show me all requests from user X where feature flag Y was on and latency > 2s"

Removing metrics as first-class citizens doesn't eliminate pre-aggregation — it shifts this responsibility from the application layer to the database via materialized views.

GreptimeDB Reference Architecture

GreptimeDB is an open-source analytical database purpose-built for O11y 2.0 wide events. Built in Rust, designed for cloud-native deployments.

graph TB
    subgraph "Ingestion"
        OTLP_IN["OTLP Receiver"]
        PROM_RW["Prometheus Remote Write"]
        TRANSFORM_ENG["Built-in Transform Engine\n(pre-processing, enrichment)"]
    end

    subgraph "GreptimeDB Core"
        INGEST_NODE["Ingest Nodes\n(high-throughput write)"]
        QUERY_NODE["Query Nodes\n(real-time API)"]
        READ_REPLICA["Read Replicas\n(isolated analytics)"]
        MAT_VIEW["Materialized Views\n(metric derivation)"]
        RULE_ENGINE["Rule Engine\n(alerts, triggers)"]
    end

    subgraph "Storage"
        LOCAL_SSD["Local SSD\n(hot data)"]
        S3["Object Storage (S3/GCS)\n(warm + cold data)"]
    end

    subgraph "Consumers"
        GRAFANA["Grafana\n(PromQL dashboards)"]
        SQL_CLIENT["SQL Client\n(exploratory queries)"]
        ALERT_MGR["Alertmanager\n(push notifications)"]
    end

    OTLP_IN --> TRANSFORM_ENG
    PROM_RW --> TRANSFORM_ENG
    TRANSFORM_ENG --> INGEST_NODE
    INGEST_NODE --> LOCAL_SSD
    LOCAL_SSD --> S3

    QUERY_NODE --> LOCAL_SSD
    QUERY_NODE --> S3
    READ_REPLICA --> S3

    MAT_VIEW --> QUERY_NODE
    RULE_ENGINE --> ALERT_MGR

    QUERY_NODE --> GRAFANA
    READ_REPLICA --> SQL_CLIENT
    RULE_ENGINE --> GRAFANA

    style INGEST_NODE fill:#27ae60,color:#fff
    style QUERY_NODE fill:#2980b9,color:#fff
    style READ_REPLICA fill:#8e44ad,color:#fff
    style S3 fill:#e67e22,color:#fff

Key GreptimeDB features for O11y 2.0:

  • Native OTLP ingestion — accepts OpenTelemetry data directly
  • Built-in transform engine — pre-process and enrich events at ingest time
  • Materialized views — derive metrics from raw wide events within the database
  • Read replicas — isolate heavy analytical queries from real-time dashboard queries
  • Rule engine + triggers — push-based alerting without external dependencies
  • Automatic data tiering — hot data on local SSD, warm/cold on S3 with minimal management

GreptimeDB vs ClickHouse for O11y 2.0

Dimension GreptimeDB ClickHouse
Design intent Purpose-built for time-series and observability General-purpose OLAP analytical engine
Storage layout Timestamp-first: data partitioned/sorted by time Columnar-first: time is just another dimension
O11y stack Native OTLP, PromQL, Jaeger query support Requires ClickStack or external middleware (Kafka, Redis)
Schema Dynamic: auto-creates columns for new attributes Requires ALTER TABLE or migrations for new columns
Best fit Observability/telemetry workloads with native OTel Massive-scale ad-hoc analytical/BI queries
Maturity Newer, rapidly evolving Battle-tested at massive scale (Cloudflare, Uber)

Migration: 1.0 to 2.0

The transition is incremental, not big-bang:

  1. Start instrumenting wide events alongside existing logs/metrics (dual-write)
  2. Enrich events with business context in handlers (user, cart, payment details)
  3. Deploy an O11y 2.0-capable backend (GreptimeDB, ClickHouse, or Honeycomb)
  4. Create materialized views that replace existing Prometheus recording rules
  5. Point Grafana dashboards at the new backend via PromQL compatibility
  6. Enable tail sampling to control event volume and cost
  7. Gradually retire separate log/metric/trace pipelines as confidence builds

Backward compatibility is non-negotiable

Existing Grafana dashboards, alert rules, and trace analysis workflows must be preserved and enhanced, not discarded. The 2.0 backend must speak PromQL for dashboards and support trace views for distributed debugging.