Architecture¶

Technical deep dive into the Observability 2.0 paradigm — wide event anatomy, implementation patterns, sampling strategies, database requirements, and the GreptimeDB reference architecture.

Observability 1.0 vs 2.0 Architecture¶

graph TB
    subgraph "Observability 1.0 — Three Pillars"
        direction TB
        APP1["Application"] --> PROM_EXP["Prometheus Exporter"]
        APP1 --> LOG_AGENT["Fluentd / Fluent Bit"]
        APP1 --> OTEL_SDK1["OTel SDK (Traces)"]

        PROM_EXP --> MIMIR["Mimir / Prometheus"]
        LOG_AGENT --> LOKI["Loki / Elasticsearch"]
        OTEL_SDK1 --> TEMPO["Tempo / Jaeger"]

        MIMIR --> GRAFANA1["Grafana Dashboards"]
        LOKI --> GRAFANA1
        TEMPO --> GRAFANA1
    end

    subgraph "Observability 2.0 — Unified Wide Events"
        direction TB
        APP2["Application"] --> WIDE["Wide Event Middleware"]
        WIDE --> OTEL_COL["OTel Collector"]
        OTEL_COL --> UNIFIED_DB["Unified DB\n(GreptimeDB / ClickHouse)"]

        UNIFIED_DB --> DASH["Dashboards (PromQL)"]
        UNIFIED_DB --> EXPLORE["Exploratory Queries (SQL)"]
        UNIFIED_DB --> ALERT["Alerts & Rules"]
        UNIFIED_DB --> TRACE_VIEW["Trace View"]
    end

    style APP1 fill:#e74c3c,color:#fff
    style APP2 fill:#27ae60,color:#fff
    style UNIFIED_DB fill:#2980b9,color:#fff

Key difference: In 1.0, the application emits three separate signal types to three separate backends. In 2.0, the application emits one wide event per request, routed through a single pipeline to a single store that powers all views.

Wide Event Anatomy¶

A wide event captures the complete context of a single request in one structured record. This example from a checkout service shows approximately 30 fields across 6 context groups:

{
  "timestamp": "2025-01-15T10:23:45.612Z",
  "request_id": "req_8bf7ec2d",
  "trace_id": "abc123",

  "service": "checkout-service",
  "version": "2.4.1",
  "deployment_id": "deploy_789",
  "region": "us-east-1",

  "method": "POST",
  "path": "/api/checkout",
  "status_code": 500,
  "duration_ms": 1247,

  "user": {
    "id": "user_456",
    "subscription": "premium",
    "account_age_days": 847,
    "lifetime_value_cents": 284700
  },

  "cart": {
    "id": "cart_xyz",
    "item_count": 3,
    "total_cents": 15999,
    "coupon_applied": "SAVE20"
  },

  "payment": {
    "method": "card",
    "provider": "stripe",
    "latency_ms": 1089,
    "attempt": 3
  },

  "error": {
    "type": "PaymentError",
    "code": "card_declined",
    "message": "Card declined by issuer",
    "retriable": false,
    "stripe_decline_code": "insufficient_funds"
  },

  "feature_flags": {
    "new_checkout_flow": true,
    "express_payment": false
  }
}

When a user complains, searching user_id = "user_456" instantly reveals:

Premium customer, 2+ year account (high priority)
Payment failed on 3rd attempt — insufficient funds
Using the new checkout flow (potential correlation?)
No grep-ing, no guessing, no second search

Context Groups in a Wide Event¶

Group	Fields	Purpose
Identity	`request_id`, `trace_id`, `timestamp`	Correlation and ordering
Infrastructure	`service`, `version`, `deployment_id`, `region`	Where the event happened
Request	`method`, `path`, `status_code`, `duration_ms`	What happened
User / Business	`user.id`, `subscription`, `account_age_days`, `lifetime_value_cents`	Who was affected and business impact
Operation	`payment.method`, `payment.provider`, `payment.latency_ms`	Domain-specific operation details
Error	`error.type`, `error.code`, `error.message`, `error.retriable`	Failure specifics
Experiments	`feature_flags.*`	Active experiments for correlation analysis

Queries Enabled by Wide Events¶

With wide events you run analytics on production traffic, not string searches on logs:

-- Premium users hitting payment errors in the last hour with new checkout flow
SELECT user.id, error.code, payment.attempt, duration_ms
FROM events
WHERE status_code >= 500
  AND user.subscription = 'premium'
  AND feature_flags.new_checkout_flow = true
  AND timestamp > NOW() - INTERVAL '1 hour'
ORDER BY user.lifetime_value_cents DESC;

-- Error rate by deployment, grouped by region
SELECT deployment_id, region,
       COUNT(*) FILTER (WHERE status_code >= 500) AS errors,
       COUNT(*) AS total,
       ROUND(100.0 * COUNT(*) FILTER (WHERE status_code >= 500) / COUNT(*), 2) AS error_pct
FROM events
WHERE timestamp > NOW() - INTERVAL '15 minutes'
GROUP BY deployment_id, region
ORDER BY error_pct DESC;

-- P99 latency by service version (canary vs stable)
SELECT version,
       PERCENTILE_CONT(0.99) WITHIN GROUP (ORDER BY duration_ms) AS p99_ms,
       COUNT(*) AS request_count
FROM events
WHERE service = 'checkout-service'
  AND timestamp > NOW() - INTERVAL '1 hour'
GROUP BY version;

Implementation Pattern¶

The key insight: build the event throughout the request lifecycle, then emit once at the end.

Middleware Approach (TypeScript / Hono)¶

// middleware/wideEvent.ts
export function wideEventMiddleware() {
  return async (ctx, next) => {
    const startTime = Date.now();

    // Initialize the wide event with request context
    const event: Record<string, unknown> = {
      request_id: ctx.get('requestId'),
      timestamp: new Date().toISOString(),
      method: ctx.req.method,
      path: ctx.req.path,
      service: process.env.SERVICE_NAME,
      version: process.env.SERVICE_VERSION,
      deployment_id: process.env.DEPLOYMENT_ID,
      region: process.env.REGION,
    };

    // Make the event accessible to handlers
    ctx.set('wideEvent', event);

    try {
      await next();
      event.status_code = ctx.res.status;
      event.outcome = 'success';
    } catch (error) {
      event.status_code = 500;
      event.outcome = 'error';
      event.error = {
        type: error.name,
        message: error.message,
        code: error.code,
        retriable: error.retriable ?? false,
      };
      throw error;
    } finally {
      event.duration_ms = Date.now() - startTime;

      // Emit the wide event — ONE log line per request
      logger.info(event);
    }
  };
}

Handler Enrichment¶

Handlers enrich the event with business context as they process the request:

app.post('/checkout', async (ctx) => {
  const event = ctx.get('wideEvent');
  const user = ctx.get('user');

  // Add user context
  event.user = {
    id: user.id,
    subscription: user.subscription,
    account_age_days: daysSince(user.createdAt),
    lifetime_value_cents: user.ltv,
  };

  // Add business context as you process
  const cart = await getCart(user.id);
  event.cart = {
    id: cart.id,
    item_count: cart.items.length,
    total_cents: cart.total,
    coupon_applied: cart.coupon?.code,
  };

  // Process payment — measure sub-operation latency
  const paymentStart = Date.now();
  const payment = await processPayment(cart, user);

  event.payment = {
    method: payment.method,
    provider: payment.provider,
    latency_ms: Date.now() - paymentStart,
    attempt: payment.attemptNumber,
  };

  // If payment fails, add error details
  if (payment.error) {
    event.error = {
      type: 'PaymentError',
      code: payment.error.code,
      stripe_decline_code: payment.error.declineCode,
    };
  }

  return ctx.json({ orderId: payment.orderId });
});

Wide Event Data Flow¶

sequenceDiagram
    participant Client
    participant Middleware as Wide Event Middleware
    participant Handler as Request Handler
    participant DB as Business Logic / DB
    participant Logger as Event Emitter
    participant Backend as O11y Backend

    Client->>Middleware: POST /checkout
    Middleware->>Middleware: Initialize event (request_id, method, path, service, version)
    Middleware->>Handler: next()

    Handler->>Handler: Enrich event (user.id, subscription, LTV)
    Handler->>DB: getCart(user.id)
    DB-->>Handler: cart data
    Handler->>Handler: Enrich event (cart.id, item_count, total)
    Handler->>DB: processPayment(cart, user)
    DB-->>Handler: payment result
    Handler->>Handler: Enrich event (payment.method, latency_ms, attempt)

    Handler-->>Middleware: response

    Middleware->>Middleware: Finalize event (duration_ms, status_code, outcome)
    Middleware->>Logger: logger.info(event)
    Logger->>Backend: Single wide event (~50 fields)
    Middleware-->>Client: HTTP response

Tail Sampling Implementation¶

Tail sampling makes the keep/drop decision after the request completes, based on outcome. This keeps costs manageable while never losing the events that matter.

// Tail sampling decision function
function shouldSample(event: WideEvent): boolean {
  // Always keep errors
  if (event.status_code >= 500) return true;
  if (event.error) return true;

  // Always keep slow requests (above p99)
  if (event.duration_ms > 2000) return true;

  // Always keep VIP users
  if (event.user?.subscription === 'enterprise') return true;

  // Always keep requests with specific feature flags (debugging rollouts)
  if (event.feature_flags?.new_checkout_flow) return true;

  // Random sample the rest at 5%
  return Math.random() < 0.05;
}

Sampling Rules Summary¶

Rule	Keep Rate	Rationale
Errors (5xx, exceptions)	100%	Never lose failure evidence
Slow requests (> p99)	100%	Tail latency is where problems hide
VIP / enterprise users	100%	Business-critical — immediate escalation
Feature flag rollouts	100%	Correlate new code with new failures
Everything else	1-5%	Happy, fast requests — sample for baselines

Naive random sampling is dangerous

If you randomly sample 1% of all traffic, you might accidentally drop the one request that explains your outage. Always use tail sampling with outcome-based rules.

OTel Collector Tail Sampling¶

The same logic can be implemented in the OTel Collector's tail sampling processor, which examines complete traces before deciding whether to keep them:

# otel-collector-config.yaml
processors:
  tail_sampling:
    decision_wait: 10s
    num_traces: 100000
    policies:
      - name: errors
        type: status_code
        status_code: { status_codes: [ERROR] }
      - name: slow-traces
        type: latency
        latency: { threshold_ms: 2000 }
      - name: probabilistic-sample
        type: probabilistic
        probabilistic: { sampling_percentage: 5 }

Database Requirements for Observability 2.0¶

A single uncompressed wide event can exceed 2KB. At 10,000 requests/second, that's 20MB/s of raw event data. The database must handle this efficiently while supporting both real-time dashboards and ad-hoc exploratory queries.

Core Requirements¶

graph LR
    subgraph "Ingest"
        OTLP["OTLP / OpenTelemetry"]
        TRANSFORM["Transform Engine\n(pre-processing)"]
    end

    subgraph "Store"
        COLUMNAR["Columnar Storage\n(Parquet / Arrow)"]
        OBJECT["Object Storage\n(S3 / GCS)"]
        MATVIEW["Materialized Views\n(derived metrics)"]
    end

    subgraph "Query"
        ROUTINE["Routine Queries\n(dashboards, alerts)"]
        EXPLORE["Exploratory Queries\n(ad-hoc analysis)"]
        PROMQL["PromQL\n(backward compat)"]
    end

    OTLP --> TRANSFORM --> COLUMNAR
    COLUMNAR --> OBJECT
    COLUMNAR --> MATVIEW
    COLUMNAR --> ROUTINE
    COLUMNAR --> EXPLORE
    MATVIEW --> PROMQL
    MATVIEW --> ROUTINE

    style COLUMNAR fill:#2980b9,color:#fff
    style MATVIEW fill:#8e44ad,color:#fff

Requirement	Why	How
Columnar storage	Wide events have 50+ fields; columnar format enables column pruning and vectorized processing	Apache Parquet, Arrow; dictionary/RLE encoding per column
Disaggregated compute + storage	Storage scales independently of compute; cost-efficient long-term retention	S3/GCS as primary persistence; local SSD for hot data
Dynamic schema	New fields appear as instrumentation evolves; can't ALTER TABLE for every new attribute	Auto-create columns on first occurrence
High-cardinality indexing	`user_id`, `trace_id`, `request_id` have millions of unique values	Inverted indexes, skip indexes, bloom filters
Real-time ingestion + query	Data must be visible within seconds for dashboards and alerting	WAL + memtable architecture; streaming ingestion
Materialized views	Metrics derived from raw events (error rates, p99 latencies) for dashboard performance	Incremental computation; update aggregates without reprocessing
PromQL backward compatibility	Existing Grafana dashboards and alert rules must work without rebuild	PromQL query engine on top of columnar store
Read replicas	Exploratory analytics must not degrade dashboard/alert performance	Isolated compute for heavy analytical queries

Routine vs Exploratory Queries¶

Query Type	Purpose	Latency Target	Example
Routine	Dashboards, alerts, SLO tracking	Sub-second	Error rate by service over last 5 min
Exploratory	Ad-hoc debugging, unknown unknowns	Seconds to minutes	"Show me all requests from user X where feature flag Y was on and latency > 2s"

Removing metrics as first-class citizens doesn't eliminate pre-aggregation — it shifts this responsibility from the application layer to the database via materialized views.

GreptimeDB Reference Architecture¶

GreptimeDB is an open-source analytical database purpose-built for O11y 2.0 wide events. Built in Rust, designed for cloud-native deployments.

graph TB
    subgraph "Ingestion"
        OTLP_IN["OTLP Receiver"]
        PROM_RW["Prometheus Remote Write"]
        TRANSFORM_ENG["Built-in Transform Engine\n(pre-processing, enrichment)"]
    end

    subgraph "GreptimeDB Core"
        INGEST_NODE["Ingest Nodes\n(high-throughput write)"]
        QUERY_NODE["Query Nodes\n(real-time API)"]
        READ_REPLICA["Read Replicas\n(isolated analytics)"]
        MAT_VIEW["Materialized Views\n(metric derivation)"]
        RULE_ENGINE["Rule Engine\n(alerts, triggers)"]
    end

    subgraph "Storage"
        LOCAL_SSD["Local SSD\n(hot data)"]
        S3["Object Storage (S3/GCS)\n(warm + cold data)"]
    end

    subgraph "Consumers"
        GRAFANA["Grafana\n(PromQL dashboards)"]
        SQL_CLIENT["SQL Client\n(exploratory queries)"]
        ALERT_MGR["Alertmanager\n(push notifications)"]
    end

    OTLP_IN --> TRANSFORM_ENG
    PROM_RW --> TRANSFORM_ENG
    TRANSFORM_ENG --> INGEST_NODE
    INGEST_NODE --> LOCAL_SSD
    LOCAL_SSD --> S3

    QUERY_NODE --> LOCAL_SSD
    QUERY_NODE --> S3
    READ_REPLICA --> S3

    MAT_VIEW --> QUERY_NODE
    RULE_ENGINE --> ALERT_MGR

    QUERY_NODE --> GRAFANA
    READ_REPLICA --> SQL_CLIENT
    RULE_ENGINE --> GRAFANA

    style INGEST_NODE fill:#27ae60,color:#fff
    style QUERY_NODE fill:#2980b9,color:#fff
    style READ_REPLICA fill:#8e44ad,color:#fff
    style S3 fill:#e67e22,color:#fff

Key GreptimeDB features for O11y 2.0:

Native OTLP ingestion — accepts OpenTelemetry data directly
Built-in transform engine — pre-process and enrich events at ingest time
Materialized views — derive metrics from raw wide events within the database
Read replicas — isolate heavy analytical queries from real-time dashboard queries
Rule engine + triggers — push-based alerting without external dependencies
Automatic data tiering — hot data on local SSD, warm/cold on S3 with minimal management

GreptimeDB vs ClickHouse for O11y 2.0¶

Dimension	GreptimeDB	ClickHouse
Design intent	Purpose-built for time-series and observability	General-purpose OLAP analytical engine
Storage layout	Timestamp-first: data partitioned/sorted by time	Columnar-first: time is just another dimension
O11y stack	Native OTLP, PromQL, Jaeger query support	Requires ClickStack or external middleware (Kafka, Redis)
Schema	Dynamic: auto-creates columns for new attributes	Requires `ALTER TABLE` or migrations for new columns
Best fit	Observability/telemetry workloads with native OTel	Massive-scale ad-hoc analytical/BI queries
Maturity	Newer, rapidly evolving	Battle-tested at massive scale (Cloudflare, Uber)

Migration: 1.0 to 2.0¶

The transition is incremental, not big-bang:

Start instrumenting wide events alongside existing logs/metrics (dual-write)
Enrich events with business context in handlers (user, cart, payment details)
Deploy an O11y 2.0-capable backend (GreptimeDB, ClickHouse, or Honeycomb)
Create materialized views that replace existing Prometheus recording rules
Point Grafana dashboards at the new backend via PromQL compatibility
Enable tail sampling to control event volume and cost
Gradually retire separate log/metric/trace pipelines as confidence builds

Backward compatibility is non-negotiable

Existing Grafana dashboards, alert rules, and trace analysis workflows must be preserved and enhanced, not discarded. The 2.0 backend must speak PromQL for dashboards and support trace views for distributed debugging.