Architecture¶
Technical deep dive into the Observability 2.0 paradigm — wide event anatomy, implementation patterns, sampling strategies, database requirements, and the GreptimeDB reference architecture.
Observability 1.0 vs 2.0 Architecture¶
graph TB
subgraph "Observability 1.0 — Three Pillars"
direction TB
APP1["Application"] --> PROM_EXP["Prometheus Exporter"]
APP1 --> LOG_AGENT["Fluentd / Fluent Bit"]
APP1 --> OTEL_SDK1["OTel SDK (Traces)"]
PROM_EXP --> MIMIR["Mimir / Prometheus"]
LOG_AGENT --> LOKI["Loki / Elasticsearch"]
OTEL_SDK1 --> TEMPO["Tempo / Jaeger"]
MIMIR --> GRAFANA1["Grafana Dashboards"]
LOKI --> GRAFANA1
TEMPO --> GRAFANA1
end
subgraph "Observability 2.0 — Unified Wide Events"
direction TB
APP2["Application"] --> WIDE["Wide Event Middleware"]
WIDE --> OTEL_COL["OTel Collector"]
OTEL_COL --> UNIFIED_DB["Unified DB\n(GreptimeDB / ClickHouse)"]
UNIFIED_DB --> DASH["Dashboards (PromQL)"]
UNIFIED_DB --> EXPLORE["Exploratory Queries (SQL)"]
UNIFIED_DB --> ALERT["Alerts & Rules"]
UNIFIED_DB --> TRACE_VIEW["Trace View"]
end
style APP1 fill:#e74c3c,color:#fff
style APP2 fill:#27ae60,color:#fff
style UNIFIED_DB fill:#2980b9,color:#fff
Key difference: In 1.0, the application emits three separate signal types to three separate backends. In 2.0, the application emits one wide event per request, routed through a single pipeline to a single store that powers all views.
Wide Event Anatomy¶
A wide event captures the complete context of a single request in one structured record. This example from a checkout service shows approximately 30 fields across 6 context groups:
{
"timestamp": "2025-01-15T10:23:45.612Z",
"request_id": "req_8bf7ec2d",
"trace_id": "abc123",
"service": "checkout-service",
"version": "2.4.1",
"deployment_id": "deploy_789",
"region": "us-east-1",
"method": "POST",
"path": "/api/checkout",
"status_code": 500,
"duration_ms": 1247,
"user": {
"id": "user_456",
"subscription": "premium",
"account_age_days": 847,
"lifetime_value_cents": 284700
},
"cart": {
"id": "cart_xyz",
"item_count": 3,
"total_cents": 15999,
"coupon_applied": "SAVE20"
},
"payment": {
"method": "card",
"provider": "stripe",
"latency_ms": 1089,
"attempt": 3
},
"error": {
"type": "PaymentError",
"code": "card_declined",
"message": "Card declined by issuer",
"retriable": false,
"stripe_decline_code": "insufficient_funds"
},
"feature_flags": {
"new_checkout_flow": true,
"express_payment": false
}
}
When a user complains, searching user_id = "user_456" instantly reveals:
- Premium customer, 2+ year account (high priority)
- Payment failed on 3rd attempt — insufficient funds
- Using the new checkout flow (potential correlation?)
- No grep-ing, no guessing, no second search
Context Groups in a Wide Event¶
| Group | Fields | Purpose |
|---|---|---|
| Identity | request_id, trace_id, timestamp |
Correlation and ordering |
| Infrastructure | service, version, deployment_id, region |
Where the event happened |
| Request | method, path, status_code, duration_ms |
What happened |
| User / Business | user.id, subscription, account_age_days, lifetime_value_cents |
Who was affected and business impact |
| Operation | payment.method, payment.provider, payment.latency_ms |
Domain-specific operation details |
| Error | error.type, error.code, error.message, error.retriable |
Failure specifics |
| Experiments | feature_flags.* |
Active experiments for correlation analysis |
Queries Enabled by Wide Events¶
With wide events you run analytics on production traffic, not string searches on logs:
-- Premium users hitting payment errors in the last hour with new checkout flow
SELECT user.id, error.code, payment.attempt, duration_ms
FROM events
WHERE status_code >= 500
AND user.subscription = 'premium'
AND feature_flags.new_checkout_flow = true
AND timestamp > NOW() - INTERVAL '1 hour'
ORDER BY user.lifetime_value_cents DESC;
-- Error rate by deployment, grouped by region
SELECT deployment_id, region,
COUNT(*) FILTER (WHERE status_code >= 500) AS errors,
COUNT(*) AS total,
ROUND(100.0 * COUNT(*) FILTER (WHERE status_code >= 500) / COUNT(*), 2) AS error_pct
FROM events
WHERE timestamp > NOW() - INTERVAL '15 minutes'
GROUP BY deployment_id, region
ORDER BY error_pct DESC;
-- P99 latency by service version (canary vs stable)
SELECT version,
PERCENTILE_CONT(0.99) WITHIN GROUP (ORDER BY duration_ms) AS p99_ms,
COUNT(*) AS request_count
FROM events
WHERE service = 'checkout-service'
AND timestamp > NOW() - INTERVAL '1 hour'
GROUP BY version;
Implementation Pattern¶
The key insight: build the event throughout the request lifecycle, then emit once at the end.
Middleware Approach (TypeScript / Hono)¶
// middleware/wideEvent.ts
export function wideEventMiddleware() {
return async (ctx, next) => {
const startTime = Date.now();
// Initialize the wide event with request context
const event: Record<string, unknown> = {
request_id: ctx.get('requestId'),
timestamp: new Date().toISOString(),
method: ctx.req.method,
path: ctx.req.path,
service: process.env.SERVICE_NAME,
version: process.env.SERVICE_VERSION,
deployment_id: process.env.DEPLOYMENT_ID,
region: process.env.REGION,
};
// Make the event accessible to handlers
ctx.set('wideEvent', event);
try {
await next();
event.status_code = ctx.res.status;
event.outcome = 'success';
} catch (error) {
event.status_code = 500;
event.outcome = 'error';
event.error = {
type: error.name,
message: error.message,
code: error.code,
retriable: error.retriable ?? false,
};
throw error;
} finally {
event.duration_ms = Date.now() - startTime;
// Emit the wide event — ONE log line per request
logger.info(event);
}
};
}
Handler Enrichment¶
Handlers enrich the event with business context as they process the request:
app.post('/checkout', async (ctx) => {
const event = ctx.get('wideEvent');
const user = ctx.get('user');
// Add user context
event.user = {
id: user.id,
subscription: user.subscription,
account_age_days: daysSince(user.createdAt),
lifetime_value_cents: user.ltv,
};
// Add business context as you process
const cart = await getCart(user.id);
event.cart = {
id: cart.id,
item_count: cart.items.length,
total_cents: cart.total,
coupon_applied: cart.coupon?.code,
};
// Process payment — measure sub-operation latency
const paymentStart = Date.now();
const payment = await processPayment(cart, user);
event.payment = {
method: payment.method,
provider: payment.provider,
latency_ms: Date.now() - paymentStart,
attempt: payment.attemptNumber,
};
// If payment fails, add error details
if (payment.error) {
event.error = {
type: 'PaymentError',
code: payment.error.code,
stripe_decline_code: payment.error.declineCode,
};
}
return ctx.json({ orderId: payment.orderId });
});
Wide Event Data Flow¶
sequenceDiagram
participant Client
participant Middleware as Wide Event Middleware
participant Handler as Request Handler
participant DB as Business Logic / DB
participant Logger as Event Emitter
participant Backend as O11y Backend
Client->>Middleware: POST /checkout
Middleware->>Middleware: Initialize event (request_id, method, path, service, version)
Middleware->>Handler: next()
Handler->>Handler: Enrich event (user.id, subscription, LTV)
Handler->>DB: getCart(user.id)
DB-->>Handler: cart data
Handler->>Handler: Enrich event (cart.id, item_count, total)
Handler->>DB: processPayment(cart, user)
DB-->>Handler: payment result
Handler->>Handler: Enrich event (payment.method, latency_ms, attempt)
Handler-->>Middleware: response
Middleware->>Middleware: Finalize event (duration_ms, status_code, outcome)
Middleware->>Logger: logger.info(event)
Logger->>Backend: Single wide event (~50 fields)
Middleware-->>Client: HTTP response
Tail Sampling Implementation¶
Tail sampling makes the keep/drop decision after the request completes, based on outcome. This keeps costs manageable while never losing the events that matter.
// Tail sampling decision function
function shouldSample(event: WideEvent): boolean {
// Always keep errors
if (event.status_code >= 500) return true;
if (event.error) return true;
// Always keep slow requests (above p99)
if (event.duration_ms > 2000) return true;
// Always keep VIP users
if (event.user?.subscription === 'enterprise') return true;
// Always keep requests with specific feature flags (debugging rollouts)
if (event.feature_flags?.new_checkout_flow) return true;
// Random sample the rest at 5%
return Math.random() < 0.05;
}
Sampling Rules Summary¶
| Rule | Keep Rate | Rationale |
|---|---|---|
| Errors (5xx, exceptions) | 100% | Never lose failure evidence |
| Slow requests (> p99) | 100% | Tail latency is where problems hide |
| VIP / enterprise users | 100% | Business-critical — immediate escalation |
| Feature flag rollouts | 100% | Correlate new code with new failures |
| Everything else | 1-5% | Happy, fast requests — sample for baselines |
Naive random sampling is dangerous
If you randomly sample 1% of all traffic, you might accidentally drop the one request that explains your outage. Always use tail sampling with outcome-based rules.
OTel Collector Tail Sampling¶
The same logic can be implemented in the OTel Collector's tail sampling processor, which examines complete traces before deciding whether to keep them:
# otel-collector-config.yaml
processors:
tail_sampling:
decision_wait: 10s
num_traces: 100000
policies:
- name: errors
type: status_code
status_code: { status_codes: [ERROR] }
- name: slow-traces
type: latency
latency: { threshold_ms: 2000 }
- name: probabilistic-sample
type: probabilistic
probabilistic: { sampling_percentage: 5 }
Database Requirements for Observability 2.0¶
A single uncompressed wide event can exceed 2KB. At 10,000 requests/second, that's 20MB/s of raw event data. The database must handle this efficiently while supporting both real-time dashboards and ad-hoc exploratory queries.
Core Requirements¶
graph LR
subgraph "Ingest"
OTLP["OTLP / OpenTelemetry"]
TRANSFORM["Transform Engine\n(pre-processing)"]
end
subgraph "Store"
COLUMNAR["Columnar Storage\n(Parquet / Arrow)"]
OBJECT["Object Storage\n(S3 / GCS)"]
MATVIEW["Materialized Views\n(derived metrics)"]
end
subgraph "Query"
ROUTINE["Routine Queries\n(dashboards, alerts)"]
EXPLORE["Exploratory Queries\n(ad-hoc analysis)"]
PROMQL["PromQL\n(backward compat)"]
end
OTLP --> TRANSFORM --> COLUMNAR
COLUMNAR --> OBJECT
COLUMNAR --> MATVIEW
COLUMNAR --> ROUTINE
COLUMNAR --> EXPLORE
MATVIEW --> PROMQL
MATVIEW --> ROUTINE
style COLUMNAR fill:#2980b9,color:#fff
style MATVIEW fill:#8e44ad,color:#fff
| Requirement | Why | How |
|---|---|---|
| Columnar storage | Wide events have 50+ fields; columnar format enables column pruning and vectorized processing | Apache Parquet, Arrow; dictionary/RLE encoding per column |
| Disaggregated compute + storage | Storage scales independently of compute; cost-efficient long-term retention | S3/GCS as primary persistence; local SSD for hot data |
| Dynamic schema | New fields appear as instrumentation evolves; can't ALTER TABLE for every new attribute | Auto-create columns on first occurrence |
| High-cardinality indexing | user_id, trace_id, request_id have millions of unique values |
Inverted indexes, skip indexes, bloom filters |
| Real-time ingestion + query | Data must be visible within seconds for dashboards and alerting | WAL + memtable architecture; streaming ingestion |
| Materialized views | Metrics derived from raw events (error rates, p99 latencies) for dashboard performance | Incremental computation; update aggregates without reprocessing |
| PromQL backward compatibility | Existing Grafana dashboards and alert rules must work without rebuild | PromQL query engine on top of columnar store |
| Read replicas | Exploratory analytics must not degrade dashboard/alert performance | Isolated compute for heavy analytical queries |
Routine vs Exploratory Queries¶
| Query Type | Purpose | Latency Target | Example |
|---|---|---|---|
| Routine | Dashboards, alerts, SLO tracking | Sub-second | Error rate by service over last 5 min |
| Exploratory | Ad-hoc debugging, unknown unknowns | Seconds to minutes | "Show me all requests from user X where feature flag Y was on and latency > 2s" |
Removing metrics as first-class citizens doesn't eliminate pre-aggregation — it shifts this responsibility from the application layer to the database via materialized views.
GreptimeDB Reference Architecture¶
GreptimeDB is an open-source analytical database purpose-built for O11y 2.0 wide events. Built in Rust, designed for cloud-native deployments.
graph TB
subgraph "Ingestion"
OTLP_IN["OTLP Receiver"]
PROM_RW["Prometheus Remote Write"]
TRANSFORM_ENG["Built-in Transform Engine\n(pre-processing, enrichment)"]
end
subgraph "GreptimeDB Core"
INGEST_NODE["Ingest Nodes\n(high-throughput write)"]
QUERY_NODE["Query Nodes\n(real-time API)"]
READ_REPLICA["Read Replicas\n(isolated analytics)"]
MAT_VIEW["Materialized Views\n(metric derivation)"]
RULE_ENGINE["Rule Engine\n(alerts, triggers)"]
end
subgraph "Storage"
LOCAL_SSD["Local SSD\n(hot data)"]
S3["Object Storage (S3/GCS)\n(warm + cold data)"]
end
subgraph "Consumers"
GRAFANA["Grafana\n(PromQL dashboards)"]
SQL_CLIENT["SQL Client\n(exploratory queries)"]
ALERT_MGR["Alertmanager\n(push notifications)"]
end
OTLP_IN --> TRANSFORM_ENG
PROM_RW --> TRANSFORM_ENG
TRANSFORM_ENG --> INGEST_NODE
INGEST_NODE --> LOCAL_SSD
LOCAL_SSD --> S3
QUERY_NODE --> LOCAL_SSD
QUERY_NODE --> S3
READ_REPLICA --> S3
MAT_VIEW --> QUERY_NODE
RULE_ENGINE --> ALERT_MGR
QUERY_NODE --> GRAFANA
READ_REPLICA --> SQL_CLIENT
RULE_ENGINE --> GRAFANA
style INGEST_NODE fill:#27ae60,color:#fff
style QUERY_NODE fill:#2980b9,color:#fff
style READ_REPLICA fill:#8e44ad,color:#fff
style S3 fill:#e67e22,color:#fff
Key GreptimeDB features for O11y 2.0:
- Native OTLP ingestion — accepts OpenTelemetry data directly
- Built-in transform engine — pre-process and enrich events at ingest time
- Materialized views — derive metrics from raw wide events within the database
- Read replicas — isolate heavy analytical queries from real-time dashboard queries
- Rule engine + triggers — push-based alerting without external dependencies
- Automatic data tiering — hot data on local SSD, warm/cold on S3 with minimal management
GreptimeDB vs ClickHouse for O11y 2.0¶
| Dimension | GreptimeDB | ClickHouse |
|---|---|---|
| Design intent | Purpose-built for time-series and observability | General-purpose OLAP analytical engine |
| Storage layout | Timestamp-first: data partitioned/sorted by time | Columnar-first: time is just another dimension |
| O11y stack | Native OTLP, PromQL, Jaeger query support | Requires ClickStack or external middleware (Kafka, Redis) |
| Schema | Dynamic: auto-creates columns for new attributes | Requires ALTER TABLE or migrations for new columns |
| Best fit | Observability/telemetry workloads with native OTel | Massive-scale ad-hoc analytical/BI queries |
| Maturity | Newer, rapidly evolving | Battle-tested at massive scale (Cloudflare, Uber) |
Migration: 1.0 to 2.0¶
The transition is incremental, not big-bang:
- Start instrumenting wide events alongside existing logs/metrics (dual-write)
- Enrich events with business context in handlers (user, cart, payment details)
- Deploy an O11y 2.0-capable backend (GreptimeDB, ClickHouse, or Honeycomb)
- Create materialized views that replace existing Prometheus recording rules
- Point Grafana dashboards at the new backend via PromQL compatibility
- Enable tail sampling to control event volume and cost
- Gradually retire separate log/metric/trace pipelines as confidence builds
Backward compatibility is non-negotiable
Existing Grafana dashboards, alert rules, and trace analysis workflows must be preserved and enhanced, not discarded. The 2.0 backend must speak PromQL for dashboards and support trace views for distributed debugging.