Observability¶
Telemetry platforms, time-series databases, log aggregators, distributed tracing, and APM tools for maintaining visibility into cloud-native systems.
Topics¶
| Tool | Description |
|---|---|
| Coroot | eBPF-powered observability and APM with zero-instrumentation auto-discovery, AI root cause analysis, and continuous profiling. |
| Grafana | De facto visualization and dashboarding layer — hub of the LGTM stack with plugins for every data source. |
| LGTM Stack | Grafana's unified telemetry stack: Loki (logs), Grafana (viz), Tempo (traces), Mimir (metrics), Pyroscope (profiling). |
| OpenObserve | Rust-native observability platform with Apache Parquet + S3-native storage — positioned as 140× cheaper than Elasticsearch. |
| SigNoz | OpenTelemetry-native platform unifying metrics, logs, and traces on a ClickHouse analytical backend. |
| Monoscope | Open-source observability with S3-native BYOS storage, LLM-powered natural language queries, and AI anomaly detection agents. Built on Haskell + TimeFusion. |
| Apache SkyWalking | ASF APM and observability platform for distributed systems, microservices, and service mesh environments. |
| Victoria Stack | Ultra-high-performance observability suite — VictoriaMetrics (metrics), VictoriaLogs (logs), VictoriaTraces (traces) as drop-in Prometheus/Loki/Tempo replacements. |
Comparisons¶
| Comparison | Scope |
|---|---|
| LGTM vs Victoria Stack | Detailed feature and operational comparison of the LGTM stack vs VictoriaMetrics-based stack |
| Observability Stacks | Broad comparison across Grafana/LGTM, VictoriaMetrics, SigNoz, OpenObserve, SkyWalking, and Coroot |
Landscape¶
The observability space is converging around OpenTelemetry (OTel) as the universal instrumentation standard, now the second-most-active CNCF project after Kubernetes. OTel provides a single SDK and collector architecture for metrics, logs, and traces, replacing the fragmented instrumentation landscape of Prometheus client libraries, Jaeger SDKs, and Fluentd/Fluent Bit log agents.
eBPF-based auto-instrumentation has emerged as a zero-code alternative — tools like Coroot, Grafana Beyla, and Odigos can extract HTTP, gRPC, and database call telemetry directly from kernel-level socket events without modifying application code. The "three pillars" model (metrics, logs, traces) is expanding to include profiling as a fourth signal, with Grafana Pyroscope and Parca providing continuous profiling correlated with traces via span IDs.
Storage Economics
On the backend side, object storage (S3, GCS) has become the default persistence layer for modern observability platforms — Loki, Tempo, Mimir, and VictoriaMetrics all support tiered storage that keeps recent data on local disk and archives older data to cheap object storage. This architectural shift has made high-retention observability (90-180 days) economically viable where it was previously cost-prohibitive.
The competitive landscape is bifurcating between vertically integrated platforms (SigNoz, OpenObserve, Coroot) that bundle all signals in one binary and composable stacks (LGTM, Victoria) that optimize each signal independently. AI-powered features are entering the space rapidly — natural language querying (Monoscope, Grafana AI), automated root cause analysis (Coroot, Dynatrace), and anomaly detection are becoming table-stakes differentiators rather than premium features.
Key Concepts¶
Three Signals (Metrics, Logs, Traces)¶
Signal Types
- Metrics: Numeric time-series data (counters, gauges, histograms) sampled at regular intervals. Low cardinality, high compression, ideal for alerting and dashboards. Prometheus exposition format and OpenTelemetry metrics are the two dominant wire formats.
- Logs: Timestamped textual records of discrete events. High volume, semi-structured, essential for debugging. Modern log systems (Loki, VictoriaLogs) index labels but store log lines in compressed chunks to manage cost.
- Traces: Distributed call graphs composed of spans, each representing a unit of work with timing, status, and parent-child relationships. Critical for understanding latency in microservice architectures.
OpenTelemetry¶
A CNCF Incubating project providing vendor-neutral APIs, SDKs, and the OTel Collector for generating, collecting, processing, and exporting telemetry data. The OTel Collector acts as a telemetry pipeline with three stages:
- Receivers: Ingest data via OTLP, Prometheus scrape, Jaeger, Zipkin, Fluentd, syslog, and 80+ other protocols
- Processors: Transform data in-flight — batching, sampling (head-based or tail-based), attribute mutation, filtering, and span-to-metrics conversion
- Exporters: Send processed data to any backend — Prometheus remote write, OTLP to Mimir/Tempo/Loki, ClickHouse, or proprietary SaaS platforms
The Collector's tail-based sampling processor examines complete traces before deciding whether to keep them, dramatically reducing storage costs for high-throughput systems while preserving error and high-latency traces.
Cardinality¶
The number of unique time-series created by the combination of metric name and label values. High cardinality (e.g., labeling metrics with user IDs, request paths, or container IDs) causes storage explosion and query performance degradation. Prometheus and Mimir enforce series limits; VictoriaMetrics handles high cardinality more gracefully through its merge-tree storage engine. Strategies to control cardinality include recording rules (pre-aggregation), relabeling (dropping labels), and OTel Collector attribute processors.
Exemplars¶
A mechanism for linking metrics to representative trace spans, bridging the gap between aggregated statistical data and individual request-level detail. When a Prometheus histogram bucket records a high-latency request, an exemplar attaches the corresponding trace ID, enabling a single click from a dashboard panel to the exact distributed trace. Grafana supports exemplar overlays on metric panels, and both Mimir and VictoriaMetrics store exemplars alongside metrics.
SLI/SLO (Service Level Indicators / Objectives)¶
SLIs are quantitative measures of service behavior (e.g., proportion of requests completing under 300ms, error rate below 0.1%). SLOs set target thresholds for SLIs over a rolling window (e.g., 99.9% of requests succeed over 30 days). Error budgets — the allowed amount of unreliability — drive operational decisions: when the budget is spent, teams freeze deployments and focus on reliability. Tools like Sloth, Pyrra, and OpenSLO generate Prometheus recording and alerting rules from SLO definitions, automating burn-rate alerts that fire when the error budget consumption rate predicts an SLO breach.
Open Questions¶
- As eBPF-based auto-instrumentation matures, will manual OpenTelemetry SDK instrumentation become limited to business-logic-specific spans and custom metrics, effectively making zero-code instrumentation the default?
- How should organizations manage the cost of high-cardinality observability data — is aggressive pre-aggregation via recording rules the right trade-off, or do newer engines like VictoriaMetrics and ClickHouse make high-cardinality storage economically viable?
- With profiling joining metrics, logs, and traces as a fourth signal, what is the realistic overhead of continuous profiling in production, and does it justify the debugging value for most workloads?