LGTM Stack¶
Home | Knowledge Hub | Projects Hub
Summary¶
LGTM is the Grafana Labs open-source observability stack, named after its four core components: Loki (Logs), Grafana (visualization), Tempo (Traces), and Mimir (Metrics). A fifth pillar, Pyroscope (Profiles), is frequently included, sometimes expanding the acronym to LGTMP or referring to it as "big tent" observability.
The stack is purpose-built so each backend is independently scalable, uses object storage (S3/GCS/Azure Blob) as its primary persistence layer, and speaks OpenTelemetry natively. Grafana sits at the center as the single pane of glass, correlating across all signals.
| Component | Signal | Query Language | GitHub Stars | Latest Version |
|---|---|---|---|---|
| Grafana Mimir | Metrics | PromQL | ~5k ⭐ | 3.0.5 |
| Grafana Loki | Logs | LogQL | ~27.9k ⭐ | 3.7.1 |
| Grafana Tempo | Traces | TraceQL | ~4k ⭐ | 2.10.1 (3.0 in dev) |
| Grafana Pyroscope | Profiles | FlameQL | ~10k ⭐ | 1.20.2 |
| Grafana | Visualization | — | 73.1k ⭐ | 12.4.2 |
| Grafana Alloy | Collection | HCL (River) | — | 1.15.0 |
Evaluation¶
-
Why it's better: The only fully open-source stack that covers all four observability pillars (metrics, logs, traces, profiles) with cross-signal correlation in a single UI. Each backend is optimized for its signal type and uses cheap object storage, making the stack 3–10x cheaper than Datadog at scale.
-
When it fits (Applicability):
- Organizations with platform engineering capacity to operate multiple backends
- Teams standardizing on OpenTelemetry who want no vendor lock-in
- Cloud-native (Kubernetes) environments needing horizontal scalability
- Mixed environments with heterogeneous data sources
-
Budget-conscious organizations needing enterprise-grade observability at open-source cost
-
Pros and Cons:
| Pros | Cons |
|---|---|
| Each component best-of-breed for its signal type | Operational complexity — 4+ backends to manage |
| Object-storage-first = dramatically reduced cost | Requires solid Kubernetes & DevOps expertise |
| OpenTelemetry-native, no vendor lock-in | Signal correlation requires careful config |
| Massive community, battle-tested at scale | Query languages differ per signal (PromQL, LogQL, TraceQL, FlameQL) |
| Independent horizontal scaling per component | Multi-tenancy requires auth proxy setup |
All-in-one Docker image for dev (grafana/otel-lgtm) |
Production setup requires 6+ Helm charts |
| Cross-signal correlation (exemplars, derived fields) | Label cardinality is the #1 operational pitfall |
- Common Use Cases:
- Full-stack Kubernetes observability — metrics, logs, traces, and profiles from all workloads in one view
- Centralized enterprise observability platform — multi-tenant, shared infrastructure for multiple teams (Maersk, DHL, Salesforce pattern)
- Cost-effective log aggregation — replacing Elasticsearch with Loki for 10–100x cost reduction
- Distributed tracing at scale — Tempo handles 100M+ spans/day on object storage alone
- AI/ML pipeline observability — tracking model inference latency, GPU utilization, and training metrics
-
IoT and industrial telemetry — high-volume metric ingestion via Mimir
-
Licensing & Commercial Use:
- Grafana, Loki, Tempo: AGPL-3.0
- Mimir: AGPL-3.0
- Pyroscope: AGPL-3.0
- Alloy: Apache 2.0
- All components are free to self-host. If you modify the source and offer it as SaaS, you must release modifications under AGPL-3.0.
-
Grafana Cloud provides fully managed LGTM: Free ($0), Pro ($19/mo + usage), Enterprise ($25k+/yr)
-
Ecosystem & Data Connections:
- Ingestion protocols: OTLP (gRPC/HTTP), Prometheus remote_write, Jaeger, Zipkin, Syslog, FluentBit
- Collection: Grafana Alloy (primary), OpenTelemetry Collector, Prometheus, Promtail (legacy)
- Storage: S3, GCS, Azure Blob Storage, MinIO (self-hosted)
- IaC: Helm charts, Terraform provider, Jsonnet/Tanka, Ansible
-
Instrumentation: OpenTelemetry SDKs (Go, Java, Python, Node.js, .NET, Rust), auto-instrumentation agents, eBPF
-
Compatibility & Requirements:
- Runs on Kubernetes (recommended), Docker, or bare metal Linux
- Min dev setup:
docker run grafana/otel-lgtm(single container with all components) - Production requires: Kubernetes cluster, object storage, PostgreSQL (for Grafana metadata), Redis (for sessions)
-
Object storage is mandatory for Mimir, Loki, and Tempo in production
-
Alternatives:
- Datadog — All-in-one SaaS, highest cost, lowest ops burden
- SigNoz — Open-source, OTel-native, ClickHouse-backed, unified single-binary
- ELK Stack — Mature for logs, weaker for metrics/traces
- New Relic — SaaS, generous free tier, proprietary
- Splunk Observability — Enterprise, very expensive
-
OpenObserve — Open-source, Rust-based, single binary
-
Migration & Lock-in Risks:
- Low lock-in on individual components — each backend uses open storage formats
- Moderate lock-in on query languages — PromQL is universal, but LogQL, TraceQL, and FlameQL are Grafana-specific (well-documented, but not portable)
- Gradual migration is supported — run old and new stacks in parallel, move one signal at a time
- Migration from ELK: KQL/Lucene → LogQL requires query rewriting; Elasticsearch → Loki is a fundamental architecture shift (full-text index → label-only index)
-
Migration from Prometheus + Jaeger: Mimir accepts remote_write directly; Tempo accepts Jaeger protocol directly — both are near-drop-in replacements
-
Community Health & Support:
- Combined GitHub stars across components: 120k+ (Grafana 73k, Loki 28k, Mimir 5k, Tempo 4k, Pyroscope 10k)
- Battle-tested at: Maersk, DHL Express, Dutch Tax Office, Salesforce, and thousands of organizations
- Enterprise SLAs via Grafana Labs
- Active community forums, Slack, regular GrafanaCON conferences
Notes In This Folder¶
Related Topics¶
- Grafana — the visualization layer and hub of the LGTM stack
- Victoria Stack — competing full-stack (VictoriaMetrics + VictoriaLogs + VictoriaTraces), Apache 2.0, lower resource footprint
- LGTM vs Victoria Stack — canonical comparison note
- OpenTelemetry — the industry-standard telemetry collection framework used to feed the LGTM stack
- Observability Stacks Comparison — 6-way comparison including Coroot, SigNoz, SkyWalking, OpenObserve
Assets¶
Store local images, diagrams, and PDFs in the _assets/ subfolder. Prefer Mermaid for inline diagrams.
Next Actions¶
- Deep dive into Grafana Adaptive Metrics and Adaptive Logs (cost optimization features)
- ~~Research LGTM vs SigNoz comparison note~~ → covered in Observability Stacks Comparison
- Benchmark object storage costs across S3, GCS, and Azure Blob for LGTM workloads
Sources¶
Primary Sources¶
| URL | Source Kind | Authority | Retrieved Via | Date |
|---|---|---|---|---|
| https://github.com/grafana/mimir | repository | primary | web search | 2026-04-10 |
| https://github.com/grafana/loki | repository | primary | web search | 2026-04-10 |
| https://github.com/grafana/tempo | repository | primary | web search | 2026-04-10 |
| https://github.com/grafana/pyroscope | repository | primary | web search | 2026-04-10 |
| https://github.com/grafana/alloy | repository | primary | web search | 2026-04-10 |
| https://github.com/grafana/docker-otel-lgtm | repository | primary | web search | 2026-04-10 |
| https://grafana.com/docs/mimir/latest/ | docs | primary | web search | 2026-04-10 |
| https://grafana.com/docs/loki/latest/ | docs | primary | web search | 2026-04-10 |
| https://grafana.com/docs/tempo/latest/ | docs | primary | web search | 2026-04-10 |
| https://grafana.com/docs/pyroscope/latest/ | docs | primary | web search | 2026-04-10 |
| https://grafana.com/docs/alloy/latest/ | docs | primary | web search | 2026-04-10 |
| https://grafana.com/pricing/ | docs | primary | web search | 2026-04-10 |
| https://grafana.com/docs/grafana/latest/datasources/tempo/configure-tempo-data-source/ | docs | primary | web search | 2026-04-10 |
| https://grafana.com/docs/grafana/latest/datasources/loki/configure-loki-data-source/ | docs | primary | web search | 2026-04-10 |
Secondary Sources¶
| URL | Source Kind | Authority | Retrieved Via | Date |
|---|---|---|---|---|
| Grafana customer case studies (page removed) | case study | primary | web search | 2026-04-10 |
| https://opentelemetry.io/docs/ | docs | primary | web search | 2026-04-10 |
| https://signoz.io/comparisons/ | comparison | secondary | web search | 2026-04-10 |
Community Sources¶
| URL | Source Kind | Authority | Retrieved Via | Date |
|---|---|---|---|---|
| https://community.grafana.com/ | forum | community | manual | 2026-04-10 |
| https://grafana.com/about/events/grafanacon/ | conference | primary | web search | 2026-04-10 |
| https://play.grafana.org/ | demo | community | manual | 2026-04-10 |
Related Notes¶
Questions¶
Open¶
Answered¶
- What is the all-in-one dev image for LGTM? —
grafana/otel-lgtm, includes OTel Collector + all backends + Grafana, resolved in operations - How does multi-tenancy work across LGTM? — via
X-Scope-OrgIDheader on all API calls, requires auth proxy, resolved in observability/lgtm/architecture - Should LGTM components share an object storage bucket? — No, use separate buckets per component (Mimir, Loki, Tempo), resolved in observability/lgtm/architecture
- How does cross-signal correlation work? — Exemplars (metrics→traces), Derived Fields (logs→traces), Trace-to-Logs/Metrics/Profiles (traces→everything), resolved in observability/lgtm/architecture
- What are the four query languages? — PromQL (metrics), LogQL (logs), TraceQL (traces), FlameQL (profiles), resolved in observability/lgtm/architecture
- How do Grafana Adaptive Metrics and Adaptive Logs actually work internally? What is the ROI in practice? — Adaptive Metrics analyzes query logs to identify unused/low-value series and creates aggregation rules that drop or combine high-cardinality metrics not referenced in dashboards/alerts, typically yielding 20-40%+ reduction in active series. Adaptive Logs applies the same concept to log data, identifying logs never queried or alerted on to reduce ingestion volume. Both are part of Grafana Cloud's cost optimization suite and work by correlating telemetry usage patterns with stored data, resolved in observability/lgtm/operations
- What is the production experience of running LGTM in a single AZ vs multi-AZ? What is the actual cost difference? — Single-AZ is simpler but risks total observability loss during an AZ failure; multi-AZ requires zone-aware replication in Mimir (ingester/store-gateway zones), Loki (ingester zones), and Tempo (live-store across AZs). Multi-AZ roughly doubles compute costs (minimum 2 replicas per component across zones) but uses the same object storage. Operational overhead increases significantly with multi-AZ due to cross-zone networking and quorum management. Most production deployments use multi-AZ for the backends (Mimir/Loki/Tempo) while running Grafana with a managed database, resolved in observability/lgtm/architecture
- How does tail-based sampling in Alloy compare to head-based sampling for Tempo cost optimization? — Alloy supports tail-based sampling via
beyla.ebpfcomponent with strategies liketraceidratio,always_on, andparentbased_traceidratio, configured globally or per-service. Head-based sampling (set by the application at trace start) is simpler but cannot make decisions based on full trace outcome. Tail-based sampling waits for the complete trace to decide whether to keep it (e.g., keep all error traces, sample 10% of success), dramatically reducing storage costs while preserving important traces. Trade-off: tail-based requires buffering traces in memory, increasing agent resource usage, resolved in observability/lgtm/operations - What is the Tempo 3.0 decoupled architecture, and how does it change deployment? — Tempo 3.0 mandates a Kafka-compatible system (Apache Kafka, Redpanda, or WarpStream) as a durable write-ahead log for both monolithic and microservices modes, decoupling read and write paths. The distributor writes traces to Kafka topics; ingesters consume from Kafka and flush to object storage. This provides backpressure handling, fault tolerance (data buffers in Kafka if ingesters go down), trace replay capability, and independent scaling of write vs read paths. It increases TCO and operational complexity (Kafka dependency) but enables per-component horizontal scaling, resolved in observability/lgtm/architecture
- How does the OTel Operator for Kubernetes compare to Alloy DaemonSet for auto-instrumentation? — The OTel Operator auto-injects instrumentation libraries into application pods via pod annotations (e.g.,
instrumentation.opentelemetry.io/inject-java), supporting Java, Node.js, Python, .NET, Go, Apache HTTPD, and Nginx. It manages OpenTelemetry Collector instances via CRDs. Alloy DaemonSet (usingbeyla.ebpf) provides eBPF-based zero-code auto-instrumentation without modifying pods or requiring annotations. OTel Operator gives language-level instrumentation depth; Alloy gives breadth (any TCP/HTTP/GRPC service) with less application-specific detail. They can complement each other: OTel Operator for deep app traces, Alloy for infrastructure-level auto-discovery, resolved in observability/lgtm/architecture - What are the concrete limitations of running LGTM in monolithic/SSD mode at medium scale (1M series, 100 GB/day logs)? — Monolithic mode runs all components in a single process per backend -- suitable for <1M series and <50 GB/day but lacks independent scaling. At 1M series, Mimir monolithic will need significant RAM (32-64 GB) and fast SSD for the storage gateway. Loki monolithic at 100 GB/day needs SSD for WAL and adequate memory for index queries. Tempo monolithic is limited by trace volume. All three lack HA (single replica = single point of failure). Monolithic mode cannot scale reads independently from writes. Plan migration to microservices/distributed mode when approaching these limits, resolved in observability/lgtm/architecture
- How does SigNoz (ClickHouse-backed, single binary) compare to LGTM in real-world operational overhead? — SigNoz uses ClickHouse as a unified storage backend for metrics, logs, and traces, reducing component sprawl compared to LGTM (which requires separate backends: Mimir + Loki + Tempo). SigNoz's single-binary dev mode and fewer moving parts lower initial operational overhead. However, LGTM components are individually battle-tested at massive scale (Grafana Cloud runs them for thousands of customers) and offer deeper ecosystem integration. SigNoz is younger but operationally simpler for small-medium deployments; LGTM offers more mature scaling and broader community support, resolved in observability/lgtm/index
- What is the migration path from Datadog to self-hosted LGTM, and what are the gotchas (agent migration, dashboard porting, alert porting)? — Key gotchas: (1) Agent: replace Datadog Agent with Grafana Alloy; DogStatsD extended features do not map 1:1; Datadog's 700+ auto-discovery integrations require explicit Prometheus scrape configs. (2) Dashboards: no automated migration tool; rebuild manually in Grafana (different JSON schema). (3) Alerts: rewrite from Datadog's proprietary query language to PromQL/LogQL; alert routing uses Grafana Contact Points instead of Datadog's
@slack-channelsyntax. (4) Tags to labels: sanitize and rename all tags (no dots allowed in Prometheus labels). (5) Trace libraries: switch fromdd-trace-*to OpenTelemetry SDKs; Datadog's proprietary trace context is incompatible with W3C TraceContext. Strategy: dual-ship during transition, migrate infrastructure metrics first, then app metrics, logs, traces, alerts, dashboards, resolved in observability/lgtm/operations - How mature is eBPF-based auto-instrumentation for Go applications in production? — Go's non-standard calling convention and static binary compilation make eBPF instrumentation harder than for C/Rust. Go 1.17+ register-based ABI improved but complicated tooling. Network/L3-4 observability (Cilium/Hubble) is very mature and production-proven. Continuous profiling via eBPF (Parca/Pyroscope) is production-ready. Auto-tracing for Go via eBPF (Odigos, Pixie/Beyla) is emerging but has gaps with goroutine-based concurrency and async patterns. Uprobe-based instrumentation can have non-trivial overhead on hot paths. Production concern: uprobe breakpoints use INT3, costly for frequently-called functions. Overall: viable for network-level tracing, but deep function-level Go auto-instrumentation via eBPF is still evolving, resolved in observability/lgtm/architecture
- What are the best practices for Loki structured metadata (introduced in Loki 3.x) vs traditional labels? — Use structured metadata for high-cardinality fields that should not be indexed but need to be queryable (pod names, process IDs, trace IDs, container IDs, user IDs). Keep indexed labels minimal (~10 or fewer per stream) with only low-cardinality values (job, namespace, cluster, container_name). Requires OTel-format ingestion (Alloy or OTel Collector) and
allow_structured_metadata: truein Loki config. Structured metadata fields are queryable via LogQL without runtime parsing. For 75+ TB/month customers, Bloom filters (Loki 3.3+) utilize structured metadata for accelerated searches. Never usek8s.pod.nameorservice.instance.idas indexed labels; move them to structured metadata, resolved in observability/lgtm/operations