Skip to content

How It Works

Core Mechanism

The LGTM stack operates on a collect → route → store → correlate → visualize pipeline. Applications emit telemetry (metrics, logs, traces, profiles) via OpenTelemetry SDKs, which flows through a collection layer (Alloy/OTel Collector) into purpose-built backends, all unified in Grafana for cross-signal analysis.

The Four Pillars + Collection

Pillar Component What It Stores Key Insight
Metrics Mimir Time-series data (Prometheus format) Horizontally scalable Prometheus with long-term storage
Logs Loki Log streams with label metadata Indexes labels only, not log content — 10–100x cheaper than ELK
Traces Tempo Distributed trace spans (Parquet) No index at all — relies on object storage + columnar format
Profiles Pyroscope Continuous profiling data (pprof) Links CPU/memory hotspots to exact lines of code
Collection Alloy N/A (pipeline agent) OTel Collector distribution — receives, processes, routes all signals

Data Flow

flowchart TB
    subgraph Apps["Instrumented Applications"]
        direction LR
        A1["Service A<br/>(OTel SDK)"]
        A2["Service B<br/>(Auto-instrumented)"]
        A3["Service C<br/>(Prometheus client)"]
    end

    subgraph Alloy["Grafana Alloy / OTel Collector"]
        direction TB
        Recv["Receivers<br/>OTLP gRPC:4317<br/>OTLP HTTP:4318<br/>Prometheus scrape"]
        Proc["Processors<br/>Batch · MemoryLimiter<br/>ResourceDetection · Transform"]
        Exp["Exporters"]
        Recv --> Proc --> Exp
    end

    subgraph Backends["Signal Backends"]
        direction LR
        Mimir["Mimir<br/>📊 Metrics<br/>PromQL"]
        Loki["Loki<br/>📝 Logs<br/>LogQL"]
        Tempo["Tempo<br/>🔍 Traces<br/>TraceQL"]
        Pyro["Pyroscope<br/>🔥 Profiles<br/>FlameQL"]
    end

    subgraph Storage["Object Storage (S3 / GCS / Azure)"]
        S3M["Metric TSDB Blocks"]
        S3L["Log Chunks + Index"]
        S3T["Trace Parquet Blocks"]
        S3P["Profile Blocks"]
    end

    subgraph Grafana["Grafana (Single Pane of Glass)"]
        Dash["Dashboards"]
        Explore["Explore"]
        Alert["Alerting"]
    end

    Apps -->|OTLP / scrape| Alloy
    Alloy -->|remote_write| Mimir
    Alloy -->|push| Loki
    Alloy -->|OTLP| Tempo
    Alloy -->|push| Pyro

    Mimir --> S3M
    Loki --> S3L
    Tempo --> S3T
    Pyro --> S3P

    Grafana -.->|PromQL| Mimir
    Grafana -.->|LogQL| Loki
    Grafana -.->|TraceQL| Tempo
    Grafana -.->|FlameQL| Pyro

    style Apps fill:#0d7377,color:#fff
    style Alloy fill:#ff6600,color:#fff
    style Backends fill:#2a2d3e,color:#fff
    style Storage fill:#0d1117,color:#fff
    style Grafana fill:#ff6600,color:#fff

Cross-Signal Correlation

The killer feature of the LGTM stack is seamless navigation between signals. This requires both instrumentation (injecting trace IDs everywhere) and Grafana configuration (linking data sources).

Correlation Matrix

flowchart LR
    Metrics["📊 Metrics<br/>(Mimir)"]
    Logs["📝 Logs<br/>(Loki)"]
    Traces["🔍 Traces<br/>(Tempo)"]
    Profiles["🔥 Profiles<br/>(Pyroscope)"]

    Metrics -->|"Exemplars<br/>(trace ID on data point)"| Traces
    Traces -->|"Trace-to-Logs<br/>(span labels → Loki query)"| Logs
    Logs -->|"Derived Fields<br/>(regex extracts trace ID)"| Traces
    Traces -->|"Trace-to-Metrics<br/>(span attrs → PromQL)"| Metrics
    Traces -->|"Trace-to-Profiles<br/>(span_id → Pyroscope)"| Profiles
    Traces -.->|"Span Metrics Generator<br/>(RED metrics → Mimir)"| Metrics

    style Metrics fill:#7b42bc,color:#fff
    style Logs fill:#2a7de1,color:#fff
    style Traces fill:#e65100,color:#fff
    style Profiles fill:#c62828,color:#fff

Correlation Configuration Checklist

Link From → To How Configuration Location
Exemplars Metrics → Traces Trace IDs attached to metric data points App instrumentation + Mimir/Prometheus backend + Grafana Prometheus DS settings
Trace-to-Logs Traces → Logs Span labels used to query Loki Tempo DS → "Trace to logs" section → select Loki DS
Derived Fields Logs → Traces Regex extracts trace ID from log line Loki DS → "Derived fields" → regex + internal link to Tempo
Trace-to-Metrics Traces → Metrics Span attributes mapped to PromQL filter Tempo DS → "Trace to metrics" → select Mimir DS
Trace-to-Profiles Traces → Profiles Span ID linked to Pyroscope profile Tempo DS → "Trace to profiles" → select Pyroscope DS
Span Metrics Traces → Metrics (auto) Tempo Metrics Generator computes RED Tempo config: metrics_generator → remote_write to Mimir

Example: Incident Workflow

sequenceDiagram
    participant SRE as SRE / On-Call
    participant Metrics as Mimir (Metrics)
    participant Traces as Tempo (Traces)
    participant Logs as Loki (Logs)
    participant Profiles as Pyroscope (Profiles)

    SRE->>Metrics: Alert fires: error_rate > 5%
    SRE->>Metrics: Open dashboard, see spike
    SRE->>Metrics: Click exemplar on spike
    Metrics-->>Traces: Jump to trace ID abc123
    SRE->>Traces: See slow span in payment-service (2.3s)
    SRE->>Traces: Click "Trace to Logs"
    Traces-->>Logs: Query: {service="payment"} |= "abc123"
    SRE->>Logs: See "connection timeout to DB" error
    SRE->>Traces: Click "Trace to Profiles"
    Traces-->>Profiles: See flame graph for payment-service
    SRE->>Profiles: CPU hotspot: connection pool retry loop
    SRE->>SRE: Root cause: DB connection pool exhausted

Query Languages

The LGTM stack uses four purpose-built query languages, all sharing PromQL's label-matching DNA:

PromQL (Metrics — Mimir)

# Rate of HTTP requests over 5 minutes, grouped by status code
sum(rate(http_requests_total{job="api"}[5m])) by (status)

# 99th percentile latency
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))

# Alert expression: error rate > 5%
sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) > 0.05

LogQL (Logs — Loki)

# Filter error logs from production
{app="payment-service", env="prod"} |= "error" != "timeout"

# Parse JSON logs and filter by status
{app="api"} | json | status >= 500

# Count error logs per minute (metric query)
sum(rate({app="api"} |= "error" [1m])) by (pod)

TraceQL (Traces — Tempo)

# Find traces with HTTP 500 errors
{span.http.status_code = 500}

# Find slow spans in a specific service
{resource.service.name = "checkout" && duration > 2s}

# Find traces where parent and child spans are in different services
{resource.service.name = "frontend"} >> {resource.service.name = "payment"}

FlameQL (Profiles — Pyroscope)

# CPU profiles for a specific service
process_cpu{service_name="payment-service"}

# Memory allocation profiles filtered by environment
memory_alloc{service_name="api", env="production"}

Multi-Tenancy

All LGTM backends support multi-tenancy via the X-Scope-OrgID HTTP header:

flowchart LR
    subgraph Clients["Tenants"]
        T1["Team Alpha<br/>X-Scope-OrgID: alpha"]
        T2["Team Beta<br/>X-Scope-OrgID: beta"]
    end

    Proxy["Auth Proxy<br/>(NGINX / Envoy)<br/>Validates identity,<br/>injects X-Scope-OrgID"]

    subgraph LGTM["LGTM Backends"]
        direction TB
        M["Mimir"]
        L["Loki"]
        T["Tempo"]
    end

    T1 --> Proxy
    T2 --> Proxy
    Proxy -->|"X-Scope-OrgID: alpha"| LGTM
    Proxy -->|"X-Scope-OrgID: beta"| LGTM

    style Clients fill:#0d7377,color:#fff
    style Proxy fill:#ff6600,color:#fff
    style LGTM fill:#2a2d3e,color:#fff

Configuration per component

Component Enable Multi-Tenancy Cross-Tenant Queries
Mimir Enabled by default tenant-federation.enabled=true, use tenant1\|tenant2 in header
Loki auth_enabled: true multi_tenant_queries_enabled: true
Tempo multitenancy_enabled: true tenant1\|tenant2 in header

Critical: Never trust X-Scope-OrgID from end users. Always use an auth proxy that validates identity and injects the correct tenant header.

Lifecycle

Signal Lifecycle (Write Path → Read Path)

  1. Instrumentation — App emits telemetry via OTel SDK
  2. Collection — Alloy receives, batches, processes, and routes to backends
  3. Ingestion — Each backend's Distributor validates and shards to Ingesters
  4. In-Memory Write — Ingesters hold data in memory + WAL for durability
  5. Flush — Ingesters periodically flush to object storage (2h for Mimir, configurable for Loki/Tempo)
  6. Compaction — Background compactors merge and optimize stored blocks
  7. Query — Queriers fetch from both ingesters (recent) and object storage (historical)
  8. Visualization — Grafana presents data with cross-signal links