Architecture
1. Default Topology / Flow
flowchart LR
App["Applications"] --> Alloy["Alloy"]
Alloy --> M["Mimir<br/>(all-in-one)"]
Alloy --> L["Loki<br/>(all-in-one)"]
Alloy --> T["Tempo<br/>(all-in-one)"]
M --> S3["Object Storage"]
L --> S3
T --> S3
G["Grafana"] -.-> M
G -.-> L
G -.-> T
style M fill:#7b42bc,color:#fff
style L fill:#2a7de1,color:#fff
style T fill:#e65100,color:#fff
style G fill:#ff6600,color:#fff
Deployment Modes
All LGTM backends (Mimir, Loki, Tempo) share the same deployment mode philosophy: a single binary with a -target flag that selects which component to run. This gives three deployment tiers:
| Mode |
Description |
Use Case |
Ops Complexity |
| Monolithic |
All components in one process |
Dev, testing, PoC, small prod |
Low |
| Simple Scalable (SSD) |
Read/Write/Backend targets as separate services |
Mid-sized production |
Medium |
| Microservices |
Each component as independent pod |
Large-scale, hyperscale production |
High |
Monolithic Mode
Simple Scalable Deployment
flowchart TB
subgraph MimirSSD["Mimir (Simple Scalable)"]
MW["Write Path<br/>(distributor + ingester)"]
MR["Read Path<br/>(query-frontend + querier)"]
MB["Backend<br/>(compactor + store-gateway)"]
end
subgraph LokiSSD["Loki (Simple Scalable)"]
LW["Write Path<br/>(distributor + ingester)"]
LR["Read Path<br/>(query-frontend + querier)"]
LB["Backend<br/>(compactor + index-gateway)"]
end
subgraph TempoSSD["Tempo (Simple Scalable)"]
TW["Write Path<br/>(distributor + ingester)"]
TR["Read Path<br/>(query-frontend + querier)"]
TB2["Backend<br/>(compactor)"]
end
S3["Object Storage"]
MW --> S3
LW --> S3
TW --> S3
MR --> S3
LR --> S3
TR --> S3
style MimirSSD fill:#7b42bc,color:#fff
style LokiSSD fill:#2a7de1,color:#fff
style TempoSSD fill:#e65100,color:#fff
Microservices Mode (Production)
flowchart TB
subgraph MimirMS["Mimir Microservices"]
MD["Distributor"]
MI["Ingester ×3"]
MQF["Query Frontend"]
MQ["Querier ×2"]
MSG["Store-Gateway ×2"]
MC["Compactor"]
end
subgraph LokiMS["Loki Microservices"]
LD["Distributor"]
LI["Ingester ×3"]
LQF["Query Frontend"]
LQ["Querier ×2"]
LIG["Index Gateway"]
LC["Compactor"]
end
subgraph TempoMS["Tempo Microservices"]
TD["Distributor"]
TI["Ingester ×3"]
TQF["Query Frontend"]
TQ["Querier ×2"]
TMG["Metrics Generator"]
TC["Compactor"]
end
S3["Object Storage<br/>(3 separate buckets)"]
MI --> S3
MSG --> S3
MC --> S3
LI --> S3
LIG --> S3
LC --> S3
TI --> S3
TC --> S3
TMG -->|"remote_write"| MD
style MimirMS fill:#7b42bc,color:#fff
style LokiMS fill:#2a7de1,color:#fff
style TempoMS fill:#e65100,color:#fff
style S3 fill:#0d1117,color:#fff
Full Production Topology
flowchart TB
subgraph Apps["Applications & Infrastructure"]
App["Services<br/>(OTel SDK)"]
K8s["Kubernetes<br/>Nodes"]
end
subgraph Collection["Collection Layer"]
Alloy["Grafana Alloy<br/>(DaemonSet)"]
end
subgraph Backends["LGTM Backends (Microservices)"]
Mimir["Mimir<br/>📊 Metrics"]
Loki["Loki<br/>📝 Logs"]
Tempo["Tempo<br/>🔍 Traces"]
Pyro["Pyroscope<br/>🔥 Profiles"]
end
subgraph Storage["Object Storage"]
B1["mimir-blocks (S3)"]
B2["loki-chunks (S3)"]
B3["tempo-traces (S3)"]
B4["pyroscope-data (S3)"]
end
subgraph Grafana["Grafana (HA)"]
G1["Pod 1"]
G2["Pod 2"]
G3["Pod 3"]
end
subgraph Support["Supporting Services"]
PG["PostgreSQL<br/>(Grafana DB)"]
Redis["Redis<br/>(Sessions)"]
LB["Ingress / LB"]
MC["Memcached<br/>(Query Cache)"]
end
Apps --> Alloy
K8s --> Alloy
Alloy -->|remote_write| Mimir
Alloy -->|push| Loki
Alloy -->|OTLP| Tempo
Alloy -->|push| Pyro
Mimir --> B1
Loki --> B2
Tempo --> B3
Pyro --> B4
LB --> G1 & G2 & G3
G1 --> PG
G1 --> Redis
Mimir --> MC
Loki --> MC
Grafana -.-> Mimir
Grafana -.-> Loki
Grafana -.-> Tempo
Grafana -.-> Pyro
style Apps fill:#0d7377,color:#fff
style Collection fill:#ff6600,color:#fff
style Backends fill:#2a2d3e,color:#fff
style Storage fill:#0d1117,color:#fff
style Grafana fill:#ff6600,color:#fff
style Support fill:#1a1d2e,color:#fff
Object Storage Layout
Critical rule: Each LGTM component must use separate buckets (or at minimum, separate prefixes within a bucket). Never share the same path.
| Component |
Recommended Bucket |
Contents |
| Mimir (blocks) |
observability-mimir-blocks |
TSDB blocks (2h intervals) |
| Mimir (ruler) |
observability-mimir-ruler |
Recording and alerting rules |
| Mimir (alertmanager) |
observability-mimir-alertmanager |
Alertmanager state |
| Loki (chunks + index) |
observability-loki-chunks |
Compressed log chunks + TSDB index |
| Tempo (traces) |
observability-tempo-traces |
Parquet trace blocks + bloom filters |
| Pyroscope (profiles) |
observability-pyroscope-data |
Profile blocks |
Kubernetes Deployment Matrix
| Component |
Helm Chart |
Pods (Min HA) |
Scaling Dimension |
Key Resource |
| Grafana |
grafana/grafana |
2–3 |
HPA (CPU/mem) |
Memory |
| Mimir |
grafana/mimir-distributed |
7+ (dist×1, ing×3, q×2, sg×1) |
Per-component HPA |
Memory (ingesters), CPU (queriers) |
| Loki |
grafana/loki |
6+ (dist×1, ing×3, q×2) |
Per-component HPA |
Memory (ingesters), CPU (queriers) |
| Tempo |
grafana/tempo-distributed |
5+ (dist×1, ing×3, q×1) |
Per-component HPA |
Memory (ingesters) |
| Pyroscope |
grafana/pyroscope |
1–3 |
Replicas |
Memory |
| Alloy |
grafana/alloy |
1 per node (DaemonSet) |
DaemonSet |
CPU, Memory |
| PostgreSQL |
External managed |
HA pair |
Managed service |
Disk IOPS |
| Redis |
External managed |
HA pair |
Managed service |
Memory |
| Memcached |
bitnami/memcached |
2–3 |
Replicas |
Memory |
Shared Infrastructure Patterns
Hash Rings
Mimir, Loki, and Tempo all use consistent hash rings for sharding data across ingesters. The ring is backed by:
- memberlist (default, gossip-based, no external dependency)
- Consul or etcd (for environments that already run them)
Caching
| Cache Layer |
Purpose |
Technology |
| Query results cache |
Cache query responses |
Memcached (recommended) or Redis |
| Chunks cache |
Cache data chunks from object storage |
Memcached |
| Index cache |
Cache index lookups |
Memcached |
| Metadata cache |
Cache block metadata |
Memcached |
Memcached is strongly recommended over Redis for cache layers due to lower latency and simpler scaling.
Data Model
1. Default Topology / Flow
erDiagram
Lgtm_CORE ||--o{ CONFIG : requires
Lgtm_CORE ||--o{ STATE : writes
CONFIG {
string runtime_params
string limits
}
STATE {
string metric_id
json payload
}
How It Works
Core Mechanism
The LGTM stack operates on a collect → route → store → correlate → visualize pipeline. Applications emit telemetry (metrics, logs, traces, profiles) via OpenTelemetry SDKs, which flows through a collection layer (Alloy/OTel Collector) into purpose-built backends, all unified in Grafana for cross-signal analysis.
The Four Pillars + Collection
| Pillar |
Component |
What It Stores |
Key Insight |
| Metrics |
Mimir |
Time-series data (Prometheus format) |
Horizontally scalable Prometheus with long-term storage |
| Logs |
Loki |
Log streams with label metadata |
Indexes labels only, not log content — 10–100x cheaper than ELK |
| Traces |
Tempo |
Distributed trace spans (Parquet) |
No index at all — relies on object storage + columnar format |
| Profiles |
Pyroscope |
Continuous profiling data (pprof) |
Links CPU/memory hotspots to exact lines of code |
| Collection |
Alloy |
N/A (pipeline agent) |
OTel Collector distribution — receives, processes, routes all signals |
Data Flow
flowchart TB
subgraph Apps["Instrumented Applications"]
direction LR
A1["Service A<br/>(OTel SDK)"]
A2["Service B<br/>(Auto-instrumented)"]
A3["Service C<br/>(Prometheus client)"]
end
subgraph Alloy["Grafana Alloy / OTel Collector"]
direction TB
Recv["Receivers<br/>OTLP gRPC:4317<br/>OTLP HTTP:4318<br/>Prometheus scrape"]
Proc["Processors<br/>Batch · MemoryLimiter<br/>ResourceDetection · Transform"]
Exp["Exporters"]
Recv --> Proc --> Exp
end
subgraph Backends["Signal Backends"]
direction LR
Mimir["Mimir<br/>📊 Metrics<br/>PromQL"]
Loki["Loki<br/>📝 Logs<br/>LogQL"]
Tempo["Tempo<br/>🔍 Traces<br/>TraceQL"]
Pyro["Pyroscope<br/>🔥 Profiles<br/>FlameQL"]
end
subgraph Storage["Object Storage (S3 / GCS / Azure)"]
S3M["Metric TSDB Blocks"]
S3L["Log Chunks + Index"]
S3T["Trace Parquet Blocks"]
S3P["Profile Blocks"]
end
subgraph Grafana["Grafana (Single Pane of Glass)"]
Dash["Dashboards"]
Explore["Explore"]
Alert["Alerting"]
end
Apps -->|OTLP / scrape| Alloy
Alloy -->|remote_write| Mimir
Alloy -->|push| Loki
Alloy -->|OTLP| Tempo
Alloy -->|push| Pyro
Mimir --> S3M
Loki --> S3L
Tempo --> S3T
Pyro --> S3P
Grafana -.->|PromQL| Mimir
Grafana -.->|LogQL| Loki
Grafana -.->|TraceQL| Tempo
Grafana -.->|FlameQL| Pyro
style Apps fill:#0d7377,color:#fff
style Alloy fill:#ff6600,color:#fff
style Backends fill:#2a2d3e,color:#fff
style Storage fill:#0d1117,color:#fff
style Grafana fill:#ff6600,color:#fff
Cross-Signal Correlation
The killer feature of the LGTM stack is seamless navigation between signals. This requires both instrumentation (injecting trace IDs everywhere) and Grafana configuration (linking data sources).
Correlation Matrix
flowchart LR
Metrics["📊 Metrics<br/>(Mimir)"]
Logs["📝 Logs<br/>(Loki)"]
Traces["🔍 Traces<br/>(Tempo)"]
Profiles["🔥 Profiles<br/>(Pyroscope)"]
Metrics -->|"Exemplars<br/>(trace ID on data point)"| Traces
Traces -->|"Trace-to-Logs<br/>(span labels → Loki query)"| Logs
Logs -->|"Derived Fields<br/>(regex extracts trace ID)"| Traces
Traces -->|"Trace-to-Metrics<br/>(span attrs → PromQL)"| Metrics
Traces -->|"Trace-to-Profiles<br/>(span_id → Pyroscope)"| Profiles
Traces -.->|"Span Metrics Generator<br/>(RED metrics → Mimir)"| Metrics
style Metrics fill:#7b42bc,color:#fff
style Logs fill:#2a7de1,color:#fff
style Traces fill:#e65100,color:#fff
style Profiles fill:#c62828,color:#fff
Correlation Configuration Checklist
| Link |
From → To |
How |
Configuration Location |
| Exemplars |
Metrics → Traces |
Trace IDs attached to metric data points |
App instrumentation + Mimir/Prometheus backend + Grafana Prometheus DS settings |
| Trace-to-Logs |
Traces → Logs |
Span labels used to query Loki |
Tempo DS → "Trace to logs" section → select Loki DS |
| Derived Fields |
Logs → Traces |
Regex extracts trace ID from log line |
Loki DS → "Derived fields" → regex + internal link to Tempo |
| Trace-to-Metrics |
Traces → Metrics |
Span attributes mapped to PromQL filter |
Tempo DS → "Trace to metrics" → select Mimir DS |
| Trace-to-Profiles |
Traces → Profiles |
Span ID linked to Pyroscope profile |
Tempo DS → "Trace to profiles" → select Pyroscope DS |
| Span Metrics |
Traces → Metrics (auto) |
Tempo Metrics Generator computes RED |
Tempo config: metrics_generator → remote_write to Mimir |
Example: Incident Workflow
sequenceDiagram
participant SRE as SRE / On-Call
participant Metrics as Mimir (Metrics)
participant Traces as Tempo (Traces)
participant Logs as Loki (Logs)
participant Profiles as Pyroscope (Profiles)
SRE->>Metrics: Alert fires: error_rate > 5%
SRE->>Metrics: Open dashboard, see spike
SRE->>Metrics: Click exemplar on spike
Metrics-->>Traces: Jump to trace ID abc123
SRE->>Traces: See slow span in payment-service (2.3s)
SRE->>Traces: Click "Trace to Logs"
Traces-->>Logs: Query: {service="payment"} |= "abc123"
SRE->>Logs: See "connection timeout to DB" error
SRE->>Traces: Click "Trace to Profiles"
Traces-->>Profiles: See flame graph for payment-service
SRE->>Profiles: CPU hotspot: connection pool retry loop
SRE->>SRE: Root cause: DB connection pool exhausted
Query Languages
The LGTM stack uses four purpose-built query languages, all sharing PromQL's label-matching DNA:
PromQL (Metrics — Mimir)
# Rate of HTTP requests over 5 minutes, grouped by status code
sum(rate(http_requests_total{job="api"}[5m])) by (status)
# 99th percentile latency
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))
# Alert expression: error rate > 5%
sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) > 0.05
LogQL (Logs — Loki)
# Filter error logs from production
{app="payment-service", env="prod"} |= "error" != "timeout"
# Parse JSON logs and filter by status
{app="api"} | json | status >= 500
# Count error logs per minute (metric query)
sum(rate({app="api"} |= "error" [1m])) by (pod)
TraceQL (Traces — Tempo)
# Find traces with HTTP 500 errors
{span.http.status_code = 500}
# Find slow spans in a specific service
{resource.service.name = "checkout" && duration > 2s}
# Find traces where parent and child spans are in different services
{resource.service.name = "frontend"} >> {resource.service.name = "payment"}
FlameQL (Profiles — Pyroscope)
# CPU profiles for a specific service
process_cpu{service_name="payment-service"}
# Memory allocation profiles filtered by environment
memory_alloc{service_name="api", env="production"}
Multi-Tenancy
All LGTM backends support multi-tenancy via the X-Scope-OrgID HTTP header:
flowchart LR
subgraph Clients["Tenants"]
T1["Team Alpha<br/>X-Scope-OrgID: alpha"]
T2["Team Beta<br/>X-Scope-OrgID: beta"]
end
Proxy["Auth Proxy<br/>(NGINX / Envoy)<br/>Validates identity,<br/>injects X-Scope-OrgID"]
subgraph LGTM["LGTM Backends"]
direction TB
M["Mimir"]
L["Loki"]
T["Tempo"]
end
T1 --> Proxy
T2 --> Proxy
Proxy -->|"X-Scope-OrgID: alpha"| LGTM
Proxy -->|"X-Scope-OrgID: beta"| LGTM
style Clients fill:#0d7377,color:#fff
style Proxy fill:#ff6600,color:#fff
style LGTM fill:#2a2d3e,color:#fff
Configuration per component
| Component |
Enable Multi-Tenancy |
Cross-Tenant Queries |
| Mimir |
Enabled by default |
tenant-federation.enabled=true, use tenant1\|tenant2 in header |
| Loki |
auth_enabled: true |
multi_tenant_queries_enabled: true |
| Tempo |
multitenancy_enabled: true |
tenant1\|tenant2 in header |
Critical: Never trust X-Scope-OrgID from end users. Always use an auth proxy that validates identity and injects the correct tenant header.
Lifecycle
Signal Lifecycle (Write Path → Read Path)
- Instrumentation — App emits telemetry via OTel SDK
- Collection — Alloy receives, batches, processes, and routes to backends
- Ingestion — Each backend's Distributor validates and shards to Ingesters
- In-Memory Write — Ingesters hold data in memory + WAL for durability
- Flush — Ingesters periodically flush to object storage (2h for Mimir, configurable for Loki/Tempo)
- Compaction — Background compactors merge and optimize stored blocks
- Query — Queriers fetch from both ingesters (recent) and object storage (historical)
- Visualization — Grafana presents data with cross-signal links
Benchmarks
Test Conditions
- Stack versions: Mimir 3.0.x, Loki 3.7.x, Tempo 2.x, Pyroscope 1.20.x, Grafana 12.4.x
- Date: April 2026
- Note: Benchmarks reflect both published Grafana Labs data and community reports. Individual component benchmarks are covered in detail in the Grafana Benchmarks folder.
Per-Component Throughput
| Component |
Metric |
Benchmark |
Conditions |
| Mimir |
Active series |
1B+ |
Documented by Grafana Labs |
| Mimir |
Ingestion rate |
30M+ samples/sec |
Multi-tenant, microservices mode |
| Mimir |
Storage efficiency |
~1.3 bytes/sample |
Compressed TSDB blocks |
| Loki |
Ingestion rate |
1 TB+/day |
Label-indexed, object storage |
| Loki |
Compression ratio |
10–20:1 |
Snappy/GZIP on chunks |
| Loki |
Query (label-filtered) |
Sub-second |
When label cardinality < 10k |
| Tempo |
Ingestion rate |
100M+ spans/day |
Parquet format, object storage |
| Tempo |
Trace ID lookup |
< 200ms |
Direct lookup |
| Tempo |
TraceQL search |
1–30s |
Depends on time range and selectivity |
| Pyroscope |
Profiles ingested |
Millions/hour |
Low overhead (<1% CPU typically) |
Stack-Level Benchmarks
End-to-End Latency (Ingestion → Queryable)
| Signal |
Typical Latency |
Notes |
| Metrics (recent) |
< 15 seconds |
Queried from ingester memory |
| Metrics (historical) |
< 5 seconds (cold) |
Object storage + cache |
| Logs (recent) |
< 10 seconds |
Queried from ingester memory |
| Logs (historical) |
1–30 seconds |
Depends on time range and label selectivity |
| Traces (by ID) |
< 500ms |
Bloom filter lookup |
| Traces (search) |
2–60 seconds |
Object storage scan |
Scale Limits (Documented Production)
| Dimension |
Scale |
Source |
| Salesforce |
70M metrics/min, 120k alerts/min |
GrafanaCON |
| Maersk |
Enterprise-wide centralized observability |
GrafanaCON case study |
| Grafana Cloud |
Multi-billion active series globally |
Grafana Labs |
Cost Comparison: LGTM vs Alternatives
At 1M Active Series + 100 GB/day Logs + 50M Spans/day
| Stack |
Estimated Monthly Cost |
Operational Burden |
Vendor Lock-in |
| Self-hosted LGTM |
$1,000–3,000 |
High (4+ backends) |
Low |
| Grafana Cloud Pro |
$1,500–4,000 |
Low (managed) |
Low-Medium |
| SigNoz (self-hosted) |
$500–1,500 |
Medium (single binary) |
Low |
| Datadog |
$5,000–17,000 |
Very Low (SaaS) |
High |
| New Relic |
$2,500–8,000 |
Very Low (SaaS) |
Medium |
| ELK + Prometheus + Jaeger |
$2,000–6,000 |
Very High (3 stacks) |
Low |
Costs are rough estimates for mid-2026. Vary significantly by cloud provider, retention, and configuration.
Why LGTM Is Cheaper
- Object storage costs pennies — S3 Standard: ~$0.023/GB/month vs EBS gp3: ~$0.08/GB/month
- Label-only indexing (Loki) — 10–100x less storage than Elasticsearch full-text indexing
- No index (Tempo) — traces stored as Parquet on object storage, no expensive index cluster
- Compression — Loki achieves 10–20:1 compression on log chunks
- Open source — no per-host, per-user, or per-GB licensing fees
Caveats
- These benchmarks assume microservices mode for production loads
- Loki performance degrades severely with high label cardinality (> 50k active streams)
- Tempo TraceQL search is slower than indexed alternatives (Jaeger/Elasticsearch) but dramatically cheaper
- Object storage latency varies by cloud provider and region — always co-locate in the same AZ
- Memcached caching is essential for production query performance — without it, queries hit object storage directly
Sources
| URL |
Source Kind |
Authority |
Date |
| https://grafana.com/docs/mimir/latest/references/architecture/ |
docs |
primary |
2026-04-10 |
| https://grafana.com/docs/loki/latest/get-started/overview/ |
docs |
primary |
2026-04-10 |
| https://grafana.com/docs/tempo/latest/getting-started/ |
docs |
primary |
2026-04-10 |
| https://grafana.com/pricing/ |
docs |
primary |
2026-04-10 |
| https://grafana.com/about/events/grafanacon/ |
conference |
primary |
2026-04-10 |