Architecture¶
1. Default Topology / Flow¶
flowchart TB
subgraph Frontend["Frontend (TypeScript / React)"]
direction LR
DashUI["Dashboard UI"]
ExploreUI["Explore"]
AlertUI["Alerting UI"]
PluginUI["Panel & App Plugins"]
end
subgraph Backend["Backend (Go)"]
direction LR
API["HTTP API Server"]
Auth["Auth & RBAC"]
QEngine["Query Engine"]
AlertEng["Alert Rule Evaluator"]
Prov["Provisioning Engine"]
PluginMgr["Plugin Manager<br/>(gRPC subprocess host)"]
end
subgraph State["State Layer"]
DB["Database<br/>(PostgreSQL / MySQL / SQLite)"]
Cache["Session Cache<br/>(Redis / Memcached)"]
end
subgraph External["External Data Sources"]
Prom["Prometheus / Mimir"]
LokiDS["Loki"]
TempoDS["Tempo"]
SQL["MySQL / PostgreSQL"]
ES["Elasticsearch"]
CW["CloudWatch"]
end
Frontend --> API
API --> Auth
API --> QEngine
API --> AlertEng
API --> Prov
QEngine --> PluginMgr
PluginMgr -->|gRPC| External
Auth --> DB
AlertEng --> DB
Prov --> DB
Auth --> Cache
style Frontend fill:#ff6600,color:#fff
style Backend fill:#2a2d3e,color:#fff
style State fill:#1a1d2e,color:#fff
style External fill:#0d7377,color:#fff
Grafana Server Components¶
The Grafana server itself is a stateless web application with the following internal layers:
Key Architectural Properties¶
| Property | Detail |
|---|---|
| Stateless frontend | All state is in the external DB and cache |
| Plugin isolation | Backend plugins run as gRPC subprocesses |
| Provisioning | Dashboards, data sources, alerts loaded from YAML/JSON at startup |
| Multi-org | Single Grafana instance, multiple isolated organizations |
| API-first | All UI operations have corresponding REST API endpoints |
LGTM Stack — Full Production Architecture¶
flowchart TB
subgraph Apps["Instrumented Applications"]
App1["Service A<br/>(OTel SDK)"]
App2["Service B<br/>(OTel SDK)"]
App3["Service C<br/>(Prometheus client)"]
end
subgraph Infra["Infrastructure"]
K8s["Kubernetes"]
Nodes["VM / Bare Metal"]
end
subgraph Collection["Grafana Alloy (DaemonSet / Sidecar)"]
Recv["Receivers<br/>OTLP, Prometheus, Syslog"]
Proc["Processors<br/>Batch, MemoryLimiter, ResourceDetection"]
Exp["Exporters"]
end
subgraph Mimir["Grafana Mimir"]
MD["Distributor"]
MI["Ingester"]
MQ["Querier"]
MSg["Store-Gateway"]
MC["Compactor"]
end
subgraph Loki["Grafana Loki"]
LD["Distributor"]
LI["Ingester"]
LQ["Querier"]
LQF["Query Frontend"]
LC["Compactor"]
end
subgraph Tempo["Grafana Tempo"]
TD["Distributor"]
TI["Ingester"]
TQ["Querier"]
TQF["Query Frontend"]
TMG["Metrics Generator"]
end
subgraph ObjStore["Object Storage (S3 / GCS / Azure)"]
Blocks["Metric Blocks"]
Chunks["Log Chunks + Index"]
Traces["Trace Blocks (Parquet)"]
end
subgraph Grafana["Grafana Server (HA)"]
GF1["Grafana Pod 1"]
GF2["Grafana Pod 2"]
GFn["Grafana Pod N"]
end
subgraph Supporting
PG["PostgreSQL<br/>(Grafana metadata DB)"]
Redis["Redis<br/>(Session cache)"]
LB["Load Balancer / Ingress"]
end
Apps --> Collection
Infra --> Collection
Collection -->|remote_write| MD
Collection -->|push| LD
Collection -->|OTLP gRPC| TD
MD --> MI
MI --> Blocks
MQ --> MI
MQ --> MSg
MSg --> Blocks
MC --> Blocks
LD --> LI
LI --> Chunks
LQF --> LQ
LQ --> LI
LQ --> Chunks
LC --> Chunks
TD --> TI
TI --> Traces
TQF --> TQ
TQ --> TI
TQ --> Traces
TMG --> MD
GF1 --> PG
GF2 --> PG
GFn --> PG
GF1 --> Redis
LB --> GF1
LB --> GF2
LB --> GFn
Grafana -.->|PromQL| MQ
Grafana -.->|LogQL| LQF
Grafana -.->|TraceQL| TQF
style Apps fill:#0d7377,color:#fff
style Infra fill:#0d7377,color:#fff
style Collection fill:#ff6600,color:#fff
style Mimir fill:#7b42bc,color:#fff
style Loki fill:#2a7de1,color:#fff
style Tempo fill:#e65100,color:#fff
style ObjStore fill:#0d1117,color:#fff
style Grafana fill:#ff6600,color:#fff
style Supporting fill:#1a1d2e,color:#fff
Mimir Architecture (Metrics)¶
flowchart LR
subgraph Write["Write Path"]
D["Distributor<br/>(validates, shards, replicates)"]
I["Ingester<br/>(in-memory TSDB + WAL)"]
end
subgraph Read["Read Path"]
QF["Query Frontend<br/>(splits, caches, queues)"]
Q["Querier<br/>(executes PromQL)"]
SG["Store-Gateway<br/>(indexes object storage)"]
end
subgraph Background["Background"]
C["Compactor<br/>(vertical + horizontal compaction)"]
end
subgraph Storage["Object Storage"]
OS["S3 / GCS / Azure<br/>(TSDB Blocks)"]
end
Prom["Prometheus / Alloy"] -->|remote_write| D
D -->|hash ring| I
I -->|flush every 2h| OS
QF --> Q
Q -->|recent data| I
Q -->|historical data| SG
SG --> OS
C --> OS
style Write fill:#7b42bc,color:#fff
style Read fill:#2a7de1,color:#fff
style Background fill:#1a1d2e,color:#fff
style Storage fill:#0d1117,color:#fff
Deployment Modes¶
| Mode | Description | Use Case |
|---|---|---|
| Monolithic | All components in a single process/pod | Dev, testing, small scale |
| Read-Write | Separate read and write paths | Medium scale |
| Microservices | Each component as independent pods | Production, hyperscale |
Loki Architecture (Logs)¶
flowchart LR
subgraph Write["Write Path"]
LD["Distributor<br/>(validates, routes by label hash)"]
LI["Ingester<br/>(compresses into chunks, indexes labels)"]
end
subgraph Read["Read Path"]
LQF["Query Frontend<br/>(splits time ranges, queues)"]
LQ["Querier<br/>(executes LogQL)"]
LIG["Index Gateway<br/>(metadata lookups)"]
end
subgraph Background["Background"]
LC["Compactor<br/>(merges index files, retention)"]
end
subgraph Storage["Object Storage"]
LOS["S3 / GCS / Azure<br/>(Chunks + Index)"]
end
Alloy["Alloy / Promtail"] -->|push| LD
LD --> LI
LI -->|flush| LOS
LQF --> LQ
LQ -->|recent| LI
LQ -->|historical| LIG
LIG --> LOS
LC --> LOS
style Write fill:#2a7de1,color:#fff
style Read fill:#0d7377,color:#fff
style Background fill:#1a1d2e,color:#fff
style Storage fill:#0d1117,color:#fff
Key Design Choice: Loki only indexes labels, not log content. This makes it 10–100x cheaper to operate than full-text-indexing alternatives (e.g., Elasticsearch) but requires effective label design.
Tempo Architecture (Traces)¶
flowchart LR
subgraph Write["Write Path"]
TD["Distributor<br/>(OTLP, Jaeger, Zipkin)"]
TI["Ingester<br/>(Parquet columns + bloom filters)"]
end
subgraph Read["Read Path"]
TQF["Query Frontend<br/>(splits, shards)"]
TQ["Querier<br/>(TraceQL engine)"]
end
subgraph SideEffects["Side Effects"]
TMG["Metrics Generator<br/>(RED metrics → Mimir)"]
end
subgraph Storage["Object Storage"]
TOS["S3 / GCS / Azure<br/>(Parquet Trace Blocks)"]
end
OTel["Apps (OTel SDK)"] -->|OTLP| TD
TD --> TI
TD --> TMG
TI -->|flush blocks| TOS
TQF --> TQ
TQ --> TI
TQ --> TOS
TMG -->|remote_write| Mimir["Mimir"]
style Write fill:#e65100,color:#fff
style Read fill:#ff6600,color:#fff
style SideEffects fill:#7b42bc,color:#fff
style Storage fill:#0d1117,color:#fff
Key Design Choice: No traditional index — Tempo uses Parquet columnar storage with bloom filters. TraceQL queries selectively load required columns, making large-scale trace search performant.
Kubernetes Deployment Topology¶
A typical production Grafana + LGTM deployment on Kubernetes uses these Helm charts:
| Component | Helm Chart | Min Replicas | Scaling |
|---|---|---|---|
| Grafana | grafana/grafana |
2+ (HA) | HPA on CPU/memory |
| Mimir | grafana/mimir-distributed |
3+ ingesters | Per-component HPA |
| Loki | grafana/loki |
3+ ingesters | Per-component HPA |
| Tempo | grafana/tempo-distributed |
3+ ingesters | Per-component HPA |
| Alloy | grafana/alloy (DaemonSet) |
1 per node | DaemonSet auto-scales |
| PostgreSQL | External managed (RDS/CloudSQL) | HA pair | Managed service |
| Redis | External managed (ElastiCache) | HA pair | Managed service |
Related Notes¶
Data Model¶
1. Default Topology / Flow¶
erDiagram
Grafana_CORE ||--o{ CONFIG : requires
Grafana_CORE ||--o{ STATE : writes
CONFIG {
string runtime_params
string limits
}
STATE {
string metric_id
json payload
}
How It Works¶
Core Mechanism¶
Grafana is fundamentally a query, transform, and visualize engine. It does not store time-series data itself (except for configuration and alert state). Instead, it proxies queries to external data sources and renders the results in a browser.
Request Lifecycle¶
- User opens a dashboard → browser loads the dashboard JSON model
- Panel queries are dispatched → each panel sends its query to the Grafana backend
- Backend proxies to data source → Grafana translates the query and forwards it to the appropriate backend (Prometheus, Loki, SQL, etc.) using the configured data-source plugin
- Results are returned → data frames are sent back to the frontend
- Frontend renders → React-based panel plugins render the visualization
sequenceDiagram
participant User as Browser
participant GF as Grafana Server
participant DS as Data Source<br/>(Prometheus, Loki, etc.)
User->>GF: Open Dashboard
GF-->>User: Dashboard JSON + Panel Config
User->>GF: Execute Panel Queries
GF->>DS: Proxy Query (PromQL, LogQL, SQL...)
DS-->>GF: Data Frames / Results
GF-->>User: Transformed Data
User->>User: Render Visualization
Data Frames¶
Grafana uses a unified data abstraction called Data Frames — typed, columnar data structures (similar to Pandas DataFrames) that all data-source plugins must return. This abstraction lets any panel plugin render data from any source without tight coupling.
The LGTM Stack¶
The Grafana ecosystem addresses all pillars of observability through purpose-built backends:
| Signal | Backend | Query Language | Storage |
|---|---|---|---|
| Metrics | Grafana Mimir | PromQL | Object Storage (S3/GCS/Azure) |
| Logs | Grafana Loki | LogQL | Object Storage |
| Traces | Grafana Tempo | TraceQL | Object Storage (Parquet columnar) |
| Profiles | Grafana Pyroscope | FlameQL | Object Storage |
| Collection | Grafana Alloy | HCL-based config (River) | N/A (pipeline agent) |
Cross-Signal Correlation¶
The true power of the LGTM stack is cross-signal linking:
- Exemplars: Metric data points carry trace IDs → click a spike in Mimir and jump to the exact trace in Tempo
- Trace-to-Logs: A trace span carries labels that map to Loki log streams → jump from trace to logs
- Derived Fields: Loki logs are parsed for trace IDs → jump from logs back to traces
- Profiles: Pyroscope profiles are linked via labels to traces and metrics
flowchart TB
subgraph Collection["Grafana Alloy (OTel Collector)"]
direction LR
R[Receivers<br/>OTLP, Prometheus, Syslog]
P[Processors<br/>Batch, Filter, Transform]
E[Exporters<br/>Remote Write, OTLP]
R --> P --> E
end
subgraph Backends["LGTM Backends"]
Mimir["Mimir<br/>(Metrics)"]
Loki["Loki<br/>(Logs)"]
Tempo["Tempo<br/>(Traces)"]
Pyroscope["Pyroscope<br/>(Profiles)"]
end
subgraph Storage["Object Storage"]
S3["S3 / GCS / Azure Blob"]
end
Collection -->|remote_write| Mimir
Collection -->|push| Loki
Collection -->|OTLP| Tempo
Collection -->|push| Pyroscope
Mimir --> S3
Loki --> S3
Tempo --> S3
Pyroscope --> S3
subgraph Grafana["Grafana UI"]
Dash[Dashboards]
Explore[Explore]
Alert[Alerting]
end
Mimir -.->|PromQL| Grafana
Loki -.->|LogQL| Grafana
Tempo -.->|TraceQL| Grafana
Pyroscope -.->|FlameQL| Grafana
style Collection fill:#2a2d3e,color:#fff
style Backends fill:#1a1d2e,color:#fff
style Storage fill:#0d1117,color:#fff
style Grafana fill:#ff6600,color:#fff
Plugin Architecture¶
Grafana's extensibility is built on a modular plugin system:
Plugin Types¶
| Type | Purpose | Example |
|---|---|---|
| Data Source | Connect to external data backends | Prometheus, MySQL, Elasticsearch |
| Panel | Custom visualization types | Time series, Stat, Geomap, Flame graph |
| App | Bundles of datasources + panels + pages | Grafana Incident, Grafana OnCall |
| Renderer | Server-side image/PDF rendering | grafana-image-renderer |
Plugin Lifecycle¶
- Discovery — Grafana scans the plugin directory on startup
- Bootstrap — reads
plugin.jsonmetadata (ID, type, dependencies) - Validation — checks plugin signature (signed/unsigned/private)
- Initialization — loads frontend (React) and backend (Go via gRPC subprocess)
Frontend ↔ Backend Communication¶
Backend plugins run as separate processes and communicate with the Grafana server via gRPC. This isolation means:
- A crashing plugin does not crash Grafana
- Plugins can implement custom auth, caching, and alerting
- Sensitive operations (secrets, credentials) stay server-side
Key SDK packages:
- @grafana/data — data structures, plugin base classes
- @grafana/ui — reusable React UI components (Grafana design system)
- @grafana/runtime — runtime services (data fetching, config)
- Grafana Plugin SDK for Go — server-side plugin development in Go
Alerting Pipeline (Unified Alerting)¶
Since Grafana 9+, alerting uses a unified architecture that works across all data sources:
flowchart LR
subgraph Rules["Alert Rules"]
R1["Rule 1<br/>PromQL: cpu > 80%"]
R2["Rule 2<br/>LogQL: error rate"]
end
subgraph Eval["Rule Evaluator"]
E["Periodic Evaluation<br/>(every N seconds)"]
end
subgraph State["Alert State Manager"]
S["Normal → Pending → Alerting"]
end
subgraph NP["Notification Policies"]
Tree["Routing Tree<br/>(label matchers)"]
end
subgraph CP["Contact Points"]
Slack[Slack]
PD[PagerDuty]
Email[Email]
WH[Webhook]
end
Rules --> Eval --> State --> NP
NP -->|severity=critical| PD
NP -->|team=backend| Slack
NP -->|default| Email
NP -->|custom| WH
style Rules fill:#1a1d2e,color:#fff
style Eval fill:#2a2d3e,color:#fff
style State fill:#2a2d3e,color:#fff
style NP fill:#ff6600,color:#fff
style CP fill:#0d7377,color:#fff
Key Concepts¶
- Alert Rules define what to evaluate and the threshold conditions
- Labels on alert instances drive routing (e.g.,
severity=critical,team=infra) - Notification Policies form a routing tree — each policy matches labels and routes to contact points
- Contact Points define destinations (Slack, PagerDuty, Email, Webhook, OpsGenie, etc.)
- Mute Timings suppress alerts during maintenance windows
- Silences temporarily suppress specific alert instances during incidents
Data Flow¶
Grafana Alloy Pipeline¶
Grafana Alloy (successor to Grafana Agent) is the recommended telemetry collection agent:
- Two configuration modes:
- Default Engine (Alloy/River syntax) — HCL-based, component-oriented, supports clustering and debug UI
-
OpenTelemetry Engine — standard YAML OTel Collector config for portability
-
Pipeline stages: Receivers → Processors → Exporters
- Debug UI available at
http://localhost:12345for real-time pipeline inspection
Grafana Mimir (Metrics)¶
- Distributor receives remote-write from Prometheus/Alloy → validates, shards by tenant
- Ingester writes to in-memory TSDB + WAL → flushes 2-hour blocks to object storage
- Querier executes PromQL across ingesters (recent) and store-gateways (historical)
- Compactor merges and deduplicates blocks in object storage
- Store-Gateway indexes object storage blocks for fast historical queries
Grafana Loki (Logs)¶
- Distributor receives log streams from Alloy → routes by label hash
- Ingester compresses logs into chunks, indexes labels only (not full text)
- Querier executes LogQL across ingesters and object storage
- Compactor merges index files and enforces retention
Key insight: Loki does not index log content — only metadata labels. This dramatically reduces storage costs but requires queries to start with a label selector.
Grafana Tempo (Traces)¶
- Distributor receives traces (OTLP, Jaeger, Zipkin) → routes by trace ID hash
- Ingester organizes spans into Apache Parquet columns, creates bloom filters, flushes blocks
- Querier searches by trace ID or uses TraceQL for attribute-based search
- Metrics-Generator (optional) extracts RED metrics (Rate, Errors, Duration) from spans → pushes to Mimir
Key insight: Tempo uses no index — it relies on object storage + Parquet columnar format + bloom filters, making it extremely cheap to operate at scale.
Lifecycle¶
Grafana Server Lifecycle¶
- Startup — loads
grafana.iniconfig, runs database migrations, discovers plugins - Runtime — serves HTTP/HTTPS, processes API requests, evaluates alert rules, manages sessions
- Shutdown — graceful drain of connections, flushes pending alert state
Dashboard Lifecycle¶
- Created (UI or provisioning) → stored as JSON in the Grafana database
- Versioned — each save creates a new version (built-in version history)
- Provisioned (optional) — dashboards loaded from YAML/JSON files on disk, watched for changes every 10s
- Exported — dashboards can be exported as JSON for sharing or IaC
Related Notes¶
Benchmarks¶
Test Conditions¶
- Grafana Version tested: v12.x (current stable line, 2026)
- LGTM Stack Components: Mimir, Loki, Tempo (latest stable)
- Date: April 2026
- Note: Grafana itself is primarily a visualization layer. Performance is almost always bottlenecked by the underlying data source, not Grafana's rendering engine. The benchmarks below reflect both Grafana UI limits and backend throughput.
Grafana Server Performance¶
Dashboard Rendering¶
| Metric | Observation | Source |
|---|---|---|
| Panels per dashboard | No hard limit; practical limit ~25–30 before browser sluggishness | Grafana Docs |
| Recommended panel count | 8–12 (overview), 15–20 (detailed) | Community best practices |
| Data points per panel | Rendering degrades above ~10k points; use maxDataPoints to cap |
Grafana Docs |
| Dashboard load time target | < 3 seconds for 95th percentile | Industry SRE standard |
| Concurrent viewers | Grafana server itself handles hundreds; bottleneck is query load on backends | Grafana Docs |
Query Performance Guidelines¶
| Query Type | Acceptable p99 | Concern Threshold |
|---|---|---|
| Simple PromQL (1 series) | < 200ms | > 500ms |
| Moderate PromQL (10–50 series) | < 1s | > 2s |
| Complex PromQL (100+ series, range) | < 5s | > 10s |
| LogQL (label-filtered) | < 2s | > 5s |
| LogQL (full scan, large window) | < 30s | > 60s |
| TraceQL (by trace ID) | < 500ms | > 2s |
| TraceQL (attribute search) | < 10s | > 30s |
Mimir Benchmarks (Metrics)¶
Mimir is designed for hyperscale Prometheus metrics:
| Metric | Benchmark | Conditions |
|---|---|---|
| Active series | 1 billion+ | Documented by Grafana Labs |
| Ingestion rate | 30M+ samples/sec | Large-scale production deployments |
| Query throughput | Thousands of concurrent PromQL queries | With query-frontend sharding |
| Storage efficiency | 1.2–1.5 bytes per sample (compressed TSDB blocks on object storage) | With compaction |
| Ingester flush interval | 2 hours (default TSDB block size) | Configurable |
| Replication factor | 3 (default for durability) | Configurable |
Mimir Cost Efficiency¶
| Scale | Estimated Infra Cost (Self-Hosted) | Notes |
|---|---|---|
| 100k active series | ~$50–100/mo | Monolithic mode, minimal nodes |
| 1M active series | ~$200–500/mo | Microservices mode recommended |
| 10M active series | ~$1,000–3,000/mo | Full microservices, HA |
| 100M+ active series | $5,000–20,000+/mo | Enterprise-grade infra |
Loki Benchmarks (Logs)¶
| Metric | Benchmark | Conditions |
|---|---|---|
| Ingestion rate | 1 TB+/day | Documented in production at scale |
| Query performance | Label-filtered: sub-second; full scan: depends heavily on time range and volume | Label cardinality is the primary factor |
| Compression ratio | 10–20:1 (Snappy/GZIP on chunks) | Varies by log structure |
| Storage cost | Up to 90% cheaper than Elasticsearch for same data volume | Due to label-only indexing |
| Max active streams | Configurable per tenant (default: 5,000) | Set via max_global_streams_per_user |
Loki Cardinality Guidelines¶
| Label Strategy | Active Streams | Impact |
|---|---|---|
| Ideal (namespace, pod, job) | < 10k | Optimal performance |
| Moderate (+ container, node) | 10k–50k | Acceptable |
| High cardinality (+ request ID) | 50k–500k+ | Performance degrades, ingester memory spikes |
Critical: Never use user IDs, request IDs, or IP addresses as Loki labels. Use them in log content and filter with LogQL pipe expressions.
Tempo Benchmarks (Traces)¶
| Metric | Benchmark | Conditions |
|---|---|---|
| Ingestion rate | 100M+ spans/day | Production at Grafana Labs scale |
| Trace ID lookup | < 200ms typical | Direct trace ID queries |
| TraceQL search | Seconds to tens of seconds | Depends on time range and attribute selectivity |
| Storage cost | Significantly cheaper than Jaeger + Elasticsearch | No index required; object storage only |
| Parquet block size | Configurable, typically 100–500 MB | Larger blocks improve search, increase flush latency |
Comparison: LGTM Cost vs Alternatives¶
| Stack | Metrics (1M series) | Logs (100 GB/day) | Traces (50M spans/day) | Total Estimated |
|---|---|---|---|---|
| Self-hosted LGTM | $200–500/mo | $300–800/mo | $200–500/mo | $700–1,800/mo |
| Grafana Cloud Pro | $500–1,000/mo | $500–1,500/mo | $300–800/mo | $1,300–3,300/mo |
| Datadog | $1,500–5,000/mo | $2,000–8,000/mo | $1,000–4,000/mo | $4,500–17,000/mo |
| New Relic (Full Platform) | $1,000–3,000/mo | $1,500–5,000/mo | included | $2,500–8,000/mo |
Costs are rough estimates for mid-2026, vary significantly by configuration, data volume, retention, and provider.
Caveats¶
- Grafana UI performance depends heavily on the browser — Chrome performs best for large dashboards
- Backend query performance is 90% determined by the data source, not Grafana
- Loki and Tempo are optimized for object storage — running them on local disk undermines cost benefits
- Mimir benchmarks assume proper recording rules for expensive queries
- All cost estimates assume reasonable retention (15–30 days for logs/traces, 13 months for metrics)
Sources¶
| URL | Source Kind | Authority | Date |
|---|---|---|---|
| https://grafana.com/docs/grafana/latest/best-practices/ | docs | primary | 2026-04-10 |
| https://grafana.com/docs/mimir/latest/references/architecture/ | docs | primary | 2026-04-10 |
| https://grafana.com/docs/loki/latest/get-started/overview/ | docs | primary | 2026-04-10 |
| https://grafana.com/docs/tempo/latest/getting-started/ | docs | primary | 2026-04-10 |
| https://grafana.com/pricing/ | docs | primary | 2026-04-10 |