Operations¶
Deployment & Typical Setup¶
Quick Dev Setup (All-in-One Docker)¶
The fastest way to start with LGTM — a single Docker image with all components:
- Grafana UI:
http://localhost:3000(admin/admin) - OTLP gRPC:
localhost:4317 - OTLP HTTP:
localhost:4318
Includes: OTel Collector, Prometheus, Loki, Tempo, Pyroscope, Grafana — all pre-wired.
Production Setup (Kubernetes)¶
For production, deploy each component independently via Helm:
# Add repository
helm repo add grafana https://grafana.github.io/helm-charts
helm repo update
# Deploy in order: storage backends first, then Grafana
helm install mimir grafana/mimir-distributed -n monitoring -f mimir-values.yaml
helm install loki grafana/loki -n monitoring -f loki-values.yaml
helm install tempo grafana/tempo-distributed -n monitoring -f tempo-values.yaml
helm install pyroscope grafana/pyroscope -n monitoring -f pyroscope-values.yaml
helm install alloy grafana/alloy -n monitoring -f alloy-values.yaml
helm install grafana grafana/grafana -n monitoring -f grafana-values.yaml
Production Readiness Checklist¶
- Object storage configured for Mimir, Loki, Tempo (separate buckets)
- PostgreSQL for Grafana metadata (not SQLite)
- Redis for Grafana session management
- Memcached for query/chunk/index caching (Mimir + Loki)
- Auth proxy for multi-tenant
X-Scope-OrgIDinjection - Ingress / LB with TLS termination
- HPA configured per component
- Resource requests/limits set on all pods
- Provisioning for data sources, dashboards, alerting rules
- Cross-signal correlation configured (exemplars, trace-to-logs, derived fields)
- Retention policies set per backend
- Monitoring the monitoring — meta-monitoring for LGTM components
Configuration & Optimal Tuning¶
Label Strategy (CRITICAL for Loki)¶
The #1 operational pitfall is label cardinality. Follow these rules:
| Label Type | Good ✅ | Bad ❌ |
|---|---|---|
| Static metadata | namespace, pod, job, env |
user_id, request_id, ip_address |
| Bounded values | status_code (200, 404, 500) |
timestamp, trace_id |
| Grouping | team, region, cluster |
url_path (unbounded) |
Target: Keep active label streams < 10,000 per tenant for optimal performance.
Retention Configuration¶
| Component | Config Key | Recommended Defaults |
|---|---|---|
| Mimir | blocks_storage.tsdb.retention_period |
13 months (metrics) |
| Loki | limits_config.retention_period |
30 days (logs) |
| Tempo | compactor.compaction.block_retention |
14–30 days (traces) |
| Pyroscope | retention config | 14 days (profiles) |
Per-Tenant Limits (Multi-Tenancy)¶
Set per-tenant limits to prevent noisy neighbors:
# Mimir overrides
overrides:
tenant-alpha:
max_global_series_per_user: 500000
ingestion_rate: 50000 # samples/sec
max_fetched_series_per_query: 100000
tenant-beta:
max_global_series_per_user: 100000
ingestion_rate: 10000
# Loki overrides
overrides:
tenant-alpha:
max_global_streams_per_user: 10000
ingestion_rate_mb: 10
max_query_length: 720h
Reliability & Scaling¶
Scaling Decision Matrix¶
| Symptom | Component to Scale | How |
|---|---|---|
| Slow metric queries | Mimir queriers | Add querier replicas |
| Write backpressure on metrics | Mimir ingesters | Add ingester replicas |
| Slow log search | Loki queriers | Add querier replicas, check label cardinality |
| Log ingestion lag | Loki ingesters | Add ingester replicas, increase limits |
| Slow trace search | Tempo queriers | Add querier replicas |
| Cache miss rate > 20% | Memcached | Add memcached replicas, increase memory |
| Object storage latency | All | Verify same-AZ deployment, enable caching |
High Availability Requirements¶
| Component | HA Mechanism | Minimum Replicas |
|---|---|---|
| Ingesters (all backends) | Replication factor (RF=3 recommended) | 3 |
| Distributors | Stateless, load-balanced | 2+ |
| Queriers | Stateless, load-balanced | 2+ |
| Compactors | Leader election (single active) | 1 (with standby) |
| Store-Gateways (Mimir) | Sharded by blocks | 2+ |
| Query Frontends | Stateless, request splitting | 2+ |
Cost¶
Cost Drivers¶
| Factor | Primary Driver | Optimization |
|---|---|---|
| Object storage | Data volume × retention | Set retention policies, use lifecycle rules, compress |
| Compute (ingesters) | Ingestion rate | Right-size, use spot/preemptible nodes |
| Compute (queriers) | Query volume and complexity | Recording rules, caching, query limits |
| Network (cross-AZ) | Cross-AZ traffic between components | Co-locate in single AZ or use VPC endpoints |
| Memcached | Cache size × hit ratio | Size to achieve > 80% hit rate |
Cost at Scale¶
| Scale | Metrics (active series) | Logs (GB/day) | Traces (spans/day) | Est. Monthly Self-Hosted |
|---|---|---|---|---|
| Small | 100k | 10 GB | 5M | $200–500 |
| Medium | 1M | 100 GB | 50M | $1,000–3,000 |
| Large | 10M | 1 TB | 500M | $5,000–15,000 |
| Enterprise | 100M+ | 10 TB+ | 5B+ | $20,000–100,000+ |
Cost Optimization Strategies¶
- Recording rules — precompute expensive PromQL queries in Mimir
- Adaptive Metrics (Grafana Cloud) — automatically drop unused metrics
- Adaptive Logs (Grafana Cloud) — automatically reduce noisy log volumes
- Sampling — head-based or tail-based trace sampling to reduce Tempo costs
- Log pipeline filtering — drop debug/info logs in Alloy before they reach Loki
- Single-AZ deployment — eliminate cross-AZ network costs (accept reduced availability)
- Object storage lifecycle rules — transition old data to cheaper tiers (S3 IA/Glacier)
Security¶
Authentication & Multi-Tenancy¶
- Deploy an auth proxy (NGINX, Envoy, or API gateway) in front of all backends
- The proxy authenticates users and injects
X-Scope-OrgIDbased on verified identity - Never expose backends directly without authentication
- Use per-tenant limits to prevent resource exhaustion
- All inter-component communication should use mTLS in production
Network Security¶
- Use NetworkPolicies in Kubernetes to restrict pod-to-pod communication
- Only Alloy should talk to backend Distributors
- Only Query Frontends should be exposed to Grafana
- Object storage should be accessed via VPC endpoints (no public internet)
Best Practices¶
Instrumentation¶
- Use OpenTelemetry everywhere — standardize on OTLP as the protocol
- Inject trace IDs into logs — this is the foundation of log-trace correlation
- Set resource attributes —
service.name,deployment.environment,k8s.pod.nameon every signal - Use auto-instrumentation first — Java Agent, Python instrument, eBPF for Go
- Add manual spans for critical business logic the auto-instrumentation misses
- Sample in production — head-based (simple) or tail-based (captures errors/slow) sampling
Operations¶
- Monitor the monitoring — deploy a separate, smaller "meta-monitoring" LGTM stack that monitors the primary stack
- Use Grafana mixins — pre-built dashboards for Mimir, Loki, Tempo internals
- Label governance — enforce labeling standards to prevent cardinality explosions
- Test with load — use
k6with the Grafana extension to load-test the stack before production - GitOps everything — dashboards, alerts, data sources, and Helm values in version control
Common Issues & Playbook¶
| Symptom | Likely Cause | Fix |
|---|---|---|
| "too many outstanding requests" | Ingester overwhelmed | Scale ingesters, increase per-tenant limits |
| "max streams limit reached" (Loki) | High label cardinality | Reduce label cardinality, drop high-cardinality labels in Alloy |
| "context deadline exceeded" | Slow object storage or oversized query | Enable caching, add query limits, check AZ placement |
| Exemplars not showing | Mimir not storing exemplars | Enable exemplar_storage in Mimir, verify app instrumentation |
| Trace-to-logs not working | Missing trace ID in logs | Verify OTel SDK injects trace_id into log output |
| Derived fields not clickable | Regex doesn't match | Test regex against actual log lines, verify Loki DS config |
| High memory on ingesters | WAL too large or too many active series/streams | Increase ingester memory, tune WAL flush interval |
| Slow TraceQL queries | Large time range or low selectivity | Narrow time range, add specific attribute filters |
Monitoring & Troubleshooting¶
Key Metrics to Monitor (Meta-Monitoring)¶
| Component | Metric | What It Tells You |
|---|---|---|
| All | *_request_duration_seconds |
Internal API latency |
| Ingesters | *_ingester_memory_series / *_live_entries |
In-memory load |
| Distributors | *_distributor_received_samples_total |
Ingestion throughput |
| Queriers | *_querier_request_duration_seconds |
Query latency |
| Compactors | *_compactor_runs_completed_total |
Compaction health |
| Object Storage | *_thanos_objstore_bucket_operation_duration_seconds |
Storage latency |
Grafana Mixins¶
Pre-built monitoring dashboards for each LGTM component:
- Mimir: grafana/mimir → operations/mimir-mixin/
- Loki: grafana/loki → production/loki-mixin/
- Tempo: grafana/tempo → operations/tempo-mixin/
Related Notes¶
Commands & Recipes¶
Quick Start (All-in-One Docker)¶
# Start the entire LGTM stack in one container (dev/testing only)
docker run --name lgtm \
-p 3000:3000 \
-p 4317:4317 \
-p 4318:4318 \
--rm -ti grafana/otel-lgtm
# With persistent data
docker run --name lgtm \
-v "$(pwd)/data:/data" \
-p 3000:3000 \
-p 4317:4317 \
-p 4318:4318 \
grafana/otel-lgtm
# Enable internal component logs
docker run --name lgtm \
-e ENABLE_LOGS_ALL=true \
-p 3000:3000 \
-p 4317:4317 \
-p 4318:4318 \
grafana/otel-lgtm
Test it immediately — send a trace with curl:
curl -X POST http://localhost:4318/v1/traces \
-H "Content-Type: application/json" \
-d '{
"resourceSpans": [{
"resource": {"attributes": [{"key": "service.name", "value": {"stringValue": "test-service"}}]},
"scopeSpans": [{
"spans": [{
"traceId": "5b8efff798038103d269b633813fc60c",
"spanId": "eee19b7ec3c1b174",
"name": "test-span",
"kind": 1,
"startTimeUnixNano": "1544712660000000000",
"endTimeUnixNano": "1544712661000000000"
}]
}]
}]
}'
Grafana Data Source Provisioning (Cross-Signal)¶
This is the most critical provisioning file for the LGTM stack — it wires up all cross-signal correlation:
# /etc/grafana/provisioning/datasources/lgtm.yaml
apiVersion: 1
datasources:
# === METRICS (Mimir) ===
- name: Mimir
type: prometheus
uid: mimir
access: proxy
url: http://mimir-query-frontend:8080/prometheus
isDefault: true
jsonData:
httpMethod: POST
exemplarTraceIdDestinations:
- name: traceID
datasourceUid: tempo
# === LOGS (Loki) ===
- name: Loki
type: loki
uid: loki
access: proxy
url: http://loki-gateway:3100
jsonData:
derivedFields:
- datasourceUid: tempo
matcherRegex: '"traceID":"(\w+)"'
name: TraceID
url: '$${__value.raw}'
urlDisplayLabel: 'View Trace'
# === TRACES (Tempo) ===
- name: Tempo
type: tempo
uid: tempo
access: proxy
url: http://tempo-query-frontend:3200
jsonData:
tracesToLogsV2:
datasourceUid: loki
spanStartTimeShift: '-1h'
spanEndTimeShift: '1h'
tags:
- key: service.name
value: service_name
filterByTraceID: true
filterBySpanID: false
tracesToMetrics:
datasourceUid: mimir
spanStartTimeShift: '-1h'
spanEndTimeShift: '1h'
tags:
- key: service.name
value: job
queries:
- name: 'Request Rate'
query: 'sum(rate(traces_spanmetrics_calls_total{$$__tags}[5m]))'
- name: 'Error Rate'
query: 'sum(rate(traces_spanmetrics_calls_total{$$__tags,status_code="STATUS_CODE_ERROR"}[5m]))'
tracesToProfiles:
datasourceUid: pyroscope
tags:
- key: service.name
value: service_name
profileTypeId: 'process_cpu:cpu:nanoseconds:cpu:nanoseconds'
serviceMap:
datasourceUid: mimir
nodeGraph:
enabled: true
# === PROFILES (Pyroscope) ===
- name: Pyroscope
type: grafana-pyroscope-datasource
uid: pyroscope
access: proxy
url: http://pyroscope:4040
Alloy Configuration (Full LGTM Pipeline)¶
// config.alloy — Full LGTM pipeline with all 4 signals
// =============================================
// RECEIVERS
// =============================================
// OTLP receiver for traces and metrics
otelcol.receiver.otlp "default" {
grpc { endpoint = "0.0.0.0:4317" }
http { endpoint = "0.0.0.0:4318" }
output {
metrics = [otelcol.processor.batch.default.input]
traces = [otelcol.processor.batch.default.input]
logs = [otelcol.processor.batch.default.input]
}
}
// Prometheus scrape for Kubernetes pods
prometheus.scrape "k8s_pods" {
targets = discovery.kubernetes.pods.targets
forward_to = [prometheus.remote_write.mimir.receiver]
}
discovery.kubernetes "pods" {
role = "pod"
}
// =============================================
// PROCESSORS
// =============================================
otelcol.processor.batch "default" {
timeout = "5s"
send_batch_size = 8192
output {
metrics = [otelcol.processor.memory_limiter.default.input]
traces = [otelcol.processor.memory_limiter.default.input]
logs = [otelcol.processor.memory_limiter.default.input]
}
}
otelcol.processor.memory_limiter "default" {
check_interval = "1s"
limit_mib = 512
output {
metrics = [otelcol.exporter.prometheus.mimir.input]
traces = [otelcol.exporter.otlp.tempo.input]
logs = [otelcol.exporter.loki.default.input]
}
}
// =============================================
// EXPORTERS
// =============================================
// Metrics → Mimir
prometheus.remote_write "mimir" {
endpoint {
url = "http://mimir-distributor:8080/api/v1/push"
}
}
otelcol.exporter.prometheus "mimir" {
forward_to = [prometheus.remote_write.mimir.receiver]
}
// Traces → Tempo
otelcol.exporter.otlp "tempo" {
client {
endpoint = "tempo-distributor:4317"
tls { insecure = true }
}
}
// Logs → Loki
otelcol.exporter.loki "default" {
forward_to = [loki.write.default.receiver]
}
loki.write "default" {
endpoint {
url = "http://loki-distributor:3100/loki/api/v1/push"
}
}
Helm Values Snippets¶
Mimir (Key Production Settings)¶
# mimir-values.yaml (key settings only)
mimir:
structuredConfig:
common:
storage:
backend: s3
s3:
endpoint: s3.us-east-1.amazonaws.com
bucket_name: observability-mimir-blocks
region: us-east-1
blocks_storage:
tsdb:
retention_period: 13h # blocks before compaction
bucket_store:
sync_interval: 15m
limits:
max_global_series_per_user: 1500000
ingestion_rate: 100000
ruler_storage:
backend: s3
s3:
bucket_name: observability-mimir-ruler
ingester:
replicas: 3
resources:
requests: { cpu: "1", memory: "4Gi" }
limits: { cpu: "2", memory: "8Gi" }
persistentVolume:
enabled: true
size: 50Gi
querier:
replicas: 2
resources:
requests: { cpu: "500m", memory: "2Gi" }
store_gateway:
replicas: 2
compactor:
replicas: 1
Loki (Key Production Settings)¶
# loki-values.yaml
loki:
auth_enabled: true
storage:
type: s3
s3:
endpoint: s3.us-east-1.amazonaws.com
bucketnames: observability-loki-chunks
region: us-east-1
schemaConfig:
configs:
- from: "2024-01-01"
store: tsdb
object_store: s3
schema: v13
index:
prefix: loki_index_
period: 24h
limits_config:
retention_period: 720h # 30 days
max_global_streams_per_user: 10000
ingestion_rate_mb: 20
per_stream_rate_limit: 5MB
ingester:
replicas: 3
querier:
replicas: 2
Tempo (Key Production Settings)¶
# tempo-values.yaml
tempo:
multitenancyEnabled: true
storage:
trace:
backend: s3
s3:
bucket: observability-tempo-traces
endpoint: s3.us-east-1.amazonaws.com
region: us-east-1
metricsGenerator:
enabled: true
remoteWriteUrl: "http://mimir-distributor:8080/api/v1/push"
global_overrides:
defaults:
metrics_generator:
processors: [span-metrics, service-graphs]
ingester:
replicas: 3
querier:
replicas: 2
compactor:
replicas: 1
OpenTelemetry SDK Quickstart¶
Java (Auto-Instrumentation)¶
# Download the OTel Java agent
curl -LO https://github.com/open-telemetry/opentelemetry-java-instrumentation/releases/latest/download/opentelemetry-javaagent.jar
# Run your app with the agent
java -javaagent:opentelemetry-javaagent.jar \
-Dotel.service.name=my-service \
-Dotel.exporter.otlp.endpoint=http://alloy:4317 \
-jar my-app.jar
Python (Auto-Instrumentation)¶
# Install OTel packages
pip install opentelemetry-distro opentelemetry-exporter-otlp
# Auto-install all detected instrumentation libraries
opentelemetry-bootstrap -a install
# Run your app with auto-instrumentation
OTEL_SERVICE_NAME=my-service \
OTEL_EXPORTER_OTLP_ENDPOINT=http://alloy:4317 \
opentelemetry-instrument python app.py
Go (Manual SDK)¶
// Initialize OTel in your Go app
import (
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc"
"go.opentelemetry.io/otel/sdk/trace"
)
func initTracer() (*trace.TracerProvider, error) {
exporter, err := otlptracegrpc.New(ctx,
otlptracegrpc.WithEndpoint("alloy:4317"),
otlptracegrpc.WithInsecure(),
)
tp := trace.NewTracerProvider(
trace.WithBatcher(exporter),
trace.WithResource(resource.NewWithAttributes(
semconv.SchemaURL,
semconv.ServiceNameKey.String("my-service"),
)),
)
otel.SetTracerProvider(tp)
return tp, nil
}
Environment Variables (All Languages)¶
# Universal OTel configuration via env vars
export OTEL_SERVICE_NAME=my-service
export OTEL_EXPORTER_OTLP_ENDPOINT=http://alloy:4317
export OTEL_EXPORTER_OTLP_PROTOCOL=grpc
export OTEL_RESOURCE_ATTRIBUTES="deployment.environment=production,k8s.namespace.name=default"
export OTEL_TRACES_SAMPLER=parentbased_traceidratio
export OTEL_TRACES_SAMPLER_ARG=0.1 # 10% sampling
Migration Recipes¶
Prometheus → Mimir (Add Long-Term Storage)¶
# Add to existing Prometheus config — zero-downtime migration
remote_write:
- url: http://mimir-distributor:8080/api/v1/push
headers:
X-Scope-OrgID: default
Jaeger → Tempo (Trace Backend Swap)¶
# Tempo natively accepts Jaeger protocol
# Just re-point your Jaeger agents/collectors to Tempo's endpoint:
# Jaeger Thrift HTTP: tempo-distributor:14268
# Jaeger gRPC: tempo-distributor:14250
# Or preferably, switch to OTLP: tempo-distributor:4317
Elasticsearch/Kibana → Loki/Grafana (Conceptual)¶
- Deploy Loki alongside Elasticsearch
- Configure Alloy to send logs to both Loki AND Elasticsearch (dual-write)
- Rebuild critical Kibana dashboards in Grafana using LogQL
- Validate data completeness and query parity
- Cut over: stop writing to Elasticsearch
- Decommission Elasticsearch after retention period expires
Useful One-Liners¶
# Check LGTM component health
for svc in mimir-distributor loki-distributor tempo-distributor; do
echo "$svc: $(curl -s http://$svc:8080/ready)"
done
# Query Mimir directly via curl
curl -s -H "X-Scope-OrgID: default" \
"http://mimir-query-frontend:8080/prometheus/api/v1/query?query=up" | jq .
# Push a test log to Loki
curl -X POST -H "Content-Type: application/json" \
-H "X-Scope-OrgID: default" \
"http://loki-distributor:3100/loki/api/v1/push" \
-d '{"streams":[{"stream":{"app":"test"},"values":[ ["'$(date +%s)000000000'","hello from curl"]]}]}'
# Query Loki directly
curl -s -H "X-Scope-OrgID: default" \
"http://loki-query-frontend:3100/loki/api/v1/query_range?query={app=\"test\"}&limit=10" | jq .
# Check Tempo trace by ID
curl -s "http://tempo-query-frontend:3200/api/traces/5b8efff798038103d269b633813fc60c" | jq .