Skip to content

Operations

Deployment & Typical Setup

Quick Dev Setup (All-in-One Docker)

The fastest way to start with LGTM — a single Docker image with all components:

docker run --name lgtm \
  -p 3000:3000 \
  -p 4317:4317 \
  -p 4318:4318 \
  --rm -ti grafana/otel-lgtm
  • Grafana UI: http://localhost:3000 (admin/admin)
  • OTLP gRPC: localhost:4317
  • OTLP HTTP: localhost:4318

Includes: OTel Collector, Prometheus, Loki, Tempo, Pyroscope, Grafana — all pre-wired.

Production Setup (Kubernetes)

For production, deploy each component independently via Helm:

# Add repository
helm repo add grafana https://grafana.github.io/helm-charts
helm repo update

# Deploy in order: storage backends first, then Grafana
helm install mimir grafana/mimir-distributed -n monitoring -f mimir-values.yaml
helm install loki grafana/loki -n monitoring -f loki-values.yaml
helm install tempo grafana/tempo-distributed -n monitoring -f tempo-values.yaml
helm install pyroscope grafana/pyroscope -n monitoring -f pyroscope-values.yaml
helm install alloy grafana/alloy -n monitoring -f alloy-values.yaml
helm install grafana grafana/grafana -n monitoring -f grafana-values.yaml

Production Readiness Checklist

  • Object storage configured for Mimir, Loki, Tempo (separate buckets)
  • PostgreSQL for Grafana metadata (not SQLite)
  • Redis for Grafana session management
  • Memcached for query/chunk/index caching (Mimir + Loki)
  • Auth proxy for multi-tenant X-Scope-OrgID injection
  • Ingress / LB with TLS termination
  • HPA configured per component
  • Resource requests/limits set on all pods
  • Provisioning for data sources, dashboards, alerting rules
  • Cross-signal correlation configured (exemplars, trace-to-logs, derived fields)
  • Retention policies set per backend
  • Monitoring the monitoring — meta-monitoring for LGTM components

Configuration & Optimal Tuning

Label Strategy (CRITICAL for Loki)

The #1 operational pitfall is label cardinality. Follow these rules:

Label Type Good ✅ Bad ❌
Static metadata namespace, pod, job, env user_id, request_id, ip_address
Bounded values status_code (200, 404, 500) timestamp, trace_id
Grouping team, region, cluster url_path (unbounded)

Target: Keep active label streams < 10,000 per tenant for optimal performance.

Retention Configuration

Component Config Key Recommended Defaults
Mimir blocks_storage.tsdb.retention_period 13 months (metrics)
Loki limits_config.retention_period 30 days (logs)
Tempo compactor.compaction.block_retention 14–30 days (traces)
Pyroscope retention config 14 days (profiles)

Per-Tenant Limits (Multi-Tenancy)

Set per-tenant limits to prevent noisy neighbors:

# Mimir overrides
overrides:
  tenant-alpha:
    max_global_series_per_user: 500000
    ingestion_rate: 50000    # samples/sec
    max_fetched_series_per_query: 100000
  tenant-beta:
    max_global_series_per_user: 100000
    ingestion_rate: 10000

# Loki overrides
overrides:
  tenant-alpha:
    max_global_streams_per_user: 10000
    ingestion_rate_mb: 10
    max_query_length: 720h

Reliability & Scaling

Scaling Decision Matrix

Symptom Component to Scale How
Slow metric queries Mimir queriers Add querier replicas
Write backpressure on metrics Mimir ingesters Add ingester replicas
Slow log search Loki queriers Add querier replicas, check label cardinality
Log ingestion lag Loki ingesters Add ingester replicas, increase limits
Slow trace search Tempo queriers Add querier replicas
Cache miss rate > 20% Memcached Add memcached replicas, increase memory
Object storage latency All Verify same-AZ deployment, enable caching

High Availability Requirements

Component HA Mechanism Minimum Replicas
Ingesters (all backends) Replication factor (RF=3 recommended) 3
Distributors Stateless, load-balanced 2+
Queriers Stateless, load-balanced 2+
Compactors Leader election (single active) 1 (with standby)
Store-Gateways (Mimir) Sharded by blocks 2+
Query Frontends Stateless, request splitting 2+

Cost

Cost Drivers

Factor Primary Driver Optimization
Object storage Data volume × retention Set retention policies, use lifecycle rules, compress
Compute (ingesters) Ingestion rate Right-size, use spot/preemptible nodes
Compute (queriers) Query volume and complexity Recording rules, caching, query limits
Network (cross-AZ) Cross-AZ traffic between components Co-locate in single AZ or use VPC endpoints
Memcached Cache size × hit ratio Size to achieve > 80% hit rate

Cost at Scale

Scale Metrics (active series) Logs (GB/day) Traces (spans/day) Est. Monthly Self-Hosted
Small 100k 10 GB 5M $200–500
Medium 1M 100 GB 50M $1,000–3,000
Large 10M 1 TB 500M $5,000–15,000
Enterprise 100M+ 10 TB+ 5B+ $20,000–100,000+

Cost Optimization Strategies

  1. Recording rules — precompute expensive PromQL queries in Mimir
  2. Adaptive Metrics (Grafana Cloud) — automatically drop unused metrics
  3. Adaptive Logs (Grafana Cloud) — automatically reduce noisy log volumes
  4. Sampling — head-based or tail-based trace sampling to reduce Tempo costs
  5. Log pipeline filtering — drop debug/info logs in Alloy before they reach Loki
  6. Single-AZ deployment — eliminate cross-AZ network costs (accept reduced availability)
  7. Object storage lifecycle rules — transition old data to cheaper tiers (S3 IA/Glacier)

Security

Authentication & Multi-Tenancy

  1. Deploy an auth proxy (NGINX, Envoy, or API gateway) in front of all backends
  2. The proxy authenticates users and injects X-Scope-OrgID based on verified identity
  3. Never expose backends directly without authentication
  4. Use per-tenant limits to prevent resource exhaustion
  5. All inter-component communication should use mTLS in production

Network Security

  • Use NetworkPolicies in Kubernetes to restrict pod-to-pod communication
  • Only Alloy should talk to backend Distributors
  • Only Query Frontends should be exposed to Grafana
  • Object storage should be accessed via VPC endpoints (no public internet)

Best Practices

Instrumentation

  1. Use OpenTelemetry everywhere — standardize on OTLP as the protocol
  2. Inject trace IDs into logs — this is the foundation of log-trace correlation
  3. Set resource attributesservice.name, deployment.environment, k8s.pod.name on every signal
  4. Use auto-instrumentation first — Java Agent, Python instrument, eBPF for Go
  5. Add manual spans for critical business logic the auto-instrumentation misses
  6. Sample in production — head-based (simple) or tail-based (captures errors/slow) sampling

Operations

  1. Monitor the monitoring — deploy a separate, smaller "meta-monitoring" LGTM stack that monitors the primary stack
  2. Use Grafana mixins — pre-built dashboards for Mimir, Loki, Tempo internals
  3. Label governance — enforce labeling standards to prevent cardinality explosions
  4. Test with load — use k6 with the Grafana extension to load-test the stack before production
  5. GitOps everything — dashboards, alerts, data sources, and Helm values in version control

Common Issues & Playbook

Symptom Likely Cause Fix
"too many outstanding requests" Ingester overwhelmed Scale ingesters, increase per-tenant limits
"max streams limit reached" (Loki) High label cardinality Reduce label cardinality, drop high-cardinality labels in Alloy
"context deadline exceeded" Slow object storage or oversized query Enable caching, add query limits, check AZ placement
Exemplars not showing Mimir not storing exemplars Enable exemplar_storage in Mimir, verify app instrumentation
Trace-to-logs not working Missing trace ID in logs Verify OTel SDK injects trace_id into log output
Derived fields not clickable Regex doesn't match Test regex against actual log lines, verify Loki DS config
High memory on ingesters WAL too large or too many active series/streams Increase ingester memory, tune WAL flush interval
Slow TraceQL queries Large time range or low selectivity Narrow time range, add specific attribute filters

Monitoring & Troubleshooting

Key Metrics to Monitor (Meta-Monitoring)

Component Metric What It Tells You
All *_request_duration_seconds Internal API latency
Ingesters *_ingester_memory_series / *_live_entries In-memory load
Distributors *_distributor_received_samples_total Ingestion throughput
Queriers *_querier_request_duration_seconds Query latency
Compactors *_compactor_runs_completed_total Compaction health
Object Storage *_thanos_objstore_bucket_operation_duration_seconds Storage latency

Grafana Mixins

Pre-built monitoring dashboards for each LGTM component: - Mimir: grafana/mimiroperations/mimir-mixin/ - Loki: grafana/lokiproduction/loki-mixin/ - Tempo: grafana/tempooperations/tempo-mixin/