Skip to content

Operations

Deployment & Typical Setup

Quick Dev Setup (All-in-One Docker)

The fastest way to start with LGTM — a single Docker image with all components:

docker run --name lgtm \
  -p 3000:3000 \
  -p 4317:4317 \
  -p 4318:4318 \
  --rm -ti grafana/otel-lgtm
  • Grafana UI: http://localhost:3000 (admin/admin)
  • OTLP gRPC: localhost:4317
  • OTLP HTTP: localhost:4318

Includes: OTel Collector, Prometheus, Loki, Tempo, Pyroscope, Grafana — all pre-wired.

Production Setup (Kubernetes)

For production, deploy each component independently via Helm:

# Add repository
helm repo add grafana https://grafana.github.io/helm-charts
helm repo update

# Deploy in order: storage backends first, then Grafana
helm install mimir grafana/mimir-distributed -n monitoring -f mimir-values.yaml
helm install loki grafana/loki -n monitoring -f loki-values.yaml
helm install tempo grafana/tempo-distributed -n monitoring -f tempo-values.yaml
helm install pyroscope grafana/pyroscope -n monitoring -f pyroscope-values.yaml
helm install alloy grafana/alloy -n monitoring -f alloy-values.yaml
helm install grafana grafana/grafana -n monitoring -f grafana-values.yaml

Production Readiness Checklist

  • Object storage configured for Mimir, Loki, Tempo (separate buckets)
  • PostgreSQL for Grafana metadata (not SQLite)
  • Redis for Grafana session management
  • Memcached for query/chunk/index caching (Mimir + Loki)
  • Auth proxy for multi-tenant X-Scope-OrgID injection
  • Ingress / LB with TLS termination
  • HPA configured per component
  • Resource requests/limits set on all pods
  • Provisioning for data sources, dashboards, alerting rules
  • Cross-signal correlation configured (exemplars, trace-to-logs, derived fields)
  • Retention policies set per backend
  • Monitoring the monitoring — meta-monitoring for LGTM components

Configuration & Optimal Tuning

Label Strategy (CRITICAL for Loki)

The #1 operational pitfall is label cardinality. Follow these rules:

Label Type Good ✅ Bad ❌
Static metadata namespace, pod, job, env user_id, request_id, ip_address
Bounded values status_code (200, 404, 500) timestamp, trace_id
Grouping team, region, cluster url_path (unbounded)

Target: Keep active label streams < 10,000 per tenant for optimal performance.

Retention Configuration

Component Config Key Recommended Defaults
Mimir blocks_storage.tsdb.retention_period 13 months (metrics)
Loki limits_config.retention_period 30 days (logs)
Tempo compactor.compaction.block_retention 14–30 days (traces)
Pyroscope retention config 14 days (profiles)

Per-Tenant Limits (Multi-Tenancy)

Set per-tenant limits to prevent noisy neighbors:

# Mimir overrides
overrides:
  tenant-alpha:
    max_global_series_per_user: 500000
    ingestion_rate: 50000    # samples/sec
    max_fetched_series_per_query: 100000
  tenant-beta:
    max_global_series_per_user: 100000
    ingestion_rate: 10000

# Loki overrides
overrides:
  tenant-alpha:
    max_global_streams_per_user: 10000
    ingestion_rate_mb: 10
    max_query_length: 720h

Reliability & Scaling

Scaling Decision Matrix

Symptom Component to Scale How
Slow metric queries Mimir queriers Add querier replicas
Write backpressure on metrics Mimir ingesters Add ingester replicas
Slow log search Loki queriers Add querier replicas, check label cardinality
Log ingestion lag Loki ingesters Add ingester replicas, increase limits
Slow trace search Tempo queriers Add querier replicas
Cache miss rate > 20% Memcached Add memcached replicas, increase memory
Object storage latency All Verify same-AZ deployment, enable caching

High Availability Requirements

Component HA Mechanism Minimum Replicas
Ingesters (all backends) Replication factor (RF=3 recommended) 3
Distributors Stateless, load-balanced 2+
Queriers Stateless, load-balanced 2+
Compactors Leader election (single active) 1 (with standby)
Store-Gateways (Mimir) Sharded by blocks 2+
Query Frontends Stateless, request splitting 2+

Cost

Cost Drivers

Factor Primary Driver Optimization
Object storage Data volume × retention Set retention policies, use lifecycle rules, compress
Compute (ingesters) Ingestion rate Right-size, use spot/preemptible nodes
Compute (queriers) Query volume and complexity Recording rules, caching, query limits
Network (cross-AZ) Cross-AZ traffic between components Co-locate in single AZ or use VPC endpoints
Memcached Cache size × hit ratio Size to achieve > 80% hit rate

Cost at Scale

Scale Metrics (active series) Logs (GB/day) Traces (spans/day) Est. Monthly Self-Hosted
Small 100k 10 GB 5M $200–500
Medium 1M 100 GB 50M $1,000–3,000
Large 10M 1 TB 500M $5,000–15,000
Enterprise 100M+ 10 TB+ 5B+ $20,000–100,000+

Cost Optimization Strategies

  1. Recording rules — precompute expensive PromQL queries in Mimir
  2. Adaptive Metrics (Grafana Cloud) — automatically drop unused metrics
  3. Adaptive Logs (Grafana Cloud) — automatically reduce noisy log volumes
  4. Sampling — head-based or tail-based trace sampling to reduce Tempo costs
  5. Log pipeline filtering — drop debug/info logs in Alloy before they reach Loki
  6. Single-AZ deployment — eliminate cross-AZ network costs (accept reduced availability)
  7. Object storage lifecycle rules — transition old data to cheaper tiers (S3 IA/Glacier)

Security

Authentication & Multi-Tenancy

  1. Deploy an auth proxy (NGINX, Envoy, or API gateway) in front of all backends
  2. The proxy authenticates users and injects X-Scope-OrgID based on verified identity
  3. Never expose backends directly without authentication
  4. Use per-tenant limits to prevent resource exhaustion
  5. All inter-component communication should use mTLS in production

Network Security

  • Use NetworkPolicies in Kubernetes to restrict pod-to-pod communication
  • Only Alloy should talk to backend Distributors
  • Only Query Frontends should be exposed to Grafana
  • Object storage should be accessed via VPC endpoints (no public internet)

Best Practices

Instrumentation

  1. Use OpenTelemetry everywhere — standardize on OTLP as the protocol
  2. Inject trace IDs into logs — this is the foundation of log-trace correlation
  3. Set resource attributesservice.name, deployment.environment, k8s.pod.name on every signal
  4. Use auto-instrumentation first — Java Agent, Python instrument, eBPF for Go
  5. Add manual spans for critical business logic the auto-instrumentation misses
  6. Sample in production — head-based (simple) or tail-based (captures errors/slow) sampling

Operations

  1. Monitor the monitoring — deploy a separate, smaller "meta-monitoring" LGTM stack that monitors the primary stack
  2. Use Grafana mixins — pre-built dashboards for Mimir, Loki, Tempo internals
  3. Label governance — enforce labeling standards to prevent cardinality explosions
  4. Test with load — use k6 with the Grafana extension to load-test the stack before production
  5. GitOps everything — dashboards, alerts, data sources, and Helm values in version control

Common Issues & Playbook

Symptom Likely Cause Fix
"too many outstanding requests" Ingester overwhelmed Scale ingesters, increase per-tenant limits
"max streams limit reached" (Loki) High label cardinality Reduce label cardinality, drop high-cardinality labels in Alloy
"context deadline exceeded" Slow object storage or oversized query Enable caching, add query limits, check AZ placement
Exemplars not showing Mimir not storing exemplars Enable exemplar_storage in Mimir, verify app instrumentation
Trace-to-logs not working Missing trace ID in logs Verify OTel SDK injects trace_id into log output
Derived fields not clickable Regex doesn't match Test regex against actual log lines, verify Loki DS config
High memory on ingesters WAL too large or too many active series/streams Increase ingester memory, tune WAL flush interval
Slow TraceQL queries Large time range or low selectivity Narrow time range, add specific attribute filters

Monitoring & Troubleshooting

Key Metrics to Monitor (Meta-Monitoring)

Component Metric What It Tells You
All *_request_duration_seconds Internal API latency
Ingesters *_ingester_memory_series / *_live_entries In-memory load
Distributors *_distributor_received_samples_total Ingestion throughput
Queriers *_querier_request_duration_seconds Query latency
Compactors *_compactor_runs_completed_total Compaction health
Object Storage *_thanos_objstore_bucket_operation_duration_seconds Storage latency

Grafana Mixins

Pre-built monitoring dashboards for each LGTM component: - Mimir: grafana/mimiroperations/mimir-mixin/ - Loki: grafana/lokiproduction/loki-mixin/ - Tempo: grafana/tempooperations/tempo-mixin/


Commands & Recipes

Quick Start (All-in-One Docker)

# Start the entire LGTM stack in one container (dev/testing only)
docker run --name lgtm \
  -p 3000:3000 \
  -p 4317:4317 \
  -p 4318:4318 \
  --rm -ti grafana/otel-lgtm

# With persistent data
docker run --name lgtm \
  -v "$(pwd)/data:/data" \
  -p 3000:3000 \
  -p 4317:4317 \
  -p 4318:4318 \
  grafana/otel-lgtm

# Enable internal component logs
docker run --name lgtm \
  -e ENABLE_LOGS_ALL=true \
  -p 3000:3000 \
  -p 4317:4317 \
  -p 4318:4318 \
  grafana/otel-lgtm

Test it immediately — send a trace with curl:

curl -X POST http://localhost:4318/v1/traces \
  -H "Content-Type: application/json" \
  -d '{
    "resourceSpans": [{
      "resource": {"attributes": [{"key": "service.name", "value": {"stringValue": "test-service"}}]},
      "scopeSpans": [{
        "spans": [{
          "traceId": "5b8efff798038103d269b633813fc60c",
          "spanId": "eee19b7ec3c1b174",
          "name": "test-span",
          "kind": 1,
          "startTimeUnixNano": "1544712660000000000",
          "endTimeUnixNano": "1544712661000000000"
        }]
      }]
    }]
  }'

Grafana Data Source Provisioning (Cross-Signal)

This is the most critical provisioning file for the LGTM stack — it wires up all cross-signal correlation:

# /etc/grafana/provisioning/datasources/lgtm.yaml
apiVersion: 1
datasources:
  # === METRICS (Mimir) ===
  - name: Mimir
    type: prometheus
    uid: mimir
    access: proxy
    url: http://mimir-query-frontend:8080/prometheus
    isDefault: true
    jsonData:
      httpMethod: POST
      exemplarTraceIdDestinations:
        - name: traceID
          datasourceUid: tempo

  # === LOGS (Loki) ===
  - name: Loki
    type: loki
    uid: loki
    access: proxy
    url: http://loki-gateway:3100
    jsonData:
      derivedFields:
        - datasourceUid: tempo
          matcherRegex: '"traceID":"(\w+)"'
          name: TraceID
          url: '$${__value.raw}'
          urlDisplayLabel: 'View Trace'

  # === TRACES (Tempo) ===
  - name: Tempo
    type: tempo
    uid: tempo
    access: proxy
    url: http://tempo-query-frontend:3200
    jsonData:
      tracesToLogsV2:
        datasourceUid: loki
        spanStartTimeShift: '-1h'
        spanEndTimeShift: '1h'
        tags:
          - key: service.name
            value: service_name
        filterByTraceID: true
        filterBySpanID: false
      tracesToMetrics:
        datasourceUid: mimir
        spanStartTimeShift: '-1h'
        spanEndTimeShift: '1h'
        tags:
          - key: service.name
            value: job
        queries:
          - name: 'Request Rate'
            query: 'sum(rate(traces_spanmetrics_calls_total{$$__tags}[5m]))'
          - name: 'Error Rate'
            query: 'sum(rate(traces_spanmetrics_calls_total{$$__tags,status_code="STATUS_CODE_ERROR"}[5m]))'
      tracesToProfiles:
        datasourceUid: pyroscope
        tags:
          - key: service.name
            value: service_name
        profileTypeId: 'process_cpu:cpu:nanoseconds:cpu:nanoseconds'
      serviceMap:
        datasourceUid: mimir
      nodeGraph:
        enabled: true

  # === PROFILES (Pyroscope) ===
  - name: Pyroscope
    type: grafana-pyroscope-datasource
    uid: pyroscope
    access: proxy
    url: http://pyroscope:4040

Alloy Configuration (Full LGTM Pipeline)

// config.alloy — Full LGTM pipeline with all 4 signals

// =============================================
// RECEIVERS
// =============================================

// OTLP receiver for traces and metrics
otelcol.receiver.otlp "default" {
  grpc { endpoint = "0.0.0.0:4317" }
  http { endpoint = "0.0.0.0:4318" }

  output {
    metrics = [otelcol.processor.batch.default.input]
    traces  = [otelcol.processor.batch.default.input]
    logs    = [otelcol.processor.batch.default.input]
  }
}

// Prometheus scrape for Kubernetes pods
prometheus.scrape "k8s_pods" {
  targets    = discovery.kubernetes.pods.targets
  forward_to = [prometheus.remote_write.mimir.receiver]
}

discovery.kubernetes "pods" {
  role = "pod"
}

// =============================================
// PROCESSORS
// =============================================

otelcol.processor.batch "default" {
  timeout = "5s"
  send_batch_size = 8192

  output {
    metrics = [otelcol.processor.memory_limiter.default.input]
    traces  = [otelcol.processor.memory_limiter.default.input]
    logs    = [otelcol.processor.memory_limiter.default.input]
  }
}

otelcol.processor.memory_limiter "default" {
  check_interval = "1s"
  limit_mib      = 512

  output {
    metrics = [otelcol.exporter.prometheus.mimir.input]
    traces  = [otelcol.exporter.otlp.tempo.input]
    logs    = [otelcol.exporter.loki.default.input]
  }
}

// =============================================
// EXPORTERS
// =============================================

// Metrics → Mimir
prometheus.remote_write "mimir" {
  endpoint {
    url = "http://mimir-distributor:8080/api/v1/push"
  }
}

otelcol.exporter.prometheus "mimir" {
  forward_to = [prometheus.remote_write.mimir.receiver]
}

// Traces → Tempo
otelcol.exporter.otlp "tempo" {
  client {
    endpoint = "tempo-distributor:4317"
    tls { insecure = true }
  }
}

// Logs → Loki
otelcol.exporter.loki "default" {
  forward_to = [loki.write.default.receiver]
}

loki.write "default" {
  endpoint {
    url = "http://loki-distributor:3100/loki/api/v1/push"
  }
}

Helm Values Snippets

Mimir (Key Production Settings)

# mimir-values.yaml (key settings only)
mimir:
  structuredConfig:
    common:
      storage:
        backend: s3
        s3:
          endpoint: s3.us-east-1.amazonaws.com
          bucket_name: observability-mimir-blocks
          region: us-east-1
    blocks_storage:
      tsdb:
        retention_period: 13h  # blocks before compaction
      bucket_store:
        sync_interval: 15m
    limits:
      max_global_series_per_user: 1500000
      ingestion_rate: 100000
    ruler_storage:
      backend: s3
      s3:
        bucket_name: observability-mimir-ruler

ingester:
  replicas: 3
  resources:
    requests: { cpu: "1", memory: "4Gi" }
    limits:   { cpu: "2", memory: "8Gi" }
  persistentVolume:
    enabled: true
    size: 50Gi

querier:
  replicas: 2
  resources:
    requests: { cpu: "500m", memory: "2Gi" }

store_gateway:
  replicas: 2

compactor:
  replicas: 1

Loki (Key Production Settings)

# loki-values.yaml
loki:
  auth_enabled: true
  storage:
    type: s3
    s3:
      endpoint: s3.us-east-1.amazonaws.com
      bucketnames: observability-loki-chunks
      region: us-east-1
  schemaConfig:
    configs:
      - from: "2024-01-01"
        store: tsdb
        object_store: s3
        schema: v13
        index:
          prefix: loki_index_
          period: 24h
  limits_config:
    retention_period: 720h  # 30 days
    max_global_streams_per_user: 10000
    ingestion_rate_mb: 20
    per_stream_rate_limit: 5MB

ingester:
  replicas: 3

querier:
  replicas: 2

Tempo (Key Production Settings)

# tempo-values.yaml
tempo:
  multitenancyEnabled: true
  storage:
    trace:
      backend: s3
      s3:
        bucket: observability-tempo-traces
        endpoint: s3.us-east-1.amazonaws.com
        region: us-east-1
  metricsGenerator:
    enabled: true
    remoteWriteUrl: "http://mimir-distributor:8080/api/v1/push"
  global_overrides:
    defaults:
      metrics_generator:
        processors: [span-metrics, service-graphs]

ingester:
  replicas: 3

querier:
  replicas: 2

compactor:
  replicas: 1

OpenTelemetry SDK Quickstart

Java (Auto-Instrumentation)

# Download the OTel Java agent
curl -LO https://github.com/open-telemetry/opentelemetry-java-instrumentation/releases/latest/download/opentelemetry-javaagent.jar

# Run your app with the agent
java -javaagent:opentelemetry-javaagent.jar \
  -Dotel.service.name=my-service \
  -Dotel.exporter.otlp.endpoint=http://alloy:4317 \
  -jar my-app.jar

Python (Auto-Instrumentation)

# Install OTel packages
pip install opentelemetry-distro opentelemetry-exporter-otlp

# Auto-install all detected instrumentation libraries
opentelemetry-bootstrap -a install

# Run your app with auto-instrumentation
OTEL_SERVICE_NAME=my-service \
OTEL_EXPORTER_OTLP_ENDPOINT=http://alloy:4317 \
opentelemetry-instrument python app.py

Go (Manual SDK)

// Initialize OTel in your Go app
import (
    "go.opentelemetry.io/otel"
    "go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc"
    "go.opentelemetry.io/otel/sdk/trace"
)

func initTracer() (*trace.TracerProvider, error) {
    exporter, err := otlptracegrpc.New(ctx,
        otlptracegrpc.WithEndpoint("alloy:4317"),
        otlptracegrpc.WithInsecure(),
    )
    tp := trace.NewTracerProvider(
        trace.WithBatcher(exporter),
        trace.WithResource(resource.NewWithAttributes(
            semconv.SchemaURL,
            semconv.ServiceNameKey.String("my-service"),
        )),
    )
    otel.SetTracerProvider(tp)
    return tp, nil
}

Environment Variables (All Languages)

# Universal OTel configuration via env vars
export OTEL_SERVICE_NAME=my-service
export OTEL_EXPORTER_OTLP_ENDPOINT=http://alloy:4317
export OTEL_EXPORTER_OTLP_PROTOCOL=grpc
export OTEL_RESOURCE_ATTRIBUTES="deployment.environment=production,k8s.namespace.name=default"
export OTEL_TRACES_SAMPLER=parentbased_traceidratio
export OTEL_TRACES_SAMPLER_ARG=0.1  # 10% sampling

Migration Recipes

Prometheus → Mimir (Add Long-Term Storage)

# Add to existing Prometheus config — zero-downtime migration
remote_write:
  - url: http://mimir-distributor:8080/api/v1/push
    headers:
      X-Scope-OrgID: default

Jaeger → Tempo (Trace Backend Swap)

# Tempo natively accepts Jaeger protocol
# Just re-point your Jaeger agents/collectors to Tempo's endpoint:
# Jaeger Thrift HTTP: tempo-distributor:14268
# Jaeger gRPC:        tempo-distributor:14250
# Or preferably, switch to OTLP: tempo-distributor:4317

Elasticsearch/Kibana → Loki/Grafana (Conceptual)

  1. Deploy Loki alongside Elasticsearch
  2. Configure Alloy to send logs to both Loki AND Elasticsearch (dual-write)
  3. Rebuild critical Kibana dashboards in Grafana using LogQL
  4. Validate data completeness and query parity
  5. Cut over: stop writing to Elasticsearch
  6. Decommission Elasticsearch after retention period expires

Useful One-Liners

# Check LGTM component health
for svc in mimir-distributor loki-distributor tempo-distributor; do
  echo "$svc: $(curl -s http://$svc:8080/ready)"
done

# Query Mimir directly via curl
curl -s -H "X-Scope-OrgID: default" \
  "http://mimir-query-frontend:8080/prometheus/api/v1/query?query=up" | jq .

# Push a test log to Loki
curl -X POST -H "Content-Type: application/json" \
  -H "X-Scope-OrgID: default" \
  "http://loki-distributor:3100/loki/api/v1/push" \
  -d '{"streams":[{"stream":{"app":"test"},"values":[ ["'$(date +%s)000000000'","hello from curl"]]}]}'

# Query Loki directly
curl -s -H "X-Scope-OrgID: default" \
  "http://loki-query-frontend:3100/loki/api/v1/query_range?query={app=\"test\"}&limit=10" | jq .

# Check Tempo trace by ID
curl -s "http://tempo-query-frontend:3200/api/traces/5b8efff798038103d269b633813fc60c" | jq .