Operations¶

Deployment & Typical Setup¶

Quick Dev Setup (All-in-One Docker)¶

The fastest way to start with LGTM — a single Docker image with all components:

docker run --name lgtm \
  -p 3000:3000 \
  -p 4317:4317 \
  -p 4318:4318 \
  --rm -ti grafana/otel-lgtm

Grafana UI: http://localhost:3000 (admin/admin)
OTLP gRPC: localhost:4317
OTLP HTTP: localhost:4318

Includes: OTel Collector, Prometheus, Loki, Tempo, Pyroscope, Grafana — all pre-wired.

Production Setup (Kubernetes)¶

For production, deploy each component independently via Helm:

# Add repository
helm repo add grafana https://grafana.github.io/helm-charts
helm repo update

# Deploy in order: storage backends first, then Grafana
helm install mimir grafana/mimir-distributed -n monitoring -f mimir-values.yaml
helm install loki grafana/loki -n monitoring -f loki-values.yaml
helm install tempo grafana/tempo-distributed -n monitoring -f tempo-values.yaml
helm install pyroscope grafana/pyroscope -n monitoring -f pyroscope-values.yaml
helm install alloy grafana/alloy -n monitoring -f alloy-values.yaml
helm install grafana grafana/grafana -n monitoring -f grafana-values.yaml

Production Readiness Checklist¶

Configuration & Optimal Tuning¶

Label Strategy (CRITICAL for Loki)¶

The #1 operational pitfall is label cardinality. Follow these rules:

Label Type	Good ✅	Bad ❌
Static metadata	`namespace`, `pod`, `job`, `env`	`user_id`, `request_id`, `ip_address`
Bounded values	`status_code` (200, 404, 500)	`timestamp`, `trace_id`
Grouping	`team`, `region`, `cluster`	`url_path` (unbounded)

Target: Keep active label streams < 10,000 per tenant for optimal performance.

Retention Configuration¶

Component	Config Key	Recommended Defaults
Mimir	`blocks_storage.tsdb.retention_period`	13 months (metrics)
Loki	`limits_config.retention_period`	30 days (logs)
Tempo	`compactor.compaction.block_retention`	14–30 days (traces)
Pyroscope	retention config	14 days (profiles)

Per-Tenant Limits (Multi-Tenancy)¶

Set per-tenant limits to prevent noisy neighbors:

# Mimir overrides
overrides:
  tenant-alpha:
    max_global_series_per_user: 500000
    ingestion_rate: 50000    # samples/sec
    max_fetched_series_per_query: 100000
  tenant-beta:
    max_global_series_per_user: 100000
    ingestion_rate: 10000

# Loki overrides
overrides:
  tenant-alpha:
    max_global_streams_per_user: 10000
    ingestion_rate_mb: 10
    max_query_length: 720h

Reliability & Scaling¶

Scaling Decision Matrix¶

Symptom	Component to Scale	How
Slow metric queries	Mimir queriers	Add querier replicas
Write backpressure on metrics	Mimir ingesters	Add ingester replicas
Slow log search	Loki queriers	Add querier replicas, check label cardinality
Log ingestion lag	Loki ingesters	Add ingester replicas, increase limits
Slow trace search	Tempo queriers	Add querier replicas
Cache miss rate > 20%	Memcached	Add memcached replicas, increase memory
Object storage latency	All	Verify same-AZ deployment, enable caching

High Availability Requirements¶

Component	HA Mechanism	Minimum Replicas
Ingesters (all backends)	Replication factor (RF=3 recommended)	3
Distributors	Stateless, load-balanced	2+
Queriers	Stateless, load-balanced	2+
Compactors	Leader election (single active)	1 (with standby)
Store-Gateways (Mimir)	Sharded by blocks	2+
Query Frontends	Stateless, request splitting	2+

Cost¶

Cost Drivers¶

Factor	Primary Driver	Optimization
Object storage	Data volume × retention	Set retention policies, use lifecycle rules, compress
Compute (ingesters)	Ingestion rate	Right-size, use spot/preemptible nodes
Compute (queriers)	Query volume and complexity	Recording rules, caching, query limits
Network (cross-AZ)	Cross-AZ traffic between components	Co-locate in single AZ or use VPC endpoints
Memcached	Cache size × hit ratio	Size to achieve > 80% hit rate

Cost at Scale¶

Scale	Metrics (active series)	Logs (GB/day)	Traces (spans/day)	Est. Monthly Self-Hosted
Small	100k	10 GB	5M	$200–500
Medium	1M	100 GB	50M	$1,000–3,000
Large	10M	1 TB	500M	$5,000–15,000
Enterprise	100M+	10 TB+	5B+	$20,000–100,000+

Cost Optimization Strategies¶

Recording rules — precompute expensive PromQL queries in Mimir
Adaptive Metrics (Grafana Cloud) — automatically drop unused metrics
Adaptive Logs (Grafana Cloud) — automatically reduce noisy log volumes
Sampling — head-based or tail-based trace sampling to reduce Tempo costs
Log pipeline filtering — drop debug/info logs in Alloy before they reach Loki
Single-AZ deployment — eliminate cross-AZ network costs (accept reduced availability)
Object storage lifecycle rules — transition old data to cheaper tiers (S3 IA/Glacier)

Security¶

Authentication & Multi-Tenancy¶

Deploy an auth proxy (NGINX, Envoy, or API gateway) in front of all backends
The proxy authenticates users and injects X-Scope-OrgID based on verified identity
Never expose backends directly without authentication
Use per-tenant limits to prevent resource exhaustion
All inter-component communication should use mTLS in production

Network Security¶

Use NetworkPolicies in Kubernetes to restrict pod-to-pod communication
Only Alloy should talk to backend Distributors
Only Query Frontends should be exposed to Grafana
Object storage should be accessed via VPC endpoints (no public internet)

Best Practices¶

Instrumentation¶

Use OpenTelemetry everywhere — standardize on OTLP as the protocol
Inject trace IDs into logs — this is the foundation of log-trace correlation
Set resource attributes — service.name, deployment.environment, k8s.pod.name on every signal
Use auto-instrumentation first — Java Agent, Python instrument, eBPF for Go
Add manual spans for critical business logic the auto-instrumentation misses
Sample in production — head-based (simple) or tail-based (captures errors/slow) sampling

Operations¶

Monitor the monitoring — deploy a separate, smaller "meta-monitoring" LGTM stack that monitors the primary stack
Use Grafana mixins — pre-built dashboards for Mimir, Loki, Tempo internals
Label governance — enforce labeling standards to prevent cardinality explosions
Test with load — use k6 with the Grafana extension to load-test the stack before production
GitOps everything — dashboards, alerts, data sources, and Helm values in version control

Common Issues & Playbook¶

Symptom	Likely Cause	Fix
"too many outstanding requests"	Ingester overwhelmed	Scale ingesters, increase per-tenant limits
"max streams limit reached" (Loki)	High label cardinality	Reduce label cardinality, drop high-cardinality labels in Alloy
"context deadline exceeded"	Slow object storage or oversized query	Enable caching, add query limits, check AZ placement
Exemplars not showing	Mimir not storing exemplars	Enable `exemplar_storage` in Mimir, verify app instrumentation
Trace-to-logs not working	Missing trace ID in logs	Verify OTel SDK injects `trace_id` into log output
Derived fields not clickable	Regex doesn't match	Test regex against actual log lines, verify Loki DS config
High memory on ingesters	WAL too large or too many active series/streams	Increase ingester memory, tune WAL flush interval
Slow TraceQL queries	Large time range or low selectivity	Narrow time range, add specific attribute filters

Monitoring & Troubleshooting¶

Key Metrics to Monitor (Meta-Monitoring)¶

Component	Metric	What It Tells You
All	`*_request_duration_seconds`	Internal API latency
Ingesters	`_ingester_memory_series` / `_live_entries`	In-memory load
Distributors	`*_distributor_received_samples_total`	Ingestion throughput
Queriers	`*_querier_request_duration_seconds`	Query latency
Compactors	`*_compactor_runs_completed_total`	Compaction health
Object Storage	`*_thanos_objstore_bucket_operation_duration_seconds`	Storage latency

Grafana Mixins¶

Pre-built monitoring dashboards for each LGTM component: - Mimir: grafana/mimir → operations/mimir-mixin/ - Loki: grafana/loki → production/loki-mixin/ - Tempo: grafana/tempo → operations/tempo-mixin/

Commands & Recipes¶

Quick Start (All-in-One Docker)¶

# Start the entire LGTM stack in one container (dev/testing only)
docker run --name lgtm \
  -p 3000:3000 \
  -p 4317:4317 \
  -p 4318:4318 \
  --rm -ti grafana/otel-lgtm

# With persistent data
docker run --name lgtm \
  -v "$(pwd)/data:/data" \
  -p 3000:3000 \
  -p 4317:4317 \
  -p 4318:4318 \
  grafana/otel-lgtm

# Enable internal component logs
docker run --name lgtm \
  -e ENABLE_LOGS_ALL=true \
  -p 3000:3000 \
  -p 4317:4317 \
  -p 4318:4318 \
  grafana/otel-lgtm

Test it immediately — send a trace with curl:

curl -X POST http://localhost:4318/v1/traces \
  -H "Content-Type: application/json" \
  -d '{
    "resourceSpans": [{
      "resource": {"attributes": [{"key": "service.name", "value": {"stringValue": "test-service"}}]},
      "scopeSpans": [{
        "spans": [{
          "traceId": "5b8efff798038103d269b633813fc60c",
          "spanId": "eee19b7ec3c1b174",
          "name": "test-span",
          "kind": 1,
          "startTimeUnixNano": "1544712660000000000",
          "endTimeUnixNano": "1544712661000000000"
        }]
      }]
    }]
  }'

Grafana Data Source Provisioning (Cross-Signal)¶

This is the most critical provisioning file for the LGTM stack — it wires up all cross-signal correlation:

# /etc/grafana/provisioning/datasources/lgtm.yaml
apiVersion: 1
datasources:
  # === METRICS (Mimir) ===
  - name: Mimir
    type: prometheus
    uid: mimir
    access: proxy
    url: http://mimir-query-frontend:8080/prometheus
    isDefault: true
    jsonData:
      httpMethod: POST
      exemplarTraceIdDestinations:
        - name: traceID
          datasourceUid: tempo

  # === LOGS (Loki) ===
  - name: Loki
    type: loki
    uid: loki
    access: proxy
    url: http://loki-gateway:3100
    jsonData:
      derivedFields:
        - datasourceUid: tempo
          matcherRegex: '"traceID":"(\w+)"'
          name: TraceID
          url: '$${__value.raw}'
          urlDisplayLabel: 'View Trace'

  # === TRACES (Tempo) ===
  - name: Tempo
    type: tempo
    uid: tempo
    access: proxy
    url: http://tempo-query-frontend:3200
    jsonData:
      tracesToLogsV2:
        datasourceUid: loki
        spanStartTimeShift: '-1h'
        spanEndTimeShift: '1h'
        tags:
          - key: service.name
            value: service_name
        filterByTraceID: true
        filterBySpanID: false
      tracesToMetrics:
        datasourceUid: mimir
        spanStartTimeShift: '-1h'
        spanEndTimeShift: '1h'
        tags:
          - key: service.name
            value: job
        queries:
          - name: 'Request Rate'
            query: 'sum(rate(traces_spanmetrics_calls_total{$$__tags}[5m]))'
          - name: 'Error Rate'
            query: 'sum(rate(traces_spanmetrics_calls_total{$$__tags,status_code="STATUS_CODE_ERROR"}[5m]))'
      tracesToProfiles:
        datasourceUid: pyroscope
        tags:
          - key: service.name
            value: service_name
        profileTypeId: 'process_cpu:cpu:nanoseconds:cpu:nanoseconds'
      serviceMap:
        datasourceUid: mimir
      nodeGraph:
        enabled: true

  # === PROFILES (Pyroscope) ===
  - name: Pyroscope
    type: grafana-pyroscope-datasource
    uid: pyroscope
    access: proxy
    url: http://pyroscope:4040

Alloy Configuration (Full LGTM Pipeline)¶

// config.alloy — Full LGTM pipeline with all 4 signals

// =============================================
// RECEIVERS
// =============================================

// OTLP receiver for traces and metrics
otelcol.receiver.otlp "default" {
  grpc { endpoint = "0.0.0.0:4317" }
  http { endpoint = "0.0.0.0:4318" }

  output {
    metrics = [otelcol.processor.batch.default.input]
    traces  = [otelcol.processor.batch.default.input]
    logs    = [otelcol.processor.batch.default.input]
  }
}

// Prometheus scrape for Kubernetes pods
prometheus.scrape "k8s_pods" {
  targets    = discovery.kubernetes.pods.targets
  forward_to = [prometheus.remote_write.mimir.receiver]
}

discovery.kubernetes "pods" {
  role = "pod"
}

// =============================================
// PROCESSORS
// =============================================

otelcol.processor.batch "default" {
  timeout = "5s"
  send_batch_size = 8192

  output {
    metrics = [otelcol.processor.memory_limiter.default.input]
    traces  = [otelcol.processor.memory_limiter.default.input]
    logs    = [otelcol.processor.memory_limiter.default.input]
  }
}

otelcol.processor.memory_limiter "default" {
  check_interval = "1s"
  limit_mib      = 512

  output {
    metrics = [otelcol.exporter.prometheus.mimir.input]
    traces  = [otelcol.exporter.otlp.tempo.input]
    logs    = [otelcol.exporter.loki.default.input]
  }
}

// =============================================
// EXPORTERS
// =============================================

// Metrics → Mimir
prometheus.remote_write "mimir" {
  endpoint {
    url = "http://mimir-distributor:8080/api/v1/push"
  }
}

otelcol.exporter.prometheus "mimir" {
  forward_to = [prometheus.remote_write.mimir.receiver]
}

// Traces → Tempo
otelcol.exporter.otlp "tempo" {
  client {
    endpoint = "tempo-distributor:4317"
    tls { insecure = true }
  }
}

// Logs → Loki
otelcol.exporter.loki "default" {
  forward_to = [loki.write.default.receiver]
}

loki.write "default" {
  endpoint {
    url = "http://loki-distributor:3100/loki/api/v1/push"
  }
}

Helm Values Snippets¶

Mimir (Key Production Settings)¶

# mimir-values.yaml (key settings only)
mimir:
  structuredConfig:
    common:
      storage:
        backend: s3
        s3:
          endpoint: s3.us-east-1.amazonaws.com
          bucket_name: observability-mimir-blocks
          region: us-east-1
    blocks_storage:
      tsdb:
        retention_period: 13h  # blocks before compaction
      bucket_store:
        sync_interval: 15m
    limits:
      max_global_series_per_user: 1500000
      ingestion_rate: 100000
    ruler_storage:
      backend: s3
      s3:
        bucket_name: observability-mimir-ruler

ingester:
  replicas: 3
  resources:
    requests: { cpu: "1", memory: "4Gi" }
    limits:   { cpu: "2", memory: "8Gi" }
  persistentVolume:
    enabled: true
    size: 50Gi

querier:
  replicas: 2
  resources:
    requests: { cpu: "500m", memory: "2Gi" }

store_gateway:
  replicas: 2

compactor:
  replicas: 1

Loki (Key Production Settings)¶

# loki-values.yaml
loki:
  auth_enabled: true
  storage:
    type: s3
    s3:
      endpoint: s3.us-east-1.amazonaws.com
      bucketnames: observability-loki-chunks
      region: us-east-1
  schemaConfig:
    configs:
      - from: "2024-01-01"
        store: tsdb
        object_store: s3
        schema: v13
        index:
          prefix: loki_index_
          period: 24h
  limits_config:
    retention_period: 720h  # 30 days
    max_global_streams_per_user: 10000
    ingestion_rate_mb: 20
    per_stream_rate_limit: 5MB

ingester:
  replicas: 3

querier:
  replicas: 2

Tempo (Key Production Settings)¶

# tempo-values.yaml
tempo:
  multitenancyEnabled: true
  storage:
    trace:
      backend: s3
      s3:
        bucket: observability-tempo-traces
        endpoint: s3.us-east-1.amazonaws.com
        region: us-east-1
  metricsGenerator:
    enabled: true
    remoteWriteUrl: "http://mimir-distributor:8080/api/v1/push"
  global_overrides:
    defaults:
      metrics_generator:
        processors: [span-metrics, service-graphs]

ingester:
  replicas: 3

querier:
  replicas: 2

compactor:
  replicas: 1

OpenTelemetry SDK Quickstart¶

Java (Auto-Instrumentation)¶

# Download the OTel Java agent
curl -LO https://github.com/open-telemetry/opentelemetry-java-instrumentation/releases/latest/download/opentelemetry-javaagent.jar

# Run your app with the agent
java -javaagent:opentelemetry-javaagent.jar \
  -Dotel.service.name=my-service \
  -Dotel.exporter.otlp.endpoint=http://alloy:4317 \
  -jar my-app.jar

Python (Auto-Instrumentation)¶

# Install OTel packages
pip install opentelemetry-distro opentelemetry-exporter-otlp

# Auto-install all detected instrumentation libraries
opentelemetry-bootstrap -a install

# Run your app with auto-instrumentation
OTEL_SERVICE_NAME=my-service \
OTEL_EXPORTER_OTLP_ENDPOINT=http://alloy:4317 \
opentelemetry-instrument python app.py

Go (Manual SDK)¶

// Initialize OTel in your Go app
import (
    "go.opentelemetry.io/otel"
    "go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc"
    "go.opentelemetry.io/otel/sdk/trace"
)

func initTracer() (*trace.TracerProvider, error) {
    exporter, err := otlptracegrpc.New(ctx,
        otlptracegrpc.WithEndpoint("alloy:4317"),
        otlptracegrpc.WithInsecure(),
    )
    tp := trace.NewTracerProvider(
        trace.WithBatcher(exporter),
        trace.WithResource(resource.NewWithAttributes(
            semconv.SchemaURL,
            semconv.ServiceNameKey.String("my-service"),
        )),
    )
    otel.SetTracerProvider(tp)
    return tp, nil
}

Environment Variables (All Languages)¶

# Universal OTel configuration via env vars
export OTEL_SERVICE_NAME=my-service
export OTEL_EXPORTER_OTLP_ENDPOINT=http://alloy:4317
export OTEL_EXPORTER_OTLP_PROTOCOL=grpc
export OTEL_RESOURCE_ATTRIBUTES="deployment.environment=production,k8s.namespace.name=default"
export OTEL_TRACES_SAMPLER=parentbased_traceidratio
export OTEL_TRACES_SAMPLER_ARG=0.1  # 10% sampling

Migration Recipes¶

Prometheus → Mimir (Add Long-Term Storage)¶

# Add to existing Prometheus config — zero-downtime migration
remote_write:
  - url: http://mimir-distributor:8080/api/v1/push
    headers:
      X-Scope-OrgID: default

Jaeger → Tempo (Trace Backend Swap)¶

# Tempo natively accepts Jaeger protocol
# Just re-point your Jaeger agents/collectors to Tempo's endpoint:
# Jaeger Thrift HTTP: tempo-distributor:14268
# Jaeger gRPC:        tempo-distributor:14250
# Or preferably, switch to OTLP: tempo-distributor:4317

Elasticsearch/Kibana → Loki/Grafana (Conceptual)¶

Deploy Loki alongside Elasticsearch
Configure Alloy to send logs to both Loki AND Elasticsearch (dual-write)
Rebuild critical Kibana dashboards in Grafana using LogQL
Validate data completeness and query parity
Cut over: stop writing to Elasticsearch
Decommission Elasticsearch after retention period expires

Useful One-Liners¶

# Check LGTM component health
for svc in mimir-distributor loki-distributor tempo-distributor; do
  echo "$svc: $(curl -s http://$svc:8080/ready)"
done

# Query Mimir directly via curl
curl -s -H "X-Scope-OrgID: default" \
  "http://mimir-query-frontend:8080/prometheus/api/v1/query?query=up" | jq .

# Push a test log to Loki
curl -X POST -H "Content-Type: application/json" \
  -H "X-Scope-OrgID: default" \
  "http://loki-distributor:3100/loki/api/v1/push" \
  -d '{"streams":[{"stream":{"app":"test"},"values":[ ["'$(date +%s)000000000'","hello from curl"]]}]}'

# Query Loki directly
curl -s -H "X-Scope-OrgID: default" \
  "http://loki-query-frontend:3100/loki/api/v1/query_range?query={app=\"test\"}&limit=10" | jq .

# Check Tempo trace by ID
curl -s "http://tempo-query-frontend:3200/api/traces/5b8efff798038103d269b633813fc60c" | jq .

Operations¶

Deployment & Typical Setup¶

Quick Dev Setup (All-in-One Docker)¶

Production Setup (Kubernetes)¶

Production Readiness Checklist¶

Configuration & Optimal Tuning¶

Label Strategy (CRITICAL for Loki)¶

Retention Configuration¶

Per-Tenant Limits (Multi-Tenancy)¶

Reliability & Scaling¶

Scaling Decision Matrix¶

High Availability Requirements¶

Cost¶

Cost Drivers¶

Cost at Scale¶

Cost Optimization Strategies¶

Security¶

Authentication & Multi-Tenancy¶

Network Security¶

Best Practices¶

Instrumentation¶

Operations¶

Common Issues & Playbook¶

Monitoring & Troubleshooting¶

Key Metrics to Monitor (Meta-Monitoring)¶

Grafana Mixins¶

Related Notes¶

Commands & Recipes¶

Quick Start (All-in-One Docker)¶

Grafana Data Source Provisioning (Cross-Signal)¶

Alloy Configuration (Full LGTM Pipeline)¶

Helm Values Snippets¶

Mimir (Key Production Settings)¶

Loki (Key Production Settings)¶

Tempo (Key Production Settings)¶

OpenTelemetry SDK Quickstart¶

Java (Auto-Instrumentation)¶

Python (Auto-Instrumentation)¶

Go (Manual SDK)¶

Environment Variables (All Languages)¶

Migration Recipes¶

Prometheus → Mimir (Add Long-Term Storage)¶

Jaeger → Tempo (Trace Backend Swap)¶

Elasticsearch/Kibana → Loki/Grafana (Conceptual)¶

Useful One-Liners¶

Related Notes¶