Operations
Deployment & Typical Setup
Installation Methods
| Method |
Recommended For |
Notes |
| Docker |
Dev, CI, small prod |
docker run -d -p 3000:3000 grafana/grafana-oss |
| Helm (Kubernetes) |
Production |
helm install grafana grafana/grafana |
| apt/yum (Linux) |
Traditional servers |
Official Grafana repo packages |
| macOS (Homebrew) |
Local dev |
brew install grafana |
| Binary |
Air-gapped envs |
Download from grafana.com/grafana/download |
| Grafana Cloud |
Managed SaaS |
Zero infrastructure overhead |
| AWS Managed Grafana |
AWS-native teams |
Per-editor/viewer pricing |
| Azure Managed Grafana |
Azure-native teams |
Resource-based pricing |
Typical Single-Node Setup
grafana.ini (or env vars)
├── [database] → SQLite (default) or PostgreSQL/MySQL
├── [server] → http_port=3000, domain, root_url
├── [security] → admin_password, secret_key
├── [auth] → SSO/LDAP/OAuth config
└── [paths] → data, logs, plugins, provisioning
Production-Grade Setup (Kubernetes)
- External PostgreSQL database (not SQLite)
- External Redis for session storage
- Multiple Grafana replicas behind a load balancer
- Dashboards and data sources managed via provisioning (YAML/JSON in ConfigMaps)
- Alert rules managed as code (provisioning YAML or Terraform)
- HPA (Horizontal Pod Autoscaler) for Grafana pods
- Ingress with TLS termination
Configuration & Optimal Tuning
Essential grafana.ini Settings
# === Database (CRITICAL for production) ===
[database]
type = postgres
host = postgres.internal:5432
name = grafana
user = grafana
password = ${GF_DATABASE_PASSWORD}
ssl_mode = require
# === Session storage (CRITICAL for HA) ===
[sessions]
provider = redis
provider_config = addr=redis.internal:6379,pool_size=100,prefix=grafana
# === Server ===
[server]
http_port = 3000
domain = grafana.example.com
root_url = https://grafana.YOUR_DOMAIN
serve_from_sub_path = false
# === Security ===
[security]
admin_password = ${GF_SECURITY_ADMIN_PASSWORD}
secret_key = ${GF_SECURITY_SECRET_KEY}
cookie_secure = true
cookie_samesite = lax
content_security_policy = true
strict_transport_security = true
# === Auth (Example: OAuth with Okta) ===
[auth.generic_oauth]
enabled = true
name = Okta
client_id = ${GF_AUTH_OKTA_CLIENT_ID}
client_secret = ${GF_AUTH_OKTA_CLIENT_SECRET}
scopes = openid profile email groups
auth_url = https://your-org.okta.com/oauth2/v1/authorize
token_url = https://your-org.okta.com/oauth2/v1/token
api_url = https://your-org.okta.com/oauth2/v1/userinfo
role_attribute_path = contains(groups[*], 'grafana-admins') && 'Admin' || contains(groups[*], 'grafana-editors') && 'Editor' || 'Viewer'
allow_sign_up = true
# === Alerting ===
[unified_alerting]
enabled = true
execute_alerts = true
# === Performance ===
[dataproxy]
timeout = 300
dial_timeout = 30
keep_alive_seconds = 30
[rendering]
concurrent_render_request_limit = 30
Environment Variable Override Pattern
All grafana.ini settings can be overridden via environment variables using the pattern:
Examples:
- GF_DATABASE_TYPE=postgres
- GF_SECURITY_ADMIN_PASSWORD=supersecret
- GF_AUTH_GENERIC_OAUTH_ENABLED=true
Reliability & Scaling
Horizontal Scaling Checklist
High Availability Architecture
flowchart TB
LB["Load Balancer<br/>(NGINX Ingress / ALB)"]
subgraph Grafana["Grafana Replicas"]
G1["Pod 1"]
G2["Pod 2"]
G3["Pod 3"]
end
PG["PostgreSQL<br/>(HA: RDS / CloudSQL)"]
Redis["Redis<br/>(HA: ElastiCache)"]
LB --> G1
LB --> G2
LB --> G3
G1 --> PG
G2 --> PG
G3 --> PG
G1 --> Redis
G2 --> Redis
G3 --> Redis
style LB fill:#ff6600,color:#fff
style Grafana fill:#2a2d3e,color:#fff
style PG fill:#2a7de1,color:#fff
style Redis fill:#e65100,color:#fff
Scaling the LGTM Backends
| Component |
Scale Strategy |
Key Metric |
| Mimir Ingesters |
Add replicas |
Active series count |
| Mimir Queriers |
Add replicas |
Query latency p99 |
| Loki Ingesters |
Add replicas |
Log ingestion rate (bytes/sec) |
| Loki Queriers |
Add replicas |
LogQL query latency |
| Tempo Ingesters |
Add replicas |
Spans/sec |
| Alloy |
DaemonSet (1 per node) |
Automatic |
Cost
Self-Hosted Cost Factors
| Factor |
Driver |
Optimization |
| Compute |
Number of backend pods |
Right-size resources, use spot/preemptible nodes |
| Object Storage |
Data retention × ingestion rate |
Set retention policies, use lifecycle rules, compress |
| Database |
PostgreSQL instance size |
Start small, scale with usage |
| Network |
Cross-AZ / cross-region traffic |
Co-locate components in same AZ, use VPC endpoints |
Grafana Cloud Pricing Summary (2026)
| Tier |
Base Cost |
Included |
Billing Model |
| Free |
$0 |
10k active metrics series, 50 GB logs/traces |
— |
| Pro |
$19/mo platform fee |
Base allowances |
Usage-based (per series, per GB) |
| Enterprise |
$25k+/yr |
Volume discounts, enhanced SLAs |
Annual commitment |
Cost Comparison: Self-Hosted vs Cloud
For a typical mid-size setup (500k active series, 100 GB/day logs, 50M spans/day):
| Model |
Estimated Monthly Cost |
Trade-off |
| Self-hosted (K8s) |
$500–2,000 |
Full control, higher ops burden |
| Grafana Cloud Pro |
$1,000–3,000 |
Managed, lower ops burden |
| Datadog equivalent |
$5,000–15,000 |
Fully managed, highest cost |
Costs are approximate and vary significantly by cloud provider and configuration.
Security
Authentication Hardening
- Disable basic auth in production — use SSO (OAuth 2.0 / SAML)
- Enforce MFA via your identity provider (Okta, Azure AD, Google)
- Disable anonymous access (
[auth.anonymous] enabled = false)
- Disable self-registration (
[users] allow_sign_up = false)
- Set session timeouts (
login_maximum_lifetime_duration = 12h)
- Use HTTPS/TLS for all traffic
- Enable CSRF protection (enabled by default)
- Set Content Security Policy headers
RBAC Best Practices
| Role |
Permissions |
Who |
| Viewer |
View dashboards, explore data |
Most users |
| Editor |
Create/edit dashboards, create alerts |
Team leads, SREs |
| Admin |
Manage org, users, data sources |
Org administrators |
| Grafana Admin |
System-wide access |
Platform team only (minimize!) |
- Use Teams synced with your IdP groups for permission management
- Use data source permissions to restrict which teams can query which backends
- Use proxy mode for data sources to avoid exposing backend credentials to browsers
- Enterprise/Cloud: Use custom roles for fine-grained permissions (e.g., "can edit dashboards in folder X but not Y")
LDAP/SAML Hardening
- Always use TLS/SSL for LDAP connections
- Use a dedicated service account with read-only permissions for LDAP binding
- Verify certificates (
ssl_skip_verify = false)
- Set minimum TLS version to 1.2+
- Enable SAML request signing for integrity
Secrets Management
- Never hardcode secrets in
grafana.ini — use environment variables or a secrets manager (Vault, AWS KMS)
- Use Kubernetes Secrets (or ExternalSecrets Operator) to inject credentials
- Use read-only database users for data source connections
Best Practices
Dashboard Governance
- Use folders to organize dashboards by team/domain
- Use provisioning for infrastructure-critical dashboards (prevents manual drift)
- Set ownership — every dashboard should have a clear owner/team
- Review cadence — quarterly review of all dashboards, archive unused ones
- Naming conventions — prefix dashboards with team or domain (e.g.,
[infra] Node Overview)
- Template variables — use for environment, region, service filtering
- Max panels per dashboard — aim for 8–12 (overview) or 15–20 (detailed)
Query Optimization
- Filter early — use precise label selectors in PromQL/LogQL
- Avoid high cardinality — don't use user IDs, IP addresses, or request paths as labels
- Use recording rules — precompute expensive PromQL queries in Mimir/Prometheus
- Set Max Data Points — prevent over-fetching (10k points for a 1k-pixel graph wastes resources)
- Optimize refresh intervals — avoid < 10s unless truly needed
- Use
$__interval — let Grafana auto-calculate appropriate step size
Infrastructure
- Monitor Grafana with Grafana — use
kube-prometheus-stack to monitor the monitoring
- Set resource limits — define CPU/memory requests and limits in Kubernetes
- Use immutable images — pre-install plugins in custom Docker images instead of runtime installs
- Backup the database — automated PostgreSQL backups with PITR
- Audit logs — enable for compliance (Enterprise feature)
Common Issues & Playbook
| Symptom |
Likely Cause |
Fix |
| Dashboard loads slowly |
Expensive queries or too many panels |
Use Query Inspector, add recording rules, reduce panel count |
| "Data source is not available" |
Connection issue or misconfigured URL |
Check network, verify URL in data source settings, check proxy mode |
| Alerts not firing |
Evaluator not running or contact point misconfigured |
Check [unified_alerting] is enabled, verify contact point with Test |
| Login loop / session issues |
SQLite under HA or missing Redis config |
Switch to PostgreSQL + Redis for sessions |
| Plugin not loading |
Unsigned plugin or missing signature |
Set allow_loading_unsigned_plugins or sign the plugin |
| High memory on Grafana process |
Too many concurrent dashboard viewers |
Scale horizontally, reduce auto-refresh intervals |
| "database is locked" |
SQLite with multiple replicas |
Switch to PostgreSQL/MySQL immediately |
Monitoring & Troubleshooting
Key Grafana Metrics to Monitor
Grafana exposes Prometheus metrics at /metrics:
| Metric |
What It Tells You |
grafana_http_request_duration_seconds |
API request latency |
grafana_alerting_rule_evaluations_total |
Alert evaluation throughput |
grafana_alerting_rule_evaluation_failures_total |
Alert evaluation errors |
grafana_proxy_request_duration_seconds |
Data source proxy latency |
grafana_stat_totals |
Total dashboards, users, orgs |
grafana_active_user_sessions |
Current active sessions |
- Query Inspector — built-in panel tool to see raw query, response time, and data
- Grafana Server Logs —
grafana-server.log or stdout in containers
- API Explorer —
/api/ endpoints for programmatic inspection
- Provisioning debug — watch mode logs file-change detection events
- Alloy Debug UI —
http://localhost:12345 — real-time pipeline graph and health
Commands & Recipes
Installation
Docker (Quick Start)
# Latest OSS version
docker run -d --name grafana \
-p 3000:3000 \
-v grafana-data:/var/lib/grafana \
grafana/grafana-oss:latest
# With environment variable overrides
docker run -d --name grafana \
-p 3000:3000 \
-e GF_SECURITY_ADMIN_PASSWORD=mysecretpassword \
-e GF_INSTALL_PLUGINS=grafana-clock-panel,grafana-piechart-panel \
-e GF_DATABASE_TYPE=postgres \
-e GF_DATABASE_HOST=postgres:5432 \
-e GF_DATABASE_NAME=grafana \
-e GF_DATABASE_USER=grafana \
-e GF_DATABASE_PASSWORD=dbpassword \
grafana/grafana-oss:latest
Docker Compose (LGTM Stack)
# docker-compose.yml — Minimal LGTM stack for development
version: '3.8'
services:
grafana:
image: grafana/grafana-oss:latest
ports:
- "3000:3000"
environment:
- GF_SECURITY_ADMIN_PASSWORD=admin
volumes:
- grafana-data:/var/lib/grafana
- ./provisioning:/etc/grafana/provisioning
mimir:
image: grafana/mimir:latest
command: ["-config.file=/etc/mimir/config.yaml", "-target=all"]
ports:
- "9009:9009"
volumes:
- ./mimir-config.yaml:/etc/mimir/config.yaml
loki:
image: grafana/loki:latest
ports:
- "3100:3100"
command: ["-config.file=/etc/loki/local-config.yaml"]
tempo:
image: grafana/tempo:latest
ports:
- "3200:3200" # Tempo API
- "4317:4317" # OTLP gRPC
- "4318:4318" # OTLP HTTP
command: ["-config.file=/etc/tempo/config.yaml"]
volumes:
- ./tempo-config.yaml:/etc/tempo/config.yaml
alloy:
image: grafana/alloy:latest
ports:
- "12345:12345" # Debug UI
volumes:
- ./alloy-config.alloy:/etc/alloy/config.alloy
command: ["run", "/etc/alloy/config.alloy"]
volumes:
grafana-data:
Helm (Kubernetes Production)
# Add Grafana Helm repo
helm repo add grafana https://grafana.github.io/helm-charts
helm repo update
# Install Grafana
helm install grafana grafana/grafana \
--namespace monitoring \
--create-namespace \
--set replicas=3 \
--set persistence.enabled=false \
--set database.type=postgres \
--set "grafana.ini.database.host=postgres.internal:5432" \
--set "grafana.ini.database.name=grafana" \
--set "grafana.ini.database.user=grafana" \
--set "grafana.ini.database.password=${DB_PASSWORD}" \
--set "grafana.ini.server.root_url=https://grafana.YOUR_DOMAIN"
# Install Mimir (distributed mode)
helm install mimir grafana/mimir-distributed \
--namespace monitoring \
-f mimir-values.yaml
# Install Loki
helm install loki grafana/loki \
--namespace monitoring \
-f loki-values.yaml
# Install Tempo
helm install tempo grafana/tempo-distributed \
--namespace monitoring \
-f tempo-values.yaml
# Install Alloy (DaemonSet)
helm install alloy grafana/alloy \
--namespace monitoring \
-f alloy-values.yaml
Homebrew (macOS)
brew install grafana
brew services start grafana
# Access at http://localhost:3000 (admin/admin)
grafana-cli Commands
Plugin Management
# Install a plugin
grafana-cli plugins install grafana-piechart-panel
# Install specific version
grafana-cli plugins install grafana-piechart-panel 1.6.4
# List installed plugins
grafana-cli plugins ls
# List all available plugins
grafana-cli plugins list-remote
# Update all plugins
grafana-cli plugins update-all
# Remove a plugin
grafana-cli plugins remove grafana-piechart-panel
# Install from a custom URL (private/unsigned)
grafana-cli --pluginUrl https://YOUR_DOMAIN/plugin.zip plugins install custom-plugin
Admin Commands
# Reset admin password
grafana-cli admin reset-admin-password MyNewPassword
# Reset admin password (custom install path)
grafana-cli --homepath /usr/share/grafana \
--config /etc/grafana/grafana.ini \
admin reset-admin-password MyNewPassword
# Encrypt data source passwords (migrate to secure_json_data)
grafana-cli admin data-migration encrypt-datasource-passwords
# Show Grafana version
grafana-cli --version
# Generate secret key
grafana-cli admin secret-scan
API Recipes
Authentication
# API key auth
curl -H "Authorization: Bearer YOUR_API_KEY" \
https://grafana.YOUR_DOMAIN/api/dashboards/home
# Basic auth
curl -u admin:password \
https://grafana.YOUR_DOMAIN/api/org
# Service account token (recommended for automation)
curl -H "Authorization: Bearer sa-token-xxx" \
https://grafana.YOUR_DOMAIN/api/dashboards/home
Dashboard Operations
# List all dashboards
curl -s -H "Authorization: Bearer $TOKEN" \
"$GRAFANA_URL/api/search?type=dash-db" | jq '.[] | {title, uid, url}'
# Get dashboard by UID
curl -s -H "Authorization: Bearer $TOKEN" \
"$GRAFANA_URL/api/dashboards/uid/YOUR_DASHBOARD_UID" | jq .
# Export dashboard JSON (for backup/provisioning)
curl -s -H "Authorization: Bearer $TOKEN" \
"$GRAFANA_URL/api/dashboards/uid/YOUR_DASHBOARD_UID" \
| jq '.dashboard' > dashboard-export.json
# Import/create dashboard
curl -X POST -H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{
"dashboard": '"$(cat dashboard-export.json)"',
"overwrite": true,
"folderId": 0
}' "$GRAFANA_URL/api/dashboards/db"
# Delete dashboard
curl -X DELETE -H "Authorization: Bearer $TOKEN" \
"$GRAFANA_URL/api/dashboards/uid/YOUR_DASHBOARD_UID"
Data Source Operations
# List all data sources
curl -s -H "Authorization: Bearer $TOKEN" \
"$GRAFANA_URL/api/datasources" | jq '.[] | {name, type, url}'
# Create a Prometheus data source
curl -X POST -H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{
"name": "Prometheus",
"type": "prometheus",
"url": "http://prometheus:9090",
"access": "proxy",
"isDefault": true
}' "$GRAFANA_URL/api/datasources"
# Test data source connectivity
curl -s -H "Authorization: Bearer $TOKEN" \
"$GRAFANA_URL/api/datasources/proxy/1/api/v1/query?query=up"
Alert Operations
# List all alert rules
curl -s -H "Authorization: Bearer $TOKEN" \
"$GRAFANA_URL/api/v1/provisioning/alert-rules" | jq .
# Get alert rule by UID
curl -s -H "Authorization: Bearer $TOKEN" \
"$GRAFANA_URL/api/v1/provisioning/alert-rules/RULE_UID"
# List contact points
curl -s -H "Authorization: Bearer $TOKEN" \
"$GRAFANA_URL/api/v1/provisioning/contact-points" | jq .
# List notification policies
curl -s -H "Authorization: Bearer $TOKEN" \
"$GRAFANA_URL/api/v1/provisioning/policies" | jq .
User & Org Management
# List all users
curl -s -H "Authorization: Bearer $TOKEN" \
"$GRAFANA_URL/api/org/users" | jq '.[] | {login, role}'
# Create a service account
curl -X POST -H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{"name": "automation-sa", "role": "Editor"}' \
"$GRAFANA_URL/api/serviceaccounts"
# Create service account token
curl -X POST -H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{"name": "ci-token"}' \
"$GRAFANA_URL/api/serviceaccounts/1/tokens"
Provisioning Recipes
Data Source Provisioning (YAML)
# /etc/grafana/provisioning/datasources/datasources.yaml
apiVersion: 1
datasources:
- name: Mimir
type: prometheus
access: proxy
url: http://mimir-query-frontend:8080/prometheus
isDefault: true
jsonData:
httpMethod: POST
exemplarTraceIdDestinations:
- name: traceID
datasourceUid: tempo
- name: Loki
type: loki
access: proxy
url: http://loki-gateway:3100
jsonData:
derivedFields:
- datasourceUid: tempo
matcherRegex: '"traceID":"(\w+)"'
name: TraceID
url: '$${__value.raw}'
- name: Tempo
type: tempo
access: proxy
url: http://tempo-query-frontend:3200
jsonData:
tracesToMetrics:
datasourceUid: mimir
tracesToLogs:
datasourceUid: loki
tags: ['job', 'namespace', 'pod']
Dashboard Provider (YAML)
# /etc/grafana/provisioning/dashboards/provider.yaml
apiVersion: 1
providers:
- name: 'Infrastructure'
orgId: 1
folder: 'Infrastructure'
type: file
editable: false
options:
path: /etc/grafana/provisioning/dashboards/infra
foldersFromFilesStructure: true
Alert Rule Provisioning (YAML)
# /etc/grafana/provisioning/alerting/rules.yaml
apiVersion: 1
groups:
- orgId: 1
name: infrastructure-alerts
folder: Infrastructure Alerts
interval: 1m
rules:
- uid: high-cpu-alert
title: High CPU Usage
condition: A
data:
- refId: A
relativeTimeRange:
from: 600
to: 0
datasourceUid: mimir
model:
expr: '100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80'
refId: A
for: 5m
labels:
severity: warning
team: infra
annotations:
summary: "CPU usage above 80% on {{ $labels.instance }}"
# /etc/grafana/provisioning/alerting/contactpoints.yaml
apiVersion: 1
contactPoints:
- orgId: 1
name: slack-oncall
receivers:
- uid: slack-receiver
type: slack
settings:
recipient: '#alerts-oncall'
token: '${SLACK_BOT_TOKEN}'
title: '{{ template "slack.default.title" . }}'
text: '{{ template "slack.default.text" . }}'
Grafana Alloy Configuration
Basic OTel Pipeline (Alloy Syntax)
// config.alloy — Basic OTLP receiver → Grafana Cloud
// OTLP Receiver (gRPC + HTTP)
otelcol.receiver.otlp "default" {
grpc { endpoint = "0.0.0.0:4317" }
http { endpoint = "0.0.0.0:4318" }
output {
metrics = [otelcol.processor.batch.default.input]
logs = [otelcol.processor.batch.default.input]
traces = [otelcol.processor.batch.default.input]
}
}
// Batch Processor
otelcol.processor.batch "default" {
output {
metrics = [otelcol.exporter.otlphttp.grafana_cloud.input]
logs = [otelcol.exporter.otlphttp.grafana_cloud.input]
traces = [otelcol.exporter.otlphttp.grafana_cloud.input]
}
}
// Export to Grafana Cloud
otelcol.exporter.otlphttp "grafana_cloud" {
client {
endpoint = env("GRAFANA_CLOUD_OTLP_ENDPOINT")
auth = otelcol.auth.basic.grafana_cloud.handler
}
}
otelcol.auth.basic "grafana_cloud" {
username = env("GRAFANA_CLOUD_INSTANCE_ID")
password = env("GRAFANA_CLOUD_API_KEY")
}
Prometheus Scraping (Alloy Syntax)
// Scrape Prometheus endpoints
prometheus.scrape "kubernetes_pods" {
targets = discovery.kubernetes.pods.targets
forward_to = [prometheus.remote_write.mimir.receiver]
}
// Kubernetes pod discovery
discovery.kubernetes "pods" {
role = "pod"
}
// Remote write to Mimir
prometheus.remote_write "mimir" {
endpoint {
url = "http://mimir-distributor:8080/api/v1/push"
}
}
Grafana Provider Setup
terraform {
required_providers {
grafana = {
source = "grafana/grafana"
version = ">= 3.0.0"
}
}
}
provider "grafana" {
url = "https://grafana.YOUR_DOMAIN"
auth = var.grafana_api_key
}
resource "grafana_dashboard" "node_overview" {
config_json = file("${path.module}/dashboards/node-overview.json")
folder = grafana_folder.infrastructure.id
overwrite = true
}
resource "grafana_folder" "infrastructure" {
title = "Infrastructure"
}
resource "grafana_data_source" "prometheus" {
type = "prometheus"
name = "Mimir"
url = "http://mimir-query-frontend:8080/prometheus"
is_default = true
json_data_encoded = jsonencode({
httpMethod = "POST"
})
}
Useful One-Liners
# Backup all dashboards via API
for uid in $(curl -s -H "Authorization: Bearer $TOKEN" "$GRAFANA_URL/api/search?type=dash-db" | jq -r '.[].uid'); do
curl -s -H "Authorization: Bearer $TOKEN" "$GRAFANA_URL/api/dashboards/uid/$uid" \
| jq '.dashboard' > "backup-$uid.json"
done
# Check Grafana health
curl -s "$GRAFANA_URL/api/health" | jq .
# Get Grafana build info
curl -s "$GRAFANA_URL/api/frontend/settings" | jq '.buildInfo'
# Count dashboards per folder
curl -s -H "Authorization: Bearer $TOKEN" "$GRAFANA_URL/api/search?type=dash-db" \
| jq 'group_by(.folderTitle) | map({folder: .[0].folderTitle, count: length})'
# Find dashboards not viewed in 90 days (requires admin)
curl -s -H "Authorization: Bearer $TOKEN" "$GRAFANA_URL/api/search?type=dash-db" \
| jq '[.[] | select(.sortMeta < (now - 7776000))] | length'