Skip to content

Operations

Deployment & Typical Setup

Installation Methods

Method Recommended For Notes
Docker Dev, CI, small prod docker run -d -p 3000:3000 grafana/grafana-oss
Helm (Kubernetes) Production helm install grafana grafana/grafana
apt/yum (Linux) Traditional servers Official Grafana repo packages
macOS (Homebrew) Local dev brew install grafana
Binary Air-gapped envs Download from grafana.com/grafana/download
Grafana Cloud Managed SaaS Zero infrastructure overhead
AWS Managed Grafana AWS-native teams Per-editor/viewer pricing
Azure Managed Grafana Azure-native teams Resource-based pricing

Typical Single-Node Setup

grafana.ini (or env vars)
├── [database] → SQLite (default) or PostgreSQL/MySQL
├── [server] → http_port=3000, domain, root_url
├── [security] → admin_password, secret_key
├── [auth] → SSO/LDAP/OAuth config
└── [paths] → data, logs, plugins, provisioning

Production-Grade Setup (Kubernetes)

  1. External PostgreSQL database (not SQLite)
  2. External Redis for session storage
  3. Multiple Grafana replicas behind a load balancer
  4. Dashboards and data sources managed via provisioning (YAML/JSON in ConfigMaps)
  5. Alert rules managed as code (provisioning YAML or Terraform)
  6. HPA (Horizontal Pod Autoscaler) for Grafana pods
  7. Ingress with TLS termination

Configuration & Optimal Tuning

Essential grafana.ini Settings

# === Database (CRITICAL for production) ===
[database]
type = postgres
host = postgres.internal:5432
name = grafana
user = grafana
password = ${GF_DATABASE_PASSWORD}
ssl_mode = require

# === Session storage (CRITICAL for HA) ===
[sessions]
provider = redis
provider_config = addr=redis.internal:6379,pool_size=100,prefix=grafana

# === Server ===
[server]
http_port = 3000
domain = grafana.example.com
root_url = https://grafana.example.com
serve_from_sub_path = false

# === Security ===
[security]
admin_password = ${GF_SECURITY_ADMIN_PASSWORD}
secret_key = ${GF_SECURITY_SECRET_KEY}
cookie_secure = true
cookie_samesite = lax
content_security_policy = true
strict_transport_security = true

# === Auth (Example: OAuth with Okta) ===
[auth.generic_oauth]
enabled = true
name = Okta
client_id = ${GF_AUTH_OKTA_CLIENT_ID}
client_secret = ${GF_AUTH_OKTA_CLIENT_SECRET}
scopes = openid profile email groups
auth_url = https://your-org.okta.com/oauth2/v1/authorize
token_url = https://your-org.okta.com/oauth2/v1/token
api_url = https://your-org.okta.com/oauth2/v1/userinfo
role_attribute_path = contains(groups[*], 'grafana-admins') && 'Admin' || contains(groups[*], 'grafana-editors') && 'Editor' || 'Viewer'
allow_sign_up = true

# === Alerting ===
[unified_alerting]
enabled = true
execute_alerts = true

# === Performance ===
[dataproxy]
timeout = 300
dial_timeout = 30
keep_alive_seconds = 30

[rendering]
concurrent_render_request_limit = 30

Environment Variable Override Pattern

All grafana.ini settings can be overridden via environment variables using the pattern:

GF_<SECTION>_<KEY>=value

Examples: - GF_DATABASE_TYPE=postgres - GF_SECURITY_ADMIN_PASSWORD=supersecret - GF_AUTH_GENERIC_OAUTH_ENABLED=true

Reliability & Scaling

Horizontal Scaling Checklist

  • Switch from SQLite to PostgreSQL or MySQL
  • Configure Redis/Memcached for session storage
  • Set replicas: 3+ in Helm values
  • Enable HPA with CPU/memory targets
  • Use Ingress with TLS termination
  • Provision dashboards and data sources via ConfigMaps or sidecar
  • Set editable: false on provisioned dashboards to prevent drift

High Availability Architecture

flowchart TB
    LB["Load Balancer<br/>(NGINX Ingress / ALB)"]

    subgraph Grafana["Grafana Replicas"]
        G1["Pod 1"]
        G2["Pod 2"]
        G3["Pod 3"]
    end

    PG["PostgreSQL<br/>(HA: RDS / CloudSQL)"]
    Redis["Redis<br/>(HA: ElastiCache)"]

    LB --> G1
    LB --> G2
    LB --> G3
    G1 --> PG
    G2 --> PG
    G3 --> PG
    G1 --> Redis
    G2 --> Redis
    G3 --> Redis

    style LB fill:#ff6600,color:#fff
    style Grafana fill:#2a2d3e,color:#fff
    style PG fill:#2a7de1,color:#fff
    style Redis fill:#e65100,color:#fff

Scaling the LGTM Backends

Component Scale Strategy Key Metric
Mimir Ingesters Add replicas Active series count
Mimir Queriers Add replicas Query latency p99
Loki Ingesters Add replicas Log ingestion rate (bytes/sec)
Loki Queriers Add replicas LogQL query latency
Tempo Ingesters Add replicas Spans/sec
Alloy DaemonSet (1 per node) Automatic

Cost

Self-Hosted Cost Factors

Factor Driver Optimization
Compute Number of backend pods Right-size resources, use spot/preemptible nodes
Object Storage Data retention × ingestion rate Set retention policies, use lifecycle rules, compress
Database PostgreSQL instance size Start small, scale with usage
Network Cross-AZ / cross-region traffic Co-locate components in same AZ, use VPC endpoints

Grafana Cloud Pricing Summary (2026)

Tier Base Cost Included Billing Model
Free $0 10k active metrics series, 50 GB logs/traces
Pro $19/mo platform fee Base allowances Usage-based (per series, per GB)
Enterprise $25k+/yr Volume discounts, enhanced SLAs Annual commitment

Cost Comparison: Self-Hosted vs Cloud

For a typical mid-size setup (500k active series, 100 GB/day logs, 50M spans/day):

Model Estimated Monthly Cost Trade-off
Self-hosted (K8s) $500–2,000 Full control, higher ops burden
Grafana Cloud Pro $1,000–3,000 Managed, lower ops burden
Datadog equivalent $5,000–15,000 Fully managed, highest cost

Costs are approximate and vary significantly by cloud provider and configuration.

Security

Authentication Hardening

  1. Disable basic auth in production — use SSO (OAuth 2.0 / SAML)
  2. Enforce MFA via your identity provider (Okta, Azure AD, Google)
  3. Disable anonymous access ([auth.anonymous] enabled = false)
  4. Disable self-registration ([users] allow_sign_up = false)
  5. Set session timeouts (login_maximum_lifetime_duration = 12h)
  6. Use HTTPS/TLS for all traffic
  7. Enable CSRF protection (enabled by default)
  8. Set Content Security Policy headers

RBAC Best Practices

Role Permissions Who
Viewer View dashboards, explore data Most users
Editor Create/edit dashboards, create alerts Team leads, SREs
Admin Manage org, users, data sources Org administrators
Grafana Admin System-wide access Platform team only (minimize!)
  • Use Teams synced with your IdP groups for permission management
  • Use data source permissions to restrict which teams can query which backends
  • Use proxy mode for data sources to avoid exposing backend credentials to browsers
  • Enterprise/Cloud: Use custom roles for fine-grained permissions (e.g., "can edit dashboards in folder X but not Y")

LDAP/SAML Hardening

  • Always use TLS/SSL for LDAP connections
  • Use a dedicated service account with read-only permissions for LDAP binding
  • Verify certificates (ssl_skip_verify = false)
  • Set minimum TLS version to 1.2+
  • Enable SAML request signing for integrity

Secrets Management

  • Never hardcode secrets in grafana.ini — use environment variables or a secrets manager (Vault, AWS KMS)
  • Use Kubernetes Secrets (or ExternalSecrets Operator) to inject credentials
  • Use read-only database users for data source connections

Best Practices

Dashboard Governance

  1. Use folders to organize dashboards by team/domain
  2. Use provisioning for infrastructure-critical dashboards (prevents manual drift)
  3. Set ownership — every dashboard should have a clear owner/team
  4. Review cadence — quarterly review of all dashboards, archive unused ones
  5. Naming conventions — prefix dashboards with team or domain (e.g., [infra] Node Overview)
  6. Template variables — use for environment, region, service filtering
  7. Max panels per dashboard — aim for 8–12 (overview) or 15–20 (detailed)

Query Optimization

  1. Filter early — use precise label selectors in PromQL/LogQL
  2. Avoid high cardinality — don't use user IDs, IP addresses, or request paths as labels
  3. Use recording rules — precompute expensive PromQL queries in Mimir/Prometheus
  4. Set Max Data Points — prevent over-fetching (10k points for a 1k-pixel graph wastes resources)
  5. Optimize refresh intervals — avoid < 10s unless truly needed
  6. Use $__interval — let Grafana auto-calculate appropriate step size

Infrastructure

  1. Monitor Grafana with Grafana — use kube-prometheus-stack to monitor the monitoring
  2. Set resource limits — define CPU/memory requests and limits in Kubernetes
  3. Use immutable images — pre-install plugins in custom Docker images instead of runtime installs
  4. Backup the database — automated PostgreSQL backups with PITR
  5. Audit logs — enable for compliance (Enterprise feature)

Common Issues & Playbook

Symptom Likely Cause Fix
Dashboard loads slowly Expensive queries or too many panels Use Query Inspector, add recording rules, reduce panel count
"Data source is not available" Connection issue or misconfigured URL Check network, verify URL in data source settings, check proxy mode
Alerts not firing Evaluator not running or contact point misconfigured Check [unified_alerting] is enabled, verify contact point with Test
Login loop / session issues SQLite under HA or missing Redis config Switch to PostgreSQL + Redis for sessions
Plugin not loading Unsigned plugin or missing signature Set allow_loading_unsigned_plugins or sign the plugin
High memory on Grafana process Too many concurrent dashboard viewers Scale horizontally, reduce auto-refresh intervals
"database is locked" SQLite with multiple replicas Switch to PostgreSQL/MySQL immediately

Monitoring & Troubleshooting

Key Grafana Metrics to Monitor

Grafana exposes Prometheus metrics at /metrics:

Metric What It Tells You
grafana_http_request_duration_seconds API request latency
grafana_alerting_rule_evaluations_total Alert evaluation throughput
grafana_alerting_rule_evaluation_failures_total Alert evaluation errors
grafana_proxy_request_duration_seconds Data source proxy latency
grafana_stat_totals Total dashboards, users, orgs
grafana_active_user_sessions Current active sessions

Troubleshooting Tools

  1. Query Inspector — built-in panel tool to see raw query, response time, and data
  2. Grafana Server Logsgrafana-server.log or stdout in containers
  3. API Explorer/api/ endpoints for programmatic inspection
  4. Provisioning debug — watch mode logs file-change detection events
  5. Alloy Debug UIhttp://localhost:12345 — real-time pipeline graph and health