Operations¶

Deployment & Typical Setup¶

Installation Methods¶

Method	Recommended For	Notes
Docker	Dev, CI, small prod	`docker run -d -p 3000:3000 grafana/grafana-oss`
Helm (Kubernetes)	Production	`helm install grafana grafana/grafana`
apt/yum (Linux)	Traditional servers	Official Grafana repo packages
macOS (Homebrew)	Local dev	`brew install grafana`
Binary	Air-gapped envs	Download from grafana.com/grafana/download
Grafana Cloud	Managed SaaS	Zero infrastructure overhead
AWS Managed Grafana	AWS-native teams	Per-editor/viewer pricing
Azure Managed Grafana	Azure-native teams	Resource-based pricing

Typical Single-Node Setup¶

grafana.ini (or env vars)
├── [database] → SQLite (default) or PostgreSQL/MySQL
├── [server] → http_port=3000, domain, root_url
├── [security] → admin_password, secret_key
├── [auth] → SSO/LDAP/OAuth config
└── [paths] → data, logs, plugins, provisioning

Production-Grade Setup (Kubernetes)¶

External PostgreSQL database (not SQLite)
External Redis for session storage
Multiple Grafana replicas behind a load balancer
Dashboards and data sources managed via provisioning (YAML/JSON in ConfigMaps)
Alert rules managed as code (provisioning YAML or Terraform)
HPA (Horizontal Pod Autoscaler) for Grafana pods
Ingress with TLS termination

Configuration & Optimal Tuning¶

Essential `grafana.ini` Settings¶

# === Database (CRITICAL for production) ===
[database]
type = postgres
host = postgres.internal:5432
name = grafana
user = grafana
password = ${GF_DATABASE_PASSWORD}
ssl_mode = require

# === Session storage (CRITICAL for HA) ===
[sessions]
provider = redis
provider_config = addr=redis.internal:6379,pool_size=100,prefix=grafana

# === Server ===
[server]
http_port = 3000
domain = grafana.example.com
root_url = https://grafana.example.com
serve_from_sub_path = false

# === Security ===
[security]
admin_password = ${GF_SECURITY_ADMIN_PASSWORD}
secret_key = ${GF_SECURITY_SECRET_KEY}
cookie_secure = true
cookie_samesite = lax
content_security_policy = true
strict_transport_security = true

# === Auth (Example: OAuth with Okta) ===
[auth.generic_oauth]
enabled = true
name = Okta
client_id = ${GF_AUTH_OKTA_CLIENT_ID}
client_secret = ${GF_AUTH_OKTA_CLIENT_SECRET}
scopes = openid profile email groups
auth_url = https://your-org.okta.com/oauth2/v1/authorize
token_url = https://your-org.okta.com/oauth2/v1/token
api_url = https://your-org.okta.com/oauth2/v1/userinfo
role_attribute_path = contains(groups[*], 'grafana-admins') && 'Admin' || contains(groups[*], 'grafana-editors') && 'Editor' || 'Viewer'
allow_sign_up = true

# === Alerting ===
[unified_alerting]
enabled = true
execute_alerts = true

# === Performance ===
[dataproxy]
timeout = 300
dial_timeout = 30
keep_alive_seconds = 30

[rendering]
concurrent_render_request_limit = 30

Environment Variable Override Pattern¶

All grafana.ini settings can be overridden via environment variables using the pattern:

GF_<SECTION>_<KEY>=value

Examples: - GF_DATABASE_TYPE=postgres - GF_SECURITY_ADMIN_PASSWORD=supersecret - GF_AUTH_GENERIC_OAUTH_ENABLED=true

Reliability & Scaling¶

Horizontal Scaling Checklist¶

Switch from SQLite to PostgreSQL or MySQL
Configure Redis/Memcached for session storage
Set replicas: 3+ in Helm values
Enable HPA with CPU/memory targets
Use Ingress with TLS termination
Provision dashboards and data sources via ConfigMaps or sidecar
Set editable: false on provisioned dashboards to prevent drift

High Availability Architecture¶

flowchart TB
    LB["Load Balancer<br/>(NGINX Ingress / ALB)"]

    subgraph Grafana["Grafana Replicas"]
        G1["Pod 1"]
        G2["Pod 2"]
        G3["Pod 3"]
    end

    PG["PostgreSQL<br/>(HA: RDS / CloudSQL)"]
    Redis["Redis<br/>(HA: ElastiCache)"]

    LB --> G1
    LB --> G2
    LB --> G3
    G1 --> PG
    G2 --> PG
    G3 --> PG
    G1 --> Redis
    G2 --> Redis
    G3 --> Redis

    style LB fill:#ff6600,color:#fff
    style Grafana fill:#2a2d3e,color:#fff
    style PG fill:#2a7de1,color:#fff
    style Redis fill:#e65100,color:#fff

Scaling the LGTM Backends¶

Component	Scale Strategy	Key Metric
Mimir Ingesters	Add replicas	Active series count
Mimir Queriers	Add replicas	Query latency p99
Loki Ingesters	Add replicas	Log ingestion rate (bytes/sec)
Loki Queriers	Add replicas	LogQL query latency
Tempo Ingesters	Add replicas	Spans/sec
Alloy	DaemonSet (1 per node)	Automatic

Cost¶

Self-Hosted Cost Factors¶

Factor	Driver	Optimization
Compute	Number of backend pods	Right-size resources, use spot/preemptible nodes
Object Storage	Data retention × ingestion rate	Set retention policies, use lifecycle rules, compress
Database	PostgreSQL instance size	Start small, scale with usage
Network	Cross-AZ / cross-region traffic	Co-locate components in same AZ, use VPC endpoints

Grafana Cloud Pricing Summary (2026)¶

Tier	Base Cost	Included	Billing Model
Free	$0	10k active metrics series, 50 GB logs/traces	—
Pro	$19/mo platform fee	Base allowances	Usage-based (per series, per GB)
Enterprise	$25k+/yr	Volume discounts, enhanced SLAs	Annual commitment

Cost Comparison: Self-Hosted vs Cloud¶

For a typical mid-size setup (500k active series, 100 GB/day logs, 50M spans/day):

Model	Estimated Monthly Cost	Trade-off
Self-hosted (K8s)	$500–2,000	Full control, higher ops burden
Grafana Cloud Pro	$1,000–3,000	Managed, lower ops burden
Datadog equivalent	$5,000–15,000	Fully managed, highest cost

Costs are approximate and vary significantly by cloud provider and configuration.

Security¶

Authentication Hardening¶

Disable basic auth in production — use SSO (OAuth 2.0 / SAML)
Enforce MFA via your identity provider (Okta, Azure AD, Google)
Disable anonymous access ([auth.anonymous] enabled = false)
Disable self-registration ([users] allow_sign_up = false)
Set session timeouts (login_maximum_lifetime_duration = 12h)
Use HTTPS/TLS for all traffic
Enable CSRF protection (enabled by default)
Set Content Security Policy headers

RBAC Best Practices¶

Role	Permissions	Who
Viewer	View dashboards, explore data	Most users
Editor	Create/edit dashboards, create alerts	Team leads, SREs
Admin	Manage org, users, data sources	Org administrators
Grafana Admin	System-wide access	Platform team only (minimize!)

Use Teams synced with your IdP groups for permission management
Use data source permissions to restrict which teams can query which backends
Use proxy mode for data sources to avoid exposing backend credentials to browsers
Enterprise/Cloud: Use custom roles for fine-grained permissions (e.g., "can edit dashboards in folder X but not Y")

LDAP/SAML Hardening¶

Always use TLS/SSL for LDAP connections
Use a dedicated service account with read-only permissions for LDAP binding
Verify certificates (ssl_skip_verify = false)
Set minimum TLS version to 1.2+
Enable SAML request signing for integrity

Secrets Management¶

Never hardcode secrets in grafana.ini — use environment variables or a secrets manager (Vault, AWS KMS)
Use Kubernetes Secrets (or ExternalSecrets Operator) to inject credentials
Use read-only database users for data source connections

Best Practices¶

Dashboard Governance¶

Use folders to organize dashboards by team/domain
Use provisioning for infrastructure-critical dashboards (prevents manual drift)
Set ownership — every dashboard should have a clear owner/team
Review cadence — quarterly review of all dashboards, archive unused ones
Naming conventions — prefix dashboards with team or domain (e.g., [infra] Node Overview)
Template variables — use for environment, region, service filtering
Max panels per dashboard — aim for 8–12 (overview) or 15–20 (detailed)

Query Optimization¶

Filter early — use precise label selectors in PromQL/LogQL
Avoid high cardinality — don't use user IDs, IP addresses, or request paths as labels
Use recording rules — precompute expensive PromQL queries in Mimir/Prometheus
Set Max Data Points — prevent over-fetching (10k points for a 1k-pixel graph wastes resources)
Optimize refresh intervals — avoid < 10s unless truly needed
Use $__interval — let Grafana auto-calculate appropriate step size

Infrastructure¶

Monitor Grafana with Grafana — use kube-prometheus-stack to monitor the monitoring
Set resource limits — define CPU/memory requests and limits in Kubernetes
Use immutable images — pre-install plugins in custom Docker images instead of runtime installs
Backup the database — automated PostgreSQL backups with PITR
Audit logs — enable for compliance (Enterprise feature)

Common Issues & Playbook¶

Symptom	Likely Cause	Fix
Dashboard loads slowly	Expensive queries or too many panels	Use Query Inspector, add recording rules, reduce panel count
"Data source is not available"	Connection issue or misconfigured URL	Check network, verify URL in data source settings, check proxy mode
Alerts not firing	Evaluator not running or contact point misconfigured	Check `[unified_alerting]` is enabled, verify contact point with Test
Login loop / session issues	SQLite under HA or missing Redis config	Switch to PostgreSQL + Redis for sessions
Plugin not loading	Unsigned plugin or missing signature	Set `allow_loading_unsigned_plugins` or sign the plugin
High memory on Grafana process	Too many concurrent dashboard viewers	Scale horizontally, reduce auto-refresh intervals
"database is locked"	SQLite with multiple replicas	Switch to PostgreSQL/MySQL immediately

Monitoring & Troubleshooting¶

Key Grafana Metrics to Monitor¶

Grafana exposes Prometheus metrics at /metrics:

Metric	What It Tells You
`grafana_http_request_duration_seconds`	API request latency
`grafana_alerting_rule_evaluations_total`	Alert evaluation throughput
`grafana_alerting_rule_evaluation_failures_total`	Alert evaluation errors
`grafana_proxy_request_duration_seconds`	Data source proxy latency
`grafana_stat_totals`	Total dashboards, users, orgs
`grafana_active_user_sessions`	Current active sessions

Troubleshooting Tools¶

Query Inspector — built-in panel tool to see raw query, response time, and data
Grafana Server Logs — grafana-server.log or stdout in containers
API Explorer — /api/ endpoints for programmatic inspection
Provisioning debug — watch mode logs file-change detection events
Alloy Debug UI — http://localhost:12345 — real-time pipeline graph and health