Operations
Deployment & Typical Setup
Installation Methods
| Method |
Recommended For |
Notes |
| Docker |
Dev, CI, small prod |
docker run -d -p 3000:3000 grafana/grafana-oss |
| Helm (Kubernetes) |
Production |
helm install grafana grafana/grafana |
| apt/yum (Linux) |
Traditional servers |
Official Grafana repo packages |
| macOS (Homebrew) |
Local dev |
brew install grafana |
| Binary |
Air-gapped envs |
Download from grafana.com/grafana/download |
| Grafana Cloud |
Managed SaaS |
Zero infrastructure overhead |
| AWS Managed Grafana |
AWS-native teams |
Per-editor/viewer pricing |
| Azure Managed Grafana |
Azure-native teams |
Resource-based pricing |
Typical Single-Node Setup
grafana.ini (or env vars)
├── [database] → SQLite (default) or PostgreSQL/MySQL
├── [server] → http_port=3000, domain, root_url
├── [security] → admin_password, secret_key
├── [auth] → SSO/LDAP/OAuth config
└── [paths] → data, logs, plugins, provisioning
Production-Grade Setup (Kubernetes)
- External PostgreSQL database (not SQLite)
- External Redis for session storage
- Multiple Grafana replicas behind a load balancer
- Dashboards and data sources managed via provisioning (YAML/JSON in ConfigMaps)
- Alert rules managed as code (provisioning YAML or Terraform)
- HPA (Horizontal Pod Autoscaler) for Grafana pods
- Ingress with TLS termination
Configuration & Optimal Tuning
Essential grafana.ini Settings
# === Database (CRITICAL for production) ===
[database]
type = postgres
host = postgres.internal:5432
name = grafana
user = grafana
password = ${GF_DATABASE_PASSWORD}
ssl_mode = require
# === Session storage (CRITICAL for HA) ===
[sessions]
provider = redis
provider_config = addr=redis.internal:6379,pool_size=100,prefix=grafana
# === Server ===
[server]
http_port = 3000
domain = grafana.example.com
root_url = https://grafana.example.com
serve_from_sub_path = false
# === Security ===
[security]
admin_password = ${GF_SECURITY_ADMIN_PASSWORD}
secret_key = ${GF_SECURITY_SECRET_KEY}
cookie_secure = true
cookie_samesite = lax
content_security_policy = true
strict_transport_security = true
# === Auth (Example: OAuth with Okta) ===
[auth.generic_oauth]
enabled = true
name = Okta
client_id = ${GF_AUTH_OKTA_CLIENT_ID}
client_secret = ${GF_AUTH_OKTA_CLIENT_SECRET}
scopes = openid profile email groups
auth_url = https://your-org.okta.com/oauth2/v1/authorize
token_url = https://your-org.okta.com/oauth2/v1/token
api_url = https://your-org.okta.com/oauth2/v1/userinfo
role_attribute_path = contains(groups[*], 'grafana-admins') && 'Admin' || contains(groups[*], 'grafana-editors') && 'Editor' || 'Viewer'
allow_sign_up = true
# === Alerting ===
[unified_alerting]
enabled = true
execute_alerts = true
# === Performance ===
[dataproxy]
timeout = 300
dial_timeout = 30
keep_alive_seconds = 30
[rendering]
concurrent_render_request_limit = 30
Environment Variable Override Pattern
All grafana.ini settings can be overridden via environment variables using the pattern:
Examples:
- GF_DATABASE_TYPE=postgres
- GF_SECURITY_ADMIN_PASSWORD=supersecret
- GF_AUTH_GENERIC_OAUTH_ENABLED=true
Reliability & Scaling
Horizontal Scaling Checklist
High Availability Architecture
flowchart TB
LB["Load Balancer<br/>(NGINX Ingress / ALB)"]
subgraph Grafana["Grafana Replicas"]
G1["Pod 1"]
G2["Pod 2"]
G3["Pod 3"]
end
PG["PostgreSQL<br/>(HA: RDS / CloudSQL)"]
Redis["Redis<br/>(HA: ElastiCache)"]
LB --> G1
LB --> G2
LB --> G3
G1 --> PG
G2 --> PG
G3 --> PG
G1 --> Redis
G2 --> Redis
G3 --> Redis
style LB fill:#ff6600,color:#fff
style Grafana fill:#2a2d3e,color:#fff
style PG fill:#2a7de1,color:#fff
style Redis fill:#e65100,color:#fff
Scaling the LGTM Backends
| Component |
Scale Strategy |
Key Metric |
| Mimir Ingesters |
Add replicas |
Active series count |
| Mimir Queriers |
Add replicas |
Query latency p99 |
| Loki Ingesters |
Add replicas |
Log ingestion rate (bytes/sec) |
| Loki Queriers |
Add replicas |
LogQL query latency |
| Tempo Ingesters |
Add replicas |
Spans/sec |
| Alloy |
DaemonSet (1 per node) |
Automatic |
Cost
Self-Hosted Cost Factors
| Factor |
Driver |
Optimization |
| Compute |
Number of backend pods |
Right-size resources, use spot/preemptible nodes |
| Object Storage |
Data retention × ingestion rate |
Set retention policies, use lifecycle rules, compress |
| Database |
PostgreSQL instance size |
Start small, scale with usage |
| Network |
Cross-AZ / cross-region traffic |
Co-locate components in same AZ, use VPC endpoints |
Grafana Cloud Pricing Summary (2026)
| Tier |
Base Cost |
Included |
Billing Model |
| Free |
$0 |
10k active metrics series, 50 GB logs/traces |
— |
| Pro |
$19/mo platform fee |
Base allowances |
Usage-based (per series, per GB) |
| Enterprise |
$25k+/yr |
Volume discounts, enhanced SLAs |
Annual commitment |
Cost Comparison: Self-Hosted vs Cloud
For a typical mid-size setup (500k active series, 100 GB/day logs, 50M spans/day):
| Model |
Estimated Monthly Cost |
Trade-off |
| Self-hosted (K8s) |
$500–2,000 |
Full control, higher ops burden |
| Grafana Cloud Pro |
$1,000–3,000 |
Managed, lower ops burden |
| Datadog equivalent |
$5,000–15,000 |
Fully managed, highest cost |
Costs are approximate and vary significantly by cloud provider and configuration.
Security
Authentication Hardening
- Disable basic auth in production — use SSO (OAuth 2.0 / SAML)
- Enforce MFA via your identity provider (Okta, Azure AD, Google)
- Disable anonymous access (
[auth.anonymous] enabled = false)
- Disable self-registration (
[users] allow_sign_up = false)
- Set session timeouts (
login_maximum_lifetime_duration = 12h)
- Use HTTPS/TLS for all traffic
- Enable CSRF protection (enabled by default)
- Set Content Security Policy headers
RBAC Best Practices
| Role |
Permissions |
Who |
| Viewer |
View dashboards, explore data |
Most users |
| Editor |
Create/edit dashboards, create alerts |
Team leads, SREs |
| Admin |
Manage org, users, data sources |
Org administrators |
| Grafana Admin |
System-wide access |
Platform team only (minimize!) |
- Use Teams synced with your IdP groups for permission management
- Use data source permissions to restrict which teams can query which backends
- Use proxy mode for data sources to avoid exposing backend credentials to browsers
- Enterprise/Cloud: Use custom roles for fine-grained permissions (e.g., "can edit dashboards in folder X but not Y")
LDAP/SAML Hardening
- Always use TLS/SSL for LDAP connections
- Use a dedicated service account with read-only permissions for LDAP binding
- Verify certificates (
ssl_skip_verify = false)
- Set minimum TLS version to 1.2+
- Enable SAML request signing for integrity
Secrets Management
- Never hardcode secrets in
grafana.ini — use environment variables or a secrets manager (Vault, AWS KMS)
- Use Kubernetes Secrets (or ExternalSecrets Operator) to inject credentials
- Use read-only database users for data source connections
Best Practices
Dashboard Governance
- Use folders to organize dashboards by team/domain
- Use provisioning for infrastructure-critical dashboards (prevents manual drift)
- Set ownership — every dashboard should have a clear owner/team
- Review cadence — quarterly review of all dashboards, archive unused ones
- Naming conventions — prefix dashboards with team or domain (e.g.,
[infra] Node Overview)
- Template variables — use for environment, region, service filtering
- Max panels per dashboard — aim for 8–12 (overview) or 15–20 (detailed)
Query Optimization
- Filter early — use precise label selectors in PromQL/LogQL
- Avoid high cardinality — don't use user IDs, IP addresses, or request paths as labels
- Use recording rules — precompute expensive PromQL queries in Mimir/Prometheus
- Set Max Data Points — prevent over-fetching (10k points for a 1k-pixel graph wastes resources)
- Optimize refresh intervals — avoid < 10s unless truly needed
- Use
$__interval — let Grafana auto-calculate appropriate step size
Infrastructure
- Monitor Grafana with Grafana — use
kube-prometheus-stack to monitor the monitoring
- Set resource limits — define CPU/memory requests and limits in Kubernetes
- Use immutable images — pre-install plugins in custom Docker images instead of runtime installs
- Backup the database — automated PostgreSQL backups with PITR
- Audit logs — enable for compliance (Enterprise feature)
Common Issues & Playbook
| Symptom |
Likely Cause |
Fix |
| Dashboard loads slowly |
Expensive queries or too many panels |
Use Query Inspector, add recording rules, reduce panel count |
| "Data source is not available" |
Connection issue or misconfigured URL |
Check network, verify URL in data source settings, check proxy mode |
| Alerts not firing |
Evaluator not running or contact point misconfigured |
Check [unified_alerting] is enabled, verify contact point with Test |
| Login loop / session issues |
SQLite under HA or missing Redis config |
Switch to PostgreSQL + Redis for sessions |
| Plugin not loading |
Unsigned plugin or missing signature |
Set allow_loading_unsigned_plugins or sign the plugin |
| High memory on Grafana process |
Too many concurrent dashboard viewers |
Scale horizontally, reduce auto-refresh intervals |
| "database is locked" |
SQLite with multiple replicas |
Switch to PostgreSQL/MySQL immediately |
Monitoring & Troubleshooting
Key Grafana Metrics to Monitor
Grafana exposes Prometheus metrics at /metrics:
| Metric |
What It Tells You |
grafana_http_request_duration_seconds |
API request latency |
grafana_alerting_rule_evaluations_total |
Alert evaluation throughput |
grafana_alerting_rule_evaluation_failures_total |
Alert evaluation errors |
grafana_proxy_request_duration_seconds |
Data source proxy latency |
grafana_stat_totals |
Total dashboards, users, orgs |
grafana_active_user_sessions |
Current active sessions |
- Query Inspector — built-in panel tool to see raw query, response time, and data
- Grafana Server Logs —
grafana-server.log or stdout in containers
- API Explorer —
/api/ endpoints for programmatic inspection
- Provisioning debug — watch mode logs file-change detection events
- Alloy Debug UI —
http://localhost:12345 — real-time pipeline graph and health