Skip to content

Operations

Deployment & Typical Setup

Single-Node (Simplest Production Path)

# VictoriaMetrics — single binary, metrics
./victoria-metrics -storageDataPath=/data/vm -retentionPeriod=12

# VictoriaLogs — single binary, logs
./victoria-logs -storageDataPath=/data/vl -retentionPeriod=30d

# VictoriaTraces — single binary, traces
./victoria-traces -storageDataPath=/data/vt

Each binary starts an HTTP server and is immediately ready to receive data. No configuration files needed for basic usage.

Kubernetes (vmoperator)

The recommended production path uses the vmoperator with CRDs:

# Install the operator
helm repo add vm https://victoriametrics.github.io/helm-charts/
helm repo update
helm install vmoperator vm/victoria-metrics-operator -n monitoring --create-namespace

# Deploy cluster via CRD
kubectl apply -f - <<EOF
apiVersion: operator.victoriametrics.com/v1beta1
kind: VMCluster
metadata:
  name: vm-cluster
spec:
  retentionPeriod: "12"
  replicationFactor: 2
  vminsert:
    replicaCount: 2
    resources:
      requests: { cpu: "500m", memory: "512Mi" }
  vmselect:
    replicaCount: 2
    resources:
      requests: { cpu: "500m", memory: "1Gi" }
  vmstorage:
    replicaCount: 3
    storageDataPath: /vm-data
    resources:
      requests: { cpu: "1", memory: "4Gi" }
    storage:
      volumeClaimTemplate:
        spec:
          resources:
            requests: { storage: 100Gi }
          storageClassName: fast-ssd
EOF

Production Readiness Checklist

  • VictoriaMetrics deployed (single-node or cluster)
  • VictoriaLogs deployed for log aggregation
  • VictoriaTraces deployed for distributed tracing
  • vmauth configured as routing proxy (with auth)
  • vmagent deployed as DaemonSet for metric scraping
  • vmalert configured with recording rules and alerts
  • vmbackup scheduled for automated snapshots to S3/GCS
  • Grafana configured with Prometheus, Loki, and Jaeger data sources
  • Resource requests/limits set on all pods
  • SSD-backed storage for all stateful components
  • Monitoring the monitoring (self-scrape)

Configuration & Optimal Tuning

vmauth Routing Configuration

The single most important config file — routes traffic across all three databases:

# vmauth-config.yaml
unauthorized_user:
  url_map:
    # === METRICS ===
    - src_paths:
        - "/api/v1/write"
        - "/api/v1/import.*"
      url_prefix: "http://vminsert:8480/insert/0/prometheus"
    - src_paths:
        - "/api/v1/query.*"
        - "/api/v1/series.*"
        - "/api/v1/labels.*"
      url_prefix: "http://vmselect:8481/select/0/prometheus"

    # === LOGS ===
    - src_paths:
        - "/insert/jsonline.*"
        - "/insert/elasticsearch.*"
        - "/loki/api/v1/push"
      url_prefix: "http://victorialogs:9428"
    - src_paths:
        - "/select/logsql/.*"
      url_prefix: "http://victorialogs:9428"

    # === TRACES ===
    - src_paths:
        - "/insert/opentelemetry/.*"
      url_prefix: "http://victoriatraces:10428"
    - src_paths:
        - "/api/traces.*"
        - "/api/services.*"
      url_prefix: "http://victoriatraces:10428"

Critical Tuning Flags

Component Flag Purpose Default
All -retentionPeriod Data retention duration 1 month
vmstorage -search.maxUniqueTimeseries Prevent OOM on high-cardinality queries 300,000
vmstorage -memory.allowedPercent Max RAM usage percent before aggressive GC 60%
vmstorage -search.maxQueryDuration Max single query execution time 30s
vminsert -replicationFactor=N Replicate data to N storage nodes 1
vmselect -dedup.minScrapeInterval Deduplicate data when RF > 1 0s
vmagent -remoteWrite.label Add global labels to all scraped metrics
VictoriaLogs -retentionPeriod Log retention 7d

Reliability & Scaling

Scaling Decision Matrix

Symptom Component to Scale How
Slow metric queries vmselect Add replicas
Write backpressure vminsert Add replicas
Disk full on metrics vmstorage Add nodes or increase disk
High RAM on storage vmstorage Increase -memory.allowedPercent, reduce cardinality
Slow log search VictoriaLogs Add CPU/RAM (single-node) or cluster
Log ingestion lag VictoriaLogs Increase resources or switch to cluster

High Availability

Mechanism Implementation
Metrics replication -replicationFactor=2 on vminsert + -dedup.minScrapeInterval on vmselect
Metrics availability If 1 vmstorage fails with RF=2, vmselect returns partial results transparently
Logs/Traces HA Deploy cluster mode with vlinsert/vlstorage/vlselect
Proxy HA Multiple vmauth replicas behind load balancer
Backup vmbackup creates instant, consistent snapshots without locking the DB

Cost

Cost Drivers

Factor Driver Optimization
Compute Insert + select pods Right-size, use spot nodes for vmselect
Storage Data volume × retention ZSTD compression reduces 2–7x naturally, tune retention
Network Internal cluster traffic Co-locate in same AZ
NO object storage Local SSD only Eliminates S3/GCS egress costs entirely

Cost at Scale (Self-Hosted)

Scale Active Series Logs (GB/day) Estimated Monthly
Small 100k 10 $100–300
Medium 1M 100 $500–1,500
Large 10M 1 TB $2,000–8,000
Enterprise 100M+ 10 TB+ $10,000–50,000

VictoriaMetrics Cloud Pricing

Tier Starting Cost Includes
Single-node ~$225/mo Up to 500k active series, 1-month retention
Cluster ~$1,300/mo Multi-tenancy, HA, advanced networking

Security

Authentication & Authorization

  • The databases themselves do not implement RBAC natively.
  • Security relies strictly on vmauth, which acts as the gatekeeper:
  • Bearer token authentication
  • Basic auth
  • URL-based access control
  • Header manipulation
  • Enterprise: SSO integration in vmauth

Network Security Best Practices

  1. Never expose ingestion nodes to the internet — always put vmauth or NGINX in front
  2. Use Kubernetes NetworkPolicies to restrict pod-to-pod communication
  3. Only vmauth should be externally accessible
  4. Use mTLS between components in sensitive environments
  5. Cluster multi-tenancy: Data isolation via account IDs in URL paths (/insert/TENANT_ID/)

Best Practices

Metrics

  1. Global Relabeling: Append datacenter/environment labels at the vmagent layer before data hits storage
  2. Drop high-cardinality labels: Use vmagent relabeling to drop labels like pod_ip, request_id before ingestion
  3. Recording rules: Precompute expensive MetricsQL expressions via vmalert
  4. Deduplication: With replication, always set -dedup.minScrapeInterval on vmselect

Logs

  1. Avoid Translation: Use native APIs whenever possible — point Fluent Bit directly to /insert/jsonline rather than going through an intermediary
  2. Structured logging: Use JSON logs to enable field extraction at query time
  3. Stream fields: Set _stream_fields on ingestion to logically group related log entries
  4. Retention per signal: Set different retention periods for logs (30d) vs metrics (12mo) vs traces (14d)

Operations

  1. Monitor with itself: Scrape VictoriaMetrics' own /metrics endpoint
  2. Use vmbackup regularly: Schedule daily incremental backups to S3
  3. Test upgrades on LTS: Use the LTS release line for production stability

Common Issues & Playbook

Symptom Likely Cause Fix
High CPU on vmstorage during queries Large time-window queries Limit -search.maxQueryDuration, scale vmselect
OOM on vmstorage High cardinality churn Tune -memory.allowedPercent, drop unused labels at vmagent
"too many unique timeseries" Query returns too many series Increase -search.maxUniqueTimeseries or refine query
Slow VictoriaLogs queries Large time range without filters Add time restrictions (_time:1h), use specific filters
vmagent not discovering targets ServiceMonitor/PodScrape CRDs not picked up Verify vmoperator is running, check CRD labels
VictoriaTraces not receiving spans OTLP gRPC not enabled Explicitly enable gRPC port in config
Data gap after vmstorage restart WAL not flushed Normal — WAL replays on restart, gap is temporary

Monitoring & Troubleshooting

Key Self-Monitoring Metrics

Metric What It Tells You
vm_rows_inserted_total Ingestion throughput
vm_active_timeseries Current cardinality
vm_slow_queries_total Queries exceeding duration threshold
vm_cache_entries Cache utilization
vm_data_size_bytes On-disk data size
process_resident_memory_bytes Actual RAM usage
vm_merge_duration_seconds Background compaction health