Skip to content

Architecture

1. Default Topology / Flow

flowchart TB
    subgraph K8s["Kubernetes Cluster"]
        subgraph Agents["Data Plane (DaemonSet + Deployment)"]
            NA["coroot-node-agent\n(eBPF DaemonSet)\nPer node"]
            CA["coroot-cluster-agent\n(Deployment)\nDatabase discovery"]
        end

        subgraph Control["Control Plane"]
            OP["Coroot Operator\n(Lifecycle management)"]
            CR["Coroot CR\n(Custom Resource)"]
        end

        subgraph Server["Coroot Server (StatefulSet)"]
            direction TB
            InspEng["Inspection Engine\n18+ auto-inspections"]
            AIRCA["AI RCA Engine\n(pattern detection)"]
            SvcMap["Service Map Builder\n(eBPF topology)"]
            SLOEng["SLO Engine\n(error budget tracking)"]
            APIGW["API / Web UI\n(port 8080)"]
        end

        subgraph Storage["Storage (StatefulSet / External)"]
            Prom["Prometheus / VM / Mimir\n(metrics)"]
            CH["ClickHouse\n(logs, traces, profiles)"]
        end
    end

    subgraph External["Optional"]
        OTEL["OTel SDK\n(app-level traces)"]
        LLM["LLM API\n(Enterprise AI RCA)"]
    end

    NA -->|"metrics, traces,\nlogs, profiles"| Server
    CA -->|"DB metrics\n(pg_stat, INFO)"| Server
    OTEL -->|OTLP| Server
    Server -->|"remote_write /\nPromQL"| Prom
    Server -->|"clickhouse-native"| CH
    LLM -.->|"API"| AIRCA
    OP -->|"reconcile"| CR
    CR -->|"manages"| Agents
    CR -->|"manages"| Server
    CR -->|"manages"| Storage

    style Server fill:#1565c0,color:#fff
    style Agents fill:#2e7d32,color:#fff
    style Storage fill:#e65100,color:#fff

Detailed component breakdown, deployment topologies, and data flow diagrams for Coroot.

Component Architecture

18 Built-In Inspections

Coroot runs 18 automated inspection categories continuously on every discovered service:

Category Inspections
SLOs Availability SLO, Latency SLO
Instances Pod restarts, unavailable replicas
CPU CPU throttling, CPU usage near limits
GPU GPU utilization, memory usage
Memory OOM kills, memory near limits
Storage Disk usage, I/O latency
Network Connection errors, DNS failures, TCP retransmits
Logs Error log rate spikes, warning patterns
Runtime JVM heap/GC, .NET GC, Python GIL contention
Databases Postgres, MySQL, MongoDB, Redis, Memcached health
Deployments Rollout tracking, canary detection

Deployment Topologies

Single-Cluster (Standard)

flowchart LR
    subgraph Cluster["K8s Cluster"]
        NA1["node-agent<br/>(node 1)"]
        NA2["node-agent<br/>(node 2)"]
        NAN["node-agent<br/>(node N)"]
        CA["cluster-agent"]
        CS["Coroot Server"]
        CH["ClickHouse<br/>(2 shards × 2 replicas)"]
        Prom["Prometheus / VM"]
    end

    NA1 --> CS
    NA2 --> CS
    NAN --> CS
    CA --> CS
    CS --> CH
    CS --> Prom

Multi-Cluster (Hub and Spoke)

flowchart TB
    subgraph Central["Central Cluster"]
        CS["Coroot Server\n(full install)"]
        CH["ClickHouse"]
        Prom["Prometheus / VM"]
    end

    subgraph Remote1["Remote Cluster 1"]
        NA_R1["node-agents"]
        CA_R1["cluster-agent"]
    end

    subgraph Remote2["Remote Cluster 2"]
        NA_R2["node-agents"]
        CA_R2["cluster-agent"]
    end

    NA_R1 -->|"agentsOnly=true"| CS
    CA_R1 --> CS
    NA_R2 -->|"agentsOnly=true"| CS
    CA_R2 --> CS
    CS --> CH
    CS --> Prom

    style Central fill:#1565c0,color:#fff

Sequence: Incident Detection → RCA

sequenceDiagram
    participant App as Application
    participant Kernel as Linux Kernel
    participant Agent as node-agent (eBPF)
    participant Server as Coroot Server
    participant Insp as Inspection Engine
    participant RCA as AI RCA
    participant Alert as Alert Channel

    Kernel->>Agent: eBPF events (TCP, DNS, disk)
    Agent->>Server: Metrics + traces + logs
    Server->>Insp: Run 18 inspection categories
    Insp->>Insp: SLO breach detected
    Insp->>RCA: Trigger root cause analysis
    RCA->>RCA: Walk dependency graph
    RCA->>RCA: Correlate metrics ↔ traces ↔ logs
    RCA->>RCA: Rank root causes
    RCA->>Alert: Send alert with RCA summary
    Note over Alert: Slack / PagerDuty / Webhook

Sources

Data Model

1. Default Topology / Flow

erDiagram
    Coroot_CORE ||--o{ CONFIG : requires
    Coroot_CORE ||--o{ STATE : writes
    CONFIG {
        string runtime_params
        string limits
    }
    STATE {
        string metric_id
        json payload
    }

How It Works

How Coroot uses eBPF for zero-instrumentation data collection, automated service discovery, and AI-powered root cause analysis.

Data Collection Pipeline

eBPF-Based Auto-Instrumentation

Coroot's core differentiator is kernel-level telemetry collection via eBPF (extended Berkeley Packet Filter). The coroot-node-agent runs as a DaemonSet on every Kubernetes node and attaches eBPF programs to kernel tracepoints and kprobes:

flowchart LR
    subgraph Kernel["Linux Kernel (4.16+)"]
        TP["Tracepoints"]
        KP["kprobes/kretprobes"]
        TC["Traffic Control (tc)"]
    end

    subgraph Agent["coroot-node-agent"]
        eBPF["eBPF Programs"]
        Perf["Perf Buffer"]
        Agg["Userspace Aggregation"]
    end

    TP --> eBPF
    KP --> eBPF
    TC --> eBPF
    eBPF --> Perf --> Agg
    Agg -->|OTLP / Prom RW| Server["Coroot Server"]

What eBPF Captures (Without Code Changes)

Signal Kernel Attachment Point Data Collected
Network metrics tcp_sendmsg, tcp_recvmsg, tcp_connect Latency, throughput, error rates per connection
HTTP/gRPC traces Socket read/write Request method, path, status code, duration
DNS UDP socket Resolution time, failures
Disk I/O blk_mq_start_request IOPS, latency, bandwidth per container
CPU profiling perf_event_open On-CPU flame graphs per process
Memory profiling Allocation tracepoints Heap allocation patterns
Container lifecycle cgroup events Start/stop times, resource limits
Log collection Container stdout/stderr Application log lines

Cluster Agent Discovery

The coroot-cluster-agent complements eBPF data by connecting directly to databases:

Database Discovery Method Metrics Collected
PostgreSQL SQL queries via pg_stat_* Active connections, query latency, replication lag
MySQL SHOW STATUS / information_schema Thread count, slow queries, buffer pool hit rate
Redis INFO command Memory usage, connected clients, hit rate
MongoDB serverStatus command Operations/sec, document counts, lock percentages

Service Map Generation

Coroot automatically builds a real-time service dependency graph by correlating eBPF network traces:

  1. Connection tracking: eBPF programs track every TCP connection (source IP:port ↔ dest IP:port)
  2. Container resolution: IP addresses are mapped to Kubernetes pods via the container runtime
  3. Service grouping: Pods are grouped by Deployment/StatefulSet/DaemonSet
  4. Protocol detection: L7 protocol (HTTP, gRPC, MySQL, PostgreSQL, Redis, Kafka, etc.) is identified from payload patterns
  5. Dependency graph: Directed edges between services are weighted by request rate, latency, and error rate

AI-Powered Root Cause Analysis

When an SLO violation or anomaly is detected, Coroot's AI RCA engine automatically:

  1. Identifies the impacted service from SLO breach alerts
  2. Walks the dependency graph upstream and downstream
  3. Correlates signals across metrics, traces, logs, and profiles for each service in the path
  4. Ranks root causes using statistical anomaly detection (e.g., sudden CPU spike, disk saturation, memory leak, new deployment)
  5. Generates remediation suggestions (e.g., "Service X shows 95th percentile latency spike correlated with disk I/O saturation on node Y — consider increasing PVC size or migrating to SSD-backed storage class")

AI RCA Integration (Enterprise)

The Enterprise edition integrates with LLM APIs to provide natural-language explanations of incidents, parse log patterns for error classification, and suggest specific remediations based on historical incident patterns.

SLO Monitoring

Coroot provides built-in SLO tracking based on RED metrics (Rate, Error, Duration):

  • Automatically calculates availability and latency SLOs per service
  • Tracks error budgets in real-time
  • Fires alerts when burn rate exceeds thresholds
  • No manual SLO configuration required — automatically derived from eBPF data

Data Flow Summary

sequenceDiagram
    participant App as Application
    participant Kernel as Linux Kernel
    participant NA as coroot-node-agent
    participant CA as coroot-cluster-agent
    participant Server as Coroot Server
    participant Prom as Prometheus / VM
    participant CH as ClickHouse

    Kernel->>NA: eBPF events (TCP, DNS, disk)
    App->>Kernel: syscalls (normal operation)
    NA->>Server: Metrics (Prometheus format)
    NA->>Server: Traces (OTLP)
    NA->>Server: Logs (container stdout)
    NA->>Server: Profiles (pprof)
    CA->>Server: DB metrics (SQL/INFO)
    Server->>Prom: Store metrics
    Server->>CH: Store logs, traces, profiles
    Server->>Server: Build service map
    Server->>Server: Run inspections & AI RCA

Sources


Benchmarks

Performance overhead, resource consumption, and scale limits for Coroot's eBPF-based observability.

eBPF Agent Overhead

Test Conditions

Parameter Value
Workload Go HTTP server (baseline)
Load 10,000 requests per second (RPS)
Agent coroot-node-agent (eBPF)
Methodology Latency comparison: baseline (no agent) vs agent enabled

Results

Metric Without Agent With Agent Impact
Request latency Baseline Within margin of error Negligible
CPU consumption ~200 millicores ~20% of 1 CPU core
Latency impact Within measurement error

Key finding: At 10,000 RPS, the latency difference with the coroot-node-agent enabled vs baseline falls within the margin of measurement error. The eBPF programs are verified by the kernel for finite complexity, guaranteeing they cannot disrupt kernel operations.

CPU Profiler Overhead

Component Overhead Notes
eBPF CPU profiler 1–3% Based on Grafana Pyroscope implementation
JVM with -XX:+PreserveFramePointer 1–3% Required for accurate JVM stack traces

Resource Consumption at Scale

Coroot Server

Cluster Size Recommended CPU Recommended RAM Notes
< 50 services 1 vCPU 2 GB Single-node sufficient
50–200 services 2 vCPU 4 GB Inspection engine overhead
200+ services 4+ vCPU 8+ GB Service map complexity

Node Agent

Per Node CPU RAM
Base overhead ~50 millicores ~50 MB
Under load (10K RPS) ~200 millicores ~100 MB
Heavy profiling ~300 millicores ~150 MB

Scale Limits

Dimension Practical Limit Bottleneck
Services per cluster 500+ Server CPU for service map
Nodes per cluster 200+ Node agent DaemonSet scaling
Multi-cluster 10+ clusters Network bandwidth to central server
Metrics cardinality Backend dependent Prometheus/VM limits apply
Trace throughput Backend dependent ClickHouse write capacity

Storage Backend Requirements

Backend Scenario Resources
Prometheus < 1M series 2 CPU, 8 GB RAM, 100 GB SSD
VictoriaMetrics 1–10M series 2 CPU, 4 GB RAM, 200 GB SSD
ClickHouse 100 GB/day logs + traces 4 CPU, 16 GB RAM, 500 GB SSD

Caveats

  • Benchmarks are from vendor-provided testing. Users with specialized workloads should conduct their own validation.
  • eBPF overhead can vary based on kernel version, workload characteristics, and enabled collection features.
  • Disabling span capture while maintaining eBPF metrics can further reduce resource consumption in extremely high-load environments.

Sources