Architecture¶

1. Default Topology / Flow¶

flowchart TB
    subgraph K8s["Kubernetes Cluster"]
        subgraph Agents["Data Plane (DaemonSet + Deployment)"]
            NA["coroot-node-agent\n(eBPF DaemonSet)\nPer node"]
            CA["coroot-cluster-agent\n(Deployment)\nDatabase discovery"]
        end

        subgraph Control["Control Plane"]
            OP["Coroot Operator\n(Lifecycle management)"]
            CR["Coroot CR\n(Custom Resource)"]
        end

        subgraph Server["Coroot Server (StatefulSet)"]
            direction TB
            InspEng["Inspection Engine\n18+ auto-inspections"]
            AIRCA["AI RCA Engine\n(pattern detection)"]
            SvcMap["Service Map Builder\n(eBPF topology)"]
            SLOEng["SLO Engine\n(error budget tracking)"]
            APIGW["API / Web UI\n(port 8080)"]
        end

        subgraph Storage["Storage (StatefulSet / External)"]
            Prom["Prometheus / VM / Mimir\n(metrics)"]
            CH["ClickHouse\n(logs, traces, profiles)"]
        end
    end

    subgraph External["Optional"]
        OTEL["OTel SDK\n(app-level traces)"]
        LLM["LLM API\n(Enterprise AI RCA)"]
    end

    NA -->|"metrics, traces,\nlogs, profiles"| Server
    CA -->|"DB metrics\n(pg_stat, INFO)"| Server
    OTEL -->|OTLP| Server
    Server -->|"remote_write /\nPromQL"| Prom
    Server -->|"clickhouse-native"| CH
    LLM -.->|"API"| AIRCA
    OP -->|"reconcile"| CR
    CR -->|"manages"| Agents
    CR -->|"manages"| Server
    CR -->|"manages"| Storage

    style Server fill:#1565c0,color:#fff
    style Agents fill:#2e7d32,color:#fff
    style Storage fill:#e65100,color:#fff

Detailed component breakdown, deployment topologies, and data flow diagrams for Coroot.

Component Architecture¶

18 Built-In Inspections¶

Coroot runs 18 automated inspection categories continuously on every discovered service:

Category	Inspections
SLOs	Availability SLO, Latency SLO
Instances	Pod restarts, unavailable replicas
CPU	CPU throttling, CPU usage near limits
GPU	GPU utilization, memory usage
Memory	OOM kills, memory near limits
Storage	Disk usage, I/O latency
Network	Connection errors, DNS failures, TCP retransmits
Logs	Error log rate spikes, warning patterns
Runtime	JVM heap/GC, .NET GC, Python GIL contention
Databases	Postgres, MySQL, MongoDB, Redis, Memcached health
Deployments	Rollout tracking, canary detection

Deployment Topologies¶

Single-Cluster (Standard)¶

flowchart LR
    subgraph Cluster["K8s Cluster"]
        NA1["node-agent<br/>(node 1)"]
        NA2["node-agent<br/>(node 2)"]
        NAN["node-agent<br/>(node N)"]
        CA["cluster-agent"]
        CS["Coroot Server"]
        CH["ClickHouse<br/>(2 shards × 2 replicas)"]
        Prom["Prometheus / VM"]
    end

    NA1 --> CS
    NA2 --> CS
    NAN --> CS
    CA --> CS
    CS --> CH
    CS --> Prom

Multi-Cluster (Hub and Spoke)¶

flowchart TB
    subgraph Central["Central Cluster"]
        CS["Coroot Server\n(full install)"]
        CH["ClickHouse"]
        Prom["Prometheus / VM"]
    end

    subgraph Remote1["Remote Cluster 1"]
        NA_R1["node-agents"]
        CA_R1["cluster-agent"]
    end

    subgraph Remote2["Remote Cluster 2"]
        NA_R2["node-agents"]
        CA_R2["cluster-agent"]
    end

    NA_R1 -->|"agentsOnly=true"| CS
    CA_R1 --> CS
    NA_R2 -->|"agentsOnly=true"| CS
    CA_R2 --> CS
    CS --> CH
    CS --> Prom

    style Central fill:#1565c0,color:#fff

Sequence: Incident Detection → RCA¶

sequenceDiagram
    participant App as Application
    participant Kernel as Linux Kernel
    participant Agent as node-agent (eBPF)
    participant Server as Coroot Server
    participant Insp as Inspection Engine
    participant RCA as AI RCA
    participant Alert as Alert Channel

    Kernel->>Agent: eBPF events (TCP, DNS, disk)
    Agent->>Server: Metrics + traces + logs
    Server->>Insp: Run 18 inspection categories
    Insp->>Insp: SLO breach detected
    Insp->>RCA: Trigger root cause analysis
    RCA->>RCA: Walk dependency graph
    RCA->>RCA: Correlate metrics ↔ traces ↔ logs
    RCA->>RCA: Rank root causes
    RCA->>Alert: Send alert with RCA summary
    Note over Alert: Slack / PagerDuty / Webhook

Sources¶

Data Model¶

1. Default Topology / Flow¶

erDiagram
    Coroot_CORE ||--o{ CONFIG : requires
    Coroot_CORE ||--o{ STATE : writes
    CONFIG {
        string runtime_params
        string limits
    }
    STATE {
        string metric_id
        json payload
    }

How It Works¶

How Coroot uses eBPF for zero-instrumentation data collection, automated service discovery, and AI-powered root cause analysis.

Data Collection Pipeline¶

eBPF-Based Auto-Instrumentation¶

Coroot's core differentiator is kernel-level telemetry collection via eBPF (extended Berkeley Packet Filter). The coroot-node-agent runs as a DaemonSet on every Kubernetes node and attaches eBPF programs to kernel tracepoints and kprobes:

flowchart LR
    subgraph Kernel["Linux Kernel (4.16+)"]
        TP["Tracepoints"]
        KP["kprobes/kretprobes"]
        TC["Traffic Control (tc)"]
    end

    subgraph Agent["coroot-node-agent"]
        eBPF["eBPF Programs"]
        Perf["Perf Buffer"]
        Agg["Userspace Aggregation"]
    end

    TP --> eBPF
    KP --> eBPF
    TC --> eBPF
    eBPF --> Perf --> Agg
    Agg -->|OTLP / Prom RW| Server["Coroot Server"]

What eBPF Captures (Without Code Changes)¶

Signal	Kernel Attachment Point	Data Collected
Network metrics	`tcp_sendmsg`, `tcp_recvmsg`, `tcp_connect`	Latency, throughput, error rates per connection
HTTP/gRPC traces	Socket read/write	Request method, path, status code, duration
DNS	UDP socket	Resolution time, failures
Disk I/O	`blk_mq_start_request`	IOPS, latency, bandwidth per container
CPU profiling	`perf_event_open`	On-CPU flame graphs per process
Memory profiling	Allocation tracepoints	Heap allocation patterns
Container lifecycle	cgroup events	Start/stop times, resource limits
Log collection	Container stdout/stderr	Application log lines

Cluster Agent Discovery¶

The coroot-cluster-agent complements eBPF data by connecting directly to databases:

Database	Discovery Method	Metrics Collected
PostgreSQL	SQL queries via `pg_stat_*`	Active connections, query latency, replication lag
MySQL	`SHOW STATUS` / `information_schema`	Thread count, slow queries, buffer pool hit rate
Redis	`INFO` command	Memory usage, connected clients, hit rate
MongoDB	`serverStatus` command	Operations/sec, document counts, lock percentages

Service Map Generation¶

Coroot automatically builds a real-time service dependency graph by correlating eBPF network traces:

Connection tracking: eBPF programs track every TCP connection (source IP:port ↔ dest IP:port)
Container resolution: IP addresses are mapped to Kubernetes pods via the container runtime
Service grouping: Pods are grouped by Deployment/StatefulSet/DaemonSet
Protocol detection: L7 protocol (HTTP, gRPC, MySQL, PostgreSQL, Redis, Kafka, etc.) is identified from payload patterns
Dependency graph: Directed edges between services are weighted by request rate, latency, and error rate

AI-Powered Root Cause Analysis¶

When an SLO violation or anomaly is detected, Coroot's AI RCA engine automatically:

Identifies the impacted service from SLO breach alerts
Walks the dependency graph upstream and downstream
Correlates signals across metrics, traces, logs, and profiles for each service in the path
Ranks root causes using statistical anomaly detection (e.g., sudden CPU spike, disk saturation, memory leak, new deployment)
Generates remediation suggestions (e.g., "Service X shows 95th percentile latency spike correlated with disk I/O saturation on node Y — consider increasing PVC size or migrating to SSD-backed storage class")

AI RCA Integration (Enterprise)¶

The Enterprise edition integrates with LLM APIs to provide natural-language explanations of incidents, parse log patterns for error classification, and suggest specific remediations based on historical incident patterns.

SLO Monitoring¶

Coroot provides built-in SLO tracking based on RED metrics (Rate, Error, Duration):

Automatically calculates availability and latency SLOs per service
Tracks error budgets in real-time
Fires alerts when burn rate exceeds thresholds
No manual SLO configuration required — automatically derived from eBPF data

Data Flow Summary¶

sequenceDiagram
    participant App as Application
    participant Kernel as Linux Kernel
    participant NA as coroot-node-agent
    participant CA as coroot-cluster-agent
    participant Server as Coroot Server
    participant Prom as Prometheus / VM
    participant CH as ClickHouse

    Kernel->>NA: eBPF events (TCP, DNS, disk)
    App->>Kernel: syscalls (normal operation)
    NA->>Server: Metrics (Prometheus format)
    NA->>Server: Traces (OTLP)
    NA->>Server: Logs (container stdout)
    NA->>Server: Profiles (pprof)
    CA->>Server: DB metrics (SQL/INFO)
    Server->>Prom: Store metrics
    Server->>CH: Store logs, traces, profiles
    Server->>Server: Build service map
    Server->>Server: Run inspections & AI RCA

Sources¶

Benchmarks¶

Performance overhead, resource consumption, and scale limits for Coroot's eBPF-based observability.

eBPF Agent Overhead¶

Test Conditions¶

Parameter	Value
Workload	Go HTTP server (baseline)
Load	10,000 requests per second (RPS)
Agent	coroot-node-agent (eBPF)
Methodology	Latency comparison: baseline (no agent) vs agent enabled

Results¶

Metric	Without Agent	With Agent	Impact
Request latency	Baseline	Within margin of error	Negligible
CPU consumption	—	~200 millicores	~20% of 1 CPU core
Latency impact	—	—	Within measurement error

Key finding: At 10,000 RPS, the latency difference with the coroot-node-agent enabled vs baseline falls within the margin of measurement error. The eBPF programs are verified by the kernel for finite complexity, guaranteeing they cannot disrupt kernel operations.

CPU Profiler Overhead¶

Component	Overhead	Notes
eBPF CPU profiler	1–3%	Based on Grafana Pyroscope implementation
JVM with `-XX:+PreserveFramePointer`	1–3%	Required for accurate JVM stack traces

Resource Consumption at Scale¶

Coroot Server¶

Cluster Size	Recommended CPU	Recommended RAM	Notes
< 50 services	1 vCPU	2 GB	Single-node sufficient
50–200 services	2 vCPU	4 GB	Inspection engine overhead
200+ services	4+ vCPU	8+ GB	Service map complexity

Node Agent¶

Per Node	CPU	RAM
Base overhead	~50 millicores	~50 MB
Under load (10K RPS)	~200 millicores	~100 MB
Heavy profiling	~300 millicores	~150 MB

Scale Limits¶

Dimension	Practical Limit	Bottleneck
Services per cluster	500+	Server CPU for service map
Nodes per cluster	200+	Node agent DaemonSet scaling
Multi-cluster	10+ clusters	Network bandwidth to central server
Metrics cardinality	Backend dependent	Prometheus/VM limits apply
Trace throughput	Backend dependent	ClickHouse write capacity

Storage Backend Requirements¶

Backend	Scenario	Resources
Prometheus	< 1M series	2 CPU, 8 GB RAM, 100 GB SSD
VictoriaMetrics	1–10M series	2 CPU, 4 GB RAM, 200 GB SSD
ClickHouse	100 GB/day logs + traces	4 CPU, 16 GB RAM, 500 GB SSD

Caveats¶

Benchmarks are from vendor-provided testing. Users with specialized workloads should conduct their own validation.
eBPF overhead can vary based on kernel version, workload characteristics, and enabled collection features.
Disabling span capture while maintaining eBPF metrics can further reduce resource consumption in extremely high-load environments.