Architecture¶
1. Default Topology / Flow¶
flowchart TB
subgraph K8s["Kubernetes Cluster"]
subgraph Agents["Data Plane (DaemonSet + Deployment)"]
NA["coroot-node-agent\n(eBPF DaemonSet)\nPer node"]
CA["coroot-cluster-agent\n(Deployment)\nDatabase discovery"]
end
subgraph Control["Control Plane"]
OP["Coroot Operator\n(Lifecycle management)"]
CR["Coroot CR\n(Custom Resource)"]
end
subgraph Server["Coroot Server (StatefulSet)"]
direction TB
InspEng["Inspection Engine\n18+ auto-inspections"]
AIRCA["AI RCA Engine\n(pattern detection)"]
SvcMap["Service Map Builder\n(eBPF topology)"]
SLOEng["SLO Engine\n(error budget tracking)"]
APIGW["API / Web UI\n(port 8080)"]
end
subgraph Storage["Storage (StatefulSet / External)"]
Prom["Prometheus / VM / Mimir\n(metrics)"]
CH["ClickHouse\n(logs, traces, profiles)"]
end
end
subgraph External["Optional"]
OTEL["OTel SDK\n(app-level traces)"]
LLM["LLM API\n(Enterprise AI RCA)"]
end
NA -->|"metrics, traces,\nlogs, profiles"| Server
CA -->|"DB metrics\n(pg_stat, INFO)"| Server
OTEL -->|OTLP| Server
Server -->|"remote_write /\nPromQL"| Prom
Server -->|"clickhouse-native"| CH
LLM -.->|"API"| AIRCA
OP -->|"reconcile"| CR
CR -->|"manages"| Agents
CR -->|"manages"| Server
CR -->|"manages"| Storage
style Server fill:#1565c0,color:#fff
style Agents fill:#2e7d32,color:#fff
style Storage fill:#e65100,color:#fff
Detailed component breakdown, deployment topologies, and data flow diagrams for Coroot.
Component Architecture¶
18 Built-In Inspections¶
Coroot runs 18 automated inspection categories continuously on every discovered service:
| Category | Inspections |
|---|---|
| SLOs | Availability SLO, Latency SLO |
| Instances | Pod restarts, unavailable replicas |
| CPU | CPU throttling, CPU usage near limits |
| GPU | GPU utilization, memory usage |
| Memory | OOM kills, memory near limits |
| Storage | Disk usage, I/O latency |
| Network | Connection errors, DNS failures, TCP retransmits |
| Logs | Error log rate spikes, warning patterns |
| Runtime | JVM heap/GC, .NET GC, Python GIL contention |
| Databases | Postgres, MySQL, MongoDB, Redis, Memcached health |
| Deployments | Rollout tracking, canary detection |
Deployment Topologies¶
Single-Cluster (Standard)¶
flowchart LR
subgraph Cluster["K8s Cluster"]
NA1["node-agent<br/>(node 1)"]
NA2["node-agent<br/>(node 2)"]
NAN["node-agent<br/>(node N)"]
CA["cluster-agent"]
CS["Coroot Server"]
CH["ClickHouse<br/>(2 shards × 2 replicas)"]
Prom["Prometheus / VM"]
end
NA1 --> CS
NA2 --> CS
NAN --> CS
CA --> CS
CS --> CH
CS --> Prom
Multi-Cluster (Hub and Spoke)¶
flowchart TB
subgraph Central["Central Cluster"]
CS["Coroot Server\n(full install)"]
CH["ClickHouse"]
Prom["Prometheus / VM"]
end
subgraph Remote1["Remote Cluster 1"]
NA_R1["node-agents"]
CA_R1["cluster-agent"]
end
subgraph Remote2["Remote Cluster 2"]
NA_R2["node-agents"]
CA_R2["cluster-agent"]
end
NA_R1 -->|"agentsOnly=true"| CS
CA_R1 --> CS
NA_R2 -->|"agentsOnly=true"| CS
CA_R2 --> CS
CS --> CH
CS --> Prom
style Central fill:#1565c0,color:#fff
Sequence: Incident Detection → RCA¶
sequenceDiagram
participant App as Application
participant Kernel as Linux Kernel
participant Agent as node-agent (eBPF)
participant Server as Coroot Server
participant Insp as Inspection Engine
participant RCA as AI RCA
participant Alert as Alert Channel
Kernel->>Agent: eBPF events (TCP, DNS, disk)
Agent->>Server: Metrics + traces + logs
Server->>Insp: Run 18 inspection categories
Insp->>Insp: SLO breach detected
Insp->>RCA: Trigger root cause analysis
RCA->>RCA: Walk dependency graph
RCA->>RCA: Correlate metrics ↔ traces ↔ logs
RCA->>RCA: Rank root causes
RCA->>Alert: Send alert with RCA summary
Note over Alert: Slack / PagerDuty / Webhook
Sources¶
Data Model¶
1. Default Topology / Flow¶
erDiagram
Coroot_CORE ||--o{ CONFIG : requires
Coroot_CORE ||--o{ STATE : writes
CONFIG {
string runtime_params
string limits
}
STATE {
string metric_id
json payload
}
How It Works¶
How Coroot uses eBPF for zero-instrumentation data collection, automated service discovery, and AI-powered root cause analysis.
Data Collection Pipeline¶
eBPF-Based Auto-Instrumentation¶
Coroot's core differentiator is kernel-level telemetry collection via eBPF (extended Berkeley Packet Filter). The coroot-node-agent runs as a DaemonSet on every Kubernetes node and attaches eBPF programs to kernel tracepoints and kprobes:
flowchart LR
subgraph Kernel["Linux Kernel (4.16+)"]
TP["Tracepoints"]
KP["kprobes/kretprobes"]
TC["Traffic Control (tc)"]
end
subgraph Agent["coroot-node-agent"]
eBPF["eBPF Programs"]
Perf["Perf Buffer"]
Agg["Userspace Aggregation"]
end
TP --> eBPF
KP --> eBPF
TC --> eBPF
eBPF --> Perf --> Agg
Agg -->|OTLP / Prom RW| Server["Coroot Server"]
What eBPF Captures (Without Code Changes)¶
| Signal | Kernel Attachment Point | Data Collected |
|---|---|---|
| Network metrics | tcp_sendmsg, tcp_recvmsg, tcp_connect |
Latency, throughput, error rates per connection |
| HTTP/gRPC traces | Socket read/write | Request method, path, status code, duration |
| DNS | UDP socket | Resolution time, failures |
| Disk I/O | blk_mq_start_request |
IOPS, latency, bandwidth per container |
| CPU profiling | perf_event_open |
On-CPU flame graphs per process |
| Memory profiling | Allocation tracepoints | Heap allocation patterns |
| Container lifecycle | cgroup events | Start/stop times, resource limits |
| Log collection | Container stdout/stderr | Application log lines |
Cluster Agent Discovery¶
The coroot-cluster-agent complements eBPF data by connecting directly to databases:
| Database | Discovery Method | Metrics Collected |
|---|---|---|
| PostgreSQL | SQL queries via pg_stat_* |
Active connections, query latency, replication lag |
| MySQL | SHOW STATUS / information_schema |
Thread count, slow queries, buffer pool hit rate |
| Redis | INFO command |
Memory usage, connected clients, hit rate |
| MongoDB | serverStatus command |
Operations/sec, document counts, lock percentages |
Service Map Generation¶
Coroot automatically builds a real-time service dependency graph by correlating eBPF network traces:
- Connection tracking: eBPF programs track every TCP connection (source IP:port ↔ dest IP:port)
- Container resolution: IP addresses are mapped to Kubernetes pods via the container runtime
- Service grouping: Pods are grouped by Deployment/StatefulSet/DaemonSet
- Protocol detection: L7 protocol (HTTP, gRPC, MySQL, PostgreSQL, Redis, Kafka, etc.) is identified from payload patterns
- Dependency graph: Directed edges between services are weighted by request rate, latency, and error rate
AI-Powered Root Cause Analysis¶
When an SLO violation or anomaly is detected, Coroot's AI RCA engine automatically:
- Identifies the impacted service from SLO breach alerts
- Walks the dependency graph upstream and downstream
- Correlates signals across metrics, traces, logs, and profiles for each service in the path
- Ranks root causes using statistical anomaly detection (e.g., sudden CPU spike, disk saturation, memory leak, new deployment)
- Generates remediation suggestions (e.g., "Service X shows 95th percentile latency spike correlated with disk I/O saturation on node Y — consider increasing PVC size or migrating to SSD-backed storage class")
AI RCA Integration (Enterprise)¶
The Enterprise edition integrates with LLM APIs to provide natural-language explanations of incidents, parse log patterns for error classification, and suggest specific remediations based on historical incident patterns.
SLO Monitoring¶
Coroot provides built-in SLO tracking based on RED metrics (Rate, Error, Duration):
- Automatically calculates availability and latency SLOs per service
- Tracks error budgets in real-time
- Fires alerts when burn rate exceeds thresholds
- No manual SLO configuration required — automatically derived from eBPF data
Data Flow Summary¶
sequenceDiagram
participant App as Application
participant Kernel as Linux Kernel
participant NA as coroot-node-agent
participant CA as coroot-cluster-agent
participant Server as Coroot Server
participant Prom as Prometheus / VM
participant CH as ClickHouse
Kernel->>NA: eBPF events (TCP, DNS, disk)
App->>Kernel: syscalls (normal operation)
NA->>Server: Metrics (Prometheus format)
NA->>Server: Traces (OTLP)
NA->>Server: Logs (container stdout)
NA->>Server: Profiles (pprof)
CA->>Server: DB metrics (SQL/INFO)
Server->>Prom: Store metrics
Server->>CH: Store logs, traces, profiles
Server->>Server: Build service map
Server->>Server: Run inspections & AI RCA
Sources¶
Benchmarks¶
Performance overhead, resource consumption, and scale limits for Coroot's eBPF-based observability.
eBPF Agent Overhead¶
Test Conditions¶
| Parameter | Value |
|---|---|
| Workload | Go HTTP server (baseline) |
| Load | 10,000 requests per second (RPS) |
| Agent | coroot-node-agent (eBPF) |
| Methodology | Latency comparison: baseline (no agent) vs agent enabled |
Results¶
| Metric | Without Agent | With Agent | Impact |
|---|---|---|---|
| Request latency | Baseline | Within margin of error | Negligible |
| CPU consumption | — | ~200 millicores | ~20% of 1 CPU core |
| Latency impact | — | — | Within measurement error |
Key finding: At 10,000 RPS, the latency difference with the coroot-node-agent enabled vs baseline falls within the margin of measurement error. The eBPF programs are verified by the kernel for finite complexity, guaranteeing they cannot disrupt kernel operations.
CPU Profiler Overhead¶
| Component | Overhead | Notes |
|---|---|---|
| eBPF CPU profiler | 1–3% | Based on Grafana Pyroscope implementation |
JVM with -XX:+PreserveFramePointer |
1–3% | Required for accurate JVM stack traces |
Resource Consumption at Scale¶
Coroot Server¶
| Cluster Size | Recommended CPU | Recommended RAM | Notes |
|---|---|---|---|
| < 50 services | 1 vCPU | 2 GB | Single-node sufficient |
| 50–200 services | 2 vCPU | 4 GB | Inspection engine overhead |
| 200+ services | 4+ vCPU | 8+ GB | Service map complexity |
Node Agent¶
| Per Node | CPU | RAM |
|---|---|---|
| Base overhead | ~50 millicores | ~50 MB |
| Under load (10K RPS) | ~200 millicores | ~100 MB |
| Heavy profiling | ~300 millicores | ~150 MB |
Scale Limits¶
| Dimension | Practical Limit | Bottleneck |
|---|---|---|
| Services per cluster | 500+ | Server CPU for service map |
| Nodes per cluster | 200+ | Node agent DaemonSet scaling |
| Multi-cluster | 10+ clusters | Network bandwidth to central server |
| Metrics cardinality | Backend dependent | Prometheus/VM limits apply |
| Trace throughput | Backend dependent | ClickHouse write capacity |
Storage Backend Requirements¶
| Backend | Scenario | Resources |
|---|---|---|
| Prometheus | < 1M series | 2 CPU, 8 GB RAM, 100 GB SSD |
| VictoriaMetrics | 1–10M series | 2 CPU, 4 GB RAM, 200 GB SSD |
| ClickHouse | 100 GB/day logs + traces | 4 CPU, 16 GB RAM, 500 GB SSD |
Caveats¶
- Benchmarks are from vendor-provided testing. Users with specialized workloads should conduct their own validation.
- eBPF overhead can vary based on kernel version, workload characteristics, and enabled collection features.
- Disabling span capture while maintaining eBPF metrics can further reduce resource consumption in extremely high-load environments.