Architecture
1. Default Topology / Flow
flowchart TB
subgraph Sources["Data Sources"]
APP["Application\n+ OTel SDK"]
PROM["Prometheus\nTargets"]
LEGACY["Jaeger /\nZipkin Clients"]
FB["FluentBit /\nFluentD"]
RUM["Browser\n(RUM SDK)"]
end
subgraph CollectorFleet["OTel Collector Fleet"]
direction TB
GW["Gateway Collectors\n(load-balanced)"]
AGT["Agent Collectors\n(DaemonSet, optional)"]
end
subgraph Backend["SigNoz Backend"]
direction TB
QS["Query Service\n(Go API)"]
Rule["Ruler +\nAlertmanager"]
OpAMP["OpAMP Server\n(dynamic config)"]
EE["Enterprise Extensions\n(SSO, RBAC, SAML)"]
FE["React Frontend\n(SPA)"]
end
subgraph CHCluster["ClickHouse Cluster"]
direction TB
Shard1["Shard 1\n(Replica A + B)"]
Shard2["Shard 2\n(Replica A + B)"]
ZK["ZooKeeper /\nClickHouse Keeper"]
subgraph Tables["Core Tables"]
T_Traces["signoz_traces\n.signoz_index_v2"]
T_Logs["signoz_logs\n.logs"]
T_Metrics["signoz_metrics\n.samples_v4"]
end
end
subgraph Meta["Metadata"]
PG["PostgreSQL\n(metadata, auth)"]
end
Sources --> CollectorFleet
GW -->|"ClickHouse\nexporter"| CHCluster
AGT -->|"forward"| GW
QS -->|"query"| CHCluster
Rule -->|"eval"| CHCluster
QS --> FE
QS --> PG
OpAMP -.->|"reconfigure"| CollectorFleet
style Backend fill:#7b1fa2,color:#fff
style CHCluster fill:#1565c0,color:#fff
Component breakdown, deployment topologies, and data flow for SigNoz.
System Architecture
Component Responsibility Matrix
| Component |
Language |
Role |
Scales Via |
| OTel Collector (Gateway) |
Go |
Ingestion, processing, routing |
Horizontal (replicas behind LB) |
| OTel Collector (Agent) |
Go |
Per-node collection, forwarding |
DaemonSet (1 per node) |
| Query Service |
Go |
API layer, ClickHouse queries |
Horizontal (stateless) |
| Ruler / Alertmanager |
Go |
Alert evaluation, notifications |
Single leader |
| OpAMP Server |
Go |
Dynamic collector reconfiguration |
Single instance |
| React Frontend |
TypeScript |
UI, dashboards, query builder |
Static assets (CDN/replicas) |
| ClickHouse |
C++ |
Columnar storage for all signals |
Sharding + replication |
| ZooKeeper / Keeper |
Java/C++ |
ClickHouse coordination |
3-node ensemble |
| PostgreSQL |
C |
Metadata, user auth, settings |
Standard HA (RDS etc.) |
Deployment Topologies
Small (< 50 GB/day)
flowchart LR
OTel["OTel Collector\n(single)"]
QS["Query Service"]
FE["Frontend"]
CH["ClickHouse\n(single node)"]
PG["PostgreSQL"]
OTel --> CH
QS --> CH
QS --> PG
QS --> FE
Production (50–200 GB/day)
flowchart LR
subgraph Collectors["Collector Fleet"]
C1["Collector 1"]
C2["Collector 2"]
C3["Collector 3"]
end
LB["Load Balancer"]
subgraph QSPool["Query Service Pool"]
QS1["QS 1"]
QS2["QS 2"]
end
subgraph CHCluster["ClickHouse (2×2)"]
S1R1["Shard1 Rep1"]
S1R2["Shard1 Rep2"]
S2R1["Shard2 Rep1"]
S2R2["Shard2 Rep2"]
end
Collectors --> LB --> CHCluster
QSPool --> CHCluster
ClickHouse Storage Schema Detail
Trace Index Table
| Column |
Type |
Purpose |
timestamp |
DateTime64(9) |
Nanosecond precision timestamp |
traceID |
FixedString(32) |
128-bit trace identifier |
spanID |
String |
Span identifier |
parentSpanID |
String |
Parent span link |
serviceName |
LowCardinality(String) |
Service name |
name |
LowCardinality(String) |
Operation name |
kind |
Int8 |
Span kind (server/client/etc.) |
durationNano |
UInt64 |
Span duration |
statusCode |
Int16 |
Status code |
httpMethod |
LowCardinality(String) |
HTTP method |
httpRoute |
LowCardinality(String) |
HTTP route |
resourceAttributes |
Map(String, String) |
Resource attributes |
Log Table
| Column |
Type |
Purpose |
timestamp |
UInt64 |
Unix nanoseconds |
body |
String |
Log message body |
severityText |
LowCardinality(String) |
ERROR, WARN, INFO, etc. |
severityNumber |
UInt8 |
Numeric severity |
traceID |
String |
Correlation to traces |
spanID |
String |
Correlation to spans |
resourceAttributes |
Map(String, String) |
Resource context |
logAttributes |
Map(String, String) |
Log-specific attributes |
Sources
Data Model
1. Default Topology / Flow
erDiagram
Signoz_CORE ||--o{ CONFIG : requires
Signoz_CORE ||--o{ STATE : writes
CONFIG {
string runtime_params
string limits
}
STATE {
string metric_id
json payload
}
How It Works
How SigNoz processes telemetry through its OTel-native pipeline, stores data in ClickHouse, and provides unified observability.
Data Pipeline
Ingestion Flow
flowchart LR
subgraph Sources["Data Sources"]
APP["App + OTel SDK"]
PROM["Prometheus"]
JAEG["Jaeger / Zipkin"]
FB["FluentBit / FluentD"]
end
subgraph Collector["SigNoz OTel Collector"]
Recv["Receivers\n(OTLP, Jaeger, Zipkin,\nPrometheus)"]
Proc["Processors\n(batch, memory_limiter,\nattribute, tail_sampling)"]
Exp["Exporters\n(ClickHouse)"]
end
subgraph Backend["SigNoz Backend"]
QS["Query Service\n(Go API)"]
FE["React Frontend"]
Rule["Ruler /\nAlertmanager"]
OpAMP["OpAMP Server\n(dynamic config)"]
end
subgraph CH["ClickHouse Cluster"]
T["signoz_traces"]
L["signoz_logs"]
M["signoz_metrics"]
end
Sources --> Recv --> Proc --> Exp --> CH
QS --> CH
Rule --> CH
QS --> FE
OpAMP -.->|reconfigure| Collector
OTel Collector Distribution
SigNoz ships a custom OpenTelemetry Collector distribution that includes:
| Component |
Purpose |
| OTLP Receiver |
Primary ingestion (gRPC + HTTP) |
| Prometheus Receiver |
Scrape Prometheus targets |
| Jaeger/Zipkin Receiver |
Legacy trace format support |
| FluentForward Receiver |
FluentBit/FluentD log ingestion |
| Batch Processor |
Batches data for efficient ClickHouse writes |
| Memory Limiter |
Prevents OOM under load |
| Tail Sampling |
Sample traces based on latency/error criteria |
| ClickHouse Exporter |
Writes all signals to ClickHouse |
OpAMP (Open Agent Management Protocol)
SigNoz uses OpAMP for dynamic reconfiguration of the OTel Collector:
- Log pipelines: Add/modify log processing rules without collector restart
- Sampling rules: Adjust tail sampling dynamically
- Collector health: Monitor collector instances from the SigNoz UI
Storage Schema (ClickHouse)
Traces
-- signoz_traces.signoz_index_v2
-- Core trace/span index with columnar storage
-- Columns: traceID, spanID, serviceName, name, kind, durationNano,
-- statusCode, httpMethod, httpRoute, resourceAttributes, ...
-- Engine: MergeTree, partitioned by toDate(timestamp)
-- TTL: Configurable (default 7 days self-hosted, 15 days cloud)
Logs
-- signoz_logs.logs
-- Columnar log storage with full-text indexing
-- Columns: timestamp, body, severityText, severityNumber,
-- traceID, spanID, resourceAttributes, logAttributes
-- Engine: MergeTree, partitioned by toDate(timestamp)
-- Supports: JSON expansion, attribute indexing
Metrics
-- signoz_metrics.samples_v4
-- Time-series samples with metric metadata
-- Columns: metric_name, fingerprint, timestamp_ms, value,
-- labels (Map), temporality, type
-- Engine: MergeTree, partitioned by toDate(timestamp_ms)
-- Query: PromQL translated to ClickHouse SQL
Query Execution
Dual Query Language Support
| Signal |
Query Language |
How It Works |
| Metrics |
PromQL |
Translated to ClickHouse SQL by the query service |
| Logs |
ClickHouse SQL |
Direct columnar queries with filter pushdown |
| Traces |
ClickHouse SQL |
Span-level queries with attribute filtering |
| All |
Query Builder |
Visual query builder generates optimized CH SQL |
Query Builder → ClickHouse Translation
The React frontend's visual query builder generates structured query payloads that the Go query service translates into optimized ClickHouse SQL:
- User builds query visually (aggregation, filters, group-by)
- Frontend sends structured JSON payload to API
- Query Service compiles to ClickHouse SQL with proper materialized column usage
- ClickHouse executes with columnar vectorized processing
- Results returned as time-series or table data
Cross-Signal Correlation
SigNoz enables correlation between signals using shared identifiers:
flowchart LR
Trace["Trace\n(traceID)"] <-->|traceID in log| Log["Log\n(traceID, spanID)"]
Trace <-->|service + timestamp| Metric["Metric\n(service, operation)"]
Log <-->|service + timestamp| Metric
- Trace → Log: Click a span to see logs with matching
traceID
- Log → Trace: Click a log with
traceID to jump to the trace waterfall
- Metric → Trace: Drill down from a latency spike to exemplar traces
Alerting Pipeline
flowchart LR
Rule["Alert Rule\n(PromQL / CH SQL)"] --> Eval["Ruler\n(periodic eval)"]
Eval -->|threshold breach| AM["Alertmanager"]
AM --> Slack["Slack"]
AM --> PD["PagerDuty"]
AM --> WH["Webhook"]
AM --> Email["Email"]
AM --> MST["MS Teams"]
- Rules can be defined on any signal type (metrics, logs, traces)
- Anomaly detection available for automated threshold learning
- Alert history tracked with state transitions
Sources
Benchmarks
Performance characteristics, capacity planning data, and scale limits for SigNoz.
vs ELK Stack
| Metric |
SigNoz (ClickHouse) |
ELK Stack |
Advantage |
| Log ingestion speed |
Baseline |
~2.5x slower |
SigNoz 2.5x faster |
| Resource consumption |
Baseline |
~2x more |
SigNoz 50% less |
| Aggregate query speed |
Baseline |
~13x slower |
SigNoz up to 13x faster |
| Ingestion capacity |
10+ TB/day |
Similar |
Comparable |
| Compression ratio |
10–30x (columnar) |
1.5x (Lucene) |
SigNoz 7–20x better |
Source: SigNoz vendor benchmarks. Cross-validated against ClickHouse engineering blog data on columnar efficiency.
High Cardinality Handling
| Aspect |
Detail |
| Approach |
Columnar storage — no inverted index explosion |
| Impact |
Adding a dimension with billions of unique values is trivial |
| Best for |
Logs and traces with rich metadata |
| Caution |
Avoid high-cardinality attributes as metric labels |
Capacity Planning
Resource Matrix (from SigNoz Official Docs)
| Component |
Small (< 10 GB/day) |
Medium (10–50 GB/day) |
Large (50–200 GB/day) |
| OTel Collectors |
1 replica, 1 CPU, 2 GB |
2 replicas, 2 CPU, 4 GB |
4+ replicas, 4 CPU, 8 GB |
| Query Service |
1 replica, 0.5 CPU, 1 GB |
2 replicas, 1 CPU, 2 GB |
2 replicas, 2 CPU, 4 GB |
| ClickHouse |
1 node, 4 CPU, 16 GB |
2 shards × 2 replicas, 8 CPU, 32 GB |
4+ shards × 2 replicas, 16 CPU, 64 GB |
| ZooKeeper / Keeper |
1 node, 0.5 CPU, 1 GB |
3 nodes, 1 CPU, 2 GB |
3 nodes, 2 CPU, 4 GB |
| PostgreSQL |
1 node, 0.5 CPU, 1 GB |
Managed DB (RDS) |
Managed DB (RDS) |
Cloud Instance Recommendations
| Cloud |
General Purpose (Collectors, QS) |
Compute-Optimized (ClickHouse) |
| AWS |
T3 family+ (Intel), T4g+ (ARM) |
C5+ (Intel), C6g/C7g+ (ARM) |
| GCP |
E2 family+ |
C3 / C3D+ |
Storage Sizing
| Signal |
Daily Volume |
15-Day Retention |
30-Day Retention |
| Logs (10:1 compression) |
50 GB raw/day |
~75 GB disk |
~150 GB disk |
| Traces (15:1 compression) |
20 GB raw/day |
~20 GB disk |
~40 GB disk |
| Metrics (30:1 compression) |
5 GB raw/day |
~2.5 GB disk |
~5 GB disk |
Scale Limits
| Dimension |
Practical Limit |
Notes |
| Daily ingestion |
10+ TB/day |
Requires multi-shard ClickHouse |
| Active time series |
10M+ |
ClickHouse handles high cardinality well |
| Concurrent queries |
50–100 |
Depends on ClickHouse node count |
| Trace span retention |
15–90 days typical |
Storage cost-limited |
| Log retention |
15–90 days typical |
ClickHouse TTL-managed |
- System tables growth: ClickHouse's
query_log and zookeeper_log can grow rapidly. Monitor and set TTLs.
- ClickHouse parts merges: Under very high ingestion, ensure sufficient CPU for background merges.
- ZooKeeper latency: In multi-shard setups, ZooKeeper latency directly impacts replication lag.
Caveats
- Benchmarks are from SigNoz vendor testing and ClickHouse engineering publications.
- Actual performance varies significantly based on data patterns, cardinality, and query complexity.
- Managed ClickHouse providers may exhibit different resource profiles.
Sources