Architecture¶

1. Default Topology / Flow¶

flowchart TB
    subgraph Sources["Data Sources"]
        APP["Application\n+ OTel SDK"]
        PROM["Prometheus\nTargets"]
        LEGACY["Jaeger /\nZipkin Clients"]
        FB["FluentBit /\nFluentD"]
        RUM["Browser\n(RUM SDK)"]
    end

    subgraph CollectorFleet["OTel Collector Fleet"]
        direction TB
        GW["Gateway Collectors\n(load-balanced)"]
        AGT["Agent Collectors\n(DaemonSet, optional)"]
    end

    subgraph Backend["SigNoz Backend"]
        direction TB
        QS["Query Service\n(Go API)"]
        Rule["Ruler +\nAlertmanager"]
        OpAMP["OpAMP Server\n(dynamic config)"]
        EE["Enterprise Extensions\n(SSO, RBAC, SAML)"]
        FE["React Frontend\n(SPA)"]
    end

    subgraph CHCluster["ClickHouse Cluster"]
        direction TB
        Shard1["Shard 1\n(Replica A + B)"]
        Shard2["Shard 2\n(Replica A + B)"]
        ZK["ZooKeeper /\nClickHouse Keeper"]

        subgraph Tables["Core Tables"]
            T_Traces["signoz_traces\n.signoz_index_v2"]
            T_Logs["signoz_logs\n.logs"]
            T_Metrics["signoz_metrics\n.samples_v4"]
        end
    end

    subgraph Meta["Metadata"]
        PG["PostgreSQL\n(metadata, auth)"]
    end

    Sources --> CollectorFleet
    GW -->|"ClickHouse\nexporter"| CHCluster
    AGT -->|"forward"| GW
    QS -->|"query"| CHCluster
    Rule -->|"eval"| CHCluster
    QS --> FE
    QS --> PG
    OpAMP -.->|"reconfigure"| CollectorFleet

    style Backend fill:#7b1fa2,color:#fff
    style CHCluster fill:#1565c0,color:#fff

Component breakdown, deployment topologies, and data flow for SigNoz.

System Architecture¶

Component Responsibility Matrix¶

Component	Language	Role	Scales Via
OTel Collector (Gateway)	Go	Ingestion, processing, routing	Horizontal (replicas behind LB)
OTel Collector (Agent)	Go	Per-node collection, forwarding	DaemonSet (1 per node)
Query Service	Go	API layer, ClickHouse queries	Horizontal (stateless)
Ruler / Alertmanager	Go	Alert evaluation, notifications	Single leader
OpAMP Server	Go	Dynamic collector reconfiguration	Single instance
React Frontend	TypeScript	UI, dashboards, query builder	Static assets (CDN/replicas)
ClickHouse	C++	Columnar storage for all signals	Sharding + replication
ZooKeeper / Keeper	Java/C++	ClickHouse coordination	3-node ensemble
PostgreSQL	C	Metadata, user auth, settings	Standard HA (RDS etc.)

Deployment Topologies¶

Small (< 50 GB/day)¶

flowchart LR
    OTel["OTel Collector\n(single)"]
    QS["Query Service"]
    FE["Frontend"]
    CH["ClickHouse\n(single node)"]
    PG["PostgreSQL"]

    OTel --> CH
    QS --> CH
    QS --> PG
    QS --> FE

Production (50–200 GB/day)¶

flowchart LR
    subgraph Collectors["Collector Fleet"]
        C1["Collector 1"]
        C2["Collector 2"]
        C3["Collector 3"]
    end

    LB["Load Balancer"]
    subgraph QSPool["Query Service Pool"]
        QS1["QS 1"]
        QS2["QS 2"]
    end

    subgraph CHCluster["ClickHouse (2×2)"]
        S1R1["Shard1 Rep1"]
        S1R2["Shard1 Rep2"]
        S2R1["Shard2 Rep1"]
        S2R2["Shard2 Rep2"]
    end

    Collectors --> LB --> CHCluster
    QSPool --> CHCluster

ClickHouse Storage Schema Detail¶

Trace Index Table¶

Column	Type	Purpose
`timestamp`	DateTime64(9)	Nanosecond precision timestamp
`traceID`	FixedString(32)	128-bit trace identifier
`spanID`	String	Span identifier
`parentSpanID`	String	Parent span link
`serviceName`	LowCardinality(String)	Service name
`name`	LowCardinality(String)	Operation name
`kind`	Int8	Span kind (server/client/etc.)
`durationNano`	UInt64	Span duration
`statusCode`	Int16	Status code
`httpMethod`	LowCardinality(String)	HTTP method
`httpRoute`	LowCardinality(String)	HTTP route
`resourceAttributes`	Map(String, String)	Resource attributes

Log Table¶

Column	Type	Purpose
`timestamp`	UInt64	Unix nanoseconds
`body`	String	Log message body
`severityText`	LowCardinality(String)	ERROR, WARN, INFO, etc.
`severityNumber`	UInt8	Numeric severity
`traceID`	String	Correlation to traces
`spanID`	String	Correlation to spans
`resourceAttributes`	Map(String, String)	Resource context
`logAttributes`	Map(String, String)	Log-specific attributes

Sources¶

Data Model¶

1. Default Topology / Flow¶

erDiagram
    Signoz_CORE ||--o{ CONFIG : requires
    Signoz_CORE ||--o{ STATE : writes
    CONFIG {
        string runtime_params
        string limits
    }
    STATE {
        string metric_id
        json payload
    }

How It Works¶

How SigNoz processes telemetry through its OTel-native pipeline, stores data in ClickHouse, and provides unified observability.

Data Pipeline¶

Ingestion Flow¶

flowchart LR
    subgraph Sources["Data Sources"]
        APP["App + OTel SDK"]
        PROM["Prometheus"]
        JAEG["Jaeger / Zipkin"]
        FB["FluentBit / FluentD"]
    end

    subgraph Collector["SigNoz OTel Collector"]
        Recv["Receivers\n(OTLP, Jaeger, Zipkin,\nPrometheus)"]
        Proc["Processors\n(batch, memory_limiter,\nattribute, tail_sampling)"]
        Exp["Exporters\n(ClickHouse)"]
    end

    subgraph Backend["SigNoz Backend"]
        QS["Query Service\n(Go API)"]
        FE["React Frontend"]
        Rule["Ruler /\nAlertmanager"]
        OpAMP["OpAMP Server\n(dynamic config)"]
    end

    subgraph CH["ClickHouse Cluster"]
        T["signoz_traces"]
        L["signoz_logs"]
        M["signoz_metrics"]
    end

    Sources --> Recv --> Proc --> Exp --> CH
    QS --> CH
    Rule --> CH
    QS --> FE
    OpAMP -.->|reconfigure| Collector

OTel Collector Distribution¶

SigNoz ships a custom OpenTelemetry Collector distribution that includes:

Component	Purpose
OTLP Receiver	Primary ingestion (gRPC + HTTP)
Prometheus Receiver	Scrape Prometheus targets
Jaeger/Zipkin Receiver	Legacy trace format support
FluentForward Receiver	FluentBit/FluentD log ingestion
Batch Processor	Batches data for efficient ClickHouse writes
Memory Limiter	Prevents OOM under load
Tail Sampling	Sample traces based on latency/error criteria
ClickHouse Exporter	Writes all signals to ClickHouse

OpAMP (Open Agent Management Protocol)¶

SigNoz uses OpAMP for dynamic reconfiguration of the OTel Collector:

Log pipelines: Add/modify log processing rules without collector restart
Sampling rules: Adjust tail sampling dynamically
Collector health: Monitor collector instances from the SigNoz UI

Storage Schema (ClickHouse)¶

Traces¶

-- signoz_traces.signoz_index_v2
-- Core trace/span index with columnar storage
-- Columns: traceID, spanID, serviceName, name, kind, durationNano,
--          statusCode, httpMethod, httpRoute, resourceAttributes, ...
-- Engine: MergeTree, partitioned by toDate(timestamp)
-- TTL: Configurable (default 7 days self-hosted, 15 days cloud)

Logs¶

-- signoz_logs.logs
-- Columnar log storage with full-text indexing
-- Columns: timestamp, body, severityText, severityNumber,
--          traceID, spanID, resourceAttributes, logAttributes
-- Engine: MergeTree, partitioned by toDate(timestamp)
-- Supports: JSON expansion, attribute indexing

Metrics¶

-- signoz_metrics.samples_v4
-- Time-series samples with metric metadata
-- Columns: metric_name, fingerprint, timestamp_ms, value,
--          labels (Map), temporality, type
-- Engine: MergeTree, partitioned by toDate(timestamp_ms)
-- Query: PromQL translated to ClickHouse SQL

Query Execution¶

Dual Query Language Support¶

Signal	Query Language	How It Works
Metrics	PromQL	Translated to ClickHouse SQL by the query service
Logs	ClickHouse SQL	Direct columnar queries with filter pushdown
Traces	ClickHouse SQL	Span-level queries with attribute filtering
All	Query Builder	Visual query builder generates optimized CH SQL

Query Builder → ClickHouse Translation¶

The React frontend's visual query builder generates structured query payloads that the Go query service translates into optimized ClickHouse SQL:

User builds query visually (aggregation, filters, group-by)
Frontend sends structured JSON payload to API
Query Service compiles to ClickHouse SQL with proper materialized column usage
ClickHouse executes with columnar vectorized processing
Results returned as time-series or table data

Cross-Signal Correlation¶

SigNoz enables correlation between signals using shared identifiers:

flowchart LR
    Trace["Trace\n(traceID)"] <-->|traceID in log| Log["Log\n(traceID, spanID)"]
    Trace <-->|service + timestamp| Metric["Metric\n(service, operation)"]
    Log <-->|service + timestamp| Metric

Trace → Log: Click a span to see logs with matching traceID
Log → Trace: Click a log with traceID to jump to the trace waterfall
Metric → Trace: Drill down from a latency spike to exemplar traces

Alerting Pipeline¶

flowchart LR
    Rule["Alert Rule\n(PromQL / CH SQL)"] --> Eval["Ruler\n(periodic eval)"]
    Eval -->|threshold breach| AM["Alertmanager"]
    AM --> Slack["Slack"]
    AM --> PD["PagerDuty"]
    AM --> WH["Webhook"]
    AM --> Email["Email"]
    AM --> MST["MS Teams"]

Rules can be defined on any signal type (metrics, logs, traces)
Anomaly detection available for automated threshold learning
Alert history tracked with state transitions

Sources¶

Benchmarks¶

Performance characteristics, capacity planning data, and scale limits for SigNoz.

ClickHouse Performance¶

vs ELK Stack¶

Metric	SigNoz (ClickHouse)	ELK Stack	Advantage
Log ingestion speed	Baseline	~2.5x slower	SigNoz 2.5x faster
Resource consumption	Baseline	~2x more	SigNoz 50% less
Aggregate query speed	Baseline	~13x slower	SigNoz up to 13x faster
Ingestion capacity	10+ TB/day	Similar	Comparable
Compression ratio	10–30x (columnar)	1.5x (Lucene)	SigNoz 7–20x better

Source: SigNoz vendor benchmarks. Cross-validated against ClickHouse engineering blog data on columnar efficiency.

High Cardinality Handling¶

Aspect	Detail
Approach	Columnar storage — no inverted index explosion
Impact	Adding a dimension with billions of unique values is trivial
Best for	Logs and traces with rich metadata
Caution	Avoid high-cardinality attributes as metric labels

Capacity Planning¶

Resource Matrix (from SigNoz Official Docs)¶

Component	Small (< 10 GB/day)	Medium (10–50 GB/day)	Large (50–200 GB/day)
OTel Collectors	1 replica, 1 CPU, 2 GB	2 replicas, 2 CPU, 4 GB	4+ replicas, 4 CPU, 8 GB
Query Service	1 replica, 0.5 CPU, 1 GB	2 replicas, 1 CPU, 2 GB	2 replicas, 2 CPU, 4 GB
ClickHouse	1 node, 4 CPU, 16 GB	2 shards × 2 replicas, 8 CPU, 32 GB	4+ shards × 2 replicas, 16 CPU, 64 GB
ZooKeeper / Keeper	1 node, 0.5 CPU, 1 GB	3 nodes, 1 CPU, 2 GB	3 nodes, 2 CPU, 4 GB
PostgreSQL	1 node, 0.5 CPU, 1 GB	Managed DB (RDS)	Managed DB (RDS)

Cloud Instance Recommendations¶

Cloud	General Purpose (Collectors, QS)	Compute-Optimized (ClickHouse)
AWS	T3 family+ (Intel), T4g+ (ARM)	C5+ (Intel), C6g/C7g+ (ARM)
GCP	E2 family+	C3 / C3D+

Storage Sizing¶

Signal	Daily Volume	15-Day Retention	30-Day Retention
Logs (10:1 compression)	50 GB raw/day	~75 GB disk	~150 GB disk
Traces (15:1 compression)	20 GB raw/day	~20 GB disk	~40 GB disk
Metrics (30:1 compression)	5 GB raw/day	~2.5 GB disk	~5 GB disk

Scale Limits¶

Dimension	Practical Limit	Notes
Daily ingestion	10+ TB/day	Requires multi-shard ClickHouse
Active time series	10M+	ClickHouse handles high cardinality well
Concurrent queries	50–100	Depends on ClickHouse node count
Trace span retention	15–90 days typical	Storage cost-limited
Log retention	15–90 days typical	ClickHouse TTL-managed

Known Performance Considerations¶

System tables growth: ClickHouse's query_log and zookeeper_log can grow rapidly. Monitor and set TTLs.
ClickHouse parts merges: Under very high ingestion, ensure sufficient CPU for background merges.
ZooKeeper latency: In multi-shard setups, ZooKeeper latency directly impacts replication lag.

Caveats¶

Benchmarks are from SigNoz vendor testing and ClickHouse engineering publications.
Actual performance varies significantly based on data patterns, cardinality, and query complexity.
Managed ClickHouse providers may exhibit different resource profiles.