Skip to content

Architecture

1. Default Topology / Flow

flowchart TB
    subgraph Sources["Data Sources"]
        APP["Application\n+ OTel SDK"]
        PROM["Prometheus\nTargets"]
        LEGACY["Jaeger /\nZipkin Clients"]
        FB["FluentBit /\nFluentD"]
        RUM["Browser\n(RUM SDK)"]
    end

    subgraph CollectorFleet["OTel Collector Fleet"]
        direction TB
        GW["Gateway Collectors\n(load-balanced)"]
        AGT["Agent Collectors\n(DaemonSet, optional)"]
    end

    subgraph Backend["SigNoz Backend"]
        direction TB
        QS["Query Service\n(Go API)"]
        Rule["Ruler +\nAlertmanager"]
        OpAMP["OpAMP Server\n(dynamic config)"]
        EE["Enterprise Extensions\n(SSO, RBAC, SAML)"]
        FE["React Frontend\n(SPA)"]
    end

    subgraph CHCluster["ClickHouse Cluster"]
        direction TB
        Shard1["Shard 1\n(Replica A + B)"]
        Shard2["Shard 2\n(Replica A + B)"]
        ZK["ZooKeeper /\nClickHouse Keeper"]

        subgraph Tables["Core Tables"]
            T_Traces["signoz_traces\n.signoz_index_v2"]
            T_Logs["signoz_logs\n.logs"]
            T_Metrics["signoz_metrics\n.samples_v4"]
        end
    end

    subgraph Meta["Metadata"]
        PG["PostgreSQL\n(metadata, auth)"]
    end

    Sources --> CollectorFleet
    GW -->|"ClickHouse\nexporter"| CHCluster
    AGT -->|"forward"| GW
    QS -->|"query"| CHCluster
    Rule -->|"eval"| CHCluster
    QS --> FE
    QS --> PG
    OpAMP -.->|"reconfigure"| CollectorFleet

    style Backend fill:#7b1fa2,color:#fff
    style CHCluster fill:#1565c0,color:#fff

Component breakdown, deployment topologies, and data flow for SigNoz.

System Architecture

Component Responsibility Matrix

Component Language Role Scales Via
OTel Collector (Gateway) Go Ingestion, processing, routing Horizontal (replicas behind LB)
OTel Collector (Agent) Go Per-node collection, forwarding DaemonSet (1 per node)
Query Service Go API layer, ClickHouse queries Horizontal (stateless)
Ruler / Alertmanager Go Alert evaluation, notifications Single leader
OpAMP Server Go Dynamic collector reconfiguration Single instance
React Frontend TypeScript UI, dashboards, query builder Static assets (CDN/replicas)
ClickHouse C++ Columnar storage for all signals Sharding + replication
ZooKeeper / Keeper Java/C++ ClickHouse coordination 3-node ensemble
PostgreSQL C Metadata, user auth, settings Standard HA (RDS etc.)

Deployment Topologies

Small (< 50 GB/day)

flowchart LR
    OTel["OTel Collector\n(single)"]
    QS["Query Service"]
    FE["Frontend"]
    CH["ClickHouse\n(single node)"]
    PG["PostgreSQL"]

    OTel --> CH
    QS --> CH
    QS --> PG
    QS --> FE

Production (50–200 GB/day)

flowchart LR
    subgraph Collectors["Collector Fleet"]
        C1["Collector 1"]
        C2["Collector 2"]
        C3["Collector 3"]
    end

    LB["Load Balancer"]
    subgraph QSPool["Query Service Pool"]
        QS1["QS 1"]
        QS2["QS 2"]
    end

    subgraph CHCluster["ClickHouse (2×2)"]
        S1R1["Shard1 Rep1"]
        S1R2["Shard1 Rep2"]
        S2R1["Shard2 Rep1"]
        S2R2["Shard2 Rep2"]
    end

    Collectors --> LB --> CHCluster
    QSPool --> CHCluster

ClickHouse Storage Schema Detail

Trace Index Table

Column Type Purpose
timestamp DateTime64(9) Nanosecond precision timestamp
traceID FixedString(32) 128-bit trace identifier
spanID String Span identifier
parentSpanID String Parent span link
serviceName LowCardinality(String) Service name
name LowCardinality(String) Operation name
kind Int8 Span kind (server/client/etc.)
durationNano UInt64 Span duration
statusCode Int16 Status code
httpMethod LowCardinality(String) HTTP method
httpRoute LowCardinality(String) HTTP route
resourceAttributes Map(String, String) Resource attributes

Log Table

Column Type Purpose
timestamp UInt64 Unix nanoseconds
body String Log message body
severityText LowCardinality(String) ERROR, WARN, INFO, etc.
severityNumber UInt8 Numeric severity
traceID String Correlation to traces
spanID String Correlation to spans
resourceAttributes Map(String, String) Resource context
logAttributes Map(String, String) Log-specific attributes

Sources

Data Model

1. Default Topology / Flow

erDiagram
    Signoz_CORE ||--o{ CONFIG : requires
    Signoz_CORE ||--o{ STATE : writes
    CONFIG {
        string runtime_params
        string limits
    }
    STATE {
        string metric_id
        json payload
    }

How It Works

How SigNoz processes telemetry through its OTel-native pipeline, stores data in ClickHouse, and provides unified observability.

Data Pipeline

Ingestion Flow

flowchart LR
    subgraph Sources["Data Sources"]
        APP["App + OTel SDK"]
        PROM["Prometheus"]
        JAEG["Jaeger / Zipkin"]
        FB["FluentBit / FluentD"]
    end

    subgraph Collector["SigNoz OTel Collector"]
        Recv["Receivers\n(OTLP, Jaeger, Zipkin,\nPrometheus)"]
        Proc["Processors\n(batch, memory_limiter,\nattribute, tail_sampling)"]
        Exp["Exporters\n(ClickHouse)"]
    end

    subgraph Backend["SigNoz Backend"]
        QS["Query Service\n(Go API)"]
        FE["React Frontend"]
        Rule["Ruler /\nAlertmanager"]
        OpAMP["OpAMP Server\n(dynamic config)"]
    end

    subgraph CH["ClickHouse Cluster"]
        T["signoz_traces"]
        L["signoz_logs"]
        M["signoz_metrics"]
    end

    Sources --> Recv --> Proc --> Exp --> CH
    QS --> CH
    Rule --> CH
    QS --> FE
    OpAMP -.->|reconfigure| Collector

OTel Collector Distribution

SigNoz ships a custom OpenTelemetry Collector distribution that includes:

Component Purpose
OTLP Receiver Primary ingestion (gRPC + HTTP)
Prometheus Receiver Scrape Prometheus targets
Jaeger/Zipkin Receiver Legacy trace format support
FluentForward Receiver FluentBit/FluentD log ingestion
Batch Processor Batches data for efficient ClickHouse writes
Memory Limiter Prevents OOM under load
Tail Sampling Sample traces based on latency/error criteria
ClickHouse Exporter Writes all signals to ClickHouse

OpAMP (Open Agent Management Protocol)

SigNoz uses OpAMP for dynamic reconfiguration of the OTel Collector:

  • Log pipelines: Add/modify log processing rules without collector restart
  • Sampling rules: Adjust tail sampling dynamically
  • Collector health: Monitor collector instances from the SigNoz UI

Storage Schema (ClickHouse)

Traces

-- signoz_traces.signoz_index_v2
-- Core trace/span index with columnar storage
-- Columns: traceID, spanID, serviceName, name, kind, durationNano,
--          statusCode, httpMethod, httpRoute, resourceAttributes, ...
-- Engine: MergeTree, partitioned by toDate(timestamp)
-- TTL: Configurable (default 7 days self-hosted, 15 days cloud)

Logs

-- signoz_logs.logs
-- Columnar log storage with full-text indexing
-- Columns: timestamp, body, severityText, severityNumber,
--          traceID, spanID, resourceAttributes, logAttributes
-- Engine: MergeTree, partitioned by toDate(timestamp)
-- Supports: JSON expansion, attribute indexing

Metrics

-- signoz_metrics.samples_v4
-- Time-series samples with metric metadata
-- Columns: metric_name, fingerprint, timestamp_ms, value,
--          labels (Map), temporality, type
-- Engine: MergeTree, partitioned by toDate(timestamp_ms)
-- Query: PromQL translated to ClickHouse SQL

Query Execution

Dual Query Language Support

Signal Query Language How It Works
Metrics PromQL Translated to ClickHouse SQL by the query service
Logs ClickHouse SQL Direct columnar queries with filter pushdown
Traces ClickHouse SQL Span-level queries with attribute filtering
All Query Builder Visual query builder generates optimized CH SQL

Query Builder → ClickHouse Translation

The React frontend's visual query builder generates structured query payloads that the Go query service translates into optimized ClickHouse SQL:

  1. User builds query visually (aggregation, filters, group-by)
  2. Frontend sends structured JSON payload to API
  3. Query Service compiles to ClickHouse SQL with proper materialized column usage
  4. ClickHouse executes with columnar vectorized processing
  5. Results returned as time-series or table data

Cross-Signal Correlation

SigNoz enables correlation between signals using shared identifiers:

flowchart LR
    Trace["Trace\n(traceID)"] <-->|traceID in log| Log["Log\n(traceID, spanID)"]
    Trace <-->|service + timestamp| Metric["Metric\n(service, operation)"]
    Log <-->|service + timestamp| Metric
  • Trace → Log: Click a span to see logs with matching traceID
  • Log → Trace: Click a log with traceID to jump to the trace waterfall
  • Metric → Trace: Drill down from a latency spike to exemplar traces

Alerting Pipeline

flowchart LR
    Rule["Alert Rule\n(PromQL / CH SQL)"] --> Eval["Ruler\n(periodic eval)"]
    Eval -->|threshold breach| AM["Alertmanager"]
    AM --> Slack["Slack"]
    AM --> PD["PagerDuty"]
    AM --> WH["Webhook"]
    AM --> Email["Email"]
    AM --> MST["MS Teams"]
  • Rules can be defined on any signal type (metrics, logs, traces)
  • Anomaly detection available for automated threshold learning
  • Alert history tracked with state transitions

Sources


Benchmarks

Performance characteristics, capacity planning data, and scale limits for SigNoz.

ClickHouse Performance

vs ELK Stack

Metric SigNoz (ClickHouse) ELK Stack Advantage
Log ingestion speed Baseline ~2.5x slower SigNoz 2.5x faster
Resource consumption Baseline ~2x more SigNoz 50% less
Aggregate query speed Baseline ~13x slower SigNoz up to 13x faster
Ingestion capacity 10+ TB/day Similar Comparable
Compression ratio 10–30x (columnar) 1.5x (Lucene) SigNoz 7–20x better

Source: SigNoz vendor benchmarks. Cross-validated against ClickHouse engineering blog data on columnar efficiency.

High Cardinality Handling

Aspect Detail
Approach Columnar storage — no inverted index explosion
Impact Adding a dimension with billions of unique values is trivial
Best for Logs and traces with rich metadata
Caution Avoid high-cardinality attributes as metric labels

Capacity Planning

Resource Matrix (from SigNoz Official Docs)

Component Small (< 10 GB/day) Medium (10–50 GB/day) Large (50–200 GB/day)
OTel Collectors 1 replica, 1 CPU, 2 GB 2 replicas, 2 CPU, 4 GB 4+ replicas, 4 CPU, 8 GB
Query Service 1 replica, 0.5 CPU, 1 GB 2 replicas, 1 CPU, 2 GB 2 replicas, 2 CPU, 4 GB
ClickHouse 1 node, 4 CPU, 16 GB 2 shards × 2 replicas, 8 CPU, 32 GB 4+ shards × 2 replicas, 16 CPU, 64 GB
ZooKeeper / Keeper 1 node, 0.5 CPU, 1 GB 3 nodes, 1 CPU, 2 GB 3 nodes, 2 CPU, 4 GB
PostgreSQL 1 node, 0.5 CPU, 1 GB Managed DB (RDS) Managed DB (RDS)

Cloud Instance Recommendations

Cloud General Purpose (Collectors, QS) Compute-Optimized (ClickHouse)
AWS T3 family+ (Intel), T4g+ (ARM) C5+ (Intel), C6g/C7g+ (ARM)
GCP E2 family+ C3 / C3D+

Storage Sizing

Signal Daily Volume 15-Day Retention 30-Day Retention
Logs (10:1 compression) 50 GB raw/day ~75 GB disk ~150 GB disk
Traces (15:1 compression) 20 GB raw/day ~20 GB disk ~40 GB disk
Metrics (30:1 compression) 5 GB raw/day ~2.5 GB disk ~5 GB disk

Scale Limits

Dimension Practical Limit Notes
Daily ingestion 10+ TB/day Requires multi-shard ClickHouse
Active time series 10M+ ClickHouse handles high cardinality well
Concurrent queries 50–100 Depends on ClickHouse node count
Trace span retention 15–90 days typical Storage cost-limited
Log retention 15–90 days typical ClickHouse TTL-managed

Known Performance Considerations

  1. System tables growth: ClickHouse's query_log and zookeeper_log can grow rapidly. Monitor and set TTLs.
  2. ClickHouse parts merges: Under very high ingestion, ensure sufficient CPU for background merges.
  3. ZooKeeper latency: In multi-shard setups, ZooKeeper latency directly impacts replication lag.

Caveats

  • Benchmarks are from SigNoz vendor testing and ClickHouse engineering publications.
  • Actual performance varies significantly based on data patterns, cardinality, and query complexity.
  • Managed ClickHouse providers may exhibit different resource profiles.

Sources