Skip to content

Multi-Cloud Governance -- Operations

Observability, cost management, and day-2 operational patterns for multi-cloud environments spanning AWS, GCP, Alibaba Cloud, and Tencent Cloud.

Observability

The OpenTelemetry Standard

OpenTelemetry (OTel) is the CNCF-graduated, vendor-neutral observability framework. It provides APIs, SDKs, and a collector pipeline for the three signals (traces, metrics, logs) plus an emerging fourth signal (continuous profiling). In a multi-cloud environment, OTel is the lingua franca that normalizes telemetry regardless of the underlying cloud provider.

OTel Collector Architecture for Multi-Cloud

The OTel Collector is the central pipeline component. In a multi-cloud deployment, the recommended pattern is per-cloud collector deployment with centralized backend export.

[ AWS Cluster ]                 [ GCP Cluster ]
  OTel Collector (DaemonSet)      OTel Collector (DaemonSet)
       | OTLP                          | OTLP
       v                               v
[ AWS Gateway Collector ] -----> [ Central Backend ]
                                    (Grafana LGTM /
                                     Datadog / Dynatrace)
       ^                               ^
       | OTLP                          | OTLP
[ Alibaba Cluster ]             [ Tencent Cluster ]
  OTel Collector (DaemonSet)      OTel Collector (DaemonSet)

Collector deployment modes:

Mode Description Use Case
DaemonSet (agent) One collector per node Low-latency collection, tail sampling
Deployment (gateway) N replicas as a standalone service Centralized processing, multi-tenant routing
Sidecar One collector per pod Strong isolation, per-app config

Recommended multi-cloud pattern: DaemonSet agents on each cluster send to a per-cloud gateway collector. The gateway collector handles processing (batching, retry, attribute enrichment with cloud/region labels) and exports OTLP to the central backend.

OTel Collector Configuration (Multi-Backend Export)

# Collector ConfigMap -- exports to Grafana LGTM stack
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  memory_limiter:
    check_interval: 1s
    limit_percentage: 75
    spike_limit_percentage: 15
  batch:
    send_batch_size: 8192
    timeout: 5s
  resource:
    attributes:
      - key: cloud.provider
        value: "aws"  # Changed per-cloud: aws, gcp, alibaba, tencent
        action: upsert
      - key: cloud.region
        value: "ap-southeast-1"
        action: upsert

exporters:
  otlphttp/grafana:
    endpoint: "https://otlp-gateway-prod-eu-west-0.grafana.net/otlp"
    headers:
      Authorization: "Basic <encoded-credentials>"
  # Optional: dual-export to cloud-native service
  otlp/aws_xray:
    endpoint: "xray.ap-southeast-1.amazonaws.com"
  logging:
    verbosity: detailed

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, batch, resource]
      exporters: [otlphttp/grafana]
    metrics:
      receivers: [otlp]
      processors: [memory_limiter, batch, resource]
      exporters: [otlphttp/grafana]
    logs:
      receivers: [otlp]
      processors: [memory_limiter, batch, resource]
      exporters: [otlphttp/grafana]

Per-Cloud OTel Integration

Cloud Managed OTel Offering Trace Backend Metric Backend Log Backend
AWS AWS Distro for OpenTelemetry (ADOT) X-Ray CloudWatch Metrics CloudWatch Logs
GCP Cloud Operations OTel Exporter Cloud Trace Cloud Monitoring Cloud Logging
Alibaba OpenTelemetry for SLS Trace Service (SLS) Metric Store (SLS) Log Service (SLS)
Tencent OTLP to CLS Cloud Trace (Tencent APM) Cloud Monitor Cloud Log Service (CLS)

OTel Semantic Conventions for Multi-Cloud

Standard attributes ensure telemetry is queryable across clouds in a unified backend:

Attribute Example Purpose
cloud.provider aws, gcp, alibaba_cloud, tencent_cloud Identify which cloud generated the telemetry
cloud.region ap-southeast-1, asia-southeast1, cn-hangzhou Region-level filtering
cloud.availability_zone ap-southeast-1a Zone-level correlation
k8s.cluster.name prod-eks-ap-southeast-1 Cluster-level grouping
service.namespace billing, payments Cross-cloud service grouping
deployment.environment production, staging Environment separation

Central Backend Options

Backend Traces Metrics Logs Profiling License
Grafana LGTM (Loki, Grafana, Tempo, Mimir) Tempo Mimir Loki Pyroscope (optional) AGPL v3 / Commercial
Datadog APM Metrics Logs Continuous Profiler Commercial
Dynatrace PurePath Davis AI Log Analytics Code-level profiling Commercial
New Relic Distributed Tracing Metrics Logs in Context n/a Commercial
Splunk Observability APM Infrastructure Splunk Log Observer n/a Commercial
Honeycomb Events / Traces Derived n/a n/a Commercial
SigNoz Traces (ClickHouse) Metrics Logs n/a MIT / Commercial

Logging Architecture

[ Applications with OTel SDK ]
         |
         v
[ OTel Collector -- log pipeline ]
         |
         +---> [ Central Backend (Loki / Datadog) ]
         |
         +---> [ Cloud-native Log Store (SLS / CLS / CloudWatch) ]
                  |
                  +---> [ Compliance archive (OSS / COS / S3 Glacier) ]

Key considerations:

  • Retention policies differ per cloud. Standardize retention at the central backend level.
  • Compliance-critical logs (audit trails) must be stored in immutable, append-only storage per cloud in addition to centralized aggregation.
  • Log volume across clouds can be significant. Use the OTel Collector's batch processor and sampling to control ingestion costs.
  • For Alibaba Cloud SLS and Tencent CLS, verify OTLP ingestion endpoint availability and stability -- both added OTLP support relatively recently.

Monitoring and Alerting

Cross-cloud alerting pattern:

  1. Each cluster runs a Prometheus-compatible scraper (Prometheus, VictoriaMetrics agent, or OTel Collector with Prometheus receiver).
  2. Metrics flow to a central Prometheus/VictoriaMetrics/Mimir instance.
  3. Alertmanager or Grafana Alerting evaluates rules against the unified metrics.
  4. Alerts route to a single notification system (PagerDuty, Opsgenie, Slack).

Key cross-cloud alert rules:

  • Inter-cloud latency SLO breach (measured via synthetic probes between clouds).
  • Error budget burn rate across clouds (combined service-level indicator).
  • Cost anomaly detection (spend spike in any cloud exceeding threshold).
  • Security events (new admin role, root account usage, failed federation attempts).
  • Certificate expiration across clouds (cert-manager, ACM, CAS).

Distributed Tracing Across Clouds

Distributed traces that span services running on different clouds require a consistent trace context propagation mechanism. W3C Trace Context and W3C Baggage are the standard headers.

  • Trace Context: traceparent and tracestate HTTP headers (or gRPC metadata) propagate the trace ID across service boundaries.
  • Cross-cloud correlation: Ensure all services use the same W3C Trace Context format. OTel SDKs default to this.
  • Trace sampling: Use tail-based sampling at the gateway collector to make sampling decisions based on complete trace structure (e.g., sample all error traces, sample 10% of success traces).

FinOps and Cost Management

FinOps Maturity Model

The FinOps Foundation defines three maturity phases:

Phase Focus Typical Activities
Crawl (Inform) Visibility Tagging, cost allocation, showback reports
Walk (Optimize) Efficiency Rightsizing, RI/Savings Plan coverage, spot usage, anomaly alerts
Run (Operate) Business integration Unit economics, chargeback, automated optimization, continuous benchmarking

Multi-Cloud Cost Visibility

Centralized aggregation tools:

Tool Clouds Key Features License
Apptio Cloudability AWS, Azure, GCP, Alibaba Anomaly detection, RI management, showback Commercial
Flexera One AWS, Azure, GCP, Alibaba, Tencent Full stack visibility, optimization, SaaS management Commercial
CloudHealth (Broadcom) AWS, Azure, GCP Policy-driven governance, cost allocation Commercial
Vantage AWS, Azure, GCP, Alibaba Developer-friendly, API-first, cost reports Commercial
Kubecost Any K8s cluster Namespace/pod/label-level cost allocation, OpenCost-based Apache 2.0 / Commercial
OpenCost Any K8s cluster Open-source K8s cost monitoring Apache 2.0
AWS Cost Explorer AWS Native cost visualization, RI coverage, Savings Plans Managed service
GCP Billing GCP Cost breakdown, committed-use discounts, budgets Managed service
Alibaba Cloud Billing Alibaba Subscription / Pay-as-you-go analysis, quota management Managed service

Commitment-Based Savings

Each cloud offers reservation mechanisms for discounted compute in exchange for a commitment period:

Cloud Reservation Type Term Typical Savings
AWS Reserved Instances (RI) 1 or 3 years Up to 72% vs on-demand
AWS Savings Plans (Compute, EC2 Instance, SageMaker) 1 or 3 years Up to 72% vs on-demand
GCP Committed Use Discounts (CUDs) 1 or 3 years Up to 57% (flexible), 70% (machine-specific)
GCP Spot VMs Preemptible Up to 91% vs on-demand
Alibaba Reserved Instances (RI) 1, 3, or 5 years Up to 55% vs pay-as-you-go
Alibaba Savings Plans 1 or 3 years Up to 40% vs pay-as-you-go
Tencent Reserved Instances 1, 3, or 5 years Up to 50% vs pay-as-you-go
Tencent Spot Instances Preemptible Up to 90% vs pay-as-you-go

Multi-cloud RI strategy:

  1. Analyze steady-state compute usage per cloud. Identify workloads running 24/7.
  2. Purchase RIs/Savings Plans for steady-state baseline in each cloud.
  3. Use spot/preemptible instances for fault-tolerant, batch, or stateless workloads.
  4. Review coverage monthly. Underutilized RIs represent wasted commitment.
  5. For workloads that can run on any cloud, consider spot pricing as a factor in placement decisions.

Tagging Strategy

A consistent cross-cloud tagging taxonomy is the foundation of cost allocation.

Recommended mandatory tags:

Tag Key Purpose Example
environment Environment production, staging, development
team Owning team billing-team, platform-engineering
service Service name payment-service, user-api
cost-center Financial allocation cc-1234, engineering
data-classification Data sensitivity public, internal, confidential, restricted

Per-cloud tag enforcement:

Cloud Enforcement Mechanism
AWS SCP conditions requiring tags, AWS Config rules, Service Catalog
GCP Organization Policy constraints, labels on resources
Alibaba RAM policies with tag conditions, Cloud Config rules
Tencent CAM tag-based access control, Cloud Config rules

Unit Economics

Shift from raw spend to cost-per-business-metric:

  • cost-per-transaction = total cloud spend / number of transactions processed.
  • cost-per-active-user = total cloud spend / monthly active users.
  • cost-per-api-call = total cloud spend / API calls served.
  • revenue-per-dollar-of-cloud = revenue / cloud spend.

OTel metrics can instrument transaction counts and active users, enabling direct correlation with cloud spend data in a unified dashboard (Grafana, Datadog).

Cost Anomaly Detection

Tool Approach
AWS Cost Anomaly Detection ML-based, integrated with Cost Explorer, SNS alerts
GCP Budget Alerts Threshold-based, integrated with Billing, Pub/Sub notifications
Alibaba Cloud Cost Alerts Threshold-based, integrated with Billing, CloudMonitor alerts
Third-party (Cloudability, Vantage) ML-based across clouds, anomaly scoring, alerting

Green FinOps

Carbon-aware workload scheduling is an emerging practice:

  • Google Cloud Carbon Footprint: Provides carbon emissions data per project and region.
  • AWS Customer Carbon Footprint Tool: Reports estimated emissions per service and region.
  • Cloud Carbon Footprint (open source): Aggregates carbon data across AWS, GCP, Azure.
  • Strategy: Schedule batch workloads in regions with lower carbon intensity; prefer newer instance types with better performance-per-watt; include carbon cost in placement decisions.

Operational Commands and Recipes

Terraform Multi-Cloud Plan

# Initialize with all providers
terraform init

# Plan against AWS and Alibaba Cloud
terraform plan \
  -var-file="aws-ap-southeast-1.tfvars" \
  -var-file="alibaba-cn-hangzhou.tfvars" \
  -out=multi-cloud.tfplan

# Apply
terraform apply multi-cloud.tfplan

Pulumi Stack Operations

# Select AWS stack
pulumi stack select prod-aws

# Preview changes
pulumi preview --diff

# Deploy to AWS
pulumi up --stack prod-aws --yes

# Switch to Alibaba stack
pulumi stack select prod-alibaba

# Deploy to Alibaba Cloud
pulumi up --stack prod-alibaba --yes

# View cross-stack outputs
pulumi stack output --stack prod-aws vpc_id

Crossplane Compose and Deploy

# Install AWS provider
kubectl crossplane install provider crossplanecontrib/provider-upjet-aws:v1.17.0

# Install Alibaba provider
kubectl crossplane install provider crossplanecontrib/provider-alicloud:v0.5.0

# Apply composite resource
kubectl apply -f datastore-claim.yaml

# Check status
kubectl get composite
kubectl describe xdatastore my-prod-db

ArgoCD Multi-Cluster Sync

# Add remote cluster (GCP GKE)
argocd cluster add gke_my-project_asia-southeast1_prod --name prod-gke

# Add remote cluster (Alibaba ACK)
argocd cluster add alibaba-ack-context --name prod-ack

# Sync ApplicationSet across all clusters
argocd appset sync my-appset --prune

# Check sync status per cluster
argocd app list --output wide

OTel Collector Health Check

# Check collector pod status
kubectl get pods -n otel-system -l app=otel-collector

# Check collector metrics (zpages extension)
curl -s http://otel-collector.otel-system:55679/debug/tracez

# Check pipeline metrics at the collector's own metrics endpoint
curl -s http://otel-collector.otel-system:8888/metrics | grep otelcol_exporter

# Verify OTLP connectivity to backend
kubectl exec -n otel-system deployment/otel-collector -- \
  wget -qO- --post-data='{"resourceSpans":[]}' \
  --header='Content-Type: application/json' \
  https://otlp-gateway.example.com/v1/traces

Multi-Cloud Cost Report

# AWS -- daily cost via CLI
aws ce get-cost-and-usage \
  --time-period Start=2026-04-01,End=2026-04-15 \
  --granularity DAILY \
  --metrics BlendedCost \
  --group-by Type=DIMENSION,Key=SERVICE

# GCP -- billing export query (BigQuery)
bq query --use_legacy_sql=false \
  'SELECT service.description, SUM(cost) as total_cost
   FROM `project.billing_dataset.gcp_billing_export_v1`
   WHERE invoice.month = "202604"
   GROUP BY service.description
   ORDER BY total_cost DESC'

# Alibaba -- via OpenAPI
aliyun bssopenapi QueryBill \
  --BillingCycle 2026-04 \
  --PageNum 1 \
  --PageSize 100

Troubleshooting

Common Multi-Cloud Observability Issues

Symptom Likely Cause Resolution
Missing traces in one cloud OTel Collector pod crashlooping or misconfigured exporter Check collector pod logs; verify OTLP endpoint and credentials
High collector memory Insufficient memory_limiter or excessive log volume Tune limit_percentage and spike_limit_percentage; add sampling
Attribute conflicts in backend Different cloud.provider values for same service Verify resource processor sets consistent attributes per deployment
Cross-cloud trace breaks Missing W3C Trace Context propagation in a service hop Verify all services use OTel SDK with W3C propagator; check load-balancer pass-through of trace headers

Common Multi-Cloud Networking Issues

Symptom Likely Cause Resolution
Intermittent inter-cloud latency spikes Traffic routing over public internet instead of dedicated interconnect Verify route tables point to Direct Connect / Express Connect / Interconnect; check BGP routes
DNS resolution failures between clouds Split-horizon DNS misconfiguration or stale NS delegation Verify NS records at apex zone match cloud DNS zone name servers; check DNS propagation
Connection resets between clouds MTU mismatch on interconnect circuits Verify MTU settings on VBR (Alibaba), Direct Connect virtual interface (AWS), and interconnect link

Common FinOps Issues

Symptom Likely Cause Resolution
Unattributed spend (>20% of total) Missing tags on resources Enforce tagging policy at provisioning time via IaC modules and cloud org policies
RI/Savings Plan underutilization Over-purchased commitments or workload migration Rightsize commitment portfolio; exchange convertible RIs; adjust Savings Plan coverage targets
Cost anomaly not detected Alert threshold too high or no anomaly detection configured Configure AWS Cost Anomaly Detection; set threshold alerts at 10% daily variance in all clouds