Multi-Cloud Governance -- Operations¶

Observability, cost management, and day-2 operational patterns for multi-cloud environments spanning AWS, GCP, Alibaba Cloud, and Tencent Cloud.

Observability¶

The OpenTelemetry Standard¶

OpenTelemetry (OTel) is the CNCF-graduated, vendor-neutral observability framework. It provides APIs, SDKs, and a collector pipeline for the three signals (traces, metrics, logs) plus an emerging fourth signal (continuous profiling). In a multi-cloud environment, OTel is the lingua franca that normalizes telemetry regardless of the underlying cloud provider.

OTel Collector Architecture for Multi-Cloud¶

The OTel Collector is the central pipeline component. In a multi-cloud deployment, the recommended pattern is per-cloud collector deployment with centralized backend export.

[ AWS Cluster ]                 [ GCP Cluster ]
  OTel Collector (DaemonSet)      OTel Collector (DaemonSet)
       | OTLP                          | OTLP
       v                               v
[ AWS Gateway Collector ] -----> [ Central Backend ]
                                    (Grafana LGTM /
                                     Datadog / Dynatrace)
       ^                               ^
       | OTLP                          | OTLP
[ Alibaba Cluster ]             [ Tencent Cluster ]
  OTel Collector (DaemonSet)      OTel Collector (DaemonSet)

Collector deployment modes:

Mode	Description	Use Case
DaemonSet (agent)	One collector per node	Low-latency collection, tail sampling
Deployment (gateway)	N replicas as a standalone service	Centralized processing, multi-tenant routing
Sidecar	One collector per pod	Strong isolation, per-app config

Recommended multi-cloud pattern: DaemonSet agents on each cluster send to a per-cloud gateway collector. The gateway collector handles processing (batching, retry, attribute enrichment with cloud/region labels) and exports OTLP to the central backend.

OTel Collector Configuration (Multi-Backend Export)¶

# Collector ConfigMap -- exports to Grafana LGTM stack
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  memory_limiter:
    check_interval: 1s
    limit_percentage: 75
    spike_limit_percentage: 15
  batch:
    send_batch_size: 8192
    timeout: 5s
  resource:
    attributes:
      - key: cloud.provider
        value: "aws"  # Changed per-cloud: aws, gcp, alibaba, tencent
        action: upsert
      - key: cloud.region
        value: "ap-southeast-1"
        action: upsert

exporters:
  otlphttp/grafana:
    endpoint: "https://otlp-gateway-prod-eu-west-0.grafana.net/otlp"
    headers:
      Authorization: "Basic <encoded-credentials>"
  # Optional: dual-export to cloud-native service
  otlp/aws_xray:
    endpoint: "xray.ap-southeast-1.amazonaws.com"
  logging:
    verbosity: detailed

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, batch, resource]
      exporters: [otlphttp/grafana]
    metrics:
      receivers: [otlp]
      processors: [memory_limiter, batch, resource]
      exporters: [otlphttp/grafana]
    logs:
      receivers: [otlp]
      processors: [memory_limiter, batch, resource]
      exporters: [otlphttp/grafana]

Per-Cloud OTel Integration¶

Cloud	Managed OTel Offering	Trace Backend	Metric Backend	Log Backend
AWS	AWS Distro for OpenTelemetry (ADOT)	X-Ray	CloudWatch Metrics	CloudWatch Logs
GCP	Cloud Operations OTel Exporter	Cloud Trace	Cloud Monitoring	Cloud Logging
Alibaba	OpenTelemetry for SLS	Trace Service (SLS)	Metric Store (SLS)	Log Service (SLS)
Tencent	OTLP to CLS	Cloud Trace (Tencent APM)	Cloud Monitor	Cloud Log Service (CLS)

OTel Semantic Conventions for Multi-Cloud¶

Standard attributes ensure telemetry is queryable across clouds in a unified backend:

Attribute	Example	Purpose
`cloud.provider`	`aws`, `gcp`, `alibaba_cloud`, `tencent_cloud`	Identify which cloud generated the telemetry
`cloud.region`	`ap-southeast-1`, `asia-southeast1`, `cn-hangzhou`	Region-level filtering
`cloud.availability_zone`	`ap-southeast-1a`	Zone-level correlation
`k8s.cluster.name`	`prod-eks-ap-southeast-1`	Cluster-level grouping
`service.namespace`	`billing`, `payments`	Cross-cloud service grouping
`deployment.environment`	`production`, `staging`	Environment separation

Central Backend Options¶

Backend	Traces	Metrics	Logs	Profiling	License
Grafana LGTM (Loki, Grafana, Tempo, Mimir)	Tempo	Mimir	Loki	Pyroscope (optional)	AGPL v3 / Commercial
Datadog	APM	Metrics	Logs	Continuous Profiler	Commercial
Dynatrace	PurePath	Davis AI	Log Analytics	Code-level profiling	Commercial
New Relic	Distributed Tracing	Metrics	Logs in Context	n/a	Commercial
Splunk Observability	APM	Infrastructure	Splunk Log Observer	n/a	Commercial
Honeycomb	Events / Traces	Derived	n/a	n/a	Commercial
SigNoz	Traces (ClickHouse)	Metrics	Logs	n/a	MIT / Commercial

Logging Architecture¶

[ Applications with OTel SDK ]
         |
         v
[ OTel Collector -- log pipeline ]
         |
         +---> [ Central Backend (Loki / Datadog) ]
         |
         +---> [ Cloud-native Log Store (SLS / CLS / CloudWatch) ]
                  |
                  +---> [ Compliance archive (OSS / COS / S3 Glacier) ]

Key considerations:

Retention policies differ per cloud. Standardize retention at the central backend level.
Compliance-critical logs (audit trails) must be stored in immutable, append-only storage per cloud in addition to centralized aggregation.
Log volume across clouds can be significant. Use the OTel Collector's batch processor and sampling to control ingestion costs.
For Alibaba Cloud SLS and Tencent CLS, verify OTLP ingestion endpoint availability and stability -- both added OTLP support relatively recently.

Monitoring and Alerting¶

Cross-cloud alerting pattern:

Each cluster runs a Prometheus-compatible scraper (Prometheus, VictoriaMetrics agent, or OTel Collector with Prometheus receiver).
Metrics flow to a central Prometheus/VictoriaMetrics/Mimir instance.
Alertmanager or Grafana Alerting evaluates rules against the unified metrics.
Alerts route to a single notification system (PagerDuty, Opsgenie, Slack).

Key cross-cloud alert rules:

Inter-cloud latency SLO breach (measured via synthetic probes between clouds).
Error budget burn rate across clouds (combined service-level indicator).
Cost anomaly detection (spend spike in any cloud exceeding threshold).
Security events (new admin role, root account usage, failed federation attempts).
Certificate expiration across clouds (cert-manager, ACM, CAS).

Distributed Tracing Across Clouds¶

Distributed traces that span services running on different clouds require a consistent trace context propagation mechanism. W3C Trace Context and W3C Baggage are the standard headers.

Trace Context: traceparent and tracestate HTTP headers (or gRPC metadata) propagate the trace ID across service boundaries.
Cross-cloud correlation: Ensure all services use the same W3C Trace Context format. OTel SDKs default to this.
Trace sampling: Use tail-based sampling at the gateway collector to make sampling decisions based on complete trace structure (e.g., sample all error traces, sample 10% of success traces).

FinOps and Cost Management¶

FinOps Maturity Model¶

The FinOps Foundation defines three maturity phases:

Phase	Focus	Typical Activities
Crawl (Inform)	Visibility	Tagging, cost allocation, showback reports
Walk (Optimize)	Efficiency	Rightsizing, RI/Savings Plan coverage, spot usage, anomaly alerts
Run (Operate)	Business integration	Unit economics, chargeback, automated optimization, continuous benchmarking

Multi-Cloud Cost Visibility¶

Centralized aggregation tools:

Tool	Clouds	Key Features	License
Apptio Cloudability	AWS, Azure, GCP, Alibaba	Anomaly detection, RI management, showback	Commercial
Flexera One	AWS, Azure, GCP, Alibaba, Tencent	Full stack visibility, optimization, SaaS management	Commercial
CloudHealth (Broadcom)	AWS, Azure, GCP	Policy-driven governance, cost allocation	Commercial
Vantage	AWS, Azure, GCP, Alibaba	Developer-friendly, API-first, cost reports	Commercial
Kubecost	Any K8s cluster	Namespace/pod/label-level cost allocation, OpenCost-based	Apache 2.0 / Commercial
OpenCost	Any K8s cluster	Open-source K8s cost monitoring	Apache 2.0
AWS Cost Explorer	AWS	Native cost visualization, RI coverage, Savings Plans	Managed service
GCP Billing	GCP	Cost breakdown, committed-use discounts, budgets	Managed service
Alibaba Cloud Billing	Alibaba	Subscription / Pay-as-you-go analysis, quota management	Managed service

Commitment-Based Savings¶

Each cloud offers reservation mechanisms for discounted compute in exchange for a commitment period:

Cloud	Reservation Type	Term	Typical Savings
AWS	Reserved Instances (RI)	1 or 3 years	Up to 72% vs on-demand
AWS	Savings Plans (Compute, EC2 Instance, SageMaker)	1 or 3 years	Up to 72% vs on-demand
GCP	Committed Use Discounts (CUDs)	1 or 3 years	Up to 57% (flexible), 70% (machine-specific)
GCP	Spot VMs	Preemptible	Up to 91% vs on-demand
Alibaba	Reserved Instances (RI)	1, 3, or 5 years	Up to 55% vs pay-as-you-go
Alibaba	Savings Plans	1 or 3 years	Up to 40% vs pay-as-you-go
Tencent	Reserved Instances	1, 3, or 5 years	Up to 50% vs pay-as-you-go
Tencent	Spot Instances	Preemptible	Up to 90% vs pay-as-you-go

Multi-cloud RI strategy:

Analyze steady-state compute usage per cloud. Identify workloads running 24/7.
Purchase RIs/Savings Plans for steady-state baseline in each cloud.
Use spot/preemptible instances for fault-tolerant, batch, or stateless workloads.
Review coverage monthly. Underutilized RIs represent wasted commitment.
For workloads that can run on any cloud, consider spot pricing as a factor in placement decisions.

Tagging Strategy¶

A consistent cross-cloud tagging taxonomy is the foundation of cost allocation.

Recommended mandatory tags:

Tag Key	Purpose	Example
`environment`	Environment	`production`, `staging`, `development`
`team`	Owning team	`billing-team`, `platform-engineering`
`service`	Service name	`payment-service`, `user-api`
`cost-center`	Financial allocation	`cc-1234`, `engineering`
`data-classification`	Data sensitivity	`public`, `internal`, `confidential`, `restricted`

Per-cloud tag enforcement:

Cloud	Enforcement Mechanism
AWS	SCP conditions requiring tags, AWS Config rules, Service Catalog
GCP	Organization Policy constraints, labels on resources
Alibaba	RAM policies with tag conditions, Cloud Config rules
Tencent	CAM tag-based access control, Cloud Config rules

Unit Economics¶

Shift from raw spend to cost-per-business-metric:

cost-per-transaction = total cloud spend / number of transactions processed.
cost-per-active-user = total cloud spend / monthly active users.
cost-per-api-call = total cloud spend / API calls served.
revenue-per-dollar-of-cloud = revenue / cloud spend.

OTel metrics can instrument transaction counts and active users, enabling direct correlation with cloud spend data in a unified dashboard (Grafana, Datadog).

Cost Anomaly Detection¶

Tool	Approach
AWS Cost Anomaly Detection	ML-based, integrated with Cost Explorer, SNS alerts
GCP Budget Alerts	Threshold-based, integrated with Billing, Pub/Sub notifications
Alibaba Cloud Cost Alerts	Threshold-based, integrated with Billing, CloudMonitor alerts
Third-party (Cloudability, Vantage)	ML-based across clouds, anomaly scoring, alerting

Green FinOps¶

Carbon-aware workload scheduling is an emerging practice:

Google Cloud Carbon Footprint: Provides carbon emissions data per project and region.
AWS Customer Carbon Footprint Tool: Reports estimated emissions per service and region.
Cloud Carbon Footprint (open source): Aggregates carbon data across AWS, GCP, Azure.
Strategy: Schedule batch workloads in regions with lower carbon intensity; prefer newer instance types with better performance-per-watt; include carbon cost in placement decisions.

Operational Commands and Recipes¶

Terraform Multi-Cloud Plan¶

# Initialize with all providers
terraform init

# Plan against AWS and Alibaba Cloud
terraform plan \
  -var-file="aws-ap-southeast-1.tfvars" \
  -var-file="alibaba-cn-hangzhou.tfvars" \
  -out=multi-cloud.tfplan

# Apply
terraform apply multi-cloud.tfplan

Pulumi Stack Operations¶

# Select AWS stack
pulumi stack select prod-aws

# Preview changes
pulumi preview --diff

# Deploy to AWS
pulumi up --stack prod-aws --yes

# Switch to Alibaba stack
pulumi stack select prod-alibaba

# Deploy to Alibaba Cloud
pulumi up --stack prod-alibaba --yes

# View cross-stack outputs
pulumi stack output --stack prod-aws vpc_id

Crossplane Compose and Deploy¶

# Install AWS provider
kubectl crossplane install provider crossplanecontrib/provider-upjet-aws:v1.17.0

# Install Alibaba provider
kubectl crossplane install provider crossplanecontrib/provider-alicloud:v0.5.0

# Apply composite resource
kubectl apply -f datastore-claim.yaml

# Check status
kubectl get composite
kubectl describe xdatastore my-prod-db

ArgoCD Multi-Cluster Sync¶

# Add remote cluster (GCP GKE)
argocd cluster add gke_my-project_asia-southeast1_prod --name prod-gke

# Add remote cluster (Alibaba ACK)
argocd cluster add alibaba-ack-context --name prod-ack

# Sync ApplicationSet across all clusters
argocd appset sync my-appset --prune

# Check sync status per cluster
argocd app list --output wide

OTel Collector Health Check¶

# Check collector pod status
kubectl get pods -n otel-system -l app=otel-collector

# Check collector metrics (zpages extension)
curl -s http://otel-collector.otel-system:55679/debug/tracez

# Check pipeline metrics at the collector's own metrics endpoint
curl -s http://otel-collector.otel-system:8888/metrics | grep otelcol_exporter

# Verify OTLP connectivity to backend
kubectl exec -n otel-system deployment/otel-collector -- \
  wget -qO- --post-data='{"resourceSpans":[]}' \
  --header='Content-Type: application/json' \
  https://otlp-gateway.example.com/v1/traces

Multi-Cloud Cost Report¶

# AWS -- daily cost via CLI
aws ce get-cost-and-usage \
  --time-period Start=2026-04-01,End=2026-04-15 \
  --granularity DAILY \
  --metrics BlendedCost \
  --group-by Type=DIMENSION,Key=SERVICE

# GCP -- billing export query (BigQuery)
bq query --use_legacy_sql=false \
  'SELECT service.description, SUM(cost) as total_cost
   FROM `project.billing_dataset.gcp_billing_export_v1`
   WHERE invoice.month = "202604"
   GROUP BY service.description
   ORDER BY total_cost DESC'

# Alibaba -- via OpenAPI
aliyun bssopenapi QueryBill \
  --BillingCycle 2026-04 \
  --PageNum 1 \
  --PageSize 100

Troubleshooting¶

Common Multi-Cloud Observability Issues¶

Symptom	Likely Cause	Resolution
Missing traces in one cloud	OTel Collector pod crashlooping or misconfigured exporter	Check collector pod logs; verify OTLP endpoint and credentials
High collector memory	Insufficient memory_limiter or excessive log volume	Tune `limit_percentage` and `spike_limit_percentage`; add sampling
Attribute conflicts in backend	Different `cloud.provider` values for same service	Verify `resource` processor sets consistent attributes per deployment
Cross-cloud trace breaks	Missing W3C Trace Context propagation in a service hop	Verify all services use OTel SDK with W3C propagator; check load-balancer pass-through of trace headers

Common Multi-Cloud Networking Issues¶

Symptom	Likely Cause	Resolution
Intermittent inter-cloud latency spikes	Traffic routing over public internet instead of dedicated interconnect	Verify route tables point to Direct Connect / Express Connect / Interconnect; check BGP routes
DNS resolution failures between clouds	Split-horizon DNS misconfiguration or stale NS delegation	Verify NS records at apex zone match cloud DNS zone name servers; check DNS propagation
Connection resets between clouds	MTU mismatch on interconnect circuits	Verify MTU settings on VBR (Alibaba), Direct Connect virtual interface (AWS), and interconnect link

Common FinOps Issues¶

Symptom	Likely Cause	Resolution
Unattributed spend (>20% of total)	Missing tags on resources	Enforce tagging policy at provisioning time via IaC modules and cloud org policies
RI/Savings Plan underutilization	Over-purchased commitments or workload migration	Rightsize commitment portfolio; exchange convertible RIs; adjust Savings Plan coverage targets
Cost anomaly not detected	Alert threshold too high or no anomaly detection configured	Configure AWS Cost Anomaly Detection; set threshold alerts at 10% daily variance in all clouds