Multi-Cloud Governance -- Operations¶
Observability, cost management, and day-2 operational patterns for multi-cloud environments spanning AWS, GCP, Alibaba Cloud, and Tencent Cloud.
Observability¶
The OpenTelemetry Standard¶
OpenTelemetry (OTel) is the CNCF-graduated, vendor-neutral observability framework. It provides APIs, SDKs, and a collector pipeline for the three signals (traces, metrics, logs) plus an emerging fourth signal (continuous profiling). In a multi-cloud environment, OTel is the lingua franca that normalizes telemetry regardless of the underlying cloud provider.
OTel Collector Architecture for Multi-Cloud¶
The OTel Collector is the central pipeline component. In a multi-cloud deployment, the recommended pattern is per-cloud collector deployment with centralized backend export.
[ AWS Cluster ] [ GCP Cluster ]
OTel Collector (DaemonSet) OTel Collector (DaemonSet)
| OTLP | OTLP
v v
[ AWS Gateway Collector ] -----> [ Central Backend ]
(Grafana LGTM /
Datadog / Dynatrace)
^ ^
| OTLP | OTLP
[ Alibaba Cluster ] [ Tencent Cluster ]
OTel Collector (DaemonSet) OTel Collector (DaemonSet)
Collector deployment modes:
| Mode | Description | Use Case |
|---|---|---|
| DaemonSet (agent) | One collector per node | Low-latency collection, tail sampling |
| Deployment (gateway) | N replicas as a standalone service | Centralized processing, multi-tenant routing |
| Sidecar | One collector per pod | Strong isolation, per-app config |
Recommended multi-cloud pattern: DaemonSet agents on each cluster send to a per-cloud gateway collector. The gateway collector handles processing (batching, retry, attribute enrichment with cloud/region labels) and exports OTLP to the central backend.
OTel Collector Configuration (Multi-Backend Export)¶
# Collector ConfigMap -- exports to Grafana LGTM stack
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
memory_limiter:
check_interval: 1s
limit_percentage: 75
spike_limit_percentage: 15
batch:
send_batch_size: 8192
timeout: 5s
resource:
attributes:
- key: cloud.provider
value: "aws" # Changed per-cloud: aws, gcp, alibaba, tencent
action: upsert
- key: cloud.region
value: "ap-southeast-1"
action: upsert
exporters:
otlphttp/grafana:
endpoint: "https://otlp-gateway-prod-eu-west-0.grafana.net/otlp"
headers:
Authorization: "Basic <encoded-credentials>"
# Optional: dual-export to cloud-native service
otlp/aws_xray:
endpoint: "xray.ap-southeast-1.amazonaws.com"
logging:
verbosity: detailed
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, batch, resource]
exporters: [otlphttp/grafana]
metrics:
receivers: [otlp]
processors: [memory_limiter, batch, resource]
exporters: [otlphttp/grafana]
logs:
receivers: [otlp]
processors: [memory_limiter, batch, resource]
exporters: [otlphttp/grafana]
Per-Cloud OTel Integration¶
| Cloud | Managed OTel Offering | Trace Backend | Metric Backend | Log Backend |
|---|---|---|---|---|
| AWS | AWS Distro for OpenTelemetry (ADOT) | X-Ray | CloudWatch Metrics | CloudWatch Logs |
| GCP | Cloud Operations OTel Exporter | Cloud Trace | Cloud Monitoring | Cloud Logging |
| Alibaba | OpenTelemetry for SLS | Trace Service (SLS) | Metric Store (SLS) | Log Service (SLS) |
| Tencent | OTLP to CLS | Cloud Trace (Tencent APM) | Cloud Monitor | Cloud Log Service (CLS) |
OTel Semantic Conventions for Multi-Cloud¶
Standard attributes ensure telemetry is queryable across clouds in a unified backend:
| Attribute | Example | Purpose |
|---|---|---|
cloud.provider |
aws, gcp, alibaba_cloud, tencent_cloud |
Identify which cloud generated the telemetry |
cloud.region |
ap-southeast-1, asia-southeast1, cn-hangzhou |
Region-level filtering |
cloud.availability_zone |
ap-southeast-1a |
Zone-level correlation |
k8s.cluster.name |
prod-eks-ap-southeast-1 |
Cluster-level grouping |
service.namespace |
billing, payments |
Cross-cloud service grouping |
deployment.environment |
production, staging |
Environment separation |
Central Backend Options¶
| Backend | Traces | Metrics | Logs | Profiling | License |
|---|---|---|---|---|---|
| Grafana LGTM (Loki, Grafana, Tempo, Mimir) | Tempo | Mimir | Loki | Pyroscope (optional) | AGPL v3 / Commercial |
| Datadog | APM | Metrics | Logs | Continuous Profiler | Commercial |
| Dynatrace | PurePath | Davis AI | Log Analytics | Code-level profiling | Commercial |
| New Relic | Distributed Tracing | Metrics | Logs in Context | n/a | Commercial |
| Splunk Observability | APM | Infrastructure | Splunk Log Observer | n/a | Commercial |
| Honeycomb | Events / Traces | Derived | n/a | n/a | Commercial |
| SigNoz | Traces (ClickHouse) | Metrics | Logs | n/a | MIT / Commercial |
Logging Architecture¶
[ Applications with OTel SDK ]
|
v
[ OTel Collector -- log pipeline ]
|
+---> [ Central Backend (Loki / Datadog) ]
|
+---> [ Cloud-native Log Store (SLS / CLS / CloudWatch) ]
|
+---> [ Compliance archive (OSS / COS / S3 Glacier) ]
Key considerations:
- Retention policies differ per cloud. Standardize retention at the central backend level.
- Compliance-critical logs (audit trails) must be stored in immutable, append-only storage per cloud in addition to centralized aggregation.
- Log volume across clouds can be significant. Use the OTel Collector's
batchprocessor and sampling to control ingestion costs. - For Alibaba Cloud SLS and Tencent CLS, verify OTLP ingestion endpoint availability and stability -- both added OTLP support relatively recently.
Monitoring and Alerting¶
Cross-cloud alerting pattern:
- Each cluster runs a Prometheus-compatible scraper (Prometheus, VictoriaMetrics agent, or OTel Collector with Prometheus receiver).
- Metrics flow to a central Prometheus/VictoriaMetrics/Mimir instance.
- Alertmanager or Grafana Alerting evaluates rules against the unified metrics.
- Alerts route to a single notification system (PagerDuty, Opsgenie, Slack).
Key cross-cloud alert rules:
- Inter-cloud latency SLO breach (measured via synthetic probes between clouds).
- Error budget burn rate across clouds (combined service-level indicator).
- Cost anomaly detection (spend spike in any cloud exceeding threshold).
- Security events (new admin role, root account usage, failed federation attempts).
- Certificate expiration across clouds (cert-manager, ACM, CAS).
Distributed Tracing Across Clouds¶
Distributed traces that span services running on different clouds require a consistent trace context propagation mechanism. W3C Trace Context and W3C Baggage are the standard headers.
- Trace Context:
traceparentandtracestateHTTP headers (or gRPC metadata) propagate the trace ID across service boundaries. - Cross-cloud correlation: Ensure all services use the same W3C Trace Context format. OTel SDKs default to this.
- Trace sampling: Use tail-based sampling at the gateway collector to make sampling decisions based on complete trace structure (e.g., sample all error traces, sample 10% of success traces).
FinOps and Cost Management¶
FinOps Maturity Model¶
The FinOps Foundation defines three maturity phases:
| Phase | Focus | Typical Activities |
|---|---|---|
| Crawl (Inform) | Visibility | Tagging, cost allocation, showback reports |
| Walk (Optimize) | Efficiency | Rightsizing, RI/Savings Plan coverage, spot usage, anomaly alerts |
| Run (Operate) | Business integration | Unit economics, chargeback, automated optimization, continuous benchmarking |
Multi-Cloud Cost Visibility¶
Centralized aggregation tools:
| Tool | Clouds | Key Features | License |
|---|---|---|---|
| Apptio Cloudability | AWS, Azure, GCP, Alibaba | Anomaly detection, RI management, showback | Commercial |
| Flexera One | AWS, Azure, GCP, Alibaba, Tencent | Full stack visibility, optimization, SaaS management | Commercial |
| CloudHealth (Broadcom) | AWS, Azure, GCP | Policy-driven governance, cost allocation | Commercial |
| Vantage | AWS, Azure, GCP, Alibaba | Developer-friendly, API-first, cost reports | Commercial |
| Kubecost | Any K8s cluster | Namespace/pod/label-level cost allocation, OpenCost-based | Apache 2.0 / Commercial |
| OpenCost | Any K8s cluster | Open-source K8s cost monitoring | Apache 2.0 |
| AWS Cost Explorer | AWS | Native cost visualization, RI coverage, Savings Plans | Managed service |
| GCP Billing | GCP | Cost breakdown, committed-use discounts, budgets | Managed service |
| Alibaba Cloud Billing | Alibaba | Subscription / Pay-as-you-go analysis, quota management | Managed service |
Commitment-Based Savings¶
Each cloud offers reservation mechanisms for discounted compute in exchange for a commitment period:
| Cloud | Reservation Type | Term | Typical Savings |
|---|---|---|---|
| AWS | Reserved Instances (RI) | 1 or 3 years | Up to 72% vs on-demand |
| AWS | Savings Plans (Compute, EC2 Instance, SageMaker) | 1 or 3 years | Up to 72% vs on-demand |
| GCP | Committed Use Discounts (CUDs) | 1 or 3 years | Up to 57% (flexible), 70% (machine-specific) |
| GCP | Spot VMs | Preemptible | Up to 91% vs on-demand |
| Alibaba | Reserved Instances (RI) | 1, 3, or 5 years | Up to 55% vs pay-as-you-go |
| Alibaba | Savings Plans | 1 or 3 years | Up to 40% vs pay-as-you-go |
| Tencent | Reserved Instances | 1, 3, or 5 years | Up to 50% vs pay-as-you-go |
| Tencent | Spot Instances | Preemptible | Up to 90% vs pay-as-you-go |
Multi-cloud RI strategy:
- Analyze steady-state compute usage per cloud. Identify workloads running 24/7.
- Purchase RIs/Savings Plans for steady-state baseline in each cloud.
- Use spot/preemptible instances for fault-tolerant, batch, or stateless workloads.
- Review coverage monthly. Underutilized RIs represent wasted commitment.
- For workloads that can run on any cloud, consider spot pricing as a factor in placement decisions.
Tagging Strategy¶
A consistent cross-cloud tagging taxonomy is the foundation of cost allocation.
Recommended mandatory tags:
| Tag Key | Purpose | Example |
|---|---|---|
environment |
Environment | production, staging, development |
team |
Owning team | billing-team, platform-engineering |
service |
Service name | payment-service, user-api |
cost-center |
Financial allocation | cc-1234, engineering |
data-classification |
Data sensitivity | public, internal, confidential, restricted |
Per-cloud tag enforcement:
| Cloud | Enforcement Mechanism |
|---|---|
| AWS | SCP conditions requiring tags, AWS Config rules, Service Catalog |
| GCP | Organization Policy constraints, labels on resources |
| Alibaba | RAM policies with tag conditions, Cloud Config rules |
| Tencent | CAM tag-based access control, Cloud Config rules |
Unit Economics¶
Shift from raw spend to cost-per-business-metric:
cost-per-transaction= total cloud spend / number of transactions processed.cost-per-active-user= total cloud spend / monthly active users.cost-per-api-call= total cloud spend / API calls served.revenue-per-dollar-of-cloud= revenue / cloud spend.
OTel metrics can instrument transaction counts and active users, enabling direct correlation with cloud spend data in a unified dashboard (Grafana, Datadog).
Cost Anomaly Detection¶
| Tool | Approach |
|---|---|
| AWS Cost Anomaly Detection | ML-based, integrated with Cost Explorer, SNS alerts |
| GCP Budget Alerts | Threshold-based, integrated with Billing, Pub/Sub notifications |
| Alibaba Cloud Cost Alerts | Threshold-based, integrated with Billing, CloudMonitor alerts |
| Third-party (Cloudability, Vantage) | ML-based across clouds, anomaly scoring, alerting |
Green FinOps¶
Carbon-aware workload scheduling is an emerging practice:
- Google Cloud Carbon Footprint: Provides carbon emissions data per project and region.
- AWS Customer Carbon Footprint Tool: Reports estimated emissions per service and region.
- Cloud Carbon Footprint (open source): Aggregates carbon data across AWS, GCP, Azure.
- Strategy: Schedule batch workloads in regions with lower carbon intensity; prefer newer instance types with better performance-per-watt; include carbon cost in placement decisions.
Operational Commands and Recipes¶
Terraform Multi-Cloud Plan¶
# Initialize with all providers
terraform init
# Plan against AWS and Alibaba Cloud
terraform plan \
-var-file="aws-ap-southeast-1.tfvars" \
-var-file="alibaba-cn-hangzhou.tfvars" \
-out=multi-cloud.tfplan
# Apply
terraform apply multi-cloud.tfplan
Pulumi Stack Operations¶
# Select AWS stack
pulumi stack select prod-aws
# Preview changes
pulumi preview --diff
# Deploy to AWS
pulumi up --stack prod-aws --yes
# Switch to Alibaba stack
pulumi stack select prod-alibaba
# Deploy to Alibaba Cloud
pulumi up --stack prod-alibaba --yes
# View cross-stack outputs
pulumi stack output --stack prod-aws vpc_id
Crossplane Compose and Deploy¶
# Install AWS provider
kubectl crossplane install provider crossplanecontrib/provider-upjet-aws:v1.17.0
# Install Alibaba provider
kubectl crossplane install provider crossplanecontrib/provider-alicloud:v0.5.0
# Apply composite resource
kubectl apply -f datastore-claim.yaml
# Check status
kubectl get composite
kubectl describe xdatastore my-prod-db
ArgoCD Multi-Cluster Sync¶
# Add remote cluster (GCP GKE)
argocd cluster add gke_my-project_asia-southeast1_prod --name prod-gke
# Add remote cluster (Alibaba ACK)
argocd cluster add alibaba-ack-context --name prod-ack
# Sync ApplicationSet across all clusters
argocd appset sync my-appset --prune
# Check sync status per cluster
argocd app list --output wide
OTel Collector Health Check¶
# Check collector pod status
kubectl get pods -n otel-system -l app=otel-collector
# Check collector metrics (zpages extension)
curl -s http://otel-collector.otel-system:55679/debug/tracez
# Check pipeline metrics at the collector's own metrics endpoint
curl -s http://otel-collector.otel-system:8888/metrics | grep otelcol_exporter
# Verify OTLP connectivity to backend
kubectl exec -n otel-system deployment/otel-collector -- \
wget -qO- --post-data='{"resourceSpans":[]}' \
--header='Content-Type: application/json' \
https://otlp-gateway.example.com/v1/traces
Multi-Cloud Cost Report¶
# AWS -- daily cost via CLI
aws ce get-cost-and-usage \
--time-period Start=2026-04-01,End=2026-04-15 \
--granularity DAILY \
--metrics BlendedCost \
--group-by Type=DIMENSION,Key=SERVICE
# GCP -- billing export query (BigQuery)
bq query --use_legacy_sql=false \
'SELECT service.description, SUM(cost) as total_cost
FROM `project.billing_dataset.gcp_billing_export_v1`
WHERE invoice.month = "202604"
GROUP BY service.description
ORDER BY total_cost DESC'
# Alibaba -- via OpenAPI
aliyun bssopenapi QueryBill \
--BillingCycle 2026-04 \
--PageNum 1 \
--PageSize 100
Troubleshooting¶
Common Multi-Cloud Observability Issues¶
| Symptom | Likely Cause | Resolution |
|---|---|---|
| Missing traces in one cloud | OTel Collector pod crashlooping or misconfigured exporter | Check collector pod logs; verify OTLP endpoint and credentials |
| High collector memory | Insufficient memory_limiter or excessive log volume | Tune limit_percentage and spike_limit_percentage; add sampling |
| Attribute conflicts in backend | Different cloud.provider values for same service |
Verify resource processor sets consistent attributes per deployment |
| Cross-cloud trace breaks | Missing W3C Trace Context propagation in a service hop | Verify all services use OTel SDK with W3C propagator; check load-balancer pass-through of trace headers |
Common Multi-Cloud Networking Issues¶
| Symptom | Likely Cause | Resolution |
|---|---|---|
| Intermittent inter-cloud latency spikes | Traffic routing over public internet instead of dedicated interconnect | Verify route tables point to Direct Connect / Express Connect / Interconnect; check BGP routes |
| DNS resolution failures between clouds | Split-horizon DNS misconfiguration or stale NS delegation | Verify NS records at apex zone match cloud DNS zone name servers; check DNS propagation |
| Connection resets between clouds | MTU mismatch on interconnect circuits | Verify MTU settings on VBR (Alibaba), Direct Connect virtual interface (AWS), and interconnect link |
Common FinOps Issues¶
| Symptom | Likely Cause | Resolution |
|---|---|---|
| Unattributed spend (>20% of total) | Missing tags on resources | Enforce tagging policy at provisioning time via IaC modules and cloud org policies |
| RI/Savings Plan underutilization | Over-purchased commitments or workload migration | Rightsize commitment portfolio; exchange convertible RIs; adjust Savings Plan coverage targets |
| Cost anomaly not detected | Alert threshold too high or no anomaly detection configured | Configure AWS Cost Anomaly Detection; set threshold alerts at 10% daily variance in all clouds |