Multi-Cloud Governance -- Architecture¶
Component breakdown, canonical topology patterns, and reference architectures for enterprises running across AWS, GCP, Alibaba Cloud, and Tencent Cloud.
Networking Patterns¶
Hub-and-Spoke¶
The most common multi-cloud topology. A central network hub (often an on-premises data center or a dedicated transit VPC/VNet) routes all inter-cloud traffic. Each cloud environment is a "spoke" connected via dedicated interconnect.
[ On-Prem DC / Transit Hub ]
/ | \
/ | \
[ AWS TGW ] [ GCP NCC ] [ Alibaba CEN ]
| | |
[ VPCs ] [ VPCs ] [ VPCs ]
Key services per cloud:
| Cloud | Hub Service | Interconnect Service |
|---|---|---|
| AWS | Transit Gateway | Direct Connect (1/10/100 Gbps) |
| GCP | Network Connectivity Center (NCC) | Dedicated / Partner Interconnect |
| Alibaba Cloud | Cloud Enterprise Network (CEN) | Express Connect (physical dedicated line) |
| Tencent Cloud | Cloud Connect Network (CCN) | Direct Connect |
Advantages: Centralized routing policy, single audit point, clear blast-radius boundary. Disadvantages: Hub is a single point of failure; all cross-cloud traffic incurs hub hairpinning latency.
Full Mesh¶
Every cloud peer connects directly to every other cloud peer, typically via dedicated point-to-point circuits or an SD-WAN overlay. No central transit hub.
Advantages: Lowest inter-cloud latency (direct paths); no single point of failure. Disadvantages: O(n^2) circuit management; complex routing tables; higher cost at scale.
Transit (Backbone)¶
A provider-agnostic backbone -- often a third-party fabric like Equinix Fabric, Megaport, or PacketFabric -- acts as the Layer 2/3 transit layer. Each cloud connects to the nearest fabric node via its dedicated interconnect service. The backbone provides any-to-any reachability with per-circuit QoS and bandwidth policies.
[ Equinix Fabric / Megaport Backbone ]
/ | | \
[ AWS DC ] [ GCP DC ] [ Alibaba PoP ] [ Tencent PoP ]
| | | |
[ VPCs ] [ VPCs ] [ VPCs ] [ VPCs ]
Advantages: Centralized bandwidth management; sub-cloud provisioning; any-to-any reachability without full mesh circuits. Disadvantages: Added cost of third-party fabric; dependency on fabric provider SLA.
SD-WAN Overlay¶
For enterprises that cannot justify dedicated circuits at every edge, SD-WAN solutions (Cisco Viptela, VMware Velocloud, Palo Alto Prisma SD-WAN, Fortinet Secure SD-WAN) create encrypted IPsec/GRE tunnels over the public internet between cloud VPCs and branch offices.
Advantages: Rapid provisioning; internet-based (no colo requirement); built-in WAN optimization. Disadvantages: Higher and variable latency; bandwidth limited by internet path; not suitable for latency-sensitive workloads.
Decision Matrix¶
| Factor | Hub-Spoke | Full Mesh | Transit Backbone | SD-WAN Overlay |
|---|---|---|---|---|
| Latency | Medium (hairpin) | Low (direct) | Low-Medium | High-Variable |
| Complexity | Low | High | Medium | Low-Medium |
| Cost | Medium | High | Medium-High | Low |
| Blast Radius | Hub is SPOF | Isolated per link | Fabric is SPOF | Per-tunnel |
| Use Case | Regulated enterprise | Low-latency apps | Global scale | Branch / DR |
Key Interconnect Services Reference¶
| Service | Cloud | Bandwidth | Protocol |
|---|---|---|---|
| AWS Direct Connect | AWS | 1/10/100 Gbps | 802.1Q VLAN, BGP |
| AWS Transit Gateway Connect | AWS | Up to 50 Gbps per attachment | GRE/BGP |
| GCP Dedicated Interconnect | GCP | 10/100 Gbps per link | 802.1Q, BGP4 |
| GCP Partner Interconnect | GCP | 50 Mbps -- 50 Gbps | Varies by partner |
| GCP Cross-Cloud Interconnect | GCP | 10 Gbps (GA) | Dedicated to other clouds |
| Alibaba Cloud Express Connect | Alibaba | 1/10/100 Gbps | Physical dedicated line, BGP |
| Alibaba Cloud CEN | Alibaba | Bandwidth packages | Transit routing |
| Tencent Cloud Direct Connect | Tencent | 1/10/100 Gbps | Physical dedicated line, BGP |
| Tencent Cloud CCN | Tencent | Bandwidth limits per instance | Transit routing |
| Equinix Fabric | Vendor-neutral | 50 Mbps -- 100 Gbps | Software-defined L2/L3 |
| Megaport | Vendor-neutral | 1 Mbps -- 100 Gbps | Software-defined L2 |
DNS and Traffic Management¶
Provider DNS Services¶
| Service | Cloud | Key Routing Policies |
|---|---|---|
| Amazon Route 53 | AWS | Latency, geolocation, weighted, failover, multivalue, IP-based |
| Google Cloud DNS | GCP | Geolocation, weighted round-robin (via Traffic Director / Cloud CDN) |
| Google Cloud Global LB | GCP | L7/LB with cross-region failover |
| Alibaba Cloud DNS (Alidns) | Alibaba | Geolocation, weighted, ISP-line routing (telecom/unicom/mobile) |
| Tencent DNSPod | Tencent | Geolocation, weighted, ISP-line routing, search-engine lines |
| Cloudflare DNS | Vendor-neutral | Geolocation, weighted, failover, load balancing |
| NS1 (IBM) | Vendor-neutral | Filter chains, regional targeting, Pulsar active telemetry |
Multi-Cloud DNS Strategy Patterns¶
1. Delegated Subdomain per Cloud
Each cloud owns a subdomain zone (e.g., aws.example.com, gcp.example.com, cn.example.com). A global apex zone delegates NS records per subdomain to the respective cloud DNS service.
example.com (Route 53 / Cloudflare)
├── aws.example.com NS → Route 53 hosted zone
├── gcp.example.com NS → Cloud DNS zone
├── cn.example.com NS → Alidns zone
└── global.example.com → GSLB (weighted/latency routing across clouds)
2. GSLB Overlay with Health Checks
A global DNS layer (Route 53, Cloudflare, NS1) resolves global.example.com by evaluating health-check endpoints in each cloud. On failure, traffic shifts to the next healthy cloud within TTL convergence time.
- Route 53 health checks: HTTP/HTTPS/TCP, configurable intervals (10s-30s), failure threshold.
- Cloudflare Load Balancing: Health checks per pool, failover steering, session affinity.
- NS1 Filter Chain: Programmatic DNS decisions based on telemetry feeds (Pulsar), availability, and geolocation.
3. China-Specific DNS Consideration
Mainland China DNS resolution for example.cn domains requires ICP filing. Alidns and DNSPod both support ISP-line resolution (routing queries from China Telecom, China Unicom, China Mobile to the nearest endpoint). For multi-cloud within China, DNSPod or Alidns serve as the apex; for global-with-China architectures, a split-horizon DNS configuration is typical -- international queries resolve via Route 53/Cloudflare; China queries resolve via Alidns/DNSPod.
Traffic-Management Tools Beyond DNS¶
| Tool | Layer | Multi-Cloud |
|---|---|---|
| Istio | L7 (service mesh) | Cross-cluster multi-primary on different clouds |
| Cilium | L3-L7 (CNI + service mesh) | Cluster mesh across clouds via tunnel or direct routing |
| Google Cloud Traffic Director | L7 | Can manage Envoy proxies outside GCP |
| AWS App Mesh | L7 | Tied to AWS; can manage Envoys on EKS/ECS |
| Kong Gateway / Traefik | L7 API Gateway | Cloud-agnostic; deploy anywhere |
Infrastructure-as-Code¶
Terraform / OpenTofu¶
The dominant multi-cloud IaC tool. HCL configurations declare resources across providers in a single state or split across workspaces. OpenTofu is the Linux-foundation-governed fork created after HashiCorp's BSL license change in 2023.
Multi-provider configuration example:
terraform {
required_providers {
aws = {
source = "hashicorp/aws"
version = "~> 5.0"
}
google = {
source = "hashicorp/google"
version = "~> 5.0"
}
alicloud = {
source = "aliyun/alicloud"
version = "~> 1.220"
}
tencentcloud = {
source = "tencentcloudstack/tencentcloud"
version = "~> 1.80"
}
}
}
provider "aws" {
region = "ap-southeast-1"
}
provider "google" {
project = "my-project"
region = "asia-southeast1"
}
provider "alicloud" {
region = "cn-hangzhou"
}
provider "tencentcloud" {
region = "ap-guangzhou"
}
Multi-cloud deployment with identity tokens (Terraform Stacks):
identity_token "aws" {
audience = ["aws.workload.identity"]
}
identity_token "gcp" {
audience = ["gcp.workload.identity"]
}
deployment "multi_cloud" {
inputs = {
aws_token = identity_token.aws.jwt
gcp_token = identity_token.gcp.jwt
aws_role = "arn:aws:iam::123456789012:role/terraform-role"
gcp_sa = "[email protected]"
}
}
State isolation strategy: Use separate workspaces or separate state backends per cloud to limit blast radius. Common backend choices: S3 (AWS), GCS (GCP), OSS (Alibaba), COS (Tencent).
Pulumi¶
General-purpose IaC using TypeScript, Python, Go, C#, or Java. Same provider ecosystem as Terraform (bridged providers) plus native providers (AWS Native, Azure Native) that offer same-day support for new cloud services.
Multi-cloud provider configuration (TypeScript):
import * as aws from "@pulumi/aws";
import * as gcp from "@pulumi/gcp";
import * as alicloud from "@pulumi/alicloud";
const awsProvider = new aws.Provider("aws-provider", {
region: "ap-southeast-1",
});
const gcpProvider = new gcp.Provider("gcp-provider", {
project: "my-project",
region: "asia-southeast1",
});
const aliProvider = new alicloud.Provider("ali-provider", {
region: "cn-hangzhou",
});
Pulumi ESC (Environments, Secrets, and Configuration) provides dynamic OIDC credentials for AWS, Azure, and GCP, eliminating static access keys. Stack references allow one stack to consume outputs from another, enabling dependency tracking across cloud boundaries.
Crossplane¶
Kubernetes-native IaC. Infrastructure is declared as CRDs and reconciled by provider controllers running inside the cluster. Crossplane Compositions allow platform teams to create abstract, cloud-agnostic APIs (XRDs) that map to cloud-specific managed resources underneath.
Multi-cloud composition concept:
# Abstract XRD -- cloud-agnostic
apiVersion: apiextensions.crossplane.io/v1
kind: CompositeResourceDefinition
metadata:
name: xdatastores.example.org
spec:
group: example.org
names:
kind: XDataStore
versions:
- name: v1alpha1
served: true
referenceable: true
schema:
openAPIV3Schema:
type: object
properties:
spec:
type: object
properties:
engine:
type: string
enum: ["mysql", "postgresql"]
size:
type: string
cloudProvider:
type: string
enum: ["aws", "gcp", "alibaba"]
Provider ecosystem (2025-2026):
| Provider | Maturity | Notes |
|---|---|---|
| provider-upjet-aws | High | Full AWS coverage |
| provider-upjet-gcp | High | Full GCP coverage |
| provider-upjet-azure | High | Full Azure coverage |
| provider-alicloud | Medium | Community-maintained, growing coverage |
| provider-tencentcloud | Low-Medium | Community contribution, limited resources |
Crossplane is strongest when the organization has already standardized on Kubernetes as its platform layer. It integrates natively with GitOps tools (ArgoCD, Flux) since CRDs are just Kubernetes objects.
Decision Matrix¶
| Factor | Terraform/OpenTofu | Pulumi | Crossplane |
|---|---|---|---|
| Learning curve | Low (HCL) | Medium (code) | High (K8s CRDs) |
| Multi-cloud maturity | Highest | High | Growing (AWS/GCP/Azure strong, Alibaba/Tencent developing) |
| GitOps native | Via Atlantis | Via Pulumi Deployments | Native (CRDs) |
| State management | External backend | Pulumi Cloud or self-managed | Kubernetes etcd |
| Platform abstraction | Modules | ComponentResource | Compositions + XRDs |
| Best fit | General infra teams | Developer-heavy teams | Platform engineering / K8s-first orgs |
CI/CD and GitOps¶
GitOps Patterns for Multi-Cloud¶
GitOps uses Git as the single source of truth for declarative infrastructure and application state. Automated controllers (ArgoCD, Flux) continuously reconcile the live state with the desired state in Git.
Hub-and-Spoke GitOps:
A central management cluster runs ArgoCD/Flux, which deploys to remote target clusters across clouds via ApplicationSets or Flux's Kustomization resources.
- ArgoCD ApplicationSets with
git-directoryormatrixgenerators create one Application per target cluster. - Flux
Kustomizationresources withspec.kubeConfigreference kubeconfigs for remote clusters.
Per-Cluster GitOps:
Each cluster runs its own ArgoCD/Flux instance. Simpler blast radius but harder to enforce global policy. Useful for regulated environments where clusters must be autonomous.
Progressive Delivery:
- Argo Rollouts: Canary / blue-green / experiment strategies with metric analysis.
- Flagger (with Flux): Automated canary deployments with Prometheus metric analysis.
- ArgoCD Image Updater / Flux Image Automation: Automatically update image tags in Git when new images are published.
Multi-Cloud CI Pipeline Structure¶
[ Git Push ]
|
[ CI Pipeline (GitHub Actions / GitLab CI) ]
├── Build container image
├── Run tests
├── Push image to multi-cloud registries
│ ├── ECR (AWS)
│ ├── Artifact Registry (GCP)
│ └── ACR / Alibaba Container Registry
├── Update GitOps manifest (image tag)
└── GitOps controller detects change and rolls out
Secret management in GitOps: Sealed Secrets (Bitnami), SOPS (Mozilla) with KMS per cloud, or External Secrets Operator syncing from HashiCorp Vault or cloud-native secret stores (AWS Secrets Manager, GCP Secret Manager, Alibaba KMS).
Tool Reference¶
| Tool | Type | Multi-Cluster | License |
|---|---|---|---|
| ArgoCD | Pull-based GitOps | ApplicationSets, Sync Waves | Apache 2.0 |
| Flux | Pull-based GitOps | Remote Kustomization, KubeConfig refs | Apache 2.0 |
| Argo Rollouts | Progressive delivery | Multi-cluster rollouts | Apache 2.0 |
| Flagger | Progressive delivery | Per-cluster, pairs with Flux | MIT |
| External Secrets Operator | Secret syncing | Multi-cloud secret store backends | Apache 2.0 |
| Sealed Secrets | In-cluster encryption | Cloud-agnostic | Apache 2.0 |
| SOPS | File-level encryption | KMS per cloud | MPL 2.0 |
Reference Architecture¶
A vendor-neutral enterprise multi-cloud architecture built on CNCF and open-source components.
graph TB
subgraph "Identity Layer"
IdP[IdP: Okta / Entra ID / Keycloak]
SPIFFE[SPIFFE / SPIRE]
end
subgraph "Control Plane"
Git[Git Repository]
CI[CI Pipeline]
ArgoCD[ArgoCD / Flux]
IaC[Terraform / Crossplane]
end
subgraph "Networking Layer"
Fabric[Equinix Fabric / Megaport]
DNS[Global DNS: Route 53 / Cloudflare]
end
subgraph "Workload Plane"
AWS[AWS EKS]
GCP[GCP GKE]
ALI[Alibaba ACK]
TENCENT[Tencent TKE]
end
subgraph "Observability Plane"
OTel[OTel Collectors]
Backend[Grafana LGTM / Datadog]
end
subgraph "Policy Layer"
OPA[OPA / Gatekeeper]
Sentinel[Sentinel / Cloud Org Policies]
end
IdP -->|SAML/OIDC| AWS
IdP -->|SAML/OIDC| GCP
IdP -->|SAML/OIDC| ALI
IdP -->|SAML/OIDC| TENCENT
Git --> CI --> ArgoCD
ArgoCD --> AWS
ArgoCD --> GCP
ArgoCD --> ALI
ArgoCD --> TENCENT
IaC --> AWS
IaC --> GCP
IaC --> ALI
IaC --> TENCENT
Fabric --> AWS
Fabric --> GCP
Fabric --> ALI
Fabric --> TENCENT
DNS --> AWS
DNS --> GCP
DNS --> ALI
DNS --> TENCENT
OTel --> Backend
AWS --> OTel
GCP --> OTel
ALI --> OTel
TENCENT --> OTel
OPA --> AWS
OPA --> GCP
OPA --> ALI
OPA --> TENCENT
Layer Descriptions¶
| Layer | Components | Purpose |
|---|---|---|
| Identity | IdP (SAML/OIDC), SPIFFE/SPIRE | Unified human + workload identity |
| Control Plane | Git, CI, GitOps controller, IaC engine | Declarative intent and reconciliation |
| Networking | Interconnect fabric, global DNS | Cross-cloud connectivity and traffic steering |
| Workload Plane | Managed K8s clusters per cloud | Application runtime |
| Observability | OTel Collectors, central backend | Traces, metrics, logs, profiles |
| Policy | OPA/Gatekeeper, Sentinel, org policies | Governance, compliance, cost guardrails |
How It Works¶
-
Identity federation: The IdP issues SAML assertions (human SSO) and OIDC tokens (workload identity) to each cloud. SPIRE agents on each workload node provide mTLS workload identity independent of cloud IAM.
-
Infrastructure provisioning: Terraform or Crossplane declares VPCs, databases, queues, and other cloud resources. Changes land in Git, CI validates with
terraform planorcrossplane dry-run, and applies after approval. -
Application delivery: CI builds images, pushes to per-cloud registries, and updates GitOps manifests. ArgoCD/Flux reconciles manifests to clusters across all clouds.
-
Networking: Equinix Fabric or Megaport provides any-to-any private connectivity. Global DNS routes users to the nearest healthy cloud endpoint.
-
Observability: OTel Collectors in each cluster collect traces, metrics, and logs. They export to a central backend (Grafana LGTM stack or a SaaS observability platform) via OTLP.
-
Policy enforcement: OPA/Gatekeeper enforces Kubernetes-level policies (resource limits, allowed images, labeling). Cloud-native org policies (AWS SCPs, Azure Policy, GCP Org Policy, Alibaba RAM policies) enforce cloud-resource-level governance. HashiCorp Sentinel adds policy-as-code guardrails for Terraform operations.