Skip to content

GCP Landing Zone -- Operations

Deployment recipes, Terraform configurations, CLI commands, and operational best practices for each architecture pattern.

Project and Hierarchy Management

Project Factory Pattern

Use Terraform with the Cloud Foundation Toolkit (CFT) project factory module to standardize project creation:

# Key elements of a project factory
module "project" {
  source  = "terraform-google-modules/project-factory/google"
  version = "~> 17.0"

  name              = "prod-payments-api"
  org_id            = var.org_id
  folder_id         = google_folder.production.id
  billing_account   = var.billing_account_id

  # Network configuration
  shared_vpc        = module.shared_vpc.project_id
  shared_vpc_subnets = ["projects/shared-net-prod/regions/us-central1/subnetworks/app-subnet"]

  # Labels for cost attribution
  labels = {
    team        = "payments"
    environment = "prod"
    cost-center = "cc-1234"
    managed-by  = "terraform"
  }

  # APIs to enable
  activate_apis = [
    "compute.googleapis.com",
    "container.googleapis.com",
    "sqladmin.googleapis.com",
    "monitoring.googleapis.com",
  ]
}

Folder Creation

# Create top-level folders under the organization
gcloud resource-manager folders create \
  --display-name="Production" \
  --organization=ORGANIZATION_ID

gcloud resource-manager folders create \
  --display-name="Non-Production" \
  --organization=ORGANIZATION_ID

gcloud resource-manager folders create \
  --display-name="Shared-Infrastructure" \
  --organization=ORGANIZATION_ID

# Create sub-folders
gcloud resource-manager folders create \
  --display-name="Team-A" \
  --folder=PRODUCTION_FOLDER_ID

Baseline Organization Policies

Apply these at the organization level using Terraform:

# Enforce OS Login
resource "google_organization_policy" "require_os_login" {
  org_id     = var.org_id
  constraint = "constraints/compute.requireOsLogin"

  boolean_policy {
    enforced = true
  }
}

# Restrict resource locations to US and EU
resource "google_organization_policy" "resource_locations" {
  org_id     = var.org_id
  constraint = "constraints/gcp.resourceLocations"

  list_policy {
    allow {
      values = [
        "in/us-locations",
        "in/eu-locations",
      ]
    }
  }
}

# Disable service account key creation
resource "google_organization_policy" "disable_sa_keys" {
  org_id     = var.org_id
  constraint = "constraints/iam.disableServiceAccountKeyCreation"

  boolean_policy {
    enforced = true
  }
}

# Restrict external IPs
resource "google_organization_policy" "vm_external_ip" {
  org_id     = var.org_id
  constraint = "constraints/compute.vmExternalIpAccess"

  list_policy {
    deny {
      all = true
    }
  }
}

# Enforce Shielded VMs
resource "google_organization_policy" "require_shielded_vm" {
  org_id     = var.org_id
  constraint = "constraints/compute.requireShieldedVm"

  boolean_policy {
    enforced = true
  }
}

Networking Operations

Shared VPC Setup

# 1. Enable the host project
gcloud compute shared-vpc enable shared-net-prod

# 2. Attach service projects
gcloud compute shared-vpc associated-projects add payments-prod \
  --host-project=shared-net-prod

# 3. Grant subnet-level access to service project admins
gcloud compute networks subnets add-iam-policy-binding app-subnet \
  --region=us-central1 \
  --network=shared-vpc \
  --member="group:[email protected]" \
  --role="roles/compute.networkUser"

VPC Peering Setup

# Create peering from network-a to network-b
gcloud compute networks peerings create peer-a-to-b \
  --network=network-a \
  --peer-network=network-b \
  --peer-project=project-b \
  --export-custom-routes \
  --import-custom-routes

# Create reciprocal peering from network-b to network-a
gcloud compute networks peerings create peer-b-to-a \
  --network=network-b \
  --peer-network=network-a \
  --peer-project=project-a \
  --export-custom-routes \
  --import-custom-routes

# Verify peering status
gcloud compute networks peerings list --network=network-a

Cloud NAT Configuration

# Create Cloud Router (required for Cloud NAT)
gcloud compute routers create nat-router \
  --network=shared-vpc \
  --region=us-central1

# Create Cloud NAT on the router
gcloud compute routers nats create nat-gateway \
  --router=nat-router \
  --region=us-central1 \
  --nat-all-subnet-ip-ranges \
  --auto-allocate-nat-ips

Private Service Connect for Google APIs

# Create a PSC endpoint to access Google APIs bundle
gcloud compute forwarding-rules create google-api-endpoint \
  --network=shared-vpc \
  --subnet=api-subnet \
  --region=us-central1 \
  --target-google-apis-bundle=ALL_GOOGLE_APIS \
  --address=10.0.100.10

Compute and Orchestration

Regional Managed Instance Group

# Create instance template
gcloud compute instance-templates create app-template \
  --machine-type=e2-standard-4 \
  --image-family=debian-12 \
  --image-project=debian-cloud \
  --network=shared-vpc \
  --subnet=app-subnet \
  --region=us-central1 \
  --no-address \
  --tags=app-server

# Create regional MIG
gcloud compute instance-groups managed create app-mig \
  --template=app-template \
  --region=us-central1 \
  --size=3 \
  --health-check=http-health-check

# Set autoscaling
gcloud compute instance-groups managed set-autoscaling app-mig \
  --region=us-central1 \
  --min-num-replicas=3 \
  --max-num-replicas=20 \
  --target-cpu-utilization=0.7

GKE Multi-Zone Cluster in Shared VPC

gcloud container clusters create prod-cluster \
  --network=shared-vpc \
  --subnetwork=gke-subnet \
  --region=us-central1 \
  --num-nodes=2 \
  --node-locations=us-central1-a,us-central1-b,us-central1-c \
  --enable-ip-alias \
  --enable-autoscaling \
  --min-nodes=1 \
  --max-nodes=10 \
  --enable-shielded-nodes \
  --workload-pool=prod-project.svc.id.goog \
  --enable-binary-authorization

Data Layer Operations

Cloud SQL HA with Cross-Region Replica

# Create HA instance (automatic regional failover)
gcloud sql instances create prod-db \
  --database-version=POSTGRES_15 \
  --tier=db-custom-4-16384 \
  --region=us-central1 \
  --availability-type=REGIONAL \
  --network=shared-vpc \
  --no-assign-ip

# Create cross-region read replica for DR
gcloud sql instances create prod-db-replica \
  --master-instance-name=prod-db \
  --region=us-east1 \
  --tier=db-custom-4-16384 \
  --network=shared-vpc \
  --no-assign-ip

Cloud Spanner Multi-Region

# Create multi-region Spanner instance
gcloud spanner instances create prod-spanner \
  --config=nam6 \
  --description="Production Spanner" \
  --nodes=3

# nam6 = us-central1 + us-east1 (multi-region, 99.999% SLA)
# nam-eur07 = europe-west1 + europe-west4

Dual-Region Cloud Storage

# Create dual-region bucket for DR
gcloud storage buckets create gs://prod-assets-bucket \
  --location=US-CENTRAL1+US-EAST1 \
  --default-storage-class=STANDARD

DR Operations

Failover Runbook (Warm Standby)

  1. Detect: Cloud Monitoring alert fires (primary region health check failures)
  2. Verify: On-call SRE confirms primary region is degraded
  3. Scale up DR region:
    # Scale up compute in DR region
    gcloud compute instance-groups managed resize app-mig-dr \
      --region=us-east1 \
      --size=10
    
    # Promote Cloud SQL replica to standalone
    gcloud sql instances promote-replica prod-db-replica
    
  4. Shift traffic: Global LB automatically routes to healthy DR region based on health checks
  5. Verify: Confirm application is serving traffic from DR region
  6. Communicate: Notify stakeholders of failover
  7. Post-incident: After primary recovers, re-establish replication and plan failback

Automated Failover (Active-Active)

For active-active, failover is automatic. The global load balancer stops sending traffic to the unhealthy region based on health check results. No manual intervention required. The key operational task is testing:

# Run a game day / DR test
# 1. Disable health check on primary region backends
# 2. Verify traffic shifts to secondary region
# 3. Re-enable health check
# 4. Verify traffic returns to normal distribution

Monitoring and Observability

Centralized Logging Sink

# Create aggregated log sink at organization level
gcloud logging sinks create org-logs-sink \
  bigquery.googleapis.com/projects/shared-logging/datasets/org_logs \
  --organization=ORGANIZATION_ID \
  --include-children \
  --log-filter='severity>=WARNING'
Alert Condition Severity
Region health check failures >50% backends unhealthy for 2 min Critical
Cross-region replication lag Lag > 60 seconds for 5 min Warning
Cloud NAT port exhaustion Allocated ports >80% for 5 min Warning
IAM policy changes Any change to org-level IAM Info
Budget threshold Spend >80% of monthly budget Warning

Troubleshooting

Shared VPC: Service Project Cannot Reach Subnet

Check that the Service Project Admin has compute.networkUser on the specific subnet:

gcloud compute networks subnets get-iam-policy app-subnet \
  --region=us-central1 \
  --network=shared-vpc

VPC Peering: Routes Not Propagating

Verify both sides have matching import/export settings:

gcloud compute networks peerings list --network=network-a
# Check: exportCustomRoutes=true, importCustomRoutes=true on both sides

Cloud NAT: Connection Failures

Check NAT allocation and port exhaustion:

gcloud compute routers get-nat-mapping-info nat-router --region=us-central1

DR: Cloud SQL Replica Promote Fails

Ensure the replica is in a healthy state before promoting:

gcloud sql instances describe prod-db-replica --format="value(status)"
# Must be RUNNABLE before promote

Sources