Alibaba Cloud -- Operations¶
Deployment patterns, CLI recipes, monitoring, and troubleshooting for day-2 operations on Alibaba Cloud (Aliyun).
Deployment Patterns¶
Resource Directory & Landing Zone Deployment¶
Cloud Governance Center provides a Landing Zone setup wizard that provisions the core account structure automatically. For IaC-driven deployments, use either ROS (Resource Orchestration Service) or the Terraform Alibaba Cloud provider.
# Create a Resource Directory folder (OU) via CLI
aliyun resourcemanager CreateFolder \
--ParentFolderId fd-root \
--FolderName "Workloads"
# Create a member account under that folder
aliyun resourcemanager CreateResourceAccount \
--DisplayName "bu-a-prod" \
--FolderId fd-XXXXXXXX \
--AccountNamePrefix bu-a-prod
# Apply a control policy to an OU
aliyun resourcemanager AttachControlPolicy \
--PolicyId cp-XXXXXXXX \
--TargetId fd-XXXXXXXX
Terraform Provider¶
The alicloud Terraform provider covers most services. A typical
multi-account setup uses the alicloud_resource_manager_* resources
combined with alicloud_cen_* for networking.
provider "alicloud" {
region = "cn-hangzhou"
access_key = var.access_key
secret_key = var.secret_key
}
resource "alicloud_vpc" "main" {
vpc_name = "prod-vpc"
cidr_block = "10.0.0.0/16"
}
resource "alicloud_vswitch" "app" {
vpc_id = alicloud_vpc.main.id
cidr_block = "10.0.1.0/24"
zone_id = "cn-hangzhou-h"
}
CLI & SDK¶
Installation¶
# Install aliyun CLI (macOS)
brew install aliyun-cli
# Configure a profile
aliyun configure set \
--profile prod \
--mode AK \
--region cn-hangzhou \
--access-key-id LTAI5tXXXXXXXXXX \
--access-key-secret XXXXXXXXXXXXXXXXXXXXXXXX
ECS (Elastic Compute Service)¶
# List all ECS instances in cn-hangzhou
aliyun ecs DescribeInstances --RegionId cn-hangzhou --output cols=InstanceId,InstanceName,Status
# Start a stopped instance
aliyun ecs StartInstance --InstanceId i-bp1xxxxxxxxxxxxxxxxx
# Create a snapshot of a disk
aliyun ecs CreateSnapshot --DiskId d-bp1xxxxxxxxxxxxxxxxx --SnapshotName "pre-deploy-2026-04-17"
VPC & Networking¶
# List VPCs in a region
aliyun vpc DescribeVpcs --RegionId cn-hangzhou
# Describe route table entries for a VPC
aliyun vpc DescribeRouteTableList --VpcId vpc-bp1xxxxxxxxxxxxxxxxx
# Create a security group rule (allow HTTPS inbound)
aliyun ecs AuthorizeSecurityGroup \
--SecurityGroupId sg-bp1xxxxxxxxxxxxxxxxx \
--IpProtocol tcp \
--PortRange 443/443 \
--SourceCidrIp 0.0.0.0/0 \
--Policy Accept
SLB (Server Load Balancer)¶
# List all SLB instances
aliyun slb DescribeLoadBalancers --RegionId cn-hangzhou
# Add a backend server to an SLB
aliyun slb AddBackendServers \
--LoadBalancerId lb-bp1xxxxxxxxxxxxxxxxx \
--BackendServers '[{"ServerId":"i-bp1xxxxxxxxxxxxxxxxx","Weight":"100"}]'
RDS (ApsaraDB for RDS)¶
# List all RDS instances
aliyun rds DescribeDBInstances --RegionId cn-hangzhou
# Create a manual backup
aliyun rds CreateBackup --DBInstanceId rm-bp1xxxxxxxxxxxxxxxxx --BackupMethod Physical
# Switch to a standby instance (planned failover)
aliyun rds SwitchDBInstanceHA --DBInstanceId rm-bp1xxxxxxxxxxxxxxxxx --NodeId xxxxx
CEN (Cloud Enterprise Network)¶
# List CEN instances
aliyun cbn DescribeCens
# List Transit Router route table entries
aliyun cbn ListTransitRouterRouteTableAssociations \
--TransitRouterId tr-bp1xxxxxxxxxxxxxxxxx
# Attach a VPC to a Transit Router
aliyun cbn CreateTransitRouterVpcAttachment \
--CenId cen-xxxxxxxxxxxxxxxxx \
--TransitRouterId tr-bp1xxxxxxxxxxxxxxxxx \
--VpcId vpc-bp1xxxxxxxxxxxxxxxxx \
--ZoneMappings '[{"ZoneId":"cn-hangzhou-h","VSwitchId":"vsw-bp1xxxxxxxxxxxxxxxxx"}]'
Monitoring & Alerting¶
CloudMonitor¶
CloudMonitor collects host-level and service-level metrics automatically.
Custom metrics can be pushed via the PutCustomMetric API.
# List available metric definitions for ECS
aliyun cms DescribeMetricMetaList --Namespace acs_ecs_dashboard
# Query CPU utilization for an ECS instance (last hour)
aliyun cms DescribeMetricLast \
--Namespace acs_ecs_dashboard \
--MetricName CPUUtilization \
--Dimensions '[{"instanceId":"i-bp1xxxxxxxxxxxxxxxxx"}]' \
--Period 300
# Create an alarm rule for CPU > 80% over 3 periods
aliyun cms PutResourceMetricRule \
--RuleId cpu-high-prod \
--RuleName "CPU > 80%" \
--Namespace acs_ecs_dashboard \
--MetricName CPUUtilization \
--Escalations.Critical.ComparisonOperator GreaterThanThreshold \
--Escalations.Critical.Threshold 80 \
--Escalations.Critical.Times 3 \
--Period 300 \
--ContactGroups '["ops-team"]'
SLS (Simple Log Service)¶
SLS is Alibaba Cloud's centralized log management service. It provides real-time log collection, search, dashboards, and alerting.
# Create a log project
aliyun sls CreateProject --body '{"projectName":"prod-logs","description":"Production logging"}'
# Create a logstore within the project
aliyun sls CreateLogStore \
--project prod-logs \
--body '{"logstoreName":"app-logs","ttl":90,"shardCount":2}'
# Query logs (SLS query language)
aliyun sls GetLogs \
--project prod-logs \
--logstore app-logs \
--from 1713340800 \
--to 1713344400 \
--query "status >= 500 | SELECT count(*) as error_count, host GROUP BY host"
Centralized logging across accounts
Use SLS cross-account log delivery to stream ActionTrail and application logs from all member accounts to a central Log Archive account's SLS project. Configure log delivery rules in the Resource Directory management account.
Troubleshooting¶
CEN Connectivity Issues¶
| Symptom | Likely Cause | Resolution |
|---|---|---|
| VPC-to-VPC ping fails across CEN | Missing route table association or propagation | Check TR route tables: aliyun cbn ListTransitRouterRouteTables --TransitRouterId tr-xxx. Verify the VPC attachment is associated with the correct route table and that route propagation is enabled. |
| Cross-region traffic drops | No bandwidth package or bandwidth exhausted | Verify bandwidth package allocation: aliyun cbn DescribeCenBandwidthPackages --CenId cen-xxx. Purchase or increase bandwidth. |
| Asymmetric routing through firewall VPC | Custom route tables not directing return traffic | Ensure both inbound and outbound custom route tables point return traffic through the firewall VPC's TR attachment. |
Cross-Region Replication Failures¶
# Check DTS synchronization status
aliyun dts DescribeSynchronizationJobStatus \
--SynchronizationJobId dtsi-bp1xxxxxxxxxxxxxxxxx
# Check OSS CRR status
aliyun oss GetBucketReplication --bucket source-bucket --region cn-hangzhou
OSS CRR requires versioning
Cross-region replication will fail silently if versioning is not enabled on both source and destination buckets. Always verify versioning status before enabling CRR.
NAT Gateway Troubleshooting¶
| Symptom | Likely Cause | Resolution |
|---|---|---|
| Private instances cannot reach Internet | No SNAT entry for the vSwitch | Create an SNAT entry: aliyun vpc CreateSnatEntry --SnatTableId stb-xxx --SnatIp 47.x.x.x --SourceVSwitchId vsw-xxx |
| SNAT connections exhausted | High concurrency exceeding NAT Gateway capacity | Upgrade NAT Gateway specification or add additional EIPs to the SNAT pool |
| DNAT port forwarding not working | Security Group blocking the forwarded port | Verify the target ECS instance's Security Group allows inbound traffic on the DNAT port |
General Diagnostic Commands¶
# Describe an ECS instance's network interfaces
aliyun ecs DescribeNetworkInterfaces \
--InstanceId i-bp1xxxxxxxxxxxxxxxxx \
--RegionId cn-hangzhou
# Check security group rules applied to an instance
aliyun ecs DescribeSecurityGroupAttribute \
--SecurityGroupId sg-bp1xxxxxxxxxxxxxxxxx \
--RegionId cn-hangzhou
# Verify VPC flow logs are enabled
aliyun vpc DescribeFlowLogs --RegionId cn-hangzhou --ResourceId vpc-bp1xxxxxxxxxxxxxxxxx