AWS -- Operations¶
Deployment patterns, CLI recipes, monitoring, and troubleshooting for day-2 operations on Amazon Web Services.
Deployment Patterns¶
Control Tower & Account Factory¶
AWS Control Tower automates Landing Zone provisioning with guardrails, centralized logging, and Account Factory for vending new accounts.
Account Factory for Terraform (AFT) extends Control Tower with IaC-driven account provisioning:
# List enrolled accounts in Control Tower
aws controltower list-enabled-controls \
--target-identifier arn:aws:organizations::123456789012:ou/o-xxx/ou-xxx
# List managed accounts
aws organizations list-accounts --output table
Service Catalog¶
Service Catalog provides self-service product provisioning with governance. Platform teams define approved CloudFormation templates as products; application teams launch them without direct CloudFormation access.
# List available products in a portfolio
aws servicecatalog search-products \
--filters FullTextSearch=vpc
# Provision a product
aws servicecatalog provision-product \
--product-id prod-XXXXXXXXXXXX \
--provisioned-product-name my-vpc \
--provisioning-artifact-id pa-XXXXXXXXXXXX \
--provisioning-parameters Key=VpcCidr,Value=10.0.0.0/16
CloudFormation StackSets¶
StackSets deploy CloudFormation stacks across multiple accounts and regions from a single template.
# Create a StackSet with service-managed permissions (Organizations integration)
aws cloudformation create-stack-set \
--stack-set-name baseline-config-rules \
--template-body file://config-rules.yaml \
--permission-model SERVICE_MANAGED \
--auto-deployment Enabled=true,RetainStacksOnAccountRemoval=false
# Deploy to all accounts in an OU
aws cloudformation create-stack-instances \
--stack-set-name baseline-config-rules \
--deployment-targets OrganizationalUnitIds=ou-xxxx-xxxxxxxx \
--regions us-east-1 us-west-2
CLI & SDK¶
VPC & Networking¶
# List all VPCs
aws ec2 describe-vpcs --output table --query 'Vpcs[*].[VpcId,CidrBlock,Tags[?Key==`Name`].Value|[0]]'
# List subnets for a VPC
aws ec2 describe-subnets --filters Name=vpc-id,Values=vpc-0xxxxxxxxxxxxxxxxx \
--query 'Subnets[*].[SubnetId,CidrBlock,AvailabilityZone]' --output table
# Describe a Transit Gateway's route table
aws ec2 search-transit-gateway-routes \
--transit-gateway-route-table-id tgw-rtb-0xxxxxxxxxxxxxxxxx \
--filters Name=state,Values=active
EC2¶
# List running instances with key details
aws ec2 describe-instances \
--filters Name=instance-state-name,Values=running \
--query 'Reservations[*].Instances[*].[InstanceId,InstanceType,PrivateIpAddress,Tags[?Key==`Name`].Value|[0]]' \
--output table
# Create an AMI from a running instance
aws ec2 create-image \
--instance-id i-0xxxxxxxxxxxxxxxxx \
--name "pre-deploy-$(date +%Y-%m-%d)" \
--no-reboot
EKS¶
# List clusters
aws eks list-clusters
# Update kubeconfig for a cluster
aws eks update-kubeconfig --name my-cluster --region us-east-1
# Describe a node group
aws eks describe-nodegroup --cluster-name my-cluster --nodegroup-name workers
RDS¶
# List all RDS instances
aws rds describe-db-instances \
--query 'DBInstances[*].[DBInstanceIdentifier,Engine,DBInstanceStatus,MultiAZ]' \
--output table
# Create a manual snapshot
aws rds create-db-snapshot \
--db-instance-identifier prod-db \
--db-snapshot-identifier prod-db-$(date +%Y%m%d)
# Initiate a failover for a Multi-AZ instance
aws rds reboot-db-instance \
--db-instance-identifier prod-db \
--force-failover
CloudFormation¶
# Deploy a stack
aws cloudformation deploy \
--template-file infra.yaml \
--stack-name prod-infra \
--capabilities CAPABILITY_NAMED_IAM \
--parameter-overrides Environment=prod VpcCidr=10.0.0.0/16
# List stack events (troubleshoot failed deployments)
aws cloudformation describe-stack-events \
--stack-name prod-infra \
--query 'StackEvents[?ResourceStatus==`CREATE_FAILED`].[LogicalResourceId,ResourceStatusReason]' \
--output table
Monitoring & Alerting¶
CloudWatch¶
CloudWatch collects metrics, logs, and traces. Custom metrics can be
published via put-metric-data.
# Query EC2 CPU utilization (last hour, 5-minute periods)
aws cloudwatch get-metric-statistics \
--namespace AWS/EC2 \
--metric-name CPUUtilization \
--dimensions Name=InstanceId,Value=i-0xxxxxxxxxxxxxxxxx \
--start-time $(date -u -v-1H +%Y-%m-%dT%H:%M:%S) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
--period 300 \
--statistics Average
# Create a CloudWatch alarm
aws cloudwatch put-metric-alarm \
--alarm-name "high-cpu-prod" \
--namespace AWS/EC2 \
--metric-name CPUUtilization \
--dimensions Name=InstanceId,Value=i-0xxxxxxxxxxxxxxxxx \
--statistic Average \
--period 300 \
--threshold 80 \
--comparison-operator GreaterThanThreshold \
--evaluation-periods 3 \
--alarm-actions arn:aws:sns:us-east-1:123456789012:ops-alerts
CloudTrail¶
CloudTrail records all API calls across all AWS services. For multi-account setups, create an organization trail from the management account.
# Look up recent API events
aws cloudtrail lookup-events \
--lookup-attributes AttributeKey=EventName,AttributeValue=StopInstances \
--max-results 10
AWS Config¶
Config continuously evaluates resource compliance against rules. Use conformance packs for bundled rule sets.
# Check compliance status of a rule
aws configservice get-compliance-details-by-config-rule \
--config-rule-name s3-bucket-server-side-encryption-enabled \
--compliance-types NON_COMPLIANT
GuardDuty¶
GuardDuty uses ML to detect threats across CloudTrail, VPC Flow Logs, and DNS logs. Enable in every region used by the organization.
# List high-severity findings
aws guardduty list-findings \
--detector-id XXXXXXXXXXXXXXXXXXXX \
--finding-criteria '{"Criterion":{"severity":{"Gte":7}}}'
Security Hub aggregation
Enable Security Hub in the Audit account with delegated administrator. It aggregates findings from GuardDuty, Inspector, Config, Macie, and Firewall Manager into a single pane with compliance scores (CIS, PCI-DSS, AWS Foundational Security Best Practices).
Troubleshooting¶
VPC Connectivity¶
| Symptom | Likely Cause | Resolution |
|---|---|---|
| EC2 cannot reach the Internet | Missing NAT Gateway route or no IGW | Check route tables: private subnets need 0.0.0.0/0 -> nat-gw; public subnets need 0.0.0.0/0 -> igw |
| Cannot reach resources in peered VPC | Missing route entries in both VPCs | Add routes for the peer CIDR pointing to the peering connection in both VPC route tables |
| Cross-AZ latency higher than expected | Traffic hairpinning through a single-AZ NAT Gateway | Deploy NAT Gateways in each AZ and update route tables per AZ |
IAM Permission Debugging¶
# Simulate whether a principal can perform an action
aws iam simulate-principal-policy \
--policy-source-arn arn:aws:iam::123456789012:role/deploy-role \
--action-names s3:PutObject \
--resource-arns arn:aws:s3:::my-bucket/*
# Decode an encoded authorization failure message
aws sts decode-authorization-message --encoded-message <encoded-message>
CloudTrail for permission debugging
When an API call fails with AccessDenied, search CloudTrail for the
event. The errorCode and errorMessage fields show the exact policy
evaluation result, including which SCP or permission boundary blocked
the action.
Transit Gateway Routing¶
| Symptom | Likely Cause | Resolution |
|---|---|---|
| Spoke VPCs cannot reach each other | TGW route table missing propagation or static routes | Verify route table associations: aws ec2 get-transit-gateway-route-table-associations. Enable route propagation for each attachment. |
| Traffic not flowing through inspection VPC | Appliance mode not enabled on TGW attachment | Enable appliance mode: aws ec2 modify-transit-gateway-vpc-attachment --options ApplianceModeSupport=enable |
| Return traffic takes a different path (asymmetric) | Subnet route tables not updated for return path | Ensure VPC ingress routing and TGW return routes both point through the firewall ENI |
General Diagnostic Commands¶
# Check VPC Flow Logs (requires flow log enabled on VPC/subnet/ENI)
aws logs filter-log-events \
--log-group-name /vpc/flow-logs \
--filter-pattern "REJECT" \
--start-time $(date -d "-1 hour" +%s000)
# Describe a Network Firewall's rule groups
aws network-firewall describe-firewall \
--firewall-name central-inspection \
--query 'Firewall.FirewallPolicyArn'
# Check Security Group rules applied to an ENI
aws ec2 describe-security-group-rules \
--filters Name=group-id,Values=sg-0xxxxxxxxxxxxxxxxx \
--output table