Skip to content

AWS -- Operations

Deployment patterns, CLI recipes, monitoring, and troubleshooting for day-2 operations on Amazon Web Services.


Deployment Patterns

Control Tower & Account Factory

AWS Control Tower automates Landing Zone provisioning with guardrails, centralized logging, and Account Factory for vending new accounts.

Account Factory for Terraform (AFT) extends Control Tower with IaC-driven account provisioning:

# List enrolled accounts in Control Tower
aws controltower list-enabled-controls \
  --target-identifier arn:aws:organizations::123456789012:ou/o-xxx/ou-xxx

# List managed accounts
aws organizations list-accounts --output table

Service Catalog

Service Catalog provides self-service product provisioning with governance. Platform teams define approved CloudFormation templates as products; application teams launch them without direct CloudFormation access.

# List available products in a portfolio
aws servicecatalog search-products \
  --filters FullTextSearch=vpc

# Provision a product
aws servicecatalog provision-product \
  --product-id prod-XXXXXXXXXXXX \
  --provisioned-product-name my-vpc \
  --provisioning-artifact-id pa-XXXXXXXXXXXX \
  --provisioning-parameters Key=VpcCidr,Value=10.0.0.0/16

CloudFormation StackSets

StackSets deploy CloudFormation stacks across multiple accounts and regions from a single template.

# Create a StackSet with service-managed permissions (Organizations integration)
aws cloudformation create-stack-set \
  --stack-set-name baseline-config-rules \
  --template-body file://config-rules.yaml \
  --permission-model SERVICE_MANAGED \
  --auto-deployment Enabled=true,RetainStacksOnAccountRemoval=false

# Deploy to all accounts in an OU
aws cloudformation create-stack-instances \
  --stack-set-name baseline-config-rules \
  --deployment-targets OrganizationalUnitIds=ou-xxxx-xxxxxxxx \
  --regions us-east-1 us-west-2

CLI & SDK

VPC & Networking

# List all VPCs
aws ec2 describe-vpcs --output table --query 'Vpcs[*].[VpcId,CidrBlock,Tags[?Key==`Name`].Value|[0]]'

# List subnets for a VPC
aws ec2 describe-subnets --filters Name=vpc-id,Values=vpc-0xxxxxxxxxxxxxxxxx \
  --query 'Subnets[*].[SubnetId,CidrBlock,AvailabilityZone]' --output table

# Describe a Transit Gateway's route table
aws ec2 search-transit-gateway-routes \
  --transit-gateway-route-table-id tgw-rtb-0xxxxxxxxxxxxxxxxx \
  --filters Name=state,Values=active

EC2

# List running instances with key details
aws ec2 describe-instances \
  --filters Name=instance-state-name,Values=running \
  --query 'Reservations[*].Instances[*].[InstanceId,InstanceType,PrivateIpAddress,Tags[?Key==`Name`].Value|[0]]' \
  --output table

# Create an AMI from a running instance
aws ec2 create-image \
  --instance-id i-0xxxxxxxxxxxxxxxxx \
  --name "pre-deploy-$(date +%Y-%m-%d)" \
  --no-reboot

EKS

# List clusters
aws eks list-clusters

# Update kubeconfig for a cluster
aws eks update-kubeconfig --name my-cluster --region us-east-1

# Describe a node group
aws eks describe-nodegroup --cluster-name my-cluster --nodegroup-name workers

RDS

# List all RDS instances
aws rds describe-db-instances \
  --query 'DBInstances[*].[DBInstanceIdentifier,Engine,DBInstanceStatus,MultiAZ]' \
  --output table

# Create a manual snapshot
aws rds create-db-snapshot \
  --db-instance-identifier prod-db \
  --db-snapshot-identifier prod-db-$(date +%Y%m%d)

# Initiate a failover for a Multi-AZ instance
aws rds reboot-db-instance \
  --db-instance-identifier prod-db \
  --force-failover

CloudFormation

# Deploy a stack
aws cloudformation deploy \
  --template-file infra.yaml \
  --stack-name prod-infra \
  --capabilities CAPABILITY_NAMED_IAM \
  --parameter-overrides Environment=prod VpcCidr=10.0.0.0/16

# List stack events (troubleshoot failed deployments)
aws cloudformation describe-stack-events \
  --stack-name prod-infra \
  --query 'StackEvents[?ResourceStatus==`CREATE_FAILED`].[LogicalResourceId,ResourceStatusReason]' \
  --output table

Monitoring & Alerting

CloudWatch

CloudWatch collects metrics, logs, and traces. Custom metrics can be published via put-metric-data.

# Query EC2 CPU utilization (last hour, 5-minute periods)
aws cloudwatch get-metric-statistics \
  --namespace AWS/EC2 \
  --metric-name CPUUtilization \
  --dimensions Name=InstanceId,Value=i-0xxxxxxxxxxxxxxxxx \
  --start-time $(date -u -v-1H +%Y-%m-%dT%H:%M:%S) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
  --period 300 \
  --statistics Average

# Create a CloudWatch alarm
aws cloudwatch put-metric-alarm \
  --alarm-name "high-cpu-prod" \
  --namespace AWS/EC2 \
  --metric-name CPUUtilization \
  --dimensions Name=InstanceId,Value=i-0xxxxxxxxxxxxxxxxx \
  --statistic Average \
  --period 300 \
  --threshold 80 \
  --comparison-operator GreaterThanThreshold \
  --evaluation-periods 3 \
  --alarm-actions arn:aws:sns:us-east-1:123456789012:ops-alerts

CloudTrail

CloudTrail records all API calls across all AWS services. For multi-account setups, create an organization trail from the management account.

# Look up recent API events
aws cloudtrail lookup-events \
  --lookup-attributes AttributeKey=EventName,AttributeValue=StopInstances \
  --max-results 10

AWS Config

Config continuously evaluates resource compliance against rules. Use conformance packs for bundled rule sets.

# Check compliance status of a rule
aws configservice get-compliance-details-by-config-rule \
  --config-rule-name s3-bucket-server-side-encryption-enabled \
  --compliance-types NON_COMPLIANT

GuardDuty

GuardDuty uses ML to detect threats across CloudTrail, VPC Flow Logs, and DNS logs. Enable in every region used by the organization.

# List high-severity findings
aws guardduty list-findings \
  --detector-id XXXXXXXXXXXXXXXXXXXX \
  --finding-criteria '{"Criterion":{"severity":{"Gte":7}}}'

Security Hub aggregation

Enable Security Hub in the Audit account with delegated administrator. It aggregates findings from GuardDuty, Inspector, Config, Macie, and Firewall Manager into a single pane with compliance scores (CIS, PCI-DSS, AWS Foundational Security Best Practices).


Troubleshooting

VPC Connectivity

Symptom Likely Cause Resolution
EC2 cannot reach the Internet Missing NAT Gateway route or no IGW Check route tables: private subnets need 0.0.0.0/0 -> nat-gw; public subnets need 0.0.0.0/0 -> igw
Cannot reach resources in peered VPC Missing route entries in both VPCs Add routes for the peer CIDR pointing to the peering connection in both VPC route tables
Cross-AZ latency higher than expected Traffic hairpinning through a single-AZ NAT Gateway Deploy NAT Gateways in each AZ and update route tables per AZ

IAM Permission Debugging

# Simulate whether a principal can perform an action
aws iam simulate-principal-policy \
  --policy-source-arn arn:aws:iam::123456789012:role/deploy-role \
  --action-names s3:PutObject \
  --resource-arns arn:aws:s3:::my-bucket/*

# Decode an encoded authorization failure message
aws sts decode-authorization-message --encoded-message <encoded-message>

CloudTrail for permission debugging

When an API call fails with AccessDenied, search CloudTrail for the event. The errorCode and errorMessage fields show the exact policy evaluation result, including which SCP or permission boundary blocked the action.

Transit Gateway Routing

Symptom Likely Cause Resolution
Spoke VPCs cannot reach each other TGW route table missing propagation or static routes Verify route table associations: aws ec2 get-transit-gateway-route-table-associations. Enable route propagation for each attachment.
Traffic not flowing through inspection VPC Appliance mode not enabled on TGW attachment Enable appliance mode: aws ec2 modify-transit-gateway-vpc-attachment --options ApplianceModeSupport=enable
Return traffic takes a different path (asymmetric) Subnet route tables not updated for return path Ensure VPC ingress routing and TGW return routes both point through the firewall ENI

General Diagnostic Commands

# Check VPC Flow Logs (requires flow log enabled on VPC/subnet/ENI)
aws logs filter-log-events \
  --log-group-name /vpc/flow-logs \
  --filter-pattern "REJECT" \
  --start-time $(date -d "-1 hour" +%s000)

# Describe a Network Firewall's rule groups
aws network-firewall describe-firewall \
  --firewall-name central-inspection \
  --query 'Firewall.FirewallPolicyArn'

# Check Security Group rules applied to an ENI
aws ec2 describe-security-group-rules \
  --filters Name=group-id,Values=sg-0xxxxxxxxxxxxxxxxx \
  --output table