Amazon CloudWatch
CloudWatch Metrics
- Provides metrics for every service in AWS
- Metric is a variable to monitor (CPUUtilization, NetworkIn, …)
- Metric belongs to a namespace
- Dimension is an attribute of a metric (instance id, env, etc, …)
- Up to 30 dimensions per metric
- Metrics have timestamps
- Can create CloudWatch dashboards of metrics
- Can create CloudWatch Custom Metrics (for the RAM for example)
CloudWatch Metric Streams
- Continually stream CloudWatch metrics to a destination with near-real-time delivery and low-latency
- Amazon Kinesis Data Firehose (and then its destinations)
- 3rd party service providers: DataDog, Dynatrace, New Relic, Splunk, …
CloudWatch Logs
- Log groups: Arbitrary name, usually representing an app
- Log stream: Instances within app/log files/containers
- Can define log expiration policies (never, 1 day ~ 10 years)
- CloudWatch Logs can send logs to:
- S3
- Kinesis Data Streams
- Kinesis Data Firehose
- Lambda
- OpenSearch
- Logs are encrypted by default
- Can set KMS-based encryption
CloudWatch Logs - Sources
- SDK, CloudWatch Logs Agent, CloudWatch Unified Agent
- Elastic Beanstalk: Collection of logs from the app
- ECS: Collection from containers
- Lambda: Collection from function logs
- VPC Flow Logs: VPC-specific logs
- API Gateway
- CloudTrail based on filter
- R53: Log DNS queries
CloudWatch Logs Insights
- Search and analyze log data stored in CloudWatch Logs
- Provides a purpose-built query language
- Automatically discover fields from AWS services and JSON logs events
- Fetch desired event fields, filter based on conditions, calculate aggregate statistics, sort events, limit number of events, …
- Can save queries and add them to CloudWatch Dashboards
- Can query multiple Log Groups in different AWS accounts
- It’s a query engine, not a real-time engine
CloudWatch Logs Subscriptions
- Get real-time log events from CloudWatch Logs for processing and analysis
- Send to Kinesis Data Streams, Kinesis Data Firehose, or Lambda
- Subscription Filter - filter which logs are events delivered to the destination
- Cross-Account Subscription - send log events to resources in a different AWS account (KDS, KDF)
CloudWatch Logs for EC2
- By default, no logs from the EC2 machine will go to CloudWatch
- Need to run a CloudWatch Agent on EC2 to push the log files
- Make sure IAM permissions are correct
- CloudWatch Agent can be set on-premises too
CloudWatch Logs Agent & Unified Agent
- For virtual servers
- CloudWatch Logs Agent
- Old version of the agent
- Can only send to CloudWatch Logs
- CloudWatch Unified Agent
- Collect additional system-level metrics such as RAM, processes, etc, …
- Collect logs to send to CloudWatch Logs
- Centralized configuration using SSM Parameter Store
CloudWatch Unified Agent - Metrics
- CPU
- Disk metrics
- RAM
- Netstat
- Processes
- Swap Space
CloudWatch Alarms
- Alarms are used to trigger notifications for any metric
- Various options (sampling, %, max, min, etc, …)
- Alarm States:
- OK
- INSUFFICIENT_DATA
- ALARM
- Period:
- Length of time in seconds to evaluate the metric
- High-resolution custom metrics
CloudWatch Alarm Targets
- Stop, Terminate, Reboot, or Recover an EC2 instance
- Trigger Auto Scaling Action
- Send notification to SNS
CloudWatch Alarms - Composite Alarms
- CloudWatch Alarms are on a single metric
- Composite Alarms monitor the states of multiple other alarms
- AND and OR conditions
- Helpful to reduce “alarm noise” by creating complex composite alarms
CloudWatch Insights & Operational Visibility
- CloudWatch Container Insights
- ECS, EKS, k8s on EC2, Fargate, needs agent for k8s
- Metrics and logs
- CloudWatch Lambda Insights
- Detailed metrics to troubleshoot serverless apps
- CloudWatch Contributors Insights
- Find “Top-N” Contributors through CloudWatch Logs
- CloudWatch Application Insights
- Automatic dashboard to troubleshoot app and related AWS services
EKS Audit Log 포렌식 (Logs Insights)
EKS control plane 로그(특히 audit)를 CloudWatch Logs 로 보내면, 클러스터에서 일어난 모든 API 요청을 Logs Insights 로 사후 분석할 수 있다. parse + stats 로 raw 로그를 집계해 원인 분포를 뽑는 게 핵심 패턴이다.
전제와 한계:
- audit log 는 클러스터 전역이라 특정 노드/사건은 이름으로
filter해서 좁힌다. - 기본 audit 레벨이
Metadata면requestObject/responseObject본문이 비어 있을 수 있다(무엇을 했는지는 알아도 상세 내용은 없음). Container Insights를 켜면 pod↔node 매핑이 직접 기록돼 이런 조인 분석이 짧아진다.
eviction 429 추출 — drain 이 PDB 에 막힌 흔적. eviction 서브리소스 요청 중 201(성공)이 아닌 것:
fields @timestamp, objectRef.namespace, objectRef.name, responseStatus.code
| filter objectRef.resource="pods" and objectRef.subresource="eviction" and responseStatus.code!=201
| sort @timestamp ascFailedScheduling 원인 집계 — 서비스별 전체 실패 수 대비 topology 가 차지하는 비율. parse 로 “topology spread” 문구가 있으면 카운트:
fields objectRef.name, requestObject.message
| filter @message like /FailedScheduling/
| filter objectRef.name like /svc1|svc2/
| parse objectRef.name /(?<svc>svc1|svc2)/
| parse requestObject.message "*topology spread*" as pre, post -- 매칭되면 pre 가 채워짐
| stats count(*) as total_fail, count(pre) as topo_fail by svc
| sort total_fail desctotal_fail == topo_fail 이면 그 서비스의 스케줄 실패가 100% topology 때문이라는 뜻이다. 작은 서비스(예: 200m/1Gi)도 막혔다면 리소스가 아니라 topology 가 주범임을 입증한다.
노드 cordon(drain 시작)~terminate 소요시간 — audit 의 node patch 로 cordon 시각을, delete 로 terminate 시각을 잡아 노드별 drain 길이를 잰다. kubelet heartbeat 의 node patch 와 섞이지 않게 업그레이드 주체 username 으로 필터한다:
-- 1) 업그레이드 주체 username 확인 (eviction 을 가장 많이 친 주체)
fields user.username
| filter objectRef.resource="pods" and objectRef.subresource="eviction"
| stats count(*) by user.username
-- 2) 노드별 drain_start(cordon)~terminate
fields @timestamp, objectRef.name, verb
| filter objectRef.resource="nodes"
| filter (verb="delete") or ((verb="patch" or verb="update") and user.username="<주체>")
| stats min(@timestamp) as drain_start, max(@timestamp) as terminate by objectRef.name
| sort drain_start asc이 쿼리들로 4시간 지연의 원인을 topology 로 특정한 사례: 06. K8s Cluster Maintenance > EKS 노드 그룹 업그레이드 지연 사례. 노드그룹 단위 launch/terminate 타임라인은 audit 가 아니라 ASG
describe-scaling-activities로 본다(그쪽 노트 참조).