01. Amazon CloudWatch

Amazon CloudWatch

CloudWatch Metrics

Provides metrics for every service in AWS
Metric is a variable to monitor (CPUUtilization, NetworkIn, …)
Metric belongs to a namespace
Dimension is an attribute of a metric (instance id, env, etc, …)
Up to 30 dimensions per metric
Metrics have timestamps
Can create CloudWatch dashboards of metrics
Can create CloudWatch Custom Metrics (for the RAM for example)

CloudWatch Metric Streams

Continually stream CloudWatch metrics to a destination with near-real-time delivery and low-latency
- Amazon Kinesis Data Firehose (and then its destinations)
- 3rd party service providers: DataDog, Dynatrace, New Relic, Splunk, …

CloudWatch Logs

Log groups: Arbitrary name, usually representing an app
Log stream: Instances within app/log files/containers
Can define log expiration policies (never, 1 day ~ 10 years)
CloudWatch Logs can send logs to:
- S3
- Kinesis Data Streams
- Kinesis Data Firehose
- Lambda
- OpenSearch
Logs are encrypted by default
Can set KMS-based encryption

CloudWatch Logs - Sources

SDK, CloudWatch Logs Agent, CloudWatch Unified Agent
Elastic Beanstalk: Collection of logs from the app
ECS: Collection from containers
Lambda: Collection from function logs
VPC Flow Logs: VPC-specific logs
API Gateway
CloudTrail based on filter
R53: Log DNS queries

CloudWatch Logs Insights

Search and analyze log data stored in CloudWatch Logs
Provides a purpose-built query language
- Automatically discover fields from AWS services and JSON logs events
- Fetch desired event fields, filter based on conditions, calculate aggregate statistics, sort events, limit number of events, …
- Can save queries and add them to CloudWatch Dashboards
Can query multiple Log Groups in different AWS accounts
It’s a query engine, not a real-time engine

CloudWatch Logs Subscriptions

Get real-time log events from CloudWatch Logs for processing and analysis
Send to Kinesis Data Streams, Kinesis Data Firehose, or Lambda
Subscription Filter - filter which logs are events delivered to the destination
Cross-Account Subscription - send log events to resources in a different AWS account (KDS, KDF)

CloudWatch Logs for EC2

By default, no logs from the EC2 machine will go to CloudWatch
Need to run a CloudWatch Agent on EC2 to push the log files
Make sure IAM permissions are correct
CloudWatch Agent can be set on-premises too

CloudWatch Logs Agent & Unified Agent

For virtual servers
CloudWatch Logs Agent
- Old version of the agent
- Can only send to CloudWatch Logs
CloudWatch Unified Agent
- Collect additional system-level metrics such as RAM, processes, etc, …
- Collect logs to send to CloudWatch Logs
- Centralized configuration using SSM Parameter Store

CloudWatch Unified Agent - Metrics

CPU
Disk metrics
RAM
Netstat
Processes
Swap Space

CloudWatch Alarms

Alarms are used to trigger notifications for any metric
Various options (sampling, %, max, min, etc, …)
Alarm States:
- OK
- INSUFFICIENT_DATA
- ALARM
Period:
- Length of time in seconds to evaluate the metric
- High-resolution custom metrics

CloudWatch Alarm Targets

Stop, Terminate, Reboot, or Recover an EC2 instance
Trigger Auto Scaling Action
Send notification to SNS

CloudWatch Alarms - Composite Alarms

CloudWatch Alarms are on a single metric
Composite Alarms monitor the states of multiple other alarms
AND and OR conditions
Helpful to reduce “alarm noise” by creating complex composite alarms

CloudWatch Insights & Operational Visibility

CloudWatch Container Insights
- ECS, EKS, k8s on EC2, Fargate, needs agent for k8s
- Metrics and logs
CloudWatch Lambda Insights
- Detailed metrics to troubleshoot serverless apps
CloudWatch Contributors Insights
- Find “Top-N” Contributors through CloudWatch Logs
CloudWatch Application Insights
- Automatic dashboard to troubleshoot app and related AWS services

EKS Audit Log 포렌식 (Logs Insights)

EKS control plane 로그(특히 audit)를 CloudWatch Logs 로 보내면, 클러스터에서 일어난 모든 API 요청을 Logs Insights 로 사후 분석할 수 있다. parse + stats 로 raw 로그를 집계해 원인 분포를 뽑는 게 핵심 패턴이다.

전제와 한계:

audit log 는 클러스터 전역이라 특정 노드/사건은 이름으로 filter 해서 좁힌다.
기본 audit 레벨이 Metadata 면 requestObject/responseObject 본문이 비어 있을 수 있다(무엇을 했는지는 알아도 상세 내용은 없음).
Container Insights 를 켜면 pod↔node 매핑이 직접 기록돼 이런 조인 분석이 짧아진다.

eviction 429 추출 — drain 이 PDB 에 막힌 흔적. eviction 서브리소스 요청 중 201(성공)이 아닌 것:

fields @timestamp, objectRef.namespace, objectRef.name, responseStatus.code
| filter objectRef.resource="pods" and objectRef.subresource="eviction" and responseStatus.code!=201
| sort @timestamp asc

FailedScheduling 원인 집계 — 서비스별 전체 실패 수 대비 topology 가 차지하는 비율. parse 로 “topology spread” 문구가 있으면 카운트:

fields objectRef.name, requestObject.message
| filter @message like /FailedScheduling/
| filter objectRef.name like /svc1|svc2/
| parse objectRef.name /(?<svc>svc1|svc2)/
| parse requestObject.message "*topology spread*" as pre, post   -- 매칭되면 pre 가 채워짐
| stats count(*) as total_fail, count(pre) as topo_fail by svc
| sort total_fail desc

total_fail == topo_fail 이면 그 서비스의 스케줄 실패가 100% topology 때문이라는 뜻이다. 작은 서비스(예: 200m/1Gi)도 막혔다면 리소스가 아니라 topology 가 주범임을 입증한다.

노드 cordon(drain 시작)~terminate 소요시간 — audit 의 node patch 로 cordon 시각을, delete 로 terminate 시각을 잡아 노드별 drain 길이를 잰다. kubelet heartbeat 의 node patch 와 섞이지 않게 업그레이드 주체 username 으로 필터한다:

-- 1) 업그레이드 주체 username 확인 (eviction 을 가장 많이 친 주체)
fields user.username
| filter objectRef.resource="pods" and objectRef.subresource="eviction"
| stats count(*) by user.username
 
-- 2) 노드별 drain_start(cordon)~terminate
fields @timestamp, objectRef.name, verb
| filter objectRef.resource="nodes"
| filter (verb="delete") or ((verb="patch" or verb="update") and user.username="<주체>")
| stats min(@timestamp) as drain_start, max(@timestamp) as terminate by objectRef.name
| sort drain_start asc

이 쿼리들로 4시간 지연의 원인을 topology 로 특정한 사례: 06. K8s Cluster Maintenance > EKS 노드 그룹 업그레이드 지연 사례. 노드그룹 단위 launch/terminate 타임라인은 audit 가 아니라 ASG describe-scaling-activities 로 본다(그쪽 노트 참조).

meatsby.github.io

Explorer