Amazon CloudWatch


CloudWatch Metrics

  • Provides metrics for every service in AWS
  • Metric is a variable to monitor (CPUUtilization, NetworkIn, …)
  • Metric belongs to a namespace
  • Dimension is an attribute of a metric (instance id, env, etc, …)
  • Up to 30 dimensions per metric
  • Metrics have timestamps
  • Can create CloudWatch dashboards of metrics
  • Can create CloudWatch Custom Metrics (for the RAM for example)

CloudWatch Metric Streams

  • Continually stream CloudWatch metrics to a destination with near-real-time delivery and low-latency
    • Amazon Kinesis Data Firehose (and then its destinations)
    • 3rd party service providers: DataDog, Dynatrace, New Relic, Splunk, …

CloudWatch Logs


  • Log groups: Arbitrary name, usually representing an app
  • Log stream: Instances within app/log files/containers
  • Can define log expiration policies (never, 1 day ~ 10 years)
  • CloudWatch Logs can send logs to:
    • S3
    • Kinesis Data Streams
    • Kinesis Data Firehose
    • Lambda
    • OpenSearch
  • Logs are encrypted by default
  • Can set KMS-based encryption

CloudWatch Logs - Sources

  • SDK, CloudWatch Logs Agent, CloudWatch Unified Agent
  • Elastic Beanstalk: Collection of logs from the app
  • ECS: Collection from containers
  • Lambda: Collection from function logs
  • VPC Flow Logs: VPC-specific logs
  • API Gateway
  • CloudTrail based on filter
  • R53: Log DNS queries

CloudWatch Logs Insights

  • Search and analyze log data stored in CloudWatch Logs
  • Provides a purpose-built query language
    • Automatically discover fields from AWS services and JSON logs events
    • Fetch desired event fields, filter based on conditions, calculate aggregate statistics, sort events, limit number of events, …
    • Can save queries and add them to CloudWatch Dashboards
  • Can query multiple Log Groups in different AWS accounts
  • It’s a query engine, not a real-time engine

CloudWatch Logs Subscriptions

  • Get real-time log events from CloudWatch Logs for processing and analysis
  • Send to Kinesis Data Streams, Kinesis Data Firehose, or Lambda
  • Subscription Filter - filter which logs are events delivered to the destination
  • Cross-Account Subscription - send log events to resources in a different AWS account (KDS, KDF)

CloudWatch Logs for EC2

  • By default, no logs from the EC2 machine will go to CloudWatch
  • Need to run a CloudWatch Agent on EC2 to push the log files
  • Make sure IAM permissions are correct
  • CloudWatch Agent can be set on-premises too

CloudWatch Logs Agent & Unified Agent

  • For virtual servers
  • CloudWatch Logs Agent
    • Old version of the agent
    • Can only send to CloudWatch Logs
  • CloudWatch Unified Agent
    • Collect additional system-level metrics such as RAM, processes, etc, …
    • Collect logs to send to CloudWatch Logs
    • Centralized configuration using SSM Parameter Store

CloudWatch Unified Agent - Metrics

  • CPU
  • Disk metrics
  • RAM
  • Netstat
  • Processes
  • Swap Space

CloudWatch Alarms


  • Alarms are used to trigger notifications for any metric
  • Various options (sampling, %, max, min, etc, …)
  • Alarm States:
    • OK
    • INSUFFICIENT_DATA
    • ALARM
  • Period:
    • Length of time in seconds to evaluate the metric
    • High-resolution custom metrics

CloudWatch Alarm Targets

  • Stop, Terminate, Reboot, or Recover an EC2 instance
  • Trigger Auto Scaling Action
  • Send notification to SNS

CloudWatch Alarms - Composite Alarms

  • CloudWatch Alarms are on a single metric
  • Composite Alarms monitor the states of multiple other alarms
  • AND and OR conditions
  • Helpful to reduce “alarm noise” by creating complex composite alarms

CloudWatch Insights & Operational Visibility


  • CloudWatch Container Insights
    • ECS, EKS, k8s on EC2, Fargate, needs agent for k8s
    • Metrics and logs
  • CloudWatch Lambda Insights
    • Detailed metrics to troubleshoot serverless apps
  • CloudWatch Contributors Insights
    • Find “Top-N” Contributors through CloudWatch Logs
  • CloudWatch Application Insights
    • Automatic dashboard to troubleshoot app and related AWS services

EKS Audit Log 포렌식 (Logs Insights)


EKS control plane 로그(특히 audit)를 CloudWatch Logs 로 보내면, 클러스터에서 일어난 모든 API 요청을 Logs Insights 로 사후 분석할 수 있다. parse + stats 로 raw 로그를 집계해 원인 분포를 뽑는 게 핵심 패턴이다.

전제와 한계:

  • audit log 는 클러스터 전역이라 특정 노드/사건은 이름으로 filter 해서 좁힌다.
  • 기본 audit 레벨이 MetadatarequestObject/responseObject 본문이 비어 있을 수 있다(무엇을 했는지는 알아도 상세 내용은 없음).
  • Container Insights 를 켜면 pod↔node 매핑이 직접 기록돼 이런 조인 분석이 짧아진다.

eviction 429 추출 — drain 이 PDB 에 막힌 흔적. eviction 서브리소스 요청 중 201(성공)이 아닌 것:

fields @timestamp, objectRef.namespace, objectRef.name, responseStatus.code
| filter objectRef.resource="pods" and objectRef.subresource="eviction" and responseStatus.code!=201
| sort @timestamp asc

FailedScheduling 원인 집계 — 서비스별 전체 실패 수 대비 topology 가 차지하는 비율. parse 로 “topology spread” 문구가 있으면 카운트:

fields objectRef.name, requestObject.message
| filter @message like /FailedScheduling/
| filter objectRef.name like /svc1|svc2/
| parse objectRef.name /(?<svc>svc1|svc2)/
| parse requestObject.message "*topology spread*" as pre, post   -- 매칭되면 pre 가 채워짐
| stats count(*) as total_fail, count(pre) as topo_fail by svc
| sort total_fail desc

total_fail == topo_fail 이면 그 서비스의 스케줄 실패가 100% topology 때문이라는 뜻이다. 작은 서비스(예: 200m/1Gi)도 막혔다면 리소스가 아니라 topology 가 주범임을 입증한다.

노드 cordon(drain 시작)~terminate 소요시간audit 의 node patch 로 cordon 시각을, delete 로 terminate 시각을 잡아 노드별 drain 길이를 잰다. kubelet heartbeat 의 node patch 와 섞이지 않게 업그레이드 주체 username 으로 필터한다:

-- 1) 업그레이드 주체 username 확인 (eviction 을 가장 많이 친 주체)
fields user.username
| filter objectRef.resource="pods" and objectRef.subresource="eviction"
| stats count(*) by user.username
 
-- 2) 노드별 drain_start(cordon)~terminate
fields @timestamp, objectRef.name, verb
| filter objectRef.resource="nodes"
| filter (verb="delete") or ((verb="patch" or verb="update") and user.username="<주체>")
| stats min(@timestamp) as drain_start, max(@timestamp) as terminate by objectRef.name
| sort drain_start asc

이 쿼리들로 4시간 지연의 원인을 topology 로 특정한 사례: 06. K8s Cluster Maintenance > EKS 노드 그룹 업그레이드 지연 사례. 노드그룹 단위 launch/terminate 타임라인은 audit 가 아니라 ASG describe-scaling-activities 로 본다(그쪽 노트 참조).

References