Modern frameworks for modern alerting Daniel Maher @phrawzty Ara Pulido @arapulido
Slide 2
Datadog is a monitoring and analytics platform that helps companies improve observability of their infrastructure and applications
Slide 3
Daniel Maher, Developer Relations dan.maher@datadoghq.com @phrawzty
Ara Pulido, Technical Evangelist ara.pulido@datadoghq.com @arapulido
Slide 4
Endpoint foobar is unavailable
Slide 5
Logs
Slide 6
Logs
Infrastructure metrics
Slide 7
Logs
Work metrics
Infrastructure metrics
Slide 8
Logs
Work metrics
RUM data
Infrastructure metrics
Slide 9
RUM Logs data Resource metrics Work metrics
Infrastructure metrics
Slide 10
RUM Logs data Resource metrics Work metrics
Infrastructure metrics
Traces
Slide 11
RUM Logs data Resource metrics Work metrics
Infrastructure metrics Events in our app Traces
Slide 12
RUM Logs data Resource metrics
Infrastructure metrics Events in our app Traces
Work metrics AI/ML aggregated data
Slide 13
Browser/UI tests RUM Logs Infrastructure data metrics Resource metrics Events in our app Traces Work metrics AI/ML aggregated data
Slide 14
Browser/UI tests RUM Logs Infrastructure data metrics Resource Business metrics Events in metrics our app Traces Work metrics AI/ML aggregated data
Slide 15
Browser/UI tests RUM Logs Infrastructure data metrics Resource Business metrics Events in metrics Events outside our app our app Traces Work metrics AI/ML aggregated data
Slide 16
Browser/UI tests RUM Logs Infrastructure data Networking data metrics Resource Business metrics Events in metrics Events outside our app our app Traces Work metrics AI/ML aggregated data
Slide 17
Security RUM Browser/UI tests Logs Infrastructure data Networking data metrics Resource Business metrics Events in metrics Events outside our app our app Traces Work metrics AI/ML aggregated data
Slide 18
Security RUM Browser/UI tests Logs Infrastructure data Networking data metrics Resource Business metrics Events in metrics Events outside our app our app Traces Work metrics AI/ML aggregated data
Slide 19
Security RUM Browser/UI tests Logs Infrastructure data Networking data metrics Resource Business metrics Events in metrics Events outside our app our app Traces Work metrics AI/ML aggregated data
FOUR
Slide 20
Security RUM Browser/UI tests Logs Infrastructure data Networking data metrics Resource Business metrics Events in metrics Events outside our app our app Traces Work metrics AI/ML aggregated data
FOUR GOLDEN
Slide 21
Security RUM Browser/UI tests Logs Infrastructure data Networking data metrics Resource Business metrics Events in metrics Events outside our app our app Traces Work metrics AI/ML aggregated data
FOUR GOLDEN SIGNALS
Slide 22
Slide 23
Alert
Fatigue
Slide 24
Information Overload
Slide 25
Service Level Indicators Objectives Agreements
Slide 26
Alerts Pages
Slide 27
Slide 28
Slide 29
Symptoms vs. Causes
Slide 30
Symptom 18 % of queries are returning an error
Slide 31
Cause 3 nodes of the cluster are down
Slide 32
It is so tempting to page on causes
Slide 33
DON’T
Slide 34
SLAs are written based on symptoms
Slide 35
Symptoms
Causes
Slide 36
Urgency
Slide 37
Slide 38
High p99 latency > 100ms
Slide 39
Only page on these Related to your SLOs
Slide 40
Medium 1 node of the cluster is down
Slide 41
Notify on these
Slide 42
Low Database I/O is slower than usual
Slide 43
Log these
Slide 44
Symptoms
Causes
Slide 45
Error Budget
Slide 46
Urgency; but make it (error) budget
Slide 47
SLOs & SLAs Alert or Page?
Slide 48
Slide 49
Availability vs. Accessibility
Slide 50
Organise data by team
Slide 51
Team A
Slide 52
Information Underload (?)
Slide 53
Thank you! https://www.datadoghq.com
Daniel Maher @phrawzty Ara Pulido @arapulido