Modern Frameworks for Modern Alerting

A presentation at PagerDuty EMEA Summit 2020 in June 2020 in by Daniel "phrawzty" Maher

Slide 1

Slide 1

Modern frameworks for modern alerting Daniel Maher @phrawzty Ara Pulido @arapulido

Slide 2

Slide 2

Datadog is a monitoring and analytics platform that helps companies improve observability of their infrastructure and applications

Slide 3

Slide 3

Daniel Maher, Developer Relations dan.maher@datadoghq.com @phrawzty Ara Pulido, Technical Evangelist ara.pulido@datadoghq.com @arapulido

Slide 4

Slide 4

Endpoint foobar is unavailable

Slide 5

Slide 5

Logs

Slide 6

Slide 6

Logs Infrastructure metrics

Slide 7

Slide 7

Logs Work metrics Infrastructure metrics

Slide 8

Slide 8

Logs Work metrics RUM data Infrastructure metrics

Slide 9

Slide 9

RUM Logs data Resource metrics Work metrics Infrastructure metrics

Slide 10

Slide 10

RUM Logs data Resource metrics Work metrics Infrastructure metrics Traces

Slide 11

Slide 11

RUM Logs data Resource metrics Work metrics Infrastructure metrics Events in our app Traces

Slide 12

Slide 12

RUM Logs data Resource metrics Infrastructure metrics Events in our app Traces Work metrics AI/ML aggregated data

Slide 13

Slide 13

Browser/UI tests RUM Logs Infrastructure data metrics Resource metrics Events in our app Traces Work metrics AI/ML aggregated data

Slide 14

Slide 14

Browser/UI tests RUM Logs Infrastructure data metrics Resource Business metrics Events in metrics our app Traces Work metrics AI/ML aggregated data

Slide 15

Slide 15

Browser/UI tests RUM Logs Infrastructure data metrics Resource Business metrics Events in metrics Events outside our app our app Traces Work metrics AI/ML aggregated data

Slide 16

Slide 16

Browser/UI tests RUM Logs Infrastructure data Networking data metrics Resource Business metrics Events in metrics Events outside our app our app Traces Work metrics AI/ML aggregated data

Slide 17

Slide 17

Security RUM Browser/UI tests Logs Infrastructure data Networking data metrics Resource Business metrics Events in metrics Events outside our app our app Traces Work metrics AI/ML aggregated data

Slide 18

Slide 18

Security RUM Browser/UI tests Logs Infrastructure data Networking data metrics Resource Business metrics Events in metrics Events outside our app our app Traces Work metrics AI/ML aggregated data

Slide 19

Slide 19

Security RUM Browser/UI tests Logs Infrastructure data Networking data metrics Resource Business metrics Events in metrics Events outside our app our app Traces Work metrics AI/ML aggregated data FOUR

Slide 20

Slide 20

Security RUM Browser/UI tests Logs Infrastructure data Networking data metrics Resource Business metrics Events in metrics Events outside our app our app Traces Work metrics AI/ML aggregated data FOUR GOLDEN

Slide 21

Slide 21

Security RUM Browser/UI tests Logs Infrastructure data Networking data metrics Resource Business metrics Events in metrics Events outside our app our app Traces Work metrics AI/ML aggregated data FOUR GOLDEN SIGNALS

Slide 22

Slide 22

Slide 23

Slide 23

Alert Fatigue

Slide 24

Slide 24

Information Overload

Slide 25

Slide 25

Service Level Indicators Objectives Agreements

Slide 26

Slide 26

Alerts Pages

Slide 27

Slide 27

Slide 28

Slide 28

Slide 29

Slide 29

Symptoms vs. Causes

Slide 30

Slide 30

Symptom 18 % of queries are returning an error

Slide 31

Slide 31

Cause 3 nodes of the cluster are down

Slide 32

Slide 32

It is so tempting to page on causes

Slide 33

Slide 33

DON’T

Slide 34

Slide 34

SLAs are written based on symptoms

Slide 35

Slide 35

Symptoms Causes

Slide 36

Slide 36

Urgency

Slide 37

Slide 37

Slide 38

Slide 38

High p99 latency > 100ms

Slide 39

Slide 39

Only page on these Related to your SLOs

Slide 40

Slide 40

Medium 1 node of the cluster is down

Slide 41

Slide 41

Notify on these

Slide 42

Slide 42

Low Database I/O is slower than usual

Slide 43

Slide 43

Log these

Slide 44

Slide 44

Symptoms Causes

Slide 45

Slide 45

Error Budget

Slide 46

Slide 46

Urgency; but make it (error) budget

Slide 47

Slide 47

SLOs & SLAs Alert or Page?

Slide 48

Slide 48

Slide 49

Slide 49

Availability vs. Accessibility

Slide 50

Slide 50

Organise data by team

Slide 51

Slide 51

Team A

Slide 52

Slide 52

Information Underload (?)

Slide 53

Slide 53

Thank you! https://www.datadoghq.com Daniel Maher @phrawzty Ara Pulido @arapulido