Modern Frameworks for Modern Alerting

A presentation at PagerDuty EMEA Summit 2020 by Daniel "phrawzty" Maher

Datadog is a monitoring and analytics platform that helps companies improve observability of their infrastructure and applications

Daniel Maher, Developer Relations dan.maher@datadoghq.com @phrawzty Ara Pulido, Technical Evangelist ara.pulido@datadoghq.com @arapulido

Logs Work metrics Infrastructure metrics

Logs Work metrics RUM data Infrastructure metrics

RUM Logs data Resource metrics Work metrics Infrastructure metrics

RUM Logs data Resource metrics Work metrics Infrastructure metrics Traces

RUM Logs data Resource metrics Work metrics Infrastructure metrics Events in our app Traces

RUM Logs data Resource metrics Infrastructure metrics Events in our app Traces Work metrics AI/ML aggregated data

Browser/UI tests RUM Logs Infrastructure data metrics Resource metrics Events in our app Traces Work metrics AI/ML aggregated data

Browser/UI tests RUM Logs Infrastructure data metrics Resource Business metrics Events in metrics our app Traces Work metrics AI/ML aggregated data

Browser/UI tests RUM Logs Infrastructure data metrics Resource Business metrics Events in metrics Events outside our app our app Traces Work metrics AI/ML aggregated data

Browser/UI tests RUM Logs Infrastructure data Networking data metrics Resource Business metrics Events in metrics Events outside our app our app Traces Work metrics AI/ML aggregated data

Security RUM Browser/UI tests Logs Infrastructure data Networking data metrics Resource Business metrics Events in metrics Events outside our app our app Traces Work metrics AI/ML aggregated data

Service Level Indicators Objectives Agreements

Symptom 18 % of queries are returning an error

Thank you! https://www.datadoghq.com Daniel Maher @phrawzty Ara Pulido @arapulido

Daniel "phrawzty" Maher
@phrawzty

1 / 53

Monitoring is an ancient discipline—but one that has evolved significantly in the past few years. Modern monitoring platforms collect a lot of data from our systems: work and resource metrics, events that are happening inside and outside our applications, distributed tracing data, real user monitoring, and more. But are we using all that data in a way that helps to avoid outages without causing alert fatigue? Are we suffering from information overload in our monitoring systems? We’ll present strategies on how to organise your system data in a way that helps your teams anticipate future user-facing issues and avoids alert fatigue by paging only when immediate attention is required.

Buzz and feedback

Here’s what was said about this presentation on social media.

“The idea behind an error budget is that you’ve established what your acceptable level of errant behavior is and if you’re below it or above it, you behave differently.“ - @phrawzty, @datadoghq #PDSummitEMEA
— PagerDuty (@pagerduty) June 30, 2020

Modern Frameworks for Modern Alerting

Link for this presentation:

HTML code for embedding:

Share on social media:

Buzz and feedback