Building SRE from Scratch at Coinbase during Hypergrowth

A presentation at AWS re:Invent in November 2018 in Las Vegas, NV, USA by Daniel "phrawzty" Maher

Slide 1

Slide 1

Slide 2

Slide 2

DEV315-S Building SRE from scratch at Coinbase during hypergrowth Niall O’Higgins Daniel Maher © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

Slide 3

Slide 3

“Our goal is to make Coinbase the most trusted and easiest to use digital currency exchange.” Brian Armstrong Co-founder & CEO of Coinbase

Slide 4

Slide 4

What is SRE? • New field • Definitions vary • Many misconceptions © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

Slide 5

Slide 5

Is this SRE? • Endless firefighting? • Being on-call? • Operational toil? Those are the symptoms; SRE is the cure. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

Slide 6

Slide 6

What about DevOps? SRE satisfies many, if not all, of the operational and cultural elements of DevOps. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

Slide 7

Slide 7

Key SRE insight #1 Measure and improve human, organisational, and machine systems. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

Slide 8

Slide 8

Key SRE insight #2 Move from reactive to proactive. Go find sources of toil and eliminate them! © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

Slide 9

Slide 9

Key SRE insight #3 Provide an organisational backpressure mechanism. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

Slide 10

Slide 10

Set early expectations • New language and fresh set of concepts. • Takes time to absorb - no instant results! • The best way to begin, is to begin. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

Slide 11

Slide 11

The Coinbase strategy • Service Level Indicators • The “Four Golden Signals” © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

Slide 12

Slide 12

Metrics: the core of SLIs • Natural tendency to over-engineer. • Lots of data, none of it actionable or useful. Optimise for KPIs, or high signal/noise. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

Slide 13

Slide 13

The Four Golden Signals • • • • Latency Traffic Errors Saturation © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

Slide 14

Slide 14

Latency • Direct impact on customer experience. • Where and how you measure it is important. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

Slide 15

Slide 15

Traffic • The amount of work being done - or attempted. • Direct relationship with business value. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

Slide 16

Slide 16

Errors • A nice, defined target to aim at. • Direct impact on customer experience. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

Slide 17

Slide 17

Saturation • Real Talk: this is a tricky one. • Direct relationship to both scaling and capacity planning. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

Slide 18

Slide 18

Humans and the Four Golden Signals • • • • Latency Traffic Errors Saturation © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

Slide 19

Slide 19

Start, then iterate • Start with an initial specification - even if it’s not ideal. • Iterating on feedback is the key to getting it right. • Keep it simple! © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

Slide 20

Slide 20

Spreadsheets are simple, right? Service Latency Errors Saturation Traffic Foo foo.latency foo.error_rate Disc space TPS Bar bar.response_time bar.error_rate Memory TPS © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

Slide 21

Slide 21

SLIs: Defining “done” • Per-service dashboard in Datadog with timeseries chart for each indicator. • Document describing the indicators and why they are important. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

Slide 22

Slide 22

Datadog Dashboards © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

Slide 23

Slide 23

Datadog Dashboards © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

Slide 24

Slide 24

Specification Documentation • • • Spec vs. implementation Where do you want to instrument? Where is it easy to instrument? © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

Slide 25

Slide 25

First SLIs, then Promises • Plain-language statements; easy to parse, easy to understand. • Plenty of potential stakeholders. • Start simple! © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

Slide 26

Slide 26

“You can rely on us to buy or sell crypto whenever you want.” Coinbase’s “Prime Promise”

Slide 27

Slide 27

Essential Reading Thinking in Promises by Mark Burgess © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

Slide 28

Slide 28

Concerning Promises • Promises have two parties. • Promises can be human to machine, human to human, or machine to machine. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

Slide 29

Slide 29

Example Promises • You service promises to respond to clients within 50ms. • A service you depend on promises that its error rate will be < 1%. • On-call promises they will engage an incident within 15 minutes. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

Slide 30

Slide 30

Promise enumeration • Each team must formalise the promises they are willing to keep. • They must also understand the promises they rely upon to function. • When a promise you rely upon is broken, what should you do? Who should you contact? © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

Slide 31

Slide 31

Promises: defining “done” Promises are done when they have a Datadog monitor (alert). © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

Slide 32

Slide 32

When Promises are broken… • It is inevitable that a promises will be broken at some point. • What to do when that happens? © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

Slide 33

Slide 33

Blameless post-mortems • Is “blameless post-mortem” a real thing? • What about data-driven post-mortems? • How does this relate to promises - specifically broken ones? • https://v.gd/jpr_post_mortems • https://v.gd/jyee_datadriven © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

Slide 34

Slide 34

Interpreting incidents • • • Build a shared language. Practise communicating. Understand that incidents and outages are broken promises. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

Slide 35

Slide 35

Measuring incident response • Quantify and measure the quality of your incident responses. • Quantitative: Time to detect, time to engage, time to fix. • Qualitative: Quality of communication. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

Slide 36

Slide 36

The end-game • Have a clear answer for “why SRE?” • Start with instrumentation keep it simple to start. • Enumerate your promises. • Measure your response when promises are broken. • Transparency • Understanding • Confidence © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

Slide 37

Slide 37

Thank you! Niall O’Higgins Daniel Maher © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

Slide 38

Slide 38

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.