A presentation at AWS re:Invent in in Las Vegas, NV, USA by Daniel "phrawzty" Maher
DEV315-S Building SRE from scratch at Coinbase during hypergrowth Niall O’Higgins Daniel Maher © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
“Our goal is to make Coinbase the most trusted and easiest to use digital currency exchange.” Brian Armstrong Co-founder & CEO of Coinbase
What is SRE? • New field • Definitions vary • Many misconceptions © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Is this SRE? • Endless firefighting? • Being on-call? • Operational toil? Those are the symptoms; SRE is the cure. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
What about DevOps? SRE satisfies many, if not all, of the operational and cultural elements of DevOps. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Key SRE insight #1 Measure and improve human, organisational, and machine systems. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Key SRE insight #2 Move from reactive to proactive. Go find sources of toil and eliminate them! © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Key SRE insight #3 Provide an organisational backpressure mechanism. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Set early expectations • New language and fresh set of concepts. • Takes time to absorb - no instant results! • The best way to begin, is to begin. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
The Coinbase strategy • Service Level Indicators • The “Four Golden Signals” © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Metrics: the core of SLIs • Natural tendency to over-engineer. • Lots of data, none of it actionable or useful. Optimise for KPIs, or high signal/noise. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
The Four Golden Signals • • • • Latency Traffic Errors Saturation © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Latency • Direct impact on customer experience. • Where and how you measure it is important. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Traffic • The amount of work being done - or attempted. • Direct relationship with business value. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Errors • A nice, defined target to aim at. • Direct impact on customer experience. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Saturation • Real Talk: this is a tricky one. • Direct relationship to both scaling and capacity planning. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Humans and the Four Golden Signals • • • • Latency Traffic Errors Saturation © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Start, then iterate • Start with an initial specification - even if it’s not ideal. • Iterating on feedback is the key to getting it right. • Keep it simple! © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Spreadsheets are simple, right? Service Latency Errors Saturation Traffic Foo foo.latency foo.error_rate Disc space TPS Bar bar.response_time bar.error_rate Memory TPS © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
SLIs: Defining “done” • Per-service dashboard in Datadog with timeseries chart for each indicator. • Document describing the indicators and why they are important. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Datadog Dashboards © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Datadog Dashboards © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Specification Documentation • • • Spec vs. implementation Where do you want to instrument? Where is it easy to instrument? © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
First SLIs, then Promises • Plain-language statements; easy to parse, easy to understand. • Plenty of potential stakeholders. • Start simple! © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
“You can rely on us to buy or sell crypto whenever you want.” Coinbase’s “Prime Promise”
Essential Reading Thinking in Promises by Mark Burgess © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Concerning Promises • Promises have two parties. • Promises can be human to machine, human to human, or machine to machine. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Example Promises • You service promises to respond to clients within 50ms. • A service you depend on promises that its error rate will be < 1%. • On-call promises they will engage an incident within 15 minutes. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Promise enumeration • Each team must formalise the promises they are willing to keep. • They must also understand the promises they rely upon to function. • When a promise you rely upon is broken, what should you do? Who should you contact? © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Promises: defining “done” Promises are done when they have a Datadog monitor (alert). © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
When Promises are broken… • It is inevitable that a promises will be broken at some point. • What to do when that happens? © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Blameless post-mortems • Is “blameless post-mortem” a real thing? • What about data-driven post-mortems? • How does this relate to promises - specifically broken ones? • https://v.gd/jpr_post_mortems • https://v.gd/jyee_datadriven © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Interpreting incidents • • • Build a shared language. Practise communicating. Understand that incidents and outages are broken promises. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Measuring incident response • Quantify and measure the quality of your incident responses. • Quantitative: Time to detect, time to engage, time to fix. • Qualitative: Quality of communication. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
The end-game • Have a clear answer for “why SRE?” • Start with instrumentation keep it simple to start. • Enumerate your promises. • Measure your response when promises are broken. • Transparency • Understanding • Confidence © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Thank you! Niall O’Higgins Daniel Maher © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Coinbase is a secure online platform for buying, selling, transferring, and storing digital currency. This talk covers its journey from a small band of engineers working on reliability to a centralized SRE organization, and the lessons learned along the way. We dive into the processes that we created—both technical and organizational—that enabled us to quickly build a world-class reliability engineering group. We also cover what reliability really means, and more importantly, how we measure it. This session is brought to you by AWS partner, Datadog.
Here’s what was said about this presentation on social media.
You need three things: good instrumentation, good understanding of what you have measured and a process for interpreting and reasoning about those measurements - without these no postmortem process will work! @niallohiggins #reInvent2018 @phrawzty
— Blameless (@blamelesshq) November 27, 2018