After 10 post-mortems in their first season, Tom and Jamie reflect on the common issues they’ve seen. Click through for details!
On May 11, 2021, Salesforce had a multi hour outage that affected numerous services. Their public writeup was somewhat controversial — it’s the first one we’ve done on this show that called out the actions of a single individual in a negative light. The latest SRE Weekly has a good list of some different articles on the subject.
In this episode, Tom and Jamie talk through the outage and all the different ways that losing DNS can break things, as well as weighing in on why this post-mortem is not a good example of how the industry should treat outages.
During a routine addition of some servers to the Kinesis front end cluster in US-East-1 in November 2020, AWS ran into an OS limit on the max number of threads. That resulted in a multi hour outage that affected a number of other AWS servers, including ECS, EKS, Cognito, and Cloudwatch.
We probably won’t do a full episode on it, but this reminded Tom of one of his favorite historical outages: S3’s 2008 outage which combined single-bit corruption with gossip protocols. If you haven’t read it, definitely check out the post-mortem.
In November 2020, Coinbase had a problem while rotating their internal TLS certificates and accidentally unleashed a huge amount of traffic on some internal services. This was a refreshingly non-database related incident that led to an interesting discussion about the future of infrastructure as code, the limits of human code review, and how many load balancers might be too many.
Just one day after we released Episode 5 about Auth0’s 2018 outage, Auth0 suffered a 4 hour, 20 minute outage that was caused by a combination of several large queries and a series of database cache misses. This was a very serious outage, as many users were unable to log in to sites across the internet.
This episode has a lot of discussion on caching, engineering leadership, and keeping your databases happy.
Tom was feeling under the weather after joining Team Pfizer last week, so today we have a special guest episode with Sujay Jayakar, Jamie’s co-founder and engineer extraordinaire.
While it’s great to respond well to an outage, it’s even better to design and test systems in such a way that outages don’t happen. As we saw in the Cloudflare outage episode from a few weeks ago, there can be very unexpected results from code if all possible variations haven’t been tested, which is hard for humans to do.
In this week’s episode, Jamie and Sujay talk about some of the ways to use automated tests to drive down the number of untested states in your programs.
Some links that came up during the discussion:
- Jason Warner (GitHub CTO) tweeted about what it was like behind the scenes during the big outage we discussed last week.
- How Dropbox decided to rewrite their sync engine and how they tested it
- The QuickCheck testing framework.
In 2018, after 43 seconds of connectivity issues between their East and West coast datacenters and a rapid promotion of a new primary, GitHub ended up with unique data written to two different databases. As detailed in the postmortem, this resulted in 24 hours of degraded service.
This episode spends a lot of time on MySQL replication and the different types of problems that it can both prevent and cause. The issues in this outage should be familiar to anyone that has had to deal with the problems when a large amount of data replicated between multiple servers gets slightly out of sync.
Auth0 experienced multiple hours of degraded performance and increased error rates in November of 2018 after several unexpected events, including a migration that dropped some indexes from their database.
The published post-mortem has a full timeline and a great list of action items, though it is curiously missing a few details, like exactly what database the company was using.
In this episode, you’ll learn a little about what Auth0 does and what happened during the outage itself. You’ll also hear about the things that went well and went poorly, and some speculation about whether something like Honeycomb.io could have reduced the amount of time to resolve the outage.
On July 2, 2019, a subtle issue in a regular expression took down Cloudflare (and with it, a large portion of the internet) for 30 minutes.
Monzo experienced some issues while adding servers to their Cassandra cluster on July 29th, 2019. Thanks to some good practices, the team recovered quickly and no data was permanently lost.