Welcome to The Downtime Project! Here’s a quick episode where Tom and Jamie talk about why they created the show and what you can expect.
Tom: [00:00:00] Hi, everybody. Welcome to The Downtime Project, a new podcast that’s going to help you learn from the Internet’s most notable outages. Every episode, we’re going to go through a post-mortem for an internet outage that you might’ve heard of and talk a little bit about what happened. This is just an introduction episode where we’re going to give you a little bit of background about us and why we’re interested in this sort of thing.
So you can feel free to skip this one if you want to get straight to the good stuff. My name is Tom Kleinpeter and my co-host is Jamie Turner.
Jamie: Hello!
Tom: I’ve been building and running services since 1999 at companies of all sizes. I got started at a peer-to-peer music company called Audiogalaxy, which grew extremely rapidly until we ran into a few legal problems.
After that I built a file synchronization service at a startup called FolderShare, then spent some time at Microsoft. And then I did a second startup called Audiogalaxy before ending up at Dropbox where I met Jamie. I didn’t get a chance to really break anything at Microsoft, but I’ve caused or been a part of stomach turning outages at all of the other companies.
So I have a lot of empathy for everyone that’s ever dealt with a big outage. Jamie. What about you?
Jamie: [00:01:17] Yeah. Hi, I’m Jamie Turner and I’ve also been fortunate to have worked on a lot of big systems over the 20 years, like Tom, at a lot of startups, working on web, mobile, market research and other things like that.
Most recently, I worked with Tom at Dropbox as he mentioned. One of the systems I worked on that was particularly relevant to this podcast was Dropbox’s multi exabyte in-house storage system, which is the equivalent of S3. At Dropbox, and it’s where we keep all the user files. On this system and others, we made mistakes as we developed them and built them and deploy them. And we slowly learned a lot of lessons about how to build things that are reliable. So we wanted to start this podcast to talk about the kind of things that happen when engineering teams have these kinds of outages, share lessons that we’ve learned along the way, and learn lessons from these public outages and postmortems.
And then we can collectively talk about these things on the podcast and kind of talk as a community of engineers who work on these kinds of services.
Tom: [00:02:20] Yeah. I totally agree with that. The postmortems that teams publish after they have an outage have so many valuable lessons in them. I think it’s a really great thing that the industry does.
Personally, I’m really excited to have a forcing function to get me to spend more time thinking about them and thinking about them a little bit more deeply than I might normally do if I just read through them on Hacker News or something. But at a higher level, I think it’s valuable to package up this information in a way that increases its consumption. Podcasts offer a really great route for learning about things you might not have otherwise heard about.
Hopefully with this one we can amplify some of the lessons that people have so painfully learned and overall just make the industry a better place.
Jamie: [00:02:57] And one way we’ll enrich the conversation is make sure we have communication channels open with all of you who are listening so we can learn from your own experiences in similar situations.
And we can reincorporate those into the conversation that we have here.
Tom: [00:03:13] Yeah. The first outage we’re going to talk about happened earlier this year on January 4th, 2021, Slack went down for a few hours of the first working day of the year. You might have been affected. I certainly was, but Slack put up a great article about what happened with a lot of really good details in it.
I think we’ll have some good lessons for everybody to hear about. So stay tuned!