in

How Coinbase Unleashed a Thundering Herd

The Downtime Project
How Coinbase Unleashed a Thundering Herd
Loading
/

In November 2020, Coinbase had a problem while rotating their internal TLS certificates and accidentally unleashed a huge amount of traffic on some internal services. This was a refreshingly non-database related incident that led to an interesting discussion about the future of infrastructure as code, the limits of human code review, and how many load balancers might be too many. 

Jamie: [00:00:00] Welcome the Downtime Project, where we learn from the Internet’s most notable outages. I’m Jamie Turner and with me is Tom Kleinpeter. Now, before we begin, I want to remind our listeners that these incidents are stressful. We peel them apart to learn, not to judge, and ultimately Tom and I have made and, in the future, unfortunately, will make similar mistakes on our own projects. So please view these conversations as educational, rather than as a judgment of mistakes that we think we would never make. Today we’re talking about Coinbase’s November 2020 TLS outage. But before we get into that, we have a few updates for you.

So first of all, keep those requests for episodes coming. And if you haven’t yet, don’t forget to rate and review the show and find and follow us on Twitter. And my startup is hiring. It’s called Zerowatt, and we’re building platforms for making serverless internet companies, so if that sounds exciting to you and you want to know more about it–you’re looking for a job– send us an email jobs@zerowatt.io. Tom?

Tom: [00:01:27] All right, well, what do you know, my startup is also hiring engineers. It’s slightly different work than Jamie, but we’re building a community platform for companies that really value their community and are trying to work with them better. We’ve got a lot of interesting projects to work on, so come find me on LinkedIn or come to our website at commonroom.io if you want to learn more about that. 

Jamie: [00:01:55] Well, let’s get into the incident here. So today, we are talking about Coinbase’s November of 2020 outage, which involves TLS. And we figured we’d chat about Coinbase today because we’re celebrating their recent IPO. So congratulations Coinbase. Now let’s talk about one of your foibles here. So Tom, maybe you could help us with a little background here on Coinbase and this outage. 

Tom: [00:02:24] So I guess starting with the very basics, Coinbase is a crypto exchange. If you want to buy some Bitcoins or you just watched SNL and you really want to get onto the Dogecoin trend, you would probably end up on Coinbase to buy and sell them. So security is obviously extremely important to a company like Coinbase, and internally all the traffic between their servers is encrypted. It’s using something called TLS, which replaced something called SSL, and it’s how HTTPS works. A couple of key points to this that you need to know about first is all this TLS infrastructure relies on certificates, which are basically proof of something–proof that you’ve been verified or that a connection is valid.

For this, they’re using these internal certificates, which isn’t super relevant for this. But the key point about all these certificates is that they have expirations. So these are not good for perpetuity. You have to cycle them–maybe once a year, some shorter, some longer–but they all have an expiration.

And when that changes or before you hit the expiration, you have to add new certificates to your system. So, Coinbase has a pretty non-trivial backend. And in the article they wrote up about this they mentioned they have 700 load balancers internally, which is a lot of load balancers because each one of these load balancers needs to have basically the path to the right certificate to use.

And so this outage occurred when they were updating the certificates on all these load balancers to push out the expiration date. So, you’ve got a ton of changes you need to make to a bunch of machines and that’s how we get started. 

Jamie: [00:04:20] Great. Cool. Well, let’s dig into the timeline here. So at about 3:30 Eastern time, 7:30 UTC on that November day, 2020, all of a sudden they started getting paged.

And they went and checked various graphs and they noticed that the traffic headed to all their internal services had suddenly dropped to zero. 

Tom: [00:04:47] Now that’s an interesting problem. A little bit atypical. That generally means when you see all your internal stuff go to zero, that something is happening on the edge.

It could be that somebody cut a cable somewhere. Those are always exciting when the real physical world intrudes on our perfect software world. But it could also be in this case it’s just a configuration issue with your load balancers on the outer edge or just something that’s getting the traffic into your system.

Jamie: [00:05:15] Yep. So the team, Tom’s point, said, okay, well they’re not errors, there’s no traffic. So what just changed that could have affected that? And they immediately suspected TLS could be involved because the migration was underway to deploy those new internal certificates to replace the expiring ones. So they said, okay, we know we’re messing with a whole lot of TLS, which certainly has a lot to do with the traffic layer. We probably messed something up there. About three minutes in, they were like this is probably a TLS thing related to what we just did. 

Six minutes in, they had their severity level assigned. They were all on VC. They started a document to collaborate on the incident and their status page was updated because they knew that they were not currently working. 14 minutes into this outage, they began to roll back the certificate change. They’d stage the rollout into multiple steps, each one with a pull request on GitHub associated with it.

So they focused on the pull requests containing the most critical services first.  That started to roll all the TLS certificates back to the previous version, which had not yet expired. It was just coming up soon. So everything in theory should be fine at this point, but the site did not come back up. They weren’t completely sure why, it sounds like at this point. But they decided to restart all the services because they figured maybe some services had gotten unhappy for some reason, but after they were restarting the services, the services were not able to kind of come back and be healthy. So this all seems to happen somewhere in the time period between 14 minutes and an hour and 16 minutes into the incident.

So there’s a little bit of extended time period here when they were not able to get their services online. And they were spending some time trying to figure out what was going on and restarting things. Now, finally, at some point, the team spent a couple of minutes really digging through the metrics and they noticed that they had really high rates of 500s on their backends when the backends were coming up. So they started to suspect a thundering herd problem.

Tom: [00:07:27] Well, the thundering herd is clearly one of the best names for classes of outages you can have. It’s just so vivid, just imagining the bison stampeding across the plain or something. This will happen when you have some event, you trigger something, and you end up with a ton of correlated traffic. You can imagine a ton of clients that all get kicked off of your network at the same time. And they all retry exactly 10 seconds later. What do you know, you’ve just correlated a ton of traffic between a large number of people. And when they all come back at the same time, instead of that traffic being nicely spread out, it’s going to hit you all at once.

And services tend not to like that. If you overload a system with a ton of traffic at once, you might end up in a state where it’s not actually completing any work, because every request it’s getting hit with is timing out. And so the traffic, the services can’t even come up because nobody finishes the work they’re calling it for.

They just fail along with everybody else and they retry, and you’re particularly vulnerable to this type of problem when your service is just started. Services are always slowest, well barring like slow memory leaks or something like that, but services tend to be a little bit slower when they first start up. Maybe the JIT is still warming up. Maybe they’ve got a bunch of modules they have to pull off the disc. Again, it could be caching. If your service is not fast until it’s cached a bunch of data you’re going to be slow when you start up. And again, that’s one reason I’m sort of down on caches is kind of the unreliable performance. But for all sorts of reasons, applications can be particularly slow when they first start, and Coinbase had just restarted all of their services.

Jamie: [00:09:09]. So that combination of a spike of traffic that suddenly wanted to succeed that had been waiting and the services all being brand new seemed to put them in some sort of cycle where they were not able to service as many requests as they needed to catch up.

So the 500s remained, and the team said we kind of have a thundering herd issue. So they came up with a plan and started to enact that plan at about an hour and 16 minutes in. At the traffic layer they turned off all the traffic headed to the main backend service is what they say here in their notes. And then they let it successfully finish kind of starting up. 

So they had a healthy pool of backend services to handle the traffic. But then because they knew that they had an extra high demand right now because of this thundering herd that represented all this deferred traffic that wants to succeed, they decided to increase the size of the backend pool (the number of servers and therefore the number of resources available to handle this traffic). They increased that by some amount, and they said now we’re able to turn traffic back on. We’ve got a bunch of services that are up and running and we’ve increased the number of those services.

So they should be able to deal with the surge.  At about an hour and 27 minutes into the outage, which is about 11 minutes after they started setting this whole thing up, they opened the floodgates. Again, it doesn’t say exactly how quickly. So it sounds like maybe they just kind of went back from 0 to 100, like this is a turn-off/turn-on kind of thing.

For about 5 minutes error rates were still a little elevated because this is more traffic than normal. However, they were making progress through their backlog. Enough were succeeding to sort of start to have the thundering herd shrink. The bison were becoming fewer.

Tom: [00:11:03] The bison had made it through the gate.

Jamie: [00:11:05] Yeah, the bison made it across the plain and through the gate and all the kind of metaphorical things that Tom cited.

Tom: [00:11:11] That’s really standard for error rates to be a little bit higher while you’re going through the backlog because you could still be hitting timeouts. Other services could still just not be responding to this all the time, but they were definitely seeing their 200 rates go up and up, back up to where they need it to be.

Jamie: [00:11:29] And so they determined about five minutes after they had turned traffic back on that all the error rates and success rates looked back to normal, traffic levels were about where they should be. So it seemed as though the herd had passed onto another plain. This is about 1 hour 32 minutes in, and the site is considered a hundred percent online and the incident is considered resolved 

Tom: [00:11:50] How satisfying it must have been to watch everything go green again. You go from all 500s to all 200s. Oh, very peaceful. 

Jamie: [00:12:00] I do love that moment when everything is green again.

Tom: [00:12:04] Particularly because you’ve just doubled your capacity too. So everything is just going to look perfect. Latencies are going to be down. Everything’s going to be really, really smooth. 

Jamie: [00:12:11] The team probably has a moment where they look at each other and ask should we give it back or just kind of keep it. So everything is so smooth and fast right now. Well, we went on this ride with them here through this timeline, but we sort of have to double back and ask why this happened. So we know TLS was somehow involved. We know that reverting the certificates back fixed the problem. After a little bit of herd management, we’re back online.

So  what actually happened is, as Tom mentioned, that certificates are being assigned by authorities and that kind of gives them the domain in which they are effective–the domain in which clients will say I will talk to you, less to you. I will encrypt traffic to you because you have the right to service that. And so these internal certificates are not necessarily valid to serve the public domain that a customer’s web browsers are talking to. And they’re probably also not signed by an authority that those browsers would recognize. So basically these internal certificates are not supposed to end up at the edge. And yet what happened is one of them did. So there was a configuration error that caused one of these internal certificates–they were only attempting to renew and update the internal certificates–one of them accidentally ended up in the public load balancer, and that started to cut all the traffic off because the customer’s web browsers were not going to treat that as an authorized certificate to talk to. Hopefully they wouldn’t yet. If everything just kept working, that would have been even more disturbing actually. 

Tom: [00:13:56] That would be somebody else’s SEV.

Jamie: [00:13:58] Yeah, that’s right. So how did this happen? How did this internal certificate end up on the edge at a public load balancer? So Coinbase uses Terraform to manage its infrastructure on AWS. And it puts its certificates into Amazon certificate management service. And then, Terraform is used to describe to all of the load balances and stuff like that that this is the certificate to use for you. And so when they were doing this change, they were going to end up making changes to their Terraform language. And Terraform is using something called HCL, which I think is HashiCorp configuration language. It stands for, but it’s basically a kind of configuration language that describes how you want all of your AWS stuff to be configured. And then you can use it to update it and things like that. So Coinbase does not directly write HCL. They allude to some transpiler that takes YAML and generates HTML or, sorry, HCL.

Tom: [00:15:05] Have you ever had a problem in infrastructure you couldn’t solve with more YAML files, Jamie?

Jamie: [00:15:10] Oh, YAML always finds its way into the mix. It has a talent for that. So in these changes, you can imagine these PRs that we mentioned earlier were each changes to the YAML. And so, because they were changing, as Tom mentioned, there’s like 700 load balancers, it sounds like each one of these pull requests had potentially hundreds of places where a YAML value was either added or changed to reflect the path to the new certificate. And so they submitted all these as pull requests. But on GitHub, when these pull requests were being reviewed, there’s only three lines of context above and below the changed statement that’s shown. And what actually happened is among those hundreds of places where this was being changed, one of those places was actually living within a block that was describing an external load balancer, not an internal load balancer. But that wasn’t really visible by the person reviewing the pull request. 

So it’s a lot to process, but It looked like the right certificate landing in the right places. And so the pull request was approved without really recognizing that one of the places that had erroneously been updated was actually a specification for an external load balancer. So once that was turned into HCL and once Terraform applied it, that meant Terraform went into AWS as APIs and switched that certificate to use the wrong one. And then fun happened. 

Tom: [00:16:41]I feel like you could just keep asking why on this. It just goes so many levels. It goes all the way down to, and as it turned out, GitHub doesn’t show you enough context around changes.

Jamie: [00:16:52] It’s amazing how many different pieces got pulled into this, right? There’s a transpiler, and then there’s Terraform and AWS’s certificate management. And then there’s the actual load balancer at the edge, and it gets all the way back to the dev workflow. We’re even implicating GitHub and the way it shows you pull requests right in this. So that’s sort of, at the end of the day, the root cause. How would you wrap this up, Tom? 

Tom: [00:17:18] I’m going to give my elevator pitch for this outage. It sounds like they pushed a bad change and had to restart everything.  Once you have a system this big, starting it up might be complicated. And you generally don’t practice just the cold restart of your whole system. So they had to make changes for the system to boot up effectively. And that took them a little while. 

Jamie: [00:17:41] Sounds about right to me. Sometimes in outages, it’s… so we’ve actually had a couple of these, right? So there’s another thematic thing for us to recognize–sometimes it’s a simple change and changing it back is simple, but then recovering from that is really complicated. And so GitHub’s outage is another good example of that–43 seconds, and then it took a long time to clean it up. And so we’ve had several outages where reverting the actual cause is identified and done quickly, but then getting the system back to a state of stability, to equilibrium takes a while. 

Tom: [00:18:29] These big complicated systems are, what do you know, complicated, and they have states that they spend 99% of their time running in. But once you get them out of it, it just grows at a high level. And it’s very easy to end up with something where it’s just, oh, look, we’ve never tested, turning the whole thing on or off.

Jamie: [00:18:48] It’s like a locomotive. It runs really well when it’s on the track, but oddly, if you put it six inches off the track, it’s quite useless and very heavy to get back on the train and just put it back on the track. Well, Tom, what would you say were sort of high points for you looking at this outage? 

Tom: [00:19:12] I like having TLS between all the services. It obviously introduces issues. This outage was caused by having to deal with certificates. That’s a bit of a pain, but this is one of those things where, especially when you’re on the public cloud, it’s like an eat-your-veggies type thing. It shuts down a potential set of issues that you just don’t have to worry about anymore. And it’s just a nice thing to have, but it also signals that you’re taking security seriously because it is non-free to do this.

But if you see an organization doing this, they’re probably doing a lot of other things because it signals that they invest in security. And I would be pretty unhappy if any financial services that I was using were not using TLS everywhere they possibly could. So, I definitely liked hearing that or seeing very clear evidence that they are using TLS internally.

Jamie: [00:20:11] As you said, especially in the public cloud, right? If somebody puts an agent in the middle of two hops of your traffic to inspect it, to steal wallet information or login information to come get your wallet, you don’t have a lot of visibility into–I mean, we’re all of course trusting Amazon. But always having TLS, even if you’re racking your own, is a good idea. But in particular, when you’re running stuff on some cloud, you would have zero idea that somebody did something in a facility to intercept traffic. So using TLS is just like one extra little bit of insurance. If it’s for financial things, for sure, it’s especially important.

Tom: [00:20:54] Also a next point, they would be nuts not to have some automation here, but if having your infrastructure configured as code, you know, whether it’s via Terraform or I personally like Pulumi a lot–it lets me write just in TypeScript how I want my infrastructure configured–but having your whole system described in a way that you can code review and roll back is just essential. Once you get past a level of complexity… I would caution everybody that nobody starts with a really complicated system and is like, hey, let’s do this manually. People start with a very simple system, and it changes incrementally where no particular change is particularly complicated. But then you find out, oh, look, we’ve got dozens of things in AWS, and Alice is the only person who actually knows how to change things or remove things. We’d better keep her happy. So, just start early with Pulumi or Terraform or whatever, and you’ll be so much happier, as your company grows, that you’ve done it.

Jamie: [00:22:04] That’s true. Another thing in this that was definitely an asset for them is that their metrics were working. So, let alone that they had pretty good metrics–they identified really quickly that it was not errors. It was traffic dropping when the initial incident happened and that really helped them very quickly identify the TLS issue. And then later, the 500 errors helped them understand it was a thundering herd.  We’ve had a number of incidents that we talked about where the metrics were unavailable because of a circular dependency or something. One thing to call out right out of the gate is they had the full power of their metrics available in this issue. And they seem to have pretty good detailed metrics ready to help them find what was going wrong here. 

Tom: [00:22:56] That’s just so key. They had everything they needed to know to understand this graph and someone with intuition about the system. It’s a little tricky for us because we don’t know what their system architecture looks like. And we don’t know what the diagram is, but for someone who was familiar with it, they clearly just had the graphs in front of them. They could look at it and say, oh, look, this is the problem. And they were able to very quickly get on to fixing it. So that’s a big metric success case right here.

Jamie: [00:23:25] One of the things just to point out here is that they had a sense of prioritization when their world was crumbling. So they mentioned that they focused on the core services first.  They didn’t try to fix everything at once. From the language, it suggests they already had an order ready to go so that they knew it is most critical to restore this and then get the other things later. And I think that helped them really focus on one hard problem instead of being a little bit blinded by so many things beeping at them at once. Well, maybe on the side of lowlights or things that have gone better, what sort of jumps out at you, Tom, on things that are more on that side of the court?

Tom: [00:24:22] I was joking about YAML a little bit earlier, but for some reason YAML just hurts my head. Once it grows, you get so much of it. And so much of it just gets copied and pasted and becomes boilerplate. You’ll have a file with a hundred lines in it. And one of them is really important, but the rest are just not understood. And you don’t change that, just copy it from so-and-so. And it gets messy quickly, and it’s hard to simplify it. And you see kind of an issue here in that they weren’t able, or they didn’t have a validator or a linter or something like this, to where they could enforce these rules about private certs are never allowed to go with an external thing. It’s hard to point out exactly why they weren’t able to do this, but you would certainly want some kind of system here where you could just validate that you don’t mix this with that ever.

Jamie: [00:25:28] Well actually, thinking back a little bit to what you said about liking Pulumi, I think we’re seeing more and more companies that are starting to do true code as configuration because one thing you could say here is ideally, they’d be able to leverage something like types because we run into the same issue. Not only do you have files of the 150 lines, you have files with 15,000 lines, 50,000 lines. And yet in code, we make it a somewhat manageable issue. 

One of the things is that code has a lot more structure to reason about the stability of, and the semantics of, what things mean. And so when we say okay, there’s an external load balancer type, it should never be able to take an internal cert. There’s a typing kind of statement we’re making there. And actually, if we even think about it from that perspective, how could we achieve something like that? We could achieve something like that with types. So it’s another thing to say, whether it’s via lint, as you said, or whatever, it’s another argument for going a little further and maybe saying all of us should be looking more closely at finding a way we can bring true programming languages into these things, just because they have a lot of helpers to make it so that you can’t accidentally pass an A when you’re expecting a B.

Tom: [00:26:57] We have decades of research on how you can write programs more safely with types. And you get a lot of that already with just these IAC systems–infrastructure as code systems out of the box. But you could definitely see this getting pushed a lot further where you have strong opinions about your system that, for example, you can just never attach something that lets you have external ports to a database or something like that. You could encode a lot more of these things you really need to get right and make them turn into red squiggles that you can’t even commit, much less send out for PR.

Jamie: [00:27:39] Certainly if you’re like, hey, these are the nouns Terraform knows about, Terraform might be able to say, well, here are policies I’m going to let you use that are part of the nouns that are in my world. But the beautiful thing about a programming language as you say, well, what I want to do also is layer in wrapping types and encapsulation types, stuff like that that really have to do with my company specifically, or my project specifically. And by the time you tried to make HCL or YAML capable of expressing all that, what do you know, you just reinvented a programming language, with layered encapsulation and all this kind of stuff. That’s just what programming languages do. 

So it’s a really interesting argument. I imagine this is actually a facet of this too, that could be considered when it comes to another thing that didn’t go very well, which was the code review process here. So the PR, it only showed plus or minus like three lines. I think it’s fair to say, like in a code review, not that that one line looks right, it’s that this line is doing something correct. It represents a correct choice. And obviously with only plus or minus three lines of code, you can’t see the context it’s in. So you don’t really know that yes, that string is a valid path to a certificate, but it’s not necessarily parameterizing the right object because I can’t even see the whole object around this thing. 

Tom: [00:29:08] I don’t know, man. I think once you’re reviewing–I mean, first off YAML, I swear it drops your IQ or something like that. So once you’re reviewing hundreds of changes in hundreds of YAML files manually, I think the chance of that happening without errors is virtually zero. That is just not a good way of validating these systems.

Jamie: [00:29:37] It makes sense what you’re saying. Even if I had sort of said, well, the person should have gone and looked right at the–even if GitHub didn’t show them anything beyond those three lines above and below, how could they know without going and clicking the file or digging in. But I think it’s a good point you raised. Not only is YAML hard to understand that way, but as we know from the 700 load balancers, this change was having hundreds of places. Are we really going to have a human go look at all of these places? It’s one of those things where it’s like code is configuration. Is there a way to not have this 700 times? Even if you’re changing, it feels like another problem that programming languages can kind of help with is being better at preventing this kind of thing in general. 

Tom: [00:30:28] Yeah, absolutely. I think if we’re going to talk about the error with code review here or that what could have gone better with code review, what would have been better is to not need code review, to catch these kinds of areas because people just aren’t built for that type of monotony.  

Jamie: [00:30:45] Yeah, it could be if you had the right constructs using something powerful, like a programming language was actually only one thing to check. So to 700 things. Exactly. 

Tom: [00:30:55] That would be code reviewable for sure.

Jamie: [00:30:59] Yeah. Because you have some sort of type you’ve made in this lane, there’s just sort of the all internal load balancer configuration type versus the that-is-never-right. And so you’re not ever putting something 700 times into a file. There’s some way that you’ve like created an abstraction that explodes out into all those things, right? What else, Tom, should we talk about on this side of things?  

Tom: [00:31:30] Let’s see, maybe at the risk of exposing myself as not totally on board with having a million microservices everywhere, I am really curious about this 700 load balancer number. There are some good diagrams and screenshots in the post-mortem, which I really appreciate, and it was very helpful for making this look right. But there’s no system architecture diagram, which at this level of microservices may just not even be possible to generate. I can understand why they wouldn’t want to publish that, but I’m assuming they have a ton of services inside and every little group of services or every service has a couple of machines and every group of machines has a load balancer or multiple load balancers in front of it.

That seems how you end up with 700 load balancers, but man, that’s a lot, and that introduces problems like this, instead of just having a small number of file config files to review, you’ve got hundreds and hundreds. So that’s obviously not something you can just trivially change, but man, that is a lot of load balancers.

Jamie: [00:32:39] Yeah. I have to admit the first time I read this, I kind of glossed over that. And then you and I were talking, you pointed out the 700 bit. That is a lot of load balancers. So yeah, it does seem almost impossible that all of those are external load balancers. Once we start talking about internal load balancers, you do wonder a little bit about the kind of service the service architecture looks like here. It sounds like potentially a lot more moving parts than ideal because that is a lot of load balancers. So it feels like there’s a lot of things being routed between. We’re guessing a little bit here, but it doesn’t feel like you should have 700 load balancers if you’re going based on size and complexity. Honestly, very few companies need 700 load balancers. 

One other piece of this I think we were talking about a little bit is the traffic, right? So they had this approach when they wanted to recover from the thundering herd, all the bison coming–I’m going to be with bison from here on out, by the way. Whenever I think about this in my own companies, it’s going to be biased into my head. So they mention this approach where they kind of shut traffic off and then they size up the cluster and they turn traffic back on. And this is reading between the lines a little bit too, but that could potentially be smoother if you were able to just slowly let traffic back in. So taking advantage of things like HTTP 503 service unavailable, there’s things like retry after headers and stuff like that that can help be a good HTTP citizen in this kind of stuff. But if you just could slowly drop a lot of the traffic, but then sort of slowly let it back in and let the herd do its thing, you might have a slightly smoother recovery and maybe have a little less risk of this over-provisioning for the herd approach not working.

Tom: [00:34:45] Yeah, because sometimes you just can’t over-provision quickly enough. And maybe that the data you’d have to replicate would take too long or maybe the type of hardware you need just isn’t available to have it. This feels like, again, kind of a meta theme or the theme we’re seeing here, which is just knobs. Just the ability to adjust components of your system and scale them from somewhere between zero and a hundred percent on just those, assuming that you have tested them and they work–those can be hugely helpful when you’re dealing with problems like this. 

Jamie: [00:35:21] Yeah. And another thematic thing that we’ve talked about in the past too is, traffic layers are nice if they can do it in a way that’s kind of sticky. Either via like client cookies or VIP’s or some sort of thing that does affinity so that what you’re not actually doing is failing 90% of requests and only letting 10% succeed. You want like 10% of clients to be able to succeed so that at least 10% of people are having a good day instead of no one.

Tom: [00:35:50] You need work to be able to finish because that’s the essence of these problems is that nobody’s finishing. And so they keep starting from zero, and they’re just like crawfish trying to crawl out of a bucket and just like pulling each other down. Nobody can actually finish. They just keep hurting each other. 

Jamie: [00:36:08] How many animals are you going to drag into this, Tom, before we’re done? 

Tom: [00:36:12] I’m full of the metaphors. 

Jamie: [00:36:15] Can we somehow get the bison in a bucket? We just need a bison-sized bucket and have them do battle with the crawfish. There’s some sort of crawfish-bison pace and equivalency we have to work through here.

Tom: [00:36:27] A thundering herd of crawfish. 

Jamie: [00:36:28] Oh, that’s a great visual. I like that. We could get that illustrated. Maybe that needs to be the logo of the podcast–a thundering herd of crawfish. Cool. What do you think, Tom? That kind of sounds like a good list. 

Tom: [00:36:41]  I think that’s good. I think this was a fun one to talk about where we got away from databases for a week, at least. I’d love to do a security outage, but I would imagine a lot of companies aren’t going to publish a lot of details on those.This might be as close to a security one as we get, unless somebody sends us one that we could talk about. So yeah, this was a really interesting one. I appreciated reading about it. 

Jamie: [00:37:11] Yeah. It was great to have a break from databases for a week. So to Tom’s point, if anybody listening out there has an idea about a security related one where there’s a good post mortem that was a security incident, we’d love to have a chance to talk about that and put that in the mix and see what we can learn about those kinds of issues. Thanks everybody for listening, and leave those reviews, follow us on Twitter, and until next time. We will be back to you then.


Host: [00:38:05] Thanks for listening to the Downtime Project. You can subscribe to the podcast on Apple, Spotify, or wherever you listen to your podcasts. If you’d like to read a transcript of the show or leave a comment, visit us at downtimeproject.com. You can follow us on Twitter @sevreview. And if you like the show, we’d appreciate a five-star review.

Write a Comment

Comment