Kinesis Hits the Thread Limit

The Downtime Project

00:00 / 44:42

During a routine addition of some servers to the Kinesis front end cluster in US-East-1 in November 2020, AWS ran into an OS limit on the max number of threads. That resulted in a multi hour outage that affected a number of other AWS servers, including ECS, EKS, Cognito, and Cloudwatch.

We probably won’t do a full episode on it, but this reminded Tom of one of his favorite historical outages: S3’s 2008 outage which combined single-bit corruption with gossip protocols. If you haven’t read it, definitely check out the post-mortem.

Tom: [00:00:00] Welcome to The Downtime Project, where we learn from the Internet’s most notable outages. With me is Jamie Turner. Before we begin, I just want to remind our listeners that these incidents are really stressful and we peel them apart to learn, not to judge. Ultimately Jamie and I have made, and we will in the future, undoubtedly, make similar mistakes on our own projects. So please view these conversations as educational, rather than a judgment of mistakes we think we would never make.

Today, we’re talking about AWS’s November 2020 Kinesis outage. Before we get into the details though, just a few quick notes. So first off, if you’re enjoying the show, we are very happy or glad you’re getting something out of it. We’ve gotten some great emails and notes from people, but if you’re really liking it, it’d be great if you share it with your friends or coworkers, because I’m sure that they would love it too. Also, if you want to follow us on Twitter, we occasionally post a few things there as well as new shows. It’s also a great way to send us requests for episodes. Every week we do this thing where we try and find a good outage and we sort of guess about what people are interested in, but if you send it to us, we will know for sure.

And finally, for me, if you want to work with me, you should come check out my startup commonroom.io or email me at tom at commonroom.io and tell me what you’re interested in doing. We’ve got a lot of roles open for engineers, and I’d love to talk to you.

Jamie: [00:01:36] And now it’s my turn to plug my startup. We also are hiring at Zerowatt. If you are excited about the serverless future and you want to help build storage and computational platforms that could power companies moving forward that are taking serverless architectures to large scale and all that kind of fun stuff, please reach out to me. We have engineering roles open, so jobs@zerowatt.io. That’s jobs@zerowatt.io.

Tom: [00:02:12] Awesome. Jamie, I would totally try and come work at your startup if I didn’t already have my own.

Jamie: [00:02:18] Thank you, Tom. And I yours.

Tom: [00:02:22] All right, we’ve talked about big companies and we’ve talked about small companies, and today we’re talking about just one of the biggest of them, AWS. It’s hard to even imagine how many servers AWS is running at this particular moment. And when you run a bunch of services and servers, things go wrong sometimes. On November 25th, 2020, AWS had a big outage in US-east-1 that was caused by Kinesis. Now I doubt you listen to this podcast if you don’t know what AWS is, but Jamie, what is Kinesis?

Jamie: [00:02:54] So Kinesis is a streaming subscription queuing service thing. So basically, you can publish events on a number of streams, and then all sorts of listeners can subscribe and follow those events and do computational processing on those events or react to those events in other ways that trigger insertions into other streams, creating all kinds of streaming architectures out of that. So those subscribers will have something like a cursor that will ensure that they follow along and miss no records. They can sort of replay history on that stream, depending on your retention policies and things like that. So it’s a lot like Kafka, if you’re familiar at all with Kafka. There’s a fair amount of overlap between the intended use cases. So if you’re queuing things up and doing stuff with them and in a kind of asynchronous follow along workloads, Kinesis may be a very useful tool for you.

Tom: [00:03:52] So our story gets started at 2:44 AM Pacific standard time when they were performing a routine addition of some capacity to the front-end Kinesis workers. So this just means they decided they need a little more capacity on the front end, so they were just adding servers to it. Jamie, do you want to talk about that a little bit?

Jamie: [00:04:14] Yeah, it sounds like from all these notes, Kinesis has this front-end layer and a back-end layer. And the front-end layer is responsible for authenticating the traffic and then routing the traffic to the appropriate back end, depending on, for example, what stream the traffic is either subscribing to or publishing to. And so this front-end layer is responsible for maintaining a kind of map of what traffic goes to, what back end. The back ends are the pieces that actually store the events on disk. The front ends make those events durable so that if they lose a machine on the back end, you don’t lose your event, and they act as the kind of service-of-record data in these streams. So, that’s the front-end job. It’s small, but it’s important.

Tom: [00:05:05] So at 3:47, about an hour after they start, the addition finishes; the front ends are all up and running. And there’s an interesting aspect of this design here–every front-end server keeps a connection open to every other front-end server. So because of the way the language is or just the way this is set up, they’re creating a thread for every other server effectively.

Jamie: [00:05:34] And it sounds like what they’re establishing here is a gossip protocol. So a gossip protocol at a high level is a stream of information that’s attempted to be shared in a kind of peer to peer basis between a number of members within a cluster. We don’t have full information on what this gossip protocol is for. But we’ll talk about a little bit more later in the podcast some ideas about what this might be for when we kind of explore the way it was built. But one thing it does mean though is that they were using this gossip protocol and it seems to have to build up some kind of stateful information in the front end about what the backend situation was. And, ultimately that seemed to be the complication here. Stateful front ends, if you can avoid them, or as we’ve talked about in the past, are best avoided.

Tom: [00:06:27] So I looked it up because I wasn’t sure, but it triggered a memory. There was an infamous S3 outage in 2008 that was caused by corruption in their gossip protocol. I don’t know if there’s enough in that for us to ever do an episode on, but if you look it up–I’ll link to it in the show notes–it was a particularly fun outage because it involved bit flips and that sort of thing.

So at 5:15, about an hour and a half after they finished, they started getting some errors on getting and putting Kinesis records. So we mentioned that each server is going to be creating a thread to every other server. They don’t give us exact numbers in the report, but they do say at one point they have many thousands of servers. This network, this gossip mesh is sort of slowly established. It doesn’t instantly, on start up, connect everything. So it makes a little bit of sense–and you’ll see why in a minute–why it took a little while before things started to go sideways. But anyway, at 5:15, they start getting some errors getting and putting Kinesis records. As they should, they immediately suspect that it was because of the new capacity, but it was a little bit confusing because it looked like the areas they were getting were unrelated. Regardless, even though they don’t have a root cause yet, they start removing the new capacity.

Jamie: [00:07:52] And you can see, this is probably the victory of the SRS over the software engineers here. “It’s almost always the thing you just changed” is a guiding principle that often will get you out of a lot of tough spots. So Tom and I were joking around a little bit about how oftentimes the software engineers say, oh, this is a problem, let’s go look at the code. And the SREs are the wise individuals who say, let’s just roll back what we just did. We don’t even understand it yet. We just need to get it to stop happening.

Tom: [00:08:26] I think that just goes to incentives. You know, everybody talks their own book.

Jamie: [00:08:31] That’s right.

Tom: [00:08:32] In the outage post-mortem, they sort of have a couple of different timelines because this ended up affecting a bunch of different services, and I’ve tried to glue them all together so we get one cohesive look. So we’re going to talk about a bunch of services in the next few minutes. But effectively at the same time the Kinesis team starts seeing these errors, a lot of other teams start seeing errors too. It starts with CloudWatch, which is how you record a bunch of your events and monitor them in AWS. It also starts to affect Lambdas, which are how you do serverless function calls on AWS. It also starts to affect ECS and EKS, which are Elastic Container Service and Elastic Kubernetes Service, I think, which are just how you run containers on AWS. So it says that provisioning new clusters was impacted, existing clusters weren’t scaling upright, and tasks weren’t being de-provisioned. So this is turning into a very big outage. I’m sure there are lots of AWS customers that don’t use Kinesis, but when you start having errors on CloudWatch, Lambdas, ECS, and EKS, the percentage of affected users is going to go up significantly.

So at 6:15, Lambda has just been buffering data at this point that it couldn’t push to CloudWatch, but now it starts to run into memory pressure. It was trying to save a bunch of stuff because it just couldn’t get it off the system, but it starts to run out of memory at 7:01. We start seeing an increase in error rates for AWS Cognito, which is a system that lets you build sign-in flows, maybe Auth0-ish, I’m not exactly sure. But those rates start going up because it’s using Kinesis for logging. You would definitely want your sign-in flows to do some good logging. So yeah, lots of people are getting woken up if they weren’t already awake and trying to figure out what to do about all this.

So 7:51, about three or four hours after this has all gotten started, they have a couple of root causes of what’s going on there. They don’t know exactly what it is, but they do know that the most likely causes would require a full restart of all of the many thousands of servers. And that is very hard. As we discussed with the Coinbase outage last week, starting up these big systems can be very tricky. One of the expensive things they have to do for a server to start up and to serve traffic is to fill out the shard map, which is partially filled out by talking to other front end servers. And so they explicitly say in the outage that the resources within a front-end server that are used to populate the shard map compete with the resources that are used to process incoming requests. If you start up too quickly, servers get marked as unhealthy because they can’t actually finish incoming requests. So they really don’t want to have to restart everything, and they want to be very sure that the problem needs a restart to fix before they do anything.

They spend a little more time trying to think of a root cause. In the meantime, they do have a config setting they can change to get more of the metadata they need from the authoritative store, rather than via the gossip protocol. But one important number that they throw out here is that they can only restart a few hundred servers per hour. Now it’s not clear if that is restart a few hundred servers per hour without errors or at all. But in either case they don’t want to do it.

So about two hours later, at 9:39, they finally have a root cause. As we said, every server has to create a connection to every other server and every connection they make is going to be on its own thread. And they have gone over an OS limit of how many threads each server can create. And so they just can’t complete the shard map. They can’t make the connection to everybody else.

Jamie: [00:12:47] This is the cousin of our friend, the file descriptor limit, which we’ve talked about a few times. On Linux, there’s definitely a few different ways you can fail to create more threads. For example, you could exhaust available stack stack size but on 64 bit systems is probably unlikely to be something related to that. But there are some explicit limits on threads within Linux. It looks like the default value is around 32 K or something like that, which is–well, we don’t know whether they’d tune that value up or down or anything–but 30 to 32,000 threads on a single system is a pretty high number of threads for sure. So it’s a little interesting as we’ll talk about a little bit more later that they sort of slowly tip their way over that limit because even 20% lower than that is still a really high number of threads to just be sitting there on a Linux system.

Tom: [00:13:45] Yup. So changing these limits is pretty easy. I mean, you’re just updating some config file or running a command, but wisely, they didn’t want to change this on their many thousands of servers without testing it. But they also knew that since the first step they’d taken was removing the servers to get it back into a known good state at this point, with the root cause and knowing what was going on, they had confidence that the restart would work. And so they decided to pull the trigger and do this very, very expensive thing.

So at 10:07 AM, they started the restarts. We’re about five hours into the errors at this point. All the other teams affected by this are highly caffeinated and working their problems as well.

At 10:15, the Cognito team has a change out to let them not use Kinesis. So their error rates start falling. 20 minutes later, the Lambda team does something that’s not exactly clear to fix their memory pressure, which solves its error rates. By noon, two hours after restarts, the error rates start to drop. I wish they had some more details about how exactly they did the restart, but I guess they had to do a rolling one and just do batches at a time. It seems that way.

Jamie: [00:15:06] Yeah.

Tom: [00:15:08] So you have 4:15 PM. They’ve got the majority of issues with ECS and EKS resolved, which is about 11 hours after the error started. At 5:47 PM, the CloudWatch errors finally start going down, but not fully, but then at 10:23 PM–12 hours after they started the restarts they had everything done. And then 10 minutes later, CloudWatch is fully recovered.

Oh man. It didn’t fit in the timeline, but this is becoming so classic now. They had the issue of not being able to update their status page in the way they typically do it because the tool that posts the updates uses Cognito for people to sign in, which is… eat your own dog food for sure… but that causes problems with this. Because they’re Amazon and they know what they’re doing, they did have another way of updating the status page. But the people working in the outage just weren’t as familiar with it, and so it took them a little bit longer to do it. But yeah, I did smile when I saw this because it’s a pretty common pattern at this point.

Jamie: [00:16:28] Yeah. The status pages of the world–who knew it’s the hardest service to keep up? All right. Let me give this a shot trying to summarize the proceedings here.

So the way I would sum up this outage, after listening to that timeline, walking through it with you, is they rolled out an expansion of their front-end cluster. A characteristic of this cluster is that every node in it needed to connect to every other node. And that characteristic meant that the cluster was creating too many threads because it was using a thread for each of these connections. And when that happened, presumably the exceptions being raised under process were being killed, and things were unhappy in general. And then in order to fix this problem, they shrunk their cluster back down again, but then restarting that smaller cluster to make it healthy without getting either run over by the thundering crawfish or whatever took many, many, many hours. And during that time, there were a lot of services that happen to use Kinesis that unfortunately were also suffering. In many cases it’s because they were attempting to log using Kinesis. And then finally, after a very, very long day, they had everything restored back to the previous state.

Tom: [00:17:51] Yeah. I think that that does sum it all up. Definitely a long day and wow. Yeah, just the problems you have, and this is kind of one of the ultimate champagne problems. It’s like, geez, I had many thousands of servers in my service and it takes a while to restart, but it’s still tough. Jamie, what do you think went well for Amazon during this outage?

Jamie: [00:18:17] They definitely have some very incisive outcome action items to follow up with here. So if you read their post-mortem, there is good stuff about increasing the size of the… it says increasing to bigger hardware.

And as they say, they don’t need as many threads, right? So what we can assume from that is if they make the machines bigger, each machine can handle more traffic. And therefore that kind of problem is not as significant. They’re adding alerting for thread usage, which is a great idea–increasing the thread limit on the machine. I don’t know if that’s a good idea or not, but there’s a long list of things where they’re going to try to make things more robust. They’re going to make sure that things like CloudWatch and dependent services are a little bit more robust against failures here by, for example, using a separate partition to the front-end fleet. There’s a good list of follow up items that are fairly thorough about what they’re going to do about this. To prevent this from happening again is certainly one highlight. Even though we did chuckle, of course, about the now infamous status page, unable-to-update circular dependency problem, they were armed with a backup way to update the status page.They had already prepared kind of a rainy day, what-if-we-can’t-use-Kinesis status page. And even though it took them a little bit of wrangling to figure out what that backup way was, they had already prepared that. And so therefore, there was a mechanism they could use to update it, even when the Kinesis was unable.

Tom: [00:20:10] And I guess I liked that they really knew the limitations of their service, and they were very deliberate about that. They could definitely have just started doing things more aggressively just to fix the problem. It’s hard to just completely sit on your hands sometimes when the whole world is down, but they were wise and they waited until they really knew that they were going to fix the problem before they started making anything worse.

Jamie: [00:20:42] Certainly things like the need to slowly roll, which we’re going to talk a little more about and who knows, right? What drives that? What drives that need, but if that is true, then I guess it’s a great thing. They didn’t just unleash the beast and try to really quickly kill everything and bring it back up. So they understood already what it looks like to try to keep this thing running well and to recover.

Tom: [00:21:11] All right. So what do you think could have gone a little bit better for them?

Jamie: [00:21:15] Well, I hate to harp on it, but that darn status dashboard. So it really is splitting the atom here, keeping a status page online. If we looked over the recent outages that we’ve discussed, having some sort of way, independent of your own systems, to tell your customers what’s going on, just feels super important, and even a company as sophisticated as Amazon–their status page suffered the same fate as some of our smaller companies that are not being as clever about how they run things.

Tom: [00:21:57] Maybe we should switch to Twitter as the primary status page and then use other pages as the backup.

Jamie: [00:22:04] That’s right. Isn’t it kind of really already?

Tom: [00:22:09] Twitter will let you know if Amazon is down for sure.

Jamie: [00:22:12] That’s right. But what does Twitter do? This is like turtles all the way down. Another thing is look at that gap, and I think it was almost two hours between when stuff started kind of going wrong and when they had definitively identified the root cause.

Tom: I think it was closer to four hours actually.

Jamie: Total in four hours. They sort of mentioned a couple of candidates, and then two hours later they say they have it. But you’re right. It’s a total of four hours. Exhausting threads is not the most exotic way to fail actually. And so it feels like they didn’t have great visibility into system level errors like this, like thread counts. If we probably watched the graphs of these machines and just looked at the thread counts, we would see them all rising until they hit a flat line and then they would suddenly careen when the processes were killed.

Tom: [00:23:19] So that flat line would probably be at some kind of magic number.

Jamie: [00:23:22] It’d be at a magic power-of-two number. So they didn’t have great visibility into what was actually going on here, and it took them quite a while to actually find the root cause.

Tom: [00:23:32] I think one principle I try to follow is, when you’re talking to an OS–creating a thread is fundamentally an OS thing–is just log really loudly if you get back an error that you’re not expecting because that might be how you find out that your hard disk is corrupted. If you get some just really wacky error back from doing IO or something, and this is a case where they were getting the exact same error on hundreds of servers at the same time. And just having a tool that would let you see all of these servers are all reporting this weird … what error code would you get back for this? Probably something dumb, like EACCES… but you’re getting the same one back from all of them. And it would be one you hadn’t seen before.

Jamie: [00:24:24] That would probably be like EAGAIN or something, just try again. Maybe we’ll let you have the thread next time. Just give it another shot after restarting your process. Oh, one thing that is kind of interesting and this is a pretty off the cuff idea–but you think about things like strace and stuff like that, right? So if you haven’t ever used strace, it is an awesome tool that on Linux, you can just attach to a process and watch the sys calls go by, which is really fun. So it’s a great way to find problems, especially obviously related to the kernel of the system or IO or anything like that.

It’d be really interesting to have something that was monitoring at kind of the S trace layer and

when suddenly a new kind of errno was raising its head that was unlike the normal kinds of errno would just be to flag that like, we normally have connections dropped and all this kind of stuff. But that errno, you don’t normally see. So that’d be pretty fascinating actually to think about if there’s some way to have some automation that is looking at the distribution of errno across the cluster. And when it suddenly changes, it points that out too. Kernels can sort of screw you over, and that’s an interesting way to think about just servicing anything that suddenly starts happening that’s really different.

Tom: [00:25:56] To some extent when, and this is kind of getting into a theme that we’re going to talk about a little bit more, which is that AWS is not you and me. They operate at a different scale. And so it’s always a little questionable when you start pulling design lessons from them or suggesting designs. But at this scale, I almost think kernel issues are almost the same as hardware issues. Like you’re just going to have some number of them, and you just need to be robust even for stuff that you’ve never seen before and you’re never going to see again, right.

Jamie: [00:26:33] But even just an alerting, even a diagnosing thing. Like you and I, when we’re talking about them, we’ll often say, but get product-market fit first. At Amazon, I’m sure there are teams working on interesting stuff like this and even more. So it’s kind of neat to think about at this scale because if we all talk about the impact of something like this, it’s tremendous. A lot of the internet doesn’t work. So there’s almost not too much money Amazon could spend on trying to be better at this. That could end up being really exciting for engineering teams because you can really try to innovate on ways to be excellent at reliability.

Tom: [00:27:19] Yup.

Jamie: [00:27:22] What else, Tom?

Tom: [00:27:24] I think we should talk about this gossip protocol design a little bit. I mean, in some ways, this is a classic issue where you build a thing and it works great and it just ends up exceeding so far beyond your expectations. Typically a design will only scale a certain number of orders of magnitude before things have to change a little bit. This is I think a good example of that where almost certainly when they wrote the original gossip protocol, they didn’t think, oh, maybe one day we’re going to have–and this is a total guess–16,000 servers. One thread per server is going to be a really bad idea.

Jamie: [00:28:08] There’s definitely some speculating here on the gossip protocol because of just a combination of past experiences and some stuff. We sort of spent some time reading a Hacker News thread where some other people chipped in. Because one of the questions that comes out when you read this post-mortem is what was this protocol for? Because there’s some hints about what it might be, but it’s not completely clear. The post-mortem states that they build this shard map for the front ends out of information from a few different places, right? One is like a configuration system, which I’m just going to kind of ignore because it’s probably a little more static. The second one is like they pull stuff out of Dynamo, which I would guess is kind of the authoritative place that sort of says here is how you map channels or streams or whatever to backend clusters.

But then the third one is this gossip protocol. And so one of the ways right now in the speculation side, in which sometimes these things are employed, is to provide hints, basically, so that you can always go back to the authoritative and say here’s the whole map and I know this to be a stable snapshot or whatever. But if the peers communicate to each other about–they could communicate about a couple of different kinds of information. Like one is they could propagate map updates more eagerly so that not everybody has to let go ask authoritative all the time to stay up to date.

And the second thing is a little different, which is health information. So even if I say these four servers are authoritative for this set of channels or whatever, I don’t know if all those servers are online and alive or if one is being repaired or restarted or whatever. And so there’s also health information. And reading a little bit about some of the architecture here, at least a little bit that we know about it, it sounds a little bit like Amazon might have a kind of RPC layer, which is a little bit more general, which in general shares health information about if there’s a set of hosts that you want to connect to provide a certain service even for a certain partition. Like these ones are currently online and known good. And that way you don’t cause cascading delays or something in the system by people trying to connect to hosts, which are currently overloaded or unresponsive or whatever. So it may be stuff like that that this gossip protocol was passing around. But there’s a little bit of guessing we’re doing there.

Tom: [00:30:39] It says that–they use the term shard map a lot. There’s a ton of these shards, and I guess streams belong to a shard. And then they have what they say are backend clusters. So one particular cluster will own a bunch of shards. And so each backend cluster is responsible for keeping all the bits in the right order. And having just, I assume, one primary that’s being written to. So yeah, I could totally see there’s the authoritative store of which shards belong to which clusters. But then cluster membership, like which servers are in there, which are healthy–that might be a little more fluid and passed around via the gossip protocol.

Jamie: [00:31:29] But it’s the way, as we talked about, Tom, the way it’s implemented here is kind of fragile. So, you know, there’s a number of things, right? Like thread per connection. I mean, gossip protocols could communicate via more like loosely connected mesh potentially so that you don’t need N by N connections. If you’re going to have N by N connections, thread per connection is not the right design basically ever, especially once the node counts get beyond a very small number. So using something that was under the covers using epoll, whatever, like I know they’re in Java. Java has NIO, which could help with stuff like that. So they probably would need to move to some evented async core to manage this or just find ways to have fewer connections. So by coming up with some other architecture that attempts to propagate, saturate the network with new messages without requiring this always on, fully interconnected. You could always just use UDP so that you don’t have stateful connections. There’s a bunch of choices you could potentially do if you want something like this that’s not like we’re going to have a TCP connection and we’re going to lock a pthread that’s going to be mostly idle between every single host and every other host in the cluster.

Tom: [00:32:52] So if you haven’t read about gossip protocols, you should go read them because it is a cool idea. And it’s definitely a fun thing, but I am confused about the N by N aspect of this because it seems like one of the neat things about gossip protocols in general is that you can get away with far less connections. Just statistically, if every server connects to 50 other servers just kind of randomly, you can propagate messages. And I forget the math, but in a pretty small number of hops, you can get one message throughout the whole cluster. And this was how early P2P networks worked way back in the day. This N square scaling factor. There’s a couple bad tastes that are all tasting bad together here, which is a stateful front end where that state has an N squared scaling factor.

Jamie: [00:33:49] It may likely be a system that is kind of old and venerable, and folks know it has limitations. And it’s long been on the chopping block to be replaced by something better and, or I’m guessing, a company the size of Amazon, they have systems that do this exact thing more efficiently that just kind of came along for the ride on maybe newer services than Kinesis and stuff like that. There’s probably some backlog items somewhere to backport those into. And oftentimes that’s the way these things break down. It’s like you have this early design, and everybody knew it was a little breaky and a little fragile, but it just never got prioritized to go back and make it more robust.

Tom: [00:34:32] And I think one of the action items they actually had explains maybe why they had never gone back and revisited this, which is they wanted to–I think the language was something like–”accelerate their efforts to cellularize the front ends”. I think what they’re talking about there is just the general idea that when you deploy a service to a cluster, you should deploy… like the way you work with things should be in a block of servers. You shouldn’t have one big cluster of 10,000 servers. You should have five each with 2000. And when you add capacity, you add one whole cell, and that is just the unit you validate things at. And those numbers are all just completely made up. I have no idea. But that way you don’t ever get into an unexpected state because everything is isolated into one particular cell and it’s fully tested and you can validate it and know that it works. And so if there was a plan to move to this, I could see why they would just say, oh, who cares about N squared once we cellularlize that it’s going to go away.

Jamie: [00:35:34] Once we have this radically different architecture that will. I’ve worked on some systems that were cellular, I guess, that had cells. And it’s a good design. You do still have a routing problem that you just displaced in front of that thing, but it can simplify your routing problem just because the kind of cluster unit can reason about a set of things in a smaller scope. And you don’t sort of have this single massive scope with the kind of sophisticated peer behaviors happening.

Tom: [00:36:11] Just round robin. You’re done. It’s just round robin.

Jamie: [00:36:14] That works for stateless things. Tom, not for stateful things. What else?

Tom: [00:36:23] This is a weird thing to say. So let me try and couch this pretty heavily. Most people building services are not building them at the Amazon scale. And there is so much crazy stuff that happens when you get just orders of magnitude beyond the typical service that there could be reasons for doing all this stuff that are far beyond my comprehension. It’s very weird to criticize how AWS does stuff too much. So think about these lessons more like lessons for people that are running more typical services. AWS, I’m sure they have good reasons for a lot of stuff they do, or maybe they don’t. But a good principle you should follow when you’re building stuff is you want to build stuff that you can restart pretty easily. That’s one of the general rules for why stateless servers are good. Now I can give you a long list of really fascinating reasons why stuff might be hard to restart that you might not be able to avoid in your design. But it’s a bit of a design smell if you say, oh, look, we can’t restart this because it’s gonna take six hours or 12 hours to restart the cluster like that. That’s something, if you can avoid that at design time, that’s a good idea.

Jamie: [00:37:54] Yeah. And I agree with you, and I think 99.9% of the time you can, which is what’s wonderful. Most of us don’t don’t have these situations where things have to roll slowly or whatever for resource balancing or various kinds of reasons. You’re right. There are some fun and interesting reasons why sometimes…. We have no idea what set of those or what other thing would be applicable to why in Amazon’s case this service took so long to roll, but that 99.9% of the time, those should not apply to the rest of our services.

So make it restart fast. You’re going to make your life so much easier if you can just make it really easy to just kill it and bring it back up. Even if you prefer to slowly restart it because it provides less traffic disruption in the common rollout case when you’re having an outage, the freedom to be able to really quickly slam it down and slam it up if you’re down anyway is essential. Like sometimes it’s like we’re down anyway, so a slight increase in error rates is not the problem we’re trying to avoid. We’re trying to get the whole system back online, and you want to be able to restart it quickly if you can.

Tom: [00:39:05] There are so many reasons to start adding state like, oh, let’s put a cache in here and we can make things faster. It takes a little bit of discipline to keep stuff stateless, but outages like this are some of the reasons why I think that in some ways it’s technical debt. It’s just going to come back and bite at some point later.

Jamie: [00:39:31] Yup. For sure. I’ll share just for the fun of it. And you can share one too, Tom. Here’s one obscure reason why you have to restart things slowly. So I worked on storage systems a lot. And one of the things that can happen in a storage system is let’s say you just lose power to an entire data center, right? You’re right, there should be generators. But believe it or not, sometimes that they don’t work and you sometimes do lose an entire room at once or whatever. And then all of a sudden, the power comes back on, and hundreds of petabytes of capacity just spin up all at once. One of the fun things is that for places like Amazon or places like big storage clusters, there’s a financial incentive to run those things, the steady state load to be a fairly reasonable percentage of the total throughput of the system.And when you have everything turned on at once, there can be slight anomalous behaviors that related to start up that actually oversaturate certain shared resources or whatever.

A good example is if all of the storage machines turned back on at the same time, what they can do is some will start up faster than other ones. And then what you don’t want is like automation to kick in and be like, oh my gosh, I have to start repairing everything because some of the servers are down so that means like some of the data’s underreplicated. And then once it starts trying to repair everything, it over saturates the network, trying to push bytes around. And then the control messages that are trying to say, “the service is actually online, you don’t have to repair anymore” start failing. So there’s ways around all this. Of course you can throttle things, but they’re really hard to get right. And so that’s one example of large scales when you don’t have excess capacity in the network necessarily. Or you have just enough to be able to do a typical repair load or maybe even a slightly atypical one. So things like that are one example of when these things can be interesting.

Tom: [00:41:27] Yeah, I think network saturation is another good one too. If you’re starting things up, you restart a bunch of front ends all at once, and the first machine that comes up just gets completely attacked. You can even have problems where you just have too much traffic going through your NIC or through your router and that causes problems if it can’t be spread around to start with. Probably my favorite problem happened back…. This was a long time ago back when all of the machines had these heavy spinning hard drives. It takes a lot of power to spin up a bunch of hard drives at once. And I had this rack of… these are all 2U machines. And I was in a hurry, and I needed to just actually physically power cycle them all. And so I just start top to bottom. They might have already been off, but I was trying to get them on. And I just went top to bottom and started flipping these big, heavy switches. Click, click, click, click, click. And all of a sudden, sparks started flying. I had somehow flipped a breaker or completely got something out of the way it should have been just because all the hard drives were trying to spin up at the same time. And it just drew too much power. So that was one of my scarier days in a data center.

Jamie: [00:42:48] My favorite version of this I heard, which was not me, but it was really super interesting to hear about was someone else I talked to that had done big storage clusters. And I think it was an older system. But at the time, they had a whole lot of disks mounted in a rack that was not super sturdy. And they noticed whenever they turned the whole rack off and on at the same time, they’d get a much higher likelihood of head crashes for like the first half an hour. And eventually they figured out that when all the disks spun up at the same time, they would vibrate the rack a little bit and actually cause some problems. So I thought that was another really interesting one. It’s in certain really exotic circumstances that you have to be kind of careful about being able to just bring a thing all up at once. But those circumstances are exotic, right? So yeah, as much as you can get away with it–and the good news is you can get away with it–almost every single time, design things that can just come down and come up as quickly as possible.

Tom: [00:43:54] All right, everybody. Well that’s it for this one. Thank you so much for listening. Until next time. And again, if you like to show, share it with your coworkers or your friends, and follow us on Twitter.

Jamie: [00:44:11] Thanks for listening to the Downtime Project. You can subscribe to the podcast on Apple, Spotify, or wherever you listen to your podcasts. If you’d like to read a transcript of the show or leave a comment, visit us at downtimeproject.com. You can follow us on Twitter at sevreview. And if you liked the show, we’d appreciate a five-star review.

Published

May 25, 2021

twk in | May 25, 2021

Published

May 25, 2021

Cancel Reply

Write a Comment