in

Slack vs TGWs

The Downtime Project
Slack vs TGWs
Loading
/

Slack was down for about 1.5 hours on the first day everyone was back in their (virtual) office in 2021, Jan 4th. Fortunately for everyone, they published a great post-mortem on their blog about what happened. Listen to Tom and Jamie walk through the timeline, complain about Linux’s default file descriptor limit, and talk about some lessons learned.

Jamie: [00:00:00] Welcome to The Downtime Project, where we learn from the Internet’s most notable outages. I’m Jamie Turner and with me is Tom Kleinpeter. Before we begin, we want to remind the listeners that these incidents are stressful and we peel them apart to learn, not to judge and ultimately Tom and I have made in, and in the future unfortunately, we’ll make similar mistakes on our own projects. So please view these conversations as educational and constructive, rather than a judgment of mistakes we think we would never make. So with that out of the way, Maybe what we’re going to start with here today is we’re going to dig through the post-mortem and walk through the timeline as we have it from the company.

So Tom, you want to kind of run us through it?

Tom: [00:01:04] Yeah, let’s get started. So Slack wrote a great post-mortem about this outage and they put it up on their blog. I highly recommend everybody read it. The big picture is that on January 4th, 2021, the first working day of the year in America, and a lot of other places.

Slack was pretty much completely down for about an hour and a half. And it was degraded for some time before and after that. They have a good timeline and a lot of details in the blog post. Let’s just talk through that and see what happened. Slack runs in AWS and they started losing some packets around 6:00 AM, Pacific 9:00 AM Eastern around 6:45.

They get a page from an external monitoring service and error rates start going up. So they start their incident process. They have got a link to an overview of their incident process, which is also worth reading. So one of the first casualties of the outage, which is just classic, is there dashboarding and alerting service, you know, during an outage like this, everybody is looking at the dashboard and that is one of the first things to go away, which you can just imagine all the stress levels would go up at that point.

So they rolled back some changes, which is also very standard and that doesn’t help at all. And then they get hit by a wave of traffic at 7:00 AM Pacific, which is expected — at that time central and Eastern are going to be fully awake. An interesting detail is that Slack has a traffic pattern of many peaks at the top of each hour and a half hour as reminders and other kinds of automation trigger and send messages quoting from the post.

And that wave of traffic takes the service down completely. What they say is a mini peak at 7:00 AM Pacific combined with the underlying network problems led to saturation of our web tier as load increased. So did the widespread packet loss, the increased packet loss led to much higher latency for calls from the web tier to its backends, which saturated system resources and our web tier Slack became unavailable.

So the network problems caused a lot of their Apache threads to go idle. And things really start to go sideways when their autoscaler deprovisions a chunk of their web tier. They say many of the incident responders on our on-call had their SSH sessions ended abruptly as the instances they were working on were de provisioned. You can really just imagine the screams, as you’re logged into this box trying to figure out what’s going on. Suddenly the box just disconnects and you’re done. And you know, what’s happening. There is not a huge surprise. You know, if something has slowed down, all the worker threads aren’t gonna be able to get as much work done. So they’re going to go idle. Your auto scaler is probably looking at CPU and it responded by just saying, Hey, we don’t need as many servers anymore. So let’s, let’s free up some, save some money. So the auto scaler also had a check about thread utilization, which is really good. And so a little bit after it re-provisioned a bunch of servers. 

It detected the thread utilization was too high. So it tried to add about 1200 servers in a short amount of time. Now they don’t say it explicitly, but if you’ve run any services and Linux, you know that 1024 is sort of a magic number and that that’s the default number of open file descriptors a process can have, and trying to launch 1200 servers.

You can sort of see what I might be headed towards here. You know, I definitely can imagine the provision service is going to have to keep at least one handle open for every server that’s coming up. And so if you’re trying to do them all at once, that’s going to put you over the limit. 

And so at some point during this scale-up they ran into the open file limit, which broke the provisioning service and ended up taking them about an hour to fix, which brings us up to about 8:15 or so. So by 9:15, they finally get enough servers up and running that the service is degraded. It’s not down, so things are slow, but people can still send messages and it’s not just completely out anymore. About this time. AWS shows up and figures out, or I’m sure AWS was already working on this, but at this point, AWS has figured out the root cause, which is that the transit gateways (the TGWs) that Slack was using in their accounts had gotten overloaded.

So just some quick background on this. So if you, if you’re running a Slack scale org in AWS, or even a much smaller one, you’re probably gonna want to split things up into multiple accounts and multiple VPCs (virtual private clouds). So these are ways that you can segment resources in AWS so that things can’t see each other.

So not being able to see each other is a feature a lot of the time, but occasionally you do need to have a set of resources in one VPC, talk to resources in another VPC. And the way you make that work is with an AWS transit gateway. So these normally just work, but they do become a dependency, just like anything else in your system.

And there was apparently a problem with these that took down Slack. So going back to what we just talked about to begin with, the monitoring services, that’s also why the monitoring services went down. There’s a quote that “our dashboarding and alerting services failed because they were running in a different VPC from their backend services”. Thus creating a dependency on the TGW use. So they wrap up with some comments about the first Monday of the year being a particular scaling challenge. You know, the clients (the Slack clients) are all going to have cold caches. And when they start up, they’re going to make heavier requests. You know, you can imagine people who’ve had their laptops and desktops off, so they’re going to start them up and that’s generally just going to put more load on the back end than just keeping a client running. So the TGWs, the gateways are supposed to scale up automatically, but something didn’t work,  AWS had already caught this with their own monitoring apparently. And so by 10:40 AM, they had the fix rolled out everywhere, and Slack was able to de-provision the extra servers.

Everything was back to normal. So, yeah, a really stressful couple of hours, but slack got it back up and running. 

Jamie: [00:07:16]  Yeah, for sure. And I think for those of us that have run services, there’s a lot of things in here we’ll see, and kind of recognize this, like, and feel kind of viscerally the pain that team was probably going through during this thing.

So maybe one thing I’ll try to do real quick, Tom is just like summarize the timeline here to make sure I understood it and we can do it together.  So the high-level version sounds like.  started going down around seven, they have these kind of spikes on the hour and half-hour that are probably synchronized with some sort of automation that’s running. So not too surprising it happened around that time.  Understanding why things are going wrong was difficult because network issues were related to underlying causes. And so, that made diagnosing more difficult. I understand the autoscaler sort of down and up pattern sounds like, because the network was choked, essentially, there were too many idle machines, so the autoscaler threw a bunch of machines away and then really quickly realized that was a mistake because they actually were needed for the load that the system was going to need to handle.

But being able to spin those machines back up, ran into some sort of system limits, on the machines when they tried to provision so many, so quickly.  And then eventually AWS helped them recognize the fact that sort of root cause things are related to these traffic gateways,  between these VPCs.They eventually got everything back up and then slowly or rates dropped and then they were healthy again. So does that sound about right, like the timeline to you? Did I get it right? Okay, great. Yeah. I agree with you that Slacks post-mortem here is awesome.

Like it’s very detailed.  I think it’s very frank about things that went well and things that didn’t and stuff like that as they went and there’s a lot to learn from, from it. And I think, for us, and I’m sure a lot of the audience, who also has dealt with things like this is also a lot of shivering to be had, because you remember at times you had similar issues happen.

Cool. So, all right, well maybe,  what we should move on to next is sort of dig in a little bit more to,  some editorializing about observations we have about this timeline and, and starting with kind of things that went well, like things that it feels like, the process of diagnosing and remediating that the team did an excellent job dealing with things.

So,  should we start there? 

Tom: [00:09:46] Yeah, that sounds great. 

Jamie: [00:09:48] Cool.  So, what jumped out at you, Tom?  Is there something where you think “oh, it’s awesome they had that at hand, because that seemed like an asset to them in this process.” 

Tom: [00:09:59] Yeah. There are a couple things,  first though, I want to say that, working on an outage like this without dashboards is really hard. I mean, that is expert mode, for sure. You know, you can get intuition about how systems working, but without being able to see the dashboards, you know, particularly for an outage like this, where it’s, you’re not getting specific errors, things are just slowing down,I mean, the way you typically diagnose it’s, something has slowed down as you look at the graphs and you see what, what looks different, what doesn’t match what you typically see. And so that is tough.

Tom: [00:10:40] Another cool thing that I like to see is they mentioned in the blog post that they used panic mode and Envoy too. Do you want to explain what Envoy is? 

Jamie: [00:10:49] Yeah, sure. Envoy is a project I have some familiarity with, it’s an HTTP proxy and can act as a kind of a traffic layer. And so, a lot of times what that means is an entity that is, you know, mostly about sort of routing traffic and making decisions about keeping or dropping traffic or where, where to send it,  within the service mesh behind it.

And, and so a lot of times the traffic layers end up being critical components of solving problems, especially problems that have to do with load because they’re either sort of the originators of the load staying alive as it gets deeper into your network. And so if you’re trying to control something, using something like Envoy using the traffic layer, it ends up being like the number one tool you’ll often use to get it under control, just because you can terminate the traffic when it’s the cheapest to make a decision about it, right?

Tom: [00:11:45] Yeah. Yeah. So something like that, this is simplifying things, but it’s going to have a list of servers it’s going to send traffic to. A connection will come into Envoy and it’s going to route it to some other server. That’s going to perform a much more expensive operation. And, you know, Envoy will have something like health checks where it knows if one of the backend services is down it knows not to send traffic there anymore, but sometimes, in an outage like this, things can get confused and it can look like lots of your servers are down,  when maybe just your health checks themselves are broken. And so it doesn’t really do any good if Envoy just starts focusing all of the traffic on less and less of your servers, cause then it’s just going to DOS them and take them all down.

And so they were able to use a feature called panic mode in Envoy that, basically it says to ignore the health checks and just spray the traffic over all the servers it knows about, and sure, some of them will fail because they, just some of those servers might be down, but you’re going to end up serving more traffic successfully than you might otherwise.

So that was good to have that set up. 

Jamie: [00:12:44] Yeah. I think there’s also a, I’m sure a theme that will come up in a lot of things we talk about, but like on Envoys, the existence of a panic mode and Envoy is a good flag that panic modes in general are a good thing. Like it’s not, not only in these tools you use, but like the set of decisions you’ll make when things are going really wrong, may actually be pretty different to than when things are in steady state. And so having pieces of your infrastructure recognize a kind of panic mode is great. And in fact, even better is if they can sort of share the concept of a panic mode as much as possible, because there’s a good chance that like, when one of them is, is in panic mode, there’s actually other reactive pieces of your infrastructure. You also would like to be making slightly different decisions as well. 

Tom: [00:13:40] Yeah. And, I love the idea of an explicit panic mode, just kind of in general, because I mean, one thing that I found true is that my IQ goes down a lot when there’s a big outage and the adrenaline gets cranked up. It’s nice to have thought ahead of time about what I’m going to do in that situation. And, you know, it’s even better if those thoughts are encoded into software and config, then I can just easily enable rather than having to try and write some code or patch something live. 

Jamie: [00:14:08] And actually some of the characteristics of that panic mode are usually going to be things like a cleverness reduction mode, right? So in your steady state, you might be making all kinds of decisions for optimization or for cost purposes or for performance or for, and one of the, one of the characteristics of panic mode is to sort of revert the system to the dumbest possible version of itself so that you’re not chasing automation around because automation often reacts crazily to unusual circumstances. And so, you know, but even Envoy’s behavior, here’s a good example of that, right. It shifts into a dumber mode. When you say it makes it easier for you to diagnose what’s going wrong because you’re isolating this other automation, which might itself be reacting to the,  you know, an analogous or,   anomalous circumstances that your system is in. Another thing, you kind of mentioned is like, your IQ dropping and panic mode. It’s I think it’s useful to make sure your panic mode in most circumstances is a decision a human makes and not some sort of thing you try to also automate because you sort of create this meta problem, like, you know, so it has to understand its degraded state enough to, to reliably put itself into panic mode. So, you know, the humans know that they’re freaking out right now because something bad is happening. And so having a way that the humans can say system immediately get pessimistic and get as dumb as possible is really valuable. And trying to thread that idea through as many pieces of infrastructure as you can. 

Tom: [00:15:50]  Yep. let’s see. Another thing that jumped out to me is that they mentioned the engineers were still able to run queries on metrics manually, which, you know, I suspect it was a subset of the engineers on the team that could do that. Like there always seems to be a few people that really know the stack in and out and can actually go, you know, grab a shell and just, you know, start querying whatever, by hand. And so,  I’m glad to see those folks were able to do that. That is hopefully knowledge that they can spread around because that’s a good trick. Glad to see that. 

Jamie: [00:16:23] Yeah, there’s the, that this kind of the break glass type stuff, right. It’s like, okay if things go bad, is there a way we can do it, like the old way or the slow way or the hard way, but at least we can do it. And like, do people still know how right. Cause oftentimes systems evolve where it used to be the old way and the dumb way. And the slow way was the only way to do these things. And then people build these amazing macro layers, but those are often the first things to go wrong. If circumstances get really bad. So. You maintain that cultural muscles know how to do it the manual way, right. In case you have to. So it’s great. It’s great to hear that they still knew how to do that. They had maintained that muscle. Yep. 

Tom: [00:17:04] And I guess the last thing I’d point out is, it’s nice to see that AWS had found the error independently and were working on a fix. I’m sure Slack, you know, was on the phone or Slack or whatever with AWS telling them what they were seeing. But it did say in the post that AWS had, had found the error,I think independently, and had started working on a fix. And so that was the real resolution to the problem. So that’s, you know, if you’re running really complicated infrastructure, yeah. AWS can be expensive, but you get stuff for free. You know, you get people with pagers, they’re going to be monitoring your network and looking for problems with the infra.

Jamie: [00:17:41] So yeah, AWS getting their share of credit for this, for their customer services. Awesome, and a portion of credit, I think also to forge with Slack and to like that maybe all of us can draw from is building a relationship with your cloud provider. If for the first time you really need them is when you need them five minutes ago, it’s tough. Right? So like getting a little more of an operational relationship built to the extent you can with your cloud provider is important because, there are some times, especially with things like the network that only they have the data necessary in order to fix it, with you and for you. So it’s great that Slack had that kind of rapport built with AWS to get them collaborating, as early as they did. 

Tom: [00:18:32] Yeah. There, there are so many things that look like technical problems, but are actually relationship problems. So I think. I think that’s a good call out. If you’re spending a lot of money on AWS, make sure you know somebody there and make sure that, you know, they like you.

Jamie: [00:18:45] That’s right. That’s good. Good advice in general. Yeah. Technical problems versus relationship problems. Cool. Makes sense. well maybe we should dig a little bit now into some things where it feels in retrospect. And again, hindsight is 20/20, but like they could have gone better. Right. Some things where, there were some things to punch up that were problematic, that we have some ideas about things we’ve seen or, probably things that Slack themselves now as reflected on, in their own kind of digging through, that  that might’ve, might’ve been might’ve led to faster remediation, for example.

Tom: [00:19:20] And just to be a hundred percent clear, this is not a criticism. This is not, you know, any sort of knock on Slack. This stuff happens and the way you keep it from happening less than the future is you look at what went wrong. So. yeah, we make no claims we could do any of this better, but  it’s important to point out stuff that you might want to have next time.

Jamie: [00:19:41] Cool. Yep. Agree. 

Tom: [00:19:46] Yeah. So I mean, the big problem here is obviously the monitoring stuff. I am still just horrified at the idea of having to deal with a latency networking outage without knowing what’s actually happening.  Without being able to see, you know, this thing normally takes, you know, 20 milliseconds and now it’s taking, you know, a thousand or 2000 or 10,000,  that that’s just going to make life really hard to know where the failure is, because all the errors, the only errors you’re going to be seeing with something like this are just timeouts, you know, you’re not going to see really crisp errors. It’s going to be pretty vague. 

Jamie: [00:20:25] Yeah, for sure. Yeah. It’s I mean, I think it’s, it’s something you and I discussed a little bit, in reflecting on this outage and, and in general, but it’s hard to come up with a prescription that’s ironclad here, because like when the network gets involved, you kind of get into all bets are off type territory, you know? So especially if you’re on a public cloud where you have fewer knobs available to, to know what’s going on there. But man doing anything you can to make sure those tools stay available. Like it’s worth so much thought to, to figure out, like, what are the failure scenarios that separate you from your observability data?

Like it’s, you can almost not go too deep on that because like, when things get bad, you really are sort of fighting with one arm tied behind your back if you don’t have access to that. 

Tom: [00:21:16] Yeah. I’m, I’m sure. Slack has a backup communication system. Whether it’s G chat or IRC or whatever, but I’m certain, they have some system already where if Slack goes down, they still have some way of talking to each other because I mean, communication and dashboards, are the major things you need in an outage and, you know, access, things like that. But I’m certain they had a backup communication and having you can’t really have backup dashboards, like, like that. I mean, you can have manual tools requiring stuff, but you’re probably not going to maintain two entirely separate dashboarding stacks. You just want to keep your dashboards as isolated from failure as possible.

Jamie: [00:22:58] Yeah. And like the part of the reason why it’s difficult to say too many useful things that are prescriptive here is it’s just such a different answer for different kinds of infrastructures. Right? So like, you know, if you manage your own networks. You can make decisions about things like management networks and failure domains when it comes to racks and rows and power units and things like that. The set of strategies you do adopt on a cloud are really different because you just don’t have as much control over those kinds of things. So,  there’s probably other choices that you can make in order to minimize the likelihood that your dashboarding becomes unavailable at the same time your system does.

 But, it’s probably a little harder in some respects. And it involves a different way of thinking, I think, than maybe the way that this was thought about in times past. So he point you made earlier about, you know, at least having and a backstop, that’s external is a great one.

Cause like the most ironclad way to sort of have independence of failure between at least some measure of like, things are going wrong and your systems is to just have them on a different system, but that’s going to give you a different level of fidelity than something that is actually on your network and has detailed data. So like ideally you would have both and I’m sure in Slack’s case, that’s one of the things we’re discussing a lot is like how to, how to make it less likely in the future that they lose their dashboards when they’re trying to figure this stuff out.

Tom: [00:23:32] All right. So next thing I think we should probably take a look at is the auto scaling service.First off — that file descriptor limit. Okay. So first off the outage would have happened even without the auto scaling service like that. It had nothing to do with the root cause. This is just one of those things where like, you’re trying to fix a problem, then all of a sudden you’re fixing a different problem because your solution has a new problem. And so, man, the file descriptor limit — that just kills me. Can the industry just fix that? Like, I have had so many problems in my career. Cause that’s stupid. Ulimit -n., 

Jamie: [00:24:12]  Yeah for sure. I would be shocked if same, same thing. I will say in my experience of all the times ulimit has hurt me, ulimit -n specifically, has hurt me way more often than it has helped me. And so, and it feels like that that number has been the default of 1024 for 20 plus years. Like as long as I’m aware of it. Yeah. And so. It feels like it’s really worth an examination about, what is the right default you limit these days? Like, especially when we look at machines, like the amount of memory they have these days and things like that, like the kinds of things used to be trying to protect by saying, you know, only 1024 FDS, like it’s not clear that those things are even close to still that like order of magnitude. 

Tom: [00:24:56] I think I ran a free BSD server or, you know, desktop in probably like.

1996 or something like that, with 4 megabytes of Ram. And I’m almost certain it had a 1024 file descriptor limit. Now that we are, how much more memory to do, thousands of times more? 

Jamie: [00:25:16] Yeah, we’ve got several, several thousand, at least 

Tom: [00:25:20] Thousands of times more memory. Maybe, maybe it’s finally time to just take that up to 64K like I would settle for 64K. Yeah. 

Jamie: [00:25:30] That would be pretty reasonable, especially if, if you read through like a sort of getting started of almost any non-trivial network thing, the first thing and everybody’s reading me is, is increase your ulimit -n right? So anything that involves a poll or any kind of event loop or whatever, and is going to be able to handle non trivial number of FDS.

I mean, these days, even threaded programs like very easily manage systems, which would, would, would make good use of tens of thousands of file descriptors. And so that default just feels way too low. 

Tom: [00:26:02] Yeah. So that, I mean that maybe in the absence of doing that, I mean, this is something services can check when they start, and it’s probably not a bad practice.

If you, you start your service up just to check that the file descriptor limit is correct. Yeah. Hopefully everything’s set up correctly where it’s a non-issue, but it clearly still happens all the time. 

Jamie: [00:26:21] Yep. I remember. I can’t remember what it was, but I do remember some relatively modern daemon that I started up a few weeks ago trying it out and it actually refused to start with you with ulimit -n 10 24. It was like, hey, you’re going to have a bad time, buddy. So I was like, thank you, thank you, mysterious demon for, you know, helping me.

Tom: [00:25:25] that’s awesome. 

Jamie: [00:26:46] Yeah. Yeah. I I, another thing I was going to point out about the auto scaling thing too, is the, certainly when we talked about like panic modes and don’t be clever when things are bad, like the auto scaling service feels like it’s sort of under the cross hairs of that criticism a little bit, right?

Tom: [00:27:08] Yeah. I mean, so there was this famous hedge fund that melted down in the 90s and their strategy was described as, quoting from memory here, but it was like “picking up nickels in front of a steamroller” — they’re making small amounts of money reliably, but they have some chance of getting flattened at any moment.  That kind of feels like what the autoscaler was doing here. I mean, I don’t have any access to Slack’s books. I don’t know how much money they’re actually saving by doing this, but you know, it does seem like maybe they were a little quick here. 

Jamie: [00:27:44] Yeah. So there’s, there’s a couple of patterns that I can think of that come out of this. That could be useful here one of them is that like any kind of decision-making entity, if it suddenly decides to make a decision out way outside of its typical parameters, it should probably at that point, like panic and call a human, right? Like the, like if it’s math works out to turning off, you know, 50 times as many servers as it normally does in any given unit at a time, like it’s it’s wrong and it should like throw its hands up and say, I’m not going to move forward because clearly something anonymous has happened. There’s definitely a, if you make one of these, you need to be aware of the kind of worst case scenarios. If the environment they operate in suddenly is doing crazy stuff and they want to make a business decision that is, you know, a 10X, an order of magnitude, what their typical reaction should be like. There’s obviously different ways to control for that. That’s certainly the dumbest one is just a limit, right? Like even if it’s just a configured limit, like you cannot turn off more than 10 servers per five minutes, right? There’s some limit. That probably represents a reasonable behavior in 99% of cases and would provide a kind of clipping that would reduce the worst case behavior in some sort of outage scenario.

Tom: [00:29:10]  And, and if you want to make that more sophisticated, you can probably pull in, like, what did you do last week at the same time? And then what did you do last month at this time? 

Jamie: [00:29:20] Yeah, there’s as you say, probably statistical methods that are even more like, but my inclination would be to start with something really dumb.

And then, you know, later on talk yourself into some sort of like, Oh, I typically do X because as soon as you’re talking about recording and reusing data, there’s now a little bit of a mini storage problem. You also need to couple in this thing, which is, which is okay. But again, the default always has to be if like the clever thing doesn’t work out, stop, like don’t keep it.

Don’t march forward to save money. Right. Because like the money you might save over the next hour is nothing compared to the downtime that you’re incurring. By making some decisions that are out like just nonsensical.  I do think this is also — when we talked about the panic mode thing earlier, right Like if your system is in panic mode, I mean, if you just reason through it, right. It should never be trying to save money during panic mode. Like it should just be like, not doing any optimization, 

Tom: [00:30:15] I think that’s great. I mean, if you have some concepts, panic mode, you know, that, that you can propagate through your system, you know, this is absolutely something you can very cheaply disable, you know, just like, don’t de-provision servers with panic mode is on, right. Because nobody would ever want to do that.. 

Jamie: [00:30:31] Yeah. Would a human ever decide now is the time to save money? No. Right. A human might say we should spin up more because we are dealing with some back pressure that needs to surge in or whatever retry load, but the human would never say like, let’s throw some servers away to save money right now while we’re down. Right. That’s not at the top of your list.

Tom: [00:30:53] Makes sense. It’s always fun after an outage and you’ve allocated all these extra servers to see all your CPU numbers just beautifully, lower than they normally are.

But yeah, I think, I think in this case, you know, any sort of product that just has a massive number of clients, I think you had another good idea here, which is, you know, there’s probably, whether you tie them to panic mode or not, but you could build in a lot of other quality knobs.

I was a little surprised we didn’t see any of that.

Jamie: [00:31:23] I was surprised about that too. Yeah. So I actually, the thing, going back to your point earlier about your IQ drops during an outage, that’s true of absolutely everybody, right? Like we’re just panicked and we’re trying to solve the thing in front of us, having quality knobs ahead of time and having policy decisions that you make clear-headed when you’re not in an outage is a huge asset in these scenarios. Right? Because if you are back-loaded, your choice isn’t should we, or should we not lose traffic? The choice is just which traffic should we lose? Like, you’re, you’re going to have to like, you are failing things actively. And the question is, are they arbitrary sets of things or are they the right sets of things? And so like, yeah. 

Tom: [00:32:04] A lot of times, it is “is it everybody?” 

Jamie: [00:32:07] That’s the thing is if you’re, if you’re kind of messing things up, you know, 10% for everybody, your service is kind of effectively useless, right. Versus having 10% of people having a bad day or having 10% of types of communications having a bad day.  and those are less critical. Right? So with Slack, it might be like, decisions Slack makes about, Oh geez, should they keep issuing reminders during this period at all? Right? Like maybe they delay those for now, but they keep real-time messages coming through.

Should they still have bots working versus humans trying to communicate with each other. Right. So like, and then even just like, instead of like classes of traffic is also just to be honest, like classes of users, right? You have paid users and you have unpaid users. You might want to make some decisions, especially if you have a lot of unpaid users.

It may be that you prioritize your paid users in a downtime scenario.  If you haven’t already decided this policy before you have to make the critical decision when you’re under stress, like, boy, is it tough to make the right decisions? 

Tom: [00:33:13] I think, I think, you know, a really key aspect here is, you know, you can kind of see this all in traffic, honestly, on the interstate or the freeway is, you know, like sometimes you add like 5% more traffic and everything slows down massively, you know, like, like they’re, they’re these tipping points, these thresholds where, once you get over a certain amount of stuff, trying to go through the system, you know, the congestion just collapses it. And so you don’t necessarily have to say, okay, we’re going to cut out 50% of our traffic to get things back working. Like sometimes just having, you know, maybe 10% of our traffic is unpaid.

You know, if we can just flip a switch, turn it off to them, just lowering that sometimes can get you back under a threshold where things work. 

Jamie: [00:33:56] Right. And you’ll end up servicing those unpaid users in practice sooner than you would if you sort of tried to limp along servicing everybody. Right. So, yeah. But yeah,  Yeah. There’s I mean, even  another, you know, another good idea right. With Slack. I don’t know whether it’s a good idea. I don’t have the data. Right. Slack knows whether it’s a good idea. Right. But, when they have something like this, like, you know, that they mentioned their cold caches or whatever,  you know, they, they, they backload some scroll back when you load a channel, right?

Like maybe they should load less scroll back when they are, you know, when they’re in panic mode or whatever. Right. So maybe. Maybe instead of giving you, you know, the last a hundred messages, they just give you the last five right. Or something. Right. But like, do they have, have they enumerated the choices and then do they have the right tags laced throughout their services in order to actually take action on those policies?

Tom: [00:34:50] I’ll tell you that the real challenge here, and we should probably save this for another episode is actually finding the time to build that stuff.It’s very easy to sit around and think of all the stuff you can do, right. In an engineering org, it’s much harder to actually get it on the schedule and make it happen.

Jamie: [00:35:07] Yeah. And, and even just a business, having the discipline right. To sit and not only do that when they’ve had the downtime right. To do that ahead of time to say, we’re going to carve off time, which you have to acknowledge is. Trading off versus feature works in many cases, right. To, to do those kinds of things.

So, I don’t want to trivialize how hard it is to pick the right amount of stuff to do and to actually prioritize it above other things the business could be doing. 

Tom: [00:35:32] Yeah. I mean, it’s probably worth pointing that this is kind of a champagne problem. Like this is the kind of problem, like getting overrun with traffic is only a problem you have if your business is successful. It’s like you have to do all the things to have a successful business. Like you can’t overly prioritize, you know, a failure mitigation until you actually have. You know,  a business that anyone cares if it fails. 

Jamie: [00:36:00] Yeah. That’s definitely a great meta point, right? Like all of this conversation is within the context of a business, like Slack, where they’re, their downtime is very, very meaningful. So if you’re just starting out, you probably should do very little of this because there’s no point in protecting something which does not yet have value. So. You should keep building features and get somebody to care about you. 

Tom: [00:36:21] Yes. If you have no users, no one will care if you’re down.

Jamie: [00:36:25] That’s right. Yep. And I have, I have been at startups or that was true, so yes.

Tom: [00:36:30] Cool. All right, do we want to spend a bit of time talking about other preventative methods? 

Jamie: [00:36:38] Sure. Yeah. And I think that processes where Tom and I have participated in the past, like that, you know, so far the things we talked about would be a sort of like detection, remediation type stuff, which is, is super valuable, right? Like how do we get out of the problem when we’re in it and stuff like that. And so now shifting a little bit to preventative things, right? Like if we have some ideas we can talk through here about ways that we could,  not end up here at all.

Right. And again, some of these ideas are probably things Slack has already considered and made decisions about, but some of them might be new to some of you in the audience. And so we’ll kind of talk through a few ideas about how to not, you know, get in circumstances like this, some things that have worked in the past for us.

Tom: [00:37:25] Okay. So, I think the first first thing to talk about here is jitter. One of my favorite words in computing, the idea that you have these peaks right at the hour and a half hour just just seems like a disaster waiting to happen. You know, it’s, you’re sort of setting up, a regular DOS attack on yourself. And when this lined up with the January 4th, first day back, it tipped the service over and knocked it down.  So, it’s pretty reasonable that people would want to have reminders and things fire on the hour, but maybe there’s some way you can spread that out over a minute or have some clever product thinking that makes people not expect it to happen exactly. That might be something to think about. 

Jamie: [00:38:16] Yeah. There’s this like zoomed in version of smearing and jittering things we’re talking about here. And then the other one that Tom referenced too, right, is January, right. January 4th. And so like, I think for probably many of you listening and certainly for Tom and I is not at all surprising that the first work day after vacation is the, is the day things had a problem. And so, you know, that there’s oftentimes a backed up code push that wants to go out. Cause there was a code freeze, right? There’s a lot more traffic there’s colder caches because your customers haven’t turned their computers on in a while. It’s unlikely to be a holiday for anybody because they’ve already taken a bunch of holiday time. So like, so there’s all kinds of probabilistic things. They’re just going to make what happens to your system and its code on that day different than most of your days. Right? And things that are smooth are just so much easier to make sure they keep working then things that have a lot of variance and that January 4th, first day back, is a perfect variance day in a lot of dimensions. And then piling that on these local variances, they always have, right? Like the, the double OO hour tight mark, right. People that have written like Cron tabs and stuff like that. Probably know some of these scenarios or are you trying to figure out what goes wrong on the system at midnight or whatever? So, yeah, I mean, I think you’re right, Tom, If there are ways to hide or, or smear or whatever, like the fact that everyone expects the reminder to happen on the hour, Mark, that’d be great. Like you could, for example, potentially pre deliver some of these reminders to the client, like earlier. A few minutes early or whatever.  And so that the activity that happens at AU is only client side. Right. And therefore not on your system. So there might be different things you can explore to make things still appear to be like the time you think, but you’re spreading the workout over a little longer, even if that was too clever, you might even find, as long as you just use the 120 seconds around zero zero, that plus or minus one minute is accurate enough to a reminder on time that,  it works and get in, provides your system more freedom to not like have quite as high of a peak. 

Tom: [00:40:41] And maybe that’s another paid unpaid thing, you know, like you, you get higher precision stuff with, with a paid account.

Jamie: [00:40:47] Yeah, that’s right. That’s a good point.

Tom: [00:40:48] Upsell, right? 

Jamie: [00:40:50] Yeah.. You could, you could just have more rigor around the precision of these reminders for your paid customers and just you’d be plus or minus a minute. Right. Where it’s, someone’s not going to freak out.  so yeah, like anything you can do to prevent those peaks is, is great. And there’s usually more tricks available than you think it’s worth. Having your smart engineering teams and product teams and stuff, and think through, is there, is there a thing that actually appears to be about the same to the user that introduces all kinds of degrees of freedom for the engineering teams, right.

Tom: [00:41:26] Yeah. I mean, I, I think if you have, if your graphs are not smooth, if your graphs have big spikes in them, you know, particularly if those spikes come at the peak of the day,  you know, it’s good to ask yourself, how can we smooth this out? You know, how could we make this go away? You never know when that’s going to tip you over. 

Jamie: [00:41:45] Yup. For sure. Makes sense. Cool. Yeah, I mean,  we talked about it a little bit before, but I think another topic here that has bit, I think all of us at some point or another, if you do services long enough is. The hardest problems, almost always to like really deal with our network issues. And I think that it is some of it, is it, you know, just, it, it admission of a limitation of sometimes the software teams that are working on these things. Like, you know, a lot of times the first responders are the software engineering teams.  And we know software really well, but there’s sometimes an assumption that we make, even if we’re running on premises that like the network just works. And so, and this is an even bigger problem in the cloud.  And so, we’ll be looking at code, we’ll be looking at all the things we know. Well, right. Did we misconfigure something, is it like a CPU limitation thing? And you see your machine idle, but traffic still isn’t working.And, and a lot of times the network is one of the last things you suspect. And it’s also one of the hardest things to test, like when you’re doing.  If you’re doing disaster training type stuff where you’re doing, 

Tom: [00:43:01] Network failures are not binary either. Right? That’s the thing that’s so frustrating about them is the network. It might be fine. Like SSH might be fine. Everything might be fine. It’s only when you’ve really slammed it with traffic, that it starts dropping packets and slows down. 

Jamie: [00:43:17] And it may, it may not even be your machine dropping packets. Right. It may be a switch where you don’t have the metrics on that switch. Or, and you don’t know when, you know, there are two machines that are talking fine to each other and it’s because they’re on the same switch. But there’s like some cluster router or something that things on different racks in the data center you’re in or indifferent. You know, rooms in the same facility or whatever. So especially on the public cloud, it’s harder to diagnose network issues. Right? It’s a bit, it’s very hard, even when you have control of all these things, because normally software teams look at the network last, right. It’s just assumed to just work. It’s like the tap, right. You turn the water on and the water comes out  but that’s not always true.  So what do you think, Tom? What can we do? 

Tom: [00:44:05] Well, I think one of the takeaways I’m going to have from this is, you know, the, just the transit gateways as a potential source of failure, you know, like you, you can’t control that much networking on AWS, you know, like you don’t have any idea what the hardware is.

You don’t have any idea when they’re in the data center, like swapping cards out or anything like that. But you can reason that something’s happening, that it can fail when it goes through a TGW, whether that’s like some service running somewhere or whatever. And so just keeping that in mind, when you set things up so that, you know, like things inside of VPC can still function as you expect, and that you don’t have too many dependencies, are you, you don’t want to be surprised about what’s going to happen if the transit gateway stops working. 

Jamie: [00:44:56] I mean, it’s, and I think that there’s a special when we use cloud services, which we, we all do. There’s these interesting curves that pop up right. Which is like, what is the sophistication of the thing you’re using? And what amount of transparency do you have about how, how it works and when it’s functioning or not. If you take something like, you know, just the top of rack switches or whatever, like these are probably less likely to fail because they’re somewhat simpler. Right? And so something like a TGW has a rule set and it has quotas and it has all kinds of fun stuff with it is a more sophisticated service. And as opposed to maybe your machine where you can still run top on it and you can run, you know, the message, like what, whatever the things are that you run to try to get a sense of your interrupt load or whatever.  The TGW is, might represent a kind of thing where they’re doing something kind of sophisticated and we don’t have a lot of insight into how they actually work.

Right. And so probably a thing to zoom in on a little bit, and not certainly a thing to say don’t use it. But just to have that in your risk assessment, right. About like, when you have things span, those that rely on them, like how much does it matter if they stop working and how much will you know that that’s the problem without, you know, having a really good buddy at Amazon, right.

Tom: [00:46:20] So I haven’t actually used a transit gateway myself, but I would be curious if it’s possible to, to make them like to turn them off or to, to scale them down yourself because, being able to run a DRT at a time when traffic is very low and you have all hands on deck and you can actually tell what’s going to happen would potentially be pretty valuable here.

But again, I wouldn’t be surprised if you don’t have control over that. I’m just not sure. 

Jamie: [00:46:44] Yeah, it would be awesome. I mean, the more knobs that the cloud providers can make available to us, to simulate congestion and packet drops and stuff, that would be amazing because. That way, at least we could, you know, harden our systems ahead of time and know how they would react to some of these problems. So I agree. I don’t actually know to what extent I haven’t used TGWs directly myself either, but like if those kinds of knobs could be made available, they’d be great for the operations teams to like, know what they’re getting into when they depend on these things for, for certain things. So, yep, that’d be cool. Any other things we should talk about that feel like it’s sort of in the preventative pool of ideas. 

Tom: [00:47:30] I think that’s everything I had in mind. Okay.  Yeah, just keep those dashboards running. 

Jamie: [00:47:38] Yeah, keep the dashboards running. Yeah. I mean, feel bad. Yeah. That’s a tough day, but, and again, like a service as critical as Slack, like I think you mentioned in our intro episode, Tom, that you use Slack. My team uses Slack too, so we all, we felt this pain as customers. And then later when you read it on the post-mortem, you sort of have this empathy for the teams that we’re trying to get the service back online, because it’s just, it’s such a stressful situation. And first day of the year, and, you feel like it’s setting a tone, but overall, I think when you look at what they had to deal with and, and the length of the outage, it feels like the team did a pretty stellar job actually in recovering from a tough circumstance.

Tom: [00:48:20] Yeah, definitely. I think  things went bad, but, I think they responded as well as they could. And, yeah, and they do make this point in the blog post as well. But, I can just quote exactly. They say “every incident is an opportunity to learn and an unplanned investment in future reliability”. And I really liked that “unplanned investment in future reliability”, because that’s exactly what these things are. I mean, you, things happen, you just don’t want them to happen multiple times. And now that this has happened, You know,  the next outage related to something like this, either isn’t going to happen or it’s going to be resolved much faster.

Jamie: [00:49:00] Definitely. Cool. Well, I guess we’re done for today. So, everybody out there, thank you for listening to this episode. We’ll be coming out with a new one soon and, yeah. Tell until next time. 

Tom: [00:49:15] All right. Thanks everybody.

Write a Comment

Comment