Auth0 Silently Loses Some Indexes

The Downtime Project

00:00 / 47:00

Auth0 experienced multiple hours of degraded performance and increased error rates in November of 2018 after several unexpected events, including a migration that dropped some indexes from their database.

The published post-mortem has a full timeline and a great list of action items, though it is curiously missing a few details, like exactly what database the company was using.

In this episode, you’ll learn a little about what Auth0 does and what happened during the outage itself. You’ll also hear about the things that went well and went poorly, and some speculation about whether something like Honeycomb.io could have reduced the amount of time to resolve the outage.

Tom: [00:00:00] Welcome to The Downtime Project, where we learn from the Internet’s most notable outages. I’m Tom Kleinpeter and with me is Jamie Turner. Before we begin, I just want to remind our listeners that these incidents are really stressful and we peel them apart to learn, not to judge. Ultimately Jamie and I have made, and we will in the future, undoubtedly make similar mistakes on our own projects. So please view these conversations as education, rather than judgment of mistakes we think we would never make today.

We’re talking about Auth0’s November 2018 outage where they had a number of hours of degraded performance and some errors. And we’ll get into all those details after a little bit of housekeeping.

So first off, thank you to everybody that’s been listening to the show. It’s been great to hear from people that are enjoying it. And the only thing we would ask of you is that you go to Apple or Spotify and you give us a nice rating and review, if you’ve liked it, that would really help us out. So thank you very much next.

If there are any outages that you’ve been a part of or heard of, and that there’s a public write-up on, we would love for you to tell us about them so that we can dig into what you think is most important. So you can just hit us up on Twitter or leave a comment on the blog at the Downtime Project and we will take a look at everything you send us.

Jamie: [00:01:30] And, I think we also have a little news about your startup, right Tom?

Tom: [00:01:34] Yes! My startup is now out of stealth. I’ve been working on this for about six months and it’s amazing to finally be able to talk about it. The name of the company is Common Room and we’re building some great products to help companies work with their online communities.

You can learn a little bit more about what we’re doing at www.commonroom.io, but if your company has an online community like a public Slack or Discourse, definitely check out the website and sign up for the waiting list. Or if you’re interested in working in a rapidly growing startup with a pretty amazing team and hopefully not causing any terrible outages along the way, come find me on LinkedIn or apply on our jobs board.

Also as part of that media blitz for the launch, I was on a couple of other podcasts, one of my co-founders Viraj and I were on the March 31st episode of Software Engineering Daily, where we talked about a lot of the tech we’re using and how we’re building the product. And then we were on a two-part episode of the ELC podcast, where we talked about our history working together and how we work together at Common Room.

I’ve worked with Viraj for about 15 years in various capacities. And so it was a ton of fun to sit down and reminisce about some of the fun stuff we’ve done together. So check those out. I really enjoyed recording them. And if you can’t get enough listening to me now, you’ve got two more options

Jamie: [00:02:45] And I’ll also throw in my pitch that you all should go check out Common Room. I think what they’re building over there is really cool. My own startup, which is not yet out of stealth, I can’t wait until we’re far enough along that we can make use of what Tom is building at his startup. So go check it out.

Tom: [00:03:00] Awesome. Awesome. Well, cool. So today we’re talking about Auth0. In 2018, they had an outage and we’re going to go through it. They wrote a very solid post-mortem that we’ll link to in the show notes. So to start with though, Jamie, why don’t you tell us a little bit about what Auth0 does. Yeah.

Jamie: [00:03:20] Cool. So Auth0 is an authentication and access control platform as a service provider. So you might say, what does this mean? Why wouldn’t I just do it myself? So let’s talk about what you would need to do if you try to do it yourself to modern standards. So you obviously need to have a way that users can create passwords. You need to be able to store those passwords safely. They need to be able to reset the passwords.

These days, that expectation is going to be, they could do OAuth type single sign in with Google, Facebook, other third parties. You need to have some access control, granularity where yeah, you can flag what kinds of resources the different accounts are entitled to based on their level or their status or whatever.

And you also need to have extra security features, right? Like things like two factor authentication using SMS or one time codes yubikey. So there’s a lot to build actually here, if you want to do this to the expectation of customers today, and OAuth is a service that takes care of all that for you. So you just use them and you don’t have to build all that stuff.

Tom: [00:04:24] Yeah, don’t write all that stuff yourself. Just use a service. That is a morass to get into. And it is amazing that Auth0 takes care of all that stuff for you. So it definitely if you find yourself writing your own SSO integration or something, Oh God, just please, you use something different.

Well, cool. So let’s run through the post-mortem. Reading this post-mortem was very educational. We learned a lot about the problem and the response, but you can tell that it has been redacted or run through some kind of approval process, because there are some things that aren’t in here that we’ll sort of call out as we go.

There’s still everything we need, I think is in there, but notably they don’t talk about the type of database they’re using, which would have been kind of important, I think, but let’s get into the details of it. So this is an interesting outage because it’s not one of these things where you change something and it breaks immediately.

The root cause of this actually happens first about seven or eight days before the real problems start. So on November 20th, 2018, they pushed some new code, which caused some database migrations to run. This is very standard. As code gets deployed a lot of times you’ll have a schema migration, and that’s just how it works.

Normally this works fine, but in this case, two indexes failed to rebuild due to some malformed email addresses. What exactly what they say is “a deployment of new code to our U S

production environment resulted in the failure of two indexes to rebuild”. So I don’t know how that would happen. Jamie, do you have any ideas?

Jamie: [00:06:07] Yeah, this is actually, to Tom’s point about the details missing. It’s a little hard to understand exactly what they mean here. They mentioned index failures due to email addresses. I mean, most indexing systems and databases are not going to care about the semantics of the structure of an email address. One thing they might mean here is maybe they added a uniqueness constraint to the email index. And assuming that the emails were unique and then you can imagine an application code or something. There might’ve been a bug where they were checking to see if the address was already in there before they had maybe normalized the case or before they had stripped white space. And then they were not getting a match and then before they inserted the email, they were stripping white space or normalizing the case. And then so essentially they were creating duplicate records, duplicate emails in the database. Later on, if you go try to add an index saying, yeah, make that unique, that’s going to fail because it’s not actually unique.

So one example of what we could imagine happened here, but the detail is a little hard to say, but the net result is this index, which I’m sure was optimizing a whole bunch of queries for them. it sounds like it did not exist after this because the new definition could not succeed in being built that way.

Tom: [00:07:24] Yeah. I think most indexes only get added after they’re needed. And so if an index that you were expecting to be there doesn’t exist that’s going to be a problem. But somehow the system powered on for about seven more days without any problems showing up until November 27th, when they started to get a heavy increase in these malformed requests.

They’ll later find out it was not due to a DOS attack. They definitely thought it was. But I think they say that it’s just a customer with a misconfiguration, but for whatever reason, they start getting a bunch of these requests that are, and again, this is where it be awesome to have some details, but for these requests are causing a lot more load, a lot more CPU somehow.

So a couple hours later, they get an alert about database replication lag. We’ve talked about this before, but if you’ve got a primary and a secondary database, all your writes go to the primary, but all the writes also have to be streamed over to the secondary. And if the secondary gets too far behind, you start to have problems.

So most people have an alert about how many seconds behind the secondary is, so that fired and triggered some engineer or SRE to take a look at it.

Jamie: [00:08:39] Yeah. This is a good example of a place where it lacks the specific database details. It’s we have to guess a little bit why this happened. We’ve both had some experience with MongoDB. And so one guess we had was this sounds like this might be MongoDB, which is known to have some issues with replication lag when the primary gets busy. And we did some digging online and did find other separate documents where Auth0 does say MongoDB is their primary source of truth.

So that may be what happened here, many databases will attempt to somehow keep the resources fairly separate between replicating data and taking, servicing new requests from clients. MongoDB is known to have issues sometimes where if the primary gets too busy, attempting to keep up with requests coming from clients it will start to lag its replication streams to the secondary. That sounds like what happened here.

Tom: [00:09:39] And so in this case, the primary was probably very busy because it was, I’m sure you’re burning a ton of CPU because it didn’t have the index that it wanted, but also potentially these malformed requests are also causing more load on it somehow. But for whatever reason, they’re getting a lot of this lag.

And so about an hour after the alert, they decide to resync the replicas. And as we’ve talked about before, that involves copying the data from the primary back to the secondary, and that is not a free process. It’s going to put some load onto the primary, which is just going to make everything a little bit worse.

So around the same time, they file an incident about degraded perf which kicks off their whole process. Their status page gets updated and a little bit after that they restart the application servers and that doesn’t seem to work. So they add more capacity to the app servers, which I think helps a little bit, but you have to be really careful about that when your problem is database load. Cause that can actually just allow you to put more load on the database. If you’re running out of workers on your app servers, because your database is too busy, being able to put more work into your database is not going to help. And I think you see that in the next point in the post-mortem, which is that they start seeing higher latencies and other services.

So I’m guessing what happened there is they started putting enough load onto the database now that it started to spill over and affect more stuff. Does that read right to you?

Jamie: [00:11:07] Yeah, I think that sounds exactly what happened. Their pinch was kind of in the back of this pipeline and by fattening up the middle of the pipeline, all they did was increase the pinch at the back, and then things just got worse.

Tom: [00:11:22] Yeah, just congestive failure. So at this point the request failure rate starts going up all over the place. And so they promote the status level of the incident to be major. And then this gets into a point where Jamie and I are a little confused about, where they bring the secondary cluster into action.

So they take all of the traffic from the primary cluster, the primary application server, and they start failing it over to a secondary cluster, which I assume, is just a set of servers they have in a different availability zone, but they do that instead of just kill dash nine and everything in the primary cluster. But right after they do this, fail over things start to look a little bit better.

Jamie: [00:12:08] Yeah. It’s not completely clear how this helped. They did have an earlier note about restarting the services, so I’m sure they attempted to restart the services a few times and. And so at first it’s the secondary cluster failing over to it. You would say again, if it, if it’s still talking to the same database, why did that fix anything? But I think in the future sort of points here in the notes, it starts to become a little clearer about why they did this. And it seems as though the secondary cluster, they’re able to target different deployments on these clusters.

So they can start to try to figure out what’s wrong with this code basically by having the secondary cluster and the primary cluster running different versions of their software so they can isolate the problem.

Tom: [00:12:56] So after that traffic over to the secondary and things look okay, they start working on bringing the primary back online, but they do that by rolling it back to a previous version and they don’t specify if it’s the previous version that was running or the previous commit or what, but this is a part, I was a little confused about bringing the primary back with the previous version doesn’t work because there’s a bug in it. So it doesn’t come up healthy.

Jamie: [00:13:24] Yeah, this is, this is kind of confusing because presumably if it was the previous deployed version, that one should already be known good, right? For a definition of good, that clears their production bar because it ran for a while. So it’s a little confusing as to why. There’s a bug. I mean, one speculation could be, there may be a forwards versus backwards compatibility issue, or like they changed something in a way that like, they can’t go back.

Tom: [00:13:49] That’s a great point. They might’ve already written out some state that the old version can’t handle. Yeah. That’s a classic issue right there.

Jamie: [00:13:55] Maybe it was that. Yeah. So we don’t, we, again, we don’t have the details, so yeah.

Tom: [00:14:00] So at this point, there are about five hours in after the first alert about the replication lag. And they began to suspect that they may not have all the indexes they need. And at this point, things start to move a little bit quicker in terms of getting to some resolution, but they take down one of the services or the service that relies on the missing index and start to add the index back. 15 minutes later, they get that service back up. 40 minutes later, they find another endpoint that was using the, I guess, the other missing index. And they can just block that in point at the router level or the, I guess, a load balancer level. And then a little bit later, they get the bug fix that was keeping the primary cluster down. They get traffic back to that. They did have some load balancer config issue, so they have to go back to the secondary cluster. And at that point it seems like they kind of give up on the primary cluster for the day. There about seven hours in after the alert. So it’s also probably let’s see, it’s definitely the middle of the night by now, eight 45 UTC, which I forget what time that is, but it’s late. Then a few hours after that they find they finally get a suspicious IP address that the actual activity with that IP address lines up with everything that’s happened with the incident.

Jamie: [00:15:26] This was their previously suspected denial of service attack was all coming from this IP, right?

Tom: [00:15:32] Yep. So they blocked the IP address and they notify the customer. And then a minute later metrics are all back to normal. So they got the indexes fixed. They got the IP address, blocked, and life is good again. And then five or six hours after that, they are able to route everything back to the primary cluster.

Jamie: [00:15:51] So it looks like there’s a nice meta comment actually about this. I think that’s actually like 17 hours later, Tom.

Tom: [00:16:10] Oh, that’s a whole different day. So they probably let everybody get some sleep.

Jamie: [00:16:11] Which is again the kind of culture like yeah. You’re running on the secondary. Yes. You probably wanna fall back over to the primary. But they probably took a breather at that point and told everybody, look, go home. We’ll fall back to the primary tomorrow, which is probably the right call to make.

Tom: [00:16:29] Yep, hat makes sense. Okay. So just to summarize all that. Kind of some really unlucky timing here, they had two indexes that didn’t rebuild when there was a code push and it just made everything not work great and used resources that they otherwise would have had available to deal with this problem of this mysterious anomalous traffic. So either one of these things on their own may have been fine, but when you bring them together, you see this problem that it’s always hard to isolate something when it wasn’t something that just changed.

And so I’m sure they didn’t think about these indexes immediately. Cause that change had gone out a week ago and everything had been fine. But, having these two together, once you added in the resyncing databases, just tips everything over and causes the problems.

So let’s run through the things that you were thinking went well, Jamie.

Jamie: [00:17:28] Yeah. I mean, one thing to call out out of the gate is they had a good set of action items. So if you read through the list of things that they’re doing in response to this, they sound like a good set of things for them to dig into. So they not only have those action items, they publish those as part of their postmortem so we all would know that they were doing these things. And and so a lot of the things they enumerate in, there are things that we too occurred to us as we’re reading through their

Tom: [00:18:04] One of the whole goals of these postmortems is to build trust with your customers and just to let them know. look, the thing that happened was really crazy, but here, all the things we’re doing to keep anything like it from happening again, and yeah. The list of action items they had, I think I agree with every single one of them, but it’s just a solid list.

Jamie: [00:18:26] Yeah. I think another thing in here that was particularly well handled compared to potentially other tech companies is the communication with customers here was good. So you certainly could see a panicking team trying to restore the service, just block that IP. The engineers could do that and not really think about the ramifications of blocking that IP, but Tom, it doesn’t sound like that’s what happened here, right?

Tom: [00:18:53] Yeah I think it’s pretty clear that they had somebody working the incident that could get in touch with the customer and tell them exactly what was going on. I think there were several cases where customers had to be contacted. And so that means they had some sort of incident process where when things start to go wrong, you get more than just engineers into the room or Slack or however they were dealing with this. You would have somebody who owns the customer relationship, who is on call to either contact the customer or advise whether or not something is going to be okay.

You probably had other people who yeah. Could make decisions if things got really bad, but yeah, they were able to get in touch with the right people. And that should be a good, a big part of any incident response. It’s just making sure your customers know what’s going on and what to expect.

Jamie: [00:19:41] Yep. Once the fact that once they had eventually figured out where this traffic was coming from, they did have a way to filter that traffic out. So, going back to the engineering side of the response to this malformed traffic, that they were able to take action on, that sounds like there was some kind of traffic layer that help them, right?

Tom: [00:20:02] Yeah, that’s just so nice to be able to turn things off. If you have a sufficiently complicated application, just to be able to say, stop running this, just fail it. Sometimes you have to damage one thing to save everything else, but you have to have built that functionality beforehand.

Jamie: [00:20:19]. Yep. So what else, Tom?

Tom: [00:20:24] Let’s see. I liked that they had the secondary cluster of application servers. That tells you that they’ve thought ahead of time, about the different failure modes they might run into. I’m speculating that it was in a different availability zone, but hat would make the most sense to me. And if you have a service, you need to keep running, you should have it spread out across multiple availability zones, because I don’t think Amazon makes any promises that it’s that they’re always going to keep one AZ up all the time.

Jamie: [00:20:53] Yeah and I think that the, if you look at the timelines here, which we can talk about in the maybe gone better section, but the fact that is it took them quite a while to fall over to this other thing. I’m guessing it was there for a different purpose, as you say, Tom, or I get probably as an AZ isolation type, like physical, you know weather, et cetera, type availability zone, or like Amazon misconfiguration takes out the network and an AZ or whatever. So they probably had this rainy day secondary cluster. but they still had it. I mean, they used it potentially for a different unanticipated purpose here where they were trying to isolate a software problem, but they did have it ready. And so it was a resource that they could utilize to help solve this problem.

Tom: [00:21:38] Also just sort of a, a jokey item, but Auth0 sold or is in the process of selling for $6.5 billion to Okta, I think. So even if your startup has a really bad day and everything melts down, the future may still be bright if you write a good postmortem and publish it to the internet.

Jamie: [00:22:00] That’s right. This wasn’t their brightest day, but things seem to have worked out okay for them. So you know, you can survive your bad days.

Tom: [00:22:11] Yeah. Always good to keep in mind. All right. Let’s talk about some of the things that could have gone a little bit better. So what do you think, Jamie?

Jamie: [00:22:19] Yeah, so I think one thing, just because if we, if we think through this and timeline sequence through that extended timeline what, one thing early on, right? Is you, you really shouldn’t have database migrations fail silently. So there’s definitely this stuff about how is difficult to figure out what’s going wrong when the problem shows up much later than the cause. So that seems like an issue. If there’s something that you deploy, whether it’s a code change or configuration change, or a database alteration or whatever, and that deployment a piece of that deployment does not succeed. That should be something you know about when it fails.

Tom: [00:23:13] Yeah. Absolutely. That’s just so key. I mean, like I said, indexes, if you’re adding them, you need them and for it to just fail and not know about it, you’re, you’re basically running in a weird untested state at that point. And any number of things could happen. I mean, you might start getting corrupted data. I mean, just because you were expecting some unique constraint to be there, which could be a huge pain to clean up. I think one of the best case scenarios, if you don’t add an index you were expecting, is that things just get slow and you can figure it out and fix it. But indexes are important stuff. So that’s a bad sign that it was able to fail without any alert going off.

Jamie: [00:23:59] Yeah, certainly, as you said, there, there could have been things that were new to use to the database layer that are constraint type situations that are protecting you from bugs or something. And if one of those had failed, yeah, it could be. I mean, it can be really sinister, right? It could take you years to figure out you’ve quietly corrupted some data. And then at that point, it’s really hard to wind the clock back and figure out how to fix it. So you definitely want to know if this thing didn’t work.

Another thing to call out here is that database queries were allowed to take arbitrarily long amounts of time. I think there are some notes in the action items about adding database time-outs. So that suggests that they did not have them before this. I will say with some confidence that like, if you have a production database, you need a non empty value for the longest query you’ll allow to run on that. Right? So if your analytics database, or maybe your secondaries or whatever, like maybe you loosen that or you don’t care or whatever. When things get bad for your primary servicing traffic, with an uptime and availability requirement, you have to define some value for the longest query that you expect to run on that system.

Tom: [00:25:24] Yeah if you expect a query to take 10 milliseconds and it’s taken 10 seconds, I mean, it might as well have failed at that point. So just kill it. Send the 500, right. Odds are, you’re going to get a much better and tighter alert on that anyway.

Jamie: [00:25:40] Or maybe even all your traffic servicing queries are totally fine, but someone doing an analytics job accidentally connected to the primary. And so they’re running some big table scan on your system and they didn’t mean to, right. So yeah, you should have that job just die and they go, oh, what happened to my query? And then they’ll discover eventually they’re on the wrong database. So you definitely need to make sure you’ve kept the kind of worst case scenario load on your database in terms of a single case.

I guess this is kind of related to a meta point, but databases are kind of my background. So I’ll sort of throw this in there. Like databases are happier with short queries and not even linearly happy, like super linearly happy, like it’s much better to like, try to make sure your database is doing as close to point queries as possible.

Or if it’s going to be doing range queries that they’re on an index where things are adjacent, especially if you’re on spinning media, which it may or may not be depending on your deployment, but generally speaking, you want to have queries be fairly fast on a production database, and then you want to figure out if there are things you need to do that are big scans of data, they try to put those onto secondaries or on analytics, databases, or data warehouses or whatever, try to find some way to move those off of your traffic servicing database, where your customer’s performance and availability is going to be impacted.

Another thing to point out here also related to the databases is they started this when the secondary started lagging and they were getting alerted about that. They did start resyncing the secondary before they completely understood what was happening or whether that would make things better or worse.

Now I certainly can understand the impulse right, that like, oh my gosh, we need a replica. Right? Cause like, what if we lose this machine, we’ll lose our data. So I understand that data loss type thinking where you really want to make sure you have a secondary, but if you do that too fast, probably that secondary replication is going to fail anyway, because you haven’t understood yet why it broke down the first time?

Tom: [00:27:52] Yeah, definitely. It’s really hard to make any comments on this without knowing what their playbook looked like, and it’s completely possible they hit a problem like this once a week and they just hit resync and things are fine.

And that this was just the one time where that wasn’t the situation. If you’re resyncing your database once a week, you probably probably should look at what you’re doing. I don’t know if that’s a good thing anywhere, but yeah, without knowing how much they’ve had to do this, for other reasons, it’s hard to say whether they really should have dug into it more, bu it also sounds like they did, they were, were they resyncing multiple databases at once?

Jamie: [00:28:27] Yeah. It’s a little hard to tell from the language, but it sounds like they might’ve been, so yeah,

Tom: [00:28:35] That may not be something you want to do if, unless you’ve done that before.

Jamie: [00:28:40] Yeah. I mean, I guess maybe kind of the last point I would say on this whole database specific kind of side of things is that their CPU usage had to go way up when this index went away. Even in advance of this incident. So I think there’s like a, maybe two different thoughts here, right? Like one is a broad thought, which is realistically across all your machines, you should just have CPU usage monitoring. And so because if it goes wrong anywhere, I mean, if it goes wrong at the application layer, maybe you have a bug in your code that made some, I mean, we just talked about CloudFlare recently where they had something to get really expensive and they’re filtering. If something goes wrong in your application layer and your CPU use has changed dramatically, you should probably know about it and know that that’s what you meant and that that’s not a problem. but certainly at your database layer as well, if the CPU usage spikes you should know about it, for sure.

The other thing I would just say about databases, and this gets back to, we had a we talked a little bit about measuring the thing you mean in a previous episode, I think when we were talking about like there’s restorations and is that what you should alert on or should you alert on backups. But like the definition of what a database does is it services, requests, people send queries and it gives responses to those queries. And so let’s let’s say we only had CPU monitoring on the database. Right. And let’s say that what happened is the discs filled up. and if you, a lot of file systems, as they approach 90, 95% plus capacity, they start to run into some breakdowns, right? So the performance starts to get significantly worse as you get close to disc saturation. What may actually happen, is the database slows down, not because of CPU usage, it slows down because the IO subsystem is taking longer to fetch records off the disc or something like that. So it’s another instance where like, For a database, you should definitely have monitoring around your query response times, right? Because at the end of the day, whatever, for the reason, and you want to monitor those other things too, right? The sort of causal things, but like whatever, the reason if your database starts slowing down, that’s the problem. Right? So that way you have pretty broad coverage of the definition of what this service is and the definition of when it starts to be no longer sufficiently providing that service. And for database a safe way to characterize that is the request, the request success rate, and the requests sort of whatever different cuts you want to take of the response time. And if that starts to get way worse, you should go look into that, cause like that probably means something went wrong.

Tom: [00:31:22] Yeah, definitely. I mean, alerts, like this can be hard to tune because you might have traffic that comes and goes, but I would be really curious to know what their alerting story on CPU and query time was like before this incident. I’m sure it’s way better now, but not just alerting on it though, you really want to be able to visualize this sort of thing, because if you imagine this graph, I’m imagining there was a pretty clear stair-step. There was before the push went out that messed up the indexes and after, and anyone, I mean, it doesn’t require any experience as an engineer could look at that and say something changed, that looks problematic. And then go, Oh, look this lined up with a push. If you’re at a really good state you can have your graphs overlaid with other events like pushes or deploys, things like that. And even one step beyond that is being able to have graphs of where particular requests are spending their time and being able to see this request used to take one second and spend one hundred milliseconds querying the databases and now on a graph you’d see, oh, now the request is taking three seconds and it’s spending 2.1 seconds we’re waiting on the database. Clearly what regressed is the database.

Jamie: [00:32:45] This is tracing type capabilities and stuff. Those things are great. Yeah. I think that the database getting slower and that whole thing about the stair-step, just cause sometimes like what’s fun to talk about is the advanced versions of these things too. Like again, like probably doesn’t apply to most companies, but it’s useful to think about something amazing that like huge companies will do if they have the resource to invest in it.

So you know, one of the things that’s interesting, right is every time you make a change to your production environment, there’s a whole bunch of lines that move a little bit in response to that. And the bigger your company gets, it gets to the point where there’s so many lines that like, it’s hard to check all of them. And similarly, it’s hard to anticipate all the threshold setting and alerting you should set up for all of those lines, right? Like maybe a lot of them might be derivative and everything. So like, you don’t need to monitor those too.

But one thing that’s kind of cool that is an advanced thing that I know a lot of companies are looking at, and probably some of the biggest companies have implemented versions of is automated anomaly detection, right? So like systems that look for a trend breaks and especially if they can do it in canary type deployments. And then as part of the deployment process, like a release manager or something sees a summary of like these lines went up an odd amount. And that person has to actually approve those changes. Like we knew that was going to get slower. That’s what we meant. Yes. Right. So there’s definitely some versions of this.

Again, if you’re getting into the gold standard of things that are awesome when it comes to these lines, you always want a human to be able to see them. But there also arrives a day where there’s too many lines for a human to like to look at all of them. And you can start to get into things where you can start to automate some of that and have systems look for anomalous changes right. Across all of these things.

Tom: [00:34:35] Yeah. That actually ties in really well with another point I wanted to make is it sounded like they didn’t have any great systems here for identifying anomalies in their traffic. They weren’t really able to cut everything up by customer ID or by remote IP address and sort of see what’s weird about the requests that are slow. So there are tools now that claim to make this a lot easier than it used to be, I’ve sort of been evaluating this tool called Honeycomb. We haven’t put everything on it yet, but I’ve been going through some of the demos and taking a look at it. And it has a really fascinating approach where it shows you all of your events in the system in sort of a heat map format where you get time on the X axis and like latency on the y-axis. And then you can actually just drag a box around outliers, like, well, there’s a, a set of events that are a lot slower than everything else, and they’re kind of standing out as a dot up above.

You can just select those and then it will show you what is most statistically distinct about those requests or attached to each event. I guess they use the term event, but attached to each event, you have things like, what is the code version it’s running? What is the IP address it’s coming from? What is the customer ID? What is the route? What is whatever might cause a different pattern in your traffic. And so then once you select these anomalies, that will actually show you, what’s interesting about these, like which of these event attributes they all share, or that most of them share compared to all the other traffic around the same time. And so it’s something like that. It would be pretty easy just to see, Oh, look, all these things seem to be coming from the same IP address. And that would have potentially taken hours out of this investigation because ultimately one of the big sources of their load was anomalous requests from one customer and one IP address.

Jamie: [00:36:37] Tools like that are amazingly powerful, for sure, especially for the human guided parts of the thing where a visualization like that is sometimes just the most direct way to discover like, Oh, Hey, there’s this, like, most things are over here in this cloud in the point cloud. And then there’s these funky ones that are a separate little, it’s usually a modal thing. It’s like, there’s a bi-modal thing where it’s like, Then there’s all of these over, in this other weird cloud. Like the fact that the tool, the honeycomb you just described has a capability to sort of just drag a box around those and say, tell me, tell me about the show, me attributes about those and show me like places those attributes are unique.

Like highlight that. That’s amazing.

Tom: [00:37:22] Because a lot of times there’s not a lot of times with these perf variations, there’s not anything really magic happening. It’s the customer who has the 500 megabyte file, or it’s the team that has 20 times more, whatevers than other people, right? You always have these outliers and being able to see where they stand out is helpful. You have a lot of neurons associated with your visual cortex. Being able to pull those into these problems is really helpful.

Jamie: [00:37:55] The way to describe the ones that are more interesting maybe versus not interesting, like things which look like they’re just following a standard distribution are probably just like physics, right? So like, you’re going to have a tail if it looks like a standard normal tail, right. Things that are modal are usually where things are interesting, right? Like you have your normal tail and then you have a cluster at the end of the tail. That’s weird. Right? Like that probably means that something funky is happening. Those are interesting. So, and your eye is able to tell that obviously really quickly, right? It’s not just a tapering off of the tail. It’s like there’s this extra mode clustered at the tail. And like, what the hell is that?

Tom: [00:38:38] You’ll always learn something if you investigate a bi-modal distribution. There’s always something that’s causing it — it’s not natural. It’s some timeout or some cap or something like that.

Jamie: [00:38:52] Cool. Well, let’s see what else do you have, Tom?

Tom: [00:38:57] I think we had to point out that this was pretty unlucky just to have all this stuff happen. There is absolutely a world where even though that migration failed silently, somebody looked at the site and said, oh, this feels slower. And they went and looked at it and oh look, why are we burning all the CPU? And they fixed that index issue an hour after it goes out versus having both these things happen at the same time. They also didn’t catch a great break on the timing of it. I think this thing really started running about 8:00 PM Pacific, and didn’t really wrap up until about 4:00 AM and man, it just sucks at 2:00 AM to be unsure of what’s going on, worried about breaking stuff and everybody’s tired. It’s just not fun.

Jamie: [00:39:46] I think related to that, like one thing I would point out, and this is, these are always fun to talk about, right. Because it’s more about the meta layer. It’s about the corporate layer. It’s about the culture layer and stuff like that, but this timeline is really long, right? It’s, it’s actually one of the things we have to point out, right. So it’s, you know eight hours and there were a lot of movements during those eight hours and. realistically, your company needs to be able to move through something like this faster than this. And the reason is not just because of how I feel about it. Right? It’s because there’s a way that people box in their head, how bad it is when something happens. So like bad things happen to systems all the time. Right? You have an exception log. I guarantee you it’s not empty. Right? So, but the truth is that like, bad things happening for 10 seconds, there’s not even a conversation with the customer. People sort of shrug their shoulders and they refresh the browser and the world goes on. Bad things happening for five minutes is different, bad things happening for two hours are different. Again, these things are kind of modal, right? It’s like, oh, that kind of sucked versus like, well, I didn’t get anything done this morning versus should I still use this company? Right. And so when you start getting into the eight plus hour range, those incidents are about as serious as it gets. Cause you’re talking like if that had landed on a different part of the day and I’m sure they’re a global company so it is in the meat of the day for a lot of international customers. Like that’s, that’s a work day that they were out right. The whole day they didn’t have service. There’s obviously a million small cuts that get you there. And like it’s an easier thing to say than it is to like, create a simple prescription for like how you don’t do it, but you definitely need to have conversations afterward around the overall timeline here. And as, as an operational culture, why did it take us so long to move through all these things? And are there meta issues around how much we’ve invested in this or whatever, or that like, are not just about a single thing we should change, but just like maybe even our confidence is an operational team. Cause we don’t know, we weren’t in the room. We don’t know what happens. But each one of these steps took quite a while to move through. And so you do need to acknowledge that, like you have to, you have to build to operate a little tighter than that. And they’re not even just technical issues. Sometimes they might just be process issues or personnel issues or where the right people aren’t in the room or where people are being indecisive, because they didn’t know if they had authority to make decisions. And all that kind of stuff, because eight hours is in that category of a fairly serious incident to their company’s reputation right now.

Tom: [00:42:33] The nice thing is that it doesn’t sound like they were totally down a big chunk of this time. Again, they don’t actually say what any of the error rates were. There weren’t any graphs. And so it’s totally possible that 98% of these requests were successful and they just consider a 2% error rate for that many hours to be a big outage. It sounds like the kind of thing where if you retry a few times, it might work, but yeah, I definitely agree that the length of time can almost be, or it’s just such a huge factor in this because even if you make things very bad for a small amount of time, they’re just not going to affect that many people, whereas moderately bad for a longer time. You’re just going to hit a lot more of your customers, statistically.

Jamie: [00:43:21] Not only that, a lot of it is also about the PR, right? Like, so if, if you have small error rates you might get some engineers talking to each other. A few people tweeting about it if you go down for a short amount of time. If you’re out for a couple hours and your reasonably large profile internet service now the press might talk about it a little bit. If you’re out for a day right here, you’re going to start to have CEOs of other companies reach out to you and say, I’m not sure I should still be a customer of yours. Right.

Tom: [00:43:50] There’s also like, let’s talk about our SLA agreement, right?

Jamie: [00:43:52] There’s also this raising awareness of how much is this something everyone is talking about. But yeah, you’re right — we actually don’t know how serious this was. I will say that the piece, I look at it more than that, is when I read this post-mortem I see steps I recognize in there, right? I’ve been on teams that have walked through those same kinds of steps. And these, in this instance, the steps were reasonable, but I would say the timelines were quite a bit longer than normally operational teams try these things. And so there’s something worth examining about this. Again, this is in the rear view mirror, this is three years ago off, Auth0 has been acquired, but for any given company, I think it’s also worth saying, are we happy with our time to resolve here?

And like, are there reasons that are more than just a specific technical problem that are causing us to move too slowly or too cautiously through these debugging phases.

Tom: [00:44:52] Yeah, and the last thing I would say is, and it definitely tied in with that is do some DRTs. They don’t mention these at all in the post-mortem, but they had some kind of weird issue with a load balancer, configuration error when they were switching back and forth between primary and secondary, which is exactly the kind of thing that a DRT is there to shake out.

Problems like that happen, but it’s much better for them to happen when everybody is looking at all the graphs and everybody is caffeinated and awake and you’ve potentially already notified your customers that this is a maintenance window or something like that. But having some practice failing back and forth between these is going to speed everything up. I’m guessing they don’t do that a lot. I’m guessing that there was a lot of discussion about whether or not we should do it. Is this worth it? How do we do it? Oh, can you get so-and-so online? They were the last person to do it, things like that.

Jamie: [00:45:51] Yep, I agree. Completely. Yeah. Makes sense. DRTs are, yeah, they’ve come up in previous episodes and it feels like they would have helped here as well.

Tom: [00:46:02] All right. So I think that’s about it for today. So again, if you like the show, please go leave us a review and a rating on Apple podcasts or wherever you get your podcast. We’d really appreciate it. Also, if you want to follow us, we’re @sevreview on Twitter. And that’s all for today. Thank you very much, folks.

Producer: [00:46:27] Thanks for listening to The Downtime Project. You can subscribe to the podcast on Apple, Spotify, or wherever you listen to your podcasts. If you’d like to read a transcript of the show or leave a comment. Visit us at downtimeproject.com. You can follow us on Twitter @sevreview. And if you like the show, we’d appreciate a five star review.

Published

April 19, 2021

twk in | April 19, 2021

Published

April 19, 2021

Cancel Reply

Write a Comment