Just one day after we released Episode 5 about Auth0’s 2018 outage, Auth0 suffered a 4 hour, 20 minute outage that was caused by a combination of several large queries and a series of database cache misses. This was a very serious outage, as many users were unable to log in to sites across the internet.
This episode has a lot of discussion on caching, engineering leadership, and keeping your databases happy.
Tom: [00:00:00] Welcome to The Downtime Project, where we learn from the Internet’s most notable outages. With me is Jamie Turner. Before we begin, I just want to remind our listeners that these incidents are really stressful and we peel them apart to learn, not to judge. Ultimately Jamie and I have made, and we will in the future, undoubtedly, make similar mistakes on our own projects. So please view these conversations as educational, rather than a judgment of mistakes we think we would never make.
Today we’re talking about an Auth0 outage that happened just a few weeks ago on April 20th, 2021. Before we get into the show though, we’ve got a few updates.
I was off last week getting a fresh install of some Pfizer microchips, but it was really interesting to see some of the discussions that were happening on Twitter. Chris Evans, who was or is the technical director of platform and reliability at Monzo listened to our show where we talked about an outage he was heavily involved in and he seemed to enjoy the show. He said it was “really interesting hearing a third party analyze your incident based on a public post-mortem. It’s a good analysis too, pragmatic, avoids hindsight bias and counterfactuals, and puts equal focus on the positive capacities.” So thanks Chris. That’s exactly the thing we’re going for here. And we’d love to hear from folks that are actually involved with these outages because we try to get a lot right here, but it’s always great when people chime in and tell us what we could have done better. It looks like Chris is working on a tool for incidents called incident.io. I haven’t signed up for it yet, but it says you can “create, manage, and resolve incidents directly in Slack and leave the admin and reporting to us”. That sounds super interesting. It’s really nice to see these processes that typically only get spun up at larger and more polished companies get turned into tools that everyone can use. So it’s nice when that kind of trickles down to the rest of the world. So I’m very interested in checking that out.
Also, I am still working on my startup Common Room. If you’re interested in joining a fast growing startup that’s working on some really interesting problems, come find me on LinkedIn or check out our jobs board commonroom.io. And that’s it for me. Jamie, what about you?
Jamie: [00:02:21] Yeah. Well, one thing Tom and I always want to reiterate is requesting ideas for future episodes. So keep those coming. If there are outages you think it would be fun to hear us talk about, send them our way. You can do that on Twitter, or you can do that on the website by leaving a comment. So we’ll read all of them and we will incorporate your ideas for sure. We already have several times so great.
And then yeah, while you’re there, don’t forget to rate and review the show. It makes a big difference in folks discovering us. And follow us on Twitter. We have updates going out there whenever we have a new episode or when we have an engagement from the community on an episode. Finally, the last thing is I actually want to plug my start at this time, so it’s not just Tom.
So my startup is called Zerowatt, which is zero, like the number zero and watt like power. We are actually just about to begin hiring an initial team, looking for some engineers that are excited about building platforms that could power the future of internet companies. So the kinds of platforms that would help have fewer of these kinds of incidents. So if you’re excited about building systems that can run companies, then come check us out. Make sure to apply to Tom’s startup too, but you can find us at jobs@zerowatt.io, and that’s about it.
Tom: [00:03:54] All right, Jamie. So today we are talking about Auth0 again. So what’s the background here?
Jamie: [00:04:02] Yeah, Auth0 is, as we all discussed, an authentication service provider. And the day after we released Episode Five a couple of weeks ago that discussed their 2018 outage, Auth0 went down again. So we did not plan this; so we don’t think we had anything to do with it. We think it was a coincidence. And so even though it feels like these two episodes are pretty close together and they are, that other outage was talking about something from three years ago. So it just kind of an unfortunate coincidence in this case having Auth0 go down so soon after.
But this outage, this time was pretty significant. One, lots of sites on the internet were not allowing login. Auth0 is a very successful service for a reason. It does a great job providing the right kind of abstractions you want for authentication. But unfortunately, it did mean a whole lot of websites that rely on it were not able to log users in, and this all started about 8:30 AM Pacific on that April 20th as Tom mentioned, one of the things that is involved in this as well. Here, in Auth0, yes, they have databases–there’s a kind of caching component we’re going to talk about–but one other piece of this is something that they refer to in their post-mortem as a feature flag service.
Tom and I put a little work in to understand what variant they mean here, of this service. So there’s a few different things that that could mean here. It’s not completely clear from the post-mortem at least to us, but one example of a feature flag service could be used for experimentation or gradual rollout. So if you have some new capability that you want to launch on your website, you can use a feature flag service to turn it on for certain populations, either AB against something else or to just slowly grow the percentage of people exposed to it, to make sure that it works well and that the customer behavior is as you expect. It could be that, but their language actually talks about the feature flag service providing front end API information about configuration values for tenants. So in this case, it almost sounds like it could be more kind of basic configuration or something that would, we would call probably more like an entitlement service, which is sort of like depending on the account type and things like that, that different product capabilities are turned on. So we’re not exactly sure exactly what this was enabling, but more or less, it’s a service that is having the application modify its behavior at runtime a little bit here based on the values that were in there.
Tom: [00:06:41] Yeah, that sounds right to me.
Jamie: [00:06:44] This is a post-mortem where we are going to have several checkpoints here again, where we’re going to try to make some guesses or speculations just to provide some structure, to attach some of the exploration to. There are some details missing in the post-mortem, and so in some circumstances, we’re not exactly sure about some of the details behind the timeline here. So just be patient and bear with us here as we walk through this and we speculate a little bit on what they might mean just to kind of provide the right kind of attachment points to talk about the outage. So let’s get to the timeline.
Tom: [00:07:27] This all got started in the morning at 8:27 PDT. So everybody’s probably getting up, getting to the office, or I guess getting to the home office, or I don’t know if Auth0 is remote or what these days. But anyway, early in the morning starting the day, they get a flood of alerts for login failures, memory usage, database connections, and just errors. So three minutes later, they started their incident process, and they observed they were seeing poor performance across various services. So I assume they have some kind of microservice architecture where there’s lots of different services running, but when you start seeing bad performance across lots of them, you should look for what’s in common, which is usually the database. So there was some sort of glitch. They knew that query times were up, but they weren’t getting any metrics from the database. But they did observe that the autoscaler had gotten very ambitious and scaled up the number of front ends, API front-ends not web front ends. In this case, it had jumped from 37 to 100 servers, which I’m guessing is the maximum because that’s a nice round number. So at 8:45, about 18 minutes in, they announced this as a major outage. A few minutes later, they realized that–and this is sort of just the dark comedy that happens sometimes with outages–three minutes later, they realized that their status webpage status.auth0.com was returning errors.
They have a separate section for this in the post-mortem, and we can just take care of this. Apparently this was the first major outage they’d had since they revamped this page last year. They got it back up and running in about an hour, but the core problem was that they were powering this page with a provider that they were making an API call to on every page load.
And what do you know, when your very popular service goes down, tons of people know to hit your status page. And they got rate-limited calling the API of their provider. As soon as they realized this, they were able to update their DNS and just point to the actual stock provider page about the outage. And so this was back up and running in about an hour, but I can only imagine the extra level of stress that this caused when they were already dealing with a major outage.
Jamie: [00:09:39] Yeah, definitely. You definitely want a way to communicate with your customers when something is down to at least confirm their suspicions that you are having issues and you are aware of it and you are working on it.
Tom: [00:09:54] Yeah. And Auth0 was really active on Twitter this whole time, so that wasn’t a huge problem–more just kind of a “oh geez, it’s going to be one of those kinds of days, huh.” Okay, about 30 minutes in, they disabled database promotion, and they rolled back to the previous week’s code. Now this is where we have to jump in and speculate a little bit here that they said they rolled back to the previous week’s code, not the previous version. So I don’t know if they just did that because there were a bunch of changes that were all suspicious that they wanted to jump past or if the previous week was the previous version. That’s a little unclear.
Jamie: [00:10:36] It could imply that maybe they deploy once a week, or as you said, maybe for some reason, they decided to jump back to an earlier version, but who knows. I mean, either of these is possible, but that’s what they did. I think that there’s a thing about this disabling database promotion that it doesn’t necessarily say exactly why they did that. So one reason we could imagine is if they started to make some assumptions that the database load was not due to a hardware issue, it seemed as though there was actual real database load starting to slow the database down. You wouldn’t want your database automation to try to start flipping the primary around because you could imagine some automation that may be misinterpreted as overloaded as down. And so just to simplify their world, in this little checkpoint, it probably was a smart move to disable promotion so that you weren’t kind of chasing the primary around erroneously and introducing additional load to the system with these failures.
Tom: [00:11:41] Ultimately the problem wasn’t hardware or anything particularly wrong with the primary. So yeah, just keeping the problem simpler and not having databases changing in and out definitely is a good move. So the next step is they squelched the autoscaler a little bit, told it to calm down, reduced the maximums and targets for it, which again is a good call because you don’t want that just changing things at this point. Every human that’s going to be looking at it is probably looking at it, and you don’t need the automation causing more confusion.
So 43 minutes in, they start getting database metrics again. I guess they fixed whatever was wrong, and they must have some kind of slow query log. They don’t explicitly say this because, at 43 minutes in, they have some team members who start looking at concerning queries. So if it wasn’t already clear, now that they’ve got their metrics back, the issue is obviously that the database is overloaded. So at this point, if they weren’t already treating this massively as a huge deal, I’m sure that they were at this point, they get all hands on deck. And this is a company with hundreds of people. And I think a lot of them were able to start looking at this as well, which is kind of its own problem that you want to keep your core team focused on this, but if other people have ideas or the ability to contribute, you want to leverage that as well.
Jamie: [00:13:03] Yeah, I mean, one of the methods that we’ve seen used in different companies, including companies we work at is to have roles like a tech lead and an incident manager. So the incident manager would be responsible for, I would say, kind of like upward and sideways kind of communication. Like this is what’s happening. Let’s make high level calls on what we’re calling it, et cetera. And then, the tech lead is the one that’s heads down on fixing the problem. And one of the responsibilities, incident managers kind of protect the tech lead from someone saying, how’s it going, how’s it going? The incident manager can answer those.
And then when the tech lead needs something like, oh, hey can you loop in the database team? Can you loop in the network team? Like they can just say that and the incident manager goes yep, yep and goes off and pulls more people into the team that’s working on the incident. So the tech lead can kind of stay heads down, can kind of subdivide responsibilities on like, I’d like somebody with expertise on this to be looking into this, and kind of helping to coordinate some of that work.
And the incident manager can be someone that’s helping to basically enable the tech lead and communicate about what’s going on, communicate with legal, communicate with any messaging or comms going out, and making really high-level decisions like how are we going to resource this, should we get more people that kind of stuff, and just saying to the tech lead, what do you need, should we do this instead. Just kind of being the point communicator with the person that is leading the effort to fix it, but protecting that person that’s trying to really concentrate on what’s wrong is really important. And it’s nice to have that pairing so that the tech lead can really stay in flow, trying to figure out what’s going on, and delegating out pieces of that diagnostics to a team that they’re working with.
Tom: [00:14:47] There’s probably a temptation to have a huge Zoom, which is probably kind of a failure mode of dealing with this stuff in a pure remote world. Because if you have a hundred people on a Zoom, that’s just going to be, unless 90 of them are keeping quiet, that’s going to be potentially a distraction. So that’s actually an interesting thought is how to best run these when you’re in this purely remote world.
But yeah, in a situation like this, now that you’ve kind of identified this core issue of the database, you probably want to have a lot of parallel investigation happening, like anybody that had checked code in in the last week, you probably want to have a parallel investigation of whether your change potentially affected the database. But in terms of actually changing things, do you want that to be highly centralized because you’re already in a situation where you don’t know what’s going on and the temptation to start changing things is powerful. But if you have multiple people that are flipping switches, changing settings, you’re just going to create more chaos and make it harder
Jamie: [00:15:41] Yeah, that’s a good example. So in that circumstance, you could see the tech lead say, or the incident manager say, “Ah, we should probably have anybody that’s changed code go look at their code and see if it could contribute to this.” And the incident manager can say, “Okay, cool, let me go grab all these people.” And the incident manager, who has authority by virtue of being in charge of the critical thing the company is trying to fix right now, just goes and grabs people and says, “Hey, go look at this, go look at this.” And then the tech lead can stay kind of heads down on like, I’m going to go look at this thing or whatever, leveraging that pairing to have one person help put that into operations while the other person stays focused on the deep problem about what’s actually going on here. And then that data can feed back like, “Oh, hey, they all looked at the code. This one kind of jumped out is suspicious.” And suddenly that resource becomes available to the tech lead to incorporate into the thinking about what to change.
Tom: [00:16:40] Yep, so at about 98 minutes in, they’ve identified one of these problematic queries, and they start adding an index to the database. I mean, this is very standard. You end up with a query that is slow because it’s doing a big table scan. You add an index, and you can easily get a 1000:1 or 10,000:1 performance increase by adding the right index. So unfortunately adding them, it takes a little while. This one is actually pretty fast. It’s just 13 minutes, but unfortunately does not help. So they spent 30 minutes doing that and I’m sure it helped something, but it didn’t actually get the system unstuck.
So at this point they start looking at other ways to reduce the load, to let the system recover because there’s just too many queries, too much load hitting the database to actually come up and serve anything successfully, which is a pretty standard failure mode for databases, where if you’re just hitting them with too much traffic, they can’t complete anything successfully. And so the work they’re doing for each query is wasted and queries timeout, and they’re just not successfully returning any good rows. So they have to figure out a way to take the load off so that things can start making progress. So two hours in, there’s another note about promoting a replica to primary, and it doesn’t say why they did this. I could speculate on why they might’ve done that if they were worried about bad hardware, potentially, but it’s just pure speculation. I don’t know. Jamie, any thoughts?
Jamie: [00:18:15] Maybe if at this point, if they were like, “we tried a few things, maybe something is wrong with that machine,” then maybe, just to try it, they say, “well, let’s move the primary somewhere else and see if somehow something was going wrong on that hardware.” So, yeah, I agree with you. That’s one thing that occurs to me as why you might try this at this time.
Tom: [00:18:35] About the same time they start doing something they call “adjusting the auto scaling groups to cause front end API nodes to recycle again.” It’s not really clear what they’re doing here. It doesn’t sound like they’re reducing the total number, but just, and I don’t know why this is–how this is–different from just restarting things. “Recycle” may just have a particular meaning to them that I’m not aware of.
Jamie: [00:18:59] I sort of wonder if they temporarily set the group size as zero and then they let the nodes all drain, and then they set the group sizes back up to max a hundred or whatever, just so that they could see if they completely re-imaged and started their services again, it would fix something. But yeah, you’re right. It’s not completely clear what they mean about what adjustment they made and what they mean specifically about recycling.
Tom: [00:19:28] So about three hours in, this is about half an hour after they started the auto scaling changes, I guess they weren’t seeing a lot of movement there. So three hours in, they turn off the feature flag service, which is, again, a little confusing to me because I’m not sure depending on which meaning of a feature flag service it is, how you actually turn it off safely.
Jamie: [00:19:51] Yeah, this is really tricky, right? Because of the variants we talked about, you could have essential configuration, in which case you just can’t turn it off. You could, if it was an entitlement service, you would say, okay, well everything defaults on or defaults off. So then like all your customers see all capabilities or no one sees anything. So this does make it sound again, like it’s more of an experimentation system because if you’re doing experimentation, you could say just revert everything to sort of the default behavior and turn off the experiments.
But it’s because we don’t completely understand what the feature flag service does. And there’s a little bit of conflicting information about what it does in the notes. It’s not clear that it fell in the category of things to do safely. It may be because they seem to do it with confidence. So they don’t mention any consequences, but it’s definitely something that if you own one of these, it’s really important to think through. If we need to turn it off, does that put the site into a state that we’re okay with? And trust me, Tom and I have been involved in systems where it actually was not safe to not run that service because what it turned into was a gradual rollout service, like a slow rollout service, where eventually something went to a hundred percent and then became an essential part of the application. And then it just never really migrated off of the experimentation system, which is not a good practice, but is a practice that can happen sometimes if you’re not careful.
.
Tom: [00:21:17] If this is an experimentation service and you might ever have to turn it off for load related reasons, please make sure you clean up your experiments. Get those feature checks out of the code. But something you could turn off safely in a situation like this would be like a throttling service, a rate limiting service. If that is in the critical path and it’s choking everything somehow that you could just turn that off. And sure you might rate limit, or you might fail to rate limit somebody, but that’s probably a lot better than the site being down.
Jamie: [00:21:48] Just because it’s so related to this and it was such a great trick, but one of the things somebody taught me one time, years ago is a great way with an experimentation service to make sure it doesn’t turn into a permanently on service is never let any of the populations go beyond like 99%. And if you do that, the person is forced to move off of the service in order to turn it on for everyone. And so there’s probably some tricks you can do like that to make sure you’re encouraging people using something that’s really just meant for experimentation to not end up living there permanently.
Tom: [00:22:26] Yup. Or just randomly flip it off.
Jamie: [00:22:29] Or just randomly turn it off. Yeah. You’re the profile, Tom, that some people just like to watch the world burn, I guess.
Tom: [00:22:40] Oh man. All right. Well about half an hour after that, I guess that wasn’t helping, they turned off another feature, which is user exporting. I don’t know exactly what that is, but then 50 minutes after that, we’re now four hours into the outage, they took what ended up being the critical step for getting things back online, which is they drop the number of front end API nodes from a hundred to 45. So they cut it in half. And that was the magic thing for releasing or leaving enough pressure from the database for it to stand back up and start actually finishing queries. So a few minutes later they start seeing some successful logins and they do a few other things that don’t really end up being relevant. But at, I guess, 12:47, so 4 hours and 20 minutes in on April 20th I guess the universe is making some comment about this: 4:20 on 4/20 they get.
Jamie: [00:23:45] We did not realize this ahead of time. Yeah. It’s a four hour and 20 minute outage, and it occurred on 4/20. All right. All right. Universe. We get it. We get it.
Tom: [00:23:58] So at that magic moment, they get core auth restored and the Internet can get unblocked and start moving again. So that’s the timeline, but there’s a few questions that obviously pop out of that.
So the first one is, obviously, why was scaling down the number of nodes the critical step? And they explained this in the post-mortem, but they have what they call a feature flag service, which we talked a little bit about at the beginning that provides the front end API nodes with config values. So on each server, there’s a client that runs that I’m assuming it makes a single connection off to some other services. And then all the worker processes on the front end node can just make a localhost connection to it. This client, this feature flag client talks to a caching service to get the values.
This is the really key point, but if the cache doesn’t respond in time or even, I guess if it’s a cache miss, the per node client is going to hit the database directly. So once they started scaling down the number of nodes, just the number of machines running the software that drops the number of feature flag clients, which drops the number of things that are hitting the database, which helps it to recover.
Jamie: [00:25:15] Yeah. This is an interesting point because it also is kind of the speculation side a little bit, but like what your initial reaction might be is to say, “Well, why would you go hit the database if the cache service was failing.” But one thing we could imagine that maybe happened here is that this code path, what it was really intended to handle was when there was a cache miss, and maybe there wasn’t a clear enough differentiation between a cache miss is the reason why you couldn’t get something from the cache or, like the cache service was horribly slow. Because obviously in one circumstance, if just a reasonable number of things cache miss, the database is fine and that’s just kind of steady state. And the other, if the cache service is just unresponsive, then essentially every single lookup will go to the database. And that is probably not something that will work. Otherwise you likely would not have cache. So yeah, again, sort of reading between the lines. It may have been what was happening here is that this code path was really expecting to only go back to the database on a cache miss and not really a cache service down.
Tom: [00:26:24] This is a point where a correlation definitely is causation. You don’t want to have a lot of correlated activity where when the cache gets bad, everybody goes and starts hitting the database instead. But uncorrelated access, it’s fine. But this is a case when it all happening at once took down the service.
So, obviously the next question from this is why was the cache slow? And the comment they have in their doc is that “an increase in traffic exceeded the caching capacity of that service and caused it to stop responding in a timely manner,” which is really interesting. Jamie and I were talking about it a little bit. I think we might have a theory as to what happened, but again, this is just total speculation. Jamie, what do you think it is?
Jamie: [00:27:08] Again, for illustrative purposes, one thing that could happen here is swapping, right? So caching services are often going to hold things in memory, whether it’s redis or memcache or something like that. They may have just sort of eventually just created too many values in the caching service, and it started swapping things to disk. And that’s obviously going to make the service significantly slower. So we don’t know if it was that, but that is one example of how an in-memory caching service suddenly got much slower.
Tom: [00:27:37] So after the cache service stopped responding, all the front end nodes just start querying the database at the same time. And that wouldn’t have been a problem except that there were three other nasty queries running at the same time, which caused the database to exceed its available disk IO. So it sounds like these three queries were running at about the same time the caching service failed, and that tipped the system over into a state where things just couldn’t start up anymore. So presumably the other query is finished, but particularly after the autoscaler kicked in and added this extra 50 or 70 nodes or whatever attacking the database that we just got to a point where it couldn’t successfully start up anymore.
They go into a few details about the three bad queries. One was just a query that didn’t have an index. So it was scanning. It was just querying, touching a lot more data on disk than they were expecting. The second was intentionally scanning a large number of documents, which you have to do sometimes, but it’s kind of an anti-pattern to do it on any database on your serving path. And the third was rare in frequency, but very resource intensive. They say “its cleanup cascaded through many collections.”
As we mentioned in the last episode on Auth0, we think they’re using MongoDB. So I’m not enough of a Mongo expert to know what they mean by “it’s cleanup cascaded through many collections,” but I think the important point is this is just an expensive query to run and they got very unlucky that it ran at the same time as the other two and tipped over the system. But once things slowed down, the autoscaler started adding machines, everything just got worse, and it wasn’t until they took the machines offline that they were able to make progress and actually finish some queries.
Jamie: [00:29:32] Yep, sounds about right. So I guess in summary here, the quick version of it if we look back on all this and say, okay, what happened, is something happened to their cache service that made it get slow–potentially swapping, potentially something else. Their cache service started timing out requests that were coming from this feature gating service, the feature flag service. When this feature flag service started having cache requests fail, it started to directly issue those requests back to the database. And the database was unable to service those requests fast enough, at least in part, because it had three queries running on it which were very expensive. And so they were causing all of these lookups that were now falling directly to the database to not return quickly enough.
And when everything started to slow down there, it looks like their auto-scaling stuff spun up the number of machines that were in their kind of application tier in this API edge tier, and that sort of just exacerbated the problem. And so they sort of work their way back from this, and then they were doing okay again.
I mean, they did some other kind of moves in the middle of this, but does that sound about right? Okay, cool. I mean, one thing that is interesting about and is challenging for Auth0 is they have done a great job in building something essential. And so they’ve made something useful and something that if someone is to take a dependency on it, its reliability needs to be very, very good. But at the same time, because what they’ve built is kind of narrower in scope than if you’re building Google or something like that, the size of their team–the sensible team size–and things like that are much smaller.
So you’re in the tough position of having to maintain a kind of excellence and operational excellence bar that often does require a significant investment in a particular kind of talent pool to build with a smaller team. So you have something that needs that level of reliability, but has to achieve that like sort of punching above your weight and while being very efficient with talent. So it’s tricky to get this right, to be honest. And it’s in between two modes, one being where like you’re building something where the request rates are so small, if it goes down, nobody cares. Great. You know, many of our startups are in that position right now where it’s not a big deal if we fail every now and then. Or you are a monster part of the internet and you have amazing large operational teams and lots of internal best practices and culture around reliability. Then yes, then you can do this, too, but they’re there. They definitely have a good challenge. I mean, it’s a good position they’re in. They’ve built something so useful, but it does make it so that they really have to be very, very good at this. But yet they need to look as good at these big companies at running reliable services. They need to get to that level of quality while having fewer resources to do it.
Tom: [00:32:37] Yeah, because the response to an outage like this, it really takes a lot. There’s a lot you have to have–the right people, the right culture, the right technology. And you have to have invested enough in the tools and visibility, and the bigger a company, the easier it is to have a whole team that just works on your monitoring or metrics, or even lots of teams that work on that sort of thing, and have whole teams that just have tons of practice dealing with outages.
Because they’ve, they’ve had so many of them, but at this smaller scale it’s a tricky spot to be in.
Jamie: [00:33:09] It is. It is challenging for sure.
Tom: [00:33:12] All right. So as we usually talk about, Jamie, what do you think went well with this outage?
Jamie: [00:33:17] Well, you know, if you read through the action items, it sounds like they’re starting to get pretty serious about slow queries. And it mentions, among other things, bringing in some Mongo experts. And they’re accelerating some projects that they already kind of had on their list, on their backlog there, in the next couple of months. So there’s some stuff around indexes and query optimizations, doing a kind of audit of that and making sure that they have all of those covered. They are talking about upping their resiliency in various ways with their database by June of this year. So they have deadlines associated with a lot of these things. In fact, two of the five that sort of are in this category are completed. So their investment into making the database fast enough looks like it’s deepening here. Does that sound right to you, Tom?
Tom: [00:34:15] Yeah, this is something we’ve talked about before, just making your queries fast is really good. Not just because it makes your application experience better but because it makes your database happier–you know, databases like fast queries. That is the essence of what they do. And yeah, just getting it. I assume some of this would include visibility into the slow queries, but just auditing all of this stuff, making sure that all the queries, it can be as fast as they can, are that fast. That sounds like great, super valuable work to me. It’s probably going to make the app a better experience, too, as a side effect.
Jamie: [00:34:52] Yeah. I also think in their notes, they call out this need for the core team to focus. So there was both. They mentioned that they put enough focus on this to essentially bring the power of the entire team as necessary and to help fix it, but at the same time, were doing their best to keep that team, as we talked about methods to achieve this, but sort of keep their team’s ability to focus on solving the problem, but having available the resources of the entire company, where everyone was kind of paying attention and ready to jump in and help, if there was an opportunity they could get involved and help that core team without sort of interrupting that core team. So that coordination sounded pretty good to me.
Tom: [00:35:43] Also, this is a real human thing. Maybe it’s a minor thing, but I really like to see it, which is that the post-mortem does apologize, when they do acknowledge that this is inconvenient or very frustrating for the people that rely on them. And I think it’s just a nice little human touch that they really apologize for the outage. So I like that, but yeah, let’s talk about what might’ve gone a little bit better. Jamie, what are your first thoughts here?
Jamie: [00:36:11] I think one thing, and this is probably like a good thing for listeners of the podcast, for us, to start understanding thematic things because that probably means they’re extra important in our own projects to pay attention to, is–you know, there’s a couple of things that we can talk about very, somewhat quickly because we’ve gone into them in some depth before. One of them is this auto scaling thing where it auto scaled to a much larger number of nodes. So we talked in one podcast about having things like rate limits on automation, so that if it’s suddenly doing something very unusual that it normally does not change that quickly, that’s probably, it shouldn’t do that. It should probably stop and page someone or whatever. So I, that could have helped them in this case, I think.
Tom: [00:36:57] With the auto scaling, you want your autoscaler to keep you inside of states that you are comfortable with. I understand the desire to save costs and things like that, but before you build automation like that, just really make sure that you need to save that money. And that is important. But definitely keep it within the guardrails of what a human would do without even thinking about it. And I think they almost tripled the number of front end servers, and that’s a pretty big change to make since there’s already a lot of stuff these services rely on.
Jamie: [00:37:33] One thing that’s interesting is new companies have the highest variance, right? But the total count is really low. And so, when you are a new company, your curves are not smooth. Somebody shows up, does a bunch of stuff, and goes away. As time goes on, when you get off the zeros level, the other thing that happens is you become more international, right? So a lot of places that start, maybe they’re starting in the US market or something. And so they clearly have like a dip at 2:00 AM that’s quite low and a peak at 9:30 that’s quite higher or whatever as time goes on. And if you’re successful as Auth0 has been great at doing, the sort of hard thing is you get more traffic. The good news is it gets flatter, right? Because you have people all over the world where the peak of their day is different. And because of that, a lot of times for mature companies like that, the difference between their peak and their trough is sometimes like 30% or 40% or something. It’s usually not 5x. So if you have a cluster of machines that very quickly wants to get three times the size, there probably isn’t any real organic thing that would cause that to happen.
Another good example to talk about that smoothness is when you’re really small, you might have one big customer. When they suddenly light up and do some behavior, your traffic suddenly gets crazy. And so once you’ve diversified across tens of thousands of customers, it’s unlikely one customer is going to suddenly change a curve in a big way. So yeah, even with your automation, to Tom’s point about thinking about what’s necessary, like they’re in this place where they’ve gotten successful, but you do have to look at your actual organic traffic and say, well, even if we decide we do somehow want to save the money by scaling this up and down, the real variation we have for between the high point and the low point of the day is 50% or something like that. It’s not 3x or whatever–so let’s anchor on that and look for some rate of change that would never happen organically. So that kind of a thing.
The other thing I would notice here is the batch versus the online databases thing. Like we’ve talked about this a little bit in the past, and you mentioned it when we’re walking through the timeline that this is a common theme you’re going to run into when you have uniform types of loads. Uniform types of loads on your serving databases are typically point queries or pretty small range queries. Big scans are not good, right? Big scans are–it’s very hard for any company to keep their site online if their primary database is having big scans on it that happen at arbitrary times, it just so changes the sort of response time curve of your database. That’s really problematic.
Tom: [00:40:36] Yeah. This is, again, one of those things that you get away with without any problem when your database all fits in memory. If your database is all in memory, like you’re going to probably be okay. You can’t do any of these crazy joins, but you start to fall into a bad place when you have more data on disk than will fit in memory at once, because what you’ll see with a big scan is your OS is going to use a bunch of memory to cache all the disc access. But once you, if you’re in a particularly bad state and you have more, just a little bit more memory, or just a little more data than you have memory, as you scan through this, you’re gonna potentially be evicting pages to pull new ones in. And so you’re no longer going to have your hot pages cached. You’re going to have to pull everything into memory, which is going to by nature uncache the really important pages. And so you’re just going to thrash back and forth, and you’re going to run out of disk IO if you have the wrong access pattern. And then things are just going to get very super linearly… or what’s the opposite of that? Uh, just things are going to really fall off a cliff. It’s not going to be a very predictable performance.
Jamie: [00:41:52] I mean, the beautiful thing about your database access, almost everybody’s database access, it’s biased, right? So the users currently logged in–their records are going to be hotter than people who haven’t logged in in a long time. Newer accounts are going to be hotter than older accounts, typically speaking. Because every business has churn. And so your database is kind of automatically–in many respects will sort of make it seem as though your database is all still in memory. But the more you just do table scans over every record or, you know, arbitrary sets of records or whatever, you are going to cause your kernels, a paging system, disc paging system to have to evict records which are actually important just to visit one time. These records as part of this cache. And so, I’m sorry, they’re a part of this scan. So you really want to keep those scans off your main serving database so that those hot records can stay in your page cache and all of your response times can stay fast. There’s some methods probably, right, Tom, to still be able to make your scans and not cause your serving databases to have issues.
Tom: [00:43:05] Yeah. A very standard approach is you have a read replica–a read only replica that you don’t really care how fast it is. If you have a database where you assume every byte you read comes off the disc and you’re just okay with that performance, then that’s an okay state to be in. You just don’t want to have that database be the same one that you need to make your sub-millisecond queries to. Having a read replica that is just a few seconds replicated behind is going to be fine for almost any big batch scan or walk you want to do. And it doesn’t really matter what you do to that because hopefully you’re only sending other batch traffic to it. And that’s generally fine if it slows down, but the thing that’s actually on the critical path, that’s serving the database or serving the site, you want that to be just very, very predictable performance. And I guess that’s the bad thing about scans is they introduce highly unpredictable performance. It might be perfectly fine. Then one day you add a few more gigs of data, and it becomes just catastrophic really slow.
Jamie: [00:44:06] Yeah, for sure. Let’s see, what else, Tom, is on your mind here?
Tom: [00:44:13] As a sort of similar note to that, I really don’t like to tolerate slow queries that could be fast. I think engineers should both have visibility into how quick things are and have intuition about how fast they should be. And I don’t know of a great solution to this. So at my first job, we ran a pretty large scale website. This was a thing called AudioGalaxy back in the late 90’s, early 2000’s. We probably ended up with like 20ish read replicas. We were doing so much traffic on this big MySQL cluster. Probably the single most impactful thing I did for performance was I supported a parameter on every URL, and after it rendered the page, it would dump all the queries that it had made for that page to a table down at the bottom, along with how many milliseconds each one had taken. And the total amount of time spent querying the database during when the page was also rendered. And that was invaluable for both building intuition about how fast a query should take and just making it really, really obvious when I didn’t have an index that I needed or when I was making a bad query because yeah, you’ll see one millisecond, one millisecond, one millisecond, 300 milliseconds. And it’s like, oh yeah, that’s obviously the one that something’s wrong on. And so I think every engineer should have some experience like that, where they can just build some intuition about how queries actually work, but also be able to see if something looks anomalously slow. So I wish I had a better overall solution for it, but, but just invest in that visibility and don’t tolerate the slow ones.
Jamie: [00:46:02] Yeah, that’s a really good idea. I think that you can even make that mandatory on the canary cluster, right? If you have like a canary cluster, like the table is always drawn. I heard one time and I would say this is one place where like, teams listening should feel liberty to do social experimentation because it’s kind of what your canary clusters are for. So I talked to someone one time who was sort of in charge of the performance team at a company. I probably shouldn’t say, but one of the things, one of the approaches they took is they would make it so that they blacked out parts of the DOM that took more than a certain threshold to render. And they did that on their canary cluster or their dogfooding cluster, right. Not their canary cluster. And so everybody in the company would suddenly see if the part of the page was unacceptably slow, and it was an amazingly effective way. So like, obviously, if that thing made it to production, they’re still gonna draw it, but it was just a really straightforward way to just point out, it’s basically for your customer’s sake, you should treat that part of the page like it’s unusable, right? Like it’s as good as not there, right? Because of how slow it was. And they found that really, really effective. It was just part of, kind of the client side framework that was pulling these things and rendering them. If they took too long, it would be like, nope, you don’t get, you don’t get a control. You get a black box.
Tom: [00:47:32] I love that. I think that’s really cool. All right, want to talk about the cache design a little bit?
Jamie: [00:47:36] Yeah. I think maybe the final thing to dive into here a little bit is what the kind of caching scheme was and how well it was configured and understood. Because you know, if you’re caching something, there’s a lot of reasons you could cache something, right? I mean, one of the ways you can cache something is to take advantage of some kind of amplification.
So let’s imagine your database was infinitely fast, right? If you’re reading it off of the primary, there’s only one server, that’s the primary and that thing has a NIC that’s a gigabyte or whatever. Like you can’t possibly get more than a gigabyte of service. If you put it in replicas, you can imagine replicas as caches in a way, because they kind of are, then you can bring all of those to bear. So you could also have things in memory. And so if you had it replicated in a bunch of different like stale versions in memory, then you can have the aggregate throughput of all of those NICs available to surface a subject many, many times.
So there is one thing where you want to service this request many, many times, and the database server just cannot service that thing that many times. But a second reason is latency reduction. And this one feels like it shouldn’t matter, but sometimes it does. Sometimes there’s something about databases, architectures, that even if it’s only being requested five times a second and the database is of course completely from a throughput perspective capable of servicing, that it takes three milliseconds if it comes from your database and it takes 300 microseconds if it comes from your cache. And so there’s, maybe there’s something architecturally about the design of these two systems, right? It’s probably not hitting discs still. It’s probably coming out of the page cache, but maybe the threading model is different or whatever. But that might be another reason that sometimes you say, okay, we can tighten up this pipeline by essentially using this system that is not adding throughput, but is somehow reducing latency or whatever and getting that record back faster so that the total response time is faster.
So, but part of the reason you’d say, well, look, you could just say I put the cache in front of it was better. Why does it matter how much I understand why I’m doing it? Well, one of the reasons it matters is you need to understand if the cache fails, does it matter or not? Because how should you be planning for that failure? So you could see, in either of these cases, it could be acceptable or not acceptable for the cache to suddenly not exist. And so this is a good example where it feels like the cache, obviously the cache stopped working and then the traffic passed through instead of going to the cache. And that was clearly never going to work. At least in this circumstance. So it is a little unclear from these notes how much they had understood that, like why this was there and what would happen if it didn’t work and stuff like that. I don’t know, Tom, how would you sort of talk about your questions about the caching scheme here?
Tom: [00:50:42] Generally, I am pretty down on caches.
Jamie: [00:50:45] Well, yeah, if you can get away with not having them, my goodness is your life easier.
Tom: [00:50:50] Historically, in my career, if I have N gigabytes of RAM to allocate, I’d rather allocate it to databases than the in-memory cache, barring the case when you’ve done a ton of computation, like if there’s something that’s very expensive to create. You know, like a rendered page or some summation of a bunch of things like that, that makes sense to cache. But just for like a query cache or something, you know, like if they had had a read replica here that all its only job was serving this, these… what do they call it… the feature flag client that would have been probably a lot simpler to diagnose. You would be able to see the load very clearly if it was getting close to the database’s limits. Sure, it wouldn’t be as up-to-date as the primary, but if they’re using a cache anyway, that can’t be very relevant. And if that wasn’t used, if that cache, or if that replica wasn’t overloaded, you could use that capacity for other things as well. But that would be a simpler system design that would also give you primitives that would let you make other things faster as well.
Jamie: [00:52:03] Yeah, it’s a good point to phrase it that way. Really. It’s so important to understand, every system is simpler with your components period. You should only use the component if you absolutely have to have it, if you can’t get away with not having it.
And a caching system that’s different from just a read replica is a new kind of system. And that’s actually why just even just talking about the two things, in my example, like I would say it’s much more suspicious to say if you’re doing this for latency reduction, instead of if, because what you might just be doing is you’re papering over the fact that you don’t have an index on that query.
Right. Like it may, you know, it actually should be fast. So versus like, oh, if, if you are happy with the performance of all of these queries, but you fundamentally just need to ask the question too many times a second. Okay. Now maybe you need a cache. Right. But also you could use read replicas, right?
So if there’s a way you can approach these questions as like, is there any way we can not have this. And if the answer is yes, it’s worth strongly considering, because that will also help you arrive at the question. Like you have to prove to yourself what this is for. And that will sort of say that you also, probably inductively will have to also say what would happen if we didn’t have it.
So that you can start making plans around failure.
Tom: [00:53:19] Yeah. You know, systems grow a lot more than they shrink. It’s hard to remove things — like not every engineer on a team gets excited about cutting stuff out and simplifying it. A lot of times once something’s there, it just creates more complexity and more stuff that people might not have used if it hadn’t already existed. But you know, once there is a cache, people will start using it in cases that are more marginal than what it would take to get the cache added. You should always think long and hard before you add something.
Jamie: [00:54:01] And I also, I have to say, I am somewhat ashamed to admit this right. But as an engineer, right? Like I’ve had lots of times in my life where I was ready to go throw a redis in front of something or whatever. And a DBA said to me, can you just show me, show me what you’re doing for a second. And then I saw, Oh, this query, you know, like MySQL it’s, it’s weird.
It’s taken six milliseconds. It seemed like it shouldn’t. And because the truth is, that the page cache is fast, MySQL is pretty fast, right? So in a lot of times they come back and say, oh, if you change the query to do this, your problem is just solved. You avoided provisioning a system, monitoring a system, blah, blah, blah, blah, blah. It could be a change to the way the query is, or, oh, we need to increase the pool of this kind so that we’re, having the right kinds of values that indexes or in memory or whatever it was. And it ended up that the amount of pain saved by keeping the system simpler was it was great.
And so you could say, well, this is just abstract, it’s hypothetical. But you know, we know from reading their notes, they did discover several things that weren’t indexed. So it, it probably, if you are, if you have a cache in front of it, hopefully you definitely have already made sure you have all your indexes before you say, right. Cause if you’re pursuing that line of reasoning, let’s only add things if we know we need them, like you, you know, as a consequence of that process, you probably would have gone and seen like, yes, indexes are in all the places there should be. And yet we still need this. Right? Yep.
Tom: [00:55:30] But if you really convince yourself, you need a cache, just promise us you’ll turn off swap.
Jamie: [00:55:35] Yeah. It turns out, disable your swap. You fail fast. Just crash the machine.
Tom: [00:55:42] Far better for the machine just to crash.
Jamie: [00:55:45] Yes, then to turn into some sort of swamp that sucks all the energy into it. Yup, any other thoughts on this whole thing, Tom, before we wrap it up?
Tom: [00:56:01] Yeah. This was obviously a rough day for Auth0. I’m sure everybody was just absurdly stressed. Very high attention thing to have an outage like this at the time of day they did, and so my heart really goes out to them cause this was just a tough thing to go through.
But at the same time, I think this can be turned into a good thing. If the right attitude is taken towards it. I mean, all these problems are fixable. Like this is not like there are some sort of problem that is just fundamental to the architecture, to the design that simply cannot be fixed.
Also, this could have been a lot worse. There was no data loss or not even, it seemed like a big risk of data loss. You can definitely imagine outages companies have, and we’ve talked about them, we talked about GitLab that deleted a chunk of their database. This could have just been so much worse, but issues like this are a signal that there’s room to get better operationally and to get to a place where you can deal with outages faster and more cleanly. And this is what I think is a good chance to use this as well. The motivation to really go and fundamentally make the improvements you need to, to get to just rock solid stability. But the catch for something like that is you have to decide to not do something in order to do something new. There’s just not a lot of slack available in the modern engineering org. And that’s a job for leadership, to say, keeping this service up is the number one priority. We accept we had these other plans that we’re going to defer and we’re going to not just sit around and come up with ideas on how to improve things, but really invest in not just the things that are obvious, but the things that are speculative, better monitoring, better tools, DRTs things like that.
And so that’s really just a job that leadership has right now. To say we have to make some fundamental changes to really up-level our game. And if they do that, I think they can end up in a situation with a much better system that is going to be much more resilient to future problems because I think that’s a really important point is that this could have been a lot worse for sure.
Jamie: [00:58:15] Right? Yeah. And it’s as someone that’s been in the management position much more recently in the last few years than the engineering position managing teams like this. It is very seldom you go to an engineering team and you say, do you guys want to take a little time so that you don’t get woken up in the middle of the night as much? And they go, no, we actually we’d rather just, you know. So the point here is that like investing in being good at this stuff is it’s often the hardest to do at the prioritization level. It’s usually not the engineering teams just didn’t care because they do, I mean, engineering teams care about quality. There’s a pride in the craft that most great engineering teams have. And so when I agree with Tom, when you talk about leadership, that usually actually the real mandate here has to be, it’s sort of picked up by the leadership to say, okay, we are going to invest in this. We’re going to bring in the talent, we need to. If our team needs to be augmented with folks with some specific skill sets and stuff like that. So management needs and leadership needs to say, we are going to own this. We are going to give you guys the time you need to develop the processes and tools and culture and stuff to be excellent at this because yeah, I mean, it’s not a fun day for the engineering teams. This is tough when these things happen, and the system can, they can emerge from this much, much stronger if they take this moment as a catalyzing moment to sort of change the bar internally for how good they are at these kinds of things.
Tom: [00:59:53] Yep, all right, well, I think that’s it for today. Thanks for listening. Everybody send us any suggestions for outages that you want us to talk about or follow us on Twitter. We’re always online and happy to engage with anybody there. So thanks for listening.
Producer: [01:00:15] Thanks for listening to the downtime project. You can subscribe to the podcast on Apple, Spotify, or wherever you listen to your podcasts. If you’d like to read a transcript of the show or leave a comment, visit us at downtimeproject.com. You can follow us on Twitter @sevreview. And if you like the show, we’d appreciate a five-star review.