In 2018, after 43 seconds of connectivity issues between their East and West coast datacenters and a rapid promotion of a new primary, GitHub ended up with unique data written to two different databases. As detailed in the postmortem, this resulted in 24 hours of degraded service.
This episode spends a lot of time on MySQL replication and the different types of problems that it can both prevent and cause. The issues in this outage should be familiar to anyone that has had to deal with the problems when a large amount of data replicated between multiple servers gets slightly out of sync.
Jamie: [00:00:00] Welcome to The Downtime Project, where we learn from the Internet’s most notable outages. I’m Jamie Turner. And with me is Tom Kleinpeter. Now, before we begin, I want to remind our listeners that these incidents are stressful and we peel them apart to learn, not to judge. Ultimately Tom and I have made and in the future unfortunately will make similar mistakes on our own projects. So please view these conversations as educational, rather than a judgment of mistakes that we think we would never make.
Today we’re talking about GitHub’s October, 2018 database outage. But before we dig into that, we have a little bit of housekeeping to go through. So one thing that was a lot of fun and a little crazy is we released an episode about Auth0’s 2018 outage on Monday that was due to database issues. And the very next day on Tuesday, which was a couple of days ago from when we’re recording, Auth0 had around a four hour incident. It was pretty amazing it happened the day after we released an episode about an outage that they had a few years ago. What’s kind of interesting about it is we had some new listeners that came in and, and hopefully you’re listening to this episode to check out the podcast because we ended up being accidentally topical.
So that was kind of fun. So if you’re here and Auth0’s outage got you interested in hearing more about what we do, welcome to the show. We know with this latest Auth0 issue they said they’re going to publish a post-mortem soon. And Tom and I certainly plan on reading through it and we’ll do a follow-up episode on it once it comes out and check out what happened this time.
So anyway, a big coincidence is happening and so we’ll see.
So since we all, I think, rely on GitHub, today we’re talking about GitHub, hopefully GitHub, knock on wood doesn’t have any outages. It might just mean we’re bad luck or something like that. So anyways, so always with you guys, if you like what you hear make sure you rate and review the show. It does make a difference for other folks discovering us. If you like it, certainly let us know that you like it. And it helps promote that show to other people that might be interested in it. And we also too are certainly interested in any outages you think we should cover, so make sure you log on to the website or ping us on Twitter and let us know if there’s any outages you want us to cover in an upcoming episode.
And we will definitely incorporate that. And yeah. So if you’re not already following us on Twitter, definitely follow us there too. Cause there’s lots of good stuff on Twitter that we tweet about every now and then when something comes up related to outages that we’ve covered or when every time we release a new episode,
Tom: [00:03:01] Yeah, thanks, Jamie. Also, I just want to go ahead and mention the startup that I have co-founded called Common Room. If you are building an online community, if you have a Slack or Discourse and you have people engaging with your community, but you would like to understand what’s happening a little bit more check out our website commonroom.io.
And if you’re interested in working for a small but rapidly growing venture funded startup that I think is a really good place to work. Come find me on LinkedIn or apply on our jobs board.
Jamie: [00:03:37] Allight, today we are talking about Github’s 2018 database outage. And this, this particular outage we are talking about because it was suggested that we dig into it from somebody online. On hacker news dmlittle had pinged Tom on a comment actually about the Auth0 outage and pointed out a couple of interesting post mortems for us to check out, including this one about GitHub. So as we dug into it, it indeed was very, very fun to learn about and to talk about together and now to share with all of you. Once again it is database related. So we’re definitely hitting a theme, Tom, about lots of things about databases.
Tom: [00:04:24] Well, you know, databases are hard. They tend to take sites down when they go down too. So yeah, just to a little bit of background that might be helpful for understanding what happened to GitHub here. So pretty much anybody running a database in production is going to be using some form of replication. So the writes will go to the primary and then they will be fanned out to one or more replicas that can serve several purposes. You can offload reads to the replicas, but the main reason people do it initially is as a real-time backup. It doesn’t really help you if somebody gets to the command line and drops a table, but it is very helpful if you have a hardware failure, you can quickly fail over to one of your replicas.
There are several different types of replication. So the one GitHub was using was called semi sync, which implies there’s both synchronous and asynchronous forms, I guess, but let me just run through the three types here so you kind of know what’s going on. With asynchronous replication when a write goes to the primary, the client that does the write immediately gets told success. The replicas are not involved at all. And so you can end up with situations where you have data that only exists on one hard drive, which is not a great state. On the other end of the spectrum there is synchronous replication, where when a write goes to the primary, it doesn’t return to the client until all the replicas have received the data. This is great for people that are extremely paranoid or data that is extremely important, but it’s going to be a lot slower. As soon as you start doing things with more than one computer, you have to deal with tail latencies. And so the speed of the write is going to be limited by the slowest replica, which is not great.
So the compromise that a lot of people use is semi sync replication, where you fan out the writes to all the replicas. And as soon as one of them has successfully got it onto a disk, the write is considered successful and returns. So that’s generally a good balance of safety versus performance, and that is what GitHub was using in this situation.
Another factor I mentioned is people will use these replicas to offload read data. If you have a very read heavy site, which is extremely common, you’re gonna eventually overwhelm your primary with how many reads it can do. And so if you’re okay with potentially a small amount of latency, or not latency, a small amount of lag in your data you can do a lot of your reads off the replicas. And that is a very easy way of scaling up your read capacity without doing anything too complicated. So that’s also going to come into play.
And as soon as you start dealing with replicas and primaries, you have these decisions about when you demote or promote one of the machines. So if you have a primary that’s receiving all your writes and it goes down, every client doing the writes has to decide where do I start doing the writes now, the thing that I thought was the primary is gone. What do I do now? And GitHub was using a tool called Orchestrator to help with this, which generally will work great.
In this case, GitHub had a network partition. Network partitions are no fun at all. Yeah, you have to remember that networks are really there. There’s no magic happening in them. It’s all a bunch of things that are just connected to each other and lots of different things can fail and you can end up in very weird states where some computers can see others and others can’t see the same ones.
So you’ll see how a fun network partition really affected GitHub on this day. And just the last little bit of context. So GitHub had two data centers, at least two, they mentioned here, they had an East coast data center and a West coast data center. They were running semi sync replication, where they had a primary in the East coast. And then they had replicas both on the East coast, in the same data center and a number on the West coast as well. I think they had up to 12 replicates to handle read traffic. As you would expect, GitHub is a pretty read heavy site. So the West coast data center, it seemed like it was more of a standby and that pretty much all or most of the application servers are running on the East coast. And that is also going to come into play. So, yeah, Jamie, let’s get into the timeline.
Jamie: [00:08:45] Sounds good. Yeah. So on that day in October 2018, there was some backbone maintenance being done in their network. So the network link that connected the East coast data center to the internet and to the West coast data center was having some maintenance done on it. In the course of this maintenance, the East coast data center became disconnected from everything for 43 seconds. So essentially everything in the East coast goes offline, right? The Orchestrator group, which uses consensus in order to do this promotion and demotion of the MySQL servers has some members in the West coast data center, some in the East coast data center and some on public cloud as well. And so when the East coast data center dropped offline, the orchestrator network was able to still maintain a quorum to make decisions because it still had both the West coast and the public cloud instances alive. And those instances decided, hey we are seeing that the East coast is gone, so we need to promote something else. And those instances decided to promote some of the, MySQL servers in the West coast to be the new primaries.
Tom: [00:10:08] It’s important to note how quickly this happens. And this is the blessing and the curse of automation that they had 43 seconds of this outage or the network being down and orchestrator jumped in and made a pretty important decision in that timeframe.
Jamie: [00:10:22] Yep, East coast looks down, orchestrator quorum says we’re going to promote the West coast since the East coast disappeared. And the West coast becomes authoritative for recording new records in these databases. So, those 43 seconds elapse and the network is restored. East coast data center comes back online and at this point, orchestrator decided to go ahead and promote from, from the reading here of their notes. It looks as though orchestrator preferred the East coast to be primary. And so when the East coast came back online, orchestrator began to try to repromote those MySQL machines to be the primaries.
Unfortunately, this did not work. And the reason it didn’t work, they discovered about two minutes into the outage when their pages are all going off there. They’re getting notified that orchestrator is unhappy. And the cluster of the MySQL clusters seem unable to do the automation that they want to do and they start investigating and they discover that the East coast servers have some records on them that were missing on the newly promoted West coast primaries.
Tom: [00:11:38] Oh boy. And that’s when you knew it was going to be a rough day right there, because now you’ve diverged and that’s really rough. But yeah. So the way this happened is, as I mentioned, the semi sync replication earlier just, it’s almost a failure mode of it. There were a bunch of writes going to these primaries on the East coast. The replica that was responding, hey, we’re good, we got the backup data was almost certainly on the East coast. Just the speed of light means it’s 60 milliseconds or so round trip to the West coast. So writes are going to the East coast primaries. They were being acked by an East coast replica and so the data was not making it to the West coast. So the network had partitioned, but when the network partitioned, there was data that only existed on the East coast. Sure, it existed on multiple servers there. It was on more than one MySQL server, but it had not made it to the West coast when orchestrator jumped in and said, Hey, West coast is the boss. Now it’s the primary. Let’s start doing our writes there.
Jamie: [00:12:46] Yeah. And so because of that, we sort of ended up with a different source of truth or two different sources of truth for what writes had been acknowledged and committed by these databases. So 15 minutes into the outage starting to realize that this has happened and knowing that they’re going to be in for a bit of a process and knowing that they were going to have users see some missing or inconsistent data, the GitHub team decides to put the site into a yellow status and then quickly after that and do a red status. So helping to acknowledge that something was wrong here.
Tom: [00:13:24] This is where you start to get into really weird territory in terms everything that you assume you, all the assumptions you use to be able to make about your database are kind of gone now where users might’ve written data, and then they reload the page and it’s gone.
So, it’s hard to even reason about all the weird stuff that can start to happen at that state, but you know, one simple one would be. I filed a bug or created a PR. And then I reload the page and it’s gone. So I do it again. And now there is one version of those on the East coast and then one version of those on the West coast. And that’s going to be hard to reconcile.
Jamie: [00:14:01]. Yep. Not ideal. So even with two different sources of truth for some set of records, it only exists in each place because at this point, the West coast has been the primary for around 20 minutes. And so 20 minutes of writes are only on the West coast and around, you know, 15, 30 seconds, we don’t know exactly how many are sort of isolated on the East coast. They decided, well, it was interesting here because they decided what they were going to do is. make sure that they respected the West coast writes. So because they had seen 40 times as much data theoretically that was only on the West coast. They decided that the right move here about 20 minutes into the outage was to say, okay, let’s keep the West coast primary right now. And let’s get the East coast set up as a replicas of the West coast because that’s the first step that we need to do in order to eventually promote back over to the East coast. You might say, why were they in such a hurry to make the East coast the primary again? And the reason is that because all the application servers were running in the East coast data center. The performance of the site was really bad. They had not intended and hadn’t really tested and the architecture wasn’t really built for suddenly the roundtrip to the databases going from a few milliseconds to 60 milliseconds. So as soon as everything came back online and the application servers in the East coast were serving all of GitHub the website got really slow, because of the 60 millisecond round trips to the application servers suddenly had to do all the way over to the West coast in order to get to their databases.
So they really wanted to get back up and have their site be responsive. But at the same time, they had this decision to make about 20 minutes of new data on the West coast versus 30 seconds-ish on the East coast that they would have to take offline if they were immediately promoted to the East coast.
Tom: [00:16:07] Yeah. So one thing in the post-mortem that is a really good thing to call out that kind of help them make this decision is that they they state that the integrity of user data is GitHub’s highest priority. They had already put some thought into the principles that they were going to make these decisions with. And that is the thinking you need to do ahead of time, because as we’ve mentioned before, everybody’s IQ drops in an outage. Everybody gets a little bit dumber and stops making as good of decisions. And when you have to make the big, important decisions, it’s very nice if you’ve put thought into that ahead of time. So they’re stating here integrity is more important than availability and you can absolutely imagine sites where the reverse of that is is correct. I mean, imagine like a search engine’s data integrity — you could probably rebuild all that stuff. So who cares? Availability would be all that really mattered.
But forGitHub, this is where people put the data that they create or the information they create. And it’s very precious. Nobody would like to lose that. And so they say, hey, integrity is more important. We’re going to take a hit on downtime so that we can get the data back into the state it should be.
Jamie: [00:17:14] Yep. And the alternative they certainly could have done as Tom mentioned is to say the fastest way for them to get back online and in some respects, the more selfish way, given their principles. Cause they could get out of this embarrassing state would be to just immediately repromote the East coast machines and to basically just write off the West coast, we’ll figure out later how to get that 20 minutes of stuff merged in here. But let’s just get the site back online. Cause everybody’s looking at us and we look bad and stuff like that, but they didn’t do that. They said we have 20 minutes of precious data on one side and we have 30 seconds on the other side. So we’re going to stay degraded to make sure that we jeopardize less data. And that’s a very principled stand and I think it is worth admiring.
So 27 minutes in all, they decide given this kind of backed up performance and the fact that they had lots of back pressure building on the East coast because of this hop across the country to get to the database servers, they needed to shed some load. So they turned off kind of web hook invocations. They turned off GitHub pages being built, a bunch of stuff that they deferred that they knew they would have to work through the backlog later, but it helped them keep the site running.
And they, and they just sort of left a to do for themselves to go back and rerun all those things later on when the site was healthy. About an 1:15 in, they start formulating a plan for how they’re going to get the East coast MySQL clusters healthy again, because again, remember at this point, the East coast, MySQL clusters were not participating at all in the new authoritative West coast stream because they were refusing to do so because they had a super set of the records that like not a super set actually, but they had a, you know, they had some records that the West coast did not have. And so there was no clear way to create a sequential replication from the West coast. So what they decided is, okay, they have a backup system, thankfully. Right? So in it, every four hours, they backup all their databases, which is good. That’s pretty frequent. So, and what they, what they know is that those backups, because they were a few hours old, they, those backups do represent a clean subset of the data the West coast. Those backups do not have the 30 seconds of bad data, bad in quotes here that complicates the ability to replicate the West coast. So they say, all right, well, let’s back up or sorry, let’s restore the East coast, MySQL machines from their backups. And then after we do that, we can re-make them replicas of the West coast. And then finally, once they’re caught up, then we can repromote them and our performance problems will go away because the primaries are back on the East coast.
Tom: [00:20:07] And hopefully they had enough disk space on the West coast to save the logs.
Cause then things would get really exciting. And so presumably they also set aside at least two of the replicas that had that 30 seconds of data and just took them off, took them out of the cluster so that they could go and query them later to try and recover the forked data.
Jamie: [00:20:28] Yup. Yeah. That’s a good way to put it — the forked data. Right. Cause that’s sort of what we have here is this fork. So yeah. So Now, one thing they knew, and this was very heartening to read — they did test restorations daily. So they knew they worked, but they also knew that they take several hours. So they knew this was going to be a lengthy process.
But nevertheless, at about an hour and 50 minutes into the outage, they had kicked off restoration jobs on every single cluster on the East coast.
So they started rebuilding a new MySQL database out of the backups on the East coast for every cluster in their system. So in the meantime, because they knew this was going to take several hours. They had a team of engineers who were kind of investigating ways to speed this process up, but it is lengthy. The next sort of checkpoint in their post-mortem is about six hours later. So they did have several database clusters on the East coast that were restored and they were streaming replicas for the West coast primaries. So they were basically getting closer to being able to promote everything and have the East coast be the primaries again. But they still had some of the bigger clusters that they thought were going to take about two more hours to restore so six hours, and they’ve got some East coast back online and ready to be primary, but not all of them.
What’s interesting is here is a note about eight, eight, eight hours and 55 minutes in, in the background here. They’re still trying to get those last couple of servers on the East coast to be to be caught up and be ready to be primary. At eight hours, 54 minutes, they GitHub published a blog post with a bunch of details in it about what was going on. So they actually mentioned in their notes that they would have loved to have done this much sooner. But the primary ways they communicate with the community uses GitHub pages. So they had a dogfooding issue, which again is a little bit thematic with these outages, where they use their own GitHub pages system in order to make their blog posts and communicate with your user base.
And so it wasn’t until this time that they had their systems healthy enough that they were able to be building those pages to be able to publish that blog posts. So just short of nine hours in the first formal communication with any detail in it, other than yes, we are down, went out from the company. So now that they have that message out, but in the background, these last couple of tough primaries they’re still trying to rebuild on the East coast. It didn’t take until 12 hours into the issue into the incident that they finally had all of the East coast servers fully caught up with the West coast and they were able to promote them all to primary.
One of the reasons it took so long, because again, from the time they made the decision to pursue this strategy until the time they were back online is on the order of 10 and a half hours.
Right. So it’s a very long restoration, but they eventually got them all up. And what made the end of the process pretty painful is that they started to get into the meat of their day, where GitHub is the busiest. And so the busier their traffic gets the harder it is to do background jobs like this at the same time, the slower they all take because the database clusters are busy, serving live traffic and new traffic. But needless to say at 12:20, and I feel like taking a breather here for a minute, everything is back. They can repromote the East coast to be primary and the application gets substantially faster, GitHub.com is starting to feel like it’s snappy self again, because they now have East coast application servers talking to East coast databases, and everybody is happy in that respect.
Tom: [00:24:35] It’s nice that it’s gotten faster, but there definitely seems like there’s still going to be some inconsistencies here where GitHub is getting reads from the application servers to these replicas that are, are behind and in some cases, hours and hours behind. And so you can imagine you’re, you’re browsing the page, you’re hitting different application servers, which have different database connections. You might be seeing snapshots of the past really, as you go through it. Fanning out reads to replicas is one of the easiest ways of scaling a database, but it also comes with this problem that can really come up. When your replicas get behind is, users might see older data, and this, this is something that a lot of times it’s just worth it. It’s still going to be a lot easier to scale your database this way, then a lot of other ways. And so this is a trade off that most companies do choose to make.
But depending on your application, it’s something you have to be really cautious about. And, you know, as you design your app, you never want to round trip data through your clients back up to your primary. You could imagine that this really gets bad when you’re basing decisions and you’re making writes based on a lagged view of the world.
There’s different ways that could cause subtle or not so subtle problems. So whenever you want to write to your primary, you always want to verify that the data you’re writing is up to date or the data you’re basing the write on is up-to-date. So it doesn’t sound like they had any problems due to that, but that’s something I would be really worried about in a situation like this. What writes are we seeing that are somehow subtly based on a stale view of the world?
Jamie: [00:26:14] Yeah, definitely. And they don’t say exactly how this happened necessarily in the write-up, but one guess we could kind of make is that they were so focused on getting the primaries in the East coast running that they didn’t necessarily at the same time have secondaries, replicas also getting caught up.
So when they finally were able to flip over these clusters to at least have the primaries on the East coast, be a hundred percent caught up and, and authoritative for new writes, all the East coast secondaries were still quite a bit behind. And so because of that they were not completely out of the woods yet to Tom’s point. There was still inconsistent data being served, even though it was being served quickly now.
Tom: [00:26:58] I’ll get you the wrong answer as fast as you want!
Jamie: [00:27:00] Yeah. So now they started focusing on making sure they could get all of these replicas caught up in the East coast. This really started to get into the peak traffic part of GitHub’s day. This is about 14 hours into the outage and they started to notice is that in some instances, their read replicas were actually falling further behind. And one of the issues is again, the load that goes on those replicas, when the organic traffic goes up causes it makes it even harder for them to catch up with this historical data. So what they do is they spin up some extra read replicas in the public cloud, probably, you know, East coast, AWS, or whatever, and helped to spread the read traffic onto those as well. And by doing that they had the load spread enough that they were able to get all the replicas caught up. And then at 17 hours and 32 minutes in every single replica was back in sync and the East coast was authoritative again for writes. So this kind of represents the point at which all the databases were happy. 17 hours later.
Tom: [00:28:18] It’s a good thing to remember that all the read replicas, they also have to run all the writes. Every single write to the database has to go through every single one of these servers.
And so if you have a non-trivial right mode, your replicas aren’t free basically. They have a lot of work to do just to keep up. And as you hit that peak traffic time it definitely makes sense that there’s going to be a lot of work done on each read replica just to keep up with the current state of the world.
Jamie: [00:28:47] Yep. Yeah. A lot of times people are surprised to find that their replicas are even busier than their primaries. If you think of it in some respects, it does make sense. Cause they have to do all the writes that your primary is doing. And you’re probably also sending some read traffic at them.
Tom: [00:29:04] It’s been a long time since I was deeply involved with MySQL, but at one point, all the writes would come through on a single connection. And so you could potentially end up in situations where read replica could not keep up because it couldn’t do all the writes serially fast enough to keep that up. I don’t know if that’s been fixed or not..
Jamie: [00:29:23] It has, it’s been fixed now. Yes, but as of maybe six, five, six years ago, that was still a problem where there was a single I can’t remember what version of SQL it was, but where there was a single connection that serially handled all the writes. And you could actually get into issues about you’d be limited by ack times. Especially once flash appeared and databases got way faster all of a sudden, eventually it started to be like replication protocol and then round trip time, even between machines and the same data center could start to be the limiting factor on a synchronous like semi sync replication. So yeah, it gets tricky for sure. Replication has so many different ways it can bite you.
So, 17 and a half hours in, boom, databases are happy again. The long and short of it is all our databases are good now. They were conservative here, by not flagging the site back onto green, because GitHub did have to admit that they had deferred a lot of web hook traffic that they needed to process and, and a lot of GitHub pages to build.
Tom: [00:30:32] So had a pent up DOS attack, ready to unleash on their partners.
Jamie: [00:30:36] Yeah. So for the next, let’s say seven hours or so they work their way through they said about 200,000 web hooks they had to invoke and lots of GitHub pages to build. One of the complexities they ran into is that those 200,000 web hooks they had to be very careful to try to throttle how quickly they move through those. Because in some instances, partners that had subscribed to those hooks, at rates folks were not ready for. And so they sort of unleashed minor, denial of service attacks, as Tom said, against some of the folks who were subscribed to these web hooks, just because the backlog, it was no longer limited by the rate the event happened — it was limited by how quickly GitHub could move through it. So they ran through a few more hiccups here at the end, just trying to get through this backlog and kind of flush their queues so that they were current.
24 hours, 11 minutes in the backlog is empty. GitHub is fully caught up. Their databases are happy. So they’ve flipped the site back to green and they realized that the one job they have left is we gotta go back to those 30 seconds of writes way, way back in the beginning that the East coast accepted that did not get sent to the primary. So they had just shoved those aside and shelved them for right now. Well, they got the site back online and they needed to kind of go through these writes and figure out how to reconcile them with all the other stuff that had happened on the databases since then.
Tom: [00:32:02] Yeah. That was 954 writes from the busiest cluster. That’s such an irritating number. Cause it’s big enough that you really want to automate some of that. If it had been 20, sure, you just go through them all by hand and see what it is. But that is a fascinating problem right there. Cause there’s just so many crazy things you can imagine of different types of conflicts where.
I mentioned this earlier, but maybe somebody opens an issue on both sides. And now, which one do you go with? You know, in that case, maybe you want to just have both of them and let the customer sort them out, but without really knowing all the details of their data model, it’s hard to even think about all the different, weird types of conflicts they could have, but yet almost a thousand writes, that that is a really annoying number to work through.
Jamie: [00:32:49] It is, and it is another testament to their principle about integrity, right. That they mentioned their notes, that they knew for some of those writes that they were going to look at fairly individually, they would just probably have to contact the customer and say, hey, this happens. We didn’t really have a great way to like, know what you meant or to resolve this. So. sorry, just be aware of it, that kind of thing. So they were really willing to go as far as to say, if they ran into situations where they did not know how to make the world whole, they were going to contact the customer and admit that that had happened to their account.
Tom: [00:33:23] And that’s the right thing to do. Customer trust is everything and you really have to just go above and beyond to maintain that trust. As easy as it would have been just to toss all those writes and say, ah, it’s not that many. I really respect that they actually put in the work to do the right thing there.
So in summary,you hear about these network partitions with CAP theory and all this stuff? Well, this is what happens when things go bad. So GitHub had a network partition that more or less resulted in their database getting forked. Which is, is that ironic? I don’t know. Sometimes I’m unsure, but so they had two branches of their data, one on the West coast, one on the East coast. They had to sort of shunt the East coast aside a little bit until they could get it replicating from the West coast to preserve the bigger fork over on the West coast. Once they got it replicating, it just took a long time to get the backups up and running and a long time to get it re replicated, and then I’m sure a little while again, after that to get the stuff manually resolved. But, they got it all done and were up serving data about 24 hours later. 43 seconds network outage 24 hours of degraded performance.
Jamie: [00:34:35] Sometimes that’s the way the world works. Huh? Geez. 43 seconds.
It’s a very fateful 43 seconds.
Tom: [00:34:44] I would be surprised if they didn’t have a conference room called 43 seconds.
Jamie: [00:34:49] That’s right, you definitely have to make that. Well, Tom, let’s talk first about what went well on this, that impressed us when we’re reading through it so what stood out to you?
Tom: [00:35:01] First off start with the backups. So they had backups every four hours and they actually were testing that they could restore from them automatically. That is great. You know, this is certainly something you want to do at least every day, but four hours is great. So I was happy to see that.
I love the principles and having already thought about, and just knowing and having that be a value that integrity is the highest thing. Having made those decisions made ahead of time, and just having something, having a rock you can point to when you’re trying to work through this and say, well, we don’t really have a decision to make here because we’ve already decided integrity is the right thing to do. So therefore let us do the thing that lets us maintain integrity. That was really good.
Another thing we, this has come up in a few other ones we’ve done is they had the ability to just block out the tools, just say, okay, now orchestrator you’re you’re, you’re out of the game. It’s all manual here and we’re all paying attention. So don’t do anything weird, we got this. So having already built that that’s great. And similar to that, they also had the ability to shed work. This is something we’ve talked about several times now. If you have a complicated system, it’s very helpful to be able to turn parts of it off and to have switches that you’ve already tested and that you use often, and you just know that they work.
So I don’t know if they get into this explicitly, but they do it pretty quickly where just, okay, well, web hooks are a problem. Let’s just flip the switch, turn them off for awhile. Let’s turn off page builds. Let’s reduce the load. Kind of get down to the core. Kind of like keeping your core temperature in the right place and letting the extremities go. And just having already built that stuff is a great thing for a mature company to have done. So those were a couple for me. What about you?
Jamie: [00:37:00] Yeah, on the quick side of quickly being able to make those decisions and guided by principles, and I also love the fact that they very quickly went into the yellow, red status and use that machinery sometimes companies, because it kind of is embarrassing to admit you’re down. We’ll wait too long to say it, to, to play this game of chicken about like, are you guys going to get it back? Maybe you’ll get it back in a few minutes. Maybe we don’t have to, you know, say anything on Twitter or whatever. They just were like, nope, this is big. We’re going to say that we’re yellow. We’re going to say we’re red. And also, again, the fact that they waited until they had a pretty high bar for what done meant. So they did not go green again until they were done with their backlog. And so I love the fact that they were they were red, very open and ready to admit that they were down, and that they had a high bar for when it’s working again.
Tom: [00:37:50] And given GitHub’s position as such a center of developer’s lives, I really liked, just seeing them do the right thing here, you know, just kind of modeling the good behavior because so many developers were paying attention to this, which probably also made it harder for them to do the right thing because they did have so many people who knew that there was probably a status page and that they could check it out. But it was nice to see them do the right thing and model good behavior for everybody else.
Jamie: [00:38:17] Yeah. A lot of you who are listening, probably know engineering teams can often feel a lot of pressure to like, can we, can we put it green again? Can we put a green, can we say we’re back up? And so I do like the fact that they were very tenacious about resisting that urge to say everything is fine too fast and to just be honest with themselves and with all of us about the state they were in.
So I think another thing that’s worth pointing out here, and again, having been part of companies that tried to solve this problem, is even though they kind of got bit here in some respects by their geographic redundancy. There are a lot of companies who wait surprisingly long before they have any kind of answer for, hey, what happens if there was a massive natural disaster in the part of the world where our primary data center is. Some of them actually do not have not yet done the work to have a story yet about why anything should be geographically distributed. Even though there were some parts about having invested in that, that made their day harder here. That’s something they’re clearly thinking about and that they do have West coast replicas and some amount of West coast capacity is a testament to them being thoughtful about disaster planning. And as Tom said, we all rely on GitHub so much, it’s good to hear because they’re a very important resource to the development world.
Tom: [00:39:48] One important point here is that nobody flips a switch or just files one ticket and gets geographic redundancy for a non-trivial app. It is really hard to actually build this the right way. There are a lot of discussions, a lot of decisions that have to be made. And so I think I’m guessing at this point, GitHub was in kind of an intermediate state because it is a process to get to the point where this works really smoothly. And so it sounds like they just kind of got caught, in that they were moving towards geographic redundancy. Their setup would have helped greatly if a hurricane hit the East coast and just wiped out everything there, they would have had their data. I’m sure they would have to fight with everybody else for more servers on the West coast, but they would’ve had all the data up to the minute that the water broke in or whatever.
At some point I’m sure they’ll get to a point where the whole East coast could go away and nobody would even notice, but it sounds like they were kind of in that intermediate state where they had done some of the work, but not all of it yet.
Jamie: [00:40:49] Yeah it’s actually, it’s a good transition to us talking about maybe some things that would have been that could have gone better that day. And I think that to your point, that the fact that they were in this intermediate state made things a little trickier? They are on a bridge to something good. And even their write up about what they intend to do next as a continuation of that, they’re talking about active/active strategies and things like that.
But, reading this, you say, okay, well, so they have the West coast thing, but like, what is it capable of? What problems is it capable of solving for them now? So it sounds like as you said, Tom, they’ve got all their data over there and they could probably go fight folks for maybe West coast, public cloud servers or something like that in order to have application capacity.
But if you read between the lines on this, I mean, one thing that made their day a little harder is it does not sound like the West coast was provisioned yet to be able to handle all of their traffic. Because if it was in theory, they could have just flipped all their application servers over to running the West coast, and then they wouldn’t have had these latency issues that forced them figure this all out with such haste.
Tom: [00:41:54] Yeah, exactly they’re partially there. I mean, if you want to fail, it’s extremely expensive to be able to just completely wipe out one datacenter and have another one step in and just not have any sort of degradation of service. Cause you need twice as many servers as you would otherwise. And GitHub is a non-trivial site. I’m sure they have a ton of money invested in their hardware. But yeah, to get to the point where you can just fail from one data center to another, you need databases and application servers. And it sounds like at this point, what they had was the data, which is very important, but they didn’t have all the spare app servers lying around to take up the load.
Jamie: [00:42:33] Yep. Yeah, because, because so many of the issues they were running into were related to the 60 millisecond hop. You could imagine their day would have been simpler if they could just spin up their application on the West coast as well. And then the performance and everything would be fine while they took their time sorting it out. It could have taken a couple of days to get the East coast back online and it would have been okay. But, yeah, they weren’t quite ready for that yet. And so they ended up having to be in this configuration where they were spanning database traffic across the country, which ended up creating a site that was not at the performance that they considered usable. It was significantly degraded. So let’s see what else, Tom, do we look at and say, you know, here’s an idea that if they had implemented, it might’ve made their day a little easier.
Tom: [00:43:25] So I feel like we should maybe just start giving some of these numbers or pithy names or something like that. But yet again, we see a company that had some circular dependencies get in the way of either communication or resolution. They were using GitHub pages to communicate with their customers. And what do you know, GitHub pages also needs the databases and the app servers and all that stuff to work. Did they say what they used as a backup or did they just get pages up and running?
Jamie: [00:43:55] They eventually just got GitHub pages up and running. I think that they were tweeting a little bit and they had it maybe a status page, but they did not really get any details out about what was going wrong until they eventually got GitHub pages running again, which was about eight hours or so I think into the outage.
Tom: [00:44:16] Well, I guess Twitter is kind of everybody’s backup status page these days, so right, that’s always there, but yeah, that does seem like the kind of thing that just probably caused a little bit more anxiety than necessary when, oh god, we can’t actually broadcast this out the way we were expecting because the whole thing is down.
Jamie: [00:44:35] Yeah, I mean, another one that is in some respects kind of the beating heart of this whole outage is their replication and promotion strategy. Right. And just like how thoroughly they sort of understood how the automation would work given their deployment and stuff like that.
They almost were in an awkward middle ground of sophistication between something really simple and something that was quite sophisticated and tested to work in all of these situations. Right. So by having something like orchestrator, that’s like automatically doing fail over, automatic fail over is really, really hard to not screw up.
And some of it fundamentally comes from the idea that you use something like orchestrator, which is using consensus to make decisions about master election, but orchestrator fundamentally is not inline with the replication stream. So when a machine goes down, it knows it’s down. But it doesn’t necessarily know what the latest record was that that machine had acknowledged.
And so if you take this scenario, we run into right where a network partition takes not only the primary offline, but all of the replicas that are likely to also have that latest record. You can’t even check all those replicas to find out–hey, what is the latest record anyone has acknowledged besides the primary, which failed. And fundamentally, those are the only candidates for promotion, right? So you’re not even sure in this kind of a typology exactly what record is safe to promote, right? Like what is the most recent acknowledged record to a client? Because you do have, you’ve separated the consensus group that decides who is master and the replication streams, which say what record is authoritative is most recent. And so there are some tricky consequences of that, right?
And this is one of them that bit them, so there could have been something simpler where maybe the orchestrator group was only making promotions in the data center, local kind of decision-making. So, that might’ve simplified their world a little bit, or if they even just had like a single brain decision-maker that someone was just paged if it went down. So instead of a consensus group, it was just a daemon that would make these calls, which a lot of companies use. They were sort of halfway to something fully automatic that was going to make geographic decisions about fail over, but without really reasoning through actually, like this distributed systems theory about how that would be safe to do, because it actually was not safe to do with the combination.
Tom: [00:47:14] Yep, it sounds like this might’ve been one of those things that just kind of grew. Like they had a system with the semi sync replication, and then they added the orchestrator and then they added the West coast. And, you know, like probably no one sat down and designed this whole thing top down and reasoned through all the failures. And this is something that you see when systems kind of grow organically is that you end up having to fault through some of the edge cases. Because it, it certainly seems like if you were just going to sit down and sketch this out on paper, draw the diagrams, you would see–oh, look if these coasts gets partitioned, the West coast is just going to start using stale data or not still doing it. But the West coast might start using a forked version of the data and that, I guess it would be an older version of the data. So knowing how software grows organically, I can definitely see how you’d end up with something like this. But yeah, it definitely seems like they just missed some of the subtlety here.
Jamie: [00:48:10] Yeah. And some of, some of this is like, it might sound a little unfair to GitHub, but just like a resistance in getting too clever. If you haven’t done the cost-benefit analysis in some respects, like how often do you lose connectivity between the two data centers? Like probably somewhat infrequently, right? Like how often do you have to promote a secondary to a primary because you lose a piece of hardware? If you have a lot of database servers, it’s quite frequent, right? So like that part, you probably do want to automate the whole, hey, what do we do if the two coasts can’t talk to each other. That one you might want to start with just having a human to get paged, right?
Because the amount of investment you’re going to have to do in that to know it works is really high, and it’s not going to happen often enough for you to really build confidence that you’ve got it, unless you have a ton of testing and DRTing around it.
Tom: [00:49:00] Yeah. If you’re losing your East coast/West coast connectivity that often, that might be a problem to look into, right? Yeah. It’s probably okay to have 10 minutes or so–just a 10 or so gap between deciding to start writing to the West coast–because that’s a big decision.
Jamie: [00:49:20] Yeah. It’s a really big decision. And it’s hard to test and all that kind of fun stuff. And the resulting state is not even a state where it seems as though the site is up, especially because you can’t move the application servers. So even if you do this and you keep the database online, the site’s kind of unusable anyway because you can’t really run the application servers in the West coast. So it definitely seems like the right juice-for-the-squeeze kind of ratio might’ve been great automation in the data center, local, but be a little bit more dumb about things that span the data centers. So what are any other things to you, Tom, that kind of stood out as something else they could have thought through a little bit more?
Tom: [00:50:01] This is probably a universal experience for anyone that’s ever had to restore a big database, but I think everybody has a thing where they try and restore the database and it’s a lot slower. It’s slow, as you knew it was going to be, but somehow when everything is down and everybody’s upset, that time is no longer acceptable. So I’m like, sure you knew it was going to take a couple hours to restore it and you figured, oh, that’s fine. You know that it’s better to have backups, but then, oh god, now we actually need to do it. It’s down. It’s still down. Can we make it go faster? And you have a lot of idle engineers sitting around and trying to figure out how do you speed this process up? So,obviously you should have backups. Having backups is better than not having backups, no matter how slow they are. But if you have a couple different options, when you’re looking at how you’re going to build your backup system, just remember what your mental state is going to be like when you need to actually pull the trigger and restore those backups. And you’re going to want them back fast, and it’s probably worth paying a little bit of extra money. Maybe keeping an extra copy a little bit closer. You don’t have to keep a ton of backups near your servers, but you probably want to keep your most recent backup somewhere you can restore it really quickly. So yeah, that hit hits close to home.
Jamie: [00:51:16] Yeah, definitely. And I mean, the good news is that the public cloud object stores, even though they are far away, are very, very reliable. So empirically we know that to be true. So the thing that you do that’s closer to you doesn’t have to be particularly sophisticated, right? Like it could just be sitting on a disk because it’ll probably be there when you need it. And if not, then you have one part of your day that becomes rainy. So you don’t have to overinvest in that thing, but if there’s some way you can view it more as an optimization versus–you can almost just view it as a cache–instead of something that cannot be lost because your backstop is always the public cloud, which is going to be very reliable it’s offsite. It’s geographically diverse, probably, from your data center in many respects. But yeah, that local backup is probably going to be there when you need it, and your day just got a lot easier.
Tom: [00:52:11] Just make sure it’s the right data.
Jamie: [00:52:12] Yeah, that’s right. Just make sure it’s the right data for sure. Cool. Well, this was another interesting one. And thank you so much for the suggestion from all of our listeners that have been giving us some suggestions. When we dug into this one, it ended up being really fruitful. And so we hope we keep getting more suggestions from all of you. And as usual, don’t forget to rate and review us, and we will talk to you all next time. Thank you.
Producer: [00:52:43] Thanks for listening to the Downtime Project. You can subscribe to the podcast on Apple, Spotify, or wherever you listen to your podcasts. If you’d like to read a transcript of the show or leave a comment, visit us at downtimeproject.com. You can follow us on Twitter @sevreview. And if you liked the show, we’d appreciate a five star review.
We weren’t using semisync replication.
Oh, that’s interesting. We had read this blog post from earlier in 2018 which mentioned it. Do you mean you weren’t using it cross-DC, or at all?