in

Gitlab’s 2017 Postgres Outage

The Downtime Project
The Downtime Project
Gitlab's 2017 Postgres Outage
Loading
/

On January 31st, 2017, Gitlab experienced 24 hours of downtime and some data loss. After it was over, the team wrote a fantastic post-mortem about the experience. Listen to Tom and Jamie walk through the outage and opine on the value of having a different color prompt on machines in your production environment.

Tom: [00:00:00] Welcome to The Downtime Project, where we learn from the Internet’s most notable outages. I’m Tom Kleinpeter and with me is Jamie Turner. Before we begin, I just want to remind our listeners that these incidents are really stressful and we peel them apart to learn, not to judge. Ultimately Jamie and I have made, and we will in the future, undoubtedly make similar mistakes on our own projects. So please view these conversations as education rather than judgment of mistakes we think we would never make. 

Today we’re talking about the GitLab outage from January 31st, 2017. Let’s set a little bit of context and then walk through the timeline.  Gitlab published a great overview. They also had a live document during the actual outage they were updating, but let’s set a little bit of background and then run through the timeline.

Jamie: [00:01:03] Cool. Sounds good. So the zoomed out version here of, of what happened is there was a database load incident at GitLab that caused the site to be degraded. And while attempting to get the site back online, some mistakes were made that caused database data to, to go missing.

And then they had to work through various ways to restore that data. At the end of the day, 24 hours after the incident began, they were back online, but with some data loss. So we’re going to kind of explore the context around their systems and what state they were in when this all started. And then we’ll kind of walk through the sequence of events here as, as it unraveled. 

Tom: [00:01:56] Yeah. Okay. So like so many companies, at this point,  Gitlab had one monolithic database server in production. It had a primary replica pair running Postgres. So I guess technically they had two, but they had one primary with pretty much everything on it. They also had a staging cluster where they could validate new code, which is a good practice.

So they would take a snapshot from production, remove a little bit of data from it, and ship it over to the staging cluster, where they could try out new code with realistic data in a secure way and make sure everything was working. So, the staging work was done via LVM snapshots, but they did also have a scheduled pg_dump run (where pg_dump is a tool for just taking a snapshot of a Postgres database).

So they would run this every 24 hours and upload it to an S3 bucket. In the weeks before the outage, they were having a couple of sort of sporadic load issues. These can be particularly bad when you just have one database cause one bad actor can, get your primary database loaded, which can make things slow for everybody.

So Gitlab was looking into ways they could distribute the load across more backend databases. So that the issue seemed to be caused by spam, according to the post-mortem, but they don’t get into a ton of details about what exactly that means. And it’s not really relevant for the outage.

Jamie: [00:03:31] Cool. Yeah. So that’s sort of the state of play as we begin here with the time sequence. So, about 90 minutes before they started having site issues related to, as Tom just mentioned the load issues, they were starting to encounter on a regular basis. Someone on the engineering team was playing around with using PG pool to load balance traffic between multiple database servers. In order to begin safely testing this deployment, the engineer was going to use their staging cluster. And so the engineer took an LVM snapshot, a newer one, because as Tom mentioned, every 24 hours staging would use an LVM snapshot in order to mirror production and have a copy.

But in order to do this testing, the engineer decided to take an even fresher snapshot before they began testing. And so they did that. So they took an LVM snapshot of production and then began testing PG pool in the staging cluster. 

Tom: [00:04:36] It’s not, it’s not totally clear why they needed a more recent snapshot. I would have thought the one from yesterday would be fine, but this is actually going to be really important. And some of the best luck they end up having in the 24 hours.

Jamie: [00:04:47] Definitely. For sure this actual exercise the engineer was doing, it doesn’t actually contribute to going down, but it ends up mattering later in a way we’ll discover as we go here.

So yeah. So anyway, 90 minutes later, the official start of the outage, the load related site degradation, and, basically Gitlab started noticing that their performance was degrading on the site. Committing, comments were starting to fail. Various other things, a familiar pattern to them and their dashboards and things indicated it was database load related.

Since they were familiar with these kinds of things due to spam, in previous weeks they assumed this was probably an especially bad spam attack, similar to the others in a way they didn’t completely understand Yet, later on it was learned that the actual cause of the extra database load is it appears that, Gitlab has a kind of, abuse report, flag kind of mechanism, and some sort of pattern had occurred where enough, reports were issued against him, someone that, the deletion kicked off, basically some sort of account removal or something, and it may be in a trolling kind of way. These reports were being issued about a  Gitlab employee who had a ton of data in the system. So an especially expensive delete operation was kicked off as a result of this automation, trying to de provision the account.

Tom: [00:06:24] One thing there is that this was actually a real deletion of the employee’s data. It wasn’t just like a soft deletion. A lot of times, if you have data that gets deleted from your system, it’s much better to mark it deleted. And then hopefully that can be a very cheap operation. And then if you do it right, you can defer the really expensive work until there’s a better time. So rather than trying to delete everything about this employee, you could just go mark their root account row as deleted and then just have things ignore them. 

Jamie: [00:06:55] Yup. For sure. Yeah. That would have helped, but yeah, so I guess this, the combination of maybe the usual sort of background, spam type behavior, and this large delete caused too much load and maybe cascading failure because, the overall load issues persisted for several hours. about four hours in with these load issues still seem to be ongoing to some degree. And one of the consequences of that additional load is that the secondary, the replica database server was starting to lag very far behind the primary. So the primary head transactions committed from quite a while ago that had not yet made it to the secondary and as this gap increased, it started to become a problem.

Tom: [00:07:47]Yeah. So this is a pretty standard problem that I’ve certainly seen before where, if you’re not familiar with the way replication, logs, or transaction logs work with, a primary replica set up, you’ll, you’ll run all your writes on your primary. And as these writes happen, it’s going to write out the transaction log, which is the record of everything that’s happened. All the writing has to happen on that machine. This is going to be shipped over or just, sent over a socket to the replica. And the replica is just going to apply those locally. So the queries just have to run on the replica as well. So if you get into a situation where you get a ton of writes coming into the system, the secondary can just back up, actually applying all these writes.

And so the logs just continue to build up on the primary and eventually that can be a problem. if you have some limit on how big the logs can get, the replication will just break and then you’re stuck because, the replica is now out of sync and has to get a new copy from the primary before it can pick up the log again.

Jamie: [00:08:53] Yup. Oh yeah. And that is what happened in this case. So eventually the primary began to evict some of the logs, And unfortunately those logs had not yet been propagated and applied to the secondary. So standard replication stream was no longer going to be sufficient to be able to get the secondary back in sync with the primary. So when this happened, the option that becomes available is basically batch restoration. So instead of using this transaction log, you need to sort of go back to the raw data on the primary and start to replicate essentially, everything.  It’s the easiest way to get a secondary that’s very close to the master again, so that you can begin normal replication, normal streaming replication, and Gitlab knew this. So. They shifted into, okay, we need to completely re-replicate the, primary. And so an engineer swung into action and the tool that Gitlab had decided to use to do this is a tool called pg_basebackup. And pg_basebackup, you can, point it at a primary and, it will re replicate the entire state of that primary. And then they can start up the secondary and everything will be okay again. So, they ran this tool a few times and the tool wasn’t really working for them. They had not used it very much before, it wasn’t giving a lot of good error output. They had to reconfigure things several times on the primary in order for it to have enough, resources available to kick off this big batch replication. And in the course of these restarts, they eventually got to the place where they had the master up and running and they believe they had the resources configured correctly. and so when they were going to kick this off, if you’re going to do a, from scratch replication, that what they did is they on the secondary, they removed the data directory. So they removed all the Postgres data. So there was kind of a clean slate to replicate into, and for several invocations, pg_basebackup had failed, and getting the master back online with the number of resources they thought were going to be necessary for this batch replication. They tried one more time to run PG base backup, and it still wasn’t working.

And they really did not know why. However, they had a theory and that theory is when things got kind of interesting again. So, the team thought that PG base backup was maybe silently refusing to begin this replication because the failed attempts at running it before had populated some files into the data directory, but not obviously not the complete set because it had not finished. And so they thought, Oh, maybe we should wipe the data directory again. And, so that pg_basebackup will run. And unfortunately, and I think this is we’ve all been here before, when they reissued the delete the data directory command at the shell, they were on the primary’s command line and not the secondary’s command line. So they removed the data directory from the primary.

Tom: [00:12:09] Oh man — I know exactly how that would have felt, like that is really the worst feeling you can possibly have when you’re working on an outage, you type in some command you hit enter and it, it takes longer than you think it should. And you look at your terminal, you look at the prompt maybe, and you just, oh my God, you control C maybe you hope for a second. It’s not as bad as you think, but oh, no, you just really did the thing. And there’s just no getting back from it. Oh, it’s so bad. Just that it is a physical feeling in your stomach when you do something like that. My personal nightmare here is mixing up the c and f when I’m using tar to  move something around. But it is so easy to annihilate something just with the stock Linux. So, Oh my God, I feel so bad. 

Jamie: [00:13:05] Yeah. I’ve absolutely been there too. And I agree usually that first, that sinking feeling in your stomach kicks in when the command takes longer than you thought. And it sounds like a similar thing happened here cause they also killed the commands, but not after unfortunately it did enough damage to render the primary without a complete dataset. So for all intents and purposes, at this point  Gitlab had lost their entire database and they needed to start looking at backup options.

Tom: [00:13:35] They lost the primary when they already did not have complete data on the secondary. 

Jamie: [00:13:43] There’s nothing, nothing there anymore. I mean, the net, the net result, unfortunately accidentally out of the attempting the spatch replication is they, they removed the data directory on the secondary to replicate into it. And then they accidentally remove the data directory on the primary when they were attempting to re-remove it on the secondary. So no data on either machine. And now you’re, you’re, you’re in backup land. So at this point, we’re about five hours into the outage. And, as, as you can probably imagine at this point,  Gitlab is just down. There’s no more degraded. It’s just, everything is shut off and there’s no database anymore. And, so the focus at this point shifts into recovery. So,  Gitlab, as you do, started scanning what backup options they had. As Tom mentioned, setting context early on, they did have a nightly pg_dump job that was running and then was uploading backups into a bucket in S3. So they went to check that bucket and that sinking feeling is about to come back. The bucket was empty, so there was nothing in the S3 bucket. and, so this is probably another moment of panic for the poor team at this point is they’re running around looking at what’s happening. And so they eventually figured out that, That, when they initially set this all up, it was all working great. And the S3 bucket probably had an eviction policy because they were rotating backups. And, when they set it up, it was some time ago and they were running Postgres version nine dot two. And so they had the Postgres demon running on the database server running nine dot two, and they had pg_dump running on a kind of compute machine running nine dot two. And at some point after they declared victory and moved on, they had upgraded the database server to version nine six, but it had been missed to double back and make sure that the PG dump version was also upgraded. And it ends up that when PG dump camp connects to a database running a different version than itself. It errors out and refuses to run. So a PG dump had not been successfully completing a dump for some period of time. And there was nothing left in the S3 bucket. 

Tom: [00:16:06] Oh man.

Jamie: [00:16:08] Yeah you can totally let me see how you get to this spot, but it is tough once you realize you’re there, it’s, it’s a pretty hard day. So, they they’ve, they did have a way to make sure that the jobs were succeeding here. So this job was being run as part of a crontab. And they had email set up to, to email, on, on a non-successful invocation of a crontab. The error output should have been emailed to an email address where an operational team would help fix it. However, sort of, bad gets worse, tthese emails were being sent or were being attempted to be sent by, by the Cron demons. but they had, Switched over to using a DMARC for email authentication at the Gitlab domain. And so at some point, these crontab emails were not set up to use DMARC. And so they were getting silently dropped, instead of being delivered. And so despite the fact that there were error emails being attempted to be sent about these failing backups, those emails were not being successfully delivered because they were not utilizing email authentication on the domain.

Tom: [00:17:24] Yeah you can totally imagine how this would happen. As employees come in and out, there’s somebody that’s used to getting these emails, then maybe they change roles or they leave the company. The new people don’t know that these emails exist. If that happens at the same time, the DMARC change goes in. It’s so easy to imagine how that gets lost.

Jamie: [00:17:47] Yup, so, all right. So S3 backups, not, not available. So, as you may recall, the staging had a snapshot, and there was a policy, there was a system in place to create these snapshots every 24 hours. At this point, the snapshot would have been about 24 hours old, which is sort of worst case scenario. Luckily, as, as we mentioned earlier in the timeline, in order to do this PG pool testing, an engineer had taken a more recent snapshot that was only about six hours old at this point. So this LVM snapshot that was propagated into staging and was the staging database was running it was their best option at this point. So in fact it was there, it was their only option. 

Tom: [00:18:35] I I’m, I, have been in the industry for awhile. Maybe things have changed, but at some point it was not okay to just take a disk snapshot of a bunch of b-trees and use that. Like, I don’t think Postgres guarantees that it writes everything out in a consistent state. So I’m sort of horrified by this, but fairly there, there wasn’t an alternative.

Jamie: [00:18:57] That’s right. Yeah. Any port in a storm. Yeah. But yeah, it’s not super safe to back databases up this way. Yeah. Because I agree with you. I think the guarantees about the stability of that are not ironclad. So, it was available and was an option they could use. And so they did. So they began looking into basically going the other direction, which is not their usual direction and bringing this LVM snapshot back out of staging and restoring production with it. Unfortunately one more complexity arose, which was the staging cluster was using cheaper versions of resources, including the disk was a slower network-based disc instead of like a fast SSD or whatever. And so the bandwidth coming off of those disks is pretty limited. So in order to, copy, the database back out of staging and restore it into production, this LVM snapshot, it was going to take. About 18 hours to copy the snapshot from staging back into production. So, which would obviously increase our downtime, and the nature of the cloud provider, they were using Azure, the way that they do these classes of resources. It was not possible to just upgrade on the spot to a premium storage option. So they had, once again, this was the option available to them. So they took it. So they had these 18 hours to restore production from staging. And then, add that to the five plus that they were down under load.

And about 24 hours after the outage started, they were back online. So this snapshot was restored. Postgres was started again. They had a couple of details to care of. Like for example, they probably issued some unique IDs out of an, like an auto increment range that may have been retained by external systems. So they had to kind of skip over like a hundred thousand auto increment IDs to prevent any kind of collisions. But then 24 hours later, they’re back online. So total consequences here, 24 hours of outage, and about six hours of data loss, which ends up affecting some, 5,000 or so like customers, in various ways. So, yeah. big, big, hard day at Gitlab reading through. Yeah, lots of PTSD here.

Tom: [00:21:27] If this happened to me, I would definitely be recording a screenshot of what my Apple watch is telling me. My heart rate was over the.. 

Jamie: [00:21:39] yeah, for sure. 

Tom: [00:21:40] Oh God, that would be so so rough.

Jamie: [00:21:41] So Tom and I were talking before the podcast and talking about that moment, when you were in the wrong shell when you ran the deletion, both of us having been there at one point in our careers. And at that moment, you need to stand up and walk away from the keyboard and you need to let a colleague sit down and take over. You probably need to go take a walk, get some tea, get a Coke, because you’re going to be, you’re going to be a little shook up for a little while.  That Apple watch would be spiking a lot right about then.

Tom: [00:22:12] Yeahs, sometimes you just have to settle down and like, like take a break. Like OK,  things are going to be down for a while. Let’s not make things any worse. 

Jamie: [00:22:21] Yup. Yup. Yeah. You sort of enter into a new class of outage and probably the first motion is actually for everybody to take a breath for a second and regroup.

Tom: [00:22:35]  Okay, well, there’s the timeline.  That was a pretty intense timeline. As we usually do, let’s run through things that went well, and then we’ll have a section on things that might’ve gone a little bit better. 

Jamie: [00:22:45] Sounds great. 

Tom: [00:22:47] So for the stuff that went well, bad things happen, and that’s just how life is sometimes. And, at that point you’re going to be judged somewhat. And so the way you’re going to get judged really is on the recovery. How you explain what happened and how you tell the world what went on. And so I give huge props for the transparency on dealing with this outage. They actually had a live document. They were updating, while the outage was going on, which I don’t know if I would want to do that. Cause it might just distract the team a lot, but they did that and then it had just an incredibly thorough post-mortem. I mean, one of the reasons we picked this one is not to pick on Gitlab or anything, but just because it was such a great post-mortem. But this kind of transparency is how you rebuild that trust. It’s how you make people understand that, we’ve learned our lessons and I promise you, this is not going to happen again. And, certainly, everyone involved in this incident is not gonna do it again. These are, these are lessons that will stick with you for decades. So, yup. So that’s the first thing I want to point out. It’s just really, really well done on the transparency front. 

Jamie: [00:24:09] Yeah. Well, a lot of transparency and I think the positioning about accountability was appropriate too, in terms of the CEO, he was the voice in the post-mortem and someone claiming responsibility. And, I saw senior folks from Gitlab on places like hacker news, responding to it, feedback and stuff like that, and really just owning the mistake. Right. Cause like people. Like Tom said, mistakes are gonna happen. mistakes do happen. That’s, it’s, it’s part of part of life, but seeing the company, not shirk its responsibility at all, to like its customers, to be very clear about what the mistakes were, what they’re going to do about it and, and owning that responsibility, was, was really impressive in this circle.

Tom: [00:24:58] Yeah, for sure. Yeah. So the next thing that went well is they had a staging environment and that is what saved them. Honestly, if they had not had the staging environment, I think the company would have been gone. I don’t know how you would recover from, I mean, I guess you could start like, actually, doing like disk recovery and opening up, pg_tree.c and trying to just extract the raw data. But I mean, then you’re talking about weeks of outage of downtime versus days. And I, I think it would’ve been very hard to come back from that. So, they had built a staging environment, like that’s not free, that takes engineering work and, which is just a great thing for, validating changes, for making sure code is safe before it goes out. So kudos to them for having put in the work to have a staging environment, which, although this was never its goal it did end up arguably saving the company. 

Jamie: [00:25:52] Yeah, definitely. I think that the staging environment existing is great and obviously it was yet another backstop and, and even though obviously like the monitoring of it, it looks like the backups were not working successfully and we sort of know why now, but the company having layers of redundancy here on, you have a replica, you have 24 hour database stumps. You have a disk level snapshot. I mean, the truth is if you have a critical asset, like your primary database, you want to have multiple diverse ways that it’s made that it’s redundant. And so, LVM snapshots succeed and fail in a different way than pg_dump running on the command line fails versus online replication. And so it just. That set of diverse ways to represent the same data, like makes it significantly less likely that they all fail at the same time.

So you could imagine instead, just as a contrast, right, you had only online replication, but you had like nine replicas, in three different geographies, right. You might say, Oh my gosh, I have so many copies of the data. Right. But all it takes is one delete command that goes across all of them to break that, or one bad bug and Postgres or whatever. So even if Postgres had the bug, for example, right. LVM disc snapshotting would not, right. It’s not correlated with bugs and Postgres. So these having the fact that they did have a few different ways that this data was in multiple places and ultimately save them as Tom said before, right? The staging environment being one of those places. So when two out of three of these representations failed. The third one still, save the day

Tom: [00:27:44] All right. So let’s get into things that might have gone a little bit better and maybe some lessons other people can take away from this, just to help lower average heart rates across the industry.

This is definitely an outage that had two distinct phases. There was the time when it was an unavailability outage, just that the site wasn’t available, and then it transitioned into a data loss outage when the bad command got issued. And so, this is so understandable and so predictable when you’re under stress, when things are moving fast and you have that pressure to get the site back up and running. It’s worth thinking about multiple different ways you can prevent issues like a bad command during an outage. So, one thing I like to do, even when I’m not in an outage, like if I’m doing anything on a database that is remotely scary. I will get on a VC with somebody and have them actually look at the commands I’m typing, just to make sure that, somebody else can just OK it.  We have code review for all the code, but you should sort of think of an equivalent thing for typing in commands. Come up with the equivalent for code review for your CLI when it’s really important.

Jamie: [00:29:09] Yup. Yeah. I think that some of the firefighting scenarios I’ve been in have been the most successful, there’s someone now this is in an, let’s say an office environment, but are VC equivalents, right? There’s someone that’s driving and then there’s usually at least one or two people sort of looking over their shoulder. And, and like just a pause before you hit enter and saying this look good, and just getting an ACK from your other engineers around you is one way when you’re under fire. Just because there’s especially, there’s an extra anxiety the person at the keyboard feels because it has this visceral thing that like, you’re the one that’s supposed to be fixing it right. And there’s a little bit of clear heads. Sometimes the folks that don’t have their hands on a keyboard have, because it feels a little bit less directly, accountable for making a mistake. Right? So sometimes the person watching you work will be the one to catch the “Oh wait, aren’t you on the wrong machine?” kind of mistake

Tom: [00:30:10] Right. So, I read this on the internet a long time ago, but somebody had this great story about how on us Navy submarines, which are, I think all nuclear powered, there are people in charge of the reactor and there’s a big panel of stuff, but then below the panel, there’s a brass bar. And when something is not right, it’s the sailor’s job to put their hands on the brass bar and not touch anything until they figure out what’s going on. Like, like it’s an actual physical place put your hands on so you don’t just start flipping switches and potentially breaking something. I love that story, but then I asked a guy I knew who was actually on a sub about that. And he said that was BS.

Jamie: [00:30:53] Oh, I love that story too. I’m a little disappointed that it’s not real, but it’s still a good story, right? Yeah, we need the dev ops equivalent of the brass bar. I think one of the things on this topic right, is like, There’s just that. And like, we can explore some like, sort of increasingly sophisticated ways to think about these kinds of bad commands. Like certainly having, having a coworker, looking at it with you is a relatively low tech way to sort of build some in, but maybe there’s some others we can talk about, Tom, that sort of ramp up the sophistication around preventing this.

Tom: [00:31:29] Yeah. So, one of the easiest things you can do is just have different prompts in all of your machines, all your environments, like, it’s pretty easy to set up just like a different color prompt so that, and this can automatically happen when your 82 servers or whatever get deployed. When the machine gets set up, it just puts a command into the profile, such that, if you’re on prod, it’s red. If you’re in your staging environment, it’s yellow or something like that. And it’s just a visual cue that things are different here. And of course, hopefully you’re not spending so much time in a prod shell that you’re just accustomed to it. But, this is just another sort of really cheap and easy layer you can put on top of things. It’s defense in depth. Just one more thing. 

Jamie: [00:32:17] Yup. And another like slightly more sophisticated thing you could do as well, is invest in a practice of making sure that deletes are like soft deletes that move things out of the way by default.

As much as like folks may giggle a little bit, the concept of the recycle bin, especially back in the day. I think at this point, in retrospect, we have to say there’s a lot of merit to this pattern. For example, on your database servers, it may be that you don’t really want it possible to just rm, unless someone opts in.  The default should be to essentially move it to some sort of like, quarantine area that’s eventually evicted. The human side would just be, get in a habit of doing mv if possible, instead of rm, right. And then the tooling one would be, try to have your rms actually be mvs in critical environments. 

Tom: [00:33:18] Yeah. I think at the really high end, hopefully all of your machines, at some point in your company’s life, all your machines should be tracked in some kind of database that knows what they are, Ideally, you would have tools that check your machine database and are a little bit more restrictive of what you can do when that’s the primary, so in a perfect world, you would have to demote the primary to be a secondary before you could really delete anything. Something like that. 

Jamie: [00:34:01] Yeah. We’d like it in production. We’ve had things set up well enough that you can go as far as you, if you can invest in it and you have the time and you’re at that level of maturity and all that too. You use the Linux capability system in order to make it so that even root can’t remove files in the data directory. So only like blessed tools can execute those commands and those tools check your machine database to tell if, is this machine currently, a, a primary, right? you can even, we had a tool set up at one point where. It would check whether or not something was bound on port 5432, or whatever. So if the database is running, it will not do anything. So there’s, there’s different kinds of checks you can make to, to plumb into the clusters concept of, is this a primary and have it essentially refuse to run, tools that can destroy data, if that is true, because that was almost definitely a mistake if you, if you attempted to do that. 

Tom: [00:35:11] But before you build any of that stuff, make sure you have a successful business for sure.

Jamie: [00:35:16] All of this is sort of like, you can get along way with just like practices and really inexpensive things like different prompts before you get into like, making it so that you can’t do things based on machine database tags or whatever. Like that’s, that’s a great place to be in, but that does take a lot of investment to get there.

Tom: [00:35:39] Okay. So the second kind of umbrella is just that if something’s not tested it, it doesn’t work. And I mean, this is something we’re going to come back to time and time again, in this podcast I expect, which is, if something’s not consistency checked, it’s probably inconsistent. Something’s not tested. It probably doesn’t work. If you haven’t restored your databases recently, your restores don’t work. It’s just the safest way to think about things. It’s making time for it that’s usually the hard problem. 

Jamie: [00:36:10] Yeah. It’s a part of like, making sure you’re also, you’re checking for things that are at the end, ideally of the dependency chain and not some intermediate point. Right? So like one way to express this, you could say is like, if you can restore a backup, the backup worked. And then if you think that, that combination of thinking, if we don’t test it, it doesn’t work. And are we actually testing the outcome we’re looking for, right. Like if you think of it that way, It almost doesn’t make sense to spend too much time checking whether you quote unquote did backup successfully, because in order to know that blob that you put in the object store is not garbage, you have to try to restore it in order to know it, you know how to restore it, you have to restore it. Right? So the purpose of the backup is not to have a file in S3. The purpose of the backup is to be able to restore the database. So like you have to test that otherwise you’re not, you’re not really testing anything. 

Tom: [00:37:13] I think that’s just a really great way of putting it. Think about the outcomes you care about and evaluate whether or not those outcomes are working, not if we’re not willing at the intermediate steps are so, 

Jamie: [00:37:24] Yeah, basically like DRT type exercises where you do disaster recovery simulations are great to do as a team every now and then where. You just say, oh, hey, we’re gonna use pg_basebackup, this week. And so, and that way you sort of discover ahead of time, the lessons you need to know, or if the person’s not around, that knows how it really works. You can just take a note and get the answer later, or like asynchronously, as opposed to needing that person to be available when you’re under fire, when you have a deadline. So, things like that where, not all these things have to be automated, but automation obviously in, in the long run is great. Like automating restorations are something a lot of companies do and that’s amazing. If you can make sure that what you restore is a meaningful subset of your data. Like for example, you could replicate it to the primary successfully. That probably means you, the restoration worked. So, but even lacking that again, down this maturity spectrum and the size of your team and stuff like that. just try it, say, get a calendar and like make sure once a quarter or whatever your team uses, uses the tools to, to bring your database back online from S3 or uses pg_basebackup to rebuild a secondary from the primary.

Tom: [00:38:48] Yeah, I think there’s that, that, that, that is, that is all great stuff, but I think the higher, higher level point I want to make here too, is that this is really an ownership thing. None of this stuff is technically hard. There just has to be somebody that cares about making it happen. There has to be someone whose job it is. I mean, ultimately that’s the CEO, but hopefully they will have delegated that to somebody and said, this is now your job, I want to hear what you’ve done, but somebody needs to own this and that’s where it all begins. 

Jamie: [00:39:22] Yeah, I think, I agree with that. Another thing is being pretty familiar with these teams, find a way to celebrate it too. So, because this ends up being the kind of work that, again, nobody notices unless it goes wrong. And if you’re doing preventive work, it’s hard to illustrate the value of it, right? Because it’s like, well, we did the work, so we don’t know whether it was necessary. So it can kind of fall in that. So as a leader, engineering leader, or as team in a culture, or yes, the CEO of the company or whatever you need to make sure. Also there’s a way to sort of like value and celebrate this work, like, so that the teams that are doing this preventative work feel like they’re not trading that off against more valuable work where they like made features, right? You have to find a way to celebrate the preventative stuff. 

Tom: [00:40:11] And the DRTs are also, a neat way of like, I wouldn’t say onboarding people, but taking people that are maybe a few months in, or a month or two into learning about a system and really deepening their understanding of it, because that’s when you might have to really be on your toes a bit more and learn about the system pretty quickly. A lot of people never have to learn that stuff until people that already knew it are gone and then, then it gets a lot harder. Let’s get on the next point. So you had databases, which are really important for businesses these days. That’s not a place you ever want to, to skrimp, like if you have the money it’s a good place to spend money is, like, I mean, better databases make your generally everything faster, better storage, like in this case, if they’d been able to keep more of the transaction log around that they might’ve been perfectly fine. If they’d had a faster network connection to where the LVM snapshot were stored or a faster disk there. They wouldn’t have had as much downtime. But more importantly, if you can pay somebody else to deal with your databases, please do. 

Jamie: [00:41:36] Yeah. Yes, for sure. I mean, I think that the data, the database in many respects is the company, right? Like the company’s assets are the things customers have given you, and you keep for them and you represent to them and you, you move that data through workflows to enrich it. And if there’s one place you should really make sure you’re doing whatever you can to not have to worry about it, it’s your databases. And, and also too, it’s correlated with it’s one of the harder things to get right as well. Right? Because honestly, in the face of all the things that can go wrong in a network, like replication and backup and recovery procedures and latency, like how, a lot of times companies, even if they have, like, it was kind of like it would happen to  Gitlab. Even if they have amazing backups, the first time they have to restore those backups is the first time that they learn it will take three days to restore the backup. And so that’s something you should probably know ahead of time, or you should just have someone that knows the magic to make that fast. Like ideally if you’re using a high level service provider, they have disks fail all the time too. You never really know about that. Right. So, it, it, I think databases are one of the things that like in very few circumstances, should you try to do that yourself? It’s just really, really hard. and I say that having been in charge of teams that did it on tens of thousands of database servers, like the list of edge cases is, is incredibly long. And, and yet the criticality of doing it correctly is very, very high. And if you can pay someone to do that for you, it’s probably one of those things. It’s like a no brainer. 

Tom: [00:43:18] Yeah. Databases sneak up on you. They start out very simple, like everything is fast to start with, everything is easy, there are just no problems, there are no edge cases to start with. But, as things grow, it’s more complicated than you think it is, is generally how I would put it.

Jamie: [00:43:38] Yes. Very, very complicated. Especially at scale. Yep. So pay somebody to do it for you and you will pay a premium for that. But I think when it comes to your database, it’s worth it. 

Tom: [00:43:50] All right. And the last kind of bucket we want to talk about, or put things into is just a general philosophy around notifying the right person when something’s gone wrong.

Jamie: [00:44:03] Yep. Yeah, for sure. When something goes wrong, how do you know about it and how do you know about it reliably and how many different ways are there to discover that? And it’s another one of these things where it’s so important for that to actually work. Something went wrong, if someone knows that they can take action on it. If at all possible you want to make it so there’s only one way that happens. So maybe the opposite of this, just to illustrate what it actually means is like, you have a metrics oriented thing that pages, when things cross thresholds, and then you have error logs that people get when they check their email. Right. So not to pick on  Gitlab, but that’s probably kind of what we’re describing here. Having one way that whenever anything goes wrong, someone will know about it is great because you can harden that one way. And it’s much more likely that that one way won’t fail. Then if you have three or four ways, but one of them is used 90% of the time of the other ones are used less than 10 collectively. The ones used less than 10 are just way more likely like economically to break without people knowing.

Tom: [00:45:19] Yeah, once you have this one way of doing things like, then you can really invest in making sure that it is just completely bulletproof. You can inject synthetic errors and make sure this one way is picking them up, you can have systems that detect that this thing isn’t running and then, then alert you based on that, you know? So, yeah, just, just have one way of getting errors to you and make sure that that’s working.

Jamie: [00:45:49] If this is originating from error logs in this circumstance forever, like that’s totally okay. But then use something that like turns those error logs into metrics or whatever. So, things like logstash or vector or lots of other solutions that can basically turn, Datadog if you’re using a hosted thing and turn logs into metrics. The other thing that’s a little bit more in the weeds, but like about these kinds of systems is like, if you need to know something is wrong, because something tells you it’s wrong, that pattern usually is more fragile than ringing the phone unless something is right. And so like the example here, it really oftentimes has to do with metric systems. Right. We already talked about that. And then what does a missing value mean? Right. So like if we take a database backup and this is again, based on like, actual teams I’ve worked on, so I’m stealing a little bit here. But, a great thing to test is like basically, ring the phone, unless there’s a value in seconds from the last time a database was restored or backed up, or again, restored as ideal. If it was less than two days or whatever your threshold is. And so, nominally, like what this means is if it’s more than two days, page someone, right. Or if the date, if the value is missing page someone, right? So unless a system emits a value that says some, cause what, what otherwise can happen is the system that checks that just stops working. And so you just, you don’t get an error occurred type value. So it’s kind of inverting, making sure you’re inverting that it’s not about like ring, ring the phone if an error happens. It’s ring the phone unless you got data that said everything worked. And if you can shape your systems that way, they’re just gonna be a lot more robust to the type of missed signal problems that occur when something quietly stops running. I guess that would be the recommendation in this area, that in my experience works pretty well for these kinds of systems. 

Tom: [00:47:59] Yeah, absolutely. Absolutely agree. Well, cool. anything else you want to call out from this one? 

Jamie: [00:48:07] No, I think that’s, other than, reading through it, I was on the edge of my seat. I think we’ve all been there when it seems like one thing goes wrong after another. The resiliency teams have to show to get the site back online when things like this are happening. It can end up really pulling the team together.  And as you said, the team emerges from the other side, just so much stronger and wiser. They are hard lessons learned, but they’re lessons learned really deeply. And I’m sure one of the things Gitlabdid. And if you look at all their follow-up items, it looks like they not only have the individuals learn it, but those individuals help build it into the culture. So that it’s not just those individuals that make that mistake again. But the company has become stronger at this kind of problem. Yeah. 

Tom: [00:49:07] Gitlab is still in business. They’re still going strong. Yeah, they did the right things and they survived. So, and again, big kudos to them for being so open about this really, really rough 24 hours.

Jamie: [00:49:20] Definitely.

Tom: [00:49:23] Well, cool. Well, thanks everybody for listening. We’ll be back for another episode soon.

Producer: [00:49:31] Thanks for listening to The Downtime Project. You can subscribe to the podcast on Apple, Spotify, or wherever you listen to your podcasts. If you’d like to read a transcript of the show or leave a comment, visit us at www.downtimeproject.com. You can follow us on Twitter @sevreview. And if you liked the show, we’d appreciate a five star review.

Write a Comment

Comment