in

Salesforce Publishes a Controversial Postmortem (and breaks their DNS)

The Downtime Project
The Downtime Project
Salesforce Publishes a Controversial Postmortem (and breaks their DNS)
Loading
/

On May 11, 2021, Salesforce had a multi hour outage that affected numerous services.  Their public writeup was somewhat controversial — it’s the first one we’ve done on this show that called out the actions of a single individual in a negative light. The latest SRE Weekly has a good list of some different articles on the subject. 

In this episode, Tom and Jamie talk through the outage and all the different ways that losing DNS can break things, as well as weighing in on why this post-mortem is not a good example of how the industry should treat outages. 

Jamie: [00:00:00] Welcome to the Downtime Project where we learn from the Internet’s most notable outages. I’m Jamie Turner, and with me is Tom Kleinpeter. Now, before we begin, I want to remind our listeners that these incidents are stressful and we peel them apart to learn, not to judge. And ultimately Tom and I have made, and in the future, unfortunately, will make similar mistakes on our own projects. So please view these conversations as educational, rather than a judgment of mistakes that we think we would never make. And today we’re talking about Salesforce’s May 2021 DNS outage. 

But before we get into that, we have a few updates for you all. So continue to please share the show with your coworkers. If you enjoy it, there’s a good chance they will enjoy it, too. And make sure you’re following us on Twitter if you’re not already. Every time we have a new episode, we tweet about it there. And we also have some content sometimes where we’re engaging with folks that were involved in these outages and things like that. So check us out on Twitter. 

And another plug for my startup here. So if you want to build cool serverless platforms that are like systems that could be used to build internet companies in the future, without those companies even having operations teams and solving the kinds of problems that we are talking about all the time on this podcast, come apply at my company, Zerowatt. We’re a brand new startup that’s just hiring our first batch of engineers. So jobs@zerowatt.io, and Tom. 

Tom: [00:01:48] All right. Just another quick plug for my startup as well. It’s called Common Room. If you’re interested in working on some really interesting software that helps companies with their communities, you should email me at tom@commonroom.io. We have a ton of customers, and we have so much work to do. It is crazy. So there is no shortage of interesting problems we’re working on. 

Also, if there is an outage that you want to hear us do an episode on, please send us an email, or follow us on Twitter, or just get in touch with us somehow. We’d love to hear what you want to hear more about. 

Jamie: [00:02:22] Yes, among the many benefits of the modern internet, as you all know, there’s about 80 different ways to get ahold of us. So you please, please utilize any of those 80 ways. We will find that recommendation in one place or another. Cool. So great. So today’s sev is Salesforce’s no good, very bad DNS day. Tom, you want to give us a little background on this, this particular incident? 

Tom: [00:02:47] Okay. If you’re listening to this podcast, you almost certainly know what DNS does. But just for a very quick refresher, DNS is the thing that turns a host name, like downtimeproject.com into an IP address, which is what you need to actually route some packets to a server.

Salesforce is going to have a DNS problem in this podcast. They are big enough to run their own name servers. And when those go down, it is really nasty because DNS is just at the root of everything. You need them to get to any login page. You need them to use almost any of your internal tools. And as you’ve seen in some of our past episodes, if you can’t get to your login pages or your internal tools, you’re going to have a rough day. So DNS is on the very short list of services that absolutely positively have to keep running for everything else to work. So let’s see what happens.

Jamie: [00:03:40] That’s right. Well, here’s what happened. So, just a warning before we dig into the timeline–there’s not very many timestamps on this timeline. So unfortunately we’re going to have some steps that we talk about here where we’re not quite sure when they occur within this timeline. But we’ll just do our best as we go here.

So at a certain point, let’s just say about 10 minutes before a pager started going off, Salesforce started to roll out a DNS change. Now the reason for this DNS change is not super important, but basically they were going to change two data centers being able to sort of resolve names with each other. And so this involved a reconfiguration of their DNS server configuration–sorry, their DNS server configurations–which means that the kind of config files that drive the DNS server they’re using, which is BIND, would be changed. And then they all need to be kind of restarted in order to reload those configurations and change their behavior. 

So they make this change. They start rolling it out. Let’s say a few minutes later, logins start failing. So let’s call this the beginning of the incident. So a pager starts going off, and the Salesforce engineering team starts to see that authentication services everywhere are starting to fail and people are not able to log into services anymore.

Tom: [00:05:06] Now they don’t say exactly why the authentication services were failing, but it doesn’t surprise me one bit. Once DNS goes down, it’s going to fail. And log in is probably just the thing that’s failing because that’s what people are hitting first. But if you can’t even get to a webpage to sign in, or you can’t resolve the host name for a server you have to call for someone to log in, it’s going to fail.

Jamie: [00:05:25]  Well, Tom, but at least the status page is working, right? 

Tom: [00:05:31] Oh, but, wait. Yeah, so this is another one, and I promise you, we’re not just cherry picking sevs that have status page failures. But yet again, the Salesforce status page failed–not hugely surprising because once DNS is down, it’s like everything is going to be broken. So yet again, we see another company who can’t update their customers because the status page is broken.

Jamie: [00:06:03] So anyway, the place we find ourselves here to recap is everybody gets paged about log-ins failing, recent DNS change went out. And so the engineers, quite sensibly say, okay, this recent DNS change seems to have been at fault because we can’t resolve any names.

Their remediation at this point lagged a little bit during a couple hour time period because they were trying to figure out how to get into their systems in order to fix them. And so, the authentication layers they had, and probably even finding hosts, relied on DNS and relied on their authentication services. And since those things weren’t working, the operational team struggled to be able to get onto the hosts in order to fix the problem. There’s a reference in the post-mortem to a kind of break glass routine that eventually the teams figured out how to employ so that they could get into some of the servers and start getting the name servers back online.

Tom: [00:07:01] If your DNS has gone, you can just be screwed in a lot of different ways trying to get to your servers. I don’t have the IP addresses for any of my servers written down right now. It’s all on AWS, so it doesn’t really matter. And I trust AWS to keep this stuff running, but if you have bastion.hostname.com or something, you can’t SSH to that anymore unless you know what the IP address is. You probably can figure out the IP address for some of your DNS servers, but it might be hard to get all of them. 

But what’s even worse is, depending on how you have login set up, if connecting to the server requires DNS in some way on the other end, like if they have to resolve that you’re coming from some trusted host name or something, you just might be completely out of luck in terms of SSH. Then at that point, hopefully you have some kind of IPMI or a KVM over IP, which–if you’ve never done anything with that before–is effectively a way to simulate connecting a terminal. Or it’s a way to simulate being physically present at a server without having to be there. It routes what you would see if you connected… call them a crash cart. If you connected a keyboard and a monitor to the machine, and you can do things like remotely restart the server, connect to the terminal, things like that. If you don’t have that running, then your only real recourse would be to actually start rolling a little cart. 

And it’s fascinating because this was such a big part of my early data center life. But I imagine a lot of engineers now have never even thought about this, but in every data center that I’ve ever been in, it’s easy to imagine rows and rows of these tall racks of servers sprinkled throughout the data centers. They usually take them out of the pretty pictures they take to show off the data centers, but sprinkled throughout the data center will be these things called crash carts, which is a monitor and a keyboard and a mouse on wheels. And you could roll these things around and plug them into servers when you really need to. I don’t know if these things really still exist. Maybe they do, maybe they don’t. I’m sure they’re still out there in some data centers, but that would be the ultimate fallback. If you really blocked yourself out of the server because you needed DNS to operate.

Jamie: [00:09:29] If you’re at the point you’re getting a real cart, things are getting real. I think in that last outage, we talked about the physical world impeding on your sort of idealistic digital world, and the moment that crash cart’s rolling down and you’re driving and trying not to speed to the data center and stuff like that… that’s when things get interesting. 

Tom: [00:09:50] So I think probably my top three biggest screw ups of all time resulted in me having to physically connect the crash cart terminal and keyboard to about 300 servers one night. Well this was back in probably 2002. This was for Audiogalaxy. We had hundreds and hundreds of these servers, and we had them all configured with IP tables to only allow us SSH from a hard coded set of IPs. And I really was not very good at that sort of work. And I had sort of stepped in for somebody else to do this big IP migration one night. And so I got in there in the evening and was just planning on being there all night. But I screwed something up and did some change out of order and ended up not able to get to an IP address anymore that the servers would actually let me connect to them from. And so, yeah, I had to literally physically connect the monitor and the keyboard to hundreds of servers and type in a password and reset the IP tables. And I was very sad to have to do that. 

Jamie: [00:10:57] And here by getting into computing, you thought you were getting out of manual labor, Tom, but that night you learned not necessarily. 

Tom: [00:11:04] I’m getting shivers just thinking about that. You know, the fascinating thing is that just statistically, if you start physically plugging into that many servers, you’re going to just break some of them. Like some of them are just not going to come back up. And I think we lost probably two servers that night, just because, you know, either static or just who knows what–they didn’t like being touched and rebooted. So, oh God, what a pain that was. 

Jamie: [00:11:29] Well, 3 hours 45 minutes into this outage, we don’t exactly know how much glass was broken in Salesforce’s particular version of this procedure and whether a car was driven to a data center and a cart was rolled. But somehow they started to fix the issue on these name servers, which we’ll talk about in a minute. And so 3 hours 45 minutes in, most services were back online. DNS resolution was working again in general in the cluster. 

Believe it or not, that darn status page was still stubbornly refusing to work for about another hour. So it’s sort of saying here that their status page was built on a service that runs on Heroku and they needed to spin up more dynos, which are kind of like the container capacity units behind Heroku. And they finally were able to because the authentication to probably to get into their Heroku account to increase the resources was not working until just now. So, about an hour after they get DNS working again and most services back online, their status page is able to reflect that information. That was kind of interesting. 

Tom: [00:12:39] I had actually forgotten that Salesforce had bought Heroku, but yeah, they own them. And it’s unclear how much their stacks are integrated at this point, but enough to make it hard to update the status page.  

Jamie: [00:12:52] Normally in acquisitions, the technical integrations are some of the last things to actually be integrated and sometimes can take a long time. In this particular instance, Salesforce was probably suddenly wishing it had taken a little longer. So five hours and 12 minutes in, they declare all clear on all clouds, all services, all status pages. Everybody is happy, and Salesforce is kind of back to where they started.

So let’s talk a little bit more about the root cause here because the actual timeline here is interesting, but actually fairly straightforward. So DNS goes down; they can’t get into the machines; they eventually get back to the machines; they do something; everything comes back up. So the root cause–what actually went wrong here is, as I said at the outset, they were rolling out a configuration change for BIND, which is the software they use to run their name servers, and probably many or most of us that are running Linux-based machines are running BIND. So when they do this configuration change, what they do is new config files are put into place and then the name surface restarted. The name daemon is restarted. So it ends up that Salesforce was restarting this daemon with a script that would look something like kill the daemon and then wait a little while and then start the daemon backup.

And so, a few issues with this, but at the end of the day, the kill signal being sent to the daemon was just a kill term signal. So like a kind of soft kill. Asking the daemon to clean up and shut down. It was a suggestion to be killed, as opposed to a kill nine, which is just going to tell the operating system to just immediately terminate that process and don’t give it an opportunity to terminate itself.

And the issue that happens is this simple script actually kind of had a race condition built into it where if the sleep wasn’t long enough, the daemon would try to start named back up and named would attempt to create a PID file, which has kind of a file that puts the process ID inside of it. And, if that PID file still existed, named would just immediately terminate and not start up. And so what had happened is they ran the script, the script sent the please die signal to the daemon. The daemon worked on dying, but the issue that was critical in this particular instance is it ends up that they hadn’t really used this restart script that often under peak load and because named was really busy servicing name resolutions, it took a little longer to shut down and to remove the PID file. And so their script probably just didn’t wait long enough. And so when it tried to restart, named did not start, and the script just exited. And that was the end of it. And so in general, because this was kind of racy condition, it did not happen on every host in their cluster, every DNS server in their clusters, but it did happen, they say, on many slash most probably of the DNS servers, therefore rendering DNS basically inoperable in all of their data centers.

Tom: [00:16:07]  So maybe there was some kind of thundering herd unleashed on the other of….

Jamie: [00:16:12] Of crawfish! I got Adobe illustrator fired up. I’m working on that logo, Tom. We’re going to get those crawfish one day or another. So that’s kind of what happens, right?

We had this change; we attempted to restart it. The script was not the most robust thing in the world that was trying to restart it. And when this race condition happens, you end up without a DNS server running. 

Tom: [00:16:41] Wow. Yeah. Well, that sucked. Losing DNS is just awful. Can you even DRT losing DNS? Would that be sane? That would be pretty dangerous, I think. 

Jamie: [00:16:55] What’d you do–you turn, you face the nearest wall, and you just run as fast as you can into it. That’s how you DRT losing DNS. 

Tom: [00:17:04] Then you will have the proper mental framework to understand.

Jamie: [00:17:07] You’ll just say, “Okay, I never want that to happen again.”

Tom: [00:17:11] So just a super brief summary. Routine configuration change rollout exposed a race condition in the script they had managing named and that knocked out their DNS. And because that’s sort of like losing gravity or losing just a really fundamental part of your existence, it took them a long time to get it back up and running.

Jamie: [00:17:35] Sounds right. Rough day, rough day, rough day. All right, Tom. Well, let’s start with the positives here. So what stood out as going well here to you? 

Tom: [00:17:48] Well, this post-mortem is fascinating. We’ll talk about some of our criticisms of it later, but it has a lot of stuff in it. It has a lot of text that has a lot of action items that they are working on. Clearly there is somebody who’s putting a lot of time into this communication. So kudos for that.

Jamie: [00:18:11] It looks like they’re actively updating it as they go here too. 

Tom: [00:18:15] I do like that. I mean, they published something early and then they are continually updating it as they find new stuff out. And that’s good. We’ll talk about this a little bit more well later. But we do postmortems to help establish trust, and seeing a lot of communication about this is pretty good. Second, you can see they have a lot of good pieces of response in here. They have this break glass thing already built. You want companies to have the break glass stuff, because it shows you that they’re actually taking security seriously. You don’t need break glass if every engineer can SSH to every server and just do whatever they want. You might end up with a lot of other problems if you do that. But if a company is built to break glass, that means they’ve actually locked things down to where people need to use it. And that’s certainly something you want to see from a company like Salesforce. 

Jamie: [00:19:16] There were definitely some good primitives that they had had worked on. So the break glass, the EBF process, which we’re going to talk about a little bit more…. Even the fact that they failed closed as hard as they did implied that they were taking security seriously and stuff like that. Well, maybe we should dig now into things that could have gone better here.

When we read through this, what jumps out at you first? 

Tom: [00:19:54] So let’s start with the more concrete, and we can work our way up to the more abstract. But clearly the place to start is just this race condition for restarting named. There’s just a bunch of problems with this. First, process supervisors are a thing. This is not really a new problem in Linux. There are pretty established ways of being able to tell Linux that this thing is important, that I want you to keep it running for me at all times. You should not have to write your own script to restart named. You should have just had systemd just say, hey, we’ve changed the config file, restart this thing. You would get a lot of stuff along with that. It would retry; it would shut it down hard. It would hopefully be tied in with all your other alerting, and you could find out if named is flapping or restarting continually or something like that. 

Jamie: [00:20:55] So absolutely not having named running under some kind of supervision framework, whether it’s systemd or something else… it would have prevented a lot of issues here because even though we’re talking about a configuration change and explicit restart, named could have crashed for some other reason. Like something else could have gone wrong that caused named to crash. And even just these details like the PID file has to delete, and oh, what if it doesn’t respond to the term signal. There’s some subtlety in what happens if it crashes hard, if it’s asked to shut down gracefully, and the kind of infrastructure built into something like systemd is designed to be able to handle all those kinds of things. Systemd definitely gets criticized, probably somewhat fairly and a little unfairly, for being fairly complex. But a lot of that complexity is because if you really, really want to have it work all the time, specifying the semantics of a service starting and stopping safely–there’s some moving pieces to it, and it’s better to just have supervisor be pretty robust and complete about its definition of what running means and what stopping means and stuff like that.

Tom: [00:22:09] I’m not going to go and get into the details of why named is the way it is, except that it’s old. It’s been around for a long time. But if you’re running a service now, I’m not a huge fan of graceful shutdowns. You’ve got to plan for the nongraceful ones anyway. So maybe just collapse everything in that path and just always kill dash nine it. It’s going to be a lot more reproducible. You’re not going to have race conditions because it’s going to be gone very quickly after you issue that command. You should be able to start up anyway if you left resources in some weird state. 

Jamie: [00:22:51] On the storage system at Dropbox, even when we wrote our own file system, basically, the rule always was there’s only kill nine. There’s nothing that’s not kill nine because the truth is, especially at scale, your stuff will crash hard. You have to be able to handle that. And so a lot of times people will talk about crash safety, and this kind of goes back to our whole thing about being able to restart fast. If possible, and it should be possible almost every time, just plan for failure and make your programs crash safe. And the best way to keep yourself honest about crash safety is to only kill nine. If for some reason that’s not okay to do, then that means you need to go back to the drawing board because that’s going to happen anyway at a time that you’re not anticipating. Your program will go down hard at some point, and that should be okay. 

Tom: [00:23:45] I will give you one obscure reason to support a clean shutdown, and probably you should always kill dash nine. But one reason why I in the past have used kill term to do this cleanly is so I can check for memory leaks. I would have a process; you run it under Valgrind; you let it run for a while; then you send it a signal, have it shut down, clean up, everything it knows about, and print out what hasn’t been freed. That was handy. 

Jamie: [00:24:22] Here’s the second reason. We immediately contradict ourselves. But in certain proxy type scenarios, if something is bound on a port, it’s not a correctness issue, but it’s like a disruption issue, availability issue. So you ideally would want some services to unbind the port and finish the outstanding requests before they die. And that way you don’t sort of 500 some requests that you don’t need to. So, there are some reasons. But you should definitely be comfortable with kill nine and probably make that the default, unless you have a really strong reason not to.

Tom: [00:25:04] The proxies thing is a good point. Yeah, that’s true. 

Jamie: [00:25:09] Well, what else, Tom? 

Tom: [00:25:11] This is like a perfect case study in why you should stagger your changes. Whether you call that canary or just an incremental rollout or whatever, this was a problem. If they had lost one DNS server, no one would have even noticed.That’s just not a factor. This was a problem because they lost almost all of their DNS servers in a very small window of time. So if they had done this incrementally, they would have caught it earlier, and I think it would have been okay. 

Jamie: [00:25:49] Yeah, it sounds like they did have some mechanisms and systems for doing staggering. There’s some stuff in there about it. I guess they were not used in this circumstance. They’re kind of related to our kind of larger conversation we want to have about this. But absolutely, in cases like this, if it had been a stagger then–and they definitely call this out in the post-mortem–you still would have had the problem. But maybe the program would have been localized to just like the first data center, the Canary or whatever, and it would have been less disruptive. 

Tom: [00:26:25] Although if they’re not running named inside of a monitoring script, they might not have even noticed it was crashing. 

Jamie: [00:26:35] You’re absolutely right. It’s not necessarily something that went wrong but just something for all of us to be aware of as a risk related to these kinds of outages is that DNS is the perfect combination of something that is going to hurt you really bad when it goes down. And it’s for a couple of reasons. One is certainly everything depends on it. Anything like that is already bad, but a second one is that DNS is a standard. We all use off-the-shelf software packages, mostly BIND. And so we didn’t write it, and we don’t really know how it works because in many cases, it just works 99% of the time. And so when it fails, not only is everything affected at the same time, but also too, it was always kind of opaque how it worked because it always just kind of worked. And it ends up that there’s a circular dependency built in it. It’s usually necessary to be able to access the tooling to fix it. Tom, you and I have both been involved in DNS related incidents at previous companies, and they’re always really painful because of this particular combination of characteristics of DNS–the criticality, the opacity, and the fact that you don’t have good muscle for fixing it because it just happens to work most of the time. And then when it suddenly fails, it’s like your entire world is down. 

Tom: [00:28:04] This outage is sort of fascinating because there’s no data corruption. There’s no hard to deal with load issues. There’s no hardware issues. The process just crashed. That’s it. DNS, such a pain.

Jamie: [00:28:27] Yeah, for sure. And I think the kind of last topic we wanted to touch on if we’re being honest is that we don’t like this post-mortem very much. 

Tom: [00:28:42] Yeah. It’s definitely an outlier for all the ones we’ve looked at. And this is our 10th one that we’ve gone through. It definitely feels a little bit different than all the other ones that we’ve looked at and not necessarily in a good way. 

Jamie: [00:28:57] One thing right out of the gate is this postmortem focuses much more on the individual than the system. There is an engineer mentioned a couple of times. It says the engineer executing this change did not follow the aforementioned process. This process was not leveraged by the engineer making the change. Some of the best postmortems we’ve seen felt like they were written by a leader, and the only real identity of an individual you’d identify in there is a leader taking responsibility, as opposed to like the person that messed up. You sense this nameless person that messed up in this postmortem, and we don’t feel like that’s the right way to make postmortems.

Tom: [00:29:49] It’s complicated, but ultimately failures like this are not on individual engineers. Like, yes, they may have been the people that type the stuff into the keyboard, but they are just one part of a big system. And ultimately some leader here is responsible. I would like to think if I were leader of this org, I would be the one taking responsibility for this outage because I didn’t build a system–where the system includes the hardware, the software, and the people–I didn’t build a system that was safe to do the routine things. 

And there’s a lot of reasons why this was unsafe. First, they mentioned that the unnamed engineer used this emergency break fix process to push this change out. This is something that was supposed to only be used for sev zero, one, or two, which are just the top three severity levels for outages. But this person used it just for part of a routine change. And they sort of stopped right there instead of asking why they did that and why didn’t anybody else know they were doing that.

Jamie: [00:31:17] And even more than that, a lot of the language around remediations here are process oriented instead of system oriented. So it says extensive training is being expanded and refreshed to ensure the organization understands the processes and policies and the ramifications of non-compliance. So there’s language saying, hey guys, try harder, like this is a serious kind of a thing, as opposed to the systemic thing. 

Tom and I both believe–and we think a lot of the best engineering organizations believe this–that systems are there so that humans are allowed to be flawed because we all are flawed. And it’s inherently human to be flawed. And so when we talk about more process and things like ramifications of noncompliance and stuff like that, I don’t think it turns the lens back on the leadership to say, well, why aren’t people following the rules. If you worked so hard to hire all these people, these talented people who are running under your banner, you should probably assume that they have good intentions and they’re trying to do their job. And so if they’re not following the rules, why aren’t they following the rules? Why aren’t the rules working for them? And that’s a leadership consideration. 

Tom: [00:32:28] If it’s too hard to do it the right way, then you probably need to invest in making that easier. If people are circumventing your rules in order to get their routine jobs done, then something’s broken. If you think you can just fix outages by making up rules, you’re going to have a bad day. Rules are very easy to make. People love making rules after outages because it’s so easy to make a rule. You just write it down. You think if everybody did this, we’d never have this problem again. So write it down, put it in the training, and we’re done. What’s hard after an outage is saying, oh, you know what, we’re not going to ship this feature we’ve committed to because what we’re going to do instead is migrate all of our servers to put named inside of a process supervisor. We’re going to fix this whole class of things. But that is really hard to do, and that, again, requires a leader to step up and change priorities, which is way harder than putting in a new policy or new rule. But that’s how you actually harden your system. You just don’t rely on people. Diligence is in short supply in the world, and giving people more rules doesn’t increase the amount of diligence that exists.

Jamie: [00:33:46] No. And in fact, it misaligns incentives between you and the people who are there just trying to get their job done. Versus if we say, hey, we’re going to create systems that support you. Now they’re excited about the improvements because the improvements make their jobs easier as opposed to like, just be more careful and we’re going to make a longer checklist or whatever. And next time, make sure you follow it. Otherwise, what. Or else. That’s an interesting dialogue to have with your talented engineering team you work so hard to recruit. 

The one thing that we have to admit we’re making some assumptions here is that we don’t know this was systemic. It may be that this one individual did this, and this isn’t happening routinely. But I would say in our experience, there’s still a way to talk about all these things as systemic. So let’s just say it was an individual that didn’t follow a clearly exceptional policy. Well, what you need then is you need monitoring to know that that was done and then that to be sent to all the decision makers so that it’s squashed immediately. It’s like, oh, hey, this happened. We do not do that. Or justify that usage or whatever. You have to have the controls. 

The opposite of that which may be more likely here is that the volume of people subverting the rules becomes high enough that people just get numb to it. And that people are routinely just using this because they can’t get their jobs done with the efficiency they want. So you can either have a kind of volume that makes you numb to it, or if it really is an individual that did something very exceptional, then it’s how do we have controls to make sure that doesn’t happen. Or there’s a second approval or a third approval, or get that sent to the right people. If you’ve put in the right kind of cultural pressures, those are still systematic pressures. You have to make sure that the automation is helping you maintain those things because these things usually either go to zero or they go to everywhere. They usually don’t hover around a low number. You either push them to zero, or they will become common. 

Tom: [00:35:55] Just to make that concrete, if you have something exceptional that engineers have to do on your team, whether that’s the break glass thing or this emergency break fix or SSH into some machine, or just something like that, that really they should not typically have to do, make that highly visible when it happens. Dump it in a slack channel, send an email to the team, make sure it gets flagged in some way so that you can talk about it at a reliability review or something like that. That way you can keep things exceptional. And if everybody is using this every day and the whole team sees it, at least you’ll know about it then. And you can start to have some conversations like, hey, why are you guys circumventing all this process I put into place?

Jamie: [00:36:43] Which is still a leadership responsibility, right? As you just had the person saying that thing you just said as the leader, I think a second, but related thing is this post-mortem has a different tone than some of the other postmortems. It feels a little bit more like it was written by a marketing team instead of an engineering team. For example, there’s a lot of copy in there about a 2019 incident where they have done all this stuff to be good now and that this was an exception. But a lot of the content mass is devoted to talking about processes that they’re doing on things that were not related to this incident and like this incident slipped through the cracks kind of a thing. And so there’s a little bit of a question about who the author and who the audience are of this post-mortem. That’s interesting to think about. 

Tom: [00:37:35] So we’ve done a bunch of postmortems now that have all been written by different people, and it’s definitely interesting to look at the differences between them.

And to some extent that comes down to who the customers are for Salesforce. The customers for Github and AWS and Gitlab are very different from the customers for Salesforce. So that is understandable to me that you want to have a spin on this that makes you look good to nontechnical people. The problem with that is it makes you look kind of bad to a lot of the people who might help fix this problem, a lot of the people that you might want to hire, or the people who already worked there. I don’t think I’m out on a limb here to say that the tone of this where, oh, it was just this one person’s bad decision that resulted in all this… like, I, I don’t think it’s that simple. 

Jamie: [00:38:37] No, it’s very unlikely to be true that that’s the case. Yeah, I agree. 

Tom: [00:38:41] Just to be really clear, this was not a malicious thing. This person wasn’t trying to sabotage the system. They were trying to roll out a DNS change, which is just part of the job. So if you had a postmortem or if you had an outage because a malicious employee deleted all your data or something, yes, it’s totally okay to say that happened. But this is just somebody doing their job.

Jamie: [00:39:07] And so yes, if it is still important to your company to attract and retain amazing engineering teams, it’s important to keep in mind that certainly your customers are one of the audiences of this, but so is the talent pool that both works for you and that you want to work for you.  

Well, Tom, I think that’s about all to say about this one. What do you think? 

Tom: [00:39:31] Yeah, I think that’s about it. This was definitely an interesting one to go through. 

Jamie: [00:39:44] Yeah, for sure. All right. Well, thanks everybody for listening to us as usual. And we’ll be back with a new episode soon. Don’t forget to share the show with people you know you think would like the show. We always appreciate when you do that. 

Tom: [00:39:59] I’m told the best way to share the show is by making a big post on LinkedIn about how much you like it. 

Jamie: [00:40:05] Make a big post, Tom says, make a big post on LinkedIn about how much you like it, everybody. So this is a clear, a clear directive to go out and spread the word. So, anyway, thanks again. Bye everyone.

Thanks for listening to the Downtime Project. You can subscribe to the podcast on 

Apple, Spotify, or wherever you listen to your podcasts. If you’d like to read a transcript of the show or leave a comment, visit us at downtimeproject.com. You can follow us on Twitter at sevreview. And if you liked the show, we’d appreciate a five-star review.

Write a Comment

Comment