Tom was feeling under the weather after joining Team Pfizer last week, so today we have a special guest episode with Sujay Jayakar, Jamie’s co-founder and engineer extraordinaire.
While it’s great to respond well to an outage, it’s even better to design and test systems in such a way that outages don’t happen. As we saw in the Cloudflare outage episode from a few weeks ago, there can be very unexpected results from code if all possible variations haven’t been tested, which is hard for humans to do.
In this week’s episode, Jamie and Sujay talk about some of the ways to use automated tests to drive down the number of untested states in your programs.
Some links that came up during the discussion:
- Jason Warner (GitHub CTO) tweeted about what it was like behind the scenes during the big outage we discussed last week.
- How Dropbox decided to rewrite their sync engine and how they tested it
- The QuickCheck testing framework.
Jamie: [00:00:00] Welcome to the Downtime Project where we learn from the Internet’s most notable outages. I’m Jamie Turner, and Tom is out this week–so we’re doing a special interview episode with a guest, which is very exciting. But before we get to that, a little bit of housekeeping.
So, I want to really reiterate my appreciation, and Tom’s appreciation, for the support that we’ve been getting from listeners. Lots of good emails and reviews from folks. And we’ve had some great engagement from some of the leaders involved in some of these outages. John Graham Cumming and Matthew Prince at CloudFlare both sort of chipped in on Twitter and left a comment about the CloudFlare issue we covered about regular expressions a few weeks ago.
And actually we just did a release on GitHub’s 2018 network partition database 24-hour outage from a few years ago. And Jason Warner, the CTO at GitHub, replied to our podcast with a series of tweets that sort of expanded on his own personal backstory during that outage.
As a spoiler, he was on a crater in Hawaii when the site went down. So go check out the full series of tweets on Twitter. We’ll have a link in the show notes to it, but it’s kind of interesting to hear his side of the story and the fact that he was trying to help manage this situation while trying to enjoy vacation. It sounded like a well deserved vacation with his family.
And one last thing Tom and I want to mention, most meaningful really to us, is we’ve been reading your reviews. We read them, and a few of the folks that have written in are people that are new to running services or new to software engineering in general and who have said they’re enthusiastically listening as part of their education about how to do these things. And we definitely want to share that it’s really some of the most positive feedback we could possibly get from doing this. I think if you’re in this industry long enough, you eventually know, and kind of accept that every line of code you ever write is going to get replaced or deleted one day.
But you know, the people that you’ve helped and the relationships that you’ve grown are really kind of the most durable, satisfying, value in the long run. So Tom and I are just really happy to hear the podcast might, in some small way, be helping some folks beef up their careers. So thank you for writing those reviews and yeah… keep listening. Hopefully it continues to be valuable for you.
So for today’s special guest, back in the CloudFlare outage, you might recall that we talked about the core thing that contributed to that outage was CPU exhaustion. It was due to a runaway regular expression that went from linear to exponential. And we talked a little bit about fuzzing approaches and ways things like fuzzing can be brought to bear to help sniff out problems like that by, you know, having properties and generating inputs that surprise you when they do something unexpected.
So this week, while Tom’s out, I’m sitting down with someone who’s done a lot of really interesting work in this area. And I’m also really happy to say he’s one of my co-founders at my new startup. His name is Sujay Jayakar. So Sujay has been a principal engineer at Dropbox recently, and he was the tech lead of a project at Dropbox for about four years that was Dropbox’s new sync engine, which is called Nucleus. And it’s a system that has a really high quality bar, as we’ll hear more about from him.
And his team wrote a series of really excellent blog posts about the project and the testing approaches they used to harden it, including some things that involved these kinds of fuzzing or QuickCheck patterns, which we’ll talk about today. So we’ll put links in the show notes to those blog posts, and I encourage you guys all to read them because it’s a nice expansion on some of the stuff we’re talking about today. But I’m glad to have the opportunity here to dig in a little deeper into the project with him and learn more about the way they approached reliability using these kinds of methods. So Sujay, welcome to the show.
Sujay: [00:04:31] Thanks for having me.
Jamie: [00:04:33] Cool. So you know, given it’s slightly Dropbox-specific parlance, I think maybe, you could tell us what a sync engine is bause we’re going to be talking today about Nucleus, which is a project where you guys rewrote the sync engine. Is that right?
Sujay: [00:04:46] Yeah, that’s it. So before we talk about a sync engine, we’ll take a step back and talk about Dropbox for all the listeners out there who haven’t used the product. So Dropbox is a service you can sign up for, and it creates a magic folder on your computer. Let’s say you have a desktop and a laptop. A lot of times, if you’re working on your desktop and creating some word document or something, before Dropbox, you would have to maybe email it to yourself or use a thumb drive to be able to see it on your laptop. And when you sign up for Dropbox, what it does is it creates this magic folder on your computer that automatically syncs between all your devices when they’re online. And the way it does that is by having a client software that you install on all of your computers, and that client software has a sync engine as one of its components.
And the sync engine is then the part of the system that’s responsible for reading and writing to this magic folder. So if you add a file on your desktop computer, the sync engine is the part that’s watching the local file system noticing when anything changes. Determining what type of change was made by the user, sending that change up to the server. And then on the laptop, the sync engine is also responsible for watching the remote file system on the server and replicating changes from there to the local file system.
Jamie: [00:06:23] So is it kind of like a service running on the customer’s computer? Is it a library? Like, what form does it take?
Sujay: [00:06:29] Yeah, it’s a service. So it’s a piece of software. It’s like a service that just sits in the background. And it’s passively in the background and watches changes that happen either locally or remotely.
Jamie: [00:06:41] Gotcha. Okay. So, then you guys in these blog posts, talking about Nucleus, the nucleus is a replacement of the sync engine. Can you talk a little bit about the Nucleus project? Uh, what, what actually is Nucleus in the context of the sync engine?
Sujay: [00:06:54] Yeah, totally. So Dropbox got started around 2008, and syncing files is the core of the product. So we’ve had a sync engine from the very beginning and the sync engine grew and evolved from its beginnings to 2016, where Nucleus comes in. And after eight years of growth and going from being a startup prototype to servicing hundreds of millions of users and devices, the original sync engine, which we endearingly called “sync engine classic,” was starting to show some cracks, starting to creak under loads that it wasn’t designed for. So we decided in 2016 to undergo a rewrite of the old sync engine to fundamentally change its underlying data model, and tighten up a lot of the guarantees for what we could provide our users, and also to write it in a different language.
Jamie: [00:08:00] Got it. Yeah. And reading the blog posts you guys have written about it–cause there’s a couple of great blog posts we’re going to link in the show notes–but there’s a lot of emphasis on testing and quality when you guys were making this new sync engine.
Can you talk a little bit about why the quality bar is so high on this sync engine and why it’s so important to get it right?
Sujay: [00:08:20] Totally. That’s a good thing to talk about because I think it’s sometimes easy to take some of these things for granted. Like why is it so important for us to get sync right? And the first reason is just tied to the product. From the very beginning Dropbox has been the place where you put your most important stuff, and we keep it safe. So if people are putting their family photos–I think the CEO used to always used to bring up the line of like, when your house is burning down, what are the things you run for? You know, like your box of family photos. And we want people to keep those types of things in their Dropbox because they know it’s going to be safe forever. Even if the computer crashes or they lose a hard drive or whatever. And people also put their sensitive, private data in Dropbox, right? Things that make Dropbox like a very private space where it’s not just file hosting that’s accessible on the internet. So it’s important that we keep things safe and we don’t lose them. And we also don’t expose them to an audience that the user didn’t intend.
The second reason for why testing and validation and correctness are so critical for sync, it has to do with the nature of sync itself as a problem domain. And the kind of short answer here is that it is really, really hard to fix problems in sync when they go wrong. Because sync is persistent, it’s a very stateful type of system where we’re storing files and directory structure on users’ devices and on our remote file system. If we lose data, then that data might just be gone forever. And it’s really hard to remediate it. If there is some type of corruption where the data’s in an inconsistent state, we might have to go and traverse hundreds of millions of users’ directory structures and fix them up in a way that isn’t just pushing a new fixed version of the code. So remediations can be really hard.
Jamie: [00:10:33] Yep. That makes sense. Some things are kind of hard to take back, right? If you’ve made a certain decision on the customer’s computer. Yeah. It’s not, you don’t necessarily have a backup like you do on the server side, right?
Sujay: [00:10:49] Totally. And, also comparing it to the server side, running on all this code and the systems are running on user’s devices, these are out there in the wild, as opposed to thinking about doing remediations for a backend system, where you control all the servers. So if we messed up in some way and we need, and even if we understand really well how to do the remediation, actually going and running it on hundreds of millions of devices out there is a pretty big job.
And users can take their computers offline.We’ve seen plenty of examples of users like closing the lid on their laptop and then coming back in five years and expecting it to work. So these remediations are also just hard to know when they’re done.
Jamie: [00:11:37] Makes sense. Yeah. I mean, if I read the details about, I guess the testing, some of the testing philosophy you guys pursued, there were some things in there that were kind of unusual or unorthodox, or maybe a little bit more exotic than, let’s say, just unit tests and integration tests. So why, what’s so hard about it? I understand the importance of getting it right, but what’s so hard about getting this system right where you guys had to invest in unusual or, or just slightly more specialized testing methodologies.
Sujay: [00:12:07] Yeah, totally. I think when it comes to testing I like to think of it in terms of state spaces, in terms of like, what are all the possible states the program can be in and what types of statements are we trying to say, like, if we want to say that the test passes on every single platform that we support, then the user could be on any of these platforms. And we’re making a statement that on every single platform, a particular property holds. And so one of the first difficulties of testing that’s really hard, specifically for the desktop client, is a large amount of environments. So we support Windows, Mac OS, and Linux, and on each one, there’s all these different file systems, and they can be configured in different ways. The operating system environment can be different with kernel extensions and minifilter drivers on Windows. And being able to make strong statements about our program when it’s executing in all of these different environments requires us to actually go and test them in all of it.
Another way of looking at the problem from a state-based perspective is thinking about the distributed system of doing sync. And one of the things that makes the sync really difficult is that the system is extremely concurrent. So if you have many people in an organization who are collaborating on some files, there may be a thousand people in that organization. They may be using their laptops, going on airplanes or turning off wifi, going online, offline, making changes all the time. And the sync engine is responsible for synchronizing all of these concurrent actors in the system. So thinking about this from a distributed systems standpoint, network partitions are normal operation, and users expect to be able to go offline, make some changes, come back online, and reconcile them.
So thinking about this, then, from a state space perspective, there’s just so many possible states for the distributed system to get into and testing that all of the interleavings of this user goes offline, they make a change to this file, but then while they’re offline, another user deletes the parent directory, and then the user comes back online and they try to reconcile their changes–exploring all those options is a really involved affair. And it’s really hard to kind of think about this by yourself, sitting with pencil and paper.
Jamie: [00:14:42] So if I try to restate that, if I tried to apply traditional testing methodology to this, where I anticipated all of the set of things I needed to test, there’s just so many things I would have to anticipate that like, it’s not reasonable for a human to enumerate them and then write the code? Okay, that makes sense. So this does sound a little familiar, right? Part of why we’re talking today was we talked a little bit about fuzzing and QuickCheck and stuff like that. And those philosophies feel like they are having the computer generate the test cases for you. I mean, is that how would you kind of summarize what the QuickCheck-type thinking is?
Sujay: [00:15:24] Yep. Yeah. I think the framing of test case generation is totally spot on. And the idea here is that on the first level, we can go and tell the computer, please run these functions, make sure they return true, or make sure they don’t panic, or they return success. And the computer can go and say, okay, I’m going to prove the property of the system that these functions can all execute. But then a tool like QuickCheck raises it up a level where instead of saying very specific things about a program running and it succeeding in QuickCheck, you’ll instead express a property of the system. So the classic example from the Hello World of QuickCheck is that if you reverse the list twice, you get the same list back, and that should be true universally across all lists and all types within those lists. So, then the QuickCheck testing runtime will take those properties. And then almost the way I think about it is that it’s writing a bunch of unit tests for you. It’s writing a unit test for all types of lists, and it may try, it may write unit tests, and you didn’t even think about like in my double-check that reversing the empty list and doing it again gives you the empty list back.
And of course this was a very simple property. All these things are obvious, but computers are just really good at this methodical exploring of all the different options. And it’s very easy, when you adapt all of your data structures and your program to this style of testing, to have the computer come up with scenarios that you wouldn’t have thought about.
Jamie: [00:17:07] Yeah. Some of that, I imagine, is admitting that as humans, we are biased, right, whether we want to be, or not, and computers are startlingly unbiased. So they just generate things we never would have thought of. One thing that kind of emerges out of that, when we think about that in terms of the podcast, is that a few weeks ago–and we did talk about the cloud flare outage–we were talking about regular expression libraries. And so, like maybe one way the QuickCheck-type thinking would be: hey for all of the regular expressions in our system, if I give it an input of length 10, then later I give it an input of length 20, it should take around twice as long to run.
Right. Something like that is an example of potentially the kind of thing where I’m stating a property about, in this case, not the output, but like the execution. So is that one way to think about a way that, for example, that probably could have found the issue right in the regular expression that affected CloudFlare?
Sujay: [00:18:15] Yeah, totally. And yeah, instead of just asserting that the program completes successfully or that it doesn’t crash, that you can do anything, right? And this would maybe fast forward a little bit when we applied the QuickCheck philosophy to our system, to our core data structures of when you have, like, a local file system tree and a remote file system tree, how do you merge them and reconcile them for a final result?
The initial application, a QuickCheck would just check that it had produced some output and that it didn’t crash. Right. And we got a surprising amount of mileage out of that. But the next step would then be to assert different properties about how that final result was generated and what the final result actually was. So, for example, in the case for syncing file system trees if you just end up deleting everything without crashing, that is a valid–that is an outcome, it’s an outcome that didn’t crash, but deleting everything isn’t the outcome that anyone wants. So we would be able to express properly that if a user hadn’t explicitly deleted the file, the system wouldn’t do that itself. And being able to design these properties to get maximum coverage of our system while remaining simple themselves–it’s kind of one of the arts in this domain.
Jamie: [00:19:43] That makes sense. But for you guys, though, you could state these properties, but you’re stating properties really that aren’t just about a library, right? They’re about kind of a distributed system that has a client component and server component. So like, another thing I do know from the QuickCheck side is that emergent discovery usually happens when you can just generate an absolute ton of test cases to find them. So how did you guys balance that–where you guys actually have a distributed system that talks over a network, but yet you guys want to be able to run like millions of test cases, or whatever, to find these edge cases where you have a bug in your system. How did you achieve that?
Sujay: [00:20:19] Yeah, it’s a really interesting problem. So the first ingredient that was absolutely critical to all of this was making it so that all of our sync engine logic was fully deterministic. And the way we did this while still having concurrency within the system is that we used futures in rusts, which are like user scheduled cooperative threads that would multiplex all over these operations that are happening concurrently on a single thread. So if you’re syncing a hundred files at the same time, those hundred files can sync concurrently. But the actual program execution is all on a single thread. And with scheduling decisions and with the result from the network fully fixed, then the execution of the system would be deterministic. It’s a pretty tall order to write a system in this way. But the benefit from a testing perspective was enormous because going back to your question of when you’re simulating this distributed system that has a bunch of things happen concurrently, there’s non-determinism both in scheduling, but maybe even in the actual system itself. Like if you upload two things concurrently, it might be fine for them to happen in either order. Then writing it in this way allowed us to have a different test execution runtime that would run this deterministic code. And that test execution runtime would have a random number generator that was initialized with some seed. It could then control all of those scheduling decisions and try either ordering, say if two futures were outstanding, it could try scheduling one before the other. Or the other way you would make those decisions with respect to that suit or random seed if there was a simulated network requested, try injecting an error. It could also simulate the case where the network request succeeds, but then after it succeeds, the connection gets dropped or some error gets introduced. And we could do all of these decisions with respect to the pseudo random number generator, and then still maintain determinism. So given that seed, the entire run of the test would be reproduced.
Jamie: [00:22:47] I get some of that. I think one of the things I’m not quite getting right is that sometimes these things fail because a request times out or it takes a while. And so, wouldn’t that make your guys’ simulation testing too slow if you guys have things that are delay based like that, right? Like a timed out request, how do you keep the system getting through its simulations fast enough when you have these time-related things that occurred?
Sujay: [00:23:15] Yeah. That’s a really good question. And the answer is that we just had to bring time itself into our testing runtime. So we fully simulated time within our execution environment. So all of the code within the sync engine isn’t allowed to use the standard libraries, like a time, or like instant now. And it’s not allowed to sleep on a real amount of time. It always has to go through like a runtime object that we control and, in production, when it’s running, that uses just the real standard time. But then in tests, we have a notion of virtual time, and the timer is part of the runtime. When someone wants to sleep for a particular amount of time, it registers that there’s a future that’s blocked on a particular timer. And then the test orchestrator, the thing that was doing all of those scheduling decisions, can also decide to move time forward. So it itself also controls the passage of time. So we can have a test that looks like it’s sleeping for five minutes, but the actual execution of the test is instantaneous because it blasts through everything and notices that there’s nothing to be done. And the only part of the system that’s blocked is waiting on a timer for five minutes. So it can just fast forward time.
Jamie: [00:24:45] I see. So the time, the fact that everything is obeying this virtual clock means that you sort of are capturing the essential characteristic of that delay in that it changed the order things happened in. But it doesn’t actually make it wait five minutes. I see. Okay, cool. So whether we call all this stuff fuzzing or QuickCheck or simulation or whatever, it feels like there are a lot of valuable patterns in here. But I’m sure most of the folks listening to this podcast don’t have sync engines. They probably have something else. They have other libraries. They have server side services. So do you have any advice for what kind of patterns to keep an eye out for when simulation-type approaches like this are really high yield that folks listening should be aware of, so that they know when these kinds of methods might be useful?
Sujay: [00:25:39] Totally. So I would say that going back to what we were talking about earlier about tests as being statements about a state space of your program–if you think your program has a lot of states or a lot of paths through the code where when you’re writing your test, you’re going and you’re just manually enumerating a bunch of cases, you’re writing a unit test for each one–maybe you still write those unit tests, but it would be probably a good exercise to take a step back and think, how can I have a computer helping here? If I want to have a level of confidence in this code where I know that all of these paths have been checked, can I have the computer help me do that? And it’s like writing a tool, right? Like we spent a lot on it, we invested a lot of time in this framework, and it saved us tons of time in not having to write all these tests explicitly. But the upfront cost was there. So I think my advice would be that if you have a particularly critical piece of code or a system where there’s a bunch of things going on–maybe there is a lot of state or a lot of cases in the system–and you find yourself manually enumerating them, it’d be good to take a step back and think, is there a way I can almost like script this? Can I have the computer do it for me?
Jamie: [00:27:09] Got it. Okay. So it almost could be a discovery. Like if I find myself starting to accumulate a whole lot of tests where I’m trying to cover every new “oh, if this happens and then that, then I need an integration test for that,” that might be a moment to just pause and maybe examine, hey, is there another layer of this testing where something should be generating these cases for me? Is that one way? Totally. Okay.
Sujay: [00:27:36] And one thing I think is really important here is that when you accumulate a lot of these explicitly written tests, they are valuable in that there is a computer. It’s automating you at work on a level that you could just be like running the code and manually testing it. So like the unit tests are like one level of automation. But those tests actually slow you down when you’re developing, right? If you want to change your code, the semantics of your code to do something else, then you have to go and update all the tests. And if those tests are redundant, if they’re all expressing the same property of the system, but testing multiple states of it or cases that that property holds, then maybe they could all be replaced by a single higher level test that is doing something, like QuickCheck, and expressing this property over all the cases and just one test. And then if you ever change the semantics of the system, then you just have to update that one place.
Jamie: [00:28:30] Cool. That makes sense. From a maintainability perspective as well. That’s great. Well, you are officially our first guest on the podcast, so thank you so much for coming on.
Sujay: [00:28:42] Thanks for having me.
Jamie: [00:28:46] All right, everybody. Thank you so much for listening, and thank you again to Sujay for jumping on and talking to me today. Tom and I will be back next week with another outage to pour over with all of you.
Producer: [00:29:06] Thanks for listening to the downtime project. You can subscribe to the podcast on Apple, Spotify, or wherever you listen to your podcasts. If you’d like to read a transcript of the show or leave a comment, visit us @downtimeproject.com. You can follow us on Twitter @sevreview. And if you like the show, we’d appreciate a five-star review.