bitlog.com - Page 4

Why are we so bad at software engineering?

Written by jake on February 12, 2020

XKCD comic 2030, follow attribution link below

From XKCD, released under the Creative Commons 2.5 License

An app contributed to chaos at last week's 2020 Democratic Iowa Caucus. Hours after the caucus opened, it became obvious that something had gone wrong. No results had been reported yet. Reports surfaced that described technical problems and inconsistencies. The Iowa Democratic Party released a statement declaring that they didn't suffer a cyberattack, but instead had technical difficulties with an app.

A week later, we have a better understanding of what happened. A mobile app was written specifically for the caucus. The app was distributed through beta testing programs instead of the major app stores. Users struggled to install the app via this process. Once installed it had a high risk of becoming unresponsive. Some caucus locations had no internet connectivity, rendering an internet-connected app useless. They had a backup plan: use the same phone lines that the caucus had always used. But the phone lines were clogged by online trolls who jammed the phone lines "for the lulz."

As Tweets containing the words "app" and "problems" made their rounds, software engineers started spreading the above XKCD comic. I did too. One line summarizes the comic (and the sentiment that I saw on Twitter): "I don't quite know how to put this, but our entire field is bad at what we do, and if you rely on us, everyone will die." Software engineers don't literally believe this. But it also rings true. What do we mean?

Here's what we mean: We're decent at building software when the consequences of failure are unimportant. The average piece of software is good enough that it's expected to work. Yet most software is bad enough that bugs don't surprise us. This is no accident. Many common practices in software engineering come from environments where failures can be retried and new features are lucrative. And failure truly is cheap. If any online service provided by the top 10 public companies by market capitalization were completely offline for two hours, it would be forgotten within a week. This premise is driven home in mantras like "Move fast and break things" and "launch and iterate."

And the rewards are tremendous. A small per-user gain is multiplied by millions (or billions!) of users at many web companies. This is lucrative for companies with consumer-facing apps or websites. Implementation is expensive but finite, and distribution is nearly free. The consumer software engineering industry reaches a tradeoff: we reduce our implementation velocity just enough to keep our defect rate low, but not any lower than it has to be.

I'll call this the "website economic model" of software development: When the rewards of implementation are high and the cost of retries is low, management sets incentives to optimize for a high short-term feature velocity. This is reflected in modern project management practices and their implementation, which I will discuss below.

But as I said earlier, "We're decent at building software when the consequences of failure are unimportant." It fails horribly when failure isn't cheap, like in Iowa. Common software engineering practices grew out of the internet economic model, and when the assumptions of that model are violated, software engineers become bad at what we do.

How does software engineering work in web companies?

Let's imagine our hypothetical company: QwertyCo. It's a consumer-facing software company that earns $100 million in revenue per year. We can estimate the size of QwertyCo by comparing it to other companies. WP Engine, a Wordpress hosting site, hit $100 million ARR in 2018. Blue Apron earned $667 million of revenue in 2018. So QwertyCo is a medium-size company. It has between a few dozen and a few hundred engineers and is not public.

First, let's look at the economics of project management at QwertyCo. Executives have learned that you can't decree a feature into existence immediately. There are tradeoffs between software quality, time given, and implementation speed.

How much does software quality matter to them? Not much. If QwertyCo's website is down for 24 hours a year, they'd expect to lose 273,972 dollars total (assuming that uptime linearly correlates with revenue). And anecdotally, the site is often down for 15 minutes and nobody seems to care. If a feature takes the site down, they roll the feature back and try again later. Retries are cheap.

How valuable is a new feature to QwertyCo? Based on my own personal observation, one engineer-month can change an optimized site's revenue in the ballpark of -2% to 1%. That's a monthly chance at $1 million dollars of incremental QwertyCo revenue per engineer. Techniques like A/B testing even mitigate the mistakes: within a few weeks, you can detect negative or neutral changes and delete those features. The bad features don't cost a lot - they last a finite amount of time, and the wins are forever. Small win rates are still lucrative for QwertyCo.

Considering the downside and upside, when should QwertyCo launch a feature? The economics suggest that features should launch even if they're high risk, as long as they occasionally produce revenue wins. Accordingly, every project turns into an optimization game: "How much can be implemented by $date?", "How long does it take to implement $big_project? What if we took out X? What if we took out X and Y? Is there any way that we can make $this_part take less time?"

Now let's examine a software project from the software engineer's perspective.

The software engineer's primary commodity is time. Safe software engineering takes a lot of time. Once projects cross a small complexity threshold, it will have many stages (even if they don't happen as part of an explicit process). It needs to be scoped with the help of a designer or product manager, converted into a technical design or plan if necessary, divided into subtasks if necessary. Then the code is written with tests, the code is reviewed, stats are logged and integrated with dashboards and alerting if necessary, manual testing is performed if necessary. Additionally, coding often has up-front costs known as refactoring: modifying the existing system to make it easier to implement the new feature. Coding could take as little as 10-30% of the time required to implement a "small" feature.

How do engineers lose time? System-wide failures are the most obvious. Site downtime is an all-hands-on-deck situation. The most knowledgeable engineers stop what they are doing to make the site operational again. But time spent firefighting is time they are not adding value. Their projects are now behind schedule, which reflects poorly on them. How can downtime be mitigated? Written tests, monitoring, alerting, and manual testing all reduce the risk that these catastrophic events will happen.

How else do engineers lose time? Through subtle bugs. Some bugs are serious but uncommon. Maybe users lose data if they perform a rare set of actions. When an engineer receives this bug report, they must stop everything and fix the bug. This detracts from their current project, and can be a significant penalty over time.

Accordingly, experienced software engineers become bullish on code quality. They want to validate that code is correct. This is why engineering organizations adopt practices that, on their face, slow down development speed: code review, continuous integration, observability and monitoring, etc. Errors are more expensive the later they are caught, so engineers invest heavily in catching errors early. They also focus on refactorings that make implementation simpler. Simpler implementations are less likely to have bugs.

Thus, management and engineering have opposing perspectives on quality. Management wants the error rate to be high (but low enough), and engineers want the error rate to be low.

How does this feed into project management? Product and engineering split projects into small tasks that encompass the whole project. The project length is a function of the number of tasks and the number of engineers. Most commonly, the project will take too long and it is adjusted by removing features. Then the engineers implement the tasks. Task implementation is often done inside of a "sprint." If the sprint time is two weeks, then every task has an implicit two week timer. Yet tasks often take longer than you think. Engineers make tough prioritization decisions to finish on time: "I can get this done by the end of the sprint if I write basic tests, and if I skip this refactoring I was planning." The sprint process applies a constant downward pressure on time spent, which means that the engineer can either compromise on quality, or admit failure in the sprint planning meeting.

Some will say that I'm being too hard on the sprint process, and they're right. This is really because of time-boxed incentives. The sprint process is just a convenient way to apply time pressure multiple times: once when scoping the entire project, and once for each task. If the product organization is judged by how much value they add to the company, then they will naturally negotiate implementation time with engineers without any extra prodding from management. Engineers are also incentivized to implement quickly, but they might try optimizing for the long-term instead of the short-term. This is why multiple organizations are often given incentives to increase short-term velocity.

So by setting the proper incentive structure, executives get what they wanted at the beginning: they can name a feature and a future date, and product and engineering will naturally negotiate what is necessary to make it happen. "I want you to implement account-free checkouts within 2 months." And product and engineering will write out all of the 2 week tasks, and pare down the list until they can launch something called "account-free checkouts." It will have a moderate risk of breaking, and will likely undergo a few iterations before it's mature. But the breakage is temporary, and the feature is forever.

What happens if the assumptions of the website economic model are violated?

As I said before, "We're decent at building software when the consequences of failure are unimportant." The "launch and iterate" and "move fast and break things" slogans point to this assumption. But we can all imagine situations where a do-over is expensive or impossible. At the extreme end, a building collapse could kill thousands of people and cause billions of dollars in damage. The 2020 Iowa Democratic Caucus is a more mild example. If the caucus fails, everyone will go home at the end of the day. But a party can't run a caucus a second time… not without burning lots of time, money, and goodwill.

Quick note: In this section, I'm going to use "high-risk" as a shorthand for "situations without do-overs" and "situations with expensive do-overs."

What happens when the website economic model is applied to a high-risk situation? Let's pick an example completely at random: you are writing an app for reporting Iowa Caucus results. What steps will you take to write, test, and validate the app?

First, the engineering logistics: you must write both an Android app and an iPhone app. Reporting is a central requirement, so a server is necessary. The convoluted caucus rules must be coded into both the client and the server. The system must report results to an end-user; this is yet another interface that you must code. The Democratic Party probably has validation and reporting requirements that you must write into the app. Also, it'd be really bad if the server went down during the caucus, so you need to write some kind of observability into the system.

Next, how would you validate the app? One option is user testing. You would show hypothetical images of the app to potential users and ask them questions like, "What do you think this screen allows you to do?" and "If you wanted to accomplish $a_thing, what would you tap?". Design always requires iteration, so you can expect several rounds of user testing before your mockups reflect a high-quality app. Big companies often perform several rounds of testing before implementing large features. Sometimes they cancel features based on this feedback, before they ever write a line of code. User testing is cheap. How hard is it to find 5 people to answer questions for 15 minutes for a $5 gift card? The only trick is finding users that are representative of Iowa Caucus volunteers.

Next, you need to verify the end-to-end experience: The app must be installed and set up. The Democratic Party must understand how to retrieve the results. A backup plan will be required in case the app fails. A good test might involve holding a "practice caucus" where a few Iowa Democratic Party operatives download the app and report results on a given date. This can uncover systemic problems or help set expectations. This could also be done in stages as parts of the product are implemented.

Next, the Internet is filled with bad actors. For instance, Russian groups famously ran a disinformation campaign across social media sites like Facebook, Reddit, and Twitter. You will need to ensure that they cannot interfere with the caucus. Can you verify that the results you receive are from Iowa caucusgoers? Also, the Internet is filled with people who will lie and cause damage just for the lulz. Can it withstand Denial of Service attacks? If it can't, do you have a fallback plan? Who is responsible for declaring the fallback plan is in action and communicates that to the caucuses? What happens if individuals hack into the accounts of caucusgoers? If there are not security experts within the company, it's plausible that an app that runs a caucus or election should undergo a full third-party security review.

Next, how do you ensure that there isn't a bug in the software that misreports or misaggregates the results? Relatedly, the Democratic Party should also be suspicious of you: can the Democratic Party be confident of the results even if your company has a bad actor? The results should be auditable with paper backups.

Ok, let's stop enumerating issues. You will require a lot of time and resources to validate that this working.

The maker of the Iowa Caucus app was given $60,000 and 2 months. They had four engineers. $60k doesn't cover salary and benefits for four engineers for two months, especially on top of any business expenses. Money cannot be traded for time. There is little or no outside help.

Let's imagine that you apply the common practice of removing and scoping-down tasks until your timeline makes sense. You will do everything possible to save time. App review frequently takes less than a day, but worst-case it can take a week or be rejected. So let's skip that: the caucus staff will need to download the app through beta testing links. Even if the security review was free, it would take too long to implement all of their recommendations. You're not doing a security review. Maybe you pay a designer $1000 to make app mockups and a logo while you build the server. You will plan to do one round of user testing (and later skip it once engineering timelines slip). Launch and iterate! You can always fix it before the next caucus.

And coding always takes longer than you expect! You will run into roadblocks. First, the caucus' rules will have ambiguities. This always happens when applying a digital solution to an analog world: the real world can handle ambiguity and inconsistency and the digital world cannot. The caucus may issue rule clarifications in response to your questions. This will delay you. The caucus might also change their rules at the last second. This will cause you to change your app very close to the deadline. Next, there are multiple developers, so there will be coordination overhead. Is every coder 100% comfortable with both mobile and server development? Is everyone fully fluent in React Native? JS? Typescript? Client-server communication? The exact frameworks and libraries that you picked? Every "no" will add development time to account for coordination and learning. Is everyone comfortable with the test frameworks that you are using? Just kidding. A few tests were written in the beginning, but the app changed so quickly that the tests were deleted.

Time waits for no one. 2 months are up, and you crash across the finish line in flames.

In the website economic model, crashing across the finish line in flames is good. After all, the flames don't matter, and you crossed the finish line! You can fix your problems within a few weeks and then move to the next project.

But flames matter in the Iowa caucus. As the evening wears on, the Democratic Caucus is fielding calls from people complaining about the app. You get results that are impossible or double-reported. Soon, software engineers are gleefully sharing comics and declaring that the Iowa Caucus never should have paid for an app, and that paper is the only technology that can be trusted for voting.

What did we learn?

This essay helped me develop a personal takeaway: I need to formalize the cost of a redo when planning a project. I've handled this intuitively in the past, but it should be explicit. This formalization makes it easier to determine which tasks cannot be compromised on. This matches my past behavior; I used to work in mobile robotics, which had long implementation cycles and the damage of failure can be high. We spent a lot of time adding observability and making foolproof ways to throttle and terminate out-of-control systems. I've also worked on consumer websites for a decade, where the consequences of failure are lower. I've been more willing to take on short-term debt and push forward in the face of temporary failure, especially when rollback is cheap and data loss isn't likely. After all, I'm incentivized to do this. Our industry also has techniques for teasing out these questions. "Premortems" are one example. I should do more of those.

On the positive side, some people outside of the software engineering profession will learn that sometimes software projects go really badly. People funding political process app development will be asking, "How do we know this won't turn into an Iowa Caucus situation?" for a few years. They might stumble upon some of the literature that attempts to teach non-engineers how to hire engineers. For example, the Department of Defense has a guide called "Detecting Agile BS" (PDF warning) that gives non-engineers tools for detecting red flags when negotiating a contract. Startup forums are filled with non-technical founders who ask for (and receive) advice on hiring engineers.

The software engineering industry learned nothing. The Iowa Caucus gives our industry an opportunity. We could be examining how the assumption of "expensive failure" should change our underlying processes. We will not take this opportunity, and we will not grow from it. The consumer-facing software engineering industry doesn't respond to the risk of failure. In fact, we celebrate our plan to fail. If the outside world is interested in increasing our code quality in specific domains, they should regulate those domains. It wouldn't be the first one: HIPAA and Sarbanes-Oxley are examples of regulations that affect engineering at website economic model companies. Regulation is insufficient, but it may be necessary.

But, yeah. That's what we mean when we say, "I don't quite know how to put this, but our entire field is bad at what we do, and if you rely on us, everyone will die." Our industry's mindset grew in an environment where failure is cheap and we are incentivized to move quickly. Our processes are poorly applied when the cost of a redo is high or a redo is impossible.

2019 year in review

Written by jake on December 31, 2019

Here is my 2019 year in review! I lost weight, struggled with sleep, grew professionally, struggled with some personal challenges, took some much-needed vacations, and worked hard on side projects.

Health

I lost over 20 pounds in 2019. I started the year at 223.2 lbs and ended at 201.0 lbs. I dropped as low as 189. But I achieved this with a strict diet. I ate mostly fruits, vegetables, cheese, nuts, and lean meats. These are delicious, but I missed the satisfaction of eating a bagel or a slice of pizza every now and then. When I started enjoying food again, I slowly gained 10 pound back.

I'm looking for a sustainable solution. At the moment, I'm trying 16/8 intermittent fasting with calorie counting. I have failed to do each of these individually. Unrestricted intermittent fasting didn't limit my calories enough. I also feel hungry when I count calories across a full day. But I'm optimistic about the combination. I can eat a huge breakfast and still have enough calorie budget for a healthy and filling dinner. Alternatively, I can stick to salads, nuts, and cheeses in the afternoon so that I can enjoy myself later.

"Memoriae Majorum" at the entrance to the ossuary of the Paris Catacombs

The entrance to the ossuary part of the Paris Catacombs

Sleep was rough. I had a prolonged bout of insomnia that stretched from February into July. I would fall asleep easily at midnight. I would then wake up between 4:30 or 5:30. Insomnia sucks. It affects everything. I produced less at work, I was meaner, I was less happy, my chores fall by the wayside. I saw a sleep therapist and she provided a list of recommendations: wear orange-colored goggles at night to limit blue light exposure. Limit caffeine. Get as much sunlight as possible before 8:30am. Have a consistent sleep schedule. Don't eat, exercise, or shower within 3 hours of going to sleep.

I applied these techniques and shifted my wake time from 8 to 7, and this helped. I still wake in the middle of the night. But I can go back to sleep now. I can be less militant about sleep hygiene by picking my battles. I limit caffeine, keep a strict schedule, and go for a walk before 8:00 every morning. The other changes didn't matter as much.

This required sacrifices. My friends prefer hanging out late. For years I had considered moving my sleep schedule earlier. But I can't "just make it up" on the weekends. The opposite is true - staying out once or twice a week can ruin a whole week of regular sleep for me. So I always tended to keep my schedule later to match that of my friends. But the severity of my insomnia bout made me reconsider this. So I leave our plans early on a weekly basis, which is awkward. I'm not sure how this will play out in the long run.

Career

I had a productive year.

My time was split between three things: API stabilization, API support, and GraphQL.

I spent the year working on the API platform team. We worked with a more senior engineer who had helped write broad swaths of our product stack. He successfully pitched the existence of the API platform team, and spent time ramping us up. This was mostly positive. He used his experience to select high-impact projects that helped me learn. We spent a bunch of time cleaning up the user experience and the code, and we ended up with a stabler, cleaner, and faster API. His feedback was always invaluable, and he also made a major contribution to the internals during this time. On the other hand, the downsides to working under him were unusual. He wasn't officially on the team, but he wasn't NOT on the team. This created lots of coordination problems. I did discuss this with my manager, but in retrospect I could have handled the situation better by getting everyone to agree to fill out a RACI matrix (or similar). He left Etsy halfway through the year. By that point we knew enough to stand on our own legs, and the team had a fairly successful year.

Supporting the internal API took lots of my team's time. Etsy is a PHP shop. PHP is single-threaded. An easy way to parallelize computation is to distribute I/O requests among cURL requests. This system is called the "internal API" at Etsy. This means that hundreds of engineers use the internal API. This means that lots of people need help. This version of the API has lasted 5+ years and has advanced users through the company. However, this means that the problems that we hear about are often complex multi-system problems. It's amazing how many different ways the same metaproblem emerges: "This code was written on the assumption that it would only run on one machine, and now it's running on two."

I spent the year planning and prototyping GraphQL at Etsy. This was more solitary than I would like, but it was driven out of necessity. The surface area of our team is so broad. The whole company runs on APIs, and there were just 3 of us for much of the year. The only way that we could move forward was by divide-and-conquer. A second engineer, Kaley Sullivan, did work on it once I started prototyping. She kicked ass - she is a powerhouse engineer and gave me some great feedback on designs. The prototype that we wrote was a success, and we're moving forward with introducing it at the beginning of next year. In the next year, my challenges will be around growing and evolving GraphQL at Etsy such that it doesn't rely on me anymore.

The ferry building on the water in San Francisco.

Jet lag advantage: 6:15 AM run through the touristy part of SF. Peaceful and quiet.

I was promoted to Staff Engineer at the beginning of the year. This was an important milestone for me to hit. I didn't care about the title, but I wanted to prove to myself that I was growing as an engineer. A major reason I joined Etsy was because I love the mission - lots of sellers from all around the world making money. It's far more palatable than padding a tech billionaire's pockets by making $cog a little more efficient. Reaching Staff Engineer meant that I've been growing my engineering effectiveness, which means that I've been more effective at helping our sellers. But now I have access to opportunities that I didn't as a senior engineer: I emceed our engineering all-hands, I'm leading a working group, and I am in a regular meeting with directors and the CTO. It almost feels like a virtuous cycle: once I reached a certain threshold, it became easier for me to keep getting high-impact or high-visibility work. In 2020, I'd like to use my position to effect positive change, which I did in 2019 but not enough.

I attended a software engineering conference for the first time: GraphQL Summit in San Francisco. I was pleasantly surprised about how useful it was. Reading the presentations was helpful - I did a deep dive on industry writing about GraphQL, but some of the most valuable lessons came from some of the smaller talks. One fact amazed me: literally half of the attendees didn't even use it. They sent their engineers to this conference to learn more. It stands to reason: of course companies would spend thousands of dollars speculatively to avoid mistakes that would cost them millions. But it doesn't occur to me that I could get these opportunities myself, and I don't have a lot of more-senior mentors in my life that can point me in these directions.

Personal life

My year had some ups and downs.

My dog Rupert running through a field with his mouth open and smiling.

Derp on, you crazy diamond

I had a dog at the beginning of the year, Rupert. I don't have him anymore. The rescue gave him to me because he was high-energy, but I didn't fully understand what this meant. He could run for an hour in the morning, and be back to 100% at the end of the day. I also didn't understand that active dogs have active minds. He'd get bored and destructive when I left him alone for a normal work day. He destroyed thousands of dollars worth of stuff in my apartment. This included some things that were irreplaceable. He loved daycare, but he got depressed and distressed if I took him too many days in a row. I worked with a trainer a bunch, and on the first lesson she warned me, "most single owners who have dogs that are this energetic eventually give them away. I often advise people to do this. It's okay if you do." Eventually, I came to the conclusion that she was right and he would be better off with another owner. I returned him to the rescue who found him another loving family. This is very hard to write now. I miss him. But it was unsustainable. One of my close friends was very unsupportive of me during this time, which damaged our friendship. I regret this.

My girlfriend and I went on some trips together: a long weekend in Montreal, a week in Vancouver, and 2 weeks in France. I hadn't traveled internationally in a few years, and it was fun to go on an adventure together. It was also the first time in a long time that I managed to fully disconnect from work. The stories that we got together were wonderful: the meals we had, the places we stayed, the time that we were defeated by the rocky shores of Nice, the time I doused myself in diesel gasoline in Avignon, the delicious fruit varieties you've never seen before. It's also nice to share that with someone.

My girlfriend and I suited up to go whale watching in Vancouver, BC

Suited up for whale watching in Vancouver, BC

I started trying to "learn about business" in the beginning of 2019. I was wondering if I could make and sell bar trivia. But a landing page that I set up with a Squarespace domain didn't convert into any purchases, even though I got a bunch of clicks. Around this time, I was in the middle of having major problems with my dog, and I gave up working on business stuff to take him to the dog park every morning for an hour. After this, I never got back to it.

I tried to learn French. I got a fair bit of vocabulary, but I made very little progress with listening to it, despite taking French classes, practicing for over an hour every day, and listening to 100+ hours of French learning podcasts. My girlfriend is a French translator who speaks it fluently. I thought this fact would help me. This was part of the reason I chose French. But I was so remedial that we never found a middle ground that wasn't frustrating for me. Actually going to France was fun - I had some basic transactional conversations with shopkeepers. I could also read menus and order and comprehend most of the posted advertisements around the country. But it helped underscore that I would likely never use French in a serious way - if I had picked Spanish, I'd at least have the benefit of seeing it around New York. And I don't find French culture or history particularly interesting. It was a long and frustrating process for something that would never have an upside.

Eventually, I stopped spending my mornings studying French and started studying machine learning. Machine learning has interested me a few times over the years, most specifically around the time of the Netflix Prize. Most recently, I've spent 1.5 hours every morning learning about machine learning. This has been fun so far. So far, I've written a backpropagation algorithm in Numpy, and I've been working at the Titanic Kaggle competition. My first submission was yesterday, and it performed worse than the naive approach of estimating whether all the women survived and all the men died, haha. But there is a whole forum filled with people who have helpful suggestions of how to perform feature selection, and there's a whole Internet of helpful materials. I'd like to learn more about real-world data processing, different neural network architectures, and autoencoders.

I've been writing more in 2019. This is perpetually a goal of mine, so I'm glad that I'm finally making room for it. I started a "Simple software engineering" series where I examine the tradeoffs that I make when I write software. I also found it helpful to keep a "knowledge base" in Wordpress, where I record things that I learn as I learn them. There's only a loose organization so far, but I've only been keeping the knowledge base for about 3 months.

Conclusion

I had a good 2019 despite some personal challenges. My year went really well professionally, and I took some much-needed breaks that I hadn't really granted myself. I'd like to focus a little more on friendships in 2020, but it's not clear if that means old friendships or new friendships.

Simple software engineering: Mocks are a last resort

Written by jake on December 10, 2019

Most tests that rely on automatic mocking frameworks should be rewritten to use either the real dependencies or manually-written fakes.

Wait, let's back up. Tests have a few moving parts. First, there is some code being tested. This is commonly called the unit. The unit might have dependencies. The dependencies are not under test, but they can help determine whether the unit behaves correctly. Ideally, they would be passed into the unit. But dependencies can be many things: static data, global data, files on the filesystem, etc.

Dependencies interact with tests in a few ways. The unit can introduce side effects on the dependencies and vice versa. Automatic mocking frameworks are designed to aid this process. Mock assertions can validate that expected method calls happened, whether correct parameters were passed, can override return values, and can execute different logic. Mocks have almost absolute power to override the behavior of dependencies (within the confines of what the language allows).

But mocks aren't the only way to write tests that involve dependencies. Real objects can be used directly. This isn't always possible: the real object might be nondeterministic. It might provide random numbers, make a call on the network, etc. Nondeterminism is difficult to test, since there's not necessarily an expected output. Nondeterministic failures decrease confidence in tests, since it's difficult to know whether a failure is real. Accordingly, nondeterminism should be avoided in tests.

Statue of Leif Erickson in Reykjavik, Iceland.

Leif Erikson discovered automatic mocking frameworks in the year 998

"Test fakes" are an alternative. They are a fake implementation of a real object. For example, a trivial implementation of an interface that the real object implements. Here's an example from a side project of mine. It allows a clock to be simulated and advanced for testing. Test fakes have a maintenance cost. The tradeoff is that the fake can be reused everywhere.

How should you pick which one to use?

How I select a dependency to use for testing

Use the real object, if possible.
Use a fake implementation, if possible.
Use a mock.

I try to get as close to the production configuration as possible. Why? When a test fails with a real dependency, it's likely a real problem. The more differences between a test object and a production object, the less likely the failure is real, and more likely that the failure involves the test configuration.

OK, so, where am I going with this? In the next section, I will explain common issues with automatic mocking. Then I will describe the tradeoffs that real objects and test fakes have. I will finish by explaining a few situations where mocking is preferable to the other alternatives.

Automatic mocks are very manual

Consider a unit that uses Redis as a key/value store. Talking to Redis involves I/O. So we mock the return values of Redis anywhere it's used in tests.

The first mocked test isn't so bad. It reads one value and writes one value. The second test reads a few values. The third test reads a bunch of objects, but it doesn't modify the values at all. And so on.

Imagine this Redis class spreading through a codebase. Dozens of usages. After all, everyone loves Redis. Every call must be mocked in the test.

But this requires that every test author behaves like a human key/value store. Why provide return values for all of these tests? It is simpler to put the Redis key/value store behind an interface and use an in-memory implementation. This would save time per test and would make tests easier to write. The fake would save time the way any code does - by automating a task.

The tests become easier to write because it becomes trivial to assert both the effects and the side effects of the test. Did the unit return the correct value? Did the fake end up in the correct final state? Great!

I find that the break-even point for this approach is n=1. As in, implementing the fake often takes roughly the same amount of time as implementing the first mock. And then the fake can be reused, but the mock can't. There are exceptions to this that are discussed at the end of this post.

Automatic mocks don't have to behave correctly

Mocks can behave absolutely incorrectly with no consequences. Can one plus one equal three? Sure, why not:

when(mockInstance.addTwoNumbers(1, 1).thenReturn(3));

A human must simulate the return value for every mocked return value. This leads to situations where the bug and the test both have errors that mask each other. In fact, mocks can be written by watching the test fail and seeing what value would have make the assertions pass. Then the engineer simply enters the expected values into the mock. People really do this. I've done it. I've watched other people do it. These errors get past code review.

Granted, this can happen with both real objects and fake objects in testing. But since real/fake objects are not customizable per-test, the error rate will be lower holistically with these approaches.

Automatic mocks can silently break during a refactoring

This is more insidious. Let's say there is a widely-used dependency, and one of its methods provides the path of a URL. It needs to be changed to provide the full URL string as part of a project to support multiple domains. And it's being renamed from providePath to provideFullURL or something.

So you rename the method. You change the behavior. The full URL is returned instead of the path. The tests pass. Hooray 🎉 But that method is called in 50ish places, and each of those call sites have tests that are written using mocks. Furthermore, some of those call sites are within code that is mocked in tests. Are you confident that nothing is wrong?

I'd be confident in the opposite: something broke somewhere. The mocks silently hid the problem because the return value was simulated. Imagine the developers of each of those call sites. If even one had a tight deadline and needed the full URL, they're gonna prepend the server name they expect. They won't think twice. It could even take days for these errors to appear - when the next nightly big data job runs, when the next weekly marketing email is sent, etc.

Areal object would have a better chance of exposing these errors in tests. A fake object would be changed from providing a path to providing a URL, which would also allow the error to be caught across the codebase with a single change. The change would need to have the same level of scrutiny and QA testing. But with a reasonably complete test suite, it's less likely that it would lead to real problems.

Tradeoffs

Using a real object has a philosophical tradeoff. Strictly speaking, the test stops being a unit test. It becomes an integration test of the unit and its dependencies. That's fine. If a test can be written quicker and increase confidence, then it's a reasonable tradeoff. If the simplest and most maintainable test is an integration test, then write an integration test. Life's too short for ideological purity.

There are more tradeoffs. A breakage in a real object can cause dozens of failures through the codebase. This often makes it easier to debug the failure (since there are lots of examples to debug), but it can also obscure the failure. Similarly, a real object with many call sites can cause failures in just one or two tests. This is often difficult to diagnose. Is the test subtly wrong? Is there a subtle bug in the real object? Is there a subtle bug in how they interact?

Fakes add a maintenance cost. They need to be written and maintained along with the real object or interface. Plus, since they simulate the behavior of an object without being the full implementation, they can easily introduce incorrect behavior that is then reused everywhere. There is also an art to writing them that has to be learned.

A few situations where mocks are the best approach

There are definitely situations where mocks should be used. Here are some common "last resort" cases that I've discovered over the years.

Faking complex behavior, like SQL

At a certain point, a fake would be so complicated that an in-memory solution is totally infeasible. It's implausible to expect an in-memory execution of a SQL server that matches all of the syntactic quirks and features of MySQL. In this situation, using the mock dramatically reduces the maintenance required for the test.

Preventing a method from being called multiple times

Sometimes, calling a method twice is REALLY BAD - maybe it causes a deadlock, maybe a buggy device driver would cause a kernel panic, etc. Code review and instrumentation aren't enough, and it's desirable to assert that it can never happen. Mocks excel at this type of assertion.

Legacy code is poorly structured and there ain't time to fix it

Sometimes, you have to parachute into a codebase, make a fix, and then get extracted. Sometimes it's just not reasonable to spend 3 weeks refactoring to make a 1 day change more testable.

Determining whether a delegate is being invoked

A delegate wraps a second object, and is responsible for calling methods on that second object. An automatic mocking library is an easy solution for ensuring that these calls happen as expected.

Thank you for attending my Jake Talk

Automatic mocking frameworks are a last resort. Mocks have uses. But real objects and fake objects should be preferred, in that order.

Blog posts