Category Archives: Business

Finish something every day

When you write code in an engineering organization, you will do the following:

  • Type the code out.
  • Test some of it. Hell, maybe you’ll test all of it.
  • Get someone to review the code.
  • Push it to source control.

These items aren’t discrete or ordered. Test-driven development and pair programming are practices that reorder or merge these items. But these should happen for most changes.

Sometimes, you’re given a large task. You have a question at this point: should I break it up? Should I write the whole thing at once? In my experience, the best tradeoff is to finish something every day, even if it’s small. Write it. Test it. Send it for review.

This introduces a lot of tradeoffs. It’s not always possible. It makes some assumptions about your work environment. We will discuss all of these below.

Benefits

Small changes are better for you

Let’s say that you’re a full stack developer at a web shop. You are assigned a new task: add a new form to collect and store some data, and then display it on another part of the site. The designer whipped up mocks. The product manager wrote out the expected behavior. Now it’s your turn.

You see all of your tasks: Modify a form to collect the data. Include it on the request to the server. Write it into the ORM. Modify the database to store it. Think about the default values and whether we can backfill the data from another location. Read the data from the new location. Render it into the view.

There are a few obvious ways to approach the code:

  • Write all of it at once.
  • Write the database code, write the write path, write the read path.

There’s a less obvious way to approach the code:

  • Write the database code, write the read path (with stubbed data), and write the save path.

There may be other alternatives that depend on the details of your project. But look at what happened: The project is now decomposed into smaller tasks. Even better, we see that the ordering of two of the tasks doesn’t matter. The data layer code blocks everything else. But the other two tasks are independent of each other. You can pick the most convenient task ordering. They could even be done at the same time. This is the first insight of decomposing tasks: some work becomes parallelizable.

Parallelization is where the magic happens. This means that you’ve converted a one-developer project into a two-developer project. This is a great way to grow as an engineer. It lets you practice for project lead or tech lead positions. This also helps you practice for artificial deadline pressure. In an “emergency,” you could make the following proposal to your team: “we can get this out faster if we add a second engineer to this. I can add the model code today. Someone can work on the view tomorrow while I work on the save path.”

It’s also good to practice to go through a full “write, test, review, deploy” cycle frequently. Practice makes perfect. You will become more capable as you push more and more code. Your “small” changes will become larger and larger. It also becomes insurance against seniority. As you get more responsibilities, you will probably suffer from the Swiss cheese calendars that plague many senior employees. It’ll be your job to help people and maintain relationships around the company. People often need help at awkward times on your calendar. If you are in the habit of producing small changes, it’s a little easier to write code. You can still finish something if you have two hours between meetings.

Interestingly, you will discover failure cases as you parallelize work. These failure cases aren’t always obvious. What could go wrong? Some tasks are just too small. Every “write, test, review, deploy” cycle has overhead. Sometimes the overhead is huge compared to the size of the change. You will also notice that saturating a project with the maximum number of engineers doesn’t work as well as it sounds. If someone’s schedule slips, other engineers will be blocked. This is okay in the occasional “emergencies” where shipping a feature ASAP is the most important thing in the world. But you burn intangible resources (goodwill, happiness, team cohesion) by perpetually oversubscribing projects. You will learn to find a sustainable headcount for a project.

There are selfish reasons to ship all the time. Shipping is a form of advertisement. People see that you’re constantly “done” with something because you’re always asking for a review. But this is a double-edged sword. You’re always going to be asking for code reviews. The reviews should be worth the time of the reviewer. Make them large enough to be interesting. If you’re distracting them with adding a single line, you’re doing the team a disservice. This is why I’ve found “a day of work” to be a good tradeoff.

Better for the codebase

I’m going to tell you a horror story. Remember the above example: adding a UI feature to a web app? I’m going to work on that change. And I’m going to do the whole thing in a single pull request. I swoop my cape over my face and disappear into my evil lair for over a week.

I send you the code review out of nowhere. You look at it. Thousands of lines added across a few dozen files: tests, database configurations, view templates. This is going to take forever to review. You skim the files. You eventually get to the database file. You see that something is wrong: I should have added another table for the new data. Instead, I wedged it into an existing record. And this change was foundational. The write path depends on this mistake. The read path depends on this mistake. The UI on either side depends on this. The tests assert this. Everything depends on this mistake. Fixing this is expensive. It’s closer to a rewrite than a refactoring.

But we’re in the “website economic model” of development. Our sprint process puts downward pressure on how much time this task should take. I shipped a functional version of the project. It’s now your job to argue that we should throw away a working version in favor of a different working version.

This puts you in a difficult spot. The team estimated this task would be completed in under 1 sprint. But now we’re more than halfway to the deadline, and the change is wrong. Fixing it will take it past the end of the sprint. I’m incentivized to push back against your feedback. I may not. But let’s remember: this is a horror story. I’m going to push back. Bringing this up will also invite discussions with product or management stakeholders to negotiate whether there’s a cheaper fix that avoids a rewrite.

Furthermore, it took you forever to review the entire change. You need to do the entire review again a second time after my rewrite. And maybe a third time if another round of revisions are necessary. That could up to hours of reviewing that you’re not dedicating to your own work.

All of this leaves you with two bad options: rubber stamping a bad change (with some perfunctory “I was here” feedback to show that you reviewed it), or reducing your own velocity and your team’s velocity to argue for a gut renovation because of less-tangible long-term improvements.

Ok, let’s end the horror story. What if I had split my task into day-long chunks? My first task would be to write the data layer. So I’d write the database changes and any ORM changes. I’d send them to you for review. You’d look at my changes and say, “Hey, let’s move these fields into a separate table instead of wedging this into the HugeTable. We used to follow that pattern, but we’ve been regretting it lately for $these_reasons.” And it’s totally cool – I don’t push back on this. I take the few hours to make a change, you approve the changes, and I move on.

What was different? I incorporated you earlier into the process. You weren’t sacrificing anybody’s time or velocity. You made me better at my job by giving me feedback early. The codebase turned out better. Nobody’s feelings were hurt. This means that I improved the entire team’s engineering outcome by splitting my changes into small chunks. It was easy to review. It wasn’t difficult for me to fix my mistakes.

Why wouldn’t I split my changes into smaller chunks?

When does shipping every day work? When does it fail?

“Finish something every day” makes a lot of assumptions.

There is an obvious assumption: something worthwhile can be finished in less than a day. This isn’t always true. I’ve heard of legacy enterprise environments where the test suite takes days to run. I’ve also worked in mobile robotics environments where “write, compile, test” cycles took 30 minutes. In those situations, it can be impossible to finish something every day. There is a different optimal cadence that balances the enormous overhead with parallelization.

“Finish something every day” also assumes that the work can be decomposed. Some tasks are inherently large. Designing a large software system is a potentially unbounded problem. Fixing a performance regression can involve lots of experimentation and redesigning, and is unlikely to take only one day. Don’t kill yourself trying to finish these in a day. But it can be interesting to ask yourself, “can I do something quickly each morning and spend the rest of the day working on the design?”

Another assumption is that your teammates review code quickly. Quick reviews are essential. This system is painful when your code reviews languish. Changes start depending on each other. Fixes have to be rebased across all of them. Yes, the tools support it. But managing 5 dependent pull requests is hard. Fixes in the first pull request need to be merged into all the others. If your teammates review them out of order, fixing all of them becomes a nightmare.

If I may be so bold: if you’re getting slow code reviews, you should bring it up with your team. Do you do retrospectives? Bring it up there. Otherwise, find a way to get buy-in from your team’s leaders. You should explain the benefits that they will receive from fast code reviews: “Fast code reviews make it feasible to make smaller changes, because our work doesn’t pile up. Our implementation velocity improves because we’re submitting changes faster. We all know that smaller changes are easier to review. It’ll lead to better engineering outcomes because we’ll provide feedback earlier in the process.” Whatever you think people care about. Ask your team to agree on a short SLA, like half a day.

You can model the behavior you want. You should review others’ code at the SLA that you want. If you want your code reviewed within a couple of hours, review people’s code within a couple of hours. This works well if you can provide good feedback. If you constantly find bugs in their code, and offer improvements that they view as improvements, they’ll welcome your reviews and perspective as critical for the team’s success. If you nitpick their coding style and never find substantial problems, don’t bother. The goal is to add value. When you’re viewed as critical for the team’s success, then it’s easier to argue that “we will all be more successful if we review code quicker.”

I take this to an extreme. When I get a code review, I drop everything and review it. My job as a staff engineer is to move the organization forward. So I do everything possible to unblock others. If this doesn’t work for you, find a sustainable heuristic. “Before I start something, I look to see if anyone has a pending code review that I can help with.” Find a way to provide helpful code reviews quickly.

Finish something every day

Try to finish something every day. You will get better at making small changes, and your definition of “small” will keep getting bigger. You will get better at decomposing tasks. This is the first step towards creating parallelizable projects. Additionally, you will get exposure for continually being “done.”

It helps your reviewers. Smaller tasks are much easier to review than larger tasks. They won’t have to give large reviews. They also won’t have to feel bad about asking for a complete rewrite.

It will also help the codebase. If reviewers can give feedback early, they can help you avoid problems before you’ve written too much code to turn back.

In practice, “finish something every day” really means “find the smallest amount of work that makes sense compared to the per-change overhead.” In many environments, this can be “every day,” but it won’t be universal.

Please consider donating to Black Girls Code. When I was growing up,
I had access to high school classes about programming and a computer
in my bedroom which allowed me to hone these skills. I'd like
everyone to have this kind of access to opportunities.

https://www.blackgirlscode.com/

I donated $500, but please consider donating even if it's $5.

The multi-year process of getting promoted to staff software engineer

Note: these are my thoughts, and not the thoughts of my employer.

I wasn’t interested in promotions for most of my career. I had a simple reason: I didn’t want the jobs that promotions would give me. They would have made me unhappy. But then I joined Etsy four years ago. This put me in unfamiliar territory: I wanted to grow my engineering career for the first time. Getting the next promotion would make me a staff software engineer.

I achieved this a year and a half ago. This is the advice that I’d give myself when I started at Etsy four years ago. I tried to make it broadly applicable. I hope that others find it useful. It is targeted towards senior engineers who are looking to reach staff within a few years.

Let’s chat definitions real quick. “Staff software engineer” varies wildly between companies. Sometimes it’s called a “principal engineer,” and other times a “staff” and “principal” are two different levels on the same ladder. But I’m describing the first level that is partially scoped outside of a team. A staff engineer is often an individual contributor on a team. They are also recognized for having broad impact outside of their team. Maybe they’re clutch on business-critical projects. Maybe they wrote and maintain the framework that everyone uses. Maybe they’re good at coordinating the technical work of huge projects. Maybe they do all of that. They’re senior employees who have started to level up other parts of the company.

Understanding the process

First, understand the mechanics of how promotions work. Read any documentation on the evaluation and promotion process.

Let’s imagine a process at a mid-size company.

Employee evaluation

  • Every level has written performance expectations.
  • Employees write a self-evaluation every 6 or 12 months. They compare themselves to their level’s written performance expectations.
  • Managers assess their employees. This produces a rating against the performance expectations. This may contradict the employee’s assessment.
  • There is a company-wide adjustment process. This attempts to correct for the fact that different managers might rate the same employee differently.

Employee promotion

  • A promotion candidate should have two consecutive evaluations showing that they are meeting the next level’s criteria. There may be some flexibility here.
  • The employee or the manager will document the employee’s promotion justification. This is often called a “promotion packet.”
  • The promotion packet follows a template. The company is “trying out a new template that is much shorter than it used to be.” It will still take forever to fill out.
  • The manager collects peer reviews from higher-level employees that are familiar with the candidate’s work.
  • One or more high-level employees read the reviews and packet and make a decision on the promotion.

To some approximation, this is what formalized promotion processes look like at mid-size companies. Most companies aren’t trying to innovate here. I’d be suspicious if they did.

Understand the real promotion process

I had a misconception when I was a new grad. I figured that companies followed their written rules. It turned out that this isn’t true. Companies develop their own conventions and interpretations that might surprise the casual reader. There are also unwritten rules that you’ll learn over time.

What do I mean? Let’s take hockey as an example. The NHL has an official rulebook. You could read it today. The rulebook explains how hockey is supposed to work. Rink size, skates, sticks, face-offs, 20 minute periods, offsides, power plays, overtime, waivers, etc.

You read the rulebook. Armed with your new knowledge, you watch some games. The rulebook prepared you fairly well. But some things confuse you.

Every game has blatant infractions that the refs ignore. A few hooks aren’t penalized. Power forwards ram into goalies and goalies cross-check them back. Hits behind the play. The referees see it all. Yet they don’t call penalties. Even more confusingly, other rules are universally obeyed. The players avoid offsides penalties and the linesmen call them. There was a high-stick, and the offending player looked at the ceiling in frustration and skated to the penalty box without a struggle.

Then you’ll notice other oddities. Sometimes, the refs will make a blatant mistake. A star forward falls over when skating past an opposing player. The refs raise their hand: tripping. The star player’s team has a one-man advantage for 2 minutes. But then the video replay shows that the star player just fell over. Even pros lose their edge sometimes. It shouldn’t have been a penalty. The man advantage is unfair. But then the oddest thing happens. Later, the refs will make a second mistake in the opposite direction. Maybe it wasn’t a “mistake,” but they called a penalty they normally ignore.

What’s going on?

Hockey would suck if referees called every possible penalty. The game would take forever. Nobody would watch it. So the refs “put the whistles away” when games are under control: they avoid stopping play by ignoring minor infractions. Additionally, the officials have their own unspoken rules. Maybe they award make-up calls to correct accidental imbalances. They could make the same game different for every player. In fact, one of the most punitive tools available to a referee is holding just a single player to the letter of every rule. This isn’t unique to hockey.

I could go on and on. But let’s face it: the NHL rulebook doesn’t define all of the rules that govern a hockey game. It’s the NHL’s best effort to write down what hockey should be. In practice, the referees decide what hockey will be on any given day. But what if the rulebook grants the referees discretion? It doesn’t change the argument. Different refs will call the same game differently. The enforced rules still vary from the written rules in ways that can’t be learned from the rulebook.

The actual staff engineering promotion process

What lesson can we draw from hockey? The staff engineering promotion process will differ from the written process in ways that are hard to predict. You should be aware of the differences.

These differences can be discovered. Ask existing staff engineers what surprised them about their promotion. Ask your manager. Ask your manager’s manager. Disgruntled Glassdoor reviews or blog posts might have some perspectives that correct for survivorship bias.

In theory, promotions are awarded based on merit. In practice, nobody has equal context across all departments. The decision will be imprecise. Discussions and decisions rely on the written reviews and personal opinions of the participants. People always describe this part like it’s shrouded in mystery. But in practice it’s just “people trying to decide something,” and will have all of the properties of this type of system.

Sometimes there are secret rules or systems. You should try to discover them. Sometimes the secret rules are mundane, like “HighRankingEmployee thinks that employees should be at their level for three years between each promotion because otherwise how could you have enough of a track record?” Sometimes these rules are secret because the company would get sued if they were written down. If you’ve been in the industry long enough, you’ve heard stories about how someone (usually from an underrepresented group) learned that their salary was dramatically less than their peers. Here’s an example if you need one. The employee handbook doesn’t say “Take advantage of structural discrimination and information asymmetry to save a few bucks.” But that’s what the company does. I don’t have a good guide for finding this out. It’s not like you can just ask at the next all-hands. But making and maintaining lots of work friendships opens you up to hearing this kind of gossip.

So, yeah! Understand the written promotion process. And talk to people to understand the human element of the promotion process.

Being honest with yourself

I joined Etsy as a senior engineer. During an early conversation with my director, Tim, I asked what the staff engineering promotion process is like. Tim turned it around on me. “Why do you want to be a staff engineer?”

I was caught off-guard. I should have admitted the truth: I hadn’t asked about this when I was interviewing and I was curious. Instead I stammered out a half-baked answer. Tim listened and said, “It sounds like you haven’t quite figured out why you want to be a staff engineer. It’s not the right path for everyone. Please understand the answer to this before you spend a lot of people’s time trying to get there.”

Fair.

It took a while to find an answer. I got there eventually. Someone asked me a year later and I answered “I was never super interested in my career before I joined Etsy. And I probably could have joined some other company like Facebook and given 5 years of my life to the Like button. But I joined Etsy because I like the idea of helping small sellers sell their stuff on a global stage and giving them a platform to help pay the bills or be their own boss. Performing at my job means that I’m enacting more of a change that I want to see in the world. Becoming a staff engineer means that I’m starting to make the entire organization more effective at enacting the positive changes that I want to see in the world. It’s important to me to keep leveling this up.”

Don’t memorize the paragraph that I have above. That was my reason. It’s important to find your own reason.

Why? It helps you avoid wasting years on a goal that will make you unhappy. Maybe you’d rather work towards becoming a director because you’d be energized by running a department. Maybe you look at the people above your level and think, “I’d be really unhappy if I had any of their jobs.” Maybe you live and breathe tech and less hands-on coding would make you dissatisfied. Maybe your passion lives outside of work and you need something that pays the bills. There are so many valid career paths. It’d be a shame to spend years reaching for the wrong one.

Getting yourself in position

So you want to be a staff engineer. What next?

First, you need to execute at the level of a staff engineer.

Have a frank conversation with your manager about your performance and career trajectory. I’ve found it useful to just say, “My rough plan is to ‘exceed expectations’ in $cycle and to get promoted in $next_cycle. Does that seem realistic?”

Expect the answer to be “no.” That’s okay! Setting a target and expressing interest helps both of you. Your manager can guide you towards projects that will help you grow. They can find you opportunities. They can give you realistic expectations. They can tell you that you need to work longer before you even ask about it. But they may not do any of these things if you don’t express interest.

The goal of promotion is twofold: meet the performance objectives of the next level, and to demonstrate that you are performing at the level. These are subtly different. You should always find ways to leave breadcrumbs of your accomplishments. This provides evidence that you’re performing. This usually involves leaving a summary. Consider the case where you consulted on another team’s project across a few meetings. You could email out a meeting summary of each meeting and everyone’s contributions. This will provide a record of your accomplishments that might have otherwise been lost.

I want to be clear: the first priority is to be valuable. The second priority is to prove that you added value. Are you hyping up work to make it sound like you’re meeting a competency? Then you’re not meeting that competency and you’re not performing at the staff level yet. Performing at the level is foundational.

Schedule meetings with people who have worked with you for at least 6 months. Bonus if they’re more senior than you. Bonus if they’ve worked at the company for a long time. Bonus if they’re directly involved in the promotion process. Ask them questions like “What’s the difference between me and a staff engineer?” You’re hunting for objections that people might have to your promotion. You’ll notice themes after asking a few people. These are the possible objections that people may raise when you apply for promotion. Work hard to grow past these objections. Be mindful to produce evidence that the objections have been overcome.

Are you known for quickly implementing things? Consider spending an hour a day working on high-impact side projects. These projects should be low-effort and high-reward. These types of projects are easier to find at mid-sized companies. Mid-size companies are large enough to create small problems, but small enough to have trouble allocating resources to fix them. Need an example? Write a testing utility to reduce boilerplate for a common testing situation. People should want the result. Explain it to a few people if you’re not sure. If nobody says some variant of “I wish that someone did that,” consider picking another project. Tell people after you implement it. Send an email or Slack message or whatever telling people about it. Rinse and repeat. You’ll get better at identifying these projects over time.

You might want to change teams to pursue a good opportunity for promotion. There are lots of valid reasons to look for a team change. Maybe the new role will clearly help you grow compared to your current job. Maybe you’ve always wanted to work under the new manager. There’s nothing wrong with changing teams. It’s just business. But don’t change teams solely for a promotion. That’s a recipe for unhappiness. A good fit is way more important. You may need to overperform your current role for years before you are promoted. Find a team where you may be happy for years.

Putting yourself in position is not a solitary process. You will need the feedback, help, and support of everyone around you. You need your manager’s support to produce a good plan and be on the right projects. You need your colleagues’ support to help you identify your blind spots to get yourself promoted. And you need your teammates’ support to work on projects with you (and potentially under you, if your promotion plan involves being a tech lead). Don’t forget this.

Providing the evidence

You’ll be ready when your manager is willing to defend your promotion. At a mid-size company, it’d be difficult to be promoted without your manager’s support. This may not be true in FAANG-sized companies that have realized that processes need exceptions. But a smaller company may not yet have any workarounds.

First, gather all of the evidence that supports your promotion. You can collect this from a variety of places:

  • What have you committed?
  • What pull requests have you commented on?
  • What designs have you written?
  • What designs have you commented on?
  • What meetings were you in?
  • What documentation have you written?
  • Who have you been mentoring?
  • What emails have you sent?
  • Who can vouch for your work? Can they write you a recommendation level?

Understand how the promotion packet is crafted. The manager will do part of it. This could range from “the manager writes a statement in support of the employee” to “the manager summarizes the entire argument for promotion.”

Manager involvement has benefits. Managers are supposed to understand the performance management system. They are well-positioned to craft persuasive arguments within this system. Heavy manager participation also has drawbacks: managers are most overwhelmed during performance management and promotion season. The most overwhelmed may be forced to make hard choices about how to spend their time.

Help your manager before the process starts. Write a document summarizing how you met every competency. Provide links. “I worked on ProjectX and produced these designs: [link, link], and had the following technical accomplishments: [link, link]. I had mentorship relationships with $engineer_x and $engineer_y, and wrote $internal_documentation offering people suggestions on how to effectively mentor engineers.” Err towards oversharing. Make it easy to copy and paste. This means that the writing should be polished.

Don’t overstate your accomplishments. If you worked on a design with someone, say that you “co-designed” it. Don’t make it sound like you implemented entire projects if you were one part of a team. If you’re ready for promotion, you should be able to directly state why your accomplishments met the guidelines.

Managers are familiar with your work. But they’re not living your life. They may not remember all of your accomplishments, especially ones that happened outside of the team. Be sure to produce the summary. Don’t hold back. They can cherry pick what you gave them.

What if your promotion may be borderline? This can happen naturally: the performance process happens at a fixed cadence. Sometimes it doesn’t happen at a perfectly convenient time. Focus on the story of why your accomplishments meet the promotion guidelines. Remember all of those electronic records that you’ve been producing about your accomplishments? Extract and summarize the themes from them. Use the records as proof. Pull things into linkable formats and provide the links. People like stories and they like having reasons for their decisions.

What if you get rejected? Rejections are okay, even if they feel bad. They’re not permanent. They simply mean that you either haven’t persuaded everyone that you’re meeting the guidelines, or there are guidelines that you need to meet. Have the specific objections explained to you. Keep asking until you are given specifics. Vague reasons are difficult to address. Don’t accept the “keep doing what you’re doing” explanation. 6 months later, you probably kept doing what you were doing, but you still won’t get promoted. “Not enough of a track record” isn’t a specific reason. “If you launched your current project, we’d have the evidence we need” is a specific reason. “You met the technical criteria, but your teammates report that you give harsh feedback so we’re worried about putting you in a broader role” is a specific reason.

Work with your manager to develop a plan for showing evidence that you’ve overcome those objections. The evidence is important. It will provide the story about why you’re ready for promotion the next time. “We thought that they hadn’t provided enough mentorship. But in addition to the 1:1s they had been holding with their teammates, they took a few professional coaching sessions on mentorship, they onboarded 2 new engineers, and polished and promoted the team’s onboarding documentation for everyone across the company to use”

In summary

Understand the process: Talk to as many people as possible about your specific company’s promotion process. Learn how it’s supposed to happen. Learn how it actually happens. Understand the human elements that factor into the promotion decision.

Put yourself in position: Learn how to perform at the necessary level. Have frank performance conversations about how you are performing, and how people are perceiving your performance. Do everything that you can to address any objections that people bring up. Leave evidence about your accomplishments.

Provide the evidence: Help your manager when you’re going up for promotion. Provide a summary of how your work meets the promotion threshold. If you get rejected, ask until you receive specifics. Then, work with your manager to craft a plan to overcome these objections.

Why are we so bad at software engineering?

From XKCD, released under the Creative Commons 2.5 License

An app contributed to chaos at last week’s 2020 Democratic Iowa Caucus. Hours after the caucus opened, it became obvious that something had gone wrong. No results had been reported yet. Reports surfaced that described technical problems and inconsistencies. The Iowa Democratic Party released a statement declaring that they didn’t suffer a cyberattack, but instead had technical difficulties with an app.

A week later, we have a better understanding of what happened. A mobile app was written specifically for the caucus. The app was distributed through beta testing programs instead of the major app stores. Users struggled to install the app via this process. Once installed it had a high risk of becoming unresponsive. Some caucus locations had no internet connectivity, rendering an internet-connected app useless. They had a backup plan: use the same phone lines that the caucus had always used. But the phone lines were clogged by online trolls who jammed the phone lines “for the lulz.”

As Tweets containing the words “app” and “problems” made their rounds, software engineers started spreading the above XKCD comic. I did too. One line summarizes the comic (and the sentiment that I saw on Twitter): “I don’t quite know how to put this, but our entire field is bad at what we do, and if you rely on us, everyone will die.” Software engineers don’t literally believe this. But it also rings true. What do we mean?

Here’s what we mean: We’re decent at building software when the consequences of failure are unimportant. The average piece of software is good enough that it’s expected to work. Yet most software is bad enough that bugs don’t surprise us. This is no accident. Many common practices in software engineering come from environments where failures can be retried and new features are lucrative. And failure truly is cheap. If any online service provided by the top 10 public companies by market capitalization were completely offline for two hours, it would be forgotten within a week. This premise is driven home in mantras like “Move fast and break things” and “launch and iterate.”

And the rewards are tremendous. A small per-user gain is multiplied by millions (or billions!) of users at many web companies. This is lucrative for companies with consumer-facing apps or websites. Implementation is expensive but finite, and distribution is nearly free. The consumer software engineering industry reaches a tradeoff: we reduce our implementation velocity just enough to keep our defect rate low, but not any lower than it has to be.

I’ll call this the “website economic model” of software development: When the rewards of implementation are high and the cost of retries is low, management sets incentives to optimize for a high short-term feature velocity. This is reflected in modern project management practices and their implementation, which I will discuss below.

But as I said earlier, “We’re decent at building software when the consequences of failure are unimportant.” It fails horribly when failure isn’t cheap, like in Iowa. Common software engineering practices grew out of the internet economic model, and when the assumptions of that model are violated, software engineers become bad at what we do.

How does software engineering work in web companies?

Let’s imagine our hypothetical company: QwertyCo. It’s a consumer-facing software company that earns $100 million in revenue per year. We can estimate the size of QwertyCo by comparing it to other companies. WP Engine, a WordPress hosting site, hit $100 million ARR in 2018. Blue Apron earned $667 million of revenue in 2018. So QwertyCo is a medium-size company. It has between a few dozen and a few hundred engineers and is not public.

First, let’s look at the economics of project management at QwertyCo. Executives have learned that you can’t decree a feature into existence immediately. There are tradeoffs between software quality, time given, and implementation speed.

How much does software quality matter to them? Not much. If QwertyCo’s website is down for 24 hours a year, they’d expect to lose 273,972 dollars total (assuming that uptime linearly correlates with revenue). And anecdotally, the site is often down for 15 minutes and nobody seems to care. If a feature takes the site down, they roll the feature back and try again later. Retries are cheap.

How valuable is a new feature to QwertyCo? Based on my own personal observation, one engineer-month can change an optimized site’s revenue in the ballpark of -2% to 1%. That’s a monthly chance at $1 million dollars of incremental QwertyCo revenue per engineer. Techniques like A/B testing even mitigate the mistakes: within a few weeks, you can detect negative or neutral changes and delete those features. The bad features don’t cost a lot – they last a finite amount of time, and the wins are forever. Small win rates are still lucrative for QwertyCo.

Considering the downside and upside, when should QwertyCo launch a feature? The economics suggest that features should launch even if they’re high risk, as long as they occasionally produce revenue wins. Accordingly, every project turns into an optimization game: “How much can be implemented by $date?”, “How long does it take to implement $big_project? What if we took out X? What if we took out X and Y? Is there any way that we can make $this_part take less time?”

Now let’s examine a software project from the software engineer’s perspective.

The software engineer’s primary commodity is time. Safe software engineering takes a lot of time. Once projects cross a small complexity threshold, it will have many stages (even if they don’t happen as part of an explicit process). It needs to be scoped with the help of a designer or product manager, converted into a technical design or plan if necessary, divided into subtasks if necessary. Then the code is written with tests, the code is reviewed, stats are logged and integrated with dashboards and alerting if necessary, manual testing is performed if necessary. Additionally, coding often has up-front costs known as refactoring: modifying the existing system to make it easier to implement the new feature. Coding could take as little as 10-30% of the time required to implement a “small” feature.

How do engineers lose time? System-wide failures are the most obvious. Site downtime is an all-hands-on-deck situation. The most knowledgeable engineers stop what they are doing to make the site operational again. But time spent firefighting is time they are not adding value. Their projects are now behind schedule, which reflects poorly on them. How can downtime be mitigated? Written tests, monitoring, alerting, and manual testing all reduce the risk that these catastrophic events will happen.

How else do engineers lose time? Through subtle bugs. Some bugs are serious but uncommon. Maybe users lose data if they perform a rare set of actions. When an engineer receives this bug report, they must stop everything and fix the bug. This detracts from their current project, and can be a significant penalty over time.

Accordingly, experienced software engineers become bullish on code quality. They want to validate that code is correct. This is why engineering organizations adopt practices that, on their face, slow down development speed: code review, continuous integration, observability and monitoring, etc. Errors are more expensive the later they are caught, so engineers invest heavily in catching errors early. They also focus on refactorings that make implementation simpler. Simpler implementations are less likely to have bugs.

Thus, management and engineering have opposing perspectives on quality. Management wants the error rate to be high (but low enough), and engineers want the error rate to be low.

How does this feed into project management? Product and engineering split projects into small tasks that encompass the whole project. The project length is a function of the number of tasks and the number of engineers. Most commonly, the project will take too long and it is adjusted by removing features. Then the engineers implement the tasks. Task implementation is often done inside of a “sprint.” If the sprint time is two weeks, then every task has an implicit two week timer. Yet tasks often take longer than you think. Engineers make tough prioritization decisions to finish on time: “I can get this done by the end of the sprint if I write basic tests, and if I skip this refactoring I was planning.” The sprint process applies a constant downward pressure on time spent, which means that the engineer can either compromise on quality, or admit failure in the sprint planning meeting.

Some will say that I’m being too hard on the sprint process, and they’re right. This is really because of time-boxed incentives. The sprint process is just a convenient way to apply time pressure multiple times: once when scoping the entire project, and once for each task. If the product organization is judged by how much value they add to the company, then they will naturally negotiate implementation time with engineers without any extra prodding from management. Engineers are also incentivized to implement quickly, but they might try optimizing for the long-term instead of the short-term. This is why multiple organizations are often given incentives to increase short-term velocity.

So by setting the proper incentive structure, executives get what they wanted at the beginning: they can name a feature and a future date, and product and engineering will naturally negotiate what is necessary to make it happen. “I want you to implement account-free checkouts within 2 months.” And product and engineering will write out all of the 2 week tasks, and pare down the list until they can launch something called “account-free checkouts.” It will have a moderate risk of breaking, and will likely undergo a few iterations before it’s mature. But the breakage is temporary, and the feature is forever.

What happens if the assumptions of the website economic model are violated?

As I said before, “We’re decent at building software when the consequences of failure are unimportant.” The “launch and iterate” and “move fast and break things” slogans point to this assumption. But we can all imagine situations where a do-over is expensive or impossible. At the extreme end, a building collapse could kill thousands of people and cause billions of dollars in damage. The 2020 Iowa Democratic Caucus is a more mild example. If the caucus fails, everyone will go home at the end of the day. But a party can’t run a caucus a second timeā€¦ not without burning lots of time, money, and goodwill.

Quick note: In this section, I’m going to use “high-risk” as a shorthand for “situations without do-overs” and “situations with expensive do-overs.”

What happens when the website economic model is applied to a high-risk situation? Let’s pick an example completely at random: you are writing an app for reporting Iowa Caucus results. What steps will you take to write, test, and validate the app?

First, the engineering logistics: you must write both an Android app and an iPhone app. Reporting is a central requirement, so a server is necessary. The convoluted caucus rules must be coded into both the client and the server. The system must report results to an end-user; this is yet another interface that you must code. The Democratic Party probably has validation and reporting requirements that you must write into the app. Also, it’d be really bad if the server went down during the caucus, so you need to write some kind of observability into the system.

Next, how would you validate the app? One option is user testing. You would show hypothetical images of the app to potential users and ask them questions like, “What do you think this screen allows you to do?” and “If you wanted to accomplish $a_thing, what would you tap?”. Design always requires iteration, so you can expect several rounds of user testing before your mockups reflect a high-quality app. Big companies often perform several rounds of testing before implementing large features. Sometimes they cancel features based on this feedback, before they ever write a line of code. User testing is cheap. How hard is it to find 5 people to answer questions for 15 minutes for a $5 gift card? The only trick is finding users that are representative of Iowa Caucus volunteers.

Next, you need to verify the end-to-end experience: The app must be installed and set up. The Democratic Party must understand how to retrieve the results. A backup plan will be required in case the app fails. A good test might involve holding a “practice caucus” where a few Iowa Democratic Party operatives download the app and report results on a given date. This can uncover systemic problems or help set expectations. This could also be done in stages as parts of the product are implemented.

Next, the Internet is filled with bad actors. For instance, Russian groups famously ran a disinformation campaign across social media sites like Facebook, Reddit, and Twitter. You will need to ensure that they cannot interfere with the caucus. Can you verify that the results you receive are from Iowa caucusgoers? Also, the Internet is filled with people who will lie and cause damage just for the lulz. Can it withstand Denial of Service attacks? If it can’t, do you have a fallback plan? Who is responsible for declaring the fallback plan is in action and communicates that to the caucuses? What happens if individuals hack into the accounts of caucusgoers? If there are not security experts within the company, it’s plausible that an app that runs a caucus or election should undergo a full third-party security review.

Next, how do you ensure that there isn’t a bug in the software that misreports or misaggregates the results? Relatedly, the Democratic Party should also be suspicious of you: can the Democratic Party be confident of the results even if your company has a bad actor? The results should be auditable with paper backups.

Ok, let’s stop enumerating issues. You will require a lot of time and resources to validate that this working.

The maker of the Iowa Caucus app was given $60,000 and 2 months. They had four engineers. $60k doesn’t cover salary and benefits for four engineers for two months, especially on top of any business expenses. Money cannot be traded for time. There is little or no outside help.

Let’s imagine that you apply the common practice of removing and scoping-down tasks until your timeline makes sense. You will do everything possible to save time. App review frequently takes less than a day, but worst-case it can take a week or be rejected. So let’s skip that: the caucus staff will need to download the app through beta testing links. Even if the security review was free, it would take too long to implement all of their recommendations. You’re not doing a security review. Maybe you pay a designer $1000 to make app mockups and a logo while you build the server. You will plan to do one round of user testing (and later skip it once engineering timelines slip). Launch and iterate! You can always fix it before the next caucus.

And coding always takes longer than you expect! You will run into roadblocks. First, the caucus’ rules will have ambiguities. This always happens when applying a digital solution to an analog world: the real world can handle ambiguity and inconsistency and the digital world cannot. The caucus may issue rule clarifications in response to your questions. This will delay you. The caucus might also change their rules at the last second. This will cause you to change your app very close to the deadline. Next, there are multiple developers, so there will be coordination overhead. Is every coder 100% comfortable with both mobile and server development? Is everyone fully fluent in React Native? JS? Typescript? Client-server communication? The exact frameworks and libraries that you picked? Every “no” will add development time to account for coordination and learning. Is everyone comfortable with the test frameworks that you are using? Just kidding. A few tests were written in the beginning, but the app changed so quickly that the tests were deleted.

Time waits for no one. 2 months are up, and you crash across the finish line in flames.

In the website economic model, crashing across the finish line in flames is good. After all, the flames don’t matter, and you crossed the finish line! You can fix your problems within a few weeks and then move to the next project.

But flames matter in the Iowa caucus. As the evening wears on, the Democratic Caucus is fielding calls from people complaining about the app. You get results that are impossible or double-reported. Soon, software engineers are gleefully sharing comics and declaring that the Iowa Caucus never should have paid for an app, and that paper is the only technology that can be trusted for voting.

What did we learn?

This essay helped me develop a personal takeaway: I need to formalize the cost of a redo when planning a project. I’ve handled this intuitively in the past, but it should be explicit. This formalization makes it easier to determine which tasks cannot be compromised on. This matches my past behavior; I used to work in mobile robotics, which had long implementation cycles and the damage of failure can be high. We spent a lot of time adding observability and making foolproof ways to throttle and terminate out-of-control systems. I’ve also worked on consumer websites for a decade, where the consequences of failure are lower. I’ve been more willing to take on short-term debt and push forward in the face of temporary failure, especially when rollback is cheap and data loss isn’t likely. After all, I’m incentivized to do this. Our industry also has techniques for teasing out these questions. “Premortems” are one example. I should do more of those.

On the positive side, some people outside of the software engineering profession will learn that sometimes software projects go really badly. People funding political process app development will be asking, “How do we know this won’t turn into an Iowa Caucus situation?” for a few years. They might stumble upon some of the literature that attempts to teach non-engineers how to hire engineers. For example, the Department of Defense has a guide called “Detecting Agile BS” (PDF warning) that gives non-engineers tools for detecting red flags when negotiating a contract. Startup forums are filled with non-technical founders who ask for (and receive) advice on hiring engineers.

The software engineering industry learned nothing. The Iowa Caucus gives our industry an opportunity. We could be examining how the assumption of “expensive failure” should change our underlying processes. We will not take this opportunity, and we will not grow from it. The consumer-facing software engineering industry doesn’t respond to the risk of failure. In fact, we celebrate our plan to fail. If the outside world is interested in increasing our code quality in specific domains, they should regulate those domains. It wouldn’t be the first one: HIPAA and Sarbanes-Oxley are examples of regulations that affect engineering at website economic model companies. Regulation is insufficient, but it may be necessary.

But, yeah. That’s what we mean when we say, “I don’t quite know how to put this, but our entire field is bad at what we do, and if you rely on us, everyone will die.” Our industry’s mindset grew in an environment where failure is cheap and we are incentivized to move quickly. Our processes are poorly applied when the cost of a redo is high or a redo is impossible.