Tag Archives: software engineering

Seven steps to writing a better software engineering resume

Over the past few years, I’ve reviewed lots of resumes for friends and former coworkers. I’ve also reviewed resumes from strangers, like on Twitter, where I offered to do resume reviews. I’ve been told that my reviews are helpful, so I’m documenting my approach, which focuses on whether the resume is a good fit for the job that the candidate wants.

Software engineers can have their resume reviewed for free in a number of ways, such as by consulting Reddit, Discord and other forums, or by asking friends. From my observation, most online reviewers begin with the assumption that the candidate has provided all the necessary information. They then suggest tactics for strengthening that information. For example, they help candidates to rewrite each bullet point to fit frameworks like “Situation, Task, Action, Outcome.” They fix formatting, suggest rewrites, and highlight formatting inconsistencies. They’re really good at focusing on the tactics of writing a resume.

Peggy's Cove lighthouse in Nova Scotia

Tactics are fine. Everything I mentioned in the previous paragraph is important. But tactics are best supplemented with a good strategy. How do you know that your resume has the right information? You could describe work experience in an infinite number of ways. Demonstrate how your existing work experience makes you a good candidate for the job that you want.

Many reviewers ask, “How can you present this information in the best way possible?” I also ask, “How do you know that you included the right information?”

That might sound obvious, but it’s difficult to apply in practice. I know this from my own experience. When I review resumes, I ask candidates for the resume and job postings that interest them. In a vast majority of reviews, I demonstrate how candidates can include extra experience to more effectively show that they meet the criteria mentioned in the job description.

Discord message that says, "My first piece of feedback is that the first 3 paragraphs of that job posting talk about working with teams, and your resume does not hint, in the least, that you've ever worked with another person on a team"
Part of a “please tear my resume to shreds” review that I did for a friend

Resumes suck, but most jobs require one. Maximize the impact of your resume. Incorporate information from job postings, job descriptions, and any other relevant information, and use it all to produce a targeted resume.

A quick note: this is written for software engineers with previous work experience who want a new job in the field. If that doesn’t describe you, you might still find this useful but you should think about what’s different for your situation.

Here’s how I’d write my own resume if I had to write one today

The TL;DR of my strategy is to iterate from a generic resume to a specific one. Start with a baseline resume that reflects your experience to the best of your ability but is not tailored to a specific job. Then, find information about the jobs that you want. Take notes from the job postings. Rewrite your resume using the information in those notes.

And now, in more detail:

Step 1: Write a resume the way you normally would

Create a resume however you want. Modify an old resume or create a new one. I don’t care which. Don’t overthink it.

Personally, I modify my old resumes. They’re always written in LaTeX because when I was in college, I hoped that I would get street cred by ignoring decades of word processor innovations. I don’t know if this worked, but I’m too lazy to change my approach at this point. If I had to make a resume today, I’d dust off this file and modify it to reflect my most recent experience and remove irrelevant information. Then I’d relearn how to run LaTeX in The Year of Our Lord 2021.

Anyways, include all of the usual sections in your resume: name, contact information, experience, any relevant schooling, maybe relevant/interesting projects, maybe publication experience if you’re a researcher, maybe other things like security clearance if you have it and it’s relevant.

Your resume should resemble other resumes in your field. If resumes in your field tend to be compact, then your resume should be compact.

Spell check and proofread it. Be prepared to have a conversation about every fact on your resume. Just like you shouldn’t build a house on an unstable foundation, you shouldn’t mention something on your resume if you can’t back it up.

Step 2: Collect three or four relevant job postings that intrigue you

Find job postings that appeal to you. If you want to work for a specific company, find the available listings in the “Jobs” or “Careers” section of its website. Otherwise, look for companies offering the role you want.

It’s OK if you don’t know what job you want yet. Look for listings that appeal to you. Spend some time researching different companies and fields until you have identified a few job postings that stand out to you.

It’s important to collect more than one job posting. If you’re starting with a job posting from a specific company that you’re interested in working for, supplement it with similar postings from that company, or other postings from similar companies. Why? You don’t want to assume that any single job posting is complete. You will be mining these postings for data, and a single posting might omit something relevant to the job. But the more postings that you examine, the less likely that an important qualification will go unmentioned in any of them.

Step 3: Create a spreadsheet and aggregate data from the job postings

Make a table in a spreadsheet with 3-5 columns. The columns should represent different, and mostly non-overlapping, competency areas. Then, read through the job postings sentence-by-sentence, and add information from the job postings to these columns.

I typically create categories like these:

TECHNICAL SKILLS: Any hard skills that are explicitly named. Programming languages, operating systems, third-party libraries, applications, etc. “Experience with Python in Unix operating systems a plus” or keyword soup kind of things.

DELIVERY: Anything related to execution. The types of projects that you’ve designed, implemented, or delivered. Whether you’re going to be on a pager rotation. The size and scope of the tasks that you’re expected to complete. The types of financial, product, or technical outcomes that you’re expected to achieve (or have already achieved).

LEADERSHIP: This can take a lot of different forms. Information in this category might call out specific leadership roles, like tech lead, project lead, or management. They may want you to be proactive about suggesting product or process improvements. They may want you to mentor more junior engineers. If you haven’t been in the industry for long, it’s okay to skip this category, or merge it with COMMUNICATION.

COMMUNICATION: Anything related to interacting with other people. There are a wide range of skills that could fit here. It could be project management practices, how you work day-to-day with teams, expected response times for answering support tickets, ability to be a point-person for your team or for external stakeholders.

Tailor these categories to your specific field. If you are applying for research jobs, then you’ll probably have a RESEARCH category. The category names don’t matter too much. You only need a few that capture different concepts.

Now that you’ve made the categories, the actual work begins. Read through every sentence of each job posting and incorporate each piece of information from the sentence into at least one of the categories. If you encounter a line like “Experience with Go a plus,” then put “Go” under TECHNICAL SKILLS. You don’t need to record duplicate information. If two job postings mention Go, put it once. If two job postings mention working as a tech lead, put it once. Make sure that the information in each sentence is captured in your spreadsheet.

This can be time-consuming. It takes a while to process all of the sentences in the job postings. Thankfully, processing each subsequent posting will take less time once the first posting is finished. Job postings will have duplicate information. You only need to record the differences.

If something doesn’t fit into one of the categories, try to include it somewhere, even if you need a MISCELLANEOUS category. You don’t want blind spots in your resume. For example, senior+ engineers often overemphasize their technical skills and project experience, and underemphasize their soft skills.

Step 4: Modify your resume to address as many of the relevant competencies as possible

Look at each item that you recorded in the previous step. Look through your resume and ask yourself, “Does my resume address this? If not, can I add a new bullet point, or can I add relevant information to an existing bullet point?”

At this point, your resume might get longer as you add information to address criteria from the previous step. If this makes your resume longer, then rewrite the bullet points for brevity. Avoid making your resume significantly longer than when you started.

It’s okay if you don’t address every item in the categories that you produced. Again, you should feel confident in every item on your resume. Don’t add something if you can’t talk about it for 10 minutes in an interview. If a job asks for PHP experience and you’ve only written 50 lines of PHP in your life, leave it off your resume.

Step 5: Make the hiring manager want to hire you

People will look at each line of your resume and say, “So what?” To counter this, make sure that you explain the impact of your experience.

Think about the people making the hiring decision. This will differ per company. At a megacorp, the people hiring you might be a collective of managers or senior (or above) engineers. If you’re applying to work for medium-sized companies, the hiring manager will probably be in your management chain; either your manager or a boss of theirs. At a small startup, the founders might hire you themselves.

Make people want to work with you. The people at the megacorp want to prove that they consistently hire high performers. Your manager wants to know that you can function effectively in a team environment. A startup founder wants to know that you can ship.

Explaining your impact allows them to imagine how you would function on their team. Let’s look at an example that I made up:

A bullet point without impact: “Tech lead for a 3 developer team. Worked with design and product to implement customer-facing features in addition to tech lead duties.”

A bullet point with impact: “Tech lead for a 3 developer team. Worked with design and product to implement customer-facing features in addition to tech lead duties. Increased revenue $n million while maintaining a low defect rate.”

When you refer to a responsibility without describing the impact that you had, you’re just claiming that you did something. The impact helps you argue that you did it well. An engineer will think, “It sounds like they can ship code, and I can talk to them about why they had a low error rate.” A manager will think, “OK, they can function on a team.” A director will think, “OK, they help produce business results.” A founder will think, “OK, they get code out the door that gets results.” Either way, you’re telling them that you’re going to make them look good. “One additional strong developer! A team player! More money!”

Step 6 [optional]: Find a relevant career ladder

Find a career ladder that describes the job that you want. Then, look through your resume and see if you would hire yourself for the level that you want. If there are gaps, could you add any relevant experience to your resume to help fill these gaps?

This is optional, because it’s uncommon for career ladders to be published. But there are a few. For example, Etsy open sourced their career ladder. If you find a career ladder that covers your desired job, look at your resume through the lens of this ladder. Is your resume missing anything?

I understand that these ladders are mostly used for internal promotions, but hiring managers sometimes look at resumes and ask themselves questions like, “If this person were working for me, would I promote them to the job they’re interviewing for?” Don’t worry if you can’t find something. But if you can, use it.

Step 7: Proofread, spellcheck, and get feedback from real people

People reject resumes for the dumbest reasons you can imagine. There’s the apocryphal story/joke of the hiring manager who shuffled resumes, grabbed a handful, and threw them out, “because I don’t want to hire anyone that’s unlucky.” This is only untrue in the sense that these exact circumstances didn’t happen, but it’s depressingly easy to find forums where people say they would never hire someone, based on their resume, with arbitrary criteria that have literally nothing to do with job performance.

Strive to eliminate as many mistakes from your resume as possible. If you do literally nothing else, please run a spell check. Copy/paste your resume into Google Docs if your text editor doesn’t have a built-in spell checker. Spelling mistakes are easily avoidable, so you don’t have any excuse for allowing them to slip through. Don’t make someone second-guess you because you wrote your resume in Visual Studio Code without spell check.

Read your resume out loud. Ask people to review it. Look for inconsistencies, for example in the use of periods at the end of bulleted sentences, and in the formatting of dates.

Ask people to give you feedback; the more, the better.

Even if you disagree with your reviewers, ask yourself whether you could restate something to make it clearer, or improve your resume such that they wouldn’t have needed to give you feedback. If a reviewer raises an issue, then someone reading your resume at a company might think the same thing.

So, there you have it!

You can produce a resume that is tailored to the job that you want. Incorporate information from job postings and career ladders, explain your impact, and correct mistakes. It’s not difficult, but it requires some grinding. But it’s worth the time investment. It won’t magically get you a job, but it brings you closer to getting a hiring manager to consider your application and think, “We need someone like this.” You’ll be more likely to get to the next step.

Finish something every day

When you write code in an engineering organization, you will do the following:

  • Type the code out.
  • Test some of it. Hell, maybe you’ll test all of it.
  • Get someone to review the code.
  • Push it to source control.

These items aren’t discrete or ordered. Test-driven development and pair programming are practices that reorder or merge these items. But these should happen for most changes.

Sometimes, you’re given a large task. You have a question at this point: should I break it up? Should I write the whole thing at once? In my experience, the best tradeoff is to finish something every day, even if it’s small. Write it. Test it. Send it for review.

This introduces a lot of tradeoffs. It’s not always possible. It makes some assumptions about your work environment. We will discuss all of these below.

Benefits

Small changes are better for you

Let’s say that you’re a full stack developer at a web shop. You are assigned a new task: add a new form to collect and store some data, and then display it on another part of the site. The designer whipped up mocks. The product manager wrote out the expected behavior. Now it’s your turn.

You see all of your tasks: Modify a form to collect the data. Include it on the request to the server. Write it into the ORM. Modify the database to store it. Think about the default values and whether we can backfill the data from another location. Read the data from the new location. Render it into the view.

There are a few obvious ways to approach the code:

  • Write all of it at once.
  • Write the database code, write the write path, write the read path.

There’s a less obvious way to approach the code:

  • Write the database code, write the read path (with stubbed data), and write the save path.

There may be other alternatives that depend on the details of your project. But look at what happened: The project is now decomposed into smaller tasks. Even better, we see that the ordering of two of the tasks doesn’t matter. The data layer code blocks everything else. But the other two tasks are independent of each other. You can pick the most convenient task ordering. They could even be done at the same time. This is the first insight of decomposing tasks: some work becomes parallelizable.

Parallelization is where the magic happens. This means that you’ve converted a one-developer project into a two-developer project. This is a great way to grow as an engineer. It lets you practice for project lead or tech lead positions. This also helps you practice for artificial deadline pressure. In an “emergency,” you could make the following proposal to your team: “we can get this out faster if we add a second engineer to this. I can add the model code today. Someone can work on the view tomorrow while I work on the save path.”

It’s also good to practice to go through a full “write, test, review, deploy” cycle frequently. Practice makes perfect. You will become more capable as you push more and more code. Your “small” changes will become larger and larger. It also becomes insurance against seniority. As you get more responsibilities, you will probably suffer from the Swiss cheese calendars that plague many senior employees. It’ll be your job to help people and maintain relationships around the company. People often need help at awkward times on your calendar. If you are in the habit of producing small changes, it’s a little easier to write code. You can still finish something if you have two hours between meetings.

Interestingly, you will discover failure cases as you parallelize work. These failure cases aren’t always obvious. What could go wrong? Some tasks are just too small. Every “write, test, review, deploy” cycle has overhead. Sometimes the overhead is huge compared to the size of the change. You will also notice that saturating a project with the maximum number of engineers doesn’t work as well as it sounds. If someone’s schedule slips, other engineers will be blocked. This is okay in the occasional “emergencies” where shipping a feature ASAP is the most important thing in the world. But you burn intangible resources (goodwill, happiness, team cohesion) by perpetually oversubscribing projects. You will learn to find a sustainable headcount for a project.

There are selfish reasons to ship all the time. Shipping is a form of advertisement. People see that you’re constantly “done” with something because you’re always asking for a review. But this is a double-edged sword. You’re always going to be asking for code reviews. The reviews should be worth the time of the reviewer. Make them large enough to be interesting. If you’re distracting them with adding a single line, you’re doing the team a disservice. This is why I’ve found “a day of work” to be a good tradeoff.

Better for the codebase

I’m going to tell you a horror story. Remember the above example: adding a UI feature to a web app? I’m going to work on that change. And I’m going to do the whole thing in a single pull request. I swoop my cape over my face and disappear into my evil lair for over a week.

I send you the code review out of nowhere. You look at it. Thousands of lines added across a few dozen files: tests, database configurations, view templates. This is going to take forever to review. You skim the files. You eventually get to the database file. You see that something is wrong: I should have added another table for the new data. Instead, I wedged it into an existing record. And this change was foundational. The write path depends on this mistake. The read path depends on this mistake. The UI on either side depends on this. The tests assert this. Everything depends on this mistake. Fixing this is expensive. It’s closer to a rewrite than a refactoring.

But we’re in the “website economic model” of development. Our sprint process puts downward pressure on how much time this task should take. I shipped a functional version of the project. It’s now your job to argue that we should throw away a working version in favor of a different working version.

This puts you in a difficult spot. The team estimated this task would be completed in under 1 sprint. But now we’re more than halfway to the deadline, and the change is wrong. Fixing it will take it past the end of the sprint. I’m incentivized to push back against your feedback. I may not. But let’s remember: this is a horror story. I’m going to push back. Bringing this up will also invite discussions with product or management stakeholders to negotiate whether there’s a cheaper fix that avoids a rewrite.

Furthermore, it took you forever to review the entire change. You need to do the entire review again a second time after my rewrite. And maybe a third time if another round of revisions are necessary. That could up to hours of reviewing that you’re not dedicating to your own work.

All of this leaves you with two bad options: rubber stamping a bad change (with some perfunctory “I was here” feedback to show that you reviewed it), or reducing your own velocity and your team’s velocity to argue for a gut renovation because of less-tangible long-term improvements.

Ok, let’s end the horror story. What if I had split my task into day-long chunks? My first task would be to write the data layer. So I’d write the database changes and any ORM changes. I’d send them to you for review. You’d look at my changes and say, “Hey, let’s move these fields into a separate table instead of wedging this into the HugeTable. We used to follow that pattern, but we’ve been regretting it lately for $these_reasons.” And it’s totally cool – I don’t push back on this. I take the few hours to make a change, you approve the changes, and I move on.

What was different? I incorporated you earlier into the process. You weren’t sacrificing anybody’s time or velocity. You made me better at my job by giving me feedback early. The codebase turned out better. Nobody’s feelings were hurt. This means that I improved the entire team’s engineering outcome by splitting my changes into small chunks. It was easy to review. It wasn’t difficult for me to fix my mistakes.

Why wouldn’t I split my changes into smaller chunks?

When does shipping every day work? When does it fail?

“Finish something every day” makes a lot of assumptions.

There is an obvious assumption: something worthwhile can be finished in less than a day. This isn’t always true. I’ve heard of legacy enterprise environments where the test suite takes days to run. I’ve also worked in mobile robotics environments where “write, compile, test” cycles took 30 minutes. In those situations, it can be impossible to finish something every day. There is a different optimal cadence that balances the enormous overhead with parallelization.

“Finish something every day” also assumes that the work can be decomposed. Some tasks are inherently large. Designing a large software system is a potentially unbounded problem. Fixing a performance regression can involve lots of experimentation and redesigning, and is unlikely to take only one day. Don’t kill yourself trying to finish these in a day. But it can be interesting to ask yourself, “can I do something quickly each morning and spend the rest of the day working on the design?”

Another assumption is that your teammates review code quickly. Quick reviews are essential. This system is painful when your code reviews languish. Changes start depending on each other. Fixes have to be rebased across all of them. Yes, the tools support it. But managing 5 dependent pull requests is hard. Fixes in the first pull request need to be merged into all the others. If your teammates review them out of order, fixing all of them becomes a nightmare.

If I may be so bold: if you’re getting slow code reviews, you should bring it up with your team. Do you do retrospectives? Bring it up there. Otherwise, find a way to get buy-in from your team’s leaders. You should explain the benefits that they will receive from fast code reviews: “Fast code reviews make it feasible to make smaller changes, because our work doesn’t pile up. Our implementation velocity improves because we’re submitting changes faster. We all know that smaller changes are easier to review. It’ll lead to better engineering outcomes because we’ll provide feedback earlier in the process.” Whatever you think people care about. Ask your team to agree on a short SLA, like half a day.

You can model the behavior you want. You should review others’ code at the SLA that you want. If you want your code reviewed within a couple of hours, review people’s code within a couple of hours. This works well if you can provide good feedback. If you constantly find bugs in their code, and offer improvements that they view as improvements, they’ll welcome your reviews and perspective as critical for the team’s success. If you nitpick their coding style and never find substantial problems, don’t bother. The goal is to add value. When you’re viewed as critical for the team’s success, then it’s easier to argue that “we will all be more successful if we review code quicker.”

I take this to an extreme. When I get a code review, I drop everything and review it. My job as a staff engineer is to move the organization forward. So I do everything possible to unblock others. If this doesn’t work for you, find a sustainable heuristic. “Before I start something, I look to see if anyone has a pending code review that I can help with.” Find a way to provide helpful code reviews quickly.

Finish something every day

Try to finish something every day. You will get better at making small changes, and your definition of “small” will keep getting bigger. You will get better at decomposing tasks. This is the first step towards creating parallelizable projects. Additionally, you will get exposure for continually being “done.”

It helps your reviewers. Smaller tasks are much easier to review than larger tasks. They won’t have to give large reviews. They also won’t have to feel bad about asking for a complete rewrite.

It will also help the codebase. If reviewers can give feedback early, they can help you avoid problems before you’ve written too much code to turn back.

In practice, “finish something every day” really means “find the smallest amount of work that makes sense compared to the per-change overhead.” In many environments, this can be “every day,” but it won’t be universal.

Please consider donating to Black Girls Code. When I was growing up,
I had access to high school classes about programming and a computer
in my bedroom which allowed me to hone these skills. I'd like
everyone to have this kind of access to opportunities.

https://www.blackgirlscode.com/

I donated $500, but please consider donating even if it's $5.

Why are we so bad at software engineering?

From XKCD, released under the Creative Commons 2.5 License

An app contributed to chaos at last week’s 2020 Democratic Iowa Caucus. Hours after the caucus opened, it became obvious that something had gone wrong. No results had been reported yet. Reports surfaced that described technical problems and inconsistencies. The Iowa Democratic Party released a statement declaring that they didn’t suffer a cyberattack, but instead had technical difficulties with an app.

A week later, we have a better understanding of what happened. A mobile app was written specifically for the caucus. The app was distributed through beta testing programs instead of the major app stores. Users struggled to install the app via this process. Once installed it had a high risk of becoming unresponsive. Some caucus locations had no internet connectivity, rendering an internet-connected app useless. They had a backup plan: use the same phone lines that the caucus had always used. But the phone lines were clogged by online trolls who jammed the phone lines “for the lulz.”

As Tweets containing the words “app” and “problems” made their rounds, software engineers started spreading the above XKCD comic. I did too. One line summarizes the comic (and the sentiment that I saw on Twitter): “I don’t quite know how to put this, but our entire field is bad at what we do, and if you rely on us, everyone will die.” Software engineers don’t literally believe this. But it also rings true. What do we mean?

Here’s what we mean: We’re decent at building software when the consequences of failure are unimportant. The average piece of software is good enough that it’s expected to work. Yet most software is bad enough that bugs don’t surprise us. This is no accident. Many common practices in software engineering come from environments where failures can be retried and new features are lucrative. And failure truly is cheap. If any online service provided by the top 10 public companies by market capitalization were completely offline for two hours, it would be forgotten within a week. This premise is driven home in mantras like “Move fast and break things” and “launch and iterate.”

And the rewards are tremendous. A small per-user gain is multiplied by millions (or billions!) of users at many web companies. This is lucrative for companies with consumer-facing apps or websites. Implementation is expensive but finite, and distribution is nearly free. The consumer software engineering industry reaches a tradeoff: we reduce our implementation velocity just enough to keep our defect rate low, but not any lower than it has to be.

I’ll call this the “website economic model” of software development: When the rewards of implementation are high and the cost of retries is low, management sets incentives to optimize for a high short-term feature velocity. This is reflected in modern project management practices and their implementation, which I will discuss below.

But as I said earlier, “We’re decent at building software when the consequences of failure are unimportant.” It fails horribly when failure isn’t cheap, like in Iowa. Common software engineering practices grew out of the internet economic model, and when the assumptions of that model are violated, software engineers become bad at what we do.

How does software engineering work in web companies?

Let’s imagine our hypothetical company: QwertyCo. It’s a consumer-facing software company that earns $100 million in revenue per year. We can estimate the size of QwertyCo by comparing it to other companies. WP Engine, a WordPress hosting site, hit $100 million ARR in 2018. Blue Apron earned $667 million of revenue in 2018. So QwertyCo is a medium-size company. It has between a few dozen and a few hundred engineers and is not public.

First, let’s look at the economics of project management at QwertyCo. Executives have learned that you can’t decree a feature into existence immediately. There are tradeoffs between software quality, time given, and implementation speed.

How much does software quality matter to them? Not much. If QwertyCo’s website is down for 24 hours a year, they’d expect to lose 273,972 dollars total (assuming that uptime linearly correlates with revenue). And anecdotally, the site is often down for 15 minutes and nobody seems to care. If a feature takes the site down, they roll the feature back and try again later. Retries are cheap.

How valuable is a new feature to QwertyCo? Based on my own personal observation, one engineer-month can change an optimized site’s revenue in the ballpark of -2% to 1%. That’s a monthly chance at $1 million dollars of incremental QwertyCo revenue per engineer. Techniques like A/B testing even mitigate the mistakes: within a few weeks, you can detect negative or neutral changes and delete those features. The bad features don’t cost a lot – they last a finite amount of time, and the wins are forever. Small win rates are still lucrative for QwertyCo.

Considering the downside and upside, when should QwertyCo launch a feature? The economics suggest that features should launch even if they’re high risk, as long as they occasionally produce revenue wins. Accordingly, every project turns into an optimization game: “How much can be implemented by $date?”, “How long does it take to implement $big_project? What if we took out X? What if we took out X and Y? Is there any way that we can make $this_part take less time?”

Now let’s examine a software project from the software engineer’s perspective.

The software engineer’s primary commodity is time. Safe software engineering takes a lot of time. Once projects cross a small complexity threshold, it will have many stages (even if they don’t happen as part of an explicit process). It needs to be scoped with the help of a designer or product manager, converted into a technical design or plan if necessary, divided into subtasks if necessary. Then the code is written with tests, the code is reviewed, stats are logged and integrated with dashboards and alerting if necessary, manual testing is performed if necessary. Additionally, coding often has up-front costs known as refactoring: modifying the existing system to make it easier to implement the new feature. Coding could take as little as 10-30% of the time required to implement a “small” feature.

How do engineers lose time? System-wide failures are the most obvious. Site downtime is an all-hands-on-deck situation. The most knowledgeable engineers stop what they are doing to make the site operational again. But time spent firefighting is time they are not adding value. Their projects are now behind schedule, which reflects poorly on them. How can downtime be mitigated? Written tests, monitoring, alerting, and manual testing all reduce the risk that these catastrophic events will happen.

How else do engineers lose time? Through subtle bugs. Some bugs are serious but uncommon. Maybe users lose data if they perform a rare set of actions. When an engineer receives this bug report, they must stop everything and fix the bug. This detracts from their current project, and can be a significant penalty over time.

Accordingly, experienced software engineers become bullish on code quality. They want to validate that code is correct. This is why engineering organizations adopt practices that, on their face, slow down development speed: code review, continuous integration, observability and monitoring, etc. Errors are more expensive the later they are caught, so engineers invest heavily in catching errors early. They also focus on refactorings that make implementation simpler. Simpler implementations are less likely to have bugs.

Thus, management and engineering have opposing perspectives on quality. Management wants the error rate to be high (but low enough), and engineers want the error rate to be low.

How does this feed into project management? Product and engineering split projects into small tasks that encompass the whole project. The project length is a function of the number of tasks and the number of engineers. Most commonly, the project will take too long and it is adjusted by removing features. Then the engineers implement the tasks. Task implementation is often done inside of a “sprint.” If the sprint time is two weeks, then every task has an implicit two week timer. Yet tasks often take longer than you think. Engineers make tough prioritization decisions to finish on time: “I can get this done by the end of the sprint if I write basic tests, and if I skip this refactoring I was planning.” The sprint process applies a constant downward pressure on time spent, which means that the engineer can either compromise on quality, or admit failure in the sprint planning meeting.

Some will say that I’m being too hard on the sprint process, and they’re right. This is really because of time-boxed incentives. The sprint process is just a convenient way to apply time pressure multiple times: once when scoping the entire project, and once for each task. If the product organization is judged by how much value they add to the company, then they will naturally negotiate implementation time with engineers without any extra prodding from management. Engineers are also incentivized to implement quickly, but they might try optimizing for the long-term instead of the short-term. This is why multiple organizations are often given incentives to increase short-term velocity.

So by setting the proper incentive structure, executives get what they wanted at the beginning: they can name a feature and a future date, and product and engineering will naturally negotiate what is necessary to make it happen. “I want you to implement account-free checkouts within 2 months.” And product and engineering will write out all of the 2 week tasks, and pare down the list until they can launch something called “account-free checkouts.” It will have a moderate risk of breaking, and will likely undergo a few iterations before it’s mature. But the breakage is temporary, and the feature is forever.

What happens if the assumptions of the website economic model are violated?

As I said before, “We’re decent at building software when the consequences of failure are unimportant.” The “launch and iterate” and “move fast and break things” slogans point to this assumption. But we can all imagine situations where a do-over is expensive or impossible. At the extreme end, a building collapse could kill thousands of people and cause billions of dollars in damage. The 2020 Iowa Democratic Caucus is a more mild example. If the caucus fails, everyone will go home at the end of the day. But a party can’t run a caucus a second time… not without burning lots of time, money, and goodwill.

Quick note: In this section, I’m going to use “high-risk” as a shorthand for “situations without do-overs” and “situations with expensive do-overs.”

What happens when the website economic model is applied to a high-risk situation? Let’s pick an example completely at random: you are writing an app for reporting Iowa Caucus results. What steps will you take to write, test, and validate the app?

First, the engineering logistics: you must write both an Android app and an iPhone app. Reporting is a central requirement, so a server is necessary. The convoluted caucus rules must be coded into both the client and the server. The system must report results to an end-user; this is yet another interface that you must code. The Democratic Party probably has validation and reporting requirements that you must write into the app. Also, it’d be really bad if the server went down during the caucus, so you need to write some kind of observability into the system.

Next, how would you validate the app? One option is user testing. You would show hypothetical images of the app to potential users and ask them questions like, “What do you think this screen allows you to do?” and “If you wanted to accomplish $a_thing, what would you tap?”. Design always requires iteration, so you can expect several rounds of user testing before your mockups reflect a high-quality app. Big companies often perform several rounds of testing before implementing large features. Sometimes they cancel features based on this feedback, before they ever write a line of code. User testing is cheap. How hard is it to find 5 people to answer questions for 15 minutes for a $5 gift card? The only trick is finding users that are representative of Iowa Caucus volunteers.

Next, you need to verify the end-to-end experience: The app must be installed and set up. The Democratic Party must understand how to retrieve the results. A backup plan will be required in case the app fails. A good test might involve holding a “practice caucus” where a few Iowa Democratic Party operatives download the app and report results on a given date. This can uncover systemic problems or help set expectations. This could also be done in stages as parts of the product are implemented.

Next, the Internet is filled with bad actors. For instance, Russian groups famously ran a disinformation campaign across social media sites like Facebook, Reddit, and Twitter. You will need to ensure that they cannot interfere with the caucus. Can you verify that the results you receive are from Iowa caucusgoers? Also, the Internet is filled with people who will lie and cause damage just for the lulz. Can it withstand Denial of Service attacks? If it can’t, do you have a fallback plan? Who is responsible for declaring the fallback plan is in action and communicates that to the caucuses? What happens if individuals hack into the accounts of caucusgoers? If there are not security experts within the company, it’s plausible that an app that runs a caucus or election should undergo a full third-party security review.

Next, how do you ensure that there isn’t a bug in the software that misreports or misaggregates the results? Relatedly, the Democratic Party should also be suspicious of you: can the Democratic Party be confident of the results even if your company has a bad actor? The results should be auditable with paper backups.

Ok, let’s stop enumerating issues. You will require a lot of time and resources to validate that this working.

The maker of the Iowa Caucus app was given $60,000 and 2 months. They had four engineers. $60k doesn’t cover salary and benefits for four engineers for two months, especially on top of any business expenses. Money cannot be traded for time. There is little or no outside help.

Let’s imagine that you apply the common practice of removing and scoping-down tasks until your timeline makes sense. You will do everything possible to save time. App review frequently takes less than a day, but worst-case it can take a week or be rejected. So let’s skip that: the caucus staff will need to download the app through beta testing links. Even if the security review was free, it would take too long to implement all of their recommendations. You’re not doing a security review. Maybe you pay a designer $1000 to make app mockups and a logo while you build the server. You will plan to do one round of user testing (and later skip it once engineering timelines slip). Launch and iterate! You can always fix it before the next caucus.

And coding always takes longer than you expect! You will run into roadblocks. First, the caucus’ rules will have ambiguities. This always happens when applying a digital solution to an analog world: the real world can handle ambiguity and inconsistency and the digital world cannot. The caucus may issue rule clarifications in response to your questions. This will delay you. The caucus might also change their rules at the last second. This will cause you to change your app very close to the deadline. Next, there are multiple developers, so there will be coordination overhead. Is every coder 100% comfortable with both mobile and server development? Is everyone fully fluent in React Native? JS? Typescript? Client-server communication? The exact frameworks and libraries that you picked? Every “no” will add development time to account for coordination and learning. Is everyone comfortable with the test frameworks that you are using? Just kidding. A few tests were written in the beginning, but the app changed so quickly that the tests were deleted.

Time waits for no one. 2 months are up, and you crash across the finish line in flames.

In the website economic model, crashing across the finish line in flames is good. After all, the flames don’t matter, and you crossed the finish line! You can fix your problems within a few weeks and then move to the next project.

But flames matter in the Iowa caucus. As the evening wears on, the Democratic Caucus is fielding calls from people complaining about the app. You get results that are impossible or double-reported. Soon, software engineers are gleefully sharing comics and declaring that the Iowa Caucus never should have paid for an app, and that paper is the only technology that can be trusted for voting.

What did we learn?

This essay helped me develop a personal takeaway: I need to formalize the cost of a redo when planning a project. I’ve handled this intuitively in the past, but it should be explicit. This formalization makes it easier to determine which tasks cannot be compromised on. This matches my past behavior; I used to work in mobile robotics, which had long implementation cycles and the damage of failure can be high. We spent a lot of time adding observability and making foolproof ways to throttle and terminate out-of-control systems. I’ve also worked on consumer websites for a decade, where the consequences of failure are lower. I’ve been more willing to take on short-term debt and push forward in the face of temporary failure, especially when rollback is cheap and data loss isn’t likely. After all, I’m incentivized to do this. Our industry also has techniques for teasing out these questions. “Premortems” are one example. I should do more of those.

On the positive side, some people outside of the software engineering profession will learn that sometimes software projects go really badly. People funding political process app development will be asking, “How do we know this won’t turn into an Iowa Caucus situation?” for a few years. They might stumble upon some of the literature that attempts to teach non-engineers how to hire engineers. For example, the Department of Defense has a guide called “Detecting Agile BS” (PDF warning) that gives non-engineers tools for detecting red flags when negotiating a contract. Startup forums are filled with non-technical founders who ask for (and receive) advice on hiring engineers.

The software engineering industry learned nothing. The Iowa Caucus gives our industry an opportunity. We could be examining how the assumption of “expensive failure” should change our underlying processes. We will not take this opportunity, and we will not grow from it. The consumer-facing software engineering industry doesn’t respond to the risk of failure. In fact, we celebrate our plan to fail. If the outside world is interested in increasing our code quality in specific domains, they should regulate those domains. It wouldn’t be the first one: HIPAA and Sarbanes-Oxley are examples of regulations that affect engineering at website economic model companies. Regulation is insufficient, but it may be necessary.

But, yeah. That’s what we mean when we say, “I don’t quite know how to put this, but our entire field is bad at what we do, and if you rely on us, everyone will die.” Our industry’s mindset grew in an environment where failure is cheap and we are incentivized to move quickly. Our processes are poorly applied when the cost of a redo is high or a redo is impossible.