Skip to main content

Blog posts

Why are we getting worse at software engineering?

Software quality is obviously getting worse. I'm not talking about LLM slop features nobody asked for. I'm not talking about services collapsing under unprecedented LLM-powered demand. Companies are obviously shipping user-visible bugs at an accelerating rate. And consequently, the software we are using is getting worse and worse.

As a quick aside, I'm experimenting with recording my blog posts for YouTube. So if you'd rather see that, here it is:

View on YouTube View on YouTube

Back to our regulary scheduled blogging!

You can thank GitHub for this post. I recently commented on a deleted line in GitHub and hit submit. An error dialogue appeared saying that some block of client-side code couldn't find the line ID. GitHub has had almost two decades to perfect "comment on code." Yet it regressed. And hilariously, I just triggered a Google Docs copy/paste bug while typing this[0]. And heaven help any heavy Claude Desktop users. Everything's getting worse around us.

I blame three factors for this.

  1. As you write code faster, the acceptable error rate drops.
  2. Implementation time is becoming decoupled from competence.
  3. The value of implementation becomes so high that slack time will trend to zero

I have faith that we can overcome these as a discipline, but we can't do it by applying our old bag of tricks. Our old bag of tricks got us our old error rate. We need to evolve.

As you write code faster, the acceptable error rate drops[1]

I think this is obvious but it's worth stating. If you're shipping code faster, and you have a certain regression rate per change, then obviously you are shipping more regressions. It's just math.

So there are 2 pieces of this: are we shipping faster, and how do our users experience that?

Why does this matter? Let's say that you flip a switch and can immediately double your code production. Nothing else changes. Just twice as much code as before. There are a few consequences:

First, you ship 2x more user-visible regressions over any time period. This makes sense, right? Each change has about the same chance to introduce a regression as it did before. So on average, double the code means double the regressions.

The problem with user-visible regressions is that users encounter them. After you flip this "double code production" switch, the number of active issues in your product will trend towards 2x the baseline. If users are lucky, you notice the issues they encounter in some way, and you can go and fix the issue yourself with your magical new implementation rate. But often, users have to tell you that you fucked up. "This menu is broken in this configuration and I can't even click this option anymore." And that's slow. You need to aggregate the reports, try to reproduce, prioritize the fix, etc. And your best users, your power users? They're the poor saps that trigger all of your new bugs, over and over again.

I'm sure somebody wants to counterargue "LLMs can detect all of these kinds of issues and automatically fix them," and I don't want to hear it. Software is obviously degrading all around us! Whatever LLMs are currently doing isn't enough to resolve it. And if you have some magic technique that the rest of us aren't applying, please scream it from the rooftops. Even better, try to get it integrated into the official Claude Code harness the official way: by randomly Tweeting at members of the team on the off chance one of them notices.

So when you hear someone say "we're shipping 100x faster" and they can't explain how they ship 1% the number of errors that they did before, run away from their software before it explodes.

Implementation time is becoming decoupled from competence

In the old days[2], If you didn't know how to do something, then it took you a long time. But if you put in the reps, you'd develop expertise and get faster and faster, and eventually it became a natural part of your workflow.

This slowness was a blessing. The slowness was learning. You allowed the problem to impress itself on your brain. You weren't just reading the theory. You were actually developing the muscle memory for execution. You were learning every wrinkle, every pitfall, every exception, and you learned how to handle each of them.

Man, that went out the window, didn't it? Now you can be as clueless as you want to be. When I set up my blog recently, I spent weeks hammering the codebase into the right shape. Even though I generated it with Claude Code, I still have a good idea of how the code is organized and what each piece does. I chose to understand the project.

And when it came time to actually deploy my blog to Digital Ocean, I didn't want to understand. I didn't want to remember how to use Ansible and look up guides for hardening VPS instances. I didn't want to spend days tracking down the cause of obscure error messages. I just have one hour per day of side project time. I don't want to waste it. I told Claude everything I wanted: the Makefile command names, Ansible deployments, hardening, etc. It finished within 20 minutes. And sure, I checked over everything to make sure it wasn't leaking API tokens or anything. But I just read the generated code. I never truly allowed it to flow through my brain.

Did it do a good job? Not any worse than I would have done setting up my first VPS with Ansible in 6 years.

Did I learn anything? Absolutely not! I'm running Caddy in production now, and I have no idea how it's different from Nginx beyond automatically setting up HTTPS.

And that is my central point here. Implementation time is becoming decoupled from competence. Whether I knew how to set up a droplet or not, it would have taken about 20 minutes either way. Sure, an expert might have added bells and whistles, set up some extra monitoring, got Tailscale going, and whatever else experts do with their VPSes. But I did 3 days of 1-hour-a-day side project time in like 20 minutes. And that's bad! I shouldn't be the type of person that can set up a VPS in 20 minutes. It should actually take me a long time because I don't know what I'm doing. In some ways, it's actually dangerous that I can do this.

And that's what we're seeing throughout the industry. People can accelerate tasks outside of their own expertise. So review and expertise are becoming an increasingly important part of the job. Just because someone sent you a change no longer means that they have the competence level required to get it working. I sure didn't when I deployed my blog.

I'm not impressed anymore when somebody says that they pointed Claude at a ticket with its MCP or CLI, implemented the code, wrote tests, and pushed the result to a GitHub pull request with the Github CLI[3]. That's where the work starts now: actually evaluating the prompts and output for correctness, for scalability, for maintainability. For removing all of the little quirks that LLMs introduce. I'm impressed when engineers say, "I found this problem that I wouldn't have otherwise" or "it tuned this better than I could tune it myself" or "I had this insight I never would have had by myself."

Does this LLM speed boost lead to software correctness? If anything, you can now ship code faster if you're clueless because you're unconstrained by reality. You're not pouring over the code looking for API keys it's leaking, looking for obvious scaling bottlenecks, looking for unnecessary bundles you're sending the client. You've merged, and you're already looking for the next ticket. And the number of active regressions in your product just ticked up a bit.

The value of implementation becomes so high that slack time will trend to zero

Slack[4] is an important concept in system resilience. It's the amount of time that an entity is unallocated. It is your tolerance to deviations from the norm. In manufacturing, slack might be the amount of time that a factory is not utilized. At zero slack (i.e. the factory has to run every hour to meet its demand), then even a single hour of downtime needs to be made up. That's when the system comes under pressure. This is when accidents and errors creep in.

For software engineers, slack looks like unscheduled time. From the top down, unscheduled time sounds bad. This is time where engineers don't have guaranteed outcomes. But in reality, a little bit of slack can be some of the most important time that they spend. This is the time that they delete dead code, that they build that observability dashboard that everyone has been putting off, that they say, "This weird thing has been bothering me for a few weeks, I need to look into it. Oh fuck!". It's when you have a chance to say to your teammates, "Why does this part of the codebase feel wrong? What can we do about it?" and whiteboard for a week and come up with the architecture that powers you for the next 5 years.

Obviously no leader says that they want to suffer Knight Capital's fate, or that they wished they didn't have that dashboard that noticed that launch regression, or that they wished the subtle data loss bug was still live in prod, etc. But how do you make it a repeatable business outcome when it comes as the result of unallocated time? You can't.

You might say, "All of these things should obviously be part of any project." But that just misses the point of how software is made. I wrote more about it here, but the tl;dr is that in most methodologies, you set an objective like "Make widget Foo", you set the launch date along with the initial scoping of the project, and then repeatedly negotiate the scope until you have Foo on the launch date (or maybe pushed back a week or two). When your schedule starts to slip, do you know what gets descoped first? Your nice-to-have dashboards, the code cleanup tickets from the last project, etc. All the slack work goes right out the window.

I expect the industry to operate with less and less slack in the future. If we can really accelerate feature implementation, then there's some rate of implementation where it doesn't make sense to give your engineers unallocated time anymore. It just becomes too valuable to perform feature work. "So you're telling me that it used to take my team 3 months of work to figure out whether we might get +/- 2% from an engineering change? And now we can turn it around in 6 weeks?" Your quarter just got 6 weeks back, and you can bet that you are not spending it on anything but implementing more product features. Every implementation hour just got twice as valuable. That downtime between projects you used to have? Now you're spending it writing the specs for the next project.

And it leads to worse software outcomes. Without slack you can't even handle minor bumps in the road. What happens if a project suddenly needs more headcount? There's nowhere to borrow it from; everybody's already allocated. Something needs to get bumped, but you already burned a bunch of engineering time on it. You're already starting to lose the gains. Who's going to go back and delete all of the branches of the old experiments? Nobody. Who's going to spend a week whiteboarding to determine the future architecture of your company? Nobody. Who's going to investigate that weird thing before it becomes a huge problem? Nobody.

Hope for the future

I don't want to just be Doom and Gloom about the future. We can do something about this. Are we going to reduce the error rate by 100x? Probably not. But increased implementation speed applies to everything. It means that we don't have to be stuck in the old paradigm where we pair every single implementation change with a unit test and say that the test means that we verified that our software works.

I can imagine a future where a LLM with the right skill could actually verify that every single line of code has a test that fails if it regresses.

I can imagine a future where it becomes so cheap to produce integration tests that we default to integration tests over unit tests. Anyone who's worked with me knows that I hate mocking frameworks and thinks they lead to worse engineering outcomes, so this could become the perfect axe for me to grind.

I can imagine a world where we get so good at writing integration tests (because we exercise the muscle so much more) that they don't flake all that much.

And it's a bit dangerous, right? It's dangerous for me to assign LLMs these magical capabilities. Just because they can be given a prompt and produce an output doesn't mean it's correct. That doesn't mean it would help.

But the good news is that software engineering is more verifiable than it's ever been. It's never been easier to just take a change and open up one worktree or one checkout and do one technique there and then run a second implementation in parallel and compare the two outputs. What is different? Do you like one more than the other? Did one have more errors on the other?

But I'm not trying to prescribe Exact Solutions. We can imagine a world beyond what software engineering was in 2023. It doesn't have to just be an implementation paired with a new set of unit tests until you retire. We can work together and share our results, share what's working.

Footnotes

[0]: They fixed it since I started writing this, but here was the repro: triple-click a single-line paragraph. Type Ctrl-C (Cmd-C on Mac). Triple click a headline. Type Ctrl-Shift-v (Cmd-Shift-v) to paste without formatting. The line is replaced with the paste content plus three newlines in a row. Even assuming that it needed to keep both the source and destination paragraph's newlines, where did the third one come from?

[1]: While I was writing this post, I read a different treatment of this subject here that views this through the lens of maintenance costs. If you feel like you'd be more swayed by a devex argument, check this out!

[2]: Before 2025.

[3]: I mean, I'm very impressed with the Claude Code and Codex teams that they made an agent where this is even possible. Holy cow. What a time to be alive.

[4]: I am absolutely not referring to the SaaS product here.

Experimenting with YouTube shorts

I made my first YouTube short, and I learned a lot about the format.

You can watch it here. I'm not going to embed it because the YouTube embedding payload is like a megabyte, which is bigger than the entire rest of my blog. So click the link and go watch it.

As usual, my level of respect for Gen Z has only increased. My lessons are below, but first... why did I do this?

Why am I making YouTube shorts?

I'm getting with the times.

I've blogged on and off in some form since college. Since I've had my kid I've done it less. But I enjoy writing, and I always want to write more.

My posts used to be really high effort. I've swung for the fences in the hopes of getting on the Hacker News or Reddit homepages. And this sometimes works, because sometimes I do end up on the homepages. But this sets the bar really high. Too high to usually justify writing. And so, I rarely write.

So I want to stop letting perfect be the enemy of good. I want to become good at rapidly producing content. And I want to meet the internet where it is. A lot of people want to read long-form posts. A lot of people want to watch long-form videos. And a lot of people want to watch shorts. And so, I want to practice working in each format, so that I can communicate to a broader audience.

What did I learn?

I found that making a short from scratch is much harder than making a full YouTube video. Every frame needs to provide value. I needed to cut over half of my video to make a 73-second video. And I was aiming for sub-60. I just couldn't pull it off with what I recorded.

The subject was pretty simple: Claude Code has had several performance regressions over the past few months, and three of them were fixed today. So I would basically interleave my "confessional" shot with stills from the tweet and blog post. When the blog post stills were up, I would explain them. When the confessional shot was up, I would explain my experience with them.

Well, I had to cut almost all of my confessionals entirely. I had to edit pauses out. I had to cut within sentences to economize. I basically had to throw all of the fluff and padding out. The next time, I need to plan from the beginning to have concise sentences.

I'm going to film a longer video (paired with a blog post) tomorrow, and I also hope to cut that up into shorts. But that means that I need to go over my shot list and script, and figure out what I want to be a short. And I need to make those sections punchy!

The reach is crazy

Within 30 minutes of publishing the short, it already has 90 views. I'm sure their attention was much shallower and they are much less attached than someone who made it through my YouTube video. So I'm ultimately not sure how "good" the traffic quality is. But I also don't want to discount it entirely; would I get a lot of value from publishing these over and over again? What if I added common branding between my blog, longer-form YouTube videos, and my shorts? God, now I need to pay someone on Fiverr to make me a logo.

We haven't adapted teams to the magic wand yet

This does not reflect the opinion of my employer

TL;DR: we are DDoSing each other with code reviews, but it doesn't have to be this way.

Software engineering was solved until about last year. More or less.

Most projects got finished. We know which roles to hire and what levels they should be. We even have a default process, Scrum. If you can't lead an engineering team by yourself, just rub some Scrum on it. It won't be the fastest team in the world, but by golly they will finish the project at some point.

Projects still have lots of problems of course. But we're not bemoaning our ability to ship software like we did back in the '90s and '00s. We know we can do it. We can ship in spite of the problems around us.

Tech leads and project leads are a big reason for this success. They guarantee engineering outcomes. They collaborate with design and product. They lead the technical design. They are the primary reviewer for the project. And what happens when you make someone responsible for the technical execution of a project? They go into the important code reviews, even the ones they're not assigned. Even if they don't say a word, you know they're reading over the code looking for unhandled error cases and race conditions, making sure there aren't fatal flaws under the surface.

This "lead" is a strawman of sorts; it doesn't have to be one person. Maybe you had a cabal of 3 engineers out of 10 that guided the engineering, or you had a small stacked team and they could all truly retain context and hot swap for each other. But for the remainder of the post I will talk about "the lead" and we will all know what I mean.

The lead is the hub in a "hub-and-spoke" team model, since code reviews flow into this central point. Sometimes you'll send individual reviews to other people. But again, if someone is responsible for the technical delivery of a project, you know they're looking at what everyone's doing. You can't avoid this fanout, and it has been our secret sauce for a while. But our new magic wand is turning this fanout into an antipattern.

The magic wand

OK, well, something changed. We have a new magic wand. This magic wand vomits code at an impossible rate. And the worst part: the magic wand is pretty good! It's in a dangerous sweet spot. It can generate an entire "working" website, soup to nuts, from a single large prompt. And hopefully, you took the time to make sure that the API keys aren't on the client... and that the endpoints check auth... and it's doing something with CORS... and 2 dozen other fiddly bits necessary to launching in production... and its not leaking debug errors with important information in 5xx responses... and before long you realize that the generated project wasn't even 50% of the way to a production system.

This magic wand is pressuring our tech and project leads. Don't believe me? Go ask one who works with a lot of agentic coders. "How has code review been feeling lately?" And you'll get a sigh and they'll tell you, "these tools are great but it's a lot to keep up with." I don't know where the performance ceiling is for these tools. But it's obvious that they will produce code faster and faster over the immediate future. This pressure will only increase.

This is creating an interesting problem. The hub will get DDoSed in all of these hub-and-spoke team models. This means that your most senior engineers will be spending a disproportionate amount of time reading and reviewing code.

This will create an even more interesting problem: a paradox! Our most senior engineers will practice less with these new tools, because they're spending their day sweating over line 351 and asking themselves "is it REALLY okay for this module to take on a dependency to the database?" because these are the kinds of questions that lead to decisions that avoid serious problems down the line. But the more junior members of the team are spending their time getting better with agentic programming. They may even start to drive how it's used at the company, while the more senior engineers begin to lack the experience to make these judgement calls themselves.

Software engineering isn't solved anymore. But we're still following the old rules and ignoring the magic wand and its impact.

What can we do about it?

Here's the disappointing part of the post: I don't know!

But you should have seen that answer coming. I told you that software engineering isn't solved anymore! How can I tell you a solution if I don't believe it's solved?

As a consolation prize, I want to higlight some tools and experiments I think are promising in the short term. The situation is evolving rapidly enough that I can't assert an expiration date on these.

Pair programming / the buddy system

When I joined Google in 2010, Google had a regimented code review system. I'm sure it still does, but I haven't worked there for a decade and I can't be bothered to ask anyone there now. Every changelist needed approval by another engineer. Between you and the reviewer, someone needed to be in OWNERS for that directory and someone needed to have "readability," a.k.a. clearance to write code in that language. And even if you had both permissions, someone still needed to explicitly approve your CL.

But there was a neat workaround. If you pair programmed a CL with a second person, you didn't need to get it reviewed by an external party, assuming that you and your pair had OWNERS and readability. This might not have been written down anywhere, but it was a logical application of the rules. One person sent out a changelist and another person approved it in the system. You also happen to be coauthors, but that wasn't forbidden at the time.

And that was a big deal at Google. Code reviews could get really bogged down. Some people just didn't review code that often, and some people just loved bogging down reviews in nitpicks that couldn't be found in any style guide. I knew a platform team that only reviewed external changes once a week, and if they left comments you needed to wait for the next week to hope to God they hit approve. A shortcut was a big deal.

But nobody pair programmed. I sure didn't. I hate pairing unless we're bug hunting or someone's getting training. It feels like a waste to burn 2x the engineering time when everything is going well.

But it's a potential solution to the hub-and-spoke problem with the magic wand. Here's what I'm imagining: a team consists of staff/junior, staff/senior, and senior/senior pairs. These aren't permanent pairings; they're just today's arrangement. Each team prompts together and looks at the output together. The pairing has enough combined seniority that the pairing can own technical decisions. They have the authority to decide that their code can be shipped.

This has an important caveat. These pairings must understand when they need outside input. They need to gossip to the other pairings if they need to highlight an architectural decision or a bad assumption. Or if the pair cannot come to an agreement on a decision they need to find tiebreakers. But ideally these are exceptions; they would be prompting together and reviewing together. By the end, both engineers agree on the technical outcome and own the decisions.

In fact, this would become part of the definition of junior, senior, or staff engineer; how much you're trusted to ask for input when you need it.

They don't have to literally sit with each other for the whole day. They just need to both be responsible for the prompting and agree with the direction of the final code that ships. They don't need to sit together when they're updating documentation or having meetings or shitposting on Slack. But at a certain point you're having a conversation about it and making sure the architecture is reasonable and the verification is correct, and ensuring that you don't need to raise any problems with the team.

How would this look on the previous example of a four IC and one project lead team? Maybe you have two senior:senior pairings, one staff:junior pairing, and then the final floating engineer is situational. Maybe they're performing individual IC work and it will be reviewed later with one of the pairings. Maybe the project lead is kinda doing two pairing assignments at once (instead of effectively the four they had previously). I don't really care; it's your team. You figure it out. But the important thing is that the project lead's workload doesn't scale with team size; the number of pairings does.

I haven't literally pair programmed with someone else yet in this manner. But I've worked on some two-engineer projects recently and it felt pretty good. Each of you have a default reviewer, and nobody is getting overwhelmed by N magic wands.

This has some benefits. First, it provides a concrete path to hire and train junior engineers for your organization. Even if you believe that the software engineering occupation will be decimated over and over by advances in the technology until finally one of Sundar, Sam, or Dario are holding the head of the last engineer, admit that you still need a way to teach new people how to do it. Second, it provides a role for staff engineers as a level, which obviously I appreciate as a staff engineer.

Product engineers

For a few months, I've been saying that I need to become a product manager before a product manager becomes an engineer. It turns out that that already existed, but I arrived at it independently. With the advent of increases in coding velocity, it becomes possible to start projects closer to the final implementation than ever before.

When I finally caught up on my unread backlog of The Pragmatic Engineer newsletters recently, I found an issue with the subject line "The product-minded engineer", which was an interview with the author of the book with the same name.. This was a book about the need to grow your empathy with the user, and ways that technical skills and product skills can mesh together.

Why is this important? Look at the areas in LLMs that are seeing rapid development and rapid adoption. They're all dev tool related! Developers can be insanely productive nowadays, assuming they don't need to figure out what someone else needs them to build. But as soon as the topic is not "development" the process grinds to a halt.

But most companies aren't like that. If I've learned anything from working for B2B and B2C companies, it's that you can't possibly guess what people need without an obsession over qualitative and quantitative feedback. Thus, I believe that engineering is going to be more and more vital in the discovery phase of projects, where you're not even sure what to build. The ultimate software engineer will be one that can perform the product discovery work themselves. It'll be the ones that get better at producing up-front prototypes and iterating on those prototypes.

Have you ever seen a designer in a user research session, just tweaking upcoming mocks as a participant speaks to tailor it to them? Or chatting with a PM and calling an audible to tweak a major part of the mocks before the next session? Engineers who can do this kind of work will become more valuable because they will go beyond just putting hypotheticals for reaction. They will be able to produce working systems for reaction. And sure, maybe they are only 50% prototypes and there is still a bunch of productionization work. But it's clear how adding more firepower to the earliest product iterations will only improve discovery.

I'm sure someone's gonna be like "oh no, the LLM will just be the product manager and the designer and the researcher." Really? You're going to do research for a dating app by putting Codex in front of someone and having them explore a user interview with questions like "So, puny human, is your situation more about copulation or procreation?" I don't see it.

So yeah, I think there will be a period of time where the lines between discovery and execution will blur. I've never worked for a proper startup, so it's possible I'm just making an assertion like "more and more companies will need to act like a startup" or something. But I'll let the startup people assert that for me.

I think this will help address the hub-and-spoke problem because at the start of the project, you start with a system that is already halfway there. You just need to refactor and add tests and productionize. This will reduce the scope of projects (or more accurately, move a lot of scope to the discovery phase) and reduce the during-the-project review workload.

AI code review

This one is exasperating. You have a magic wand that generates a pull request description, commit message, and code. And now you want to check if that magic wand did good work. So you wave the same magic wand -- but held differently! -- and now it's going to see why this code was such a bad idea? It sounds stupid when you say it out loud.

But at the moment, they're actually pretty good; I'd wager that they find more nitty-gritty problems than I do. They do all of the rote callsite checking that you might overlook. They catch swapped parameters of the same type by noticing name mismatches. They will notice when you try to set a dangerous or weird config value.

I mean, not RELIABLY. Half of the comments are horrible.

"Oh no, you changed this!", said the bot.

"Buddy, that's the whole point", said Jake.

But it's a good first pass. I'd be comfortable if my company adopted this rule: "You can't ask for human review until you do a pass with the bot and satisfy its comments." It removes silly errors so that the code reviewer can spend time focusing on the big picture.

I don't think this is some panacea. If an agent produced a major architecture flaw, I don't expect its corresponding reviewer to notice the flaw either. But it adds more value than noise at this point.

To summarize

  • We used to love hub-and-spoke team structures, where reviews would fan-in to a lead engineer responsible for technical execution.
  • LLMs have increased execution velocity, putting lead engineers under additional strain.
  • We need to rethink how to scale teams without scaling the lead's workload.
  • This isn't a solved problem, but there are a few options.
    • Working in pairs / the buddy system, where the pair has enough authority and responsibility to make decisions and ship.
    • Getting engineers more involved in the discovery process
    • Having the bots help out with code review, to remove obvious problems before a human looks at it.