Human code review is a crutch

Testing has always been better. We just couldn't afford it.

Jan 16, 2026

Code review is one of those practices that everyone agrees is essential, in the same way everyone agrees you should floss. It appears in every engineering hiring rubric, every process document, every “how we work” blog post from successful startups. To question it feels like questioning hygiene.

I want to question it.

Code review serves several functions: catching bugs, transferring knowledge between team members, maintaining stylistic consistency, and providing a sanity check that the approach makes sense. These are all real benefits. The problem is that bug-catching is the one everyone treats as primary, and bug-catching is precisely the function where code review is worst.

Humans are bad at this

A human reviewer looking at a diff is performing a deeply unnatural task. They are reading code out of context, usually in a web interface, trying to simulate execution in their head, while also attending to style, architecture, and whether the author has addressed the ticket requirements. They are doing this after writing their own code all morning, probably while Slack pings in the background.

The empirical evidence for code review catching bugs is surprisingly weak. A study by Microsoft Research found that the primary benefits of code review were knowledge transfer and finding alternative solutions, not defect detection. SmartBear’s analysis of 2,500 code reviews found that review effectiveness drops sharply after about 400 lines of code, and that reviewers become essentially useless after an hour of reviewing. The median defect density found in code review is around 0.5 defects per hundred lines, which sounds reasonable until you realise this means most bugs sail through.

Contrast this with automated testing. A test suite runs the same way every time. It does not get tired. It does not have a meeting in ten minutes. It checks exactly what it was written to check, and it does so on every commit, not just when a human remembers to look carefully. If your test suite verifies that the APR calculation handles leap years correctly, it will catch a leap year bug with 100% reliability. A human reviewer will catch it if they happen to think about leap years, which they usually will not.

The economics of verification

The reason code review became standard practice is not that it was the best way to catch bugs. It became standard because writing good tests was expensive.

Consider what a comprehensive test suite requires. You need to enumerate the behaviours that matter. You need to write setup code, assertions, and teardown. You need to handle edge cases, error conditions, and integration points. For a complex feature, this might take as long as writing the feature itself, sometimes longer. A codebase with 80% test coverage represents an enormous investment of engineering time, often doubling the total effort.

Code review, by comparison, is cheap. You write the code once, someone else glances at it for fifteen minutes, and you ship. The reviewer catches some fraction of the bugs, maybe 20-30% based on the literature, and you accept that the rest will be found in production. This was a rational trade-off when writing tests cost engineering hours and engineering hours were the bottleneck.

The trade-off has changed. If English is the new programming language and LLMs are probabilistic compilers, then writing tests is no longer expensive. You can specify test cases in English: “Write tests verifying that the repayment schedule handles early payoff, missed payments, and variable interest rates.” The LLM generates the test code. The bottleneck to comprehensive testing was always the typing, and the typing is now free.

The new verification flow

In the emerging paradigm, the workflow looks something like this. You write a specification in English describing what the code should do. The LLM generates an implementation. You write a test specification in English describing how to verify the implementation. The LLM generates tests. You run the tests. If they pass, you ship.

Where does code review fit in this flow? The generated code is an intermediate representation, like assembly output from a C compiler. You do not review assembly; you trust the compiler and verify the behaviour. If the tests pass and the tests correctly encode your specification, the code is correct by definition, regardless of how it looks.

The interesting review happens at the specification level. Does the spec make sense? Does it capture the actual requirements? Does the test spec cover the important edge cases? These are the questions worth human attention, and they operate entirely above the level of code.

The maintainability objection

The standard objection at this point is maintainability. We review code not just to catch bugs today, but to ensure that someone can understand it six months from now. Clever tricks, unclear variable names, and tangled control flow create future pain. Code review is supposed to catch these sins.

This objection assumes that a human will read the code later. In the new paradigm, why would they?

When a change is needed, you modify the specification and regenerate. You do not patch the existing code; you recompile from the updated spec. The generated code is disposable. Its maintainability is irrelevant in the same way that the maintainability of compiled assembly is irrelevant. Nobody refactors their binary.

If this sounds radical, consider that it is already how we treat many generated artifacts. Database migrations are generated from schema definitions. API clients are generated from OpenAPI specs. Infrastructure is generated from Terraform files. We do not review the generated output for maintainability; we review the source specification. Code is joining this category.

What code review is actually for

None of this means code review has no value. It means the value is not where we thought it was.

Knowledge transfer remains valuable. Having a second engineer read through a change, understanding what was built and why, creates redundancy in the team’s understanding. This becomes spec review rather than code review, but the function persists.

Catching logical errors in the specification is valuable. A reviewer might notice that the spec fails to handle a business case, or contradicts another part of the system. This is architectural review, operating at the level of intent rather than implementation.

Sanity-checking the test coverage is valuable. A reviewer might ask: “What happens if the payment processor times out? I don’t see a test for that.” This is verification review, ensuring that the test spec is comprehensive.

What is not valuable, or at least not as valuable as we pretended, is staring at generated code trying to spot bugs. The tests do that better. They always did. We just could not afford enough of them.

The cultural shift

The difficult part of this transition is not technical; it is cultural. Code review carries moral weight in engineering organisations. Shipping without review feels reckless, even irresponsible. Engineers who skip review are cowboys. Engineers who write thorough reviews are diligent professionals.

This moral weight exists because we collectively decided that code review was the responsible way to verify software. We built a culture around it. We promoted people who were good at it. We told ourselves it was essential.

It was not essential. It’s a crutch, a second-best solution we adopted because the best solution was too expensive. The best solution was always comprehensive automated verification. We could not afford it, so we used human eyeballs instead and convinced ourselves this was rigorous.

The crutch is no longer necessary. The cost of comprehensive testing has collapsed. What remains is to update our culture to match, which may take longer than updating our tools.

Dead Neurons

Discussion about this post

Ready for more?