Against the second system effect (again)

Brooks wrote about the second system effect in 1975, and every team I’ve been on since has rediscovered it, usually around the time their rewrite hits month six with no working prototype. The trap isn’t wanting a better system - it’s believing the old one was held together by accident. This post is the checklist I now force every “let’s just rewrite it” conversation through.

What the second system effect actually is

Brooks’s claim - paraphrased by everybody, including me - is that the second system a designer builds is the most over-engineered. The first system was built under constraints; it taught the designer what mattered. By the second, the designer feels liberated, knows “where the bodies are buried,” and finally gets to do it right. They overshoot. They add every feature they previously denied. The result is a bloated system that’s harder to ship, harder to maintain, and harder to replace - and it takes three times as long to ship as the first one.

What Brooks didn’t emphasize, but which 50 years of evidence have made clear: the second system effect is almost always a rewrite, not a greenfield. And rewrites fail almost always for the same six reasons.

The six warning signs

When someone on a team says “let’s just rewrite it,” I now ask these six questions in order. If the answers are wrong, the rewrite will die. Not might. Will.

1. Can you point at the exact production pain?

Not “the code is hard to work with.” Not “the tests are slow.” Not “the framework is old.” I want a specific production-visible behavior: a latency spike, a bug class that keeps recurring, a feature we’ve delayed three times because the current architecture makes it expensive.

If the answer is a developer-experience pain rather than a production pain, the rewrite is going to fail. Because the second system’s developers will, inevitably, have their own developer-experience pains that nobody has enumerated yet. You’re trading known-shape pain for unknown-shape pain.

2. Have you talked to the person who originally built it?

The original designer knows why that weird-looking thing is there. 70% of the time, the answer is a production incident from four years ago that isn’t in the commit history. If you don’t ask, you will reimplement the bug they fixed. I’ve watched this happen to me, twice.

If the original designer has left: find the incident post-mortems. If those don’t exist either: scale your confidence down by half. The code you’re about to rewrite is loadbearing in ways you can’t see.

3. Is there a “walking skeleton” in the plan?

A rewrite that doesn’t ship something end-to-end in the first two weeks will ship nothing ever. The plan should have a day-14 milestone that looks like “one real request flows through the new system in production, shadowing the old one, with 0% of traffic.” If the plan opens with “first we’ll build the new data model, then the new API layer, then the new UI layer” - congratulations, you’ve written a waterfall plan in 2026 and nobody has noticed.

At Bytro, the event-driven modernization we did worked because we ran a live shadow from week two. Every request hit both the old and new system. We compared outputs nightly. The discrepancies were the spec.

4. Do you have a clear metric for “done”?

“The old system is gone” is not a metric. “99% of requests served by the new system, p99 within 10% of baseline, error rate flat or lower” is a metric. If you don’t know when you’re done, you will never be done, and the team will oscillate between two systems forever.

The second-system-effect ruin is not that the new system is bad. It’s that the org now maintains two systems while the rewrite’s “completion” is perpetually a quarter away. I’ve seen companies run in that state for three years.

5. Who owns the old system while the new one is being built?

The unglamorous answer: the same people who were going to build the new one. If your rewrite plan quietly depends on freezing the old system - “we’ll just stop adding features to it for 18 months” - your rewrite is dead before it ships.

The old system accumulates its own critical bug fixes while the new one is being built. Either you carry those fixes forward (which means building them twice) or you ship the new system missing 18 months of incremental fixes (which means it’s not actually a drop-in replacement).

6. What’s the rollback story?

The moment you turn on the new system at 1% of traffic, what happens when p99 goes to 2 seconds? Who notices? What do they do? Is there a dashboard? Is there an alarm? Is the rollback one command or a reconfigure-and-redeploy?

A rewrite without a fast rollback is a bet. Bets occasionally pay off, usually when the engineer making the bet had a career-defining reason to be right. In every other case they’re a resume-generating event for whoever has to explain the outage.

The rewrite I actually enjoyed

The one rewrite in my career that went smoothly was the Bytro modernization. It worked because we answered those six questions before we started, and answered them mostly “yes”:

Production pain: event ordering in the real-time game backend was causing player-visible desync on match joins. Quantified, user- reported, priority-1.
Original designer: still on the team, actively involved in the rewrite. The bodies were visible.
Walking skeleton: week two, one endpoint flowed through the new event bus. Week six, five endpoints. Week twelve, the critical path.
Metric for done: 99% of matches joined via the new path with p99 within baseline. Took roughly 11 months to hit - two months past plan, which was unusually close.
Ownership during transition: same squad owned both systems, and we froze non-critical changes to the old one. Product agreed to this because we showed them weekly progress on the new path.
Rollback: one feature-flag flip. The flag was tested weekly - not just defined, tested. In a real on-call drill. That saved us on month eight when a corner case showed up.

Every other rewrite I’ve watched die missed 3+ of those six. I don’t know a single counterexample.

The version of this for 2026

Rewrites in 2026 have a new temptation: “AI will just rewrite it.” I have run roughly 3,400 commits of AI-assisted development through this site alone in the past year, and I’ll say the quiet part: AI does not protect you from the second system effect. The temptation to rewrite is, if anything, easier to indulge when the first draft feels free.

What AI-assisted development does let you do cheaply is spike a migration path - build a 48-hour version of the new system against one endpoint, see what hurts, throw it away. That’s a phenomenally underused technique. Most rewrites commit on the basis of a slide deck; an AI-aided spike gives you a week of real evidence at the cost of a deck.

If you take one thing from this post: don’t let the rewrite start from a slide deck. Spike it, throw the spike away, and then decide.

The thing I usually catch myself about

I wrote this whole post to argue against rewrites. I also, right now, run Fulcrum - an agent control plane I built because none of the existing runtimes fit how I work - which is, pedantically, a rewrite of something I could have built on top of existing tools.

I ran it through my own checklist. The production pain was real (no existing tool coordinated multi-agent runs the way I wanted). I talked to the designer (me; I also wrote down the history). Walking skeleton shipped in week one. I have a clear metric for done. I own both the old script-based workflow and the new thing. Rollback is “close the fulcrum CLI and go back to shell aliases.”

Six out of six. So I allowed myself the rewrite. If I’d hit three out of six, I’d have shelved it. That’s the bar. Use it, and your rewrites will ship. Ignore it, and you’ll spend 18 months explaining why the new system is “almost done.”