💾 Archived View for uscoffings.net › writing › 20140617-fail-early-fail-hard.gmi captured on 2024-05-12 at 15:16:47. Gemini links have been rewritten to link to archived content

-=-=-=-=-=-=-

Fail Early, Fail Hard

[tags: software]

[date: 2014-06-17]

For more than a decade, while considering best practices in software development I have often recited a mantra to my coworkers:

Fail early, fail hard.

Prior Art?

"Fail early, fail hard." I claim this one as mine.

I think I made this phrase up at some point 10-15 years ago (somewhere during the dot-com bust while I was at Volera), but to me it seems so obvious that I'm surprised it's not ubiquitous.

Googling for this phrase doesn't turn up much. The nearest matches are "Fail early, fail fast", and generally refer to founding a startup. That may be sage advice for startups. I worked for [Volera](http://investing.businessweek.com/research/stocks/private/snapshot.asp?privcapId=823772) during the dot-com bust, and it died a slow death. That was a waste of time (and capital) for many people. Volera was holding out for the "whale" deals. Eventually, it's best to just call it a game, and move on to something better. So yes, for startups, fail early and fail fast.

Assert Contracts

But my meaning behind "fail early, fail hard" relates to asserting contracts in programming.

Let me phrase it another way:

When contracts are violated, it is far better to fail immediately, and fail in a way that cannot be ignored, than try to carry on.

Why?

The ability to purposefully ignore bugs until later tends to lead to poor planning, and a snowball affect near the end of the project.
For every later stage in the software development pipeline, a newly found bug is an order of magnitude more expensive to fix.
Enforced contracts flag "gray areas" which may otherwise not have been a visible bug until the product was in customers' hands.

Snowballing

Recently (circa 2014) I have been trying to help ship a product. We are almost reaching that point at which the bugs are "interesting", or in other words, the bugs are not expected. At this point, developers believe they are code complete, and the known shaky areas of the code have been pointed out to management. New bugs at this point are... well, interesting.

"Interesting" can be dangerous while trying to lock down a product. "Interesting" is another way of saying "Hm, I didn't consider that." And that potentially means another feature branch. Suddenly the product is no longer code complete.

It's like a fractal. Each new area you explore can lead to more new areas. You must cut it off early by asserting, so you quickly know what you do not know. Leaving it until later can snowball.

Late Bugs Are Expensive Bugs

I have long held that each later stage in the software development process is roughly order-of-magnitude more expensive than the prior. You want to catch bugs, of whatever variety, as quickly as possible.

The closer you catch the bug to the time it was created, the easier it will be to remember the mental state at the time, and therefore easier to fix. Your mental caches will be warmer.

When might bugs be stumbled upon?

Compile errors: I may be exposing my bias towards strong type-checking. If the compiler can catch something, you will save 10x the time compared to running the code, hopefully hitting the failure, walking through it, reconstructing your mental state while writing the code, and understanding the disconnect.
Runtime assertions. This is where "fail early, fail hard" comes in.
Bugs potentially caught by unit tests. You write these, yes? This should be developed in lock-step with your assertion-ridden code. Bugs found here are still relatively cheap.
Bugs potentially caught by integration tests. You write these too, yes?
QA
System integration
Alpha test
Beta test
Early adopters
Customers
Support tiers

Each stage has more layers to go through to reach the developer who originally wrote the bug. As more time elapses, the developer forgets or moves on. Each layer in the communication chain has human latencies. Someone is on vacation. Someone quit. We don't remember. We no longer have that test setup.

It's cheaper if you can catch it earlier.

Gray Areas

Sometimes assertions are useful to document "This is how it works for me, but I am not yet sure." Assertions can be like documentation.

Don't be afraid of this. QA may test a debug build and come yelling, but QA is easier to appease than a customer on a product they have paid for and deployed.

Assertions are like that "Curve Ahead" sign. QA wants to know this. It speeds the feedback loop.

Caveats

As with everything else in life, this isn't a black-and-white issue.

Tainted Data

This is related to "tainted" data, such as supported in Perl. A primary example is HTTP, and other network protocols, which commonly advise to be "Liberal in what you accept and conservative in what you emit".

Perhaps this relates to looser versus tighter coupling of components. Networking protocols are loosely coupled (anything can be thrown at the server across the network). On the other hand, internally-developed and statically-linked libraries are very tightly coupled. These are under your responsibility and your control, so more aggressively asserting contracts is reasonable.

Assert for internally-controlled data, but not for possibly tainted data.

Debug vs Release

You probably want separate debug and release builds. The debug builds should assert like crazy. The release builds should, yes, probably be more lenient, because there is a chance at runtime that the error will be invisible to the user.

Or perhaps not. You must understand your use case. If absolute correctness is essential, then assert even in release builds (or in those cases, perhaps you only have a debug build).

The Ariane 5 rocket self-destructed, destroying $500 billion of equipment, due to an uncaught exception. This is a very interesting thought experiment: What should it have done? The failing code in this case was not critical, so clearly should not have triggered self-destruction. But what if it had been in critical code? If the guidance system is operating "out of specification", should the rocket be destroyed? At this moment, I have no idea. If this is a build of the software operating in the lab, then I absolutely want to know the moment something is operating out of specification. But for a real live rocket, this is a difficult call. Is this a rounding error that actually won't matter, or an error that is steering the rocket towards a population center? As soon as you are operating out of specification, even trying to distinguish between these cases is a fool's errand.

(This touches on my philosophical disagreement with exceptions: Precisely because the exception handling is potentially far-removed from the point [and context] of the error, it is difficult to know the best way to handle it. Exception-laden code often devolves to the "just try again" style. This doesn't always work well for embedded systems, where there is no user to click "Retry" ad-nauseum.)

I think it's clear that debug build should "fail early, fail hard", but release builds are a more difficult question.

Mocking

Agile development practices say we should mock components. I agree. Full stop.

The only gotchas here are in the corner-cases. If the user of a mocked object asserts every last corner-case, your test code will be fragile. I think this just turns into a balance of sufficiently specifying the contracts of your classes (_including_ what they _do not_ guarantee) versus what you assert.