Categories
Software Architect

Everything will ultimately fail

Hardware is fallible, so we add redundancy. This allows us to survive individual hardware failures, but increases the likelihood of having at least one failure present at any given time.

Software is fallible. Our applications are made of software, so they’re vulnerable to failures. We add monitoring to tell us when the applications fail, but that monitoring is made of more software, so it too is fallible.

Humans make mistakes; we are fallible also. So, we automate actions, diagnostics, and processes. Automation removes the chance for an error of comission, but increases the chance of an error of omission. No automated system can respond to the same range of situations that a human can.

Therefore, we add monitoring to the automation. More software, more opportunities for failures.

Networks are built out of hardware, software, and very long wires. Therefore, networks are fallible. Even when they work, they are unpredictable because the state space of a large network is, for all practical purposes, infinite. Individual components may act deterministically, but still produce essentially chaotic behavior.

Every safety mechanism we employ to mitigate one kind of failure adds new failure modes. We add clustering software to move applications from a failed server to a healthy one, but now we risk “split-brain syndrome” if the cluster’s network acts up.

It’s worth remembering that the Three Mile Island accident was largely caused by a pressure relief value—a safety mechanism meant to prevent certain types of overpressure failures.

So, faced with the certainty of failure in our systems, what can we do about it?

Accept that, no matter what, your system will have a variety of failure modes. Deny that inevitability, and you lose your power to control and contain them. Once you accept that failures will happen, you have the ability to design your system’s reaction to specific failures. Just as auto engineers create crumple zones—areas designed to protect passengers by failing first—you can create safe failure modes that contain the damage and protect the rest of the system.

If you do not design your failure modes, then you will get whatever unpredictable—and usually dangerous—ones happen to emerge.

'Coz sharing is caring
Categories
Software Architect

Software architecture has ethical consequences

The ethical dimension in software is obvious when we are talking about civil rights, identity theft, or malicious software. But they arise in less exotic circumstances. If programs are successful, they affect the lives of thousands or millions of people. That impact can be positive or negative. The program can make their lives better or worse–even if just in minute proportions.

Every time I make a decision about how a program behaves, I am really deciding what my users can and cannot do, in ways more inflexible than law. There is no appeals court for required fields or mandatory workflow.

Another way to think about it is in terms of multipliers. Think back to the last major Internet worm, or when a big geek movie came out. No doubt, you heard or read a story about how much productivity this thing would cost the country. You can always find some analyst with an estimate of outrageous damages, blamed on anything that takes people away from their desks. The real moral of this story isn’t about innumeracy in the press, or self-aggrandizing accountants. It’s about multipliers, and the effect they can have.

Suppose you have a decision to make about a particular feature. You can do it the easy way in about a day, or the hard way in about a week. Which way should you do it? Suppose that the easy way makes four new fields required, whereas doing it the hard way makes the program smart enough to handle incomplete data. Which way should you do it?

Required fields seem innocuous, but they are always an imposition of your will on users. They force users to gather more information before starting their jobs. This often means they have to keep their data on Post-It notes until they’ve got everything together at the same time, resulting in lost data, delays, and general frustration.

As an analogy, suppose I’m putting a sign up on my building. Is it OK to mount the sign just six feet up on the wall, forcing pedestrians to duck or go around it? It’s easier for me to hang the sign if I don’t need a ladder and scaffold, and the sign wouldn’t even block the sidewalk. I get to save an hour installing the sign, at the expense of taking two seconds away from every pedestrian passing my store. Over the long run, all of those two second diversions are going to add up to many, many times more than the hour that I saved.

It’s not ethical to worsen the lives of others, even a small bit, just to make things easy for yourself. Successful software affects millions of people. Every decision you make imposes your will on your users. Always be mindful of the impact your decisions have on those people. You should be willing to bear large burdens to ease theirs.

'Coz sharing is caring