Don’t Control, but Observe

Today’s systems are distributed and loosely coupled. Building loosely coupled systems is a bit of a drag, so why do we bother? Because we want our systems to be flexible, so they do not break apart at the slightest change. This is a critical property in today’s environments where we may only control a small portion of our application, the remainder living in distributed services or third party packages, controlled by other departments or external vendors.

So it looks like the effort to build a system that is flexible and can evolve over time is a good idea. But that also means our system will change over time. As in “today’s system is not what it was yesterday.” Unfortunately, this makes documenting the system challenging. It’s commonly known that documentation is out of date the moment it is printed, but in a system that changes all the time, things can only be worse. Moreover, building a system that is flexible generally means the architecture is more complex and it’s more difficult to get the proverbial “big picture.” For example, if all system components communicate with each other over logical, configurable channels, one better have a look at the channel configuration to have any idea what is going on. Sending messages into the logical la-la-land is unlikely to trigger a compiler error, but it is sure to disappoint the user whose action was encapsulated in that message.

Being a control freak architect is so yesteryear, leading to tightly coupled and brittle solutions. But letting the software run wild is sure to spawn chaos. You have to supplement the lack of control with other mechanisms to avoid doing an instrument flight without the instruments. But what kind of instruments do we have? Plenty, actually. Today’s programming languages support reflection, and almost all run-time platforms provide run-time metrics. As your system becomes more configurable, the current system configuration is another great source of information. Because so much raw data is difficult to understand, extract a model from it. For example, once you figure out which components send messages to which logical channels, and which components listen to these channels, you can create a graph model of the actual communication between components. You can do this every few minutes or hours, providing an accurate and up-to-date image of the system as it evolves. Think of it as “Reverse MDA” (Model Driven Architecture). Instead of a model driving the architecture, you build a flexible architecture, and extract the model from the actual system state.

In many cases, it’s easy to visualize this model, creating the literal big picture. However, resist the temptation to plot the 3 by 5 meter billboard of boxes and lines, which contains every class in your system. That picture may pass as contemporary art, but it’s not a useful software model. Instead, use a 1000 ft view as described by Erik Doernenburg, a level of abstraction that actually tells you something. On top of that, you can make sure your model passes basic validation rules, such as the absence of circular dependencies in a dependency graph, or no messages being sent to a logical channel no one listens to.

Letting go of control is a scary thing, even when it comes to system architecture. But supplemented by observation, model extraction, and validation, it is probably the way only to architect for the 21st century.

'Coz sharing is caring

Welcome to the Real World

Engineers like precision, especially software engineers who live in the realm of ones and zeros. They are used to working with binary decisions, one or zero, true or false, yes or no. Everything is clear and consistent, guaranteed by foreign key constraints, atomic transactions, and check sums.

Unfortunately, the real world is not quite that binary. Customers place orders, just to cancel them a moment later. Checks bounce, letters are lost, payments delayed, and promises broken. Data entry errors are bound to happen every so often. Users prefer “shallow” user interfaces, which give them access to many functions at once without being boxed into a lengthy, one-dimensional “process”, which is easier to program and seems more “logical” to many developers. Instead of the call stack controlling the program flow, the user is in charge.

Worse yet, widely distributed systems introduce a whole new set of inconsistencies into the game. Services may not be reachable, change without prior notice, or do not provide transactional guarantees. When you run applications on thousands of machine, failure is no longer a question of “if”, but of “when”. These systems are loosely coupled, asynchronous, concurrent, and do not adhere to traditional transaction semantics. You should have taken the blue pill!

As computer scientists’ brave new world is crumbling, what are we to do? As so often, awareness is a first important step towards a solution. Say good bye to the good old predictive call-stack architecture, where you get to define what happens when and in what order. Instead, be ready to respond to events at any time in any order, regaining your context as needed. Make asynchronous requests concurrently instead of calling methods one by one. Avoid complete chaos by modeling your application using event-driven process chains or state models. Reconcile errors through compensation, retry, or tentative operations.

Sounds scary and more than you bargained for? Luckily, the real world had to deal with the same issues for a long time: delayed letters, broken promises, messages crossing in transit, payments posted to the wrong account — the examples are countless. Accordingly, people had to resend letters, write off bad orders, or tell you to ignore the payment reminder in case you already sent a payment. So don’t just blame the real world for your headaches, but also use it as a place to look for solutions. After all, Starbucks does not two-phase commit either [1]. Welcome to the real world.

'Coz sharing is caring