On the Joys of Reasoning about Systems above the Code Layer
On weekdays, I read portions of technical books I have chosen for the month/period of time. On the weekends, I synthesize and consolidate what I have read into articles and small projects (when possible) to ensure understanding. The synthesis articles also serve as a form of spaced repetition.
This month, I read Statistics Every Programmer Needs by Gary Sutton and Think Distributed Systems by Dominik Tornow.
This is the full synthesis of Chapter 1 of Think Distributed Systems. The diagrams were drawn by me but adapted from the book, so credits go to Tornow.
Of Leslie Lamport’s numerous contributions to the tools and models used in the field of systems, I consider his emphasis on the importance of reasoning about systems above the code layer to be one of his most important contributions. In his famous talk, Thinking Above Code, he emphasizes the importance of thinking about systems using abstractions and representing those abstractions in writing–i.e., specifications. With distributed systems, as with any other field, it is important to approach topics using models that are both correct and complete—even more so as numerous models can exist to represent one idea.
To be as precise as possible, we will start with some definitions:
- distributed system: a set of collaborating, concurrent components that communicate by sending and receiving messages over a network.
- model: a representation of a target system or a set of facts that constitutes a ground truth
- mental model: an internal representation of a target system and basis of comprehension and communication or a set of facts that constitutes perceived truth.
A mental model is good if it is:
correct: meaning that every fact of the model is a fact of the system. That is, the model is an accurate representation of the system.
complete: meaning that every relevant fact of the system is a fact of the model.
Inforelevant is relative and application-specific
InfoIt is quite important to understand the model used to reason about a system. I think you will find, like I have in reading this book, that reasoning about systems in terms of correctness and completeness is both intuitive and absolutely beautiful.
As mentioned earlier, a distributed system is a collection of concurrent components communicating via a network. The overall (global) state of the system is a combination of the states of its components, but each component only has access to its local state. The network, which could be considered an independent participating component, also has access to its own local state–which includes the messages sent over it.

One accurate way to think of a system is in terms of states and steps–i.e., a state machine: the behavior of the system is represented as a sequence of states, with each step transitioning the system from one state to another. The steps are taken either by a component or the network, and can be external (e.g. receiving or sending a message) or internal (e.g. writing to a local state).

The fitness of a system can be assessed in terms of its:
- correctness
- scalability
- reliability
Correctness
In Proving the Correctness of Multiprocess Programs, Lamport defines correctness in terms of:
- safety: the guarantee that something bad will never happen
- liveness: the guarantee that something good will eventually happen
As formally proven by Alpern and Schneider in Defining Liveness, every property of a system can be expressed as an intersection of a safety and liveness property.
Thus, a system, distributed or otherwise, is correct if every possible behavior [of its subsystems] is correct–i.e. every possible behavior meets the system’s safety and liveness guarantees.
Scalability and Reliability
Properties such as scalability and reliability can be viewed via a different model: responsiveness, which is formally defined as the system’s ability to meet its Service Level Objectives (SLOs). Thus, we define scalability as responsiveness under load and reliability as responsiveness under failure.
Responsiveness
Formally, responsiveness can be defined using four concepts:
- Service Level Indicator (SLI): a quantitative observation of a system’s behavior
- Service Level Objective (SLO): a predicate (function that yields true or false) on a service level indicator that determines whether the behavior of a system meets an objective.
- error rate: ratio of number of observations that do not meet their objectives to the total number of observations for a given time period
- error budget: the maximum acceptable (upper limit) error rate for a given time period
We can then define responsiveness as the guarantee that the error rate of a system will not exceed its error budget for a given time period.
Properties like correctness, scalability, and reliability are called emergent properties because they arise from the interactions of multiple components, not from any single component. For example, no single server can ensure that a system is always available or can handle infinite load. Instead, the property of ‘availability’ emerges as components work together, coordinate and interact, e.g. through redundancy, load balancing, and failover mechanisms.
System of Systems & Global vs Local
Q: How should we approach the idea that a group of concurrent and communicating components exhibit a specific behavior, and that when viewed from a higher level of abstraction, the group behaves as a single component?
Although a distributed system is composed of sub-systems, it is important that we maintain a transparency that creates the illusion that it is a single system. To do so, we have to track the system state. Recall that the global state of a system is the composition of its local states. From a global POV, we can observe the entire system–i.e., we can determine the system state. From a local POV, a component can only view itself–i.e., it can only determine its own state. Therefore, to determine the global system state, a component must enlist the cooperation of other components that must send their local states via the network.
Considering the ‘whole’ and the ‘parts’ when dealing with distributed systems is significant because it emphasizes the need to understand both the individual components and their interactions to grasp the overall system behavior. The core challenge of distributed systems, then, is thinking globally while acting locally. That is, the challenge of distributed systems is to design a global algorithm where each component of the system implements a local algorithm. It is difficult to ensure global correctness with only limited knowledge.

Distributed Systems Incorporated
Tornow uses a very intuitive model to illustrate the core ideas of distributed systems: distributed systems incorporated.

Based on this illustration, we see:
- components as rooms
- network as the pneumatic tubes
- external interface as mailbox
This model allows us to capture a wide range of concerns such as:
crash semantics: what happens when employees are unable to work? This represents various failure modes in a distributed system:
- short absence: transient failure
- extended absences: intermittent failure
- permanent departure: permanent failure
message delivery semantics: what happens if the mailroom attendant loses, duplicates, or rearranges messages? This represents various message delivery semantics in a distributed system:
- lost messages: messages not delivered
- duplicate messages: messages delivered more than once
- out-of-order messages: messages delivered in a different order than sent
Thinking Above The Code
Most texts often explain race conditions using the classic ‘counter’ example in which multiple threads attempt to modify a single shared variable. This example, while easy to grasp and useful for its purposes, can fail to form a mental model that allows for a more comprehensive, intuitive, and accurate understanding of the idea of race conditions. This is why I quite appreciated Tornow’s mental model for reasoning about what a race condition is.
Constructing Our Own Mental Model
We define a process as a sequence of atomic actions. Process P is a sequence of atomic actions a, b, c and process Q is a sequence of atomic actions x, y, z. A process executes by performing the first action in sequence and then continues with the remaining actions.
The concurrent composition of two processes,(P | Q), is the set of all possible interleavings of the actions from P and Q. We can thus define a race condition as the situation where a subset of (P | Q) is considered incorrect. Most importantly, we are the arbiter of correctness.
This model helps us reason better about race conditions by focusing on the fundamental concept: concurrency leads to multiple possible interleavings of actions from different processes P and Q. A race condition exists when some of these interleavings are incorrect according to our definition of correctness.
Applying this Definition of Race Conditions to the Shared Variable Illustration
To apply our mental model above to the shared variable illustration often used for race conditions, we treat each thread or process as a sequence of atomic actions (like reads and writes). The concurrent execution of (P | Q) is the set of all possible interleavings of these actions.
For example, suppose two threads attempt to increment a shared variable. Each thread consists of a read followed by a write. The possible interleavings of these reads and writes can lead to different outcomes such as:
- interleavings resulting in the correct final value (e.g. both increments are applied)
- some interleavings are incorrect (e.g. both threads read the same initial value, increment, and write back the same result, so only one increment is reflected).
We then define correctness by specifying which final states (or sequences of actions) are acceptable. An interleaving is incorrect if it leads to a state that violates your expectations (e.g. value only incremented once, not twice).
So, in this model:
- correct interleavings: all increments are reflected, no updates are lost
- incorrect interleavings: some increments are lost due to overlapping read/writes.
Thus, we can reason about race conditions by focusing on the set of possible interleavings and explicitly defining which are correct or incorrect, rather than just looking at code snippets. Beautiful, isn’t it?
Conclusion
We explored (according to our chosen mental model), a more precise definition of distributed systems and then introduced models for reasoning about the fitness of a system. For example, correctness can be modeled using safety and liveness, and reliability can be modeled using responsiveness [under load or failure].
We explored a distributed system as a system of systems and emphasized the importance of thinking globally while acting locally, after which we used Distributed Systems Incorporated as a model for distributed systems and the problems that arise with them.
Finally, we reasoned about race conditions abstracted from code and showed how it allows us to develop a comprehensive understanding of what race conditions really are.
In essence, the model with which we reason about a topic is extremely important, and with the abundance of models for reasoning about distributed systems, it is important to use precise and accurate models.
Future articles will discuss various topics in distributed systems using the fundamental models of safety and liveness.