How to Know Your Distributed System is Correct (Before It Exists)

on 2026-05-02

Wouldn't it be neat if you could, before writing a single line of code, exhaustively test your distributed system, network protocol, or state machine?

For the low, low cost of learning a (very) obscure language, you can!

Planning in an Action Oriented Era

Modern companies seem biased to an extreme degree on leaping before you look. It has been my personal experience, and the experience of many of my colleagues, that most tech companies would rather you start by writing a prototype and bouncing around the problem space until you run into the right solution. I call this the Roomba design process.

I wasn't around for the "good old days" of software development, but after hearing mentor after mentor pine for the days when one actually sat down and wrote a plan for a software development project, it made me wonder if perhaps we left a good thing behind in the name of shipping an MVP quicker.

Is a heavily planned approach actually better? Or do all these tech companies have it right at the end of the day? After all, the introduction of generative AI into the software development process has significantly increased the velocity of the Roomba.

I haven't talked much about my AI assisted software development strategy on this blog yet, but an aspect of my methodology that is very relevant to this topic is that applying the same constraints to AI models that I would impose upon a human developer delivers very similar results. The most relevant example of this is that all the test cases need to pass, and any new feature requires new test cases. I'd go so far as to say that deciding what should be tested is perhaps the part of the AI assisted software development process in which the developer should be the most engaged.

These days, when programming with AI, I like to spend a lot more time reviewing and iterating on test cases than I do on the actual application logic itself. To me, the test cases represent a contract regarding how the program should behave. And in an era where code is dynamically generated by a statistical model, this contract is more important than ever. Especially since AI can produce code faster than you can reason about its correctness. These models are just as prone to mistakes as humans (sometimes more often1) and often lack broader long-term contextual knowledge.

But in a distributed system, with complex behavior, it's often difficult to reason about what should actually be in this contract in the first place. And here, keeping the broad context in mind while actually implementing your system is incredibly important.

Meet TLA+

TLA+ was created by Leslie Lamport, a computer scientist perhaps best known for inventing the Paxos consensus algorithm and (less glamorously but no less influentially) developing LaTeX. Judging from his publications I must assume that Lamport spent decades thinking rigorously about concurrency and distributed systems, and TLA+ is the distillation of that: a formal specification language grounded in mathematics and temporal logic that lets you describe exactly how a system behaves over time.

Continuing our earlier "tests are a contract" analogy: imagine TLA+ as the first stage of drafting our contract. It is the part before the lawyers get invited, where you decide what the contract's actual purpose is, what properties it should have, and how it actually behaves. In practice, this means writing a precise mathematical model of your system, which a tool called TLC then uses to exhaustively explore every possible state it can reach, surfacing deadlocks, race conditions, and safety violations that are notoriously hard to hunt down in production. Companies like Amazon and Microsoft have used this approach to find subtle bugs in distributed protocols.

If you want to follow along, the TLA+ Toolbox (which includes TLC) is available at https://lamport.azurewebsites.net/tla/tla.html.

Losing Money, Formally: Bank Transfer Example

For example, let's say we are building a system to track bank accounts.

Here is the actual spec:

---------------------------- MODULE bank_transfer ----------------------------
EXTENDS Naturals, TLC

VARIABLES a_balance, b_balance, in_flight

vars == <<a_balance, b_balance, in_flight>>

TypeInvariant ==
    /\ a_balance \in 0..100
    /\ b_balance \in 0..100
    /\ in_flight \in 0..100

\* Total money is always conserved
MoneyConserved ==
    a_balance + b_balance + in_flight = 100

Init ==
    /\ a_balance = 100
    /\ b_balance = 0
    /\ in_flight = 0

Debit ==
    /\ a_balance >= 10
    /\ a_balance' = a_balance - 10
    /\ in_flight' = in_flight + 10
    /\ UNCHANGED b_balance

Credit ==
    /\ in_flight > 0
    /\ b_balance' = b_balance + in_flight
    /\ in_flight' = 0
    /\ UNCHANGED a_balance

Done ==
    /\ a_balance = 0
    /\ in_flight = 0
    /\ UNCHANGED vars

Next == Debit \/ Credit \/ Done

Spec == Init /\ [][Next]_vars

=============================================================================

And its associated config:

SPECIFICATION Spec

INVARIANTS
    TypeInvariant
    MoneyConserved

Every spec has four primary components:

  • Constants and variables: the named values that parameterize the model (constants) and the state that changes over time (variables).
  • State Predicates: expressions that are true or false about a single state. Init is one, TypeInvariant is one, MoneyConserved is one. They describe what a state looks like without reference to transitions.
  • Actions: expressions that relate a current state to a next state. Everything with a prime ('). Debit, Credit, etc. They describe how the system moves between states.
  • Properties: temporal formulas checked over entire behaviors. Safety invariants, liveness formulas (where A ~> B means "A leads to B"), the Spec itself. Safety properties say "something bad never happens" (MoneyConserved is one). Liveness properties say "something good eventually happens" (like both sides of the protocol eventually finishing their transmission).

To go a bit deeper, let's look at one of our predicates:

\* Total money is always conserved
MoneyConserved ==
    a_balance + b_balance + in_flight = 100

Anyone who knows any amount of programming can probably easily read this. Since our Init predicate specified the starting amount of money as 100, and we only move around money, there should always be a total value of 100 in the system at all times. And because we specify this property as an invariant in our config, it will be ensured that we do not violate this.

Now, for an action:

Debit ==
    /\ a_balance >= 10
    /\ a_balance' = a_balance - 10
    /\ in_flight' = in_flight + 10
    /\ UNCHANGED b_balance

The first thing you will notice is these ASCII wickets (/\ and \/). Not coming from a math background, I wasn't sure what these were supposed to mean, but to keep it simple /\ is a logical AND and \/ is a logical OR. Any variable with a ' refers to its value in the next state, and any variable without refers to its value in the current state. So, this action specifies:

if a_balance >= 10 
set a_balance = a_balance - 10
and
set in_flight = in_flight + 10
leave b_balance unchanged

Technically, the predicates in this action (a_balance >= 10) are not if statements. TLC simply won't consider this action for execution unless they are satisfied.

Now we have the Next action:

Next == Debit \/ Credit \/ Done

This is basically the menu of what actions TLC can take at any given step. Since they're joined with \/ (or), TLC will explore all valid combinations: steps where Debit fires, steps where Credit fires, and steps where Done fires. The Done action is interesting. It's a terminal state that kicks in once all the money has moved over (a_balance = 0 and nothing is in flight). It doesn't change anything (UNCHANGED vars), it just allows the system to idle gracefully rather than getting stuck. Without it, TLC might flag a deadlock when the transfer completes and no other actions are enabled.

And finally:

Spec == Init /\ [][Next]_vars

This is the top-level definition that ties everything together. It says: the system starts in Init AND, in every subsequent step, either a Next action occurs or the variables stay the same. That last part, [][Next]_vars, is where the temporal logic comes in. The [] means "always" (as in, at every point in time), and the _vars subscript allows for "stuttering steps" where nothing changes. Stuttering steps are a deliberate design choice to allow specs to be composed together cleanly, but you can mostly think of it as: at every moment, the system either takes a valid step or idles.

When TLC runs, this is what it checks against. It takes your Spec, generates the full graph of reachable states, and then verifies that your invariants hold in every single one of them.

We can evaluate this spec like so:

tlc bank_transfer.tla

And we will get output like this:

TLC2 Version 2.19 of 08 August 2024 (rev: 5a47802)
Running breadth-first search Model-Checking with fp 89 and seed -6445696154581258138 with 1 worker on 16 cores with 14231MB heap and 64MB offheap memory (Linux 6.12.81 amd64, Oracle Corporation 1.8.0_472 x86_64, MSBDiskFPSet, DiskStateQueue).
Parsing file /home/grayson/code/tla/simple/bank_transfer.tla
Parsing file /tmp/Naturals.tla
Parsing file /tmp/TLC.tla
Parsing file /tmp/Sequences.tla
Parsing file /tmp/FiniteSets.tla
Semantic processing of module Naturals
Semantic processing of module Sequences
Semantic processing of module FiniteSets
Semantic processing of module TLC
Semantic processing of module bank_transfer
Starting... (2026-04-22 20:21:57)
Computing initial states...
Finished computing initial states: 1 distinct state generated at 2026-04-22 20:21:57.
Model checking completed. No error has been found.
  Estimates of the probability that TLC did not check all reachable states
  because two distinct states had the same fingerprint:
  calculated (optimistic):  val = 1.6E-16
112 states generated, 66 distinct states found, 0 states left on queue.
The depth of the complete state graph search is 12.
The average outdegree of the complete state graph is 1 (minimum is 0, the maximum 2 and the 95th percentile is 2).
Finished in 00s at (2026-04-22 20:21:57)

So, what happened?

TLC ran a breadth-first search over every possible state the system can reach, starting from Init. It found 66 distinct states across 112 total steps, verified that MoneyConserved and TypeInvariant held in all of them, and reported no errors. The whole thing finished in under a second.

A few details worth noting from the output:

  • 112 states generated, 66 distinct - TLC explores paths, not just states. The same state can be reached multiple ways (e.g., Debit then Debit then Credit lands in the same place as Debit then Credit then Debit), so the raw count is higher than the number of unique states.
  • depth of 12 - the longest possible sequence of actions from start to finish is 12 steps. Since we start with 100 and debit in increments of 10, this makes sense: 10 debits + some credits + Done.
  • max outdegree of 2 - at any given state, there are at most 2 non-stuttering actions available (Debit and Credit can both be enabled at the same time). When the system has finished, no new transitions are possible, giving an outdegree of 0.

The key line is Model checking completed. No error has been found. That means our invariants held across every reachable state: the money was always conserved, and the balances never left their defined bounds. For a distributed system, that kind of exhaustive guarantee is something no amount of unit testing can give you.

But you might be wondering two things:

  • This system doesn't seem very distributed
  • TLA assumed everything worked perfectly and no failures happened

And both of these are true. The first is true just because I needed a simple example that fit entirely on your screen (patience, we are getting there.). The second is something we can address now, so how do we simulate bad stuff happening? With bad actions:

Crash ==
    /\ in_flight > 0
    /\ in_flight' = 0
    /\ UNCHANGED <<a_balance, b_balance>>

Next == Debit \/ Credit \/ Crash

Now, if we run TLC again with this example:

# tlc bank_transfer.tla 
TLC2 Version 2.19 of 08 August 2024 (rev: 5a47802)
Running breadth-first search Model-Checking with fp 30 and seed -1146706225358764036 with 1 worker on 16 cores with 14231MB heap and 64MB offheap memory (Linux 6.12.81 amd64, Oracle Corporation 1.8.0_472 x86_64, MSBDiskFPSet, DiskStateQueue).
Parsing file /home/grayson/code/tla/simple/bank_transfer.tla
Parsing file /tmp/Naturals.tla
Parsing file /tmp/TLC.tla
Parsing file /tmp/Sequences.tla
Parsing file /tmp/FiniteSets.tla
Semantic processing of module Naturals
Semantic processing of module Sequences
Semantic processing of module FiniteSets
Semantic processing of module TLC
Semantic processing of module bank_transfer
Starting... (2026-04-22 20:32:20)
Computing initial states...
Finished computing initial states: 1 distinct state generated at 2026-04-22 20:32:20.
Error: Invariant MoneyConserved is violated.
Error: The behavior up to this point is:
State 1: <Initial predicate>
/\ a_balance = 100
/\ in_flight = 0
/\ b_balance = 0

State 2: <Debit line 23, col 5 to line 26, col 26 of module bank_transfer>
/\ a_balance = 90
/\ in_flight = 10
/\ b_balance = 0

State 3: <Crash line 40, col 5 to line 42, col 41 of module bank_transfer>
/\ a_balance = 90
/\ in_flight = 0
/\ b_balance = 0

5 states generated, 5 distinct states found, 2 states left on queue.
The depth of the complete state graph search is 3.
The average outdegree of the complete state graph is 1 (minimum is 1, the maximum 1 and the 95th percentile is 1).
Finished in 00s at (2026-04-22 20:32:20)

We get a replayed version of the chain of states that led to our MoneyConserved property being violated. First there was the initial state, then there was a Debit, and then a crash occurred.

And this lets us think, in an abstract yet very real way, about how we would solve this problem in our program. For example, if we wanted exactly-once processing of a transaction, we could implement a transaction log that is synchronously written to before the transaction takes place:

\* log is authoritative: in_flight is ephemeral and may be lost on crash
MoneyConserved ==
    a_balance + b_balance + log = 100

Init ==
    /\ a_balance = 100
    /\ b_balance = 0
    /\ in_flight = 0
    /\ log = 0                \* added: no pending transfer at start

Debit ==
    /\ log = 0                \* added: block new transfer while one is pending
    /\ a_balance >= 10
    /\ a_balance' = a_balance - 10
    /\ in_flight' = in_flight + 10
    /\ log' = log + 10        \* added: record intent durably before moving money
    /\ UNCHANGED b_balance

Credit ==
    /\ in_flight > 0
    /\ b_balance' = b_balance + in_flight
    /\ in_flight' = 0
    /\ log' = 0               \* added: commit point — clear log after credit lands
    /\ UNCHANGED a_balance

Done ==
    /\ a_balance = 0
    /\ in_flight = 0
    /\ UNCHANGED vars

Crash ==
    /\ in_flight > 0
    /\ in_flight' = 0
    /\ UNCHANGED <<a_balance, b_balance, log>>  \* added: log persists across crash

\* added: re-populate in_flight from log so Credit can retry after a crash
Recover ==
    /\ log > 0
    /\ in_flight = 0
    /\ in_flight' = log
    /\ UNCHANGED <<a_balance, b_balance, log>>

Next == Debit \/ Credit \/ Crash \/ Recover \/ Done

It is worth reiterating, since this is what makes TLA such an effective tool: TLA+ is not randomly deciding when to crash the system. TLC will exercise the crash action in every single state where it is enabled. Whenever in_flight > 0, TLC will explore both the branch where Credit fires and the branch where Crash fires. There is no probability involved, no random sampling, no hope that your chaos monkey hit the right moment. If a crash can happen at a given point in the execution, TLC guarantees it will happen in the model. The spec above verified correctly because it survived every one of those branches.

Network Protocol Example

Network protocols tend to have pretty nasty bugs related to state deadlocks, and TLA+ is great at finding these. Let's work through an example.

This one comes from a potential future project. This example is modeled off of a program that uses LoRa radios to communicate. LoRa is inherently half-duplex, since any node can only be transmitting or listening at one time. So if we want to build a bidirectional protocol, we need to consider this. Fortunately, LoRa does support a limited form of carrier sense2, so we do have a "magic" method of determining if the other side is talking and we don't have to rely exclusively on timing as a result. But what we are effectively building is a Time Division Multiplexing protocol.

So, let's think about what kind of properties are important to this protocol. Let's imagine that each side has frames that they both want to transmit. First, we don't want the two sides to transmit at the same time, that would be wasteful. We can call that Safety Property NoSimultaneousTransmit. Next, we want a liveness property that specifies that each side can send all of its frames eventually, which we can call BothEventuallyDone. And finally, we want a safety invariant ensuring neither side monopolizes the channel, which we can call BothGetATurn: if A finishes sending all its frames, B must have transmitted at least once, and vice versa.

---------------------------- MODULE DuplexTDM ----------------------------
EXTENDS Naturals, TLC

(**************************************************************************)
(* Bidirectional half-duplex radio protocol.                              *)
(*                                                                        *)
(* Two nodes share a single frequency. Either node may transmit, but     *)
(* only one at a time. The transmitting node advertises a window size     *)
(* before sending. The receiver waits slightly longer than the expected   *)
(* transmission duration, then ACKs received frames and NACKs gaps.      *)
(*                                                                        *)
(* Both nodes have data to send. They must take turns fairly.            *)
(*                                                                        *)
(* Safety properties:                                                     *)
(*   - Both nodes never transmit simultaneously                           *)
(*   - ACK/NACK never sent while other side is transmitting               *)
(*   - Delivered sequence numbers are monotonically increasing            *)
(*                                                                        *)
(* Liveness properties:                                                   *)
(*   - Neither side is starved -- if a node has data, it eventually gets  *)
(*     a turn                                                             *)
(*   - A NACK eventually triggers retransmission                          *)
(*   - Data is eventually delivered despite loss                          *)
(**************************************************************************)

CONSTANTS
    MAX_WINDOW,       \* Maximum frames a node may transmit in one turn
    MAX_SEQ,          \* Maximum sequence number (wraps, but kept simple here)
    MAX_SENT          \* Maximum frames each side can send

(**************************************************************************)
(* Node states                                                            *)
(*                                                                        *)
(* Both nodes are symmetric -- each can be in any of these states.       *)
(*                                                                        *)
(* "listening"    -- monitoring the channel; will transmit when carrier  *)
(*                   goes idle (carrier busy = receiving frames)         *)
(* "transmitting" -- actively sending frames, carrier is busy            *)
(* "sending_ack"  -- carrier just went idle, firing ACK or NACK         *)
(**************************************************************************)

NodeStates == {"listening", "transmitting"}
NodeTxType == {"data", "ack", "nack"}
VARIABLES
    \* Per-node state
    a_state,          \* current state of node A
    a_tx_type,        \* what A is currently sending: "data" | "ack" | "nack"
    b_state,          \* current state of node B
    b_tx_type,        \* what B is currently sending: "data" | "ack" | "nack"

    \* Carrier sense (shared physical observation, not ownership)
    carrier_sense,    \* "busy" | "idle" -- is the channel currently in use

    \* Transmission tracking
    a_window,         \* how many frames A advertised it will send this turn
    b_window,         \* how many frames B advertised it will send this turn
    a_sent,           \* how many frames A has sent this turn (resets each window)
    b_sent,           \* how many frames B has sent this turn (resets each window)
    a_total_sent,     \* cumulative frames A has ever sent (never resets)
    b_total_sent,     \* cumulative frames B has ever sent (never resets)

    \* Delivery tracking
    a_delivered,      \* highest seq number A has delivered to its application
    b_delivered,      \* highest seq number B has delivered to its application

    \* Peer-acknowledged counts (A's/B's local knowledge of remote delivery)
    \* updated when an ACK is received; A is done when a_peer_acked = MAX_SENT
    a_peer_acked,     \* how many of A's frames B has confirmed via ACK
    b_peer_acked,     \* how many of B's frames A has confirmed via ACK

    \* Channel-request flags (piggybacked on ACK frames)
    \* set in an ACK when the sender still has data, cleared when it starts transmitting
    \* if b_wants_channel is set, A must yield; if a_wants_channel is set, B must yield
    a_wants_channel,
    b_wants_channel,

    \* Transmission history flags (set once, never cleared)
    \* used for fairness: both sides must get at least one window
    a_ever_transmitted,
    b_ever_transmitted,

    \* ACK-due flags (local knowledge only -- no peer state required)
    \* set when the peer finishes a window (you owe them an ACK)
    \* cleared when you send your ACK (DeliverAndAck/Nack)
    \* prevents starting a new window before discharging your ACK obligation
    a_ack_due,        \* A owes B an ACK (set by BFinishTransmit)
    b_ack_due,        \* B owes A an ACK (set by AFinishTransmit)

    \* Pending retransmit flags
    \* set when a NACK is received, cleared when retransmit completes
    a_nack_pending,
    b_nack_pending

vars == <<a_state, a_tx_type, b_state, b_tx_type, carrier_sense,
          a_window, b_window, a_sent, b_sent, a_total_sent, b_total_sent,
          a_delivered, b_delivered,
          a_peer_acked, b_peer_acked,
          a_wants_channel, b_wants_channel,
          a_ever_transmitted, b_ever_transmitted,
          a_ack_due, b_ack_due,
          a_nack_pending, b_nack_pending>>

(**************************************************************************)
(* TypeInvariant                                                          *)
(*                                                                        *)
(* Every variable must stay within its legal range at all times.         *)
(* TLC will check this at every reachable state.                         *)
(**************************************************************************)
TypeInvariant ==
    /\ a_state        \in NodeStates
    /\ a_tx_type      \in NodeTxType
    /\ b_state        \in NodeStates
    /\ b_tx_type      \in NodeTxType
    /\ carrier_sense  \in {"busy", "idle"}
    /\ a_window       \in 0..MAX_WINDOW
    /\ b_window       \in 0..MAX_WINDOW
    /\ a_sent         \in 0..MAX_WINDOW
    /\ b_sent         \in 0..MAX_WINDOW
    /\ a_total_sent   \in 0..MAX_SENT
    /\ b_total_sent   \in 0..MAX_SENT
    /\ a_delivered    \in 0..MAX_SEQ
    /\ b_delivered    \in 0..MAX_SEQ
    /\ a_peer_acked   \in 0..MAX_SENT
    /\ b_peer_acked   \in 0..MAX_SENT
    /\ a_wants_channel     \in BOOLEAN
    /\ b_wants_channel     \in BOOLEAN
    /\ a_ever_transmitted  \in BOOLEAN
    /\ b_ever_transmitted  \in BOOLEAN
    /\ a_ack_due           \in BOOLEAN
    /\ b_ack_due       \in BOOLEAN
    /\ a_nack_pending  \in BOOLEAN
    /\ b_nack_pending  \in BOOLEAN

(**************************************************************************)
(* Your turn -- write Init, actions, Next, Spec, and properties below    *)
(**************************************************************************)

Init ==
    /\ a_state = "listening"
    /\ a_tx_type = "data"
    /\ b_state = "listening"
    /\ b_tx_type = "data"
    /\ carrier_sense = "idle"
    /\ a_window = 0
    /\ b_window = 0
    /\ a_sent = 0
    /\ b_sent = 0
    /\ a_total_sent = 0
    /\ b_total_sent = 0
    /\ a_delivered = 0
    /\ b_delivered = 0
    /\ a_peer_acked = 0
    /\ b_peer_acked = 0
    /\ a_wants_channel = FALSE
    /\ b_wants_channel = FALSE
    /\ a_ever_transmitted = FALSE
    /\ b_ever_transmitted = FALSE
    /\ a_ack_due = FALSE
    /\ b_ack_due = FALSE
    /\ a_nack_pending = FALSE
    /\ b_nack_pending = FALSE

AStartTransmit ==
    /\ a_state = "listening"
    /\ carrier_sense = "idle"
    /\ a_peer_acked = a_total_sent            \* B has ACKed A's previous window
    /\ ~a_ack_due                             \* A has discharged its own ACK obligation to B
    /\ ~b_wants_channel                       \* yield if B called dibs in its last ACK
    /\ \E w \in 1..MAX_WINDOW :               \* TLC explores all window sizes
        /\ a_total_sent + w <= MAX_SENT       \* don't promise frames you can't send
        /\ a_window' = w
    /\ a_state'        = "transmitting"
    /\ a_tx_type'      = "data"
    /\ a_sent'         = 0
    /\ carrier_sense'  = "busy"
    /\ b_state'        = "listening"
    /\ a_wants_channel'    = FALSE             \* A is using the channel, clear its own flag
    /\ a_ever_transmitted' = TRUE             \* record that A has transmitted at least once
    /\ UNCHANGED <<b_tx_type, b_window, b_sent, b_total_sent, a_total_sent,
                   a_delivered, b_delivered, a_peer_acked, b_peer_acked,
                   b_wants_channel, b_ever_transmitted,
                   a_ack_due, b_ack_due, a_nack_pending, b_nack_pending>>

ASendFrame ==
    /\ a_state = "transmitting"
    /\ a_sent < a_window
    /\ a_total_sent < MAX_SENT
    /\ a_sent'       = a_sent + 1
    /\ a_total_sent' = a_total_sent + 1
    /\ UNCHANGED <<a_state, a_tx_type, b_state, b_tx_type, carrier_sense,
                   a_window, b_window, b_sent, b_total_sent,
                   a_delivered, b_delivered, a_peer_acked, b_peer_acked,
                   a_wants_channel, b_wants_channel,
                   a_ever_transmitted, b_ever_transmitted,
                   a_ack_due, b_ack_due, a_nack_pending, b_nack_pending>>

AFinishTransmit ==
    /\ a_state = "transmitting"
    /\ a_sent = a_window
    /\ a_state'   = "listening"
    /\ carrier_sense' = "idle"
    /\ b_ack_due' = TRUE                   \* B now owes A an ACK
    /\ UNCHANGED <<a_tx_type, b_state, b_tx_type,
                   a_window, b_window, a_sent, b_sent, a_total_sent, b_total_sent,
                   a_delivered, b_delivered, a_peer_acked, b_peer_acked,
                   a_wants_channel, b_wants_channel,
                   a_ever_transmitted, b_ever_transmitted,
                   a_ack_due, a_nack_pending, b_nack_pending>>


BStartTransmit ==
    /\ b_state = "listening"
    /\ carrier_sense = "idle"
    /\ b_peer_acked = b_total_sent            \* A has ACKed B's previous window
    /\ ~b_ack_due                             \* B has discharged its own ACK obligation to A
    /\ ~a_wants_channel                       \* yield if A called dibs in its last ACK
    /\ \E w \in 1..MAX_WINDOW :               \* TLC explores all window sizes
        /\ b_total_sent + w <= MAX_SENT       \* don't promise frames you can't send
        /\ b_window' = w
    /\ b_state'        = "transmitting"
    /\ b_tx_type'      = "data"
    /\ b_sent'         = 0
    /\ carrier_sense'  = "busy"
    /\ a_state'        = "listening"
    /\ b_wants_channel'    = FALSE             \* B is using the channel, clear its own flag
    /\ b_ever_transmitted' = TRUE             \* record that B has transmitted at least once
    /\ UNCHANGED <<a_tx_type, a_window, a_sent, a_total_sent, b_total_sent,
                   a_delivered, b_delivered, a_peer_acked, b_peer_acked,
                   a_wants_channel, a_ever_transmitted,
                   a_ack_due, b_ack_due, a_nack_pending, b_nack_pending>>

BSendFrame ==
    /\ b_state = "transmitting"
    /\ b_sent < b_window
    /\ b_total_sent < MAX_SENT
    /\ b_sent'       = b_sent + 1
    /\ b_total_sent' = b_total_sent + 1
    /\ UNCHANGED <<a_state, a_tx_type, b_state, b_tx_type, carrier_sense,
                   a_window, b_window, a_sent, a_total_sent,
                   a_delivered, b_delivered, a_peer_acked, b_peer_acked,
                   a_wants_channel, b_wants_channel,
                   a_ever_transmitted, b_ever_transmitted,
                   a_ack_due, b_ack_due, a_nack_pending, b_nack_pending>>

BFinishTransmit ==
    /\ b_state = "transmitting"
    /\ b_sent = b_window
    /\ b_state'       = "listening"
    /\ carrier_sense' = "idle"
    /\ a_ack_due'     = TRUE                   \* A now owes B an ACK
    /\ UNCHANGED <<a_state, a_tx_type, b_tx_type,
                   a_window, b_window, a_sent, b_sent, a_total_sent, b_total_sent,
                   a_delivered, b_delivered, a_peer_acked, b_peer_acked,
                   a_wants_channel, b_wants_channel,
                   a_ever_transmitted, b_ever_transmitted,
                   b_ack_due, a_nack_pending, b_nack_pending>>

\* In the perfect-connection model, frames are delivered instantly and
\* completely when the transmitter finishes.  ACK actions run after
\* AFinishTransmit / BFinishTransmit while the carrier is idle.

\* B received all of A's frames cleanly -- sends a cumulative ACK.
\* Piggybacks b_wants_channel if B still has frames to send.
BDeliverAndAck ==
    /\ carrier_sense = "idle"
    /\ b_ack_due = TRUE              \* B owes A an ACK (set when A finished its window)
    /\ b_delivered < a_total_sent
    /\ b_delivered'     = a_total_sent
    /\ a_peer_acked'    = a_total_sent
    /\ b_wants_channel' = (b_total_sent < MAX_SENT)
    /\ b_ack_due'       = FALSE
    /\ UNCHANGED <<a_state, a_tx_type, b_state, b_tx_type, carrier_sense,
                   a_window, b_window, a_sent, b_sent, a_total_sent, b_total_sent,
                   a_delivered, b_peer_acked,
                   a_wants_channel, a_ever_transmitted, b_ever_transmitted, a_ack_due,
                   a_nack_pending, b_nack_pending>>

\* B received only a prefix of A's frames due to loss -- sends a NACK.
\* \E n \in ... means TLC explores every possible loss scenario:
\*   n = b_delivered         => all frames in this window were lost
\*   n = a_total_sent - 1    => only the last frame was lost
\* A learns via a_peer_acked how far delivery got, and sets a_nack_pending
\* so it knows to retransmit from that point.
BDeliverAndNack ==
    /\ carrier_sense = "idle"
    /\ b_ack_due = TRUE              \* B owes A an ACK (set when A finished its window)
    /\ b_delivered < a_total_sent
    /\ \E n \in b_delivered..(a_total_sent - 1) :  \* some frames lost; n is last good frame
        /\ b_delivered'  = n
        /\ a_peer_acked' = n
    /\ a_nack_pending'  = TRUE
    /\ b_wants_channel' = (b_total_sent < MAX_SENT)
    /\ b_ack_due'       = FALSE
    /\ UNCHANGED <<a_state, a_tx_type, b_state, b_tx_type, carrier_sense,
                   a_window, b_window, a_sent, b_sent, a_total_sent, b_total_sent,
                   a_delivered, b_peer_acked,
                   a_wants_channel, a_ever_transmitted, b_ever_transmitted, a_ack_due,
                   b_nack_pending>>

\* A received all of B's frames cleanly -- sends a cumulative ACK.
\* Piggybacks a_wants_channel if A still has frames to send.
ADeliverAndAck ==
    /\ carrier_sense = "idle"
    /\ a_ack_due = TRUE              \* A owes B an ACK (set when B finished its window)
    /\ a_delivered < b_total_sent
    /\ a_delivered'     = b_total_sent
    /\ b_peer_acked'    = b_total_sent
    /\ a_wants_channel' = (a_total_sent < MAX_SENT)
    /\ a_ack_due'       = FALSE
    /\ UNCHANGED <<a_state, a_tx_type, b_state, b_tx_type, carrier_sense,
                   a_window, b_window, a_sent, b_sent, a_total_sent, b_total_sent,
                   b_delivered, a_peer_acked,
                   b_wants_channel, a_ever_transmitted, b_ever_transmitted, b_ack_due,
                   a_nack_pending, b_nack_pending>>

\* A received only a prefix of B's frames due to loss -- sends a NACK.
ADeliverAndNack ==
    /\ carrier_sense = "idle"
    /\ a_ack_due = TRUE              \* A owes B an ACK (set when B finished its window)
    /\ a_delivered < b_total_sent
    /\ \E n \in a_delivered..(b_total_sent - 1) :  \* some frames lost; n is last good frame
        /\ a_delivered'  = n
        /\ b_peer_acked' = n
    /\ b_nack_pending'  = TRUE
    /\ a_wants_channel' = (a_total_sent < MAX_SENT)
    /\ a_ack_due'       = FALSE
    /\ UNCHANGED <<a_state, a_tx_type, b_state, b_tx_type, carrier_sense,
                   a_window, b_window, a_sent, b_sent, a_total_sent, b_total_sent,
                   b_delivered, a_peer_acked,
                   b_wants_channel, a_ever_transmitted, b_ever_transmitted, b_ack_due,
                   a_nack_pending>>

\* A got a NACK: roll back a_total_sent to the last confirmed frame so that
\* the next AStartTransmit re-sends from that point.  Clears the pending flag.
\* Retransmit is only possible when the carrier is idle (A waits its turn).
ARetransmit ==
    /\ a_nack_pending = TRUE
    /\ carrier_sense = "idle"
    /\ a_state = "listening"
    /\ a_total_sent' = a_peer_acked  \* rewind to last ACK'd frame
    /\ a_nack_pending' = FALSE
    /\ UNCHANGED <<a_state, a_tx_type, b_state, b_tx_type, carrier_sense,
                   a_window, b_window, a_sent, b_sent, b_total_sent,
                   a_delivered, b_delivered, a_peer_acked, b_peer_acked,
                   a_wants_channel, b_wants_channel,
                   a_ever_transmitted, b_ever_transmitted,
                   a_ack_due, b_ack_due, b_nack_pending>>

BRetransmit ==
    /\ b_nack_pending = TRUE
    /\ carrier_sense = "idle"
    /\ b_state = "listening"
    /\ b_total_sent' = b_peer_acked
    /\ b_nack_pending' = FALSE
    /\ UNCHANGED <<a_state, a_tx_type, b_state, b_tx_type, carrier_sense,
                   a_window, b_window, a_sent, b_sent, a_total_sent,
                   a_delivered, b_delivered, a_peer_acked, b_peer_acked,
                   a_wants_channel, b_wants_channel,
                   a_ever_transmitted, b_ever_transmitted,
                   a_ack_due, b_ack_due, a_nack_pending>>

ADone == a_peer_acked = MAX_SENT
BDone == b_peer_acked = MAX_SENT

Done == ADone /\ BDone /\ UNCHANGED vars

Next ==
    \/ AStartTransmit
    \/ ASendFrame
    \/ AFinishTransmit
    \/ BStartTransmit
    \/ BSendFrame
    \/ BFinishTransmit
    \/ BDeliverAndAck
    \/ BDeliverAndNack
    \/ ADeliverAndAck
    \/ ADeliverAndNack
    \/ ARetransmit
    \/ BRetransmit
    \/ Done

\* WF_vars(Next): every continuously-enabled action eventually fires (no deadlock)
\* SF_vars(BDeliverAndAck/ADeliverAndAck): even with repeated loss, a clean
\* delivery must eventually happen -- models a lossy-but-not-permanently-broken
\* channel rather than one that can drop every frame forever.
Spec == Init /\ [][Next]_vars /\ WF_vars(Next)
           /\ SF_vars(BDeliverAndAck)
           /\ SF_vars(ADeliverAndAck)

(**************************************************************************)
(* Safety properties                                                      *)
(**************************************************************************)

NoSimultaneousTransmit ==
    ~(a_state = "transmitting" /\ b_state = "transmitting")

(**************************************************************************)
(* Liveness properties                                                    *)
(**************************************************************************)

BothEventuallyDone ==
    <>(ADone /\ BDone)

BothGetATurn ==
    /\ (ADone => b_ever_transmitted)
    /\ (BDone => a_ever_transmitted)

=============================================================================

The first thing you might notice is that this spec is... long compared to the previous one.

This is a more complex example though. It largely avoids "magic" variables that allow each side to have an immediate and infallible view of the state of the other. After all, this is a network protocol, we can't just directly access the memory of the other side (at least not reliably, and not without delay). Unfortunately, with TLA+, this is a thing you have to enforce.

To give a specific example of this, one might have a variable called "channel_owner" that has a value in {"side_a", "side_b", "idle"}. And this would certainly make our Action definitions easier to write. But that would be cheating. Instead we have internal states, and only access side_b state in cases where a message would have legitimately passed from one side of the network to the other. But, since we have carrier sense, we do have one shared state called carrier_sense, which is either "busy" or "idle". And I think you can see how this simulates how carrier sense would actually be used by the real protocol. It's a physical observation, not shared state, so it isn't cheating that both sides get access to it.

For the most part, this is just a longer more complex example with more of what we've already seen, but there is some syntax we haven't seen before, so let's dig into some of this. For example:

AStartTransmit ==
    /\ a_state = "listening"
    /\ carrier_sense = "idle"
    /\ \E w \in 1..MAX_WINDOW :               \* TLC explores all window sizes
        /\ a_total_sent + w <= MAX_SENT       \* don't promise frames you can't send
        /\ a_window' = w
    /\ a_state'      = "transmitting"
    /\ a_tx_type'    = "data"
    /\ a_sent'       = 0
    /\ carrier_sense' = "busy"
    /\ b_state'      = "listening"
    /\ UNCHANGED <<b_tx_type, b_window, b_sent, b_total_sent, a_total_sent,
                   a_delivered, b_delivered, a_peer_acked, b_peer_acked,
                   a_nack_pending, b_nack_pending>>

Specifically \E w \in 1..MAX_WINDOW, which is an existential quantifier. This means that TLC will explore ALL window sizes in every case where this action fires! Every single window size that can legally occur (based on configuration and type invariants) will occur.

And, in this case, we've embedded some badness here in the form of NACK messages instead of ACK messages:

\* B received only a prefix of A's frames due to loss -- sends a NACK.
\* \E n \in ... means TLC explores every possible loss scenario:
\*   n = b_delivered         => all frames in this window were lost
\*   n = a_total_sent - 1    => only the last frame was lost
\* A learns via a_peer_acked how far delivery got, and sets a_nack_pending
\* so it knows to retransmit from that point.
BDeliverAndNack ==
    /\ carrier_sense = "idle"
    /\ b_ack_due = TRUE              \* B owes A an ACK (set when A finished its window)
    /\ b_delivered < a_total_sent
    /\ \E n \in b_delivered..(a_total_sent - 1) :  \* some frames lost; n is last good frame
        /\ b_delivered'  = n
        /\ a_peer_acked' = n
    /\ a_nack_pending'  = TRUE
    /\ b_wants_channel' = (b_total_sent < MAX_SENT)
    /\ b_ack_due'       = FALSE
    /\ UNCHANGED <<a_state, a_tx_type, b_state, b_tx_type, carrier_sense,
                   a_window, b_window, a_sent, b_sent, a_total_sent, b_total_sent,
                   a_delivered, b_peer_acked,
                   a_wants_channel, a_ever_transmitted, b_ever_transmitted, a_ack_due,
                   b_nack_pending>>

And once again, our friend the existential quantifier shows up, because this action explores EVERY possible delivery cutoff point.

Additionally, our Done action gets a little more complex, since it's really a meta-state made up of two other states:

Done == ADone /\ BDone /\ UNCHANGED vars

And we have this:

Spec == Init /\ [][Next]_vars /\ WF_vars(Next)
           /\ SF_vars(BDeliverAndAck)
           /\ SF_vars(ADeliverAndAck)

Remember how TLC explores every possible combination of states exhaustively? Consider a behavior where the system simply refuses to send frames indefinitely. We need to be fair (enough) to our system to give it a chance to execute despite all the chaos we are throwing at it, so we have a concept of fairness:

WF_vars(Next) is a weak fairness formula: if any action in Next is continuously enabled, it must eventually fire. That prevents TLC from exploring behaviors where the system just sits idle forever despite being able to make progress.

SF_vars(BDeliverAndAck) and SF_vars(ADeliverAndAck) are strong fairness formulas: even if these actions are only intermittently enabled (i.e. repeatedly becoming available but not staying enabled continuously), TLC must still find behaviors where they eventually fire. This is a stronger guarantee than weak fairness provides, and it matters here because an ACK action can be temporarily blocked (the carrier might be busy, the other side might be transmitting) before becoming available again. Without strong fairness, TLC is allowed to explore behaviors where the ACK never gets sent simply because it kept getting interrupted at inconvenient moments. Strong fairness rules that out: if BDeliverAndAck is enabled often enough, it will eventually happen.

In practical terms: this is how you tell TLC that your liveness properties (BothEventuallyDone) must hold in a real execution, not just in a perfectly cooperative one.

This is about as complex a spec as I've worked with so far, and while it isn't as complex as some examples, the state space is absolutely massive. For example, with this configuration:

SPECIFICATION Spec

CONSTANTS
    MAX_WINDOW = 10
    MAX_SEQ    = 20
    MAX_SENT   = 20

INVARIANTS
    TypeInvariant
    NoSimultaneousTransmit
    BothGetATurn

PROPERTIES
    BothEventuallyDone

TLC explored 10,745,891 states, 4,131,391 of which were distinct, with a maximum depth of 59 actions. And in all 10 million of them, our liveness and safety properties held.

If we can faithfully reproduce this spec (there may yet be magical variables we can't actually implement in a real protocol), we can be assured that our system should behave correctly in all legal circumstances.

Ten million valid states is a lot of confidence to walk away with. So what do you actually do with it?

The Art And Science of Using Specs

The spec is a precise description of your system that you can translate directly into implementation decisions.

Preconditions become runtime guards. Every action in a TLA+ spec has a precondition (the lines without primes that must hold before the action can fire). In Debit, that was a_balance >= 10. In AStartTransmit, it was a_state = "listening" and carrier_sense = "idle". These map directly to assertions or guard clauses in your implementation. If your code does a debit without checking the balance first, you already know from the spec that this is wrong; the model only allowed that action when the guard held.

Invariants become unit tests. MoneyConserved and NoSimultaneousTransmit are properties TLC checked in every state. You can write the exact same checks as assertions in your test suite. The difference is that TLC proved them over every reachable state; your unit tests will only check the cases you thought to write. But having the invariant spelled out precisely in the spec means you know exactly what the test should assert.

Error traces become regression tests. When TLC finds a violation, it prints the exact sequence of states that led to it, like the Debit-then-Crash sequence that broke MoneyConserved in our bank example. That trace is a regression test handed to you for free. Encode it: set up the initial state, execute those actions in that order, and assert the invariant holds not only in your spec, but in the real codebase.

The state machine is your integration test scaffold. The Next action defines every legal transition in the system. For the radio protocol, that meant AStartTransmit, ASendFrame, BDeliverAndAck, BDeliverAndNack, ARetransmit, BRetransmit, and so on. A good integration test suite should exercise every one of these transitions, and the spec tells you exactly what preconditions need to be true to get there and what the system should look like afterward.

Not every bit of logic will end up in your spec; scope is a deliberate choice. In the radio protocol, we abstracted away byte-level framing, timing jitter, and the exact contents of frames. The spec becomes a record of what you decided not to formally verify, which is exactly where traditional unit tests should focus.

In my experience, the skill in using AI tools effectively is communicating as specifically as possible how you want the AI to go about generating your code. Perhaps this is why they are so efficient when it comes to translating between one language to another. And a TLA spec is both incredibly precise and well scoped. It is a great way to tell a model: here are the details that are crucial to the safe behavior of this codebase (and nothing more).

That said, there are a few honest caveats. TLC can only check what you told it to check: if your invariants are incomplete or your model doesn't faithfully represent the real system, you get false confidence. A contract that doesn't say the right things is as useless as no contract at all. State space explosion is also a real concern; large constants can make TLC runs take hours or become infeasible entirely, which is part of why scoping your spec carefully matters. And finally, specs go stale. For simpler systems, once the implementation is built and validated, the spec may have served its purpose. But for more complex ones, keeping it around is a useful reminder of the logic problems you already solved, and a handy thing to point at when someone else breaks those contractual obligations (even and especially when that someone is you).

Summary

TLA+ tends to get overlooked because you can't directly compile a spec into running code. There is no "generate implementation" button. You write the spec, run TLC, fix the violations, and then you still have to go write the actual system. To someone under pressure to ship, that can feel like doing the work twice.

But I think that framing misses what the spec actually gives you. By the time TLC reports no errors across however many millions of states, you have already solved the hard problem: you know what your system is supposed to do, under what conditions each action is legal, what invariants must hold at all times, and which failure modes your design actually survives. The implementation becomes a translation exercise rather than a discovery process. And the spec leaves behind artifacts you can use: preconditions that become guard clauses, invariants that become assertions, error traces that become regression tests.

The Roomba design process works fine for a lot of problems. But for a distributed system or a stateful protocol, the state space is large enough that you genuinely cannot test your way to confidence without first understanding the shape of the problem. TLA+ is a tool for developing that understanding. And perhaps a TLA spec is also a great way to share that understanding with the AI model that is going to be writing your code anyway.

Going back to the questions I raised at the start: should we all be doing this? Do the old greybeards have a point? Did we leave a useful thing behind? Is this even still relevant? I can only answer for me: When I'm thinking about my next project, the likelihood of my reaching in my tool bag for TLA+ is directly proportional to how terrified I am of troubleshooting surprise deadlocks or safety bugs in the production code.




2

Channel Activity Detection. It's not full carrier sense like WiFi's CSMA/CA, it basically tells you if a LoRa pre-amble was detected recently and can be used in a similar way. It should work most of the time, but you might occasionally step on a transmission if you get very unlucky with timing. Semtech CAD paper