# Achievability of the lower bound time limit for distributed distributed failover transactions

Foreword

Recently I read another article from the series: "we are better than a two-phase commit". Here I will not analyze the contents of this article (although I'm thinking about giving a detailed analysis). The task of my opus is to offer the most effective version of the distributed commit in terms of time delays. Of course, such a commit is given a high price. However, the goal is to assess and show that the two-phase commit is not a brake, as many believe.

It should also be noted that there will be no full-scale experiments and fake comparisons. Algorithms and theoretical analysis will be given simply. If desired, you can independently implement and test in practice. Of course, it would be much better that this was described in the current article, but it all depends on free time and motivation. In my opinion, it's more important to describe algorithms than to bring graphics, because graphics on algorithms can draw almost everyone, the opposite is not true.
consensus in the asynchronous system is not possible . Therefore, it is important to understand what is the minimum possible time in the most favorable situations without violating the correctness of the algorithm, which is obliged to maintain its invariants also in the event of violation of these conditions at any stage. Two of these lemmas give an exhaustive answer, which suggests that the minimum possible time of an agreement is attainable.

This theorem can be generalized, proving that it is impossible to reach a consensus faster than 1 RTT, throwing out the condition of having a leader. However, this is beyond the scope of this article. The idea of ​​the proof is to consider the dissemination of knowledge about other participants in the system and the availability of a corresponding message: for 1 hop, you can only send out the data, but do not know whether they have reached and what the state was with the recipient.

So, for fault tolerance, take a consensus with 1 RTT and add to our two-phase commit:

1st hop
. The client sends a request to the leader of the coordinator.

2 nd and 3 rd hop
. The coordinator leader coordinates the beginning of the transaction.

4th hop
. The Transaction Coordinator sends a request to the leaders of the participants for blocking - Phase 1.

5th and 6th hop
. Participants successfully take blocking with the preservation of knowledge in their consensus groups.

7 th hop
. Leaders of participants send the answer that they are ready for carrying out of transactions.

8th and 9th hop
. The coordinator's leader will coordinate information about all participants in the system.

10 th hop
. The leader of the coordinator sends out a message to all the participants' leaders about the application of the operations - Phase 2.

11th and 12th hop
. Leaders agree on the commit and apply the changes.

13 th hop
. Participants report on the success of the coordinator's application to the leader.

14 th hop
. The coordinator answers the client.

Total 7 RTT. Not bad. Fault tolerance is "only" 4 RTT. They arise due to the fact that the coordinator and participants on 2 times consistently come each to their own consensus, which is what this time is spent on.

In the above diagram, you can see some non-optimality. Let's fix them.

Optimizing the commit

The first obvious optimization is sending a response to the client immediately after collecting the responses of successful blockages from the participants. Because these answers are fault-tolerant, then the participants will never forget about them, which means that the transaction will sooner or later be fulfilled even if the nodes are dropped, the leader is re-elected, etc. There, however, there is one slippery moment.

It is that in fact the coordinator makes the final decision on whether to commit the final transaction or not. Those. even if all participants said OK, but some participant blunted because of, for example, a change of leader, then the coordinator has the full right to roll back the transaction. And if so, then you can remove only 10-13 hops, but not 8th and 9th. But even this is not bad, since we have a decrease by 2 RTT, i.e. 5 RTT instead of 7.

At the same time, 10-13 hopes do not disappear anywhere, just the client does not need to wait for them. The coordinator and participants will finish their business in parallel with the client. And the client will receive his confirmation a little earlier. This knowledge will necessarily be in the system, just a little later. Here we use the magic of asynchrony, consensus and the inability to prove to the external participant that we have slightly cheated and cut the corner. Those. if the client suddenly wants to immediately read the data that we just completed and go directly to a participant, it will stumble on the lock (if it was not removed by that time by the 2nd phase), and this request will hang until it is removed . However, within the framework of our theoretical research this fact is absolutely not important, because we prepare ideal conditions for ourselves. And in the case of nonideal, as already mentioned above, we will wait for several eternities (since consensus will require eternity, but we need to hold them several, and consistently).

The next move is a bit more complicated and elegant.

Let's consider the very beginning of the transaction. There the client sends a request to the coordinator and then he initiates a two-phase commit and sends these requests to the other participants. There is at once an idea to fulfill such requests simultaneously, i.e. send the request to both the coordinator and the participants in parallel. On this way we are waiting for a treacherous trap.

The matter is that the client is not a fault-tolerant entity, i.e. it can fall. Imagine that he sent a request to the participants, they took a lock and waited, and the request to the coordinator for some reason did not reach and the client fell. Thus, there is no one to start a two-phase commit and there is no one to roll it back in case of conflicts /problems and so on. Participants will permanently block records and no one will help them. Therefore, such optimization is incorrect. Participants have the right to commit only after the decision of the coordinator, who is responsible for the transaction and rolls it back if necessary.

To go further, you need to take a completely different look at the problem. And for this we begin, oddly enough, with consensus.

Consensus optimization

It would seem that you can optimize here? After all, we with Raft achieve the minimum possible execution time - 1 RTT. In fact, you can faster - for 0 RTT.

To this end, we recall that in addition to the consensus itself, another 1 RTT is required to send a request from the client to the leader and receive a response. Those. for a remote consensus group, 2 RTTs are required in this case, which we see in the two-phase commit on 2 examples: sending and committing to the coordinator, sending and committing to the participants. A total of 4 RTTs at once, and another 1 RTT - to the second phase commit on the coordinator.

It is clear that a leader-based consensus for a remote client can not be faster than 2 RTTs. In fact, at first we need to deliver a message to the leader, and then the leader must already forward the participants of the group and get an answer from them. Without options.

The only option is to get rid of the weak link - the leader. Indeed, not only must all records pass through it, so in case of its fall the group becomes inaccessible for a fairly long time. The leader of consensus is the weakest link, and the restoration of the leader is the most fragile and nontrivial part of the consensus. So you just need to get rid of it.

Definition
. Broadcast of the message is the sending of the same message to all the participants of the group.

To do this, let's take
? widely known in narrow circles. consensus without master
. The main idea is to ensure that the players achieve the same status on the participants. To do this, it is sufficient to make 2 broadships, i.e. just 1 RTT. The first ticket to the participants of the system can make the client himself. Response broads from participants can reach the client. If the client sees the same state (and he sees this in the case, for example, of the absence of competitive requests), then he will be able to understand, on the analysis of the content of the response broads, that his request will be pre-paid sooner or later. In fact, using such an algorithm, all participants in the consensus, including the client, simultaneously realize that the request was zakommichen. And this will happen after 2 broads, i.e. 1 RTT. Because the client still has to spend 1 RTT on sending the message to the group and getting the answer, then we have a paradoxical conclusion that the consensus was effectively over 0 RTT.

Analogy

To go further, we will use the powerful analysis tool - analogy. Let's return to Raft algorithm. What is happening in it? It consists of two phases:

1st phase
: the leader sends the request to the participants and is waiting for a response.

2nd phase
: After the answer, the leader takes an agreed decision individually and sends it to the participants of the system.

Does not it look like anything? That's right, this is a two-phase commit, only with some ogby the people:

In the algorithm Raft does not need to wait for a response from all participants. In a two-phase commit for a successful transaction, you must wait for a successful response from all participants.

In the Raft algorithm, the participant can not say no. More theoretically, he can do so (for example, the place is over), but this neoK will be analogous to the lack of response. In a two-phase commit, everything is stricter: if at least one of the participants said no-words, then the entire transaction should be paired and rolled back. This is the very essence of biphasicity: first we ask the consent of all, and only after the general unanimous approval we roll changes. Consensus in this sense is more democratic, because requires majority approval.

At the same time, they have in common that there is a dedicated decision driver (leader or coordinator), and there are two phases - preliminary, and final.

Accordingly, all we need is to refuse the coordinator in the two-phase commit, i.e. do exactly the same thing that we did for consensus, giving up the leader.

Let's forget about fault tolerance for a while and see how the commit looks in this case.

Self-coordination

Definition
.
Two-phase commit without coordinator
consists of 2 phases:

All participants send their decision to all other participants: OK or not.

Each participant after receiving the OK from all commit changes or rolls back them if at least one responds to a non-OK.

After that, for reliability, each participant can send out to everyone else information that a commit has occurred and you can remove the locks, but this is not necessary.

Why did the coordinator suddenly become unnecessary? The fact is that the coordinator followed the transactional process, including whether the nodes are alive. Those. In case of problems with participants, the coordinator rolled back the transaction. The problem was only in the coordinator itself, because he could not look after himself. Therefore, often a two-phase commit is called a blocking .

Definition
.
Self-coordinating transactions
- Transactions that do not require a dedicated coordinator.

However, by adding fault tolerance, the role of the coordinator becomes unnecessary. Every participant who is a consensus group can stand up for himself. Thus, we come to self-coordinating transactions without the need for a dedicated coordinator. An important difference from the usual two-phase commit with the coordinator is that the coordinator can at any time decide to roll back the transaction, even if all the participants gave a positive response. In self-coordinated transactions such nondeterministic behavior is unacceptable, because Each participant makes a decision based on the responses of other participants and this decision should be the same.

Theorem
. Self-coordinating transactions produce strict consistency (linearizability + serializability).

The proof of
. Actually, the proof is based on the simple fact that the two-phase commit also provides such a guarantee. Indeed, in a scheme without a coordinator, each participant is himself a coordinator; there is a two-phase commit as if it is the most important. This means that it preserves all the invariants of the two-phase commit. This is easy to verify, if we recall that each participant sends out answers to the rest of the participants. Those. everyone receives OK responses from all the others, acting as a coordinator for committing a transaction commit.

Let us describe the minimum number of hops with a favorable combination of circumstances:

1st hop
. The client sends a message to all participants in the transaction.

2nd hop
. All participants send a reply to the client and to each other.

After the 2nd hop, the client has all the necessary information to make a decision about the committee. This requires only 1 RTT for the commit.

Fault tolerance and availability

An attentive reader may ask: what to do in case of a client's downfall? After all, if the participants in the system can be made fault-tolerant, then to the client we can not make such demands, i.e. he can fall at any moment. It is clear that after the client sends requests to all participants of the system, the distributed commit can be completed without the participation of the client. And what if the customer managed to send only to some of them and fell safely?

In this case, we oblige the client to do the following: the client must forward to each participant information about all the other participants in our transaction. Thus, each participant knows all the other participants and sends them his result. In this case, any participant, if he did not receive a request from the client, can choose one of the following behaviors:

Immediately reply that he does not accept the transaction, i.e. sends non-OK. In this case, the locks are rolled back. The participant at the same time, as always, sends out his response to the other participants.

If the request from the other participant contains all the necessary information for executing the transaction commit for this participant, then it is possible to make a decision about the successful blocking of the corresponding records (1st phase) and send OK. To this end, the client must send to each participant of the transaction information about all other participants and all the necessary data to make the distributed commit.

In any case, we get that all participants either get OK, or in the absence of the necessary information, someone reports non-OK and the transaction rolls back. Those. in the event of a client's downfall, each participant is able either to complete the initiated or to correctly roll back the client's actions.

It remains to make the system participants fault-tolerant. To do this, we put them in the consensus of the group without a dedicated leader. Those. Each participant will not represent a separate node, but a set of nodes in the consensus group.

The commit algorithm will look like this:

The client sends its request to each node belonging to the transaction group of the transaction.

Each node sends a reply to all other nodes and the client about the speculative execution of the first phase of the commit as if it were being executed at the current step of the consensus. In reality, we do not know whether this will actually happen or not, because If there are competitive requests from other clients, the consensus can reorder the current unapplied actions.

The client receives all requests from all nodes of all participants. If all nodes in the speculative implementation responded OK and the consensus step was the same for each node from the group consensus, it means that the speculative implementation of the first phase will actually happen and you can make a decision about the commit.

In fact, the condition for obtaining a response from all nodes of each group is redundant. However, a more detailed consideration of the weakening of this requirement is beyond the scope of this article.

Conclusions

Total we get 2 hops or 1 RTT. Given that the communication between the client and the server can not be removed, the effective processing time of the commit on the server side is zero, i.e. as if the server instantly processed a distributed, high-availability, fault-tolerant transaction and sent a response to the client.

Thus, we have an important theoretical and applied fact: the lower bound of the execution time of the distributed fault-tolerant commit is achievable.

Literature

Michael J. Fischer, Nancy A. Lynch, and Michael S. Paterson, 198?

Impossibility of distributed consensus with one faulty process

G. Demchenko, 201?
Masterless Consensus Algorithm

Jim Gray, Leslie Lamport, 200?
Consensus on Transaction Commit
+ 0 -