Consistency and ACID guarantees in distributed storage systems
Distributed systems are used when there is a need for horizontal scaling to provide increased performance indicators that a vertically scaled system cannot provide for adequate money.
Like the transition from a single-threaded paradigm to a multi-threaded one, migration to a distributed system requires a kind of immersion and an understanding of how it works inside, what you need to pay attention to.
One of the problems that confronts a person who wants to migrate a project to a distributed system or start a project on it is which product to choose.
We, as a company that has “eaten a dog” in the development of systems of this kind, help our clients to make such decisions in a balanced way with regard to distributed storage systems. We also produce 3r3166. a series of webinars
for a wider audience, which are devoted to the basic principles, spoken in simple language, and regardless of any specific product preferences help to map the relevant characteristics to facilitate the selection.
This article is based on our materials on consistency and ACID guarantees in distributed systems.
Wikipedia ) 3r33250.
When the CAP-theorem speaks of consistency, it implies a rather strict definition, including linearization of records and reads, and specifies consistency only when writing individual values. ( Martin Kleppman ) 3r33250.
CAP-theorem says that if we want to be resistant to network problems, then we generally have to choose what to sacrifice: consistency or accessibility. There is also an extended version of this theorem - PACELC (3r???. Wikipedia
), Which additionally says that even in the absence of network problems, we must choose between response speed and consistency.
And although, at first glance, a native of the world of classical DBMS, it seems that the choice is obvious, and consistency is the most important thing that we have, it is not always the case that vividly illustrates the explosive growth of a number of NoSQL DBMS that made a different choice and despite this, have gained a huge user base. Apache Cassandra with its famous eventual consistency is a good example.
All due to the fact that it is [i] Selection 3r3-3110. which implies that we are sacrificing something, and not always we are ready to sacrifice it.
Often the problem of consistency in distributed systems is solved simply by abandoning this consistency.
But it is necessary and important to understand when the rejection of this consistency is acceptable, and when it is a business-critical requirement.
For example, if I design a component that is responsible for storing user sessions, here, most likely, the consistency is not so important to me, and data loss is not critical, if it occurs only in problematic cases - very rarely. The worst that will happen is that the user will need to re-login, and for many businesses this will have almost no effect on their financial performance.
If I do analytics on the data stream from the sensors, in many cases I’m not at all critical to lose some of the data and get reduced sampling over a short period of time, especially if “eventually” I’ll see the data.
But if I do the banking system, the consistency of cash transactions is critical for my business. If I accrued a penalty on the client's loan because I simply did not see the payment made on time, although it was in the system - this is very, very bad. How and if a client can withdraw all the money from my credit card several times, because I had network problems at the time of the transaction, and information about withdrawals did not reach part of my cluster.
If you are making an expensive purchase in an online store, you do not want your order to be forgotten, despite the success of a webpage reporting.
But if you opt for consistency, you sacrifice accessibility. And often this is expected, most likely, you have repeatedly come across it personally.
It is better if the basket of the online store says “try later, the distributed DBMS is not available” than if it reports success and forgets the order. It is better to get a denial of a transaction due to the inaccessibility of the bank’s services than a beat on success and then a trial with the bank due to the fact that it forgot that you made a loan payment.
Finally, if we look at the advanced, PACELC theorem, then we understand that even in the case of a regular system operation, choosing consistency, we can sacrifice low latency, obtaining a potentially lower level of maximum performance.
Therefore, answering the question “why is it necessary?”: This is necessary if it is critical for your task to have relevant, complete data, and the alternative will bring you significant losses, greater than the temporary unavailability of the service for the period of the incident or its lower performance.
3r33232. How to provide it?
Accordingly, the first decision you need to make is where you are in the CAP theorem, you want consistency or availability in the event of an incident.
Next, you need to understand at what level you want to make changes. Perhaps you have enough ordinary atomic records affecting a single object, as MongoDB was able and able (it now extends this with additional support for full-fledged transactions). Recall that the CAP-theorem says nothing about the consistency of write operations involving multiple objects: the system may well be CP (i.e., prefer accessibility consistency) and still provide only atomic single entries.
If this is not enough for you, we begin to approach the concept of full-fledged distributed ACID transactions.
I note that even when moving to a brave new world of distributed ACID transactions, we often have to sacrifice something. For example, a number of distributed storage systems have distributed transactions, but only within one partition. Or, for example, the system may not maintain the “I” part at the level you need, without having isolation, or with an insufficient number of isolation levels.
These restrictions were often made for some reason: either to simplify the implementation, or, for example, to increase performance, or for something else. They are sufficient for a large number of cases, so you should not consider them as cons in themselves.
You need to understand whether these restrictions are a problem for your particular scenario. If not, you have a wider choice, and you can give more weight, for example, to performance indicators or the ability of the system to provide disaster recovery, etc. Finally, we must not forget that for a number of systems, these parameters can be adjusted to the extent that the system can be CP or AP depending on the configuration.
If our product tends to be CP, then it usually has either a quorum approach to data selection, or dedicated nodes that are the main owners of records, all data changes pass through them, and in case of network problems, if these master nodes cannot give answer, it is considered that data is basically impossible to obtain, or arbitration, when an external high-availability component (for example, a ZooKeeper cluster) can tell which of the cluster segments is the main one, contains the current version of data and can efficiently service the request s.
Finally, if we are not only interested in CP, but support of full-fledged distributed ACID transactions, then there is often a single source of truth, for example, centralized disk storage, where our nodes are, in fact, only caches to it, which can be disabled in commit time, or multiphase commit protocol is applied.
The first single-disk approach also simplifies implementation, gives low latency on distributed transactions, but trades in return for very limited scalability on loads with large write volumes.
The second approach gives much more freedom in scaling, and, in turn, is divided into two-phase (3r3195. Wikipedia 3r3198.) And three-phase (3r3197. Wikipedia 3r3198.) Commit protocols.
Consider the example of a two-phase commit, which uses, for example, Apache Ignite.
The commit procedure is divided into 2 phases: prepare and commit.
During the prepare phase, a message about preparing for a commit is sent, and each participant makes a lock if necessary, performs all operations up to the actual commit not inclusive, sends the prepare to its replicas, if this is assumed by the product. If at least one of the participants refused for some reason or was unavailable, the data did not actually change, there was no commit. Participants roll back changes, remove locks and return to their original state.
The commit phase sends the actual commit commit to the cluster nodes. If for some reason some of the nodes were unavailable or answered with an error, then by that time the data is entered into their redo-log (since the prepare was completed successfully), and the commit can be completed in any case at least in the pending state.
Finally, if the coordinator fails, the commit will be canceled at the prepare stage, a new coordinator may be selected at the commit stage, and if all nodes have completed prepare, he can check and ensure that the commit stage is performed.
Different products have their own characteristics of implementation and optimization. For example, some products are able in some cases to reduce 2-phase commits to 1-phase, significantly gaining in performance.
The key conclusion: distributed data storage systems is a fairly developed market, and the products on it can provide high consistency of data.
At the same time, products of this category are located at different points of the consistency scale, from fully AP products without any transitivity, to CP products that additionally give full ACID transactions. Some products can be configured in one direction or the other.
When you choose what you need, you need to take into account the needs of your case and understand well what sacrifices and compromises you are willing to make, because nothing is free, and choosing one, you will most likely give up on something else.
Evaluating products from this side, you should pay attention to the following:
where they are in the CAP theorem;
Do they support distributed ACID transactions?
what restrictions do they impose on distributed transactions (for example, only within one partition, etc.);
convenience and efficiency of using distributed transactions, their integration into other components of the product.
It may be interesting