Databases and Kubernetes (review and video report)

 3r3-3559. 3r3-31. November 8 in the main conference room HighLoad ++ 2018 r3r3544. , in the section "DevOps and exploitation", the report "Databases and Kubernetes" was heard. It tells about high availability of databases and approaches to fault tolerance up to Kubernetes and with it, as well as practical options for deploying DBMS in Kubernetes clusters and existing for this solution (including Stolon for PostgreSQL). 3r? 3533.  3r3-3559. 3r? 3533.  3r3-3559. Databases and Kubernetes (review and video report) video report 3r33544. (about an hour, much more informative than the article) and the main pressing in a text form. Go!
3r33544. 3r? 3533.  3r3-3559. 3r? 3533.  3r3-3559.

Theory

3r? 3533.  3r3-3559. This report appeared as an answer to one of the most popular questions that over the past years we have been tirelessly asked in different places: comments on Habré or YouTube, social networks, etc. It sounds simple: “Is it possible to launch a base in Kubernetes?”, And if we usually answered it “in general, yes, but ”, then the explanations for these “in general” and “but” clearly did not suffice, but to accommodate them in a short message could not succeed. 3r? 3533.  3r3-3559. 3r? 3533.  3r3-3559. However, to begin with, I will summarize the question from “the base[данных]“To stateful as a whole. A DBMS is only a special case of stateful solutions, a more complete list of which can be represented as follows: 3r33333.  3r3-3559. 3r? 3533.  3r3-3559. 1. The philosophy of high availability in Kubernetes
3r? 3533.  3r3-3559. Everyone knows the analogy “pets versus flocks” (3r350s. Pets vs cattle
) And understand that if Kubernetes is a story from the world of the herd, then classic DBMS are exactly pets. 3r? 3533.  3r3-3559. 3r? 3533.  3r3-3559. And how did the architecture of the “pets” look in the “traditional” version? The classic MySQL installation example is replication on two iron servers with backup power, disk, network and everything else (including an engineer and various support tools), which will help us to be sure that the MySQL process will not fall, and if there is a problem with any of the critical for it components, fault tolerance will be respected: 3r33333.  3r3-3559. 3r? 3533.  3r3-3559. 3r? 3533.  3r3-3559. 3r? 3533.  3r3-3559.
 3r3-3559. 3r33542. Controllers. There are many of them, but there are two main ones: Deployment (for stateless applications) and 3r320. StatefulSet (for stateful applications). They contain all the logic of actions taken in the event of a node fall (inaccessibility of the pod). 3r33545.  3r3-3559. 3r33542. 3r33420. PodAntiAffinity - the ability to specify certain pods so that they are not on the same node. 3r33545.  3r3-3559. 3r33542. 3r33420. PodDisruptionBudgets - limit on the number of copies of pods that can be simultaneously turned off in case of planned work. 3r33545.  3r3-3559. 3r? 3533.  3r3-3559.

2. Guarantees of consistency in Kubernetes

3r? 3533.  3r3-3559. How does the familiar one-master failover scheme work? Two servers (master and standby), one of which is constantly accessed by the application, which in turn is used through a load balancer. What happens in case of a network problem? 3r? 3533.  3r3-3559. 3r? 3533.  3r3-3559. 3r3108. 3r? 3533.  3r3-3559. 3r? 3533.  3r3-3559. Classic 3r33524. split-brain [/i] 3r33544. : The application starts to access both instances of the DBMS, each of which considers itself to be the main one. To avoid this, keepalived was replaced with corosync already with three copies of it to achieve a quorum for voting for the master. However, even in this case there are problems: if a failed copy of the DBMS tries in every way to “suicide” (remove the IP address, translate the database into read-only ), then another part of the cluster does not know what happened to the master — indeed, it can happen that the node actually still works and requests get to it, which means that we cannot yet switch masters. 3r? 3533.  3r3-3559. 3r? 3533.  3r3-3559. To solve this situation, there is a node isolation mechanism in order to protect the entire cluster from incorrect operation — this process is called 3r33524. fencing [/i] 3r33544. . The practical essence comes down to the fact that we are trying by some external means to “kill” the fallen off car. Approaches can be different: from shutting down the machine via IPMI and blocking the port on the switch to accessing the cloud provider API, etc. And only after this operation, you can switch the wizard. Thus, a warranty is achieved. 3r33524. at-most-once [/i] 3r33544. which ensures us consistency (3r3-33130. Consistency 3r3353544.) 3r33525. . 3r? 3533.  3r3-3559. 3r? 3533.  3r3-3559.  3r3-3559. 3r33542. 3r33420. Deployment : “I was told that there should be 3 pods, and now there are only 2 of them - I will create a new one”; 3r33545.  3r3-3559. 3r33542. 3r33420. StatefulSet : “Pod'a gone? I will wait: either this node will return, or they will tell us to kill him ”, i.e. containers themselves (without operator actions) are not re-created. This is the way the same at-most-once warranty is achieved. 3r33545.  3r3-3559. 3r? 3533.  3r3-3559. However, here, in the latter case, fencing is required: a mechanism is needed that will confirm that this node is definitely no more. First of all, it is very difficult to make it automatic (many implementations are required), and secondly, even worse, it usually kills nodes slowly (accessing IPMI can take seconds or tens of seconds, or even minutes). Few people will wait per minute to switch the base to the new master. But there is another approach that does not require the fencing mechanism 3r33333.  3r3-3559. 3r? 3533.  3r3-3559. I’ll start his description outside Kubernetes. It uses a special load balancer 3r33524. (load balancer) , through which backends address the DBMS. Its specificity lies in the fact that it has the property of consistency, i.e. protection from network failures and split-brain, because it allows you to remove all connections to the current master, wait for synchronization (replica) on another node and switch to it. I did not find a well-established term for this approach and called it 3r33524. Consistent Switchover [/i] . 3r? 3533.  3r3-3559. 3r? 3533.  3r3-3559. 3r3173. 3r? 3533.  3r3-3559. 3r? 3533.  3r3-3559. The main question with him is how to make it universal, providing support to both cloud providers and private installations. To do this, proxy servers are added to the applications. Each of them will receive requests from its application (and send them to the DBMS), and quorum will be collected from all together. As soon as a part of the cluster fails, those proxies that have lost their quorum immediately remove their connections to the DBMS. 3r? 3533.  3r3-3559. 3r? 3533.  3r3-3559. 3r3182. 3r? 3533.  3r3-3559. 3r? 3533.  3r3-3559.

3. Data storage and Kubernetes

3r? 3533.  3r3-3559. The main mechanism is a network drive 3r33524. Network Block Device 3r33535. (aka SAN) in different implementations for the desired cloud options or bare metal. However, put a loaded database (for example, MySQL, which requires 5?000 IOPS) to the cloud (AWS EBS) due to the delay 3r32424. (latency) 3r33525. will not work. 3r? 3533.  3r3-3559. 3r? 3533.  3r3-3559. 3r-33199. 3r? 3533.  3r3-3559. 3r? 3533.  3r3-3559. In Kubernetes for such cases it is possible to connect a local hard disk - 3r-3524. Local Storage 3r33525. . If a crash occurs (the disc ceases to be available in the pod), then we are forced to repair this machine - by analogy with the classical scheme in case of failure of one reliable server. 3r? 3533.  3r3-3559. 3r? 3533.  3r3-3559. Both options (3r3524. Network Block Device 3r33525. And 3r33524. Local Storage 3r33525.) Fall into the category 3r33524. ReadWriteOnce [/i] : the storage cannot be mounted in two places (pod'a) - for such scaling you need to create a new disk and connect it to the new pod (for this there is a built-in K8s mechanism), and then fill it with the necessary data (already done by our own forces). 3r? 3533.  3r3-3559. 3r? 3533.  3r3-3559. If we need mode. ReadWriteMany , then implementations of 3r32424 are available. Network File System [/i] (or NAS): for a public cloud, this is AzureFile and 3r33420. AWSElasticFileSystem , and for their installations - CephFS and Glusterfs for lovers of distributed systems, as well as NFS. 3r? 3533.  3r3-3559. 3r? 3533.  3r3-3559. 3r33232. 3r? 3533.  3r3-3559. 3r? 3533.  3r3-3559.

Practice 3r3r4484. 3r? 3533.  3r3-3559.

1. Standalone

3r? 3533.  3r3-3559. This option is about the case when nothing prevents to run the DBMS in the mode of a separate server with local storage. This is not about high availability although it may be to some extent (i.e., sufficient for a given application) implemented at the iron level. There are many cases for this use. First of all, these are various staging and dev environments, but not only: secondary services also fall here, and disabling them for 15 minutes is not critical. In Kubernetes this is implemented StatefulSet “With one pod: 3r3333533.  3r3-3559. 3r? 3533.  3r3-3559. 2. Replicated pair with manual switching 3r? 3533.  3r3-3559. is used again. StatefulSet , but the general scheme is as follows: 3r33333.  3r3-3559. 3r? 3533.  3r3-3559. ) To which we can switch traffic. At the same time - even before switching traffic - it is important not to forget not only to remove requests to the DBMS from the service 3r320. mysql , but also go to the DBMS manually and make sure that all connections are completed (kill them), and also go to the second node from the DBMS and reconfigure the replica in the opposite direction. 3r? 3533.  3r3-3559. 3r? 3533.  3r3-3559. If you are currently using the classic version with two servers (master + standby) without automatic switching (failover) 3r33525. This solution is equivalent to Kubernetes. Suitable for MySQL, PostgreSQL, Redis and other products. 3r? 3533.  3r3-3559. 3r? 3533.  3r3-3559.

3. Scaling load on reading

3r? 3533.  3r3-3559. In fact, this case is not stateful, because we are talking only about reading. Here, the main DBMS server is outside the considered scheme, and within Kubernetes a “farm of slave servers” is created that are available only for reading. The general mechanism — the use of init containers to populate with the DBMS data on each new pod of this farm (using a hot dump or the usual one with additional actions, etc. — depends on the used DBMS). To be sure that each instance is not too far behind the master, you can use liveness samples. 3r? 3533.  3r3-3559. 3r? 3533.  3r3-3559. 3r33232. 3r? 3533.  3r3-3559. 3r? 3533.  3r3-3559.

4. Smart client

3r? 3533.  3r3-3559. If you do 3r33420. StatefulSet of the three memcached, a special service is available in Kubernetes that will not balance the requests, but will create each pod for its own domain. The client will be able to work with them if he can sharding and replication. 3r? 3533.  3r3-3559. 3r? 3533.  3r3-3559. You don’t have to go far for an example: storage of sessions in PHP works out of the box. For each session request, requests are made simultaneously to all servers, after which the most relevant answer is selected from them (similarly to recording). 3r? 3533.  3r3-3559. 3r? 3533.  3r3-3559. 3r33333. 3r? 3533.  3r3-3559. 3r? 3533.  3r3-3559.

5. Cloud Native-solutions

3r? 3533.  3r3-3559. There are many solutions that are initially focused on node failure, i.e. they can do fault tolerance (failover) 3r33525. and rebuilding nodes 3r33524. (recovery) 3r33525. , provide guarantees of consistency 3r33524. (consistency) . This is not a complete list of them, but only some of the popular examples: 3r33333.  3r3-3559. 3r? 3533.  3r3-3559. 3r33337. 3r? 3533.  3r3-3559. 3r? 3533.  3r3-3559. All of them are simply put in StatefulSet after which the nodes find each other and form a cluster. The products themselves differ in the way they implement three things: 3r33333.  3r3-3559. 3r? 3533.  3r3-3559.  3r3-3559. 3r33542. How do nodes learn about each other? For this there are such methods as Kubernetes API, DNS-records, static configuration, specialized nodes (seed), third-party service discovery 3r33545.  3r3-3559. 3r33542. How does the client connect? Through the load balancer, distributing to the hosts, or the client needs to know about all the hosts, and he himself decides how to proceed. 3r33545.  3r3-3559. 3r33542. How is horizontal scaling done? No, full or difficult /with limitations. 3r33545.  3r3-3559. 3r? 3533.  3r3-3559. Regardless of the chosen solutions to these issues, all such products work well with Kubernetes, because they were originally created as a “herd” 3r-3324. (cattle) 3r33535. . 3r? 3533.  3r3-3559. 3r? 3533.  3r3-3559.

6. Stolon PostgreSQL

3r? 3533.  3r3-3559. Stolon actually allows you to turn PostgreSQL DBMS, created as 3r33524. pet
, in 3r32424. cattle 3r33525. . How is this achieved? 3r? 3533.  3r3-3559. 3r? 3533.  3r3-3559.   etcd [/b] (other options are available) - a cluster of them fits into 3r33420. StatefulSet . 3r33545.  3r3-3559. 3r33542. Another part of the infrastructure is StatefulSet with PostgreSQL instances. In addition to the DBMS itself, a component called 3r-3313 is also placed next to each installation. keeper [/b] which configures the DBMS. 3r33545.  3r3-3559. 3r33542. Another component is sentinel - unfolds as Deployment and keeps track of the cluster configuration. It is he who decides who will be the master and standby, writes this information to etcd. And keeper reads data from etcd and performs actions corresponding to the current status with a PostgreSQL instance. 3r33545.  3r3-3559. 3r33542. Another component deployed in Deployment and facing PostgreSQL instances, proxy - is the implementation of the pattern already mentioned 3r32424. Consistent Switchover [/i] . These components are connected to etcd, and if this connection is lost, then the proxy immediately kills outgoing connections, because from this moment it does not know the role of its server (now is it master or standby?). 3r33545.  3r3-3559. 3r33542. Finally, before the proxy instances is the usual LoadBalancer from Kubernetes. 3r33545.  3r3-3559. 3r33547. 3r? 3533.  3r3-3559.
Conclusions 3r3484. 3r? 3533.  3r3-3559. So is it possible to base in Kubernetes? Yes, of course, it is possible, in some cases And if it is expedient, then it is done like this (see the work scheme of Stolon) 3r33333.  3r3-3559. 3r? 3533.  3r3-3559. Everyone knows that technology develops in waves. Initially, any new device can be very difficult to use, but over time, everything changes: the technology becomes available. Where are we going? Yes, it will remain like this inside, but how it will work, we will not know. The Kubernetes are actively developing operators 3r3544. . So far there are not so many of them and they are not so good, but there is a movement in this direction. 3r? 3533.  3r3-3559. 3r? 3533.  3r3-3559.
Videos and slides 3r? 3533.  3r3-3559. Video from the performance (about an hour): 3r33333.  3r3-3559. 3r? 3533.  3r3-3559. 3rr3465. 3r33466.
+ 0 -

Add comment