High availability of MySQL in GitHub

3r33762. GitHub uses MySQL as the main data repository for everything that is not related to r3r3748. git therefore, the availability of MySQL is key to the normal operation of GitHub. The site itself, the GitHub API, the authentication system and many other features require access to databases. We use several MySQL clusters to handle various services and tasks. They are set up according to the classical scheme with one
chiefly
writable node and its replicas.
Replicas
(the remaining nodes of the cluster) asynchronously reproduce the changes of the main node and provide read access.
3r33737.  3r33770. 3r33762. The availability of the main nodes is critical. Without a master node, the cluster does not support writing, which means that the necessary changes cannot be saved. Fixing transactions, registering problems, creating new users, repositories, reviews and much more will be simply impossible.
3r33737.  3r33770. 3r33762. To maintain the record, you need the corresponding available node — the main node in the cluster. However, it is equally important to determine whether it is
detect 3r3702. such a node.
3r33737.  3r33770. 3r33762. In the event of failure of the current main node, it is important to ensure the prompt appearance of a new replacement server, as well as to be able to quickly notify all the services about this change. The total idle time is made up of the time it takes to detect a failure, failover, and notify about a new main node.
3r33737.  3r33770. 3r33762. High availability of MySQL in GitHub orchestrator for detection and failover; 3r3753.  3r33770. 3r33737. VIP and DNS to locate the main node. 3r3753.  3r33770. 3r3755. 3r33737.  3r33770. 3r33762. In this case, the clients detected the record node by its name, for example, 3r33748. mysql-writer-1.github.net . By name, the virtual IP address (VIP) of the host has been determined.
3r33737.  3r33770. 3r33762. Thus, in the usual situation, it was enough for the clients to simply resolve the name and connect to the received IP address where the main node already waited for them.
3r33737.  3r33770. 3r33762. Consider the following replication topology, covering three different data centers: 3r3373763. 3r33737.  3r33770. 3r33762. 3r3117.
3r33737.  3r33770. 3r33762. In the event of a failure of the main node, a new server should be assigned to its place (one of the replicas).
3r33737.  3r33770. 3r33762. 3r33748. orchestrator detects a failure, selects a new master node, and then assigns a name /VIP. The clients do not really know the identity of the main node; they only know the name, which now must point to the new node. However, pay attention to that.
3r33737.  3r33770. 3r33762. VIP-addresses are shared, the database servers themselves request and own them. To receive or release a VIP, the server must send an ARP request. The server owning the VIP must first release it before the new master has access to this address. This approach has some undesirable consequences:
3r33737.  3r33770. 3r33737.  3r33770. 3r33737. In normal mode, the failover system will first contact the failed main node and request it to release the VIP, and then contact the new main server with a request for assigning VIP. But what to do if the first head node is unavailable or refuses to request to release the VIP address? Given that the server is currently in a state of failure, it is unlikely that it will be able to respond to the request in a timely manner or respond to it altogether. 3r33737.  3r33770. 3r33737.  3r33770. 3r33737. As a result, a situation may arise when two hosts claim their rights to the same VIP. Different clients can connect to any of these servers, depending on the shortest network path. 3r3753.  3r33770. 3r33737. The correct operation in this situation depends on the interaction of two independent servers, and this configuration is unreliable. 3r3753.  3r33770. 3r3752. 3r3753.  3r33770. 3r33737. Even if the first master node responds to requests, we are wasting precious time: switching to a new master server does not occur while we are contacting the old one. 3r3753.  3r33770. 3r33737. At the same time, even in the case of VIP reassignment, there is no guarantee that the existing client connections on the old server will be broken. We again run the risk of being in a situation with two independent main nodes. 3r3753.  3r33770. 3r3755. 3r33737.  3r33770. 3r33762. Somewhere in our environment, VIP addresses are related to physical location. They are assigned to a switch or router. Therefore, we can reassign the VIP address only to a server located in the same environment as the original head node. In particular, in some cases we will not be able to assign a VIP server in another data center and will need to make changes to the DNS.
3r33737.  3r33770. 3r33737.  3r33770. 3r33737. It takes more time to propagate changes to the DNS. Clients store DNS names for a pre-configured period of time. Failure of multiple data centers involves longer downtime because it takes more time to provide all customers with information about the new main site. 3r3753.  3r33770. 3r3755. 3r33737.  3r33770. 3r33762. These restrictions were enough to force us to start searching for a new solution, but the following should also be taken into account: 3r3373763. 3r33737.  3r33770. 3r33737.  3r33770. 3r33737. The main nodes independently transmitted heartbeat packets through the service. pt-heartbeat for 3r3181. measure the amount of lag and load regulation . The service needed to be transferred to the newly designated head node. If possible, it should be disabled on the old server. 3r3753.  3r33770. 3r33737. Similarly, the main nodes independently managed the work from 3r3186. Pseudo-GTID . It was necessary to start this process on the new main node and it is desirable to stop on the old one. 3r3753.  3r33770. 3r33737. A new master node became available for writing. The old node (if possible) should have been labeled 3r33748. read_only (only for reading). 3r3753.  3r33770. 3r3755. 3r33737.  3r33770. 3r33762. These additional steps increased total downtime and added their own points of failure and problems. 3r33737.  3r33770. 3r33762. The solution worked, and GitHub successfully worked out MySQL failures in the background, but we wanted to improve our approach to HA as follows: 3r3373763. 3r33737.  3r33770. 3r33737.  3r33770. 3r33737. ensure independence from specific data centers; 3r3753.  3r33770. 3r33737. guarantee performance in case of data center failures; 3r3753.  3r33770. 3r33737. abandon unreliable collaborative workflows; 3r3753.  3r33770. 3r33737. reduce total downtime; 3r3753.  3r33770. 3r33737. perform, as far as possible, lossless failures. 3r3753.  3r33770. 3r3755. 3r33737.  3r33770.

GitHub HA Solution: orchestrator, Consul, GLB

3r33737.  3r33770. 3r33762. Our new strategy, along with the accompanying improvements, eliminates most of the problems mentioned above or mitigates their consequences. Our current HA system consists of the following elements: 3r33737.  3r33770. 3r33737.  3r33770. 3r33737. orchestrator to detect and refine failures. We use the scheme. orchestrator /raft with multiple data centers, as shown in the figure below; 3r3753.  3r33770. 3r33737. Consul 3r30000. from Hashicorp for service discovery; 3r3753.  3r33770. 3r33737. GLB /HAProxy as a proxy layer between clients and write nodes. Source code GLB Director open; 3r3753.  3r33770. 3r33737. technology. anycast for network routing. 3r3753.  3r33770. 3r3755. 3r33737.  3r33770. 3r33762. 3r33737.  3r33770. 3r33762. The new scheme allowed to completely abandon the changes in VIP and DNS. Now, with the introduction of new components, we can separate them and simplify the task. In addition, we were able to use reliable and stable solutions. A detailed analysis of the new solution is given below. 3r33737.  3r33770.

Normal flow

3r33737.  3r33770. 3r33762. Normally, applications connect to write nodes via GLB /HAProxy. 3r33737.  3r33770. 3r33762. Applications do not receive core server credentials. As before, they use only the name. For example, the main node for cluster1 will be 3r33748. mysql-writer-1.github.net . However, in our current configuration, this name resolves to the IP address 3r-3285. anycast . 3r33737.  3r33770. 3r33762. Thanks technology anycast The name is resolved to the same IP address anywhere, but traffic is sent differently, given the location of the client. In particular, several copies of GLB, our high-availability load balancer, are deployed in each of our data centers. Traffic on r3r3748. mysql-writer-1.github.net always heading to the local data center's GLB cluster. Due to this, all clients are served by local proxies. 3r33737.  3r33770. 3r33762. We run GLB on top of HAProxy . Our HAProxy server provides recording pools 3r3702. : one for each MySQL cluster. In this case, each pool has only one server (3r-33701. Main 3r3702. Cluster node). All GLB /HAProxy instances in all data centers have the same pools, and they all point to the same servers in these pools. Thus, if the application wants to write data to the database at 3r33748. mysql-writer-1.github.net it doesn't matter which GLB server it connects to. In any case, it will be redirected to the actual main node of the cluster cluster1 . 3r33737.  3r33770. 3r33762. For applications, detection ends in GLB, and there is no need for re-detection. It is GLB that redirects traffic to the right place. 3r33737.  3r33770. 3r33762. Where does GLB get information about which servers to include in the list? How do we make changes to GLB? 3r33737.  3r33770.

Detection through Consul

3r33737.  3r33770. 3r33762. Consul is widely known as a service discovery solution; in addition, it also takes over the DNS functions. However, in our case, we use it as a highly accessible storage of key values ​​(KV). 3r33737.  3r33770. 3r33762. In the KV repository in Consul, we record the identity of the main nodes of the cluster. For each cluster there is a set of KV records pointing to the data of the corresponding main node: its 3r-3748. fqdn , port, ipv4 and ipv6 addresses. 3r33737.  3r33770. 3r33762. Each GLB /HAProxy node starts consul-template , a service that tracks changes in Consul data (in our case, these are changes in the data of the main nodes). Service consul-template creates a configuration file and can reload HAProxy when changing settings. 3r33737.  3r33770. 3r33762. Thanks to this, the information about changing the identity of the main node in Consul is available to each GLB /HAProxy instance. Based on this information, instance configuration is performed, new master nodes are specified as the only entity in the cluster pool of servers. After this, the instances are rebooted for the changes to take effect. 3r33737.  3r33770. 3r33762. We deployed Consul instances at each data center, and each instance provides high availability. However, these instances are independent of each other. They do not replicate or exchange any data. 3r33737.  3r33770. 3r33762. Where does Consul get information about changes and how does it spread between data centers? 3r33737.  3r33770. 3r33354. orchestrator /raft 3r33737.  3r33770. 3r33762. We use the scheme. orchestrator /raft : nodes orchestrator interact with each other through consensus raft 3r30000. . We have one or two nodes in each data center. orchestrator . 3r33737.  3r33770. 3r33762. 3r33748. orchestrator is responsible for detecting failures, working out MySQL failures and transmitting changed data on the main node to Consul. Failover is managed by one master orchestrator /raft but changes , news that the cluster is now a new main node, spread to all nodes 3r-3748. orchestrator using the mechanism raft . 3r33737.  3r33770. 3r33762. When bondsly orchestrator receive news about changes in the main site data, each of them communicates with its local Consul instance and initiates a KV record. A data center with multiple instances of orchestrator will receive several (identical) entries in Consul. 3r33737.  3r33770.

A generalized view of the entire stream

3r33737.  3r33770. 3r33762. If the main node fails: 3r33737.  3r33770. 3r33737.  3r33770. 3r33737. nodes. orchestrator detect failures; 3r3753.  3r33770. 3r33737. lead node orchestrator /raft initiates recovery. A new master node is assigned; 3r3753.  3r33770. 3r33737. Scheme 3r33737. orchestrator /raft sends data about the change of the main node to all nodes of the cluster 3r33748. raft ; 3r3753.  3r33770. 3r33737. each instance of orchestrator /raft receives notification of a node change and writes the identification data of the new main node to the local storage KV in Consul; 3r3753.  3r33770. 3r33737. service is running on each instance of GLB /HAProxy. consul-template which tracks changes to the KV repository in Consul, reconfigures and reloads HAProxy; 3r3753.  3r33770. 3r33737. client traffic is redirected to the new master node. 3r3753.  3r33770. 3r3755. 3r33737.  3r33770. 3r33762. For each component, responsibilities are clearly distributed, and the whole structure is diversified and simplified. 3r33748. orchestrator does not interact with load balancers. Consul does not require information about the origin of information. Proxy servers only work with Consul. Clients work only with proxy servers. 3r33737.  3r33770. 3r33762. Moreover, 3r33737.  3r33770. 3r33737.  3r33770. 3r33737. No need to make changes to the DNS and distribute information about them; 3r3753.  3r33770. 3r33737. TTL is not used; 3r3753.  3r33770. 3r33737. The thread does not wait for responses from the master in an error state. In general, it is ignored. 3r3753.  3r33770. 3r3755. 3r33737.  3r33770.

Additional information

3r33737.  3r33770. 3r33762. To stabilize the flow, we also use the following methods: 3r3373763. 3r33737.  3r33770. 3r33737.  3r33770. 3r33737. For the parameter HAProxy hard-stop-after configured very small value. When HAProxy reboots with a new server in the write pool, the server automatically terminates all existing connections to the old master node. 3r33737.  3r33770. 3r33737.  3r33770. 3r33737. Setting Parameter hard-stop-after allows you not to wait for any actions from customers, in addition, minimizes the negative effects of the possible occurrence of two main nodes in the cluster. It is important to understand that there is no magic here, and in any case passes some time before the old connections are broken. But there is a moment in time after which we can stop waiting for unpleasant surprises. 3r3753.  3r33770. 3r3752. 3r3753.  3r33770. 3r33737. We do not require the continued availability of the Consul service. In fact, we need it to be available only during failover. If the Consul service is not responding, GLB continues to work with the latest known values ​​and does not take drastic measures. 3r3753.  3r33770. 3r33737. GLB is configured to verify the identity of the newly assigned head node. As is the case with our
context-sensitive MySQL pools , a check is performed to confirm that the server is truly writable. If we accidentally delete the host identity in Consul, there will be no problem, the blank entry will be ignored. If we mistakenly write the name of another server (not the main one) to Consul, then in this case it's okay: GLB will not update it and will continue to work with the last valid state. 3r3753.  3r33770. 3r3755. 3r33737.  3r33770. 3r33762. In the following sections, we look at the problems and analyze the goals of high availability. 3r33737.  3r33770. 3r3501. Crash detection with orchestrator /raft 3r33737.  3r33770. 3r33762. 3r33748. orchestrator uses 3r3508. integrated approach to failure detection, which ensures high reliability of the tool. We do not face false positive results, premature failover is not performed, which means that optional downtime is excluded. 3r33737.  3r33770. 3r33762. Scheme 3r33737. orchestrator /raft also copes with situations of complete network isolation of the data center ("data center fencing"). Network data center isolation can be confusing: the servers inside the data center can communicate with each other. How to understand who is really isolated - servers inside this Datacenter or all the remaining 3r3702. DPC? 3r33737.  3r33770. 3r33762. In scheme 3r34848. orchestrator /raft lead node raft performs failover. The host becomes the host, which receives majority support in the group (quorum). We have deployed the node. orchestrator such that no single data center can provide most, while any provides it. n-1 3r3373749. Datacenter. 3r33737.  3r33770. 3r33762. In the case of a complete network isolation, the data center nodes orchestrator this center is disconnected from similar nodes in other data centers. The resulting nodes are orchestrator in an isolated data center can not become leading in the cluster 3r33748. raft . If such a node was leading, then it loses this status. A new master will be assigned one of the nodes of the other data centers. This presenter will have the support of all other data centers that can communicate with each other. 3r33737.  3r33770. 3r33762. So the lead node is orchestrator will always be outside of a data center isolated from the network. If the main site was in an isolated data center, 3r33748. orchestrator initiates a failover to replace it with the server of one of the available data centers. We mitigate the effects of data center isolation by delegating decisions to the quorum of available data centers. 3r33737.  3r33770. 3r33553. Accelerated alert 3r33737.  3r33770. 3r33762. The total idle time can be further reduced if you speed up the notification of a change in the main node. How to achieve this? 3r33737.  3r33770. 3r33762. When orchestrator starts failover, it considers a group of servers, one of which can be designated as the main one. Given the rules of replication, recommendations, and limitations, he is able to make an informed decision about the best course of action. 3r33737.  3r33770. 3r33762. By the following signs, he can also understand that the available server is the perfect candidate for appointment as chief: 3r33737.  3r33770. 3r33737.  3r33770. 3r33737. nothing prevents the server from raising its status (and, perhaps, the user recommends this server); 3r3753.  3r33770. 3r33737. It is expected that the server will be able to use all other servers as replicas. 3r3753.  3r33770. 3r3755. 3r33737.  3r33770. 3r33762. In this case, orchestrator first sets up the server as writable and immediately announces an increase in its status (in our case it makes an entry in the KV repository in the Consul). At the same time, the orchestrator asynchronously starts to repair the replication tree, which usually takes a few seconds. 3r33737.  3r33770. 3r33762. It is likely that by the time our GLB servers are fully rebooted, the replication tree will also be ready, although this is not necessary. That's it: the server is ready to write! 3r33737.  3r33770. 3r? 3594. Semi-synchronous replication 3r33737.  3r33770. 3r33762. In process 3r3599. semi-synchronous replication The MySQL master node does not confirm a transaction commit until the changes are accurately transferred to one or more replicas. This allows you to ensure the processing of failures without loss: any changes applied to the main node are either already applied or awaiting application to one of its replicas. 3r33737.  3r33770. 3r33762. Such consistency comes at a price because it can lead to reduced availability. If no replicas confirm the receipt of changes, the master node will be locked and the recording will stop. Fortunately, you can configure a timeout after which the host can return to asynchronous replication mode and the recording will resume. 3r33737.  3r33770. 3r33762. We chose a rather low latency value: 500 ms . This is more than enough to send changes from the main site to replicas in the local data center and even to remote data centers. With such a waiting time, we got an ideal semi-synchronous mode (without rollback to asynchronous replication), and also a very short blocking period in the absence of confirmation. 3r33737.  3r33770. 3r33762. We enable semi-synchronous replication on local replicas in the data center, and in the event of a failure of the main node, we expect (although we do not require) a fail-over failover. Failing a loss without a loss with a complete failure of the data center is too expensive, so we don’t expect it. 3r33737.  3r33770. 3r33762. Experimenting with the wait time for semi-synchronous replication, we also found an opportunity to influence the choice of r3r3701. The ideal candidate in case of failure of the main node. By activating the semi-synchronous mode on the desired servers and marking them as 3r3701. candidates 3r3702. , we can reduce the total downtime since influenced by 3r3702. on the result of failover. Our experiments 3r3700. show that in most cases we get ideal candidates and, therefore, faster disseminate information about the change of the main node. 3r33737.  3r33770. 3r33232. Transfer of packets of the pulse 3r33737.  3r33770. 3r33762. Instead of controlling the start /stop service pt-heartbeat on the designated /disconnected main nodes, we decided to run it everywhere and always. This required a certain 3r3639. improvements to service pt-heartbeat could easily work with servers that often change the value of the parameter read_only , or become completely inaccessible. 3r33737.  3r33770. 3r33762. In our current service configuration, pt-heartbeat They work both on the main nodes and on their replicas. On the main nodes, they generate pulse events. On replicas, they determine the availability of read-only servers and regularly check their current status. As soon as the server becomes master, service pt-heartbeat on this server it defines it as writable and starts generating pulse events. 3r33737.  3r33770. 3r3656. Delegating tasks to the orchestrator 3r33737.  3r33770. 3r33762. We also delegated the following tasks to the orchestrator: 3r33737.  3r33770. 3r33737.  3r33770. 3r33737. Pseudo-GTID generation; 3r3753.  3r33770. 3r33737. identifying a new master as available for recording, clearing its replication state; 3r3753.  3r33770. 3r33737. identifying the old master as read-only (3r33737. read_only ), if possible. 3r3753.  3r33770. 3r3755. 3r33737.  3r33770. 3r33762. This simplifies the tasks associated with the new master node. The node that has just been designated as the master must obviously be efficient and accessible, otherwise we would not have assigned it. Therefore, it makes sense to provide orchestrator the ability to apply changes directly to the newly designated head node. 3r33737.  3r33770.

Limitations and limitations of 3r3759. 3r33737.  3r33770. 3r33762. Using a proxy layer causes applications to not get the identity of the head node, however, this node itself cannot identify the applications. Only connections coming from the proxy layer are accessible to the main node, and we lose information about the real source of these connections. 3r33737.  3r33770. 3r33762. In terms of development of distributed systems, we still have raw scenarios. 3r33737.  3r33770. 3r33762. Note that while isolating the data center where the main site is located, applications in this data center can still write to that site. This can lead to inconsistency of states after the restoration of the network. We try to mitigate the consequences of the emergence of the two main nodes in such a situation by implementing the method 3-3-3699. STONITH 3r33700. from within the most isolated data center. As mentioned earlier, will pass. some time before the old main node is turned off, so a short period of "dual power" still can not be avoided. The operational costs aimed at completely preventing the occurrence of such situations are very high. 3r33737.  3r33770. 3r33762. There are other scenarios: disabling the Consul during failover, partial isolation of the data center, etc. We understand that working with distributed systems of this kind, it is impossible to close all the holes, so we concentrate on the most important ones. 3r33737.  3r33770.

Results

3r33737.  3r33770. 3r33762. Our orchestrator /GLB /Consul system provided the following benefits: 3r33737.  3r33770. 3r33737.  3r33770. 3r33737. reliable failure detection; 3r3753.  3r33770. 3r33737. working off failures regardless of the specific data center; 3r3753.  3r33770. 3r33737. lossless failover in most cases; 3r3753.  3r33770. 3r33737. network data center support; 3r3753.  3r33770. 3r33737. mitigation when two main nodes arise (work continues in this direction); 3r3753.  3r33770. 3r33737. lack of dependence on interactions; 3r3753.  3r33770. 3r33737. total idle time 10-13 seconds In most cases. 3r33737.  3r33770. 3r33737.  3r33770. 3r33737. In rare situations, the total downtime reaches 20 seconds 3r3373749. , and in the most extreme cases, 25 seconds 3r3373749. . 3r3753.  3r33770. 3r3752. 3r3753.  3r33770. 3r3755. 3r33737.  3r33770. 3r3758. Conclusion
3r33737.  3r33770. 3r33762. The orchestration /proxy /service discovery scheme uses well-known and reliable components in an unrelated architecture, which simplifies deployment, operation, and monitoring. In addition, each component can be scaled separately. We continue to look for ways to improve, constantly testing our system.
3r33770. 3r33770. 3r33770. 3r33737. ! function (e) {function t (t, n) {if (! (n in e)) {for (var r, a = e.document, i = a.scripts, o = i.length; o-- ;) if (-1! == i[o].src.indexOf (t)) {r = i[o]; break} if (! r) {r = a.createElement ("script"), r.type = "text /jаvascript", r.async =! ? r.defer =! ? r.src = t, r.charset = "UTF-8; var d = function () {var e = a. getElementsByTagName ("script")[0]; e.parentNode.insertBefore (r, e)}; "[object Opera]" == e.opera? a.addEventListener? a.addEventListener ("DOMContentLoaded", d,! 1): e.attachEvent ("onload", d ): d ()}}} t ("//mediator.mail.ru/script/2820404/"""_mediator") () (); 3r3769. 3r33770.
+ 0 -

Add comment