Using Consul to scale stateful-services
On September 2? we held our first non-standard MTP for developers of highly loaded systems. It was very cool, a lot of positive feedback on the reports and therefore decided not only to lay out them, but to decipher for Habr. Today we publish a speech by Ivan Bubnov, DevOps from BIT.GAMES. He told about the implementation of the Diskul-service Consul in the already working high-loaded project for the possibility of rapid scaling and failover of stateful-services. And also about the organization of a flexible namespace for backend applications and pitfalls. Now the word to Ivan.
I administer the production infrastructure in the studio BIT.GAMES and tell the story of the introduction of the consul from Hashicorp to our project "Guild of Heroes" - fantasy RPG with asynchronous pvp for mobile devices. We are released on Google Play, App Store, Samsung, Amazon. DAU about ???? online from 10 to 13 thousand. The game is done on Unity, so the client is writing to C # and we use our own BHL scripting language for game logic. The server part is written to Golang (switched to it with PHP). Next - the schematic architecture of our project.
In fact, there are a lot more services, only the fundamentals of gaming logic.
So, what we have. From Stateless-services it is:
nginx, which we use in the role of Frontend and Load Balancers and we distribute our clients on our backends by weight;
gamed - backends, compiled applications from Go. This is the central axis of our architecture, they perform the lion's share of the work and communicate with all the other backend services.
From the Stateful-services, the main ones are:
Redis, which we use to cache "hot" information (we also use it to organize an in-game chat and store notifications for our players);
Percona Server for Mysql - a store of persistent information (probably the biggest and slowest in any architecture). We use the MySQL fork and now we'll talk more about it.
In the design process, we (like everyone else) hoped that the project would be successful and envisaged a mechanism for shading. It consists of two MAINDB database entities and the shards themselves.
MAINDB is a kind of table of contents - it stores information about exactly what shard contains data about the progress of the player. Thus, the complete chain of information retrieval looks like this: the client accesses the frontend, which in turn redistributes it by the weighting factor to one of the backends, the backend goes to MAINDB, localizes the player's shard, and then fetches the data directly from the shard itself.
But when we designed it, it was not a big project, so we decided to make shards shards only nominally. They were all on the same physical server and rather it is database partitioning within a single server.
For redundancy, we used the classic master slave replication. This was not a very good solution (I will tell you a little later why), but the main disadvantage of this architecture was that all our backends knew about other backend-services exclusively by IP-addresses. And in the case of another ridiculous accident in the data center of the type " sorry, our engineer touched the cable on your server while servicing the other one and we were very much understanding why your server does not contact "We needed considerable gestures from us. First, it is a reassembly and pre-salvation of backends from the IP of the backup server for the place of the one that failed. Secondly, after the incident, it is necessary to restore our master from backup'a from the reserve, because it was in an inconsistent state and bring it into a consistent state using the same replication. After that, we again recompiled the backends and reloaded again. All this, of course, caused downtime.
The time has come when our technician (for which he thanked him a great deal) said: "Guys, enough to endure, we need to change something, let's look for exits." First of all, we wanted to achieve a simple, understandable, and most importantly - an easily managed process of scaling and migration from place to place of our databases if necessary. In addition, we wanted to achieve high availability by automating failover.
The central axis of our research was the Consul from Hashicorp. Firstly, we were advised, and secondly, we were very attracted by its simplicity, friendliness and an excellent stack of technology in one box: discovery-service with healthcheckes, key-value storage and the most important thing we wanted to use is DNS, which would resolve us the addresses from the domain service.consul.
Consul also provides an excellent Web UI and REST API for managing all of this.
As for high availability - for auto-failover we chose two utilities:
MHA for MySQL
In the case of MHA for MySQL, we poured agents onto nodes with databases, and they monitored their state. At the wizard's fayle, there was a certain timeout, after which a stop-check was made to maintain consistency and our backup wizard from the appeared master in a non-consistent state did not pick up the data. And we wrote a web hook to these agents, which registered there a new IP backup master in the Consul itself, after which it got to issue DNS.
With Redis-sentinel everything is even easier. Since he himself performs the lion's share of the work, all we had to do was take into account in the healthcheck that Redis-sentinel would be held exclusively at the master node.
At first everything worked perfectly, like a clock. We did not have any problems on the test bench. But it was worthwhile to move into the natural environment of data transfer of the loaded data center, to recall some OOM-kills (this is out of memory, in which the process is being killed by the kernel of the system) and restore the service or more sophisticated things that affect the availability of the service - as we immediately received a serious risk of false positives or even a lack of guaranteed triggering (if in an attempt to escape from false positives to twist some checks).
First of all, everything depends on the difficulty of writing the right healthcheck'ov. It seems that the task is rather trivial - check that the service is running on your server and the ping-pong port. But, as further practice showed, writing a healthcheck when implementing the Consul is an extremely complex and time-consuming process. Because very many factors that affect the availability of your service in the data center can not be foreseen - they are detected only after a certain time.
In addition, the data center is not a static structure, into which you are filled up and it works as intended. But we unfortunately (or fortunately) only learned about this later, but for the time being we were inspired and full of confidence that we will implement everything on production.
As for the scaling, I will say briefly: we tried to find a ready-made bicycle, but all of them are designed for certain architectures. And, as in the case of Jetpants, we could not meet the conditions that it imposed on the architecture of a persistent storage of information.
Therefore, we thought about our own scripted binding and postponed this issue. We decided to act consistently and start with the introduction of the Consul.
Consul is a decentralized, distributed cluster that operates on the basis of the gossip protocol and Raft's consensus algorithm.
We have an independent balance of five servers (five - to avoid the situation with the split-brain). On each node, we distribute Consul-Agent in agent mode and we distribute all healthcheckes (that is, there was not such that on some server we fill one healthcheck'y, and on certain servers - others). Healthcheck'i were written so that they passed only where there is a service.
We also used one more utility so that we did not have to learn our backend to resolve addresses from some particular domain on a non-standard port. We used Dnsmasq - it provides an opportunity to transparently resolve the addresses we need (which in the real world, so to speak, do not exist, but exist only within the cluster) on cluster nodes. Prepared an automatic script for the filling on Ansible, filled it all in production, podigali namespace, made sure that everything is complete. And, crossing their fingers, perezalili our backends, which were no longer addressed to ip-addresses, but by these names from the domain server.consul.
Everything started from the first time, our joy was no limit. But it was too early to rejoice, because within an hour we noticed that on all the nodes where our backends are located, the load average indicator increased from 0.7 to 1.? which is a rather fatty indicator.
I went to the server to see what was happening and it became obvious that the CPU was eating Consul. Here we started to understand, we began to shamanize with strace (a utility for unix-systems, which allows us to track which syscall performs the process), reset Dnsmasq statistics to understand what is happening on this node and it turned out that we missed a very important point. When we planned to integrate, we missed caching DNS records and it turned out that our backend for every one of its movements was pulled by Dnsmasq, which in turn turned to Consul and it all resulted in nehily 940 DNS queries per second.
The output seemed obvious - just twist ttl and everything will get better. But we could not be fanatical here, because we wanted to implement this structure to get a dynamic, easy-to-manage and fast-changing namespace (so we could not deliver, for example, 20 minutes). We twisted ttl to the maximum optimal values for us, we managed to reduce the query rate per second to 54? but this did not affect the CPU consumption.
Then we decided to get out in a cunning way, with the help of custom hosts-file.
It's good that we had everything for this: an excellent template-system from Consul, which, based on the state of the cluster and the template script, generates a file of any kind, any config is everything you want. In addition, Dnsmasq has an addn-hosts configuration parameter that allows you to use a non-system hosts file as the same additional hosts file.
What we did, again prepared the script in Ansible, filled in the production and it began to look something like this:
There was an additional element and a static file on the disk, which is quickly regenerated. Now the chain looked quite simple: gamed refers to Dnsmasq, and that in turn (instead of pulling the Consul-agent, which will ask the servers where we have this or that node), just looked at the file. This solved the problem with CPU consul consumption.
Now everything began to look, as we planned - absolutely transparent for our production, practically not consuming resources.
We were pretty tormented that day and went home with great apprehension. They were not afraid for nothing, because at night I was awakened by an alert from monitoring and notified that we had a fairly large (albeit short-term) burst of mistakes.
When I was analyzing the logs in the morning, I saw that all the errors of the same kind of unknown host. It was not clear why Dnsmasq can not decry this or that service from the file - it feels like it does not exist at all. To try to understand what is happening, I added a custom metric to the file regeneration - now I knew exactly when it would regenerate. In addition, in the Consul template itself there is an excellent backup option, i.e. you can see the previous state of the regenerated file.
During the day, the incident was repeated several times and it became clear that at some point in time (although it was sporadic, unsystematic in nature), we regenerated the hosts file without certain services. It turned out that in a particular data center (I will not do anti-advertising) rather unstable networks - because of network flopping, we absolutely unpredictably stopped passing healthcheck'i, or even the nodes were falling out of the cluster. It looked like this:
The node dropped out of the cluster, the Consul agent was immediately notified of this and Consul template immediately regenerated the hosts file without the required service. This was generally unacceptable, because the problem is ridiculous: if the service is unavailable for a few seconds, the timeouts and retrays are set up (they did not connect once, but the second time it happened). We, the new structure in the sales, provoked a situation when the service simply disappears from view and there was no opportunity to connect to it.
We started to think about what to do with this and twist the timeout parameter in the Consul, after which it is identified, after how much the node falls out. Quite a small indicator, we managed to solve this problem, the nodes stopped falling out, but with healthcheck'ami it did not help.
We began to think of picking different parameters for healthcheck'ov, trying to at least somehow understand when and how it happens. But because everything happened sporadically and unpredictably, we could not do it.
Then we went to the Consul template and decided to make a timeout for it, after which it reacts to the change in the state of the cluster. Again here one could not be fanatical, because we could come to a situationand when the result is no better than the classic DNS, when we were aiming for a completely different one.
And then our technical director came to the rescue again and said: "Guys, let's try to give up all this interactivity, we all have no time for research, we need to solve this issue. Let's use simple and understandable things. " So we came to the concept of using the key-value repository as the source for generating the hosts file.
What it looks like: we discard all dynamic healthcheckes, rewrite our template script so that it generates the file based on the data written in the key-value repository. In the key-value repository, we describe our entire infrastructure in the form of the key name (this is the name of the required service) and the key value (this is the name of the node in the cluster). Those. if the node is present in the cluster, we very easily get its IP address and write it to the hosts file.
We all tested it, filled it up in production and it became a silver bullet in a concrete situation. We pretty much got tormented the whole day back to our homes, but we returned already rested, inspired, because the more these problems did not recur and for the whole year in podakshen not repeated. From what I personally conclude that this was the right decision (specifically for us).
So. We finally achieved what we wanted and organized a dynamic namespace for our backends. Further we went in the direction of ensuring high availability.
But the fact is that pretty much frightened the integration of Consul and because of the problems that we faced, we thought and decided that implementing auto-failovers is not such a good solution, because we again risk false positives or failures. This process is opaque and uncontrollable.
Therefore, we went on a simpler (or more complex) way: we decided to leave the failover on conscience from the duty administrator, but gave him in the hands of another additional tool. We replaced the master-slave replication with the Master Wizard replication in the Read only mode. This removes a huge amount of headaches in the failover process - when you drop the wizard, all you need to do is change the value in the k /v storage using a Web UI or command in the API and before that remove the Read only mode on backup wizard.
After the incident is exhausted, the master is connected and automatically comes to an agreed state without any unnecessary actions at all. On this version, we stopped and use it as before - for us it is most convenient, and most importantly as simple as possible, understandable and controlled.
Web-based interface Consul
On the right is a k /v-storage and our services are visible, which we use in the work of gamed; value is the name of the node.
As for the scaling, we started to implement it when the shaddles became already crowded on one server, the bases grew, became slow, the number of players increased, we swapped and before us the task was to divide all shards into their own separate servers.
How it looked: using the XtraBackup utility we restored our backup on a new pair of servers, after which the new master was hung up with a slave to the old one. It came in a consistent state, we changed the key value in the k /v storage from the name of the node of the old wizard to the name of the node of the new wizard. Then (when we thought everything went right and all the gamed with their selections, updates, and inserts went to the new master), it only remained to kill the replication and make the coveted drop database on production, as we all like to do with unnecessary databases.
In this way, we received shards that had departed. The whole process of moving took from 40 minutes to an hour and did not cause any downtime, it was completely transparent to our backends and was itself completely transparent to the players (except that once they moved, it became easier and more pleasant for them to play).
As for failover processes, here the switching time is from 20 to 40 seconds plus the response time of the system administrator on duty. That's about the way it looks now.
What I wanted to say in conclusion - unfortunately, our hopes for absolute, comprehensive automation broke down on the harsh reality of the data transfer environment in the loaded data center and random factors that we could not foresee.
Secondly, this once again taught us that it's better to have a simple and tested tit in your sysadmin's hands than a new-fangled self-reacting, self-scraping crane somewhere behind the clouds, that you do not even understand if it is falling apart, or really began to scale.
The introduction of any infrastructure, automation in your production should not cause an unnecessary headache in the staff that serves it; it should not significantly increase the cost of maintaining the production infrastructure - the solution should be simple, understandable, transparent for your customers, convenient and controlled.
Questions from the hall
How do you write k /v with servers - script or you just patch it?
K /v-storage is on our Consul-servers and either we delete something from there, or we refill with the help of http-requests RESTful API or Web UI.
I wanted to convey this report that now for some reason the ideology of automation of everything in the world with fierce fanaticism is haunted without understanding that this sometimes complicates life, and does not simplify.
Why are you balancing between shards through databases, why is not the same Redis?
This is a historically developed decision and I forgot to say that now the concept has slightly changed.
Firstly, we rendered the configs and now we do not need to rewrite the backend. Secondly, we have introduced a parameter into the backends themselves, which are the main chord - they distribute the players. Those. if we have a new essence of the player, he makes a record in MAINDB and then he already saves the data for the shard he chose. And he now chooses it by weight coefficients. And it is very fast and easy to manage and at the moment there is no need to take and use some other technology, because it all works now quickly.
But if we run into a problem, it might be better to use a quick inmemory key-value repository or something else.
What is your base?
We use fork MySQL - Percona server.
And you did not try to unite it in a cluster and at the expense of it to balance? If you had Maria, which is the same MHA for MySQL, he has a Galera.
We were in service with Galera. There was another data center for the "Guild of Heroes" project for Asia and there we used Galera and it often gives a very unpleasant failure, it needs to be periodically raised by hands. Having such experience of using this technology specifically, we do not really want to use it yet.
Again, it is worth noting that the introduction of any technology is not just because you want to do better, but because you have a need for it, you need to get out of some existing situation, or you have already predicted that the situation will soon happen and you choose a specific technology to implement it.
It may be interesting
Corvus Health provides medical training services as well as recruiting high quality health workers for you or placing our own best team in your facility. Check Out: Health Workforce Recruitment
I.T HATCH offers a wide range of IT services including remote access setup, small business servers, data storage solutions, IT strategy services, and more. Check Out: IT strategy services