The optimal location of the shards in the Elasticsearch petabyte cluster: linear programming

At the heart of the Meltwater and information retrieval systems. Fairhair.ai There is a collection of Elasticsearch clusters with billions of articles from the media and social media. 3r33386.

3r33386.

Index shards in clusters differ greatly in access structure, workload, and size, which raises some very interesting problems. 3r33386.

3r33386.

In this article, we will describe how linear programming (linear optimization) was applied to maximize the uniform distribution of the search and index workload across all nodes in the clusters. This solution reduces the chance that a single node will become a bottleneck in the system. As a result, we increased the search speed and saved on infrastructure. 3r33386.

indexing by time similar to stack of ELK . 3r33386.

3r33386.

Such architecture of indexes gives a number of advantages. For example, you can perform efficient mass indexing, as well as delete whole indexes when data is aging. It also means that the workload for a given index varies greatly over time. 3r33386.

3r33386.

The latest indexes are exponentially more queries, compared with the old ones. 3r33386.

3r33386.

3r361. 3r33386.

3r33333. 3r33333. Fig. 1. Access scheme for indexes by time. The vertical axis shows the number of queries executed, the horizontal axis shows the age of the index. Weekly, monthly and annual plateaus are clearly visible, followed by a long tail of a lower workload on older indices

3r33333. 3r33386.

3r33386.

Patterns in Fig. 1 were quite predictable, as our clients are more interested in the latest information and regularly compare the current month with the past and /or this year with the last year. The problem is that Elasticsearch does not know about this pattern and does not perform automatic optimization for the observed workload! 3r33386.

3r33386.

The built-in shard placement algorithm Elasticsearch takes into account only two factors:

3r33386.

3r33333. The number of shards [/i] on each node. The algorithm tries to evenly balance the number of shards per node throughout the cluster.

3r388. Tags

free disk space. Elasticsearch reviews the available disk space on a node before deciding whether to allocate new shards to this node or move segments from this node to others. When 80% of the used disk is used, it is prohibited to place new shards on the node, by 90% the system will actively transfer shards from this node.

3r33386.

The fundamental assumption of the algorithm is that each segment in the cluster receives approximately the same amount of workload and that everyone has the same size. In our case, it is very far from the truth. 3r33386.

3r33386.

Standard load distribution quickly leads to hot spots in the cluster. They appear and disappear randomly, because the workload changes over time. 3r33386.

3r33386.

A hot spot is essentially a node operating near its limit of one or more system resources, such as a CPU, disk I /O, or network bandwidth. When this happens, the node first queues requests for a while, which increases the response time to the request. But if the overload continues for a long time, then eventually requests are rejected, and users get errors. 3r33386.

3r33386.

Another common consequence of overload is the unsustainable pressure of the JVM garbage due to queries and indexing operations, which leads to the “terrible hell” phenomenon of the JVM garbage collector. In such a situation, the JVM either cannot get memory fast enough and crashes out (out of memory), or gets stuck in an endless garbage collection cycle, freezes and stops responding to requests and pings of the cluster. 3r33386.

3r33386.

The problem worsened when we produced 3r3111. refactoring its architecture under AWS

. Previously, we were saved by the fact that we launched up to four Elasticsearch nodes on our own powerful servers (24 cores) in our data center. This masked the effect of the asymmetric distribution of shards: the load was largely smoothed by a relatively large number of cores on the machine. 3r33386.

3r33386.

After refactoring, we placed only one node on less powerful machines (8 cores) - and the first tests immediately revealed major problems with “hot spots”. 3r33386.

3r33386.

Elasticsearch assigns shards in random order, and with more than 500 nodes in a cluster, the likelihood of too many “hot” shards on one node has greatly increased - and such nodes quickly overflowed. 3r33386.

3r33386.

For users, this would mean a serious deterioration of work, since overloaded nodes slowly respond, and sometimes they completely reject requests or fall. If you bring such a system into production, users will see frequent seemingly random UI slowdowns and random timeouts. 3r33386.

3r33386.

At the same time, there remains a large number of nodes with shards without any special load, which are actually inactive. This leads to inefficient use of our cluster resources. 3r33386.

3r33386.

Both problems could have been avoided if Elasticsearch more intelligently distributed shards, since the average use of system resources on all nodes is at a healthy level of 40%. 3r33386.

3r33386.

### Continuous cluster change

3r33386.During the work of more than 500 nodes, we observed one more thing: a constant change in the state of the nodes. Shards are constantly moving back and forth along nodes under the influence of the following factors: 3r33333.

3r33386.

3r33333.

New indexes are created, and old ones are discarded.

Disk labels work due to indexing and other changes on shards.

Elasticsearch randomly decides that there are too few or too many shards on the node compared to the average value of the cluster.

Hardware failures and crashes at the OS level lead to the launch of new AWS instances and their joining to the cluster. With 500 nodes, this happens on average several times a week.

New nodes are added almost every week due to normal data growth.

3r33386.

Given all this, we came to the conclusion that for a comprehensive solution to all problems, a continuous and dynamic re-optimization algorithm is needed. 3r33386.

3r33386.

### Solution: Shardonnay

3r33386.After a long study of the available options, we concluded that we want:

3r33386.

Build your own solution. We did not find good articles, code or other existing ideas that work well on our scale and for our tasks.

Start the rebalancing process outside of Elasticsearch and use clustered redirection APIs , and not try create plugin . We wanted a fast feedback loop, and deploying a plugin on a cluster of this magnitude can take several weeks.

Use linear programming to calculate the optimal movements of the shard at any time.

Perform optimization continuously so that the state of the cluster gradually comes to the optimum.

Do not move too many shards at the same time.

3r33386.

We noticed an interesting thing, that if you move too many shards at the same time, it is very easy to call

*cascading storm moving shards*. After the start of such a storm, it can continue for hours when shards uncontrollably move back and forth, causing the appearance of marks about the critical level of disk space in various places. In turn, this leads to new movements of shards and so on. 3r33386.

3r33386.

To understand what is happening, it is important to know that when you move an actively indexed segment, it begins to actually use much more space on the disk from which it is moving. This is due to the way Elasticsearch saves transaction logs . We have seen cases when the index doubled as the node moved. This means that the node that initiated shard movement due to high disk space utilization will use 3r33388 for a while. even more disk space [/i] until you move a sufficient number of shards to other nodes. 3r33386.

3r33386.

To solve this problem, we have developed a

*service. Shardonnay*in honor of the famous Chardonnay grape variety. 3r33386.

3r33386.

### Linear optimization

3r33386.Linear optimization (or 3r33227. Linear programming 3r-3365., LP) is a method of achieving the best result, such as maximum profit or least cost, in a mathematical model whose requirements are represented by linear relations. 3r33386.

3r33386.

The optimization method is based on a system of linear variables, some constraints that must be met, and an objective function that determines how a successful solution looks. The goal of linear optimization is to find the values of variables that minimize the objective function while respecting the constraints. 3r33386.

3r33386.

### Shard distribution as a linear optimization problem 3r33333. 3r33386.

Shardonnay should work continuously, and at each iteration it performs the following algorithm: 3r-3386.

3r33386.

Using the API, Elasticsearch retrieves information about existing shards, indexes, and nodes in a cluster, as well as their current location.

Models the cluster state as a set of binary variables of the LP. Each combination (node, index, shard, replica) gets its own variable. In the LP model, there are a number of carefully designed heuristics, constraints and the objective function, see below.

Sends the LP model to a linear solver, which gives the optimal solution, taking into account constraints and the objective function. The solution is a new assignment of shards on the nodes.

Interprets the decision of the LP and converts it into a sequence of movements of shards.

Instructs Elasticsearch to perform shard movements through the cluster redirection API.

Waiting for the cluster to move the shards.

Returns to step 1.

3r33386.

The main thing is to develop the right constraints and objective function. The rest will be made by solver LP and Elasticsearch. 3r33386.

3r33386.

Not surprisingly, the task turned out to be very difficult for a cluster of this size and complexity! 3r33386.

3r33386.

Restrictions

3r33386. We base some restrictions on the model based on rules dictated by Elasticsearch itself. For example, always stick to disk labels or prohibit placing a replica on the same node as another replica of the same shard. 3r33386.

3r33386.

Others are added based on experience gained over the years working with large clusters. Here are some examples of our own limitations:

3r33386.

3r33333.

Do not move today's indexes, as they are the hottest and get almost constant load on reading and writing.

Prefer the movement of smaller shards, because Elasticsearch copes with them faster.

It is desirable to create and place future shards a few days before they become active, begin to be indexed and undergo a heavy load.

3r33386.

3r33386.

### Function cost

3r33386.Our cost function weighs together a number of different factors. For example, we want:

3r33386.

3r33333.

minimize the variance of indexing and search queries to reduce the number of “hot spots”;

keep the minimum dispersion of disk usage for the stable operation of the system;

minimize the number of shard movements so that chain storms do not start, as described above.

3r33386.

### Reduction of variables LP

3r33386.On our scale, the size of these LP models is becoming a problem. We quickly realized that problems could not be solved in a reasonable time with more than 60 million variables. Therefore, we used a lot of optimization and modeling tricks to drastically reduce the number of variables. Among them are biased sampling, heuristics, the "divide and conquer" method, iterative relaxation and optimization. 3r33386.

3r33386.

3r33337. 3r33338. 3r33333. 3r33386.

3r33333. 3r33333. Fig. 2. The heat map shows the unbalanced load on the Elasticsearch cluster. This is manifested in a large dispersion of resource use in the left part of the graph. Thanks to continuous optimization, the situation is gradually stabilizing

3r33333. 3r33386.

3r33386.

3r33350. 3r33351. 3r33333. 3r33386.

3r33333. 3r33333. Fig. 3. The heat map shows CPU utilization on all nodes in the cluster before and after setting the hotness feature in Shardonnay. You can see a significant change in CPU usage with a constant workload of 3r33370. 3r33333. 3r33386.

3r33386.

3r33333. 3r33333. 3r33333. 3r33386.

*.*

Fig. 4. The heat map shows the reading capacity of the disks during the same period as in fig. 3. Read operations are also more evenly distributed across the cluster

3r33333. 3r33386.

3r33386.

As a result, our LP solver finds good solutions in a few minutes, even for our huge cluster. Thus, the system iteratively improves the state of the cluster in the direction of optimality. 3r33386.

3r33386.

And the best part is that the dispersion of workload and disk usage converges as expected - and this near-optimal state is maintained after many intentional and unexpected changes in the cluster state since then! 3r33386.

3r33386.

We now support a healthy workload distribution in our Elasticsearch clusters. All thanks to linear optimization and our service, which we call [i] with love. Chardonnay

Fig. 4. The heat map shows the reading capacity of the disks during the same period as in fig. 3. Read operations are also more evenly distributed across the cluster

3r33333. 3r33386.

3r33386.

# Results

3r33386.As a result, our LP solver finds good solutions in a few minutes, even for our huge cluster. Thus, the system iteratively improves the state of the cluster in the direction of optimality. 3r33386.

3r33386.

And the best part is that the dispersion of workload and disk usage converges as expected - and this near-optimal state is maintained after many intentional and unexpected changes in the cluster state since then! 3r33386.

3r33386.

We now support a healthy workload distribution in our Elasticsearch clusters. All thanks to linear optimization and our service, which we call [i] with love. Chardonnay

3r33394. ! function (e) {function t (t, n) {if (! (n in e)) {for (var r, a = e.document, i = a.scripts, o = i.length; o-- ;) if (-1! == i[o].src.indexOf (t)) {r = i[o]; break} if (! r) {r = a.createElement ("script"), r.type = "text /jаvascript", r.async =! ? r.defer =! ? r.src = t, r.charset = "UTF-8"; var d = function () {var e = a.getElementsByTagName ("script")[0]; e.parentNode.insertBefore (r, e)}; "[object Opera]" == e.opera? a.addEventListener? a.addEventListener ("DOMContentLoaded", d,! 1): e.attachEvent ("onload", d ): d ()}}} t ("//mediator.mail.ru/script/2820404/"""_mediator") () (); 3r33395.

It may be interesting

#### weber

Author**13-11-2018, 18:18**

Publication Date
#### Algorithms / High performance / Mathematics

Category- Comments: 0
- Views: 341