Performance checks and gradual degradation of distributed systems

 3r33466. 3r3-31. Performance checks and gradual degradation of distributed systems  3r33466.
[i] As always, thanks Fred Hebert and Sargun Dhillon for having read the draft of this article and offered some invaluable advice. 3r311.
 3r33466.
In its 3r3r166. speed report 3r3449. 3r318. Tamar Berkovichi
from Box stressed the importance of performance checks during automatic database failover. In particular, she noted that monitoring the runtime of end-to-end queries, as a method for determining database health, is better than simple echo testing (pinging).
 3r33466.
switches traffic to another node, thereby eliminating the problem. We had to build in some security systems in order to prevent constant switching over nodes and other strange borderline cases: the automatic system should not be out of control - but there is nothing difficult here. The trick when organizing effective work is to know
when
transfer the database to the first position, i.e. need to be able to correctly assess the performance of the database. Now, many of the parameters we are used to paying attention to - for example, processor load, lock timeout, error rate - are secondary signals. None of these parameters actually speaks of the database’s ability to handle client traffic. Therefore, if you use them to make a decision about switching, you can get both false-positive and false-negative results. Our health checker actually performs simple queries to database nodes and uses data about completed and failed requests to more accurately assess the health of the database.
I discussed this with a friend, and he suggested that health checks should be as simple as possible, and that real traffic is the best criterion for assessing the health of a process.
Matt Ranney wrote 3r3157. the phenomenal article
about unlimited concurrency and the need for backpressure in Node.js. The article is curious in its entirety, but the main conclusion (at least for me) was the need for feedback between the process and its output unit (usually a load balancer, but sometimes another service).
 3r33466.
The trick is that when resources are exhausted, something must be given somewhere. Demand is growing, but productivity cannot magically increase. To limit incoming tasks, first of all, it would be nice to set a certain speed limit at the site level, by IP address, user, session, or, at best, by some important element for the application. Many load balancers may limit the speed in a more complicated way than restricting an incoming Node.js server, but usually do not notice problems until the process is in a difficult position.
Speed ​​limit and open circuit based on static thresholds and limits may turn out to be unreliable and unstable in terms of both correctness and scalability. Some load balancers (in particular, HAProxy) provide many statistics on the length of internal queues for 3r3169. each server
and 3r3r169. server side
. In addition, HAProxy allows agent testing (3r33259. Agent-check ) (Auxiliary test, independent of the usual health check), which allows the process to provide the proxy server with more accurate and dynamic health feedback. 3r3173. Link to documents 3r3449. :
 3r33466.
The agent’s health check is performed by a TCP connection to the port based on the specified parameter 3r-3259. agent-port and read the ASCII string. A line consists of a series of words, separated by spaces, tabs or commas in any order, not necessarily ending in /r and /or 3r35959. /n and including the following elements:
 3r33466.
 3r33466. - Representation of positive integer percentage ASCII, for example, 3r-3259. 75% 3r3–3260. . The values ​​in this format determine the weight in proportion to the initial
 3r33466. the weight of the server configured when HAProxy starts. Please note that the zero weight value is indicated on the statistics page as DRAIN since a similar impact on the server (it is removed from the LB farm).
 3r33466.
 3r33466. - The parameter of the line maxconn : followed by an integer (no space). Values ​​in 3r3451.  3r33466. This format defines the server parameter maxconn . The maximum number of
 3r33466. The stated connections must be multiplied by the number of load balancers and the various server parts using this health check to get the total number of connections that the server can establish. For example: maxconn: 30
 3r33466.
 3r33466. - The word ready . This translates the administrative state of the server to
 3r33466. mode. READY by canceling state DRAIN or MAINT .
 3r33466.
 3r33466. - The word drain . This translates the administrative state of the server to
 3r33466. mode. DRAIN ("Drain"), after which the server will not accept new connections, with the exception of connections that are received through the database.
 3r33466.
 3r33466. - The word maint . This translates the administrative state of the server to
 3r33466. mode. MAINT (“Maintenance”), after which the server will not accept any new connections, and the health checks will stop.
 3r33466.
 3r33466. - The words down , failed or stopped followed by a descriptive line after the hash mark (#). They all indicate the operational status of the server DOWN (“Off”), but since the word itself is displayed on the statistics page, the difference allows the administrator to determine whether the situation was expected: the service may be intentionally stopped, it may appear, but not pass some confirmation tests, or be considered disabled (there is no process, no response from port).
 3r33466.
 3r33466. - The word up indicates the operational status of the server. UP ("On") if health checks also confirm service availability.
 3r33466.
 3r33466. Parameters that are not declared by the agent are not changed. For example, an agent can only be designed to monitor processor usage and report only a relative weight value, without interacting with the operating state. Similarly, an agent program can be designed as an end-user interface with 3 switches, allowing the administrator to change only the administrative state.
 3r33466.
 3r33466.
However, it is necessary to take into account that only the agent can cancel its own actions, therefore, if the server is set to DRAIN mode or to the DOWN state using an agent, then the agent must perform other equivalent actions to restart the service.
 3r33466.
 3r33466. Failed connection with the agent is not considered as an error, because the connectivity is tested by regularly performing a health check, which is run using the check parameter. However, if a disconnection message has been received, a warning is not a good idea to stop the agent, since only the agent reporting the activation can re-enable the server.
Such a scheme of dynamic communication service with the output unit is extremely important to create a self-adaptable infrastructure. An example would be the architecture that I worked with in my previous work.
 3r33466.
I used to work at imgix , a real-time image processing startup company. Using a simple URL API, images are extracted and converted in real time and then used anywhere in the world via CDN. Our stack was quite complex ( As described above ), But in short, our infrastructure included a level of balancing and load balancing (in tandem with a level to get data from a source), a level of source caching, a level of processing shownand content delivery level.
 3r33466.

 3r33466.
At the heart of the load balancing level was the Spillway service, which acted as a reverse proxy and query broker. It was a purely internal service; on the verge we were running nginx and HAProxy and Spillway, so it was not designed to complete TLS or perform any other functions from the innumerable set that is usually within the competence of the border proxy.
 3r33466.
Spillway consisted of two components: the client part (Spillway FE) and the broker. Although initially both components were in the same binary file, at some point we decided to separate them into separate binaries that were deployed simultaneously on the same host. Mainly, because these two components had different performance profiles, and the client part was almost completely connected to the processor. The client-side task was to perform preprocessing of each request, including a preliminary check at the source caching level, to make sure that the image is cached in our data center before sending the request for image conversion to the executor.
 3r33466.
At any point in time, we had a fixed pool (a dozen or so, if memory serves) performers who could be connected to one Spillway broker. Artists were responsible for the actual image conversion (cropping, resizing, PDF processing, GIF rendering, etc.). They processed everything from PDF files of hundreds of pages and GIF files with hundreds of frames to simple image files. Another feature of the artist was that, although all the networks were completely asynchronous, there were no actual conversions on the GPU itself. Given that we were working in real time, it was impossible to predict what our traffic would look like at a certain point in time. Our infrastructure had to self-adapt to various forms of incoming traffic - without manual intervention by the operator.
 3r33466.
Given the disparate and disparate traffic patterns that we often encountered, it became necessary for executors to refuse to accept incoming requests (even with full operability) if accepting a connection threatened to overload the executor. Each request to the contractor contained some set of metadata about the nature of the request, which allowed the contractor to determine whether he was able to service this request. Each performer had his own set of statistics on the requests with which he currently worked. The employee used these statistics in conjunction with the query metadata and other heuristics, such as socket buffer size data, to determine if he had correctly received the incoming query. If the employee determined that he could not accept the request, he created a response that is no different from checking the HAProxy agent, which informs his output unit (Spillway) about its performance.
 3r33466.
Spillway tracked the performance of all pool artists. At first I tried to send a request three times in succession to various performers (preference was given to those who had the original image in local databases and who were not overloaded), and if all three executors refused to accept the request, the request was queued at a broker inside the memory. The broker supported three forms of queues: a LIFO queue, a FIFO queue, and a priority queue. If all three queues were filled, the broker simply rejected the request, allowing the client (HAProxy) to try again after the delay period. When a request was placed in one of three queues, any free agent could remove it from there and process it. There are certain difficulties associated with assigning a request priority and deciding which of the three queues (LIFO, FIFO, queues based on priority) should be placed, but this is a topic for a separate article.
 3r33466.
We did not need to discuss this form of dynamic feedback in order to work effectively. We carefully monitored the broker queue size (all three queues), and Prometheus gave one of the key alerts when the queue size exceeded a certain threshold (which was quite rare).
 3r33466.
3r33333.
 3r33466.
Image from my presentation on the Prometheus monitoring system at the Google NYC conference in November 201?
 3r33466.

 3r33466.
The warning is taken from my presentation on the Prometheus monitoring system at the OSC 2017 conference in May 201?
 3r33466.
Earlier this year, Uber published an interesting article in which he shed light on his approach to implementing a level of load shedding based on quality of service.
 3r33466.
Analyzing the failures of the last six months, we found that 28% of them could be mitigated or prevented by smooth degradation .
 3r33466.
 3r33466. The three most common types of failures were due to the following factors:
 3r33466.
 3r33466. - Changes to the incoming request schema, including overload and bad operator nodes.
 3r33466. - Depletion of resources such as a processor, memory, I /O circuit or network resources.
 3r33466. - Dependency crashes, including infrastructure, data storage, and downstream services.
 3r33466.
 3r33466. We implemented an overload detector based on the algorithm. CoDel . For each enabled endpoint, a lightweight query buffer is added (implemented on the basis of the gorutina and 3-333364 channels 3r3449.) In order to track delays between the moment the request is received from the call source and the processing of the request in the processor. Each queue tracks the minimum delay in a sliding time interval, activating an overload condition if the delay exceeds a set threshold value.
However, it is important to remember that if the back pressure does not spread throughout the call chain, there will be a specific queue in some component of the distributed system. Back in 201? Google published the notorious article “The Tail at Scale” In which she touched upon a number of reasons for the variability of the delay in systems with a large number of output lines (an important line is the queue), and several successful methods (often with redundant requests) to mitigate this variability were described.
 3r33466.

 3r33466.
The concurrency control in a real-time process forms the basis for load shedding, with each component of the system making decisions based on local data. Helping in question scalability by eliminating the need for centralized coordination 3r3449. This does not eliminate the need for centralized speed limiting completely.
 3r33466.
3r33385.
 3r33466.
(Many forms of speed limits and load shedding methods)
 3r33466.
For those interested in learning more about formal performance modeling based on queuing theory, I would recommend reading the following materials: 3r3454.
 3r33466.
 3r33466. 3r33434. 3r33400. Applied Performance Theory
, 3r3402. Kavya Joshi
, QCon London 2018. 3r3r434.  3r33466. 3r33434. 3r3407. Theory of queuing in practice: Performance modeling for a software engineer
, 3r3409. Eben Freeman
, from LISA 2017. 3r3r4343.  3r33466. 3r33434. 3r33414. Cancel speed limit - power correctly planned
, 3r31616. John Moore
, Strangeloop 2017.
 3r33466. 3r33434. 3r33421. Predictive load balancing: Unfair, but faster and more reliable
, 3r33423. Steve Gurie
, Strangeloop 2017.
 3r33466. 3r33434. Chapters on work with overload and 3r33430. eliminating cascading failures
from the book "Technique to ensure the reliability of the site" 3r3449. .
 3r33466.
 3r33466.
Conclusion 3r33440.
 3r33466.
Control loops and backpressure are already solved problems in protocols such as TCP /IP (where 3r3444. Overload control algorithms 3r3449. Depend on load output), IP 3r3r446. ECN
(which is a pronounced mechanism for determining the load or the approximate load) and Ethernet, given the effects of elements such as 3r3448.
pause frames. .
 3r33466.
Large-scale health checks may suffice for orchestration systems, but not to ensure quality of service and prevent cascading failures in distributed systems. Load balancers need to see the application level for successful and accurate application of back pressure to customers. Gradual degradation of the service is impossible if it is impossible to accurately determine its level of performance at any time. In the absence of timely and sufficient back pressure, services can quickly fall into a quagmire of failures.
3r33462. 3r33466. 3r33466. 3r33466.
! function (e) {function t (t, n) {if (! (n in e)) {for (var r, a = e.document, i = a.scripts, o = i.length; o-- ;) if (-1! == i[o].src.indexOf (t)) {r = i[o]; break} if (! r) {r = a.createElement ("script"), r.type = "text /jаvascript", r.async =! ? r.defer =! ? r.src = t, r.charset = "UTF-8"; var d = function () {var e = a.getElementsByTagName ("script")[0]; e. ): d ()}}} t ("//mediator.mail.ru/script/2820404/"""_mediator") () (); 3r33460. 3r33466. 3r33462. 3r33466. 3r33466. 3r33466. 3r33466.
+ 0 -

Add comment