Applications that have reached self-awareness: automated diagnostics in production

The path to understanding Zen begins with the development of applications that can monitor themselves - this makes it easier and cheaper to fix problems on the production. In this article we will see how modern Windows applications can do self-monitoring, self-diagnostics, and maybe even self-healing, all without having to have an external agent or stupidly restart the application. Curbing the power of ETW for accurate low-level monitoring, Windows performance counters for obtaining free-of-charge statistics and the CLRMD library for analyzing their own flows, heap and lok objects, one can move on the path to achieving self-awareness. All this will be demonstrated on a series of demos: automatic CPU profiling, investigation of downloaded threads and stacks, automatic GC monitoring (including object selection), automatic heap analysis to search for memory leaks and much more. After reading the article to the end, you will receive a set of tools and techniques for self-monitoring in your own applications.
 
The basis of the article is the report of Dina Goldstein "Self-aware applications: automatic production monitoring" on DotNext 2017 Moscow. Slides can be downloaded by reference .
 

LiveStacks [/i] , which translates to ETW Cycle selection in function names. Thanks to this, we can restore call stack trees, find stacks consuming the most CPU resources, display data in the form of flame graphs or any other convenient form. Moreover, you can even make recommendations regarding the actions that the application can take immediately when a problem is detected. I'll give a few examples, but I'm sure you can come up with specific solutions that are more suitable for your business.
 
Suppose your AuthenticationController suddenly starts to take up too much CPU resources. Is there a DDoS attack on you? This assumption already allows to take some measures, which ones - it is difficult for me to say, because I am not a web expert. Probably, you can close communication channels, try to filter certain IP. Another example: some processing mechanism is suddenly overloaded. If it is on the server, then perhaps it's time to scale. And it will be especially cool if this scaling can be done automatically. Finally, if you can not think of anything smart, you can simply record all the available information in the journal and send it to the developers. For this they will be very grateful.
 

Let's see a demonstration of this approach. I have a server on ASP.NET, called "Los Gatos". All the demonstrations I have are recorded on video because I did not want any interference during the report. The site on the left is visible, and on the right is perfmon , UI for Performance Counters . I set it up to track CPU usage time. We see that when I go to the site, within a few seconds the processor is overloaded, after which the download drops again. In this situation, the client will not have time to call you, so that you solve this problem with the help of a profiler. It would be nice to get data about it in real time. Let's see how this can be done.


 

I want to make a reservation: I'm not sure that in terms of design and architecture in this code, classes and inheritance are best implemented. It's written specifically for demonstration, so it's probably not worth copying it in its pure form to your system. In addition, we will not go into too much detail - I'm sure you can read the API documentation yourself. My task is to demonstrate what exactly can be achieved and how.


 

Applications that have reached self-awareness: automated diagnostics in production  


 

 


 

Let's look at the class CPUMonitor, which inherits from the class EventStacksMonitor, which, in turn, inherits from Monitor. In Monitor has the most basic functionality. With some frequency (in this case - once a second) it samples Performance Counters and monitors violations of the rules we set. If Monitor detects three violations in a row, it causes what I called "intensive mode", SampleIntensively . The same is realized. SampleIntensively` in the child classes, depending on what they are tracking.


 

 

  
 

Let's take a closer look at CPUMonitor . In the variables here are given the rules, the violations of which we will track: CounterThreshold - 90%, PerformanceCounter - "% Processor Time", PerformanceInstance - "_Total". In the case of monitoring the CPU, we will need call stacks. They receive the class EventStacksMonitor .


 

 


 

 


 

Here we see the implementation of the "intensive mode" in the function OnIntensiveSamplingStart () . In it, we get the current process and create LiveSession . About LiveSession I'll tell you more later, it serves as the shell for ETW , which allows us to get call stacks. We start the session, wait for some time (it seems, 5 seconds), stop it, and, after the data is assembled, we turn to its property Stacks.TopStacks . There we get the stacks that most downloaded the processor during this session, and write them to the log. If interested, the method .Resolve () translates function addresses in memory into function names so that developers can later determine the source of the problem.


 

Let's return to class LiveSession . I will not go into it, it is copied from the repository LiveStacks on GitHub. It's just a shell for ETW , which deals with the configuration of events, the assembly of which will occur to obtain information about the sampling of the CPU. The general principle of its work has just been shown. Let's take a look now at the magazine that resulted from our attempt to enter the application.


 

 


 

 


 

We see that at first everything looks good, CPU utilization is low. After some time, three violations are detected, the CPU sampling mode is turned on. After 5 or 10 seconds of work, enough data was collected and three stacks of calls were recorded in the log, absorbing the largest amount of processor time. Unfortunately, some of the addresses could not be translated - perhaps, there were no corresponding characters for this, Microsoft sometimes does not have them. The result is somewhat disappointing (and I will return to this issue later): apparently, the first call stack is actually the monitoring code itself. However, it got here as a result of the peak of processor load that has already arisen, due to the inclusion of intensive mode, under normal conditions it does not consume so much power. Let's move on to the next call stack. Here we see many calls related to ASP.NET. Since I'm not an expert in ASP.NET, I can not say for sure what you can do here, but, in my opinion, this is useful information. Finally, the third call stack. Before us is the method Login () our application - we should not be surprised at this, because we saw that the problem arose immediately after entering the application. It should, however, be remembered that when developers read a magazine, they may not know what actions were performed during the peak processor load. If you look at the call stack in more detail, we'll see a function that calculates the hash.


 

 


 

Now we can go back to the code. We find the function by its name, and we see that it calculates the hash every time the email address ends with "gmail.com". It is these counts that load the processor. So, we found a real-time problem with CPU usage, and as a result of profiling, we found that function in the code that caused this problem. In my opinion, this is a very good result.


 

The following example is taken from real life.


 

 


 

On the screen is a screenshot in which we see the open Visual Studio and Process Explorer, and the process is highlighted in Process Explorer. devenv . You can check all this on your computers. The process devenv has a child process, perfwatson2 , and "perf" in the title here clearly points to the performance. Further, we see that perfwatson2` creates a session ETW with four suppliers. I was not able to determine exactly what they are recording, but it is known that this process is engaged in the profiling of Visual Studio. Visual Studio developers felt that they needed telemetry of what was happening inside the program while it was being used. And they used the same method that I showed you in the previous example. It is already part of the practice of some companies. I was not able to get to the file in which the data of this telemetry was recorded, some strange errors arose, but this is not the point. Monitoring Visual Studio itself using ETW is a part of its functionality (although in general it is not clear - Microsoft describes PerfWatson, but it is not part of the public documentation for VisualStudio). As you can see, I did not invent this method, it already enters into the practice of some companies. The following examples, which I will talk about, are also related to real practice.


 

Let's talk about monitoring the garbage collection, or rather, about the problems that can affect its effectiveness. It's about situations with a lot of memory allocations. Clearly, pausing GC when processing requests is bad.


 

 


 

But, again, do not forget about the situation with custom applications. If users experience glitches and slowdowns in the UI due to long stops in the garbage collection, they obviously will not like it. I want to emphasize once again that the things we are discussing here concern all types of applications, not just server-side applications.


 

The principle here will be the same as in the previous example, but this time we will deal with memory, not with the processor. Using performance counters, we will monitor the speed of memory allocation, and as soon as we notice the performance jump, you can begin more intensive profiling. In this case, this will mean getting data about allocating memory: what types of memory it takes, and where it happens. In addition, you can find out how long it takes to build garbage, get information about generations, how much memory was released, how much is left. You can even attach ClrMD and see in general what is in our heap (later we will do it): segments, generations, types of objects, number of objects. All this can help us understand what is happening in the application and where the problem is localized.


 

[видео с демонстрацией]


 

So, before us is an application in which we track the speed of selection. When we exit to "About", we see a sudden peak of excretions. This does not mean that we will certainly have a problem with garbage collection, but it becomes very likely.


 

 


 

 


 

 


 

To learn more, look at our code, namely - on class AllocMonitor . It also inherits from EventStacksMonitor , because, as in the case of monitoring excessive CPU usage, we will be interested in the call stacks in which the allocations occur. For this we need ETW . As in the last time, here in the variables the rules are clarified, according to which we will follow Performance Counters , for example, CounterThreshold is equal to 10?000 (obviously, in your application you will expose it based on your needs). Event ETW , which we will expect, and from which we will receive call stacks, is called clr: gc: gc /allocationtick . This event will give us information about the type of the object being created, and about where the memory allocation takes place. Function OnEventOccurred () catches this event. The data structure is GCAllocationTickTraceData is a property that tells you how much memory was allocated, and there is information about the type of the object created. While we need exactly this data, although there are many others. During the intensive sampling mode, this data will be stored in the dictionary, and we will know how many objects of each type were selected during this mode. In addition, we will receive call stacks in which these memory allocations occurred.


 

 


 

After that, we will collect statistics about what is happening in the heap for some time. We will do this using the library ClrMD . There is a lot of code here, but it's very simple. The main API that we will need is AttachToProcess . We get the CLR runtime, we will need the `runtime.Heap 'property, thanks to it we will be able to traverse the heap. It also has a common heap size, the size of generations. However, the most important thing is that with this property we can number the entire heap and get the type, name and size for each object. All these data are placed in the dictionary, and we now know that there are a certain number of rows in the heap, as many arrays, etc.


 

Let's look at the magazine. At first everything is in order in it, but then memory allocations begin, and three violations occur. As you remember, we collected the call stacks in which the allocations took place. Here, in front of us, are the stacks in which the most amount of emissions has occurred. This time, we immediately find the result we need: a stack with the method About () . Again, this is not surprising, because we know that the allocation began when we reached the "About" page. If you look in more detail, it turns out that the allocation occurs when the rows are combined. To clarify the situation, let's return to the code.


 

 


 

We see that the method About () Analyzes a large amount of XML. Some functions are called .append , hence a large number of lines, intermediate objects, associations, records, and so on. All this causes a problem with the secretions. To confirm the source of the problem, you can see what other information is in the magazine, because there was written statistics on the heap. It is quite expected, we find a large number of string objects there. This confirms our initial diagnosis of the problem.


 

In the following example, we will investigate a problem that causes a large number of troubles to everyone: memory leaks. There is already no sense in collecting call stacks. We will follow the same general principle of tracking memory usage. As one of the symptoms, you can monitor the leaps in memory usage, but by themselves they do not give confidence that a leak has occurred. In addition, you can specify that the application should not use more, for example, 100 megabytes of memory. If this threshold is exceeded, the situation requires our attention, even if it is not a leak. Here you can apply ClrMD , on the same principle as in the previous example, and find out what exactly takes an unusually large amount of memory. If there is a possibility of leakage, we, as developers, can automate the reaction of the application. You can make a few snapshots of the heap, and try to compare them. So we find out which objects were allocated, what memory is freed, what is not, and why. All this can be done with the help of ClrMD , and then you can compare this information with that obtained from ETW if we are willing to spend additional efforts on this.


 

[видео с демонстрацией]


 

Let's look at the demonstration. This time I'm monitoring bytes in all heaps, this is our managed memory. I register on the site, and as long as I enter the data, the memory state does not change. As soon as I click on the registration button, there is a jump in memory usage. Even if we are not afraid of leakage, we should find out what exactly is the reason for the jump.


 

 


 

 


 

For this we turn to the class LeakMonitor . Since we do not need call stacks at this time (there is no event [/i] ? ETW [/i] , Which could be used to analyze this memory situation), it does not inherit from EventStacksMonitor , but directly from Monitor . As in the previous examples, a threshold is specified in the variables, the triple excess of which causes an intensive mode, and the monitored parameter is specified. Let's see what happens when excessive memory usage is detected. I will try to show what it is possible to automate at the development stage. We will make three snapshots of the memory space. Note that before each snapshot the garbage collector is called, because I want the statistics to be as clean as possible. Later, I will compare what exactly was released and what is not. After garbage collection, I connect to the process, I get a pile of heaps, wait a few seconds, then repeat the cycle. After the end of the loop, I compare snapshots and write the results to a file.


 

Let's look at the snapshot of the heap. It is quite primitive, it will not have statistics on individual objects. This, obviously, would take considerably more time, and it is not always desirable to spend it. Only data on the number of objects of each type and the total size of all objects of a certain type are collected. This is done in exactly the same way as in the previous example: the heap is bypassed, data about the size, name and type is extracted, and added to two dictionaries, the size depending on the type, and the quantity, depending on the type. These two dictionaries are not even compared with the help of ClrMD , but a simple function that compares data in dictionaries of different snapshots. Finally, all this is recorded in the journal.


 

In the journal at first there is nothing unusual, but below we see that the program detects a violation of the rules we set. Thanks to the snapshots made, we see that the number of byte arrays has increased significantly. Obviously, there is a problem here. The difficulty is that byte arrays can arise anywhere. In the optimal scenario, you will be able to find the type specific to your application in which these arrays are created. In this example, we know that the problem occurred during registration, so we can find the appropriate code.


 

 


 

The registration function checks the password, and at the same time adds data to some static list. And the creation of this data causes a large enough array of bytes to be allocated. Obviously, the example is not very realistic, nevertheless, I think in your practice there would still be some data structure that could be found in such a case. But even if it's about byte arrays, you probably know where exactly in your code is the selection of these arrays. Or you can correlate this data with the data on memory allocations from ETW , and to find out that it is in this branch of the code that there are a significant number of them.


 

On the sweet today will be an example of the detection of deadlocks (English deadlock). Here there is a difference from other examples considered by us, because there is no simple numerical indicator by which it would be possible to determine that there was a deadlock. You can only identify some symptoms. One of them may be a situation where the CPU usage suddenly becomes unusually low, and the application does nothing. Or if the requests do not return after the waiting time, this may mean that the threads that must respond to these requests are suspended. Or maybe the reverse situation, your pool of threads opens more and more threads, because others are in deadlock, and you have to answer the requests. But you just need to determine the list of symptoms yourself based on the needs of your business.


 

If there is a suspicion that a deadlock has occurred, we will attach to detect it. ClrMD and take a look at the call stacks. I'm not sure if there is anything that can be done to resolve deadlock in this case, but at least the problem will be logged in the log, which will be sent to the developers.


 

[видео с демонстрацией]


 

Let's go directly to the example. I'm trying to go to the contact page, and it turns out that she hung. The CPU load is almost zero, the page does not load. Perhaps, there was a deadlock. Let's look at the code, class DeadlockMonitor . In this example, I did not implement the deadlock detection, since there may be many solutions, and we've seen enough of their examples. I just want to show how easy it is to access call stacks using ClrMD and to check whether the loop is present in the wait queue.


 

 


 

 


 

 


 

 


 

Every few seconds we will run a fairly simple recursive method FindDeadlocks () . It passes all the threads that we received from ClrMD , and for each of the threads it traverses all the objects that this thread expects. They can be found using the property thread.BlockingObjects . Each of these objects has the property Owners , which refers to another thread, the owner of this object. Together, this allows us to build a graph of what is happening in the application, and see if there is a cycle in the graph. If so, you can get the call stack in which it occurs, and write it to the log.


 

 


 

Turning to the journal, we see that, indeed, in HomeController.ProcessDetails () a deadlock was detected.
 
 
 
If we return to the code and find there this function, we will see two locks and a call to the function itself, a classic example of deadlock. So, the function was found, which, in my opinion, very good.
 
In the remaining time, I would like to say a few general things. Obviously, there are many examples of the proposed approach and it will not be enough for all time, but I will mention some. You can monitor fragmentation of the heap. The CLR will not compress large objects on the heap unless you specifically specify it, because it takes a long time. If you monitor fragmentation, then you have the information to decide whether you want to spend this time or not. Another possibility, which I wanted to mention, is the use of ETW to analyze memory leaks at the system level. It is possible to receive stekreysy. For this, in ETW you can receive events with the source. These are the Streets generated by native applications written in C /C ++ and using the unmanaged Win32 heap.
 
As a small digression I want to note that ClrMD can work not only with running processes, it can be connected to a crash dump and automate dump analysis (viewing call stacks, searching for faulty components). In this case, it would be optimal to aggregate all the failures of one type, and perhaps even automate the creation of error reports for developers.
 
Finally, I want to note that the described approach, in addition to its obvious advantages, has its drawbacks. The advantage is that you have visibility inside your application, but you do not need to expand the command when you scale the application. One of the drawbacks is that there is an increase in the complexity of the application. More code means more potential problems. The risk can be reduced by performing all the operations that I demonstrated in a separate process. In the above examples, everything was done in one process, but there are no major obstacles to making monitoring in a separate process. The data from ETW and performance counters are still collected throughout the system, ClrMD can connect to any process.
 
Another drawback in increasing the resource intensity. I already paid attention to this in connection with the first example, where the biggest call stack was from itself. ETW . The same problem seems to be experienced by some Visual Studio users: if you type "perfwatson2" in the search engine, you will see a lot of questions about how this process can be disabled.
 
Finally, the extra code takes developers time. Is it worth it to spend - a decision that you or your managers should take.
 
On this article comes to an end. We saw why diagnostics are important in production, regardless of the type of application being created. We saw how it can be implemented using a hierarchical monitoring system: lightweight monitoring to detect a problem, more resources for diagnostics. Finally, four examples were given demonstrating the very good results that the described approach can give: detection in the code of real problems related to the loading of the processor, memory, and with deadlocks. All the demos are Available in network. .
 
Minute of advertising. As you probably know, we make conferences. The nearest .NET conference is DotNext 2018 Piter . It will be held on 22-23 April 2018 in St. Petersburg. What reports are there - you can see in our archive on YouTube . At the conference it will be possible to communicate live with the speakers and the best .NET experts in the special discussion areas after each report. In short, come, we are waiting for you.

+ 0 -

Add comment