Uncalled function slows down the program 5 times

[b] Slow Windows, Part 3: Completing 3r335 Processes.
Uncalled function slows down the program 5 times  
The author is engaged in optimizing the performance of Chrome in Google - approx. per.
In the summer of 201? I struggled with the Windows performance issue. Completion of the processes was slow, serialized and blocked the system input queue, which led to repeated hangs of the mouse cursor when building Chrome. The main reason was that at the end of the process, Windows spent a lot of time searching for GDI objects, while holding the system-global user32 critical section. I talked about this in article "24-core processor, but I can not move the cursor" .
Microsoft fixed the bug, and I went back to my business, but then it turned out that the bug was back. There were complaints about the slow performance of LLVM tests, with frequent input interruptions.
But in fact, the bug did not return. The reason was to change our code.
Some security changes made these features slow . During operation, they held the same lock that was used for input events. With the simultaneous completion of a large number of processes, each makes several calls to a slow function that holds this critical lock, which ultimately leads to blocking user input and to the cursor hanging.
The Microsoft patch was not to call these functions on processes without GDI objects. I don’t know the details, but I think the Microsoft fix was something like this:
3r3172. + if (IsGUIProcess ())
+ NtGdiCloseProcess ();
- NtGdiCloseProcess (); 3r-33199.
That is, just skip the GDI cleanup if the process is not a GUI /GDI process.
Since compilers and other processes, which are quickly created and completed by us, did not use GDI objects, this patch was enough to fix the UI hangup.

The 2018 problem of

It turned out that some standard GDI objects are in fact very easily distinguished by processes. If your process loads gdi32.dll, you will automatically receive GDI objects (DC, surfaces, regions, brushes, fonts, etc.), whether you need them or not (note that these standard GDI objects do not appear in the Task Manager among the GDI objects for the process).
But this should not be a problem. I mean, why should the compiler load the gdi32.dll? Well, it turned out that if you load user32.dll, shell32.dll, ole32.dll or many other DLLs, then you will automatically get gdi32.dll (with the above-mentioned standard GDI objects). And it is very easy to accidentally load one of these libraries.
The LLVM tests, when loading each process, called CommandLineToArgvW (shell32.dll), and sometimes called SHGetKnownFolderPath (also shell32.dll) These calls were enough to pull out gdi32.dll and generate these scary standard GDI objects. Because the LLVM test suite generates very many processes, it eventually becomes serialized at the completion of processes, causing huge delays and freezes of input, much worse than those that were in 2017.
But this time we knew about the main problem with blocking, so we immediately knew what to do.
First we got rid of the call CommandLineToArgvW , manually parsing the command line . After that, the LLVM test suite rarely called any functions from any problematic DLL. But we knew in advance that this would not affect the performance. The reason was that even the remaining 3r33256. conditional [/i] the call was enough to always pull out shell32.dll, which in turn pulled out gdi32.dll, creating standard GDI objects.
The second correction was delayed loading of shell32.dll . Delayed loading means that the library is loaded on demand — when the function is called — instead of loading when the process starts. This meant that shell32.dll and gdi32.dll would rarely load, but not always.
After this, the LLVM test suite started running 3r33256. five times [/i] faster - in one minute instead of five. And no more mouse hangs on the developers' machines, so that employees could work normally during the execution of tests. This is a crazy acceleration for such a modest change, and the author of the patches was so grateful for my investigation that he put me forward to 3r3114. corporate bonus
Sometimes the smallest changes have the biggest consequences. Only need 3r33232. know where to score “zero”

Execution path is not accepted 3r3240.
3r33130. It is worth repeating that we paid attention to the code, which is not done - and this was a key change. If you have a command line tool that does not access gdi32.dll, then add code from conditional 3r3r577. calling a function will repeatedly slow down the completion of processes if gdi32.dll is loaded. In the example below, CommandLineToArgvW it never gets called, but even a simple presence in the code (without a call delay) negatively affects the performance: 3r-3241.  
3r3141. 3r3142. int main (int argc, char * argv[]) {
if (argc 3r3144. CommandLineToArgvW (nullptr, nullptr); //shell32.dll, pulls in gdi32.dll
} 3r33199. 3r3148.
So yes, deleting a function call, even if the code is never executed, may be sufficient to significantly improve performance in some cases.
Reproducing pathology

When I investigated the initial error, I wrote a program (3-333160. ProcessCreateTests 3r-3284.), Which created 1000 processes, and then in parallel all of them killed. This produced a hangup, and when Microsoft fixed the error, I used a test program to test the patch: see video . After the reincarnation of the bug, I changed my program, adding the option -user32 , which for each of the thousands of test processes loads user32.dll. As expected, the completion time of all test processes increases dramatically with this option, and it is easy to detect mouse cursor hang-ups. The process creation time also increases with the -user32 option, but there are no cursor hangs during the process creation. You can use this program and see how terrible the problem can be. Here are some typical results of my quad-core /eight-thread notebook after a week of uptime. The -user32 option increases the time for everything, but the blocking 3r-3256 increases especially dramatically. UserCrit
at completion of processes:
3r3172. > ProcessCreatetests.exe
Process creation took ??? s (??? ms per process).
Lock blocked for ??? s total, maximum was ??? s.
Process destruction took ??? s (??? ms per process).
Lock blocked for ??? s total, maximum was ??? s.
> ProcessCreatetests.exe -user32
Testing with 1000 descendant processes with user32.dll loaded.
Process creation took ??? s (??? ms per process).
Lock blocked for ??? s total, maximum was ??? s.
Process destruction took ??? s (??? ms per process).
Locked for ??? s total, maximum was ??? s. 3r-33199.

Digging deeper, just for the interest of

I thought about some of the ETW methods that can be used to study the problem in more detail, and have already started writing them. But he came across such inexplicable behavior, which he decided to devote to a separate article. Suffice it to say that in this case Windows behaves even more strangely.
Other articles of the cycle:
Slow Windows, part 0: arbitrary slowing down VirtualAlloc
Slow Windows, part 1: file access
Slow Windows, part 2: creating processes 3r3284.
Slow Windows, part 3: this


The first report on UI suspensions:
"24-core processor, but I can not move the cursor"

The following article, which leads to an understanding of the problem: 3r33251. “What * does * Windows do by holding this lock”
Article about 3r3256. another [/i] blocking UI due to the interaction between Gmail, ASLR v8 workers, CFG memory allocation policies and slow WMI scanning: 3r33258. “24-core CPU, but I can't type an email” 3r-3284.
Downloading the gdi32.dll compiler seems strange, but it is even stranger that the compiler loads mshtml.dll, which is used to do VC ++ in some cases
Sometimes research weeks lead to small but critical changes, as discussed in article 3r-3268. “Know where to score zero”
Video with a demonstration of using ProcessCreateTests and ETW to verify bug fixes 3r-3285.  
First change for LLVM by manual parsing of the command line
The second fix is ​​for LLVM using 3r328383. delay boot shell32.dll
+ 0 -

Add comment