PHP performance: we plan, we profile, we optimize

PHP performance: we plan, we profile, we optimize  
 
Hi, Habr! Two years ago, we are wrote 3r3109. how to switch to PHP 7.0 and save a million dollars. On our load profile, the new version turned out to be twice as efficient in terms of CPU usage: the load that we had previously served ~ 600 servers, after the transfer, we started to serve ~ 300. As a result, for two years we had a reserve capacity.
 
 
But Badoo is growing. The number of active users is constantly increasing. We are improving and developing our functionality, so that users spend in the application more and more time. And this, in turn, is reflected in the number of requests, which in two years has increased 2-2.5 times.
 
 
We found ourselves in a situation where the double performance gain was leveled by more than twice the growth of requests, and we again began to approach the limits of our cluster. In the PHP core, useful r3r319 is expected again. optimization
(JIT, preload), but they are planned only for PHP 7.? and this version will be released not earlier than in a year. Therefore, the trick with the transition now can not be repeated - you need to optimize the application code itself.
 
 
Under the cut, I will tell you how we approach such tasks as using tools, and give examples of optimizations, ideas and approaches that we use and which helped us in due time.
 
perf Is a profiling tool built into the Linux kernel. It is sampling profiler, which is launched by a separate process, therefore it does not directly add the overhead to the program being profiled. An indirectly added overhead evenly "smeared", therefore, does not distort the measurement.
 
 
With all its advantages, perf is able to work only with compiled code and with JIT and does not know how to work with code running “under the virtual machine”. Therefore, profiling the PHP code itself will not work in it, but it is perfectly clear how PHP works inside, including various PHP extensions, and how much resources are spent on it.
 
 
We, for example, found several bottlenecks using perf, including a place with compression, which I will discuss below.
 
 
Example: 3r3-31006.  
 
3r33532. perf record - call-graph dwarf, 65528 -F 99 -p $ (pgrep php-cgi | paste -sd "," -) - sleep 20
 
perf report
 
 
(if the process and perf are executed under different users, then perf needs to be started from under sudo).
 
 
3r33333.
 
3r33895. Sample output of perf report for PHP-FPM 3r3r9696.
 
 
3r3393931. XHProf and XHProf aggregator
 
XHProf is an extension for PHP that puts timers around all function /method calls, and also contains tools for visualizing the results thus obtained. Unlike perf, it allows you to operate with the terms of PHP-code (in this case, what happens in the extensions can not be seen).
 
 
The disadvantages include two things:
 
 
3r3r9959.  
3r3996. all measurements are collected within a single request, therefore, they do not provide information about the picture as a whole;
 
3r3999.  
3r3996. overhead projector though not so big , as, for example, when using Xdebug, but it is, and in some cases the results are strongly distorted (the more often a function is called and the simpler it is, the stronger the distortion).
 
3r3999.  
 
Here is an example illustrating the last point: 3r3-31006.  
 
3r3774. function child1 () {
return 1;
function child2 () {
return 2;
}
function parent1 () {
child1 ();
child2 ();
return;
}
for ($ i = 0; $ i < 1000000; $i++) {
parent1 (); 3r3-31014.} 3r3373799. 3r3-331006.  
3r3407.
 
3r33895. The XHProf output for the demoscript: parent1 is orders of magnitude larger than the sum of child1 and child2 [/i]
 
 
It can be seen that parent1 () was performed ~ 500 times longer than child1 () + child2 (), although in reality these numbers should be approximately equal, as are main () and parent1 ().
 
 
If the last drawback is difficult to fight, then to combat the first one, we made an add-on for XHProf, which aggregates profiles of different requests and visualizes aggregated data.
 
 
In addition to XHProf, there are many other less well-known profilers working along similar lines. They have similar advantages and disadvantages.
 
 
3r3393931. Pinba
 
3r33432. Pinba
allows monitor performance in a section of scripts (actions) and on previously placed timers. All measurements in the context of scripts are made out of the box, for this no additional action is required. For each script and timer, is executed. getrusage therefore, we know exactly how much processor time was spent on a particular section of code (as opposed to sampling profilers, where this time can be the network, disk, and so on). Pinba is great for storing historical data and getting pictures in general, as well as within specific types of requests.
 
 
3r3442.
 
3r33895. General rusage of all scripts, obtained from Pinba [/i]
 
 
The disadvantages include the fact that timers that profile specific parts of the code, rather than the entire script, must be set in the code in advance, as well as the presence of an overhead projector, which (as in XHProf) can distort data.
 
 
3r3393931. phpspy
 
phpspy - a relatively new project (the first commit on GitHub was six months ago), which looks promising, so we are closely following it.
 
 
From the user's point of view, phpspy is similar to perf: a parallel process is started, which periodically copies portions of the PHP process's memory, parses them, and receives stack traces and other data from there. This is done in a rather specific way. In order to minimize overhead, phpspy does not stop the PHP process and copies the memory right during its work. This leads to the fact that the profiler can get a non-consistent state, stack-traces can be broken. But phpspy can detect this and discards such data.
 
 
In the future, using this tool, it will be possible to collect both data on the picture as a whole and profiles of specific types of requests.
 
 
3r3393931. Comparison Table 3r33232.
 
To structure the differences between the tools, let's make a pivot table:
 
 
 
3r33895. Comparison of the main features of profilers [/i]
 
3r33895. Flame Graphs 3r3r96.
 
 
3r33947. Optimizations and approaches
 
With these tools, we constantly monitor the performance and use of our resources. When they are used unnecessarily or we are approaching the threshold (for the CPU, we empirically chose a value of 55% to have time left for growth), as I wrote above, optimization is one of the solutions to the problem.
 
 
Well, if the optimization has already been done by someone else, as was the case with PHP 7.? when this version turned out to be much more productive than the previous ones. We generally try to use modern technologies and tools, including timely updates to the latest versions of PHP. According to 3r3504. public r3r3910. benchmarks , PHP 7.2 is 5–12% faster than PHP 7.1. But this transition, alas, gave us considerably less.
 
 
For all the time we have implemented a huge number of optimizations. Unfortunately, most of them are strongly connected with our business logic. I will talk about those that may be relevant not only for us, or ideas and approaches from which you can use outside of our code.
 
 
3r3393931. Compression zlib => zstd
 
We use compression for large memkey keys. This allows us to spend three to four times less memory for storage at the expense of additional CPU costs for compression /decompression. We used zlib for this (our extension for working with memoksh is different from those that come with PHP, but also in the official 3r3520. Also 3r31010. 3r35222. 3r?310. Zlib is used).
 
 
In perf on production, it was something like this: 3r3-31006.  
 
3r33532. + ???% ???% php-cgi libz.so.???[.]inflate
 
+ ???% ???% php-cgi libz.so.???[.]deflate
 
 
7-8% of the time was spent on compression /decompression.
 
 
We decided to test different levels and compression algorithms. It turned out that zstd works on our data almost ten times faster, losing in place by ~ 1.1 times. A fairly simple change in the algorithm saved us ~ 7.5% CPU (this, let me remind you, on our volumes is equivalent to ~ 45 servers).
 
 
It is important to understand that the ratio of the performance of different compression algorithms can vary greatly depending on the input data. There are various 3r33548. Comparison
, but most accurately it can be estimated only by real examples.
 
 
3r3393931. IS_ARRAY_IMMUTABLE as a repository for infrequently-modified data 3r33232.
 
Working with real-life tasks, one has to deal with such data that is needed often and at the same time rarely changes and has a limited size. We have a lot of similar data, a good example is the configuration of 3r3558. split tests
. We check whether the user is subject to the conditions of a particular test, and depending on this we show him experimental or regular functionality (this happens during almost every request). In other projects, such an example can be configs and various reference books: countries, cities, languages, categories, brands, etc. 3r3-31006.  
 
Since such data is often requested, receiving it can create a noticeable additional load on both the application itself and the service in which this data is stored. The last problem can be solved, for example, with the help of APCu, which uses the memory of the same machine as PHP-FPM as storage. But even in this case:
 
 
3r3r9959.  
3r3996. there will be costs for serialization /deserialization;
 
3r3999.  
3r3996. need to somehow invalidate the data when changing;
 
3r3999.  
3r3996. there is some overhead compared to just accessing a variable in PHP.
 
3r3999.  
 
In PHP 7.? the optimization 3r33588 appeared. IS_ARRAY_IMMUTABLE
. If you declare an array, all elements of which are known at the time of compilation, it will be processed and placed in the OPCache memory once, PHP-FPM workers will refer to this shared memory without spending its own before attempting the change. It also follows from this that including such an array will take a constant time regardless of size (usually ~ 1 microsecond).
 
 
For comparison: an example of the time to get an array of 1?000 elements via include and apcu_fetch:
 
 
3r3774. $ t0 = microtime (true);
$ a = include 'test-incl-1.php';
$ t1 = microtime (true);
printf ("include (% d):% d microsecn", count ($ a), ($ t1- $ t0) * 1e6);
$ t0 = microtime (true);
$ a = apcu_fetch ('a');
$ t1 = microtime (true);
printf ("apcu_fetch (% d):% d microsecn", count ($ a), ($ t1- $ t0) * 1e6);
//include (10000): 1 microsec
//apcu_fetch (10000): 792 microsec
3r3799.
 
It is very easy to check whether this optimization has been applied, if you look at the generated opcodes:
 
 
3r3774. $ cat immutable.php
<?php
return[
'key1' => 'val1',
'key2' => 'val2',
'key3' => 'val3',
];
$ cat mutable.php
<?php
return[
'key1' => SomeClass::CONST_1,
'key2' => 'val2',
'key3' => 'val3',
];
$ php -d opcache.enable = 1 -d opcache.enable_cli = 1 -d opcache.opt_debug_level = 0x20000 immutable.php
$ _main:; (lines = ? args = ? vars = ? tmps = 0)
; (after optimizer)
; /home/ubuntu/immutable.php:1-8
L0 (4): RETURN array ()
$ php -d opcache.enable_cli = 1 -d opcache.opt_debug_level = 0x20000 mutable.php
$ _main:; (lines = ? args = ? vars = ? tmps = 2)
; (after optimizer)
; /home/ubuntu/mutable.php:1-8
L0 (4): T1 = FETCH_CLASS_CONSTANT string ("SomeClass") string ("CONST_1") 3r3-31014. L1 (4): T0 = INIT_ARRAY 3 T1 string ("key1")
L2 (5): T0 = ADD_ARRAY_ELEMENT string ("val2") string ("key2")
L3 (6): T0 = ADD_ARRAY_ELEMENT string ("val3") string ("key3")
L4 (6): RETURN T0 3r3-31014. 3r3799.
 
In the first case, it is clear that the only opcode in the file is the return of the finished array. In the second case, it is formed by element-by-element every time the file is executed.
 
 
Thus, it is possible to generate structures in the form that does not require further conversion in runtime. For example, instead of disassembling class names by “_” and “” signs each time for an autoload, you can pre-generate a “Class => Path” correspondence map. In this case, the conversion function will be reduced to a single reference to the hash table. This optimization is done by Composer, if you enable the option 3r33670. optimize-autoloader
.
 
 
For the invalidation of such data, you do not need to do anything specifically - PHP itself recompiles the file when it is modified in the same way as it is done with the usual code deployment. The only drawback you need to remember is that if the file is very large, then the first request after changing it will cause a recompilation, which can take a considerable amount of time.
 
 
3r3393931. Performance include /require
 
Unlike the example with a static array, the connection of files with declarations of classes and functions is not so fast. Despite the availability of OPCache, the PHP engine must copy them into the process memory, recursively connecting dependencies, which may take hundreds of microseconds or even milliseconds per file.
 
 
If you create a new empty project on Symfony 4.1 r3r3910 and put 3r3690. get_included_files () the first line in the action, you can see that 310 files are already connected. In a real project, this number can reach thousands per request. It is worth paying attention to the following things.
 
 
The lack of autoloading functions 3r33824.
 
 
There are Function Autoloading RFC but no development has been seen for several years. Therefore, if the dependency in Composer defines functions outside the class and these functions must be accessible to the user, this is done by 3r3704. mandatory connection
a file with these functions for each autoloader initialization.
 
 
For example, by removing one of the dependencies from composer.json, which declares many functions and is easily replaced by a hundred lines of code, we won a couple of percent of the CPU.
 
 
[b] Autoloader is called more often than it may seem 3r32424.
 
 
To demonstrate the idea of ​​creating a file with the class:
 
 
3r3774. <?php
class A extends B implements C
{
use D;
const AC1 = E :: E1;
const AC2 = F :: F1;
private static $ as3 = G :: G1;
private static $ as4 = H :: H1;
private $ a5 = I :: I1;
private $ a6 = J :: J1;
public function __construct (K $ k = null) {}
public static function asf1 (L $ l = null):? LR {return null;}
public static function asf2 (M $ m = null):? MR {return null;}
public function af3 (N $ n = null):? NR {return null;}
public function af4 (P $ p = null):? PR {return null;}
} 3r3799.
 
[b] Register an autoloader:

 
 
3r3774. spl_autoload_register (function ($ name) {
echo "Including $ name n";
include "$ name.php";
}); 3r3799.
 
And we will make several options for using this class:
 
 
3r3774. include ‘A.php’ 3r3-31014. Including b
Including D
Including C
A :: AC1
Including A
Including b
Including D
Including C
Including E
new A ()
Including A
Including b
Including D
Including C
Including E
Including F
Including G
Including H
Including I
Including J
3r3799.
 
You may notice that when we simply somehow connect the class, but do not create its instance, the parent, interfaces and traits will be connected. This is done recursively for all files that are connected as they are resolved.
 
 
When creating an instance, a rezolv of all constants and fields is added to this, which leads to the connection of all the necessary files for this, which, in turn, will also cause a recursive connection of the traits, parents and interfaces of the newly connected classes.
 
 
 
3r33895. Connecting related classes for the instantiation process and other cases 3r3896.
 
 
There is no universal solution to this problem, you just need to keep it in mind and follow the links between the classes: one line can pull on the connection of hundreds of files.
 
 
OPCache settings
 
 
If you use the method. atomic deployment by changing the symlink proposed by Rasmus Lerdorf, the creator of PHP, then for 3r33831. solutions
the sticking problems of the symlink on the old version you have to include opcache.revalidate_path, as recommended, for example, in this 3r33868. Article 3r3910. About OPCache, translated by Mail.Ru Group.
 
 
The problem is that this option significantly (on average one and a half to two times) increases the time to include each file. In total, this can consume a significant amount of resources (in our case, turning off this option yielded a gain of 7–9%).
 
 
To disable it, you need to do two things:
 
 
3r3r9959.  
3r3996. 3r33850. force
web server rezolvit simlinki;
 
3r3999.  
3r3996. stop connecting files inside a PHP script along paths containing symlinks, or forcibly rezolvit them via readlink () or realpath ().
 
3r3999.  
 
If all files are connected by the Composer autoloader, the second item will be executed automatically after the first one is executed: Сomposer uses the constant __DIR__, which will be split correctly.
 
 
OPCache has a few more options that can give a performance boost in exchange for flexibility. Read more about this in Article 3r3910. I mentioned above.
 
 
Despite all these optimizations, include will not be free anyway. To combat this, PHP 3.4 plans to add 3r33874. preload
.
 
 
3r3393931. APCu, lock
 
Although we are not talking about databases and services here, various kinds of locks can also occur in the code, which increase the execution time of the script.
 
 
As requests grew, we noticed a sharp slowdown in response at peak times. After finding out the reasons, it turned out that, although APCu is the fastest way to get data (compared to Memcache, Redis and other external storage), it can also work slowly with frequent overwriting of identical keys.
 
 
3r33892.
 
3r33895. The number of requests per second and the execution time: bursts at the peaks on October 16 and 1? 3r-3696.
 
 
When using APCu as a cache, this problem is not so urgent, because caching usually means rare writing and frequent reading. But some tasks and algorithms (for example, 3r30101. Circuit Breaker
( Implementation in PHP )) Also imply frequent writes, which cause locks.
 
 
There is no universal solution to this problem, but in the case of Circuit Breaker it can be solved, for example, by placing it in 3r3909. Separate service
delivered to machines with PHP.
 
 
3r3393931. Batch processing
 
Even if you do not take into account the include, usually a significant part of the query execution time is spent on initial initialization: a framework (for example, building a DI container and initializing all its dependencies, routing, executing all listeners), raising the session, User and so Further.
 
 
If your backend is an internal API for something, then surely some requests on clients can be bundled together and sent as a single request. In this case, initialization will be performed once for several requests.
 
 
If this is not possible on clients, try to find requests that can be processed asynchronously. They can be taken by some simplest script that does not initialize anything and simply puts them into the queue. And already it can be processed in batches.
 
 
3r3393931. Reasonable utilization of resources 3r33932.
 
We in Badoo have different clusters that are sharpened for different needs. In addition to the PHP-FPM cluster, where hundreds of servers are loaded on the CPU, and the disks are idle, we have one specific database cluster of a couple of hundred machines that is directly opposite to the first: with huge disks and heavily loaded on IO, whose CPUs were idle.
 
 
The obvious solution here was the launch of PHP-FPM on the second cluster - in fact, we got a couple of hundred additional machines for free in the PHP cluster.
 
 
The load can be divided not only by type (CPU, IO), but also by time. For example, it is possible that during working hours employees of the company build reports, run tests, compile or do something else on a large number of servers, and the peak use of the application falls on non-working time. In this case, you can use the resources of the idle cluster, when the other is particularly heavy. And the construction of reports, perhaps, in general, can be arbitrarily carried over time.
 
 
3r33947. Conclusion 3r33948.
 
So we solve this kind of problem in our country. As a result, even in the conditions of constant growth of traffic and activity, we manage to not add new hardware to clusters with PHP for several years.
 
 
Short summary:
 
 
3r3r9959.  
3r3996. on small amounts, iron is usually cheaper than optimizations;
 
3r3999.  
3r3996. do not optimize unnecessarily;
 
3r3999.  
3r3996. if you still need to optimize, then measure: most likely, the problem is not in the code;
 
3r3999.  
3r3996. Interpret the measurements correctly: everything is not always linear and obvious (hyper-trading, spikes, non-linearity of activity);
 
3r3999.  
3r3996. do not rely on guesswork: profile and correctly interpret the results;
 
3r3999.  
3r3996. change compression settings, OPCache, or upgrade PHP version is usually easier than optimizing code;
 
3r3999.  
3r3996. but measure here too: someone else’s solutions may not be suitable for you (as, for example, PHP 7.2 didn’t give us as much as the benchmarks promise);
 
3r3999.  
3r3996. look at the problem more broadly: perhaps it will help optimize customers or use smarter resources.
 
3r3999.  
 
What tools and utilities do you use?
 
 
Thanks for attention!
+ 0 -

Add comment