From ??? TFlops HPL to ASC'18 Application Innovation

Hello, Habr! We continue the series of articles about the participation of the team from St. Petersburg State University (we call ourselves EnterTildeDot) on the world's largest student supercomputer competitions.
From ??? TFlops HPL to ASC'18 Application Innovation
In this article we will look at the way to ASC'18 on the example of one team member, paying special attention to the business card of the competitions and modern supercomputers in general - Linpack. Well, let's look at the secret of achieving a record and antirecord of the performance of the computing system.
articles , including long-distance post about this year's contest. Nevertheless, for completeness, some information about the competition as a whole, we still give here.
Asian Supercomputer Challenge - one of the three main team competitions in high-performance computing, attracting annually more and more student teams from all over the world. ASC, like other similar competitions, presupposes the availability of a qualifying and final round with the following provisions:
Core business: HPC solution;
Team: 5 students + coach;
Selection stage: the correspondence description of the proposal with a description of the solution of the tasks presented, on the basis of which a list of 20 finalists is determined.
Final stage: full-time competition for 20 teams lasting about 5 competitive days, including complete assembly and configuration of the computing cluster, problem solving, presentation. The cluster is going on the basis of limitations on the power of 3 kW, or from the iron provided by the organizers, or from its own. The cluster does not have an Internet connection. Tasks partially coincide with the tasks of the qualifying stage, but there is also an unknown task - Mystery Application.
Well, now in order with digressions to the educational program. Unlike the rest of the team members, who were already entering the final of ASC'1? I joined the contest only this year. I joined the team in September, the qualifying stage is sent out only in January, so I had enough time to study the basic concepts of the contest, and also to work on the only known task - HPL & HPCG. The task in one form or another occurs almost every year, but it is not always known in advance on what equipment it is necessary to carry out the task (sometimes the organizers provide remote access to their own resources).
HPL (High Performance Computing Linpack Benchmark) - a test of the performance of the computer system, based on the results of which the modern is formed. list the world's best supercomputers. The essence of the test lies in the solution of dense systems of linear algebraic equations. The appearance of this benchmark introduced a metric that allows the ranking of supercomputers, while at the same time providing some kind of "disservice" to the HPC community. If you look at the list of the best supercomputers, you can understand that the secret of Lynpak was solved quite quickly - take as many graphic accelerators as you can and will be in the top. Of course, there are exceptions, but mostly the top places are occupied by supercomputers with graphic accelerators. What is the "disservice"? The point is that besides measuring performance, Lynpak is not used anywhere else and has nothing to do with real computing tasks. As a result, the supercomputer race has gone in the direction of obtaining the greatest efficiency of Lynpack, and not real workloads, like naming the standard tasks of the USE instead of mastering the school curriculum. The HPL developers also created another package - HPCG, which also forms rating is supercomputers. It is generally accepted that this benchmark is closest to real-world tasks than HPL, and, in some ways, the significant inconsistency of supercomputer positions in these two lists reflects the real state of affairs. However, the latest ratings (June 2018) were a pleasant exception, and, finally, the first positions of the lists coincided.
And now about the real HPL
We return to more practical moments of the story and competition. Lynpak is open source, available for download on the official site , however, there is hardly a supercomputer in the world top, the performance of which was measured by this version of the benchmark. Manufacturers of accelerators produce their own versions of HPL, optimized for specific devices, which allows you to get a significant gain in performance. Of course, custom versions of HPL must meet certain criteria and must successfully pass special tests. Own version of HPL is for every vendor for each accelerator, however, unlike the original benchmark, there is no question of any open-source here. Nvidia releases HPL versions optimized for each of the cards, while the code is no longer delivered as source code, but as binaries. In addition, there are only two ways to access them:
You have a supercomputer with Nvidia cards, able to enter the top - Nvidia will find you yourself. Alas, you probably will not get the binaries, just as there will not be an opportunity to participate in optimizing HPL parameters. Either way, you will get an adequate performance value obtained on an optimized benchmark.
You are a participant in one of three student supercomputer competitions. But we will return to this part.
So what are all the same tasks, especially if smart uncles from large companies have already optimized the benchmark for your equipment?
In the case of the qualifying stage of the competition - describe the possible actions to increase the system's performance. In this case, it is not necessary to chase the absolute numbers of performance, since some teams can have access to a large and class cluster of 226 nodes with modern accelerators, while others - only to the university computer class number 22? which we call a cluster.

In the case of the final stage, it already makes sense to compare the absolute values ​​of performance. Not to say that everyone here is on an equal footing, but at least there is a limit on the permissible maximum power of the system.
The result of the benchmark performance depends mainly on two components: the cluster configuration and the settings of the benchmark itself. Also worth mentioning is the influence of the choice of compilers and libraries for matrix and vector computations, but here everything is rather boring, all use the compiler from Intel + MKL. And in the case of binaries, it is not necessary to choose at all, since they are already collected. The output of HPL is a numerical value indicating how many floating point operations per second the given computing system performs. The basic unit of measure is FLOPS (FLoating-point Operations Per Second) with the corresponding prefixes. In the case of the final stage of the competition, almost always we are talking about Tera-skale systems.
Optimization of the results
Adjustment of benchmark parameters consists in meaningful selection of input data calculated by Lynpak task (HPL.dat file). In this case, the greatest impact is the dimension of this problem - the size of the matrix, the size of the blocks for which the matrix is ​​divided, in what ratio to distribute the blocks, etc In total, the parameters are several tens, the possible values ​​are thousands. Bruteforce is not the best choice, especially provided that the test on relatively small systems runs from a couple of minutes to a couple of hours, depending on the configuration (for the GPU the test runs much faster).
I had enough time to study, as already described in other sources, the patterns that contribute to optimizing the results of the benchmark, and to identify new ones. I began to run tests a huge number of times, started a lot of Google tablets, tried to access systems with untested configuration before, to run the benchmark on them. As a result, even before the start of the qualifying stage, a number of systems were tested, both CPU and GPU, including even completely unsuitable for this Nvidia Quadro P5000. At the beginning of the qualifying stage, we got access to several sites with P100 and P600? which helped us very much in the preparation. The configuration of this system in many respects was similar to the one that we planned to collect in the final stage of the contest, and also, we finally got access to low-level settings, including frequency change.
As for the configuration, the greatest influence is exerted by the presence and number of accelerators. In the case of testing a system with a GPU, the most optimal option is when the main computational part of the task is delegated to the GPU component. The CPU component will also be loaded with auxiliary tasks, however, the contribution to system performance will not be made. But in this case, the peak performance of the CPU must be taken into account in the peak performance of the system as a whole, which can be extremely disadvantageous in terms of the ratio of maximum performance to peak (theoretical). When you run HPL on a GPU, a system with 2 GPU accelerators and two processors will at least not be inferior to a system with 2 GPUs and 20 CPUs.
Having described the proposals on the possible optimization of the HPL results, I finished with my part of the proposal for the qualifying stage, and after going to the final of the competitions, a new stage of the competition began - the search for sponsors. On the one hand, we needed a sponsor who would bear the cost of flying the team to China, on the other hand - a sponsor who would kindly agree to provide the team with graphics accelerators. With the former, we were lucky in the end, a part of the money was allocated by the university, and the company helped to fully cover the tickets. Devexperts . With the sponsors from whom we planned to lend the cards, we were less fortunate, and now we again fly to the final with the basic configuration of the cluster without a chance to compete in HPL. Well, nothing, squeeze the maximum of what they give, we thought.
Final ASC'18
And here we are in China, in a small town by the Chinese standards - Nanchang, in the final. Two days we collect the cluster, and then - the tasks.

This year, all teams were given 4 Nvidia V100 cards, we did not get any advantages over other teams, but it gave us the opportunity to run HPL not on the CPU. Nodes initially give everyone 1? but superfluous (remember the restriction of 3 kW) must be returned before the stage of the main competitive tasks. There is some trick here - reducing the CPU and GPU frequencies, their performance is reduced, but you can choose such values ​​for the frequency that we get more performance per unit of energy consumed. By lowering the frequency, we are able to add even more accelerators, which will ultimately affect performance for the better. Alas, this cunning would be useful to us much more if we came to the competition with a suitcase of accelerators, like the other participants. However, we were able to afford to keep the maximum number of CPUs. Since not all tasks of the contest require a GPU, it was suspected that in some ways this could play into our hands.
So, the most common configuration of the cluster in the final of the competition is a minimum of nodes, a maximum of cards.

Final lincak and a little about the records
The tasks at the competition were tied to certain competitive days, and HPL became the first of them, of course, after the cluster was assembled. The deadline for submitting HPL results is lunch on the third competitive day, in addition, access to the remaining tasks of this competitive day is opened right after the delivery of Lynpac. Nevertheless, Lynpak begin to drive already in the first days. First, to make sure that the cluster is compiled correctly, and secondly, the Lynpak setup is not a quick thing, and since no additional input is required, then why not. We gathered our cluster pretty fast and started including Lynpack. For our configuration, we got quite adequate values ​​- about 20 TFlops, and everything would be fine, but after the output of the result there was a line with an error. Earlier, I received such errors only when I deliberately indicated the incorrect size of the blocks to which the task matrix is ​​divided. Then a very unpleasant surprise awaited us. Earlier I said that we were given 4 V100 cards, well, so We did not get HPL binaries for them and no one could help with this. It's been several months, and for me it's still a mystery what happened at that final with our Lynpak. We changed the versions of compilers and other libraries in the hope of getting rid of the error, repeatedly checked to see if we put the accelerators correctly (as we did this for the first time), but we did not manage to fix the error.

The night before Lynpak's surrender, we again carefully studied the criteria for evaluating tasks, and so, for Lynpak, the formula consisted of two components-a certain value, depending on the result of the team that Lynpak would win, and the ratio for the successful completion of the assignment. So it turned out that this coefficient is so large that it is necessary to hand over the adequate value of Lynpak, but with an incomprehensible mistake it is completely unprofitable compared with the surrender of any value, but without error. Thoroughly thought everything over, taking into account the fact that it took a lot of time to find the solution to the error and that the receipt of the data from the following tasks completely depends on the time of Lynpak's surrender, we decided to merge this task tactically. So the absolute "record" in the history of supercomputer competitions among the correct values ​​was established. Our Lynpak erupted with the value
???r3r3173. TFlops. Of course, by optimizing the benchmark for the available CPUs, we would get a slightly higher performance value, however, this would not be reflected much in the points, and the time would be much more spent. We remember that on the CPU Lynpack works much longer. The best result was shown by the National Tsing Hua University -
TFlops. After a day or two, Jack Dongarra (the creator of Lynpak), who is part of the organizing committee of the competition, casually inquired from us, saying how there Lynpak? Apparently, at that time he had not yet seen the board with the results: his WHAAAT response cost every hour spent by us on HPL.

Mystery Application
Having passed the benchmark, according to the prepared plan, I joined the part of the team that was supposed to deal with the Mystery Application. What this will be for the task in advance no one knew, so they were preparing for the worst - beforehand they installed everything from the flash drive to the cluster that could only come in handy. As a rule, the main difficulty of assignments from this section is to collect them. This time everything turned out differently. Appendix collected almost from the first time, without any problems. The problems started when on most of the datasets submitted we got an error at the address, even though it was fortran-attachedies. Judging by the board with the results, not only for us this task caused problems.
Secret weapon: CPU
Well, the last assignment in which I took part was scheduled for the next competition day. Unlike Mystery Application, we have already seen the package with which we had to work - it was cfl3d. When we found out that it was a NASA product, for some reason everyone was happy, thinking that there everything would be fine with both assembly and optimization. When we tested the package at home, there were no problems with the assembly, but the examples of use were very interesting. Most of the examples had dependencies on the installation of additional tools, it also happened that in an attempt to googled one of these tools - tool XX, we found an article of the year 1995 where it was said that now the tool XX is obsolete and use YY. The site of the product from the same times - the documentation often sent the user to the pages of the site, but only the site on the frames and further the main page will not be able to leave. The relevance of the examples left much to be desired.
If it's very simple, the essence of the task was a tricky partitioning of a multilevel grid with preservation of a given level of accuracy. Of course, the main metric here was time. Somehow it happened that on this day we were already as relaxed as possible and just did what was supposed to be. The task was for the CPU, and this is exactly what we had a lot. Input files of the problem had a very specific form and, often, a large size - up to hundreds of lines. A member of our team wrote a script that automated the process of generating an input file, which accelerated the process, probably hundreds of times. Ultimately, all the datasets were successfully completed and optimized, there was even time to try and re-compile the package with some interesting options, but we did not get any more acceleration. We performed this task better than others by receiving a special Application Innovation prize, as well as 11th place in the team event (out of 20 in the final, out of 300+ among all contestants).

Table with configurations of computer systems, as well as the main photo taken from the site .
+ 0 -

Add comment