The secrets of impossible computing on GPU
Our experience of using a computing cluster of 480 AMD RX 480 GPU in solving math problems. As a problem, we took the proof of the theorem from the article by Professor A. Chudnov. “ Cyclic decompositions of sets, separating digraphs and cyclic classes of games with guaranteed win “. The task is to find the minimum number of participants of one coalition in coalition games Nim-type, guaranteeing the winning of one of the parties.
Cyclic decompositions of sets, separating digraphs and cyclic classes of games with guaranteed win “. The task is to find the minimum number of participants of one coalition in coalition games Nim-type, guaranteeing the winning of one of the parties.
From a mathematical point of view, this is a search for a reference cyclic sequence. If we represent a sequence in the form of a list of zeros and ones, then the reference test can be implemented by logical bitwise operations. From the point of view of programming, such a sequence is a long register, for example, 256 bits. The most reliable way to solve this problem is to go through all the options except for the impossible ones for obvious reasons.
The objectives of solving the problem are the issues of efficient signal processing (detection, synchronization, coordinate measurement, coding, etc.).
The difficulty of solving this problem in the search for a huge number of options. For example, if we are looking for a solution for n = 2? then this is 25 bits, and if n = 10? then this is already 100 bits. If we take the number of all possible combinations, then for n = 25 it is 2 ^ 25 = 3?55?43? and for n = 100 it is already 2 ^ 100 = ?26?65?62?22?42?40?59?70?20?376 combinations. The increase in complexity is simply enormous!
Such a task is well parallelized, which means that it is ideal for our GPU cluster.
Programmers vs Mathematics 3r3207.
Initially, mathematicians solved this problem in Visual Basic in Excel, so they managed to get primary solutions, but the low performance of scripting languages did not allow to go far ahead. The decision to n = 80 took a month and a half We bow our heads in front of these patient people.
The first step was to implement the C task algorithm and run it on the CPU. In the process it turned out that a lot can be optimized when working with bit sequences.
Next, we optimized the search area and eliminated duplication. Also, a good result was given by analyzing the assembler code generated by the compiler and optimizing the code for the features of the compiler. All this made it possible to achieve a significant increase in the speed of calculations.
The next stage of optimization was profiling. Measurement of the execution time of different sections of the code showed that in some branches of the algorithm the load on the memory was greatly increased, and the excessive branching of the program was also revealed. Because of this “small” flaw, almost a third of the CPU power was not used.
A very important aspect of solving such problems is the accuracy of writing code. Nobody knows the correct answers to this problem and there are no test vectors, respectively. There is only the first part of the range of solutions that were found by mathematicians. The reliability of new solutions can be guaranteed only by the accuracy of writing code.
So, the stage of preparing the program for solving on the GPU has come and the code has been modified to work in several threads. The control program is now engaged in dispatching tasks between threads. In a multithreaded environment, the computation speed has increased 5 times! This was achieved by the simultaneous operation of 4 threads and the combination of functions.
At this stage, the solution made correct calculations up to n = 80 in 10 minutes, whereas in Excel, these calculations took six weeks! Little win!
GPU and OpenCL
It was decided to use OpenCL version 1.2 to ensure maximum compatibility between different platforms. Primary debugging was done on a CPU from Intel, then on a GPU from Intel. Already then switched to a GPU from AMD.
The version of the OpenCL 1.2 standard supports integer variables of 64 bits. The dimension of 128 bits is limited to supported by AMD, but compiled into two 64-bit numbers. For compatibility reasons and to optimize performance, it was decided to represent a 256-bit number as a group of 32-bit numbers, the logical bit-by-bit operations on which are performed on the internal ALU GPU as quickly as possible.
An OpenCL program contains a kernel — a function that is the entry point of a program. Data for processing is loaded from the CPU into the RAM of the video card and transferred to the kernel as buffers - pointers to the input and output data arrays. Why an array? We perform high-performance computing, we need a lot of tasks performed at the same time. The kernel runs on the device in multiple instances. Each core knows its identifier and takes exactly its own piece of input data from the common buffer. The case when the simplest solution is the most effective. OpenCL is not only a language, but also a comprehensive framework in which all the details of scientific and game calculations are thoroughly thought out. This makes life easier for the developer. For example, you can run many threads, task manager will place them on the device itself. Those tasks that did not stand up for immediate execution will be put in a waiting queue and launched as the computing blocks become free. Each kernel instance has its own space in the output buffer, where it places the answer upon completion of the work.
The main task of the OpenCL dispatcher is to provide parallel execution of multiple kernel instances. Here applied the accumulated decades of scientific and practical experience. While some cores load data into registers, another part is working with memory or performing calculations at this time - as a result, the GPU core is always fully loaded.
The OpenCL compiler does a good job of optimizing, but it's easier for a developer to influence performance. Optimization for the GPU goes in two directions - the acceleration of code execution and the possibility of its parallelization. How well the code is parallelized by the compiler depends on several things: the number of scratch registers (which are located in the slowest GPU memory - global), the size of the compiled code (you need to fit in 32 kb cache), the number of vector and scalar registers used.
ComBox A-480 GPU or one million cores
This is the most interesting part of the project when we switched from Excel to a computing cluster consisting of 480 AMD RX 480 video cards. Large, fast, efficient. Fully prepared to perform the task and obtain the results that the world has never seen.
It should be noted that at all stages of improving and optimizing the code, we launched a search for a solution from the very beginning and compared the answers of the new version with the previous ones. This made it possible to be sure that optimization of the code and improvements did not introduce errors into the solutions. Here you need to understand that there are no correct answers at the end of the textbook, and no one in the world knows them.
The launch on the cluster confirmed our assumptions about the speed of solutions: the search for sequences for n> 100 took about an hour. It was surprising to see how on the ComBox A-480 cluster the new solutions were in minutes, while on the CPU it took many hours.
In just two hours of the computing cluster, we got all the solutions up to n = 127. Verification of the decisions showed that the received answers are reliable and correspond to the theorems of Professor A. Chudnov set forth in the article.
The evolution of speed
If you look at the performance gains in the course of solving the problem, the results were approximately as follows: 3r-3245.
a month and a half to n = 80 in Excel;
an hour to n = 80 on Core i5 with an optimized C ++ program;
10 minutes to n = 80 on Core i5 using multithreading;
10 minutes to n = 100 on a single AMD RX 480 GPU;
120 minutes to n = 127 on ComBox A-480.
Perspectives and Future 3r3207.
Many tasks standing at the intersection of science and practice are awaiting their solution in order to make our life better. The rental market for computing power is only being formed, and the need for parallel computing continues to grow.
Possible applications of parallel computing:
tasks of automatic control of vehicles and drones;
calculations of aerodynamic and hydrodynamic characteristics;
speech recognition and visual imagery;
neural network training;
tasks of astronomy and astronautics;
statistical and correlation data analysis;
folding protein-protein compounds;
early diagnosis of diseases with the use of AI.
Separate direction - cloud computing on the GPU. For example, such giants as Amazon, IBM and Google lease their computing power on the GPU. Today it is safe to say that the future of high-performance parallel computing will belong to the GPU clusters.
It may be interesting
I am overwhelmed by your post with such a nice topic. Usually I visit your blogs and get updated through the information you include but today’s blog would be the most appreciable. Well done!
Took me time to understand all of the comments, but I seriously enjoyed the write-up. It proved being really helpful to me and Im positive to all of the commenters right here! Its constantly nice when you can not only be informed, but also entertained! I am certain you had enjoyable writing this write-up.