The amazing performance of parallel C ++ 17 algorithms. Myth or Reality?

 3r31192. 3r3-31. Good evening! 3r31175.  3r31192. 3r31175.  3r31192. From our course "Developer C ++" We offer you a small and interesting study about parallel algorithms. 3r31175.  3r31192. 3r31175.  3r31192. Go. 3r31175.  3r31192. 3r31175.  3r31192. With the advent of parallel algorithms in C ++ 1? you can easily update your “computational” code and get the benefit of parallel execution. In this article, I want to consider an STL algorithm that naturally uncovers the idea of ​​independent computing. Can we expect a 10-fold acceleration in the presence of a 10-core processor? Maybe more? Or less? Let's talk about it. 3r31175.  3r31192. 3r31175.  3r31192. Introduction to parallel algorithms 3-???. 3r31175.  3r31192. 3r31175.  3r31192. The amazing performance of parallel C ++ 17 algorithms. Myth or Reality? 3r31180. 3r31175.  3r31192. 3r31175.  3r31192. C ++ 17 offers an execution policy option for most algorithms: 3r31175.  3r31192. 3r31175.  3r31192. 3r31145.  3r31192. 3r31157. sequenced_policy is a type of execution policy, is used as a unique type to eliminate the overload of the parallel algorithm and the requirement that parallelizing the execution of a parallel algorithm is impossible: the corresponding global object is 3r3504. std :: execution :: seq ;
 3r31192. 3r31157. parallel_policy is a type of execution policy used as a unique type for eliminating the overload of the parallel algorithm and indicating that parallelization of the execution of the parallel algorithm is possible: the corresponding global object is 3r3504. std :: execution :: par ;
 3r31192. 3r31157. parallel_unsequenced_policy is a type of execution policy used as a unique type for eliminating the overload of the parallel algorithm and indicating that parallelization and vectorization of the execution of the parallel algorithm are possible: the corresponding global object is 3r3504. std :: execution :: par_unseq ;
 3r31192. 3r31162. 3r31175.  3r31192. 3r31175.  3r31192. Shortly: 3r31175.  3r31192. 3r31175.  3r31192. 3r31145.  3r31192. 3r31157. use std :: execution :: seq for sequential execution of the algorithm;  3r31192. 3r31157. use std :: execution :: par for parallel execution of the algorithm (usually with the help of any implementation of the Thread Pool (thread pool));  3r31192. 3r31157. use std :: execution :: par_unseq for parallel execution of the algorithm with the possibility of using vector commands (for example, SSE, AVX).  3r31192. 3r31162. 3r31175.  3r31192. As a quick example, call std :: sort in parallel: 3r31175.  3r31192. 3r31175.  3r31192. 3r33857. 3r33858. std :: sort (std :: execution :: par, myVec.begin (), myVec.end ()); 3r31192. //^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Read more. //execution policy 3r33865. 3r33866. 3r31175.  3r31192. 3r31175.  3r31192. Note how easy it is to add a parallel execution parameter to the algorithm! But will there be a significant improvement in performance? Will it increase the speed? Or are there cases of slowdown? 3r31175.  3r31192. 3r31175.  3r31192. [b] Parallel 3r3504. std :: transform 3r31099. 3r31175.  3r31192. 3r31175.  3r31192. In this article, I want to draw attention to the algorithm 3r3504. std :: transform which could potentially be the basis for other parallel methods (along with 3r3504. std :: transform_reduce , for_each , 3r35038. scan , sort
3r31175.  3r31192. 3r31175.  3r31192. Our test code will be based on the following pattern:
 3r31192. 3r31175.  3r31192. 3r33857. 3r33858. std :: transform (execution_policy, //par, seq, par_unseq
inVec.begin (), inVec.end (),
outVec.begin (),
ElementOperation); 3r33865. 3r33866. 3r31175.  3r31192. 3r31175.  3r31192. Suppose that the function has 3r3504. ElementOperation There are no synchronization methods, in which case the code has the potential for parallel execution or even vectorization. Each element calculation is independent, the order does not matter, so the implementation can generate several threads (possibly in a thread pool) for independent element processing. 3r31175.  3r31192. 3r31175.  3r31192. I would like to experiment with the following things: 3r3r11175.  3r31192. 3r31175.  3r31192. 3r31145.  3r31192. 3r31157. the size of the vector field is large or small;  3r31192. 3r31157. a simple conversion that spends most of the time accessing memory;  3r31192. 3r31157. more arithmetic (ALU) operations;  3r31192. 3r31157. ALU in a more realistic scenario.  3r31192. 3r31162. 3r31175.  3r31192. 3r31175.  3r31192. As you can see, I want to not only test the number of elements that are “good” for using the parallel algorithm, but also the ALU operations that the processor occupies. 3r31175.  3r31192. Other algorithms, such as sorting, accumulation (in the form of std :: reduce) also offer parallel execution, but also require more work to calculate the results. Therefore, we will consider them candidates for another article. 3r31175.  3r31192. 3r31175.  3r31192. Note on benchmarks 3r3r17575.  3r31192. 3r31175.  3r31192. For my tests, I use Visual Studio 201? 15.8 - because this is the only implementation in the popular /STL compiler implementation at the moment (November, 2018) (GCC on the way!). Moreover, I focused only on execution :: par since 3r3504. execution :: par_unseq not available in MSVC (works similarly to execution :: par ). 3r31175.  3r31192. 3r31175.  3r31192. There are two computers:
 3r31192. 3r31175.  3r31192. 3r31145.  3r31192. 3r31157. i??? - PC, Windows 1? i??? - clock speed 3.2 GHz, 6 cores /12 threads (Hyperthreading);  3r31192. 3r31157. i??? - Laptop, Windows 1? i???? 2.6GHz clock speed, 4 cores /8 threads (Hyperthreading).  3r31192. 3r31162. 3r31175.  3r31192. 3r31175.  3r31192. The code is compiled into x6? Release more, auto-vectorization is enabled by default, I also included the extended command set (SSE2) and OpenMP (2.0). 3r31175.  3r31192. 3r31175.  3r31192. The code is in my github: github /fenbf /ParSTLTests /TransformTests /TransformTests.cpp 3r31175.  3r31192. 3r31175.  3r31192. For OpenMP (2.0), I use parallelism only for cycles: 3r31175.  3r31192. 3r31175.  3r31192. 3r33857. 3r33858. #pragma omp parallel for
for (int i = 0; ) 3r33866. 3r31175.  3r31192. 3r31175.  3r31192. I run the code 5 times and look at the minimum results. 3r31175.  3r31192. 3r31175.  3r31192. 3r3744. [b] Warning
: The results reflect only rough observations, check on your system /configuration before using in production. Your requirements and environment may differ from mine. 3r3747. 3r31175.  3r31192. 3r31175.  3r31192. You can read more about MSVC implementation in 3r33241. this post . And here is the latest report Bill O’Neal with CppCon 2018 (Bill implemented Parallel STL in MSVC). 3r31175.  3r31192. 3r31175.  3r31192. Well, let's start with simple examples! 3r31175.  3r31192. 3r31175.  3r31192. Simple conversion 3r31175.  3r31192. 3r31175.  3r31192. Consider the case when you apply a very simple operation to the input vector. This may be copying or multiplying elements. 3r31175.  3r31192. 3r31175.  3r31192. For example: 3r31175.  3r31192. 3r31175.  3r31192. 3r33857. 3r33858. std :: transform (std :: execution :: par,
vec.begin (), vec.end (), out.begin (),
[](double v) {return v * 2.0;}
); 3r31192. 3r33865. 3r33866. 3r31175.  3r31192. My computer has 6 or 4 cores can I expect a 46-fold acceleration of sequential execution? Here are my results (time in milliseconds):
 3r31192. 3r31175.  3r31192. 3r33875.  3r31192. 3r31069.  3r31192. 3r3888. Operation 3r3889.  3r31192. 3r3888. Vector size 3r3889.  3r31192. 3r3888. i??? (4 cores) 3r3889.  3r31192. 3r3888. i??? (6 cores) 3r3889.  3r31192. 3r31083.  3r31192. 3r31069.  3r31192. 3r31080. execution :: seq  3r31192. 3r31080. 10k  3r31192. 3r31080. ???r3r31081.  3r31192. 3r31080. ???r3r31081.  3r31192. 3r31083.  3r31192. 3r31069.  3r31192. 3r31080. execution :: par  3r31192. 3r31080. 10k  3r31192. 3r31080. ???r3r31081.  3r31192. 3r31080. ???r3r31081.  3r31192. 3r31083.  3r31192. 3r31069.  3r31192. 3r31080. openmp parallel for  3r31192. 3r31080. 10k  3r31192. 3r31080. ???r3r31081.  3r31192. 3r31080. ???r3r31081.  3r31192. 3r31083.  3r31192. 3r31069.  3r31192. 3r31080. execution :: seq  3r31192. 3r31080. 100k  3r31192. 3r31080. ???r3r31081.  3r31192. 3r31080. ???r3r31081.  3r31192. 3r31083.  3r31192. 3r31069.  3r31192. 3r31080. execution :: par  3r31192. 3r31080. 100k  3r31192. 3r31080. ???r31081.  3r31192. 3r31080. ???r3r31081.  3r31192. 3r31083.  3r31192. 3r31069.  3r31192. 3r31080. openmp parallel for  3r31192. 3r31080. 100k  3r31192. 3r31080. ???r3r31081.  3r31192. 3r31080. ???r3r31081.  3r31192. 3r31083.  3r31192. 3r31069.  3r31192. 3r31080. execution :: seq  3r31192. 3r31080. 1000k  3r31192. 3r31080. ???r3r31081.  3r31192. 3r31080. ???r3r31081.  3r31192. 3r31083.  3r31192. 3r31069.  3r31192. 3r31080. execution :: par  3r31192. 3r31080. 1000k  3r31192. 3r31080. ???r3r31081.  3r31192. 3r31080. ???r3r31081.  3r31192. 3r31083.  3r31192. 3r31069.  3r31192. 3r31080. openmp parallel for  3r31192. 3r31080. 1000k  3r31192. 3r31080. ???r3r31081.  3r31192. 3r31080. ???r3r31081.  3r31192. 3r31083.  3r31192. 3r31085. 3r31175.  3r31192. 3r31175.  3r31192. On a faster machine, we can see that it will take about 1 million items to notice an improvement in performance. On the other hand, on my laptop, all parallel implementations were slower. 3r31175.  3r31192. 3r31175.  3r31192. Thus, it is difficult to notice any significant performance improvement with such transformations, even with an increase in the number of elements. 3r31175.  3r31192. 3r31175.  3r31192. Why so? 3r31175.  3r31192. 3r31175.  3r31192. Since the operations are elementary, the processor cores can call it almost instantly, using only a few cycles. However, processor cores spend more time waiting for main memory. So, in this case, for the most part they will wait, rather than perform calculations. 3r31175.  3r31192. 3r31175.  3r31192. 3r3744. Reading and writing a variable in memory takes about 2-3 clocks if it is cached, and a few hundred clocks if not cached. 3r3747. 3r31175.  3r31192. 3r33434. https://www.agner.org/optimize/optimizing_cpp.pdf 3r31175.  3r31192. 3r31175.  3r31192. You can roughly say that if your algorithm depends on memory, then you should not expect an improvement in performance with parallel computation. 3r31175.  3r31192. 3r31175.  3r31192. More calculations 3r31175.  3r31192. 3r31175.  3r31192. Since memory bandwidth is extremely important and can affect the speed of things let's increase the number of calculations affecting each element. 3r31175.  3r31192. 3r31175.  3r31192. The idea is that it is better to use processor cycles rather than waste time waiting for memory. 3r31175.  3r31192. 3r31175.  3r31192. To begin with, I use trigonometric functions, for example, 3r3504. sqrt (sin * cos) (this is conditional computing in a non-optimal form, just to occupy the processor). 3r31175.  3r31192. 3r31175.  3r31192. We use 3r3504. sqrt , 3r3504. sin and 3r3504. cos which can take ~ 20 to sqrt and ~ 100 per trigonometric function. This amount of computation can cover the memory access delay. 3r31175.  3r31192. 3r31175.  3r31192. More details about the delays of the teams are written in the excellent article 3-33510. Perf guidefrom Agner Fogh . 3r31175.  3r31192. 3r31175.  3r31192. Here is the benchmark code:
 3r31192. 3r31175.  3r31192. 3r33857. 3r33858. std :: transform (std :: execution :: par,
vec.begin (), vec.end (), out.begin (),
[](double v) {return v * 2.0;}
); 3r33865. 3r33866. 3r31175.  3r31192. 3r31175.  3r31192. And now what? Can we expect to improve performance compared to the previous attempt? 3r31175.  3r31192. 3r31175.  3r31192. Here are some results (time in milliseconds):
 3r31192. 3r31175.  3r31192. 3r33875.  3r31192. 3r31069.  3r31192. 3r3888. Operation 3r3889.  3r31192. 3r3888. Vector size 3r3889.  3r31192. 3r3888. i??? (4 cores) 3r3889.  3r31192. 3r3888. i??? (6 cores) 3r3889.  3r31192. 3r31083.  3r31192. 3r31069.  3r31192. 3r31080. execution :: seq  3r31192. 3r31080. 10k  3r31192. 3r31080. ???r3r31081.  3r31192. 3r31080. ???r3r31081.  3r31192. 3r31083.  3r31192. 3r31069.  3r31192. 3r31080. execution :: par  3r31192. 3r31080. 10k  3r31192. 3r31080. ???r3r31081.  3r31192. 3r31080. ???r3r31081.  3r31192. 3r31083.  3r31192. 3r31069.  3r31192. 3r31080. openmp parallel for  3r31192. 3r31080. 10k  3r31192. 3r31080. ???r3r31081.  3r31192. 3r31080. ???r3r31081.  3r31192. 3r31083.  3r31192. 3r31069.  3r31192. 3r31080. execution :: seq  3r31192. 3r31080. 100k  3r31192. 3r31080. ???r3r31081.  3r31192. 3r31080. ???r3r31081.  3r31192. 3r31083.  3r31192. 3r31069.  3r31192. 3r31080. execution :: par  3r31192. 3r31080. 100k  3r31192. 3r31080. ???r31081.  3r31192. 3r31080. ???r3r31081.  3r31192. 3r31083.  3r31192. 3r31069.  3r31192. 3r31080. openmp parallel for  3r31192. 3r31080. 100k  3r31192. 3r31080. ???r3r31081.  3r31192. 3r31080. ???r3r31081.  3r31192. 3r31083.  3r31192. 3r31069.  3r31192. 3r31080. execution :: seq  3r31192. 3r31080. 1000k  3r31192. 3r31080. ???r3r31081.  3r31192. 3r31080. ???r3r31081.  3r31192. 3r31083.  3r31192. 3r31069.  3r31192. 3r31080. execution :: par  3r31192. 3r31080. 1000k  3r31192. 3r31080. ???r3r31081.  3r31192. 3r31080. ???r3r31081.  3r31192. 3r31083.  3r31192. 3r31069.  3r31192. 3r31080. openmp parallel for  3r31192. 3r31080. 1000k  3r31192. 3r31080. ???r3r31081.  3r31192. 3r31080. ???r3r31081.  3r31192. 3r31083.  3r31192. 3r31085. 3r31175.  3r31192. 3r31175.  3r31192. Finally, we see quite good numbers :) 3r31175.  3r31192. 3r31175.  3r31192. For 1000 elements (not shown here), the time for parallel and sequential calculations was similar, so for more than 1000 elements we see an improvement in the parallel version. 3r31175.  3r31192. 3r31175.  3r31192. For 100 thousand items, the result on a faster computer is almost 9 times better than the sequential version (similarly for the OpenMP version). 3r31175.  3r31192. 3r31175.  3r31192. In the largest variant of a million elements - the result is faster by 5 or 8 times. 3r31175.  3r31192. For such calculations, I achieved a “linear” acceleration, depending on the number of processor cores. What was expected. 3r31175.  3r31192. 3r31175.  3r31192. Fresnel and three-dimensional vectors 3r31175.  3r31192. 3r31175.  3r31192. In the section above, I used “made-up” calculations, but what about the real code? 3r31175.  3r31192. Let's solve the Fresnel equations that describe the reflection and curvature of light from a smooth, flat surface. This is a popular method for generating realistic lighting in 3D games. 3r31175.  3r31192. 3r31175.  3r31192. 3r3747. 3r31175.  3r31192. 3r31175.  3r31192. As a good sample, I found This is a description and implementation of . 3r31175.  3r31192. 3r31175.  3r31192. On the use of the library GLM 3r31175.  3r31192. 3r31175.  3r31192. Instead of creating my own implementation, I used library glm . I often use it in my OpenGl projects. 3r31175.  3r31192. 3r31175.  3r31192. The library is easy to access through Conan Package Manager therefore I will use it too. Reference on the package. 3r31175.  3r31192. 3r31175.  3r31192. File Conan:
 3r31192. 3r31175.  3r31192. 3r33857. [requires]3r31192. glm/[email protected]/stable
3r31192.[generators]3r31192. visual_studio
3r33865. 3r33866. 3r31175.  3r31192. and the command line to install the library (it generates props files that I can use in the Visual Studio project): 3r3r1175.  3r31192. 3r31175.  3r31192. 3r33857. conan install. -s build_type = Release -if build_release_x64 -s arch = x86_64 3r33866. 3r31175.  3r31192. 3r31175.  3r31192. The library consists of a header, so you can simply download it manually if you want. 3r31175.  3r31192. 3r31175.  3r31192. Actual code and benchmark 3r31175.  3r31192. 3r31175.  3r31192. I adapted the code for glm from r3r3815. scratchapixel.com :
5.  3r31192. 3r31175.  3r31192. 3r33857. 3r33858. //implementation is adapted from https://www.scratchapixel.com
float fresnel (const glm :: vec4 & I, const glm :: vec4 & N, const float ior)
{
float cosi = std :: clamp (glm :: dot (i, n), -1.0f, 1.0f); 3r31192. float etai = ? etat = ior; 3r31192. if (cosi> 0) {std :: swap (etai, etat);}
3r31192. //Calculate sini using the Snell law
float sint = etai /etat * sqrtf (std :: max (0.f, 1 - cosi * cosi)); 3r31192. //Full Internal Reflection
if (sint> = 1)
return 1.0f; 3r31192. 3r31192. float cost = sqrtf (std :: max (0.f, 1 - sint * sint)); 3r31192. cosi = fabsf (cosi); 3r31192. float Rs = ((etat * cosi) - (etai * cost)) /3r31192. ((etat * cosi) + (etai * cost)); 3r31192. float Rp = ((etai * cosi) - (etat * cost)) /3r31192. ((etai * cosi) + (etat * cost)); 3r31192. return (Rs * Rs + Rp * Rp) /2.0f; 3r31192.}
3r33866. 3r31175.  3r31192. 3r31175.  3r31192. The code uses several mathematical instructions, scalar product, multiplication, division, so the processor has something to do. Instead of the double vector, we use a vector of 4 elements to increase the amount of used memory. 3r31175.  3r31192. 3r31175.  3r31192. Benchmark: 3r31175.  3r31192. 3r31175.  3r31192. 3r33857. 3r33858. std :: transform (std :: execution :: par,
vec.begin (), vec.end (), vecNormals.begin (), //input vectors
vecFresnelTerms.begin (), //output
[](const glm :: vec4 & v, const glm :: vec4 & n) {
return fresnel (v, n, 1.0f);
} 3r31192.); 3r33865. 3r33866. 3r31175.  3r31192. 3r31175.  3r31192. And here are the results (time in milliseconds):
 3r31192. 3r31175.  3r31192. 3r33875.  3r31192. 3r31069.  3r31192. 3r3888. Operation 3r3889.  3r31192. 3r3888. Vector size 3r3889.  3r31192. 3r3888. i??? (4 cores) 3r3889.  3r31192. 3r3888. i??? (6 cores) 3r3889.  3r31192. 3r31083.  3r31192. 3r31069.  3r31192. 3r31080. execution :: seq
 3r31192. 3r31080. 1k
 3r31192. 3r31080. ???r3r31081.  3r31192. 3r31080. ???r3r31081.  3r31192. 3r31083.  3r31192. 3r31069.  3r31192. 3r31080. execution :: par
 3r31192. 3r31080. 1k
 3r31192. 3r31080. ???r3r31081.  3r31192. 3r31080. ???r3r31081.  3r31192. 3r31083.  3r31192. 3r31069.  3r31192. 3r31080. openmp parallel for
 3r31192. 3r31080. 1k
 3r31192. 3r31080. ???r3r31081.  3r31192. 3r31080. ???r3r31081.  3r31192. 3r31083.  3r31192. 3r31069.  3r31192. 3r31080. execution :: seq
 3r31192. 3r31080. 10k
 3r31192. 3r31080. ???r3r31081.  3r31192. 3r31080. ???r3r31081.  3r31192. 3r31083.  3r31192. 3r31069.  3r31192. 3r31080. execution :: par
 3r31192. 3r31080. 10k
 3r31192. 3r31080. ???r3r31081.  3r31192. 3r31080. ???r3r31081.  3r31192. 3r31083.  3r31192. 3r31069.  3r31192. 3r31080. openmp parallel for
 3r31192. 3r31080. 10k
 3r31192. 3r31080. ???r3r31081.  3r31192. 3r31080. ???r3r31081.  3r31192. 3r31083.  3r31192. 3r31069.  3r31192. 3r31080. execution :: seq
 3r31192. 3r31080. 100k
 3r31192. 3r31080. ???r3r31081.  3r31192. 3r31080. ???r3r31081.  3r31192. 3r31083.  3r31192. 3r31069.  3r31192. 3r31080. execution :: par
 3r31192. 3r31080. 100k
 3r31192. 3r31080. ???r3r31081.  3r31192. 3r31080. ???r3r31081.  3r31192. 3r31083.  3r31192. 3r31069.  3r31192. 3r31080. openmp parallel for
 3r31192. 3r31080. 100k
 3r31192. 3r31080. ???r3r31081.  3r31192. 3r31080. ???r3r31081.  3r31192. 3r31083.  3r31192. 3r31069.  3r31192. 3r31080. execution :: seq
 3r31192. 3r31080. 1000k
 3r31192. 3r31080. ???r3r31081.  3r31192. 3r31080. ???r3r31081.  3r31192. 3r31083.  3r31192. 3r31069.  3r31192. 3r31080. execution :: par
 3r31192. 3r31080. 1000k
 3r31192. 3r31080. ???r3r31081.  3r31192. 3r31080. ???r3r31081.  3r31192. 3r31083.  3r31192. 3r31069.  3r31192. 3r31080. openmp parallel for
 3r31192. 3r31080. 1000k
 3r31192. 3r31080. ???r31081.  3r31192. 3r31080. ???r3r31081.  3r31192. 3r31083.  3r31192. 3r31085. 3r31175.  3r31192. 3r31175.  3r31192. With “real” computations, we see that parallel algorithms provide good performance. For such operations on two of my Windows machines, I achieved acceleration with an almost linear dependence on the number of cores. 3r31175.  3r31192. 3r31175.  3r31192. For all tests, I also showed results from OpenMP and two implementations: MSVC and OpenMP work in a similar way. 3r31175.  3r31192. 3r31175.  3r31192. Conclusion 3r31175.  3r31192. 3r31175.  3r31192. In this article, I examined three cases of the use of parallel computing and parallel algorithms. Replacing the standard algorithms with the version of std :: execution :: par may seem very tempting, but this is not always worth it! Each operation you use inside the algorithm may work differently and be more dependent on the processor or memory. Therefore, consider each change separately. 3r31175.  3r31192. 3r31175.  3r31192. What you should remember:
 3r31192. 3r31175.  3r31192. 3r31145.  3r31192. 3r31157. parallel execution usually does more than sequential, since the library must prepare for parallel execution;
 3r31192. 3r31157. not only the number of elements is important, but also the number of instructions the processor is occupied with;
 3r31192. 3r31157. it is better to take tasks that are independent of each other and other shared resources;
 3r31192. 3r31157. parallel algorithms offer an easy way to divide work into separate threads;
 3r31192. 3r31157. if your operations are memory dependent, you should not expect performance improvements, and in some cases, the algorithms may be slower;
 3r31192. 3r31157. To get a decent performance improvement, always measure the timings of each problem, in some cases the results may be completely different.
 3r31192. 3r31162. 3r31175.  3r31192. 3r31175.  3r31192. Special thanks to JFT for helping with the article! 3r31175.  3r31192. 3r31175.  3r31192. Also note my other sources about parallel algorithms: 3r31175.  3r31192. 3r31175.  3r31192. 3r31145.  3r31192. 3r31157. A fresh chapter in my book C ++ 17 In Detail About Parallel Algorithms;
 3r31192. 3r31157. 3r31153. Parallel STL And Filesystem: Files Word Count Example
;
 3r31192. 3r31157. 3r31158. Examples of parallel algorithms from C ++ 17
.
 3r31192. 3r31162. 3r31175.  3r31192. 3r31175.  3r31192. Pay attention to another article related to Parallel Algorithms: How to Boost Performance with Intel Parallel STL and C ++ 17 Parallel Algorithms 3r31175.  3r31192. 3r31175.  3r31192. THE END
 3r31192. 3r31175.  3r31192. We are waiting for comments, and questions that can be left here or at our teacher 3r31180. on 3r31179. Open day . 3r31188. 3r31192. 3r31192. 3r31192. 3r31185. ! function (e) {function t (t, n) {if (! (n in e)) {for (var r, a = e.document, i = a.scripts, o = i.length; o-- ;) if (-1! == i[o].src.indexOf (t)) {r = i[o]; break} if (! r) {r = a.createElement ("script"), r.type = "text /jаvascript", r.async =! ? r.defer =! ? r.src = t, r.charset = "UTF-8"; var d = function () {var e = a.getElementsByTagName ("script")[0]; e.parentNode.insertBefore (r, e)}; "[object Opera]" == e.opera? a.addEventListener? a.addEventListener ("DOMContentLoaded", d,! 1): e.attachEvent ("onload", d ): d ()}}} t ("//mediator.mail.ru/script/2820404/"""_mediator") () (); 3r31186. 3r31192. 3r31188. 3r31192. 3r31192. 3r31192. 3r31192.
+ 0 -

Add comment