“Lock-free, or not lock-free, that is the question.” Or “Healthy sleep, worse than bitter radish”

 
3r3-31. 3r33347. On writing this article, I was moved by comments to the article "3r333. How to right and wrong sleep
".
3r33333.  
3r33347. This article will discuss the development of multi-threaded applications, the applicability of lock-free to some cases arising in the process of working on 3r339. LAppS
, about function 3r311. nanosleep
and scheduler violence.
3r33333.  
NB: Everything discussed concerns development in C ++ under Linux, but it can be applied to all POSIX.1-2008 compatible systems (with an eye on a specific implementation). 3r33333.  
3r33347. In general, everything is pretty messy, I hope the train of thought in the presentation will be clear. If interested, then please under the cat. 3r33333.  
3r33347. Event-oriented software is always waiting for something. Whether they are a GUI or a network server, they are waiting for any events: keyboard input, mouse events, data packet arrival over the network. But all software is waiting in different ways. Lock-free systems should not wait at all. At least the use of lock-free algorithms should occur where it is not necessary to wait, and even harmful. But we are talking about competitive (multi-stream) systems, and oddly enough, lock-free algorithms are also waiting. Yes, they do not block the execution of parallel threads, but they themselves are waiting for the opportunity to do something without blocking. 3r33333.  
3r33347. LAppS is very active in using mutexes and semaphores. In this case, the standard C ++ semaphores are absent. The mechanism is very important and convenient, however C ++ should work on systems that do not have semaphore support, and therefore semaphores are not included in the standard. Moreover, if the semaphores, I use because they are convenient, then mutexes because forced. 3r33333.  
3r33347. The behavior of the mutex in the case of concurrent lock () as well as sem_wait () in Linux, puts the waiting thread at the end of the task scheduler queue, and when it comes to the top, the check is repeated and without returning to userland, the thread is placed again in the queue the expected event has not happened yet. This is a very important point. 3r33333.  
3r33347. And I decided to check if I could drop the std :: mutex and POSIX semaphores by emulating them with std :: atomic, transferring the load mostly to userland. In fact, it was not possible, but first things first. 3r33333.  
3r33347. Firstly, I have several sections in which these experiments might be useful: 3r33333. 3r33333.  
  •  
  • blocking in LibreSSL (case 1);  
  • blocking the transfer of payload received packets in the Lua application (case 2);  
  • Waiting for events on receipt payload ready for processing applications Lua (case 3).  
3r33333.  
3r33347. Let's start with non-blocking locks. Let's write your mutex using atomics, as shown in some speeches by X. Sutter (there is no original code, therefore, from memory and therefore the code does not match the original 100%, and Sutter also referred to C ++ 20 progress, so there are differences). And despite the simplicity of this code, there are pitfalls in it. 3r33333.  
    #include  
#include
namespace test
{
class mutex
{
private:
std :: atomic mLock;
public:
explicit mutex (): mLock {0}
{
}
mutex (const mutex &) = delete;
mutex (mutex &) = delete;
void lock ()
{
pthread_t locked_by = 0; //in C ++ 20 this variable will not be needed, since compare_exchange_strong will be able to work with values ​​instead of the reference in the first argument 3r-33599. while (! mLock.compare_exchange_strong (locked_by, pthread_self ()))
{
locked_by = 0; //this will also not need
}
}
void unlock ()
{
pthread_t current = pthread_self ();
if (! mLock.compare_exchange_strong (current, 0))
{
throw std :: system_error (EACCES, std :: system_category (), "Antivize the mutex owned by other thread");
}
}
const bool try_lock ()
{
pthread_t unused = 0;
return mLock.compare_exchange_strong (unused, pthread_self ());
}
};
}
3r33333.  
3r33347. Unlike std :: mutex :: unlock (), the test :: mutex: unlock () behavior when trying to unlock from another thread is deterministic. An exception will be thrown. This is good, although not consistent with the behavior of the standard. What's wrong with this class? The bad thing is that the test :: mutex: lock () method will shamelessly eat CPU resources in allocated time quotas in an attempt to take over a mutex that another thread already owns. Those. a loop in test :: mutex: lock () will result in a waste of CPU resources. What are our options for resolving this situation? 3r33333.  
3r33347. We can use sched_yield () (as suggested in one of the comments to the above article). Is it so simple? First of all, in order to use sched_yield (), it is necessary that the execution threads use the SCHED_RR, SCHED_FIFO policies for their prioritization in the task scheduler. Otherwise, calling sched_yield () will be a waste of CPU resources. Secondly, a very frequent call to sched_yield () still raises the CPU usage. Moreover, the use of real-time policies in your application, and provided that there are no other real-time applications in the system, will limit the scheduler queue with the selected policy to only your threads. It seemed to be - this is good! No, it's not good. The whole system will become less responsive, because busy with priority task. CFQ will be in the pen. But there are other threads in the application, and very often there is a situation when the thread that captures the mutex is put at the end of the queue (the quota has expired), and the thread that is waiting for the mutex to be released right in front of it. In my experiments (case 2) this method gave roughly the same results (by 3.8% worse) as std :: mutex, but the system is less responsive and the CPU consumption increases by 5% -7%. 3r33333.  
3r33347. You can try changing test :: mutex :: lock () like this (also bad): 3r33333.  
    void lock ()
{
pthread_t locked_by = 0;
while (! mLock.compare_exchange_strong (locked_by, pthread_self ()))
{
static thread_local const struct timespec pause {?4}; //- you can play with the duration of sleep
nanosleep (& pause, nullptr);
locked_by = 0;
}
}
3r33333.  
3r33347. Here you can experiment with sleep duration in nanoseconds, 4ns latency turned out to be optimal for my CPU and the performance drop relative to std :: mutex in the same case 2 was 1.2%. Not the fact that nanosleep slept 4ns. In fact, or more (in general) or less (if interrupted). The fall (!) Of consumption of CPU resources was 12% -20%. Those. it was such a healthy dream. 3r33333.  
3r33347. OpenSSL and LibreSSL have two functions that set callbacks for blocking when using these libraries in a multi-threaded environment. It looks like this: 3r33333.  
    //Self callback
void openssl_crypt_locking_function_callback (int mode, int n, const char * file, const int line)
{
static std :: vector locks (CRYPTO_num_locks ());
if (n> = static_cast (locks.size ()))
{
abort ();
}
if (mode & CRYPTO_LOCK)
locks[n].lock ();
else
locks[n].unlock ();
}
//assign callback-a
CRYPTO_set_locking_callback (openssl_crypt_locking_function_callback);
//definition id
CRYPTO_set_id_callback (pthread_self);
3r33333.  
3r33347. And now the worst thing is, using the above test :: mutex mutex in LibreSSL reduces LAppS performance by almost 2 times. And regardless of the option (empty wait cycle, sched_yield (), nanosleep ()). 3r33333.  
3r33347. In general case 2 and case 1 are deleted, and remain with std :: mutex. 3r33333.  
3r33347. Let us turn to semaphores. There are many examples of how to implement semaphores with std :: condition_variable. All of them use std :: mutex as well. And such simulators of semaphores are slower (according to my tests) than system POSIX semaphores. 3r33333.  
3r33347. Therefore we will make a semaphore on atoms: 3r33348. 3r33333.  
    class semaphore
{
private:
std :: atomic mayRun;
mutable std :: atomic counter;
public:
explicit semaphore (): mayRun {true}, counter {0}
{
}
semaphore (const semaphore &) = delete;
semaphore (semaphore &) = delete;
const bool post () const
{
++ counter;
return mayRun.load ();
}
const bool try_wait ()
{
if (mayRun.load ())
{
if (counter.fetch_sub (1)> 0)
return true;
else
{
++ counter;
return false;
}
} else {
throw std :: system_error (ENOENT, std :: system_category (), "Semaphore is destroyed");
}
}
void wait ()
{
while (! try_wait ())
{
static thread_local const struct timespec pause {?4};
nanosleep (& pause, nullptr);
}
}
void destroy ()
{
mayRun.store (false);
}
const int64_t decrimentOn (const size_t value)
{
if (mayRun.load ())
{
return counter.fetch_sub (value);
} else {
throw std :: system_error (ENOENT, std :: system_category (), "Semaphore is destroyed");
}
}
~ semaphore ()
{
destroy ();
}
};
3r33333.  
3r33347. Oh, this semaphore turns out to be many times faster than the system semaphore. The result of a separate testing of this semaphore with one provider and 20 consamers: 3r33333.  
    OS semaphores test. Started 20 threads waiting on a semaphore
Thread (OS): wakes: 500321
Thread (OS): wakes: 500473
Thread (OS): wakes: 501504
Thread (OS): wakes: 502337
Thread (OS): wakes: 498324
Thread (OS): wakes: 502755
Thread (OS): wakes: 500212
Thread (OS): wakes: 498579
Thread (OS): wakes: 499504
Thread (OS): wakes: 500228
Thread (OS): wakes: 499696
Thread (OS): wakes: 501978
Thread (OS): wakes: 498617 3rr3359. Thread (OS): wakes: 502238
Thread (OS): wakes: 497797 3r3333359. Thread (OS): wakes: 498089
Thread (OS): wakes: 499292
Thread (OS): wakes: 498011
Thread (OS): wakes: 498749
Thread (OS): wakes: 501296
OS semaphores test. 1?00?000 posts for ??? milliseconds
OS semaphores test. Post latency: ???ns
=======================================
AtomicEmu semaphores test. Started 20 threads waiting on a semaphore
Thread (EmuAtomic) wakes: 492748
Thread (EmuAtomic) wakes: 546860
Thread (EmuAtomic) wakes: 479375
Thread (EmuAtomic) wakes: 534676
Thread (EmuAtomic) wakes: 501014
Thread (EmuAtomic) wakes: 528220
Thread (EmuAtomic) wakes: 496783
Thread (EmuAtomic) wakes: 467563
Thread (EmuAtomic) wakes: 608086
Thread (EmuAtomic) wakes: 489825
Thread (EmuAtomic) wakes: 479799
Thread (EmuAtomic) wakes: 539634
Thread (EmuAtomic) wakes: 479559
Thread (EmuAtomic) wakes: 495377
Thread (EmuAtomic) wakes: 454759
Thread (EmuAtomic) wakes: 482375
Thread (EmuAtomic) wakes: 512442
Thread (EmuAtomic) wakes: 453303
Thread (EmuAtomic) wakes: 480227
Thread (EmuAtomic) wakes: 477375
AtomicEmu semaphores test. 1?00?000 posts for 20 milliseconds
AtomicEmu semaphores test. Post latency: ???ns

3r33333.  
3r33347. T.e. this semaphore with an almost free post (), which is 29 times faster than the system one, is also very fast in awakening the streams waiting for it: 29325 revivals¹ per millisecond, against 1007 revivals per millisecond of the system one. He has a deterministic behavior when a semaphore is destroyed or a semaphore is destroyed. And naturally segfault when trying to use already destroyed.
five.  
3r33347. (¹) In fact, so many times per second, a thread cannot be postponed and roused by the scheduler. Since The post () is not blocking; with this synthetic test, wait () very often finds itself in a situation where it is not necessary to sleep. At the same time, at least 7 streams in parallel read the semaphore value.
3r33333.  
3r33347. But using it in case 3 in LAppS leads to performance losses, regardless of sleep time. He wakes up too often to check, and events in LAppS arrive much slower (network latency, client-side latency generating the load, etc.). And checking less often means losing performance.
3r33333.  
3r33347. Moreover, the use of sleep in such cases and in a similar way is completely harmful, since on the other hardware, the results may turn out to be completely different (as in the case of the assembler pause instruction), and for each CPU model you also have to select the delay time.
3r33333.  
3r33347. The advantage of the system mutex and semaphore is that the execution thread does not wake up until the event (unlock mutex or semaphore increment) occurs. Extra CPU cycles do not waste, - profit.
3r33333.  
3r33347. In general, everything from this evil one, disabling iptables on my system gives from 12% (with TLS) to 30% (without TLS) performance gain
3r33333.
3r33352. ! function (e) {function t (t, n) {if (! (n in e)) {for (var r, a = e.document, i = a.scripts, o = i.length; o-- ;) if (-1! == i[o].src.indexOf (t)) {r = i[o]; break} if (! r) {r = a.createElement ("script"), r.type = "text /jаvascript", r.async =! ? r.defer =! ? r.src = t, r.charset = "UTF-8"; var d = function () {var e = a.getElementsByTagName ("script")[0]; e.parentNode.insertBefore (r, e)}; "[object Opera]" == e.opera? a.addEventListener? a.addEventListener ("DOMContentLoaded", d,! 1): e.attachEvent ("onload", d ): d ()}}} t ("//mediator.mail.ru/script/2820404/"""_mediator") () (); 3r33333.
3r33333.
+ 0 -

Add comment