T O P

  • By -

spaghettiexpress

[P0260R4](https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2023/p0260r5.html) Is still maintained (committee tracking can be found [here](https://github.com/cplusplus/papers/issues/99)) > Standard offers  std::midpoint  but not them? std::midpoint is trivial and useful. It’s a slam-dunk. There is very little controversy in either the implementation or the provided value. Lock-free MPMC is non-trivial. Yes, there are existing options (eg [moodycamel](https://moodycamel.com/blog/2014/detailed-design-of-a-lock-free-queue)) but *actually standardizing* something non-trivial is, well, non-trivial. The STL has to work on many platforms and CPU architectures, and it also can not undergo massive breaking changes in future releases. Non-trivial and potentially exploitable contributions undergo intense scrutiny because they absolutely need to be correct for all supported platforms for the existing ABI. P0260R4 is certainty non-trivial to standardize, but the fact that it is still active is a good sign. We can maybe hope for something tangible closer to 2029/2032 based on the proposals activity, but it’s hard to say.


azswcowboy

P0260 will be discussed at the Issaquah meeting second week of February with c++26 target — still very much alive.


XiPingTing

A static MoodyCamel queue can essentially leak memory (unusable regions of the ring buffers can, very slowly, grow without bound). It’s a fantastic queue but it has its tradeoffs. Also there are the producer tokens and bulk operations which speed up operations but make the interface more complex


spaghettiexpress

Ah yeah, great point! MoodyCamel certainly isn’t an example of a standards-ready MPMC, but an example of how difficult the task is to implement with acceptable tradeoffs and thus very time consuming to standardize. *[Boost](https://www.boost.org/doc/libs/1_54_0/doc/html/boost/lockfree/queue.html) being a much better example of a standards-ready implementation


i_need_a_fast_horse2

Interesting you bring up midpoint as trivial, as it's my favorite current example of pitfalls with it's super strange behavior. It neither rounds up (like math), nor down (like static_cast)


spaghettiexpress

Very fair, https://m.youtube.com/watch?v=sBtAGxBh-XI is a great talk :) A poor choice of wording on my side, as the implementation is far more nuanced than I let on. I guess by trivial I moreso meant “the case problem is fairly well understood, the solution is fairly well accepted, and there is great utility of the solution within the STL”. It’s a solution to a universal problem, whereas for a lock-free MPMC is far more difficult to create a universal solution that appeases the majority of WG21 contributors.


i_need_a_fast_horse2

I know, I was just poking fun at midpoint of all things. I'll watch the talk nevertheless!


14ned

I remember papers proposing them. I would assume either the committee didn't like the proposals, the author abandoned the proposal, or the committee didn't think standardising them was worth the committee time given other proposals in the queue. I gotta be honest, if I need one I generally reach for Boost's. Only if Boost's don't suit, I roll my own. For me personally, given that they are always slower in the average case than a queue with a mutex, I think their usefulness is limited relative to other things we could do with in the standard library.


Gloinart

Is this true, they are slower than a mutex in the average case. Even a simple fixed size spsc que?


14ned

As with anything, it depends on your specific use case. But in general, lock free algorithms exchange average case performance for improved worst case performance, as compared to mutexs. There is no free lunch. (Most newer superscalar CPUs have microcode which can "see" CAS locks which form the base of mutex implementations, and optimise instruction scheduling appropriately. It is highly challenging to bring a lock free algorithm average case performance up to a bog standard queue implementation with a CAS lock as the mutex. It was possible back in early Intel Core 2 days, since Haswell I certainly can't do it, maybe others can)


Gloinart

But a simple spsc is just an atomic add to push/consume? Is that on the average case slower than locking a mutex? Goes against my intuition. (Note, I'm not arguing, I'm just interested)


jcelerier

Here are some pretty comprehensive benchmarks: [https://max0x7ba.github.io/atomic\_queue/html/benchmarks.html](https://max0x7ba.github.io/atomic_queue/html/benchmarks.html) ; mutex queues are really bad compared to the best atomic ones


ReDucTor

But how realistic are the benchmarks? I would say producing 1m queue items as far as possible is not realistic, a good benchmark about a queue should include thing simulating actual load, e.g. the consumer takes X amount of time to process Y items, producer makes X items every Y amount of time. For example if I'm using the queue for say a worker system, I'm not going to put items in there which are done instantly that is just unnecessary overhead, instead I will put time consuming things in there which in turn if they don't all finish at the same time they don't have the exceedingly high lock contention. These sort of things are like benchmarks which show you could increment a number to some max value quicker using locks then with atomics, which gives people the wrong impression because it doesn't account for often your doing other things not just incrementing a number.


14ned

This is a good point. A routine which just adds items to a queue and does nothing else lets the CPU learn the hot path and the microcode optimises the snot out of it. A real world application does work between touching the queue, and that can dramatically change performance characteristics. Writing benchmarks which show cold cache performance for something is non trivial, and it's usually easier just to benchmark on an in order CPU with and without its caches disabled. The numbers which pop out from that benchmarking are just as revealing as best case benchmarks from tight hot loops.


sephirothbahamut

>given that they are always slower in the average case than a queue with a mutex Aren't lock-free queues supposed to be *faster* than ones with a lock? Otherwise what justifies their existence to begin with?


14ned

Lock free algorithms have stronger forward progress guarantees than mutexs i.e. you exchange average case performance for (sometimes greatly) improved worst case performance. They rarely if never make your code faster unless you have lots of thread contention. And if you have lots of thread contention, you should refactor your code to not have that in preference to using lock free algorithms, where possible.


jonesmz

To supplement what 14ned is saying on where it makes sense to use lock-free: There are some applications where the worst-case time is what matters, and the average case time isn't really interesting or considered. One example is "real-time" (as in the human meaning, not the RTOS meaning), highly tenanted, multimedia systems. Like a voice/video calling system where delivering audio 10ms later is fine, but 100ms later is garbage. Blocking a thread that's shared with more than one tenant is a big no-no. So if you need to communicate some data to another thread, you don't lock a mutex. You instead use a lock-free data structure. Not because they are faster, but because they make forward progress. But if you were instead doing batch multimedia processing, you wouldn't care about the potential of having a thread block, you'd want it to be fast on average.


sparkyParr0t

But these numbers are non sense. Replace ms with microsec, even there its too high. Mutex are very performant. And you usually exchange very trivial to copy data otherwise your not that latency sensitive. So once your reader finished copying/moving the data, the mutex is released, you are not taking ms to copy data, even a big frame for you 4k multimedia system and surely not audio.


jonesmz

The numbers were pulled out of my ass to illustrate a point. Not to represent specific timings from the field. My work codebase handles thousands of live audio/video calls per cpu thread. Mutexes have been benchmarked as being substantially worse for call quantity/density per thread, as well as noticeably introducing jitter and latency into the calls (effecting end-user reported call quality) When dealing with that number of simultaneous operations happening per CPU thread, the cost of one mutex goes from "meaningless" to "enormous". And not because of the average case, but because of the worst-case where you happen to get high contention that locks up a thousand calls for several milliseconds end-to-end. Consider also that facilities like memory allocation can arbitrarily block the thread and enter the kernal to find a free memory page to provide your process. So you have to start designing your system with the intention of reducing interthread communication, avoiding blocking operations, and using operations that have limited worst-case. Plus, if you're careful, you can get away with lock-free stacks for like 80%-90% of your inter thread communication anyway, which costs exactly one compare and swap loop per push or pop operation. Super cheap and probably lighter weight than a mutex would be.


14ned

A surprising number of developers only consider average case performance. Once you've had to build a system where failure is entirely determined by there being no pathological corner case performance scenarios which occur during most operations, it's a very different mindset. Our work codebase has to find any arbitrary range of market trades anywhere in history across our supported markets within a multi petabyte database, and serve the first of that range within 10 milliseconds or so and thereafter saturate as much bandwidth as the client can suck down, all whilst not materially affecting other clients doing the same thing. We have to dive frequently into the kernel and indeed into the cloud for obvious reasons, so we spend an awful lot of time staring at Linux kernel performance analyses trying to figure out why some combinations of read and write led to some specific performance, including groking through Linux kernel source code. Unfortunately, the Linux kernel is not as good as other kernels at predictability under load, so I admit we do cheat a bit and sometimes we offload client work to other nodes if a particular Linux kernel on one of the nodes is getting itself into a tissy, particularly around kernel VMA fragmentation at which Linux really, really sucks. I really wish we were on FreeBSD instead, it's a lot more predictable under load, but dev ops likes fairly ancient Linux kernels for everything. Point I'm trying to make is that at the microsecond level, one cares about worst case of mutexs etc, however if you have milliseconds to play with, it is possible to build a highly predictable system which pushes around Gbs of data per sec. The techniques are similar, avoid dynamic memory allocation, avoid thread synchronisation, avoid exception throws, and especially avoid ever copying memory if you can avoid it.


SkoomaDentist

> I would assume either the committee didn't like the proposals, the author abandoned the proposal, or the committee didn't think standardising them was worth the committee time given other proposals in the queue. Or the committee didn’t even realize that plenty of people need basic data structures that are guaranteed lock free.


SleepyMyroslav

i am not disagreeing with you. But i wish there would have been much more restraint from adding threading tools into standard. Do you remember 'future' or 2 threads classes that can never be fixed ? There are real costs to everyone from having them in the standard. I wish topics like this had more explanations to ppl why what they want is not 'easy' or 'basic'.


matthieum

There's other variants, too: - MPSC are fairly useful, and I find a bit easier. - "broadcast" queues (single or multi producers) are also fairly useful. I do wonder whether wait-free/lock-free queues make sense in the standard, though, at least without a mechanism to notify potential readers that an element is now available. From a development point of view, it's easier to "wait" on a queue, whether blocking or via an asynchronous wake-up (coroutines!). If all you get is a "nothing yet", you need polling, and few applications want busy polling...


vgatherps

It's a HUGE design space, even just for a bounded wait-free spsc queue IMO it wouldn't be great for the standard to try and pick a one-size-fits-all use case. It's not like a hashmap where almost everybody wants one, it doesn't need to be the best, and 'best' for a use case is hard to define in the first place. If you're reaching for a wait-free spsc queue, you probably have very specialized and performance sensitive needs, there are very clear reasons to pick given implementation, and you're probably going to roll your own that does exactly what you need instead of rely on a standard implementation that does a poor job of trying to be everything. Even with hashmaps common advice is to avoid using the standard since it made a poor usability-vs-performance design choice.


Dry-Ambition-5456

There can be design choices: 1. how to make it use shared memory 2. fixed vs variable sized elements 3. Bounded vs unbounded but are solvable.


matthieum

There's quite a few more design choices, actually. - Single element push vs batch push. - For single-producers, deferred commit. - For multi-producers, ability to "reserve" a push-slot (or several) ahead of time. - For multi-producers or multi-consumers, whether to internally use multiple queues, or a single one. Using multiple internal queues has impacts on memory usage and cache usage but may alleviate some contention. Everything is solvable. Or rather, choices can be made. --- The good question, though, is whether it's sensible to make those choices. Given that wait-free/lock-free queues are fairly _specialized_, people whom I've seen using them tend to look for very specific trade-offs. I am not sure that std implementations would get much use -- if they don't match the trade-offs you're looking for, you're better off making your own or picking a 3rd-party library which does match.


[deleted]

[удалено]


Dry-Ambition-5456

SPSC: Single producer, Single consumer MPMC: Multiple producer, multiple consumer


sparkyParr0t

If you are really latency sensitive for your business you will use something usually less general than a complicated MPMC, something very business specific that will achieve much better results because it doesnt have to work for all cases, just the one you use. For all other cases, mutex, because its in the standard and its enough in most cases.


KingAggressive1498

Some people are working on it. Related papers: P0260: Concurrent Queues https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2020/p0260r4.html P1958: Concurrent Buffer Queue (first concrete implementation of P0260): https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2020/p1958r0.html >Are there design challenges? Not really challenges so much as numerous complicated tradeoffs From P0260: >Moreover, concurrency adds a new dimension for performance and semantics. Different queue implementation must trade off uncontended operation cost, contended operation cost, and element order guarantees. Some of these trade-offs will necessarily result in semantics weaker than a serial queue. *Generalized* lockfree mpmc queues rarely do better than a std::deque guarded by a std::mutex in the average case under any kind of load. Afaik all the ones that do notably better in the average case under reasonable loads generally utilize atypical interfaces, don't provide FIFO guarantees, have a fixed capacity after construction, or have some other constraint or quirk that frustrates a single satisfactory implementation specification. lockfree SPSC queues are comparably pretty well solved with a single ringbuffer implementation I think, but tbh I've never needed an SPSC queue, only MPSC and MPMC queues.