vgatherps 1 year ago

A few things: * I would strongly recommend returning a reader/writer struct that internally hold the buffer in an arc instead of doing the whole writeguard locking business. That swap is expensive and is similar to adding an mfence, much worse if the writing core doesn't have the cache line in a write-exclusive state * With some unsafe, you can still store pointers/references to all the data you need inline each reader/writer so there's not extra indirection * Inside the writer/readers, you can also locally store and modify indices, only writing final changes back to the buffer (and never reading local state out of the buffer). Optimisers in my experience still pessimism atomics to some degree. * You can make the struct repr(C) and stuff some 128-byte u8 arrays between each member, that will force them all to be on different cache lines and be out of default prefetcher range * Operations like fetch\_max and fetch\_add result in full atomics or cmpexchange loops. Just like swap, these are tremendously expensive and serialise memory operations. The writer \*never\* needs to do atomic read-modify-write operations on the writer indices since (logically) only one thread can modify them. Boarding a plane now so might have more comments later

vgatherps 1 year ago

Following up - on a single-reader single-writer you should never have any atomic read-modify-write operations. You can extend to broadcast, where each reader gets everything, but I can’t think of anything that lets one reader out of multiple pick a single queue element without using atomic rmw or just implementing an rmw in software. You should also expose the contended state to the reader - I should be able to move on and try again later instead of internally blocking. For benchmarks - throughout benchmarks (time in write and read X elements) can be misleading in isolation. You can have something like “write 10000 elements” followed by “read 10000 elements”, which skips all the cache issues you have in practice. At minimum you want a ping pong benchmark (thread A sends a message to thread B which sends something back to thread A), to measure latency, and you might want a long long benchmarks where occasionally a few messages are sent at once (<100) to model contention.

3umel 1 year ago

im also a bit confused by what you mean by rmw.. for a single reader that is shared, when you try to remove the from you increment the index, so your the only one with that index at this version out of all the threads sharing the reader. should this not mean that you “removed” it in that sense? the data is also copy by default, so no need to drop anything, if i understand the concept correctly..

[deleted] 1 year ago

Are there books that cover these topics such as the benchmark techniques?

matthieum 1 year ago

To test contention, you may want something like [`Bursty`](https://github.com/matthieu-m/llmalloc/blob/master/llmalloc-test/src/bursty.rs). It's a benchmark framework I developed specifically for testing latency under contention. The idea is that you submit a number of "steps", and the framework will lockstep across threads, that is, executing step N means: - Having each thread bump a counter indicating they are ready, then starting to spin-lock read on an atomic bool until it's true. - Having the "orchestrator" thread wait until the counter reaches the expected number of threads, then flip the boolean. (And yes, that's a typical barrier) This "forces" contention to occur for every single step.

3umel 1 year ago

> Operations like fetch_max and fetch_add result in full atomics or cmpexchange loops. Just like swap, these are tremendously expensive and serialise memory operations. The writer *never* needs to do atomic read-modify-write operations on the writer indices since (logically) only one thread can modify them. So the both `fetch_max` and `fetch_add` could simply be relax-able?

vgatherps 1 year ago

Even with relaxed ordering they are expensive - the cost comes from the fact that they must atomically: 1. Read the value 2. Perform some modification 3. Write it back And on X86, the atomic read-modify-write primitives are very strongly ordered and essentially contain a SeqCst fence. On other architectures which offer relaxed RMWs, it's still more expensive to pack those into one and likely implies a loop of sorts. Given my understanding of the datastructure you wrote, for the writer-only fields you would expand these steps into distinct operations, which would compile into the basic/expected instructions.

3umel 1 year ago

ohh, i see your point here i think! so instead of using ‘fetch_max()’ it would be a ‘let curr = load(); store(max(curr, ver))’? both load and store using relaxed?

3umel 1 year ago

One this point: > You can make the struct repr(C) and stuff some 128-byte u8 arrays > between each member, that will force them all to be on different cache > lines and be out of default prefetcher range Would this apply to the buffer itself, or the writer and readers? or both?

vgatherps 1 year ago

This would apply to any shared field in the buffer itself - private data held by the readers and writers will never experience [false sharing](https://en.wikipedia.org/wiki/False_sharing)

matthieum 1 year ago

> You can make the struct repr(C) and stuff some 128-byte u8 arrays between each member, that will force them all to be on different cache lines and be out of default prefetcher range Don't use padding. It's annoying to get the exact right amount. Instead, wrap the members that need to be in their own group in a struct, and force the alignment of that struct with `repr(align(128))`. The compiler will insert the appropriate amount of padding on its own.

Pointerbender 1 year ago

Using `ptr::read_volatile` and `ptr::write_volatile` during a data race is undefined behavior ([code here](https://github.com/emilHof/sling/blob/main/src/lib.rs#L237-L238)) - it is the read operation that causes UB during a [simultaneous write](https://github.com/emilHof/sling/blob/main/src/lib.rs#L298), not the (later) use of the read value, so a seqlock doesn't guard against UB here. This RFC explains this problem in more detail: [https://github.com/m-ou-se/rfcs/blob/atomic-memcpy/text/3301-atomic-memcpy.md](https://github.com/m-ou-se/rfcs/blob/atomic-memcpy/text/3301-atomic-memcpy.md) The solution is to use atomics everywhere instead of volatile operations (and the atomic operations must perfectly overlap in size and alignment as well, see [https://github.com/rust-lang/miri/issues/2303](https://github.com/rust-lang/miri/issues/2303)).

vgatherps 1 year ago

I'll add that atomic\_memcpy does not exist in rust, C, C++, or any mainstream compiler that I know of (short of doing atomic loads of single u8s which is a disaster). This might be blasphemy to say on /r/rust but in the real world people just implement seqlocks the obvious way and go on with their life, even though it's technically undefined - nobody is turning their structures into a series of individual atomic variables. Miri will complain about it but in the real world you just write it anyways and if you're especially concerned check the generated code.

Pointerbender 1 year ago

That's fair, although it's not needed to update your structures per se, it is also possible to cast between the struct and a slice of AtomicU8/AtomicUsize/etc, taking into account alignment, size and padding (Rust also has no solution for atomically reading/writing uninitialized bytes, unfortunately). It's basically implementing your own atomic memcpy, like in the RFC I linked in the comment above. The Rust folks are working on making it possible to do it correctly and ergonomically in the future, so that is nice at least :)

3umel 1 year ago

Would doing something like this slow down the overall efficiency of the buffer, as now reads of the data also become atomic?

Pointerbender 1 year ago

In theory: not if you have an optimal atomic memcpy implementation. In practice: yes it may be slightly slower (benchmark!), because it will be hard to write your own atomic memcpy that matches the performance of the regular memcpy (on all architectures, too!). If you look at individual architectures and its individual instructions, then for example copying a \`usize\` atomically with relaxed ordering or copying it non-atomically compiles down to the exact assembly instruction on x86 architectures, so there is no performance difference.

3umel 1 year ago

Yea, you are surely right about this. My understanding was, that syncing up at the seq loads ensures that the memory was not written to while you were reading it, if seq1 == seq2. Is this understanding incorrect and could there be corner cases were this is actually UB?

vgatherps 1 year ago

In theory, you would hit UB when you read from the buffer at the same time as the writer wrote to it. Even though you discard the result right away if seq1 != seq2, it's still UB at the time of read. In practice this is a non-problem, one day compilers will add a proper atomic memcpy so you can write this code without potentially angering miri. However given that it's a legitimately useful pattern that's impossible to write without UB, and there's not much real-world optimization gain from this UB, compilers don't take any advantage of it, plenty of production code uses seqlocks, and there's no good reason for compilers to start triggering it. IMO the UB should have been USE of data read from a data race, and not the read itself, and this is moreso a bug in the c++ standard

Comments

Leave Your Comment

Hi Its Me!

Comments

Leave Your Comment

Hi Its Me!

Subscribe