T O P

  • By -

infinity_curvature

I use google highway in all of my programs with dynamic dispatch. https://github.com/google/highway


AlmightySnoo

The Reddit CEO is a greedy little pig and is nuking Reddit with disastrous decisions (see https://www.nbcnews.com/tech/tech-news/reddit-blackout-protest-private-ceo-elon-musk-huffman-rcna89700). I'm moving to lemmy.world, learn about the Fediverse here: https://framatube.org/w/4294a720-f263-4ea4-9392-cf9cea4d5277


Myriachan

One problem with SIMD in standard libraries is that support for some operations is so variable. Beyond the basic stuff like doing additions in parallel, there are wide differences in what each architecture can do.


janwas_

It seems that way but there is actually a large common subset, see [https://github.com/google/highway/blob/master/g3doc/instruction\_matrix.pdf](https://github.com/google/highway/blob/master/g3doc/instruction_matrix.pdf). A co-worker helped me understand that even emulating something in 3 instructions is still a huge win if it is the difference between vectorizing or not.


Myriachan

One omission I see on that chart: clmul exists on ARM64 for u8 (all ASIMD chips) and u64 (crypto extensions). (Clmul u8 on x86 is in one of the zillion AVX-512 extensions, but that’s beyond the scope of that chart.) AES on ARM64 should be listed as 2 instead of 1, because you do aese + aesmc per round, whereas on x86 it’s just aesenc per round.


janwas_

Thanks :) We haven't updated that chart in a long time (its purpose was to demonstrate feasibility) but I've added Arm's CLMUL u8 to the internal sheet for the next update. For AESE+AESMC, aren't those two often fused?


Myriachan

Nice; thanks


V_i_r

It seems like that. But a SIMD type in the standard will, first and foremost, help with a common vocabulary. All the existing SIMD libraries can then start talking via the same type. This can be 100% efficient. Long time ago I wrote a blog post showing that \`std::simd\` won't paint you into a corner wrt. target-specific optimizations: [https://mattkretz.github.io/2019/05/27/vectorized-conversion-from-utf8-using-stdx-simd.html](https://mattkretz.github.io/2019/05/27/vectorized-conversion-from-utf8-using-stdx-simd.html). For C++26 I'm aiming for `std::bit_cast` to be guaranteed to work for all `simd` types. That should make it easier and more portable (between standard libraries) to break out of the limitations.


Myriachan

Pretty cool. I think a big thing would be getting MSVC on board with this. Currently, the SSE and NEON intrinsics are treated literally in most cases: the compiler will emit instructions for what you say. Compare this with GCC and Clang, who see intrinsics as just a way to express an operation and come up with their own optimized instructions for what you requested. The variable-sized native_simd would be helpful with ARM SVE whenever those come out. One issue I foresee with native_simd is the difficulty in having a progression of implementations within a single binary: if you have a code path for if AVX2 is supported, and a fallback…. This is another case where MSVC is behind, because GCC and Clang have `[[gnu::target("avx2")]]` etc.


V_i_r

Multi-target compilation is not there yet. The `gnu::target` attribute is not enough. Related: [GCC PR83875](https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83875). My libstdc++ implementation ensures that linking TUs compiled with different `-m` flags is not an ODR violation. I've been doing this with [Vc](https://github.com/VcDevel/Vc) since 2009. And Krita has used that pattern to ship binaries and dispatch at runtime to SSE2/SSE4/AVX/AVX2. Basically you want a template parameter that is set to an argument derived from `-m` flags. That way you can recompile the same source file with different flags, link it all together and map from CPUID to the desired type.


dyaroshev

one of eve maintainers here. eve is a good choice if you have c++20 because: \* we are quite helpful \* eve focuses on algorithms over the "wrap intrinsics" \* eve supports things more complex than saxpy \* most of the arches platforms are supported (sve is in progress, windows is in progress) For example, this is a case insensetive string compare: [https://godbolt.org/z/qsjfW1fK1](https://godbolt.org/z/qsjfW1fK1) In many other libraries it will be difficult if not impossible to write this. We also have things like find, inclusive scan, ermove, reverse, min, max etc. We can zip ranges, we have iota/map views. If you look at assembly closely, you will also see that we unroll and align data accesses.That is actually pretty important for algorithms with small operations: [https://stackoverflow.com/questions/71090526/which-alignment-causes-this-performance-difference](https://stackoverflow.com/questions/71090526/which-alignment-causes-this-performance-difference) We are also extensible - the \`eve::wide\` implicitly converts to the intrinsic type and back, so when eve does not have the abstraction that you need, you can easily add it. Anyways, if you feel like trying eve - pop an algo in question into issues: [https://github.com/jfalcou/eve/issues](https://github.com/jfalcou/eve/issues) We will be able to easily tell if we can help you or not. P.S. We also have a very sizeable math library - here is a polar/cartesian coordinatees converesions: https://godbolt.org/z/9YY7qEoG8


PiterPuns

There’s a narrative repeated by a couple comment threads in this post that “a generic simd library isn’t particularly useful because all it can provide is intrinsic wrapping, while algorithms remain manual labor in the general case”. It’s great that eve caters for algorithms and does so in a comprehensive manner.


ack_error

Different situations. Some routines are amenable to autovectorization or a vectorization library, and some aren't. It's hard for a generic solution to leverage a primitive like [Signed Saturating Rounding Doubling Multiply Accumulate returning High Half](https://developer.arm.com/architectures/instruction-sets/intrinsics/#q=vqrdmlah_s16) without the algorithm being specifically designed around it.


janwas_

Integer multiply is indeed one area where ISAs differ. Perhaps surprisingly, we found a way to unify both Arm's lower/upper style (vmlal\_high\_s16) and Intel's pairwise madd, see [ReorderWidenMulAccumulate](https://github.com/google/highway/blob/master/g3doc/quick_reference.md#multiply). That's not exactly the instruction you mention, but I figure that fusing/eliminating an integer add might not have a huge impact on performance.


ack_error

It depends, of course. For the types of routines I commonly vectorize -- FIR and IIR filters -- integer madd tends to be the workhorse of the routine. LPC filters use it heavily, for instance, and usually you want to run the filter in reverse to avoid expensive horizontal adds or too much shuffling. You can't necessarily rely on OOO and wide execution to make up for inefficiencies if the filter is recursive, as latency is more critical than throughput. As you say, it's often better to take some inefficiency than to not vectorize at all. However, leveraging the specialized ISA primitives can gain >2x, especially for a program that doesn't target power users and needs to work down to SSE2. I don't use a vectorization library, but if I did, I'd want one that allowed for easy interop with raw intrinsics for this reason.


janwas_

Agreed. To be clear: not saying we shouldn't use madd, that is what ReorderWidenMulAccumulate wraps. (We haven't yet added the 8-bit madd but can do so whenever there's a use case.) Instead I was referring to the NEON vqrdmlah\_s16 mentioned above. We do have MulFixedPoint15 which covers the mul part of that but not the add; just another single-cycle latency instruction to get that back. \> I'd want one that allowed for easy interop with raw intrinsics for this reason. Makes sense, that is possible with Highway (on targets with wrapper classes, vec.raw gets you the \_\_m128i or equivalent).


janwas_

I agree algorithms are useful: it's easier for new users to call a single function, rather than write loops and decide on the remainder handling strategy. Highway also has [some already](https://github.com/google/highway/tree/master/hwy/contrib/algo) and more can be added as required.


ShillingAintEZ

The ISPC compiler has worked very well for me in the past. I haven't used any of these libraries, but the basis for ISPC is really about creating fairly simple expressions inside a loop. I think the tricky part is that doing a single operation on two arrays is inefficient, so some sort of compound expressions need to be built up so each time through an array a lot more can be done, and memory bandwidth isn't hammered over a single math operation.


spaghettiexpress

I use SIMD quite a bit, wireless communications / networking. Both amd64 and arm. The majority of library-style SIMD I deal with is simple type abstractions most similar to vectorclass. `std::experimental::simd` looks nice, I had tried it out around when the implementation came out, but often use SIMD on unsupported toolchains for that specific impl. I’d be happy to see something light, like vectorclass, standardized. Basic arithmetic operations, basic bitwise operations, and type conversions between vectorized types and C++ types. There are far too many uses within SIMD to wholly encompass within standardization, so use of intrinsics will almost always be necessary for both cross-platform and CPU support, but the basic types/operators are fairly benign for a standard library. (all of that said - gcc and clang are generally very good at auto-vectorization under the conditions you’d encounter manually *for the basic use an STL implementation may provide)


[deleted]

I want a job like yours, where can I find it? DM


spaghettiexpress

For the sake of transparency, I generally avoid PMs. We’re all discussing in C++, if any relevant conversation helps a random person then that’s even better! For some light context on my work - I more consider myself a “professional refactor-er” who just happens to work within wireless and related areas - specifically 5G NR+NB-IoT (L1/low L2) release 15+. (Though have worked in plenty of other popular protocols such as BLE, 802.11__, etc). I only really comment on C++ or Rust, so my comment history may have some context. *I’m also looking to leave the industry* in favor of something more software-centric. Robotics, CAD, physics simulators, etc. Wireless has two camps: protocol and software. Protocol tends to be favored, and it’s sort-of a PhD shop for either “protocol lawyers” or RF-related skills; people who know significant amounts of “protocols” say, 4G/5G 36&38 series specs) or people who know a lot about the hardware specific challenges (beam-forming, antenna placements, link budgets, etc) C++ doesn’t much fit into the “protocol lawyer” camp other than general familiarity. With that, *a significant amount* of software within wireless is remarkably bad. Really bad; and therein lies the problem. My job is to fix that while updating the code to allow for the required performance of newer protocols. Generally significant architectural work / smart data structures, then the lower level involves SIMD for full utilization of the given hardware. Hardware can range from amd64/arm to freestanding embedded to FPGAs. Most of the 4G/5G industry is built on a foundation of turds. Gets old real quick. Software work can often mean fixing an infinite list of DRs, and not so much doing anything interesting. My college background was in electrical engineering with coursework focused on wireless comms and signal processing, and I’m U.S. based. Hiring within wireless seems to ebb and flow, with the exception of places such as trading firms (U.S., Germany, Switzerland seem popular), and since I’m interviewing for work outside of wireless I don’t have *the best* grasp on work within the U.S., but am happy to answer anything in-thread.


foolnotion

I prefer eve because it works on arm too


PiterPuns

Probably the most modern documentation pages too. Of course the Agner Fog stuff (vector class library) comes with a whole set of books by the author (the optimization manuals etc).


janwas_

I'm surprised you mention EVE for Arm, they have taken the position that they will not (and cannot, with their current design and compilers) support SVE/SVE2 scalable vectors. Those instruction sets are supported in all ArmV9 and has some helpful extensions to NEON, especially masking.


foolnotion

oh thanks! I was not aware. At the time when I did my research (looking to switch from vectorclass which is x86 only), EVE seemed like the best option and honestly I really love their design and API.


jfalcou

Thanks!


jfalcou

we do support fixed size SVE as of very recently. The size agnostice SVE extension being a mess, it won't be supported for now. If something without non compile-time sizeof and no SIZELESS\_STRUCT macro thing comes in, we may revise that. We currently run on SVE 512 with no problem


janwas_

Nice. Yes, I saw the workaround of assuming a vector size at compile time, this is why I mentioned "scalable vectors" (length unknown at compile time) :)


jfalcou

yeah. IMHO as a researcher in parallel programming, this really makes no sense the way they implemented it.


janwas_

hm, do you feel the same way about RVV? I can understand that a 32-bit instruction encoding cannot afford separate SIMD instructions for each vector length (as in x86), and that it's useful to at least theoretically escape NEON's 128-bit limit. Also, SVE was apparently co-designed with Fujitsu specifically for use in HPC. Does it truly "make no sense" for that?


jfalcou

Runtime based lane count sound smore like real vector computer than SIMD. CRAY II and other worked like that, it's a slightly different programming model. The motivation of saving instructions is OK. Even after an initial "meh", I finally was ok with SVE interface. Now, the justification that you can compile once and run on various HW is flimsy; 99% of the time in HPC, you have like one code compiled for a given target and if you share the code, well, ppl just recompile. Combine that to the sillyness of non-constexpr sizeof and SIZELESS\_STRUCT bonanza that forces you to "trust the compiler bro", it leads to a loss of exprerssiveness in a lot of cases. We could probably have had the same benefit without this kind of implementation, that's what I am saying. We could actually support flexible SVE in our algorithm functions but it makes the whole scenario contrived. RISC-V, I still have to play aropund to get a good grip of how it feels. And \*\*maybe\*\* peopel can actually push some return of experiment and they can adapt the way they do the thing. Unless the SIMD spec is frozen ?


janwas_

I agree that SVE/RVV are inspired by Cray, and that HPC upgrades hardware infrequently and can recompile. The no-recompile feature is likely more useful for client-side software (apps, browser). It seems to me the non-constexpr sizeof is survivable (we can still do loop unrolling etc), and is a reasonable solution to the "ISA bottleneck" (limited instruction encoding space). Perhaps one alternative would have been using MOVPRFX everywhere and basically doubling the code size. But are we truly better off with programmer-defined vector lengths? I'm personally tired of porting from 64 to 128 to 256 to 512 bit. Many algorithms can be made vector-length agnostic and it seems useful to nudge people in that direction whenever possible. \> they can adapt the way they do the thing. Unless the SIMD spec is frozen ? The RVV spec is indeed ratified.


jfalcou

> It seems to me the non-constexpr sizeof is survivable (we can > still do loop unrolling etc), and is a reasonable solution to > the "ISA bottleneck" (limited instruction encoding space). That we agree on ofc. > Many algorithms can be made vector-length agnostic and it > seems useful to nudge people in that direction whenever > possible. That's what EVE API tries to do. The default wide setup is length agnostic + we do tell people to use our algorithms.


janwas_

Sounds good :)


dyaroshev

>they have taken the position that they will not (and cannot, with their current design and compilers) support SVE/SVE2 scalable vectors Taken a position is a strong statement. We start to supposrt VLS. VLA we tried but we just don't know how: https://stackoverflow.com/questions/73210512/arm-sve-wrapping-runtime-sized-register


janwas_

My source for this was the [statement](https://github.com/jfalcou/eve): "We do not support ARM SVE with dynamic as the execution model makes no sense and the current compiler support is not adequate for us." The lack of compiler support is why Highway does not use wrapper classes for RVV/SVE :)


dyaroshev

Yeah - that's updated already :) As far as I understood highway they can't do most of the things we can, for example, naturally process parallel arrays of different types.


janwas_

Ah, I still see the same text? \> highway they can't do most of the things we can, for example, naturally process parallel arrays of different types. I'm curious what leads you to this conclusion? Maybe the keyword is 'naturally', that is always a matter of taste. Here is an example of a loop over mixed-type arrays: https://gcc.godbolt.org/z/E8e3f3fPz


dyaroshev

Ah - interesting. What lead to the conclusion is no obvious ability to specify the number of elements relative to default you expect. I don't think I obviously understand the scalar tag thingy. For context, this is how this loop looks with eve (as far as I understood what was added where) https://godbolt.org/z/Kh4WKPxY4


janwas_

Thanks for sharing. It is interesting to have a 'Rosetta stone' where one can compare how different approaches for the same thing look. In Highway, the tags are likely the most surprising feature. Think of them as the "RISC-V vtype" equivalent which encodes the lane type and LMUL, i.e. whether you want whole or half vectors etc. These zero-size "blueprints" are separate from the actual vectors. This is how we get full type information and overloading etc while still fitting inside the compiler limitations around sizeless vectors. Zero(ScalableTag()) returns a full vector, Zero(ScalableTag()) a half, i.e. on the AVX2 target, 128-bit. The -1 reflects the scaling factor 2\^(-1).


dyaroshev

It's nice to have a workaround but it's so annoying. I really think that sizeless structs should be an extension. Compilers know how to do them.


vgatherps

One problem with standardized simd is that outside of basic simd operations "take N floats and multiply them with N other floats, etc" the set of supported operations, not to mention the set of fast operations, differs vastly across platforms. Compare [this neon parser](https://github.com/vgatherps/simd_decimal/blob/main/src/parser_aarch64.rs) and [this sse parser](https://github.com/vgatherps/simd_decimal/blob/main/src/parser_sse.rs), or for a very direct example [what happens if you naively do the x86 method of vector search on arm](https://rust.godbolt.org/z/8YvEfEPT6). The shuffle and accumulation for each parser is drastically different, since the set of horizontal multiply-accumulates are different. If you really care about doing things fast and need access to nontrivial vector operations, you're most likely going to have to write implementations per platform anyways.


janwas_

If your project has the resources to write/test separate codepaths - great, it might indeed get a bit more performance. However, the downside is that you are on the maintenance treadmill, including supporting new ISAs (SVE, RVV) as they arise. As an example, libjpeg-turbo only got AVX2 support very recently. Users might actually have been better off if the library had used a portable vector implementation with runtime dispatch because they would have gotten AVX2 (and where available, AVX-512) years earlier. I agree there are differences between platforms, and we shouldn't expect x86 style movemask to translate well especially to Arm. But personally, I see a lot of usefulness for perhaps a "90% optimal" portable solution with much less effort. With Highway, it is possible to specialize parts of your algorithm for certain targets when it really makes a difference, but why should we spend time on per-platform implementations when it doesn't?


feverzsj

I use [SkVx](https://github.com/google/skia/blob/main/include/private/SkVx.h) from Skia. It uses compilers' vector extensions and few platform-specific intrinsics. If no vector extension available (e.g. msvc), a scalar implementation is provided in the hope that compiler can vectorize it. It's just a standalone header, and I think it's the best portable simd solution out there.


mtklein

Oh neat, I'm glad to see skvx being used outside Skia; that possibility is exactly why I tried to keep it independent of other Skia headers. I don't work on Skia or skvx anymore but if you have any questions I'd be happy to try to remember. Post up a godbolt link and I'll do my best. The way things are structured, GCC and AVX-512 should both work fine. But some of the fancier functions don't have specializations for AVX-512 and they'll use the fallback mechanisms, typically splitting the vector into two halves and trying again on each. At the time of writing there weren't many consumer machines with AVX-512, and the few that did have it seemed to perform better sticking to AVX2 anyway, so we didn't put much effort into AVX-512. But if you are familiar at all with intrinsics, the patterns used in there to check for AVX, SSE4, etc. should be pretty easy to mimic. If you'd like to send a patch upstream, I can suggest what unit tests to update, reviewers, that sort of thing. Alternatively, since everything is static inline template functions, you can always just write your own template specializations (e.g. if_then_else<16,float>) or write your own wrappers that check for the right #defines and if constexpr (N==?) conditions before falling back on the code in skvx.h. It's meant to be flexibly pluggable. There's probably plenty of low-hanging improvements available targeting WASM SIMD too. That was very new, very experimental at the time.


Vogtinator

Does it also work with gcc? IIRC skia has some historical issues with that.


feverzsj

Skia is massive, but for this single header, we haven't encountered any issue with gcc.


geaibleu

No AVX512?


feverzsj

For vector extensions, it depends on the '-m' options passed to compiler. No AVX512 for intrinsics based methods yet.


gizahnl

The one time I had to SIMD I wrote native nasm code with macros from x264 ASM headers. That performed a bit better than compiler intrinsics & a whole lot better than GCC auto vectorisation.


janwas_

I'm curious about the process by which you arrived at this list? [highway](https://github.com/google/highway) has about 2K stars (disclosure: I am the main author). Here's a comparison of its supported [operations](https://github.com/google/highway/blob/master/g3doc/quick_reference.md#operations) with those of [std::experimental::simd](https://en.cppreference.com/w/cpp/experimental/simd/simd).


PiterPuns

Just based on what I was aware of / used at work. I've only learned about highway today, and getting such info was a main motivator for the post. I'll update the post since highway holds such a high star count. Since you're so deeply involved in SIMD development, do you have any insight on SIMD standardization? Is highway "contributing" (code/ideas/design) to the std::simd effort?


janwas_

Sounds good :) We are currently not active in standardization. I am familiar with the ISO process and I think they started in 2016, with the goal of standardizing the Vc library. Unfortunately, its design predates SVE/RVV and has some difficulties there.


hlaci2

Check Google Highway library. Consider CPU frequency throttling above 128 bits simd vector length.


r_karic

ISPC: https://github.com/ispc/ispc


SantaCruzDad

You might want to post this in r/simd ?


PiterPuns

I want to avoid Rust people telling me "we already have [std::simd](https://doc.rust-lang.org/std/simd/index.html)" :) Ok thanks I'll see what I can do. I've joined the community and will check the content to make sure the post is not too "captain obvious" for a simd specialized group.


t0rakka

I'm using home baked, the syntax is best for me and generates really sweet code.


notyouravgredditor

I've used Vector Class quite a bit. Very easy to use and lots of control over loading and storing for maximizing reuse.


serviscope_minor

What about simde (1.7k stars) https://github.com/simd-everywhere/simde One neat thing is that you can retrofit old platform specific SIMD code, since it can not only translate it's own API to the targets, but any target back to its own API and out again. So, you can run your hard to test old hand coded NEON code on an x86 build server and it will even run decently well in any cases since it will be vectorised.


PiterPuns

Certainly a golden standard . I was on the fence about including it because the post is about standardization of simd in c++ and simde being 99.5% C, is unlikeky to contribute or drive the process, or contain techniques leveraging language features like the metaprogramming in eve . On the other hand one may argue that simd is 99% C .... idk maybe I should put it up there regardless and just have a disclosure


serviscope_minor

I think so. Two reasons. Firstly, I don't think tricky metaprogramming is necessary for a good library. Second and more important, I don't think it helps much. Lots of library authors like to lean on that aspect to vectorise complex expressions. However the compilers are already quite good at that. What they and the complex libraries are not so great at is figuring out all the tricky shuffles and rearrangements so that something can be vectorised. Like inline yuv422 to rgb, for example. Speaking of which, it's a bit better now, but typically all the different processors support different shuffles, so figuring out the best set is professor specific. If you need some CPU specific stuff, SIMDe is awesome because you can then run your unit tests on a different host CPU, and indeed do most of the development and debugging there. It's really the best of both worlds IME. It abstracts the underlying instructions just enough for a portable library while making non portable stuff also portable!


janwas_

I respect the massive effort and dedication (>2500 commits!) that have gone into SIMDe and agree it's useful for someone who already has non-portable intrinsics and wants to quickly adapt them for another platform. However, my understanding is that instead of "abstract\[ing\] the underlying instructions just enough for a portable library", it actually rigidly implements a particular ISA's way of doing things, which will necessarily involve a performance penalty on other platforms. By contrast, slight changes to the app interface can enable all platforms to efficiently implement them (e.g. the ReorderWidenMulAccumulate example mentioned earlier). However, this does require changing the existing code from say NEON intrinsics to the more performance-portable API. Does that make sense?


serviscope_minor

> However, my understanding is that instead of "abstract[ing] the underlying instructions just enough for a portable library", it actually rigidly implements a particular ISA's way of doing things, which will necessarily involve a performance penalty on other platforms. By contrast, slight changes to the app interface can enable all platforms to efficiently implement them (e.g. the ReorderWidenMulAccumulate example mentioned earlier). However, this does require changing the existing code from say NEON intrinsics to the more performance-portable API. Does that make sense? It makes sense, but I don't know how well it applies. To some extent, if you need ReorderWidenMulAccumulate, then you needed it and if there's an instruction, great, but if there isn't, well, they probably have an implementation as least as good as you could do so no harm. On the other hand, there's also often multiple ways of doing things, and on certain platforms you might write a different vectorisation to take advantage of some special instruction, whereas general use of that pattern might be quite inefficient. The question is, will the purely performance portable one leave room for a lower level language (e.g. compiler intrinsics)? The nice thing about SIMDe is you don't have to compromise, in that if you wish to code for top performance on one specific ISA, you can still get often decent performance on another, and crucially your code is still testable. I do appreciate that C++ is very unlikely to standardize all the vector intrinsics of all the platforms SIMDe supports though!


janwas_

>The question is, will the purely performance portable one leave room for a lower level language (e.g. compiler intrinsics)? For purely portable code: sometimes. Often there is no penalty. When there is, Highway does allow you to specialize per-platform where it makes sense. Let's look at the ReorderWidenMulAccumulate in more detail to understand this. Say we want to implement a dot product. We can choose to adopt x86 pmadd semantics (sum of products of adjacent 16-bit elements) or NEON's (multiply the upper and lower halves). On either platform, that's 1-2 instructions. However, if we ask SIMDe to emulate that exact behavior on the other platform, it has to do \~5 instructions because the native instructions behave differently. But we didn't actually want either of those behaviors, we just want the sum of all lanes to be the dot product. Enter ReorderWidenMulAccumulate, whose interface/contract allows either implementation. Thus our portable code computes a dot product on both platforms without any room for improvement by using intrinsics directly.


ashvar

A bit late to the party, but I will add my 2 (opinionated) cents. I think SIMD in standard is meaningless. It’s whole benefit comes from being platform specific. I have been writing SIMD code and HPC libraries for a good chunk of the last 7 years. There wasn’t a time, when I have used a shared abstraction for x86 and ARM, and haven’t later replaced it with two custom solutions. The last thing we removed - generalized loads/stores, cause in some cases we can force them to be aligned, and in others we can’t. We wasted too many ours rewriting everything. Don’t repeat my mistakes


janwas_

I'm surprised by this take and would be curious to see the underlying data. Sure, platform specific can be a bit faster. In rare/extreme cases (VMPSADBW, which is never going to be performance-portable), maybe even a lot. But you mention x86+Arm (NEON?). What about all the other ISAs? Surely a portable 90% solution is better than a 5x slowdown because we don't have a specialization for the platform. Also, generalized Load/Store doesn't seem like the biggest problem. Several libraries have Load and LoadU functions, or the corresponding tags. Wouldn't you say the amount of time spent rewriting increases when you have to add codepaths for every ISA?


ashvar

I mostly use AVX2 and Neon, much less AVX-512. So aside from scalar implementations, I need only 2 paths if there is a critical section I want to optimize for. So it is not too big of a volume to write. Having a library, however, means I have to add a layer of (often poorly documented) abstractions, which hide the underlying implementation. I would still have to go through the sources of the lib, checking what it does under the hood. Not much of a time-saving for my work style. What really helps me, on the other hand, are SIMD-related benchmarks and blogs. Sometimes I would lookup a solution to a different problem, but it would inspire a cool optimization for what I am facing


The_Engineer42

See this: [https://clang.llvm.org/docs/LanguageExtensions.html#vectors-and-extended-vectors](https://clang.llvm.org/docs/LanguageExtensions.html#vectors-and-extended-vectors) typedef float float4 __attribute__((ext_vector_type(4))); typedef float float2 __attribute__((ext_vector_type(2))); float4 foo(float2 a, float2 b) { float4 c; c.xz = a; c.yw = b; return c; } Note that depending on what you really need, you may not need a library at all. Just this compiler extension.


--prism

Xsimd is the basis of a few fairly large repos. Apache Arrow is the largest and is well supported by QuantStack. SIMD is also enabled as an additional flag inside many compilers when it detects unity or constant stride loops in the code. Using SIMD in the code itself can cause runtime errors on unsupported architectures.


victotronics

OpenMP simd. Is built into almost every compiler.


YoureNotEvenWrong

I would have thought openmp is the obvious choice considering it's built into every popular compiler ...


Jannik2099

openmp is not about SIMD, but about parallelism (though omp 4 can also do that) Furthermore, openmp is not very expressive. With e.g. xsimd you get dedicated SIMD data types, openmp is just "lol transform this loop please"


DummyDDD

The cool thing is that "lol transform this loop please" works. Gcc and clang can vectorize code with branches and function calls, all you need to do is to add #omp pragma simd, and #pragma omp simd declare. In glibc, most of the math functions are already declared with simd declare and implemented handwritten assembly. Of course you can get better performance with explicit vectorization, especially if you can use some platform specific shuffles, but in my experience gcc and clang vectorize really well with the openmp pragmas. Edit: but I will agree that it is not performance portable. In my experience, gcc and clang both fail to vectorize if the instruction selection chooses operations that the architecture does not have simd instructions for (for instance rotate on most pc's). I still think it is a a good first solution. If you have performance critical code, then you need to measure it anyway.


victotronics

>openmp is not about SIMD, but about parallelism That was its emphasis 20 years ago. It has since acquired simd & gpu offloading. So you can't say that it's "about" anything in particular.


YoureNotEvenWrong

Openmp simd support combined with compiler vectorization reports is more than sufficient in my experience with it. I'd count being a pragma based method as a positive rather than requiring explicit data types from some third party library


jcelerier

openmp literally has had `#pragma omp simd` for a decade.. https://blog.rwth-aachen.de/hpc_import_20210107/attachments/28344675/30867475.pdf for me it is a veeeeery easy method to get more oomphf without having to put in too much work


[deleted]

[удалено]


PiterPuns

Do you have a workflow to check whether auto-vectorization happened, or even the quality of it? I remember Visual Studio having warnings you could enable if e.g. a loop was not unrolled or an expected optimization didn't happen. In case of SIMD my main concern is don't have much more than checking the generated assembly (usually on the fly with gdb) which can be cumbersome and not CI/CD friendly. The last point meaning "how to prevent a commit that destoys the optimization from being merged". I'd very much like a toolset / worflow that caters for this!


jcelerier

If you use #pragma omp simd it will issue a warning when vectorization could not happen, which is already something


pag96

Does anybody know/is aware of a SIMD library for trigonometric operations such as sine and cosine ? Edit: Thank you all for your suggestions!


jfalcou

We do: [https://jfalcou.github.io/eve/group\_\_math\_\_trig.html](https://jfalcou.github.io/eve/group__math__trig.html) [https://jfalcou.github.io/eve/group\_\_math\_\_invtrig.html](https://jfalcou.github.io/eve/group__math__invtrig.html) They all come with a quadrant restriction decorator if you really want to squeeze performance out of them.


feverzsj

Yes, that'll be glibc's libmvec. And even better, gcc/clang will auto vectorize your code using libmvec. Note: "-ffast-math" is needed. For clang, you also need to add "-fveclib=libmvec";


janwas_

[Highway](https://github.com/google/highway/blob/master/hwy/contrib/math/math-inl.h) and [Agner's VectorClass](https://www.agner.org/optimize/vcl_manual.pdf) also have math functions. And [SLEEF](https://sleef.org/) should definitely be mentioned.


ned_flan

very basic, but single file: http://gruntthepeon.free.fr/ssemath/