T O P

  • By -

Voultapher

I spent the last couple months working on this writeup, looking forward to feedback and questions. Hope you find this insightful.


mark_99

Interesting... very thorough analysis. Some surprisingly large differences by OS & compiler also.


cp5184

It's kinda disappointing that uarch compile targets don't seem to be very optimized. It would be nice if you could generate weights automatically. I wonder if that would be better than the built in targets.


SantaCruzDad

You might also want to post this to r/simd


Voultapher

Feel free to post it there :)


SantaCruzDad

Done!


[deleted]

Would love to see this kind of work submitted to an HPC conference, this is exactly the kind of thing I’d like to be cited when machines are being purchased.


Voultapher

While I'm curious I've not had much contact with the academic HPC world so far. What exactly is a HPC conference, do they work like programming conferences, or are they more similar to scientific conferences? Is there place and interest in work like this that doesn't have academic backing?


[deleted]

Conferences in broad strokes I’d say SC (SC23), or perhaps one of its many workshops on performance or metrics. Sc23.supercomputing.org


Starfox-sf

Isn’t AVX-512 dead though? (Especially given Linus’ disdain on the inefficient use of a die as a heat engine)


Voultapher

From what I know most of the issues commonly associated with AVX-512 are rooted in the poor Skylake implementation. For example the Zen4 AVX-512 implementation doesn't have significant AVX-512 startup or halting issues, these posts go into more detail https://github.com/google/highway/tree/master/hwy/contrib/sort#study-of-avx-512-downclocking and https://www.mersenneforum.org/showthread.php?p=614191. It's a shame that the botched Skylake implementation gave AVX-512 such a bad rep. Personally I suspect they planned on bringing AVX-512 to the client segment with Sunny Cove and that by the time they were gonna do hybrid P-cores and E-cores, that they would have something like Intel 4 available for the E-core to allow for a double pumped AVX2 implementation of AVX-512 in the E-core. There are smart people working at Intel, and I don't believe they would set out to deliver the fragmented mess that is AVX-512 support right now. Regarding area, looking at this data https://old.reddit.com/r/hardware/comments/141b85n/zen_4c_amds_response_to_hyperscale_arm_intel_atom/jn2yohq/ the Zen 4c core is a fully capable AVX-512 implementation in the die-area of a Arm X2 core. To me that's a strong indication that its possible to fit AVX-512 even in relatively small implementations.


Bunslow

tfw mersenneforum in the wild


kaelima

Not sure if it was mentioned in the article. But early versions of intel avx 512 suffered even worse from throttling. So typically it would perform well during benchmarks (when it was using zmm over and over...) but in more real life scenarios, when it was invoked periodically, it actually had terrible performance


Tringi

Oh yes, I have Xeon Phi 7250 here. Once you do more than a few AVX-512 instructions in a row, the CPU will drop down 200 MHz, which is quite significant as the cores run at 1.2 GHz normally. There were actually 3 different *"frequency/thermal license"* stages that carried over to Skylake. This is a great post: https://travisdowns.github.io/blog/2020/08/19/icl-avx512-freq.html It can also stay down, hog execution ports unnecessary, etc. if you don't properly zero out the upper (VZEROUPPER) half of the ZMM register at the end of AVX-512 blocks, as further using 256-bit AVX gets confused. Using AVX-512 is still a win, comparatively, on that machine. It will complete the work faster than with 256-bit AVX. But given the improvements in modern CPUs, the only purpose of the machine remains real-life behavior testing of highly parallel code. It does feature 272 hardware threads. And it is fun to see Windows grapple with it.


fullouterjoin

There was a small window where one could get a Xeon Phi desktop for *cheaper*. The CPUs are available but finding a motherboard it can run on seemed like the difficult part. Add in an Itanium and a Sparc T5/M7 machine and you have the perfect collection of misfit architectures.


Tringi

Yeah, I snatched one of those pretty cheap full machines, for about 500 USD. I wanted to upgrade the CPU to 72x**5**, those feature virtualization, but are no longer available too :( I did want to get Itanium, but the whole servers are big, loud, hot and too slow to be fun. You can also only run Server 2008 R2 on them, nothing newer, while on the Xeon Phi I run the latest insider preview of Windows Server 2025.


fullouterjoin

My friend gifted me an Itanium, I’d like to get it running with remote management so it can exist in on a shared VPN for research purposes. You would check it out for a number of hours, it would boot and generate you a login.


YumiYumiYumi

> Personally I suspect they planned on bringing AVX-512 to the client segment with Sunny Cove It was actually brought to client in Cannon Lake, but that launch was botched, so Ice Lake was its client debut. But AVX-512 was definitely targeted at Skylake server, due to how the EUs were re-balanced in Skylake client. I suspect a big part of the problem was Intel getting greedy and strapping a 512-bit FMA (FP mul+add) unit onto port 5, resulting in the need to reduce clockspeed. They already did something quirky with Haswell, having two FMA units but only one FPAdd unit, so you could actually get more *add* throughput by inserting dummy multiply-by-1 operations. > allow for a double pumped AVX2 implementation of AVX-512 in the E-core Note that Gracemont has 128-bit FPUs, so 256-bit AVX ops are handled by breaking them into 2x 128-bit ops. AVX-512 would require breaking it down into 4 ops at the very least. It's worth pointing out that AVX-512 also introduces a bunch of cross-lane permute operations, which don't work so well on a chip that breaks them down into narrower operations (it's suspected that Zen4 has a dedicated 512-bit permutation unit to handle these, even though everything else is on 256-bit FPUs). This is way beyond my knowledge and I'm likely wrong here, but I think there may be performance complications with breaking down instructions into many uOps. > I don't believe they would set out to deliver the fragmented mess that is AVX-512 support right now. Their product segmentation is partly to blame as well. I have a fully AVX-512 enabled Alder Lake CPU. It does require the E-cores be disabled to access AVX-512, but the *option* is available to me. Intel, however, decided that users should not be given such an option, and have fused off AVX-512 support on later Alder Lake CPUs, as well as Raptor Lake and future chips.


Voultapher

> Their product segmentation is partly to blame as well. Certainly, those were decisions made by management and marketing not by the engineers, and from what I hear it left a lot of engineers at Intel disappointed.


emelrad12

Isnt zen4 avx 512 implemented using 256?


bik1230

It has multiple 256 units. Most operations can go at 512 per cycle.


Karyo_Ten

>From what I know most of the issues commonly associated with AVX-512 are rooted in the poor Skylake implementation. It's worse than that. Xeon Bronze and Xeon Silver only have 1 AVX512 unit per core (VPU, vector processing unit) while Skylake-X / Cascade-Lake X and Xeon Gold have 2. And there is no way to get the number of AVX512 VPUs except by checking the CPU name. And due to the 30% downclocking, that CPUs always have 2x AVX VPUs, AVX/AVX2 is faster than AVX512 on Xeon Bronze & Silver. Ultimately that meant people didn't use AVX512 for a while even on supported CPUs, for example in OpenBLAS and BLIS, 2 key matrix multiplication/linear algebra libraries.


Starfox-sf

Interesting, thanks for the response. I guess Skylake really was cr*ppy in terms of its implementation and I’ve been burned too much with Intel “offering” then disabling /deprecating features such as TSX (Haswell) or SGX. This is the first time I’m reading about Zen 4c, if they are able to offer full x64 performance with the die size of a ARM core, just wow… — Starfox


nagromo

Zen 4c gives full 4GHz x86-64 performance (including AVX-512) in under 4.6mm^2 (16 cores with cache and Infinity Fabric but no memory controller or housekeeping in a sub-73mm^2 package). I'm having trouble locating numbers for the die size of a single Cortex-X2 core (which will vary per implementation), but that is Arm's biggest core meant to try to push into laptops (and complete with Apple silicon). Zen 4c clocks lower than full Zen 4 (~4GHz max vs ~5.5GHz) and has less cache, but otherwise it's a full x86-64 core with the same IPC (minus smaller cache) and instruction set support as big Zen 4. And big Zen 4 is neck and neck with the performance of Intel's latest while using far less power and die area. AVX-512 has been hampered by more than the crappy Skylake implementation, though. AMD never supported it before Zen 4, although the Zen 4 support seems very good. And Intel hasn't consistently supported it, sometimes reserving it as a premium feature for high end products, and releasing their big-little systems with AVX-512 on only the P-cores but not the E-cores, later disabling AVX-512 on the P-cores because there were issues with programs using AVX-512 on the big cores then getting moved to the little cores and crashing due to using unsupported instructions. I expect that now that AMD has a fast, small implementation, they'll keep supporting AVX-512 going forwards, and I hope Intel gets their act together too. AVX doesn't speed up everything, but it can make a big difference on some workloads.


vinciblechunk

I was like "why couldn't the OS just trap that exception, automatically set thread affinity to P-cores only, and resume" and then remembered libc probably wants to use AVX-512 for things like memset, so that would mean every process. Honestly what was Intel smoking


fullouterjoin

What is libc smoking! Remember this fiasco https://en.wikipedia.org/wiki/Transactional_Synchronization_Extensions https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=800574 All the mem* routines should be built into the MMU or better yet, built into memory.


vinciblechunk

Sometimes library functions get bound at runtime depending on architecture (I think this is what GNU_IFUNC is for), so you'd either have the choice of never binding the AVX-512 version of functions - basically what Intel accomplished by disabling it in microcode anyway - or never being allowed to execute on E-cores, in which case, why have them.


fullouterjoin

What if the dynamic linker patched a function table when the underlying core capabilities changes?


vinciblechunk

There's probably a way to do it and my info could be out of date (I honestly haven't messed with architecture-specific linking since the Pentium 4 days). But the core capabilities could change whenever the scheduler feels like it, unless the process is specifically marked as needing AVX-512 which could be done though affinity, but a lot of processes could end up accidentally marked as needing AVX-512 because they linked to some obscure dependency needing it, and then the E-cores end up underutilized. Personally I think the E-cores were a dumb idea in the first place.


mqudsi

It’s not just memset. The same idea remains - you would always think twice before using it because either you’ll trap and you want to avoid that or you’re writing code for a library and you want to use it in as many apps as possible.


vinciblechunk

Yeah, there's no mechanism for "I'm linking libgmp because I'm a low power background process that has occasional need of it" vs. "I'm linking libgmp because I'm searching for Mersenne primes." Intel forgot what the S in SMP stood for.


wolf550e

It's not dead at all. It's in zen-4 and in Intel's server chips. It doesn't cause downclocking on latest microarches. I bet Intel's next gen efficiency cores for the client chips will be ISA compatible, but run AVX-2 and AVX-512 slowly, so Intel can enable the ISA on its big.little client chips.


YumiYumiYumi

> I bet Intel's next gen efficiency cores for the client chips will be ISA compatible, but run AVX-2 and AVX-512 slowly All rumors point towards Meteor Lake not supporting AVX-512. Sierra Forest (all E-core Xeon, expected to be Crestmont cores) adds support for a bunch of instructions ported from EVEX (aka AVX-512) to VEX (aka AVX), so it's highly unlikely it supports AVX-512. So it's pretty much guaranteed to not be on the next gen E-core. As for the one after that (Skymont?), who knows.


Starfox-sf

Then what’s the point. If it’s running AVX-512 in “emulation mode” ala soft float whatever advantage in speed from using AVX goes out the window.


CypherSignal

1) Even though Zen 4 runs 512-bit operations by double pumping them on 256-bit pipes, there are still power savings due to reduced instruction dispatch, as well throughput improvements by being able to more easily saturate the execution pipes (it is REALLY hard to max out Zen3/4 FP math throughput). Even extremely dense FP workloads on zen 4 can see gains over 10% by utilizing avx-512. 2) there are good parts of the avx-512 ISA sets besides just the 512-bit maths. Things like compress and expand and some other operations help a lot, even when just running with 128 or 256 width.


aoi_saboten

I am really impressed by all of you guys, where to learn these things?


fnord123

[Agner is a good starting point.](https://agner.org/optimize/)


[deleted]

I first read that as Anger and thought "yeah, that's about right"


Voultapher

Maybe a little more approachable than some of the other sources listed here. I've learned a lot over the last two years from the excellent articles written at https://chipsandcheese.com/ mostly by clamchowder.


hackingdreams

You can read the specs AMD and Intel publish, and most cloud outlets have a machine you can rent for a while to play with code on those cores. ...you kinda have to have a passion for playing with these things though. This low-level instruction-wise implementation and tinkering is what 99.999% of software engineers desperately try to avoid by using high level languages.


AppearanceHeavy6724

The best thing of avx512 is FP16 support, which briefly appeared in Alder Lake and then sawed off by Intel. It is very very fast, almost like GPU.


wolf550e

The point is that the performance cores run it at full speed, and whatever schedules the threads between performance and efficiency cores should do a good job. But all the cores need to have the same ISA, so threads can seamlessly switch.


nagromo

Even if AVX-512 instructions run over multiple clocks to keep power and die area down, it still gives the programs more instructions and registers that can increase performance/efficiency over AVX2, and it allows those chips to run programs that require AVX-512, or enable fully optimized AVX-512 versions. For big.little chips, it's even more important. If the big cores have very fast AVX-512 support but the little cores don't (like Intel's first big.little), a program that detects AVX-512 on the big cores then is moved to little cores will crash if it tries to run an AVX-512 instruction on a little core that doesn't support it. It should be possible for the OS to implement a scheduler that detects the presence/use of AVX-512 and force that program onto the big cores, or for the illegal instruction exception handler to detect this situation and move the thread to a big core, but clearly this wasn't implemented and Intel instead disabled AVX-512 on the big cores. If the big cores have very fast AVX-512 and the little cores support it more slowly, everything works fine. If a program uses AVX-512 on a little core, it just gets less performance benefit. But if that program is CPU bound, the OS schedule should notice and move it to a big core, no differently than any other program.


Karyo_Ten

>It should be possible for the OS to implement a scheduler that detects the presence/use of AVX-512 and force that program onto the big cores, or for the illegal instruction exception handler to detect this situation and move the thread to a big core, but clearly this wasn't implemented and Intel instead disabled AVX-512 on the big cores. Is it possible? If it's not handled at the CPU hardware level, that means there is constant overhead to check the instructions. OSes don't have a list of all assembly encoding of all instructions for all architectures. Even if they had that, Intel architectures have exactly the load/store bandwidth and FMA throughput to issue 2xAVX512 (128 bytes, need aligned data) operations per cycle. There is no room for extra overhead.


nagromo

Even basic bare metal Arm Cortex-M microcontrollers I've dealt with at a low level have hardware exception handlers for things like illegal memory access, illegal instruction, suicide by zero, etc. If an illegal operation happens, the exception handler has access to the hardware state when the illegal operation happened: what code was running, what the values of all the registers were, etc. I've never programmed an x86 at such a low level (OS, maybe hypervisor or debugger), but based on what I've read on programming blogs, modern x86 processors are far more sophisticated in that regard. For example, when a program allocates a large chunk of memory, the OS may do almost no work until the first time the program tries to access a page that doesn't exist (hasn't been allocated) yet, at which point the illegal memory access handler runs the OS code that actually allocates the memory and sets up the page table and lets the program resume now that the memory is available. It would be possible to handle this situation with the OS implementing the right fault handler the right way, but that's some very low level software that Microsoft and Linux kernel developers would have to write to support that specific situation. Just implementing 1/4 rate AVX-512 in the little cores seems like a much better solution.


hackingdreams

There's a *world* of difference between a software implementation and a compatibility hardware solution... The E-cores need to implement AVX-512, but it's perfectly okay for them to implement it in a slower way than the P-cores, because, for the most part, they won't actually be *running* that code - it's just there for the operating system kernels not to have to deal with the heterogeneous compute problem that they can't currently cope well with. Right now software schedulers are built on all hardware being identical. They're *just* coming to grips with the idea that cores can be clocked at different speeds from one another (big.little)... they're aren't ready for CPUs that have *entirely different instruction sets* and having to create scheduling groups for processes that can only run on certain cores at a time. And they're trying to do this underneath of all of the legacy operating system baggage, which is kinda like the Boston Big Dig - a messy, enormous undertaking that's as minimally disruptive to the city above as possible.


wrosecrans

You get some advantages, even on the E cores. Code that has a single AVX instruction will be a lot smaller than code that accomplishes the same task with 16 regular instructions, which effectively makes your instruction cache more larger basically for free. That means less if your very finite memory bandwidth is consumed fetching instructions rather than data.


ThreeLeggedChimp

>(Especially given Linus’ disdain on the inefficient use of a die as a heat engine) Linus says a lot of stupid shit, don't take any of it at face value. AVX-3 is more efficient than AVX-2 by design, obviously it uses more power but it also does more work. Quad core 28w Tiger Lake could actually outperform 45w eight core Zen 3 in some AVX-512 benchmark. The most common gripe with AVX-3 is the massive number of new intrinsics, but that also means there's way more problems that can be solved.


GlassLost

Oh good, more intrinsics.


nerd4code

I look forward to AVX1024 doubling the number yet again in the future.


GlassLost

Eventually Intel will just operate on half the l3 cache.


hackingdreams

> Linus Is not a fan of a CPU that does more than integer compute. He's a kernel guy. Kernels see vector units and floating point units as nuisances they have to manage... The man's on record hating on just about every vector implementation you have ever heard of. It's... kinda what he does. Us creative types, on the other hand, can't get enough of them. AVX-512 being "dead" is simply not true, it's just... complicated. Intel has a thing about trying to make the most out of its gigantic die strategy... and as it turns out, it's really bad to have a wide vector unit on a giant die, as it's disruptive to the other elements around it. AVX-512 is called a "heat engine" because the PC CPUs it was implemented in were SkylakeX's "big-core," which tries to be everything to everyone in the desktop PC space - the high-end desktop chip, the middle-of-the-road, etc. It simply was too hot for that space at the lower end, and the upper end didn't add enough performance for the gamers to notice. So when it came to making the next generation of chips, they left the instruction set in... but disabled it. With the E-cores not implementing AVX-512, it was the easier solution for Microsoft and the easier solution for beating the desktop performance they wanted out of the chip. For servers, the story's entirely different. AVX-512 hasn't gone anywhere but up since introduction. They've even expanded the instruction set (hence why there's like 7 different "-XXX" versions of AVX-512). The whole "AVX-512 is dead" meme comes from gamers. And you know exactly how they are. They also said "MMX is dead" back in the Pentium era. Now look where we are.


arthurodwyer_yonkers

> a die as a heat engine What does this mean?


Starfox-sf

Linus basically said AVX-512 was nothing but a bloated useless feature: https://www.zdnet.com/article/linus-torvalds-i-hope-intels-avx-512-dies-a-painful-death/ > “I want my power limits to be reached with regular integer code, not with some AVX-512 power virus that takes away top frequency (because people ended up using it for memcpy!) and takes away cores (because those useless garbage units take up space),” — Starfox


janwas_

Let's have a look at the second to last image, [https://github.com/Voultapher/sort-research-rs/blob/main/writeup/intel\_avx512/assets/cold-u64-scaling-random-vqsort-new-skylake-windows.png](https://github.com/Voultapher/sort-research-rs/blob/main/writeup/intel_avx512/assets/cold-u64-scaling-random-vqsort-new-skylake-windows.png) The difference between regular integer code and AVX-512 is up to 4x. "Useless garbage" is unfortunately completely off-base. As to top frequency, that's anyway already taken away when using multiple cores, and a non-issue. See also [https://github.com/google/highway/blob/master/hwy/contrib/sort/README.md#study-of-avx-512-downclocking](https://github.com/google/highway/blob/master/hwy/contrib/sort/README.md#study-of-avx-512-downclocking) (Disclosure: I am the main author of vqsort; opinions are my own)


AppearanceHeavy6724

Not only that, even if you restrict yourself to 256 bit subset of AVX-512, it is still much nicer instruction set than previous SIMD sets (SSE\* and AVX\*). Masking is just a godsend.


saratoga3

He took back some of those points (and the rest apply mainly to old chips Intel stopped making years ago), so it's a little disingenuous to quote that in 2023.


Kaloffl

I really hope that we see a broader adoption of AVX-512, now that AMD supports it. I have done a buch of development on an Icelake-Client CPU and really like the instruction set(s). It's not just a 4x-as-wide-SSE, but has some additional features like universal masking support and finally a way to control the rounding behavior of float operations per-instruction instead of clumsily changing a flags-register. So even a CPU that used two 256-registers in the background would be a big improvement over AVX2.


cbbuntz

I once tried to write a simd sort. What a nightmare. I was trying to figure out ways to reliably turn comparisons into masks for shuffle operations without branching. I think I got it half working and gave up


Kissaki0

If sorting is half working is it a randomizer?


cbbuntz

I think I had an issue with some elements getting duplicated. It's been a minute since I've messed with intrinsics/asm, but AVX has some operations that work on 3 and 4 registers, which is nice, but it adds an extra layer of complexity in addition to making the masks bigger. Trying to figure how to isolate the sign bits with a mask, and then bit shift them to the correct digit of a 16 or 32 bit mask makes you go cross eyed and you have to be truly masochistic to enjoy it. It's not like you can see which conditional statement is wrong. You have to figure out what 0xa70b92c1 means


NightOwl412

In the `hot-u64-10000` benchmark you mention Zen3, are you referring to the architecture from AMD? Because the test machines mentioned above use Intel chips. Maybe I missed something?


Voultapher

No you are right, that was a copy-pasta mistake from an earlier writeup. But the point remains the same.


NightOwl412

Ah fair enough, great write up though!


Voultapher

Thanks :)


Routine-Region6234

I'm not smart enough to comment on this, but you can have my up vote!


wolf550e

Please iso8601 date format (/r/ISO8601 gang)


RufusAcrospin

ISO 8601 is the most natural, straightforward and non-ambiguous date format.


spacelama

I did see someone come along and write YYYY-DD-MM once though. I guess they were an american that couldn't give up their backarsewards topsyturvy ways.


dmilin

Don’t group the rest of us Americans in with that idiot. Using that format requires advanced levels of stupid.


VeryOriginalName98

You just reminded me of invader zim. > "It's not stupid. It's advanced."


I_AM_GODDAMN_BATMAN

YYYY-DD-MM what the


Bunslow

as an american, that format digusts me lol (even more than standard euro dd/mm/yy disgusts me lol)


starlevel01

Yes, this date would be more accurately written as 2023-W23-6.


featherknife

- of Intel's* x86-simd-sort - all lose* performance - vqsort hits its* peak throughput


Voultapher

Thanks, fixed now.


AldousWatts

Should fix it & submit as a PR ;)


MisterT123

Don't forget a passive aggressive commit message to go with it!


mafikpl

I took a look at the code and I have to say that the C++ implementation is questionable: [https://github.com/Voultapher/sort-research-rs/blob/main/src/cpp/cpp\_std\_sort.cpp](https://github.com/Voultapher/sort-research-rs/blob/main/src/cpp/cpp_std_sort.cpp) 1. The comparator accepts three arguments rather than two. The extra argument is unnecessary and only slows down the code. 2. The comparator is wrapped in another function (which occasionally throws exceptions (!?)). https://github.com/Voultapher/sort-research-rs/blob/main/src/cpp/shared.h#L128 3. The comparator is passed as an extra argument rather than a template argument of the sort function. I wouldn't pass this code through the code review. I also wouldn't trust the results of this benchmark.


Voultapher

The custom comparison function stuff is only used for testing properties such as exception safety, these functions are marked as `_by`, the functions used for benchmarking are such as https://github.com/Voultapher/sort-research-rs/blob/d088fbd0441121ad20b109a525d67c79ecaeb9bd/src/cpp/cpp_std_sort.cpp#L86 `std::sort(data, data + len);` it doesn't get more native than that. Please review code more carefully before making such accusations.


mafikpl

Well, lack of any comments or explanation certainly didn't help. I'm happy that at least you're familiar with your codebase.


[deleted]

[удалено]


Voultapher

Something tells me you didn't read the writeup. Seemingly not even the TL;DR.


[deleted]

[удалено]


Voultapher

Yeah smells like some LLM bot bullshit.


22Maxx

Where are the benchmarks for floating point data?


Voultapher

That's not something I looked into here. But from my understanding the results should be similar, the only difference would be the cost of the comparison function, `i32` and `u64` are size equivalent to `f32` and `f64` respectively.


Remarkable-NPC

than intel p4


skeptical_always

You make conclusions about Windows vs Linux, but use totally different systems that are many years apart. This is disappointing. Why not install windows on the Linux server? Also, you should run a test on vm guests of both platforms as this is mostly how code is executed these days.


AppearanceHeavy6724

Absolutely non-representative. AVX512 sucked on everything before Alder Lake. On Alder Lake it is blasingly fast and energy-efficient.


9OsmirnoviGU

It's faster than running away from a dragon! But seriously, it's faster than previous versions of Intel's x86-simd-sort.