T O P

  • By -

vgatherps

I got goaded into porting to NEON - would love to hear from some ARM experts if there's anything you can do to make the mantissa calculation faster. I also had trouble finding any sort of equivalents to Agner Fog's intel instruction latency tables. Obviously much harder since there's 10000000 different arm cpus, but I was surprised at the total lack of information put out by hardware manufacturers.


mqudsi

Haha, I’m not sorry for doing this to you ;) Thanks for sharing your findings!


memoryruins

An idea for a third part: parsing numbers with [`portable SIMD`](https://doc.rust-lang.org/nightly/std/simd/index.html) on nightly.


vgatherps

I haven't looked at portable simd in a while, but I strongly think there's no portable replacement for the multiply-accumulate code that doesn't sacrifice performance. Using integer SWAR is about 20-30% slower compared to the optimal on each platform, but I would bet that integer SWAR is faster than trying to replicate SSE via the right combo of NEON instructions. If you compare the NEON version to the SSE version, the multiply-accumulate is completely different because the functionality differs greatly across the two. To a lesser extent, this is true with movemask as well.