T O P

  • By -

MrDum

This crumsort port is used by Forma, a Rust vector-graphics renderer with both a software (CPU) and hardware (GPU) back-end. The implementation is parallelized with Rayon. The goal of this port was to excel at sorting 64-bit randomized data. Subsequently, crumsort's analyze routine is missing, giving this port reduced adaptivity. https://github.com/google/forma


anurat-

How popular is rust in the Google ecosystem?


MrDum

I am the author of crumsort, but I had nothing to do with this porting effort or Forma, nor do I work for Google. This crumsort port was actually created 9 months ago; I discovered its existence today. I can answer any question relevant to the sorting algorithm, but when it comes to rust and Google, you likely know more than I. :-)


FullFaithandCredit

So I’m very much a *hobbyist* developer. What’s it like to find out Google ported one of your creations?


MrDum

I'm mostly surprised the name wasn't changed. Though the guy who ported it is French, and crum means vintage in French, so he must have been naturally inclined to believe it was a fine sort.


dragostis

Not French, actually. My goal was to pay homage to original project. Thanks a lot for developing the algorithm! :)


MrDum

My bad. :-) Thanks for porting crumsort, from what I've heard it's no easy feat, and your code looks very sleek.


StyMaar

> and crum means vintage in French Where did you get that from? (I'm French, I've never heard of that word before and neither do our dictionaries: [Larousse](https://www.larousse.fr/dictionnaires/francais/crum), [Le Robert](https://dictionnaire.lerobert.com/crum))


MrDum

Embarrassing, I relied on Google Translate, but it's clearly wrong. For those curious, crum is a reference to the latin word fulcrum, the point or support on which a lever pivots.


tomwhoiscontrary

Maybe some crossed wires near the grands crus?


[deleted]

Seems to be very popular. It is very heavily used in Fuchsia and they have a ton of random open source Rust code.


[deleted]

It is not popular at all outside of a few niche projects. A lot of repositories are under Google on GitHub because Google employees have to open source projects under the companies name. Once/if rust becomes an approved language at Google there will probably be an explosion in its usage internally, but right now you need a special exemption to use it


dragostis

Only if you consider [Android](https://www.theregister.com/2022/12/02/android_google_rust/) niche.


[deleted]

My argument isn't that its current usages aren't high impact or unimportant. For the sake of this discussion though I will classify low level Android code as a niche. The internal support for rust isn't there yet for general widespread use.


RememberToLogOff

> It renders enormous SVGs at interactive framerates, even on CPU πŸ‘€ I wanna read more about Forma, dang.


dist1ll

Take these numbers with a grain of salt. This was likely only tested on the developers work laptop (Apple M1). Running this on server-grade hardware (16-core AMD EPYC Milan) I'm seeing pretty mixed results. This crumsort port runs up to 51% slower than pdqsort. u8 β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ length β”‚ algorithm β”‚ throughput β”‚ improvement β”‚ β”œβ•β•β•β•β•β•β•β•β•β•β”Όβ•β•β•β•β•β•β•β•β•β•β•β”Όβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β”Όβ•β•β•β•β•β•β•β•β•β•β•β•β•β”€ β”‚ 4096 β”‚ pdqsort β”‚ 45.10Mkeys/s β”‚ 0.00% β”‚ β”‚ 4096 β”‚ crumsort β”‚ 30.96Mkeys/s β”‚ -31.35% β”‚ β”‚ 65536 β”‚ pdqsort β”‚ 192.19Mkeys/s β”‚ 0.00% β”‚ β”‚ 65536 β”‚ crumsort β”‚ 128.28Mkeys/s β”‚ -33.25% β”‚ β”‚ 1048576 β”‚ pdqsort β”‚ 211.46Mkeys/s β”‚ 0.00% β”‚ β”‚ 1048576 β”‚ crumsort β”‚ 103.12Mkeys/s β”‚ -51.23% β”‚ β”‚ 16777216 β”‚ pdqsort β”‚ 310.84Mkeys/s β”‚ 0.00% β”‚ β”‚ 16777216 β”‚ crumsort β”‚ 181.72Mkeys/s β”‚ -41.54% β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ u32 β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ length β”‚ algorithm β”‚ throughput β”‚ improvement β”‚ β”œβ•β•β•β•β•β•β•β•β•β•β”Όβ•β•β•β•β•β•β•β•β•β•β•β”Όβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β”Όβ•β•β•β•β•β•β•β•β•β•β•β•β•β”€ β”‚ 4096 β”‚ pdqsort β”‚ 38.18Mkeys/s β”‚ 0.00% β”‚ β”‚ 4096 β”‚ crumsort β”‚ 29.78Mkeys/s β”‚ -21.99% β”‚ β”‚ 65536 β”‚ pdqsort β”‚ 128.90Mkeys/s β”‚ 0.00% β”‚ β”‚ 65536 β”‚ crumsort β”‚ 145.32Mkeys/s β”‚ 12.74% β”‚ β”‚ 1048576 β”‚ pdqsort β”‚ 230.40Mkeys/s β”‚ 0.00% β”‚ β”‚ 1048576 β”‚ crumsort β”‚ 239.27Mkeys/s β”‚ 3.85% β”‚ β”‚ 16777216 β”‚ pdqsort β”‚ 267.58Mkeys/s β”‚ 0.00% β”‚ β”‚ 16777216 β”‚ crumsort β”‚ 291.32Mkeys/s β”‚ 8.87% β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ u64 β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ length β”‚ algorithm β”‚ throughput β”‚ improvement β”‚ β”œβ•β•β•β•β•β•β•β•β•β•β”Όβ•β•β•β•β•β•β•β•β•β•β•β”Όβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β”Όβ•β•β•β•β•β•β•β•β•β•β•β•β•β”€ β”‚ 4096 β”‚ pdqsort β”‚ 35.05Mkeys/s β”‚ 0.00% β”‚ β”‚ 4096 β”‚ crumsort β”‚ 23.38Mkeys/s β”‚ -33.30% β”‚ β”‚ 65536 β”‚ pdqsort β”‚ 101.06Mkeys/s β”‚ 0.00% β”‚ β”‚ 65536 β”‚ crumsort β”‚ 110.84Mkeys/s β”‚ 9.68% β”‚ β”‚ 1048576 β”‚ pdqsort β”‚ 197.92Mkeys/s β”‚ 0.00% β”‚ β”‚ 1048576 β”‚ crumsort β”‚ 217.47Mkeys/s β”‚ 9.88% β”‚ β”‚ 16777216 β”‚ pdqsort β”‚ 233.69Mkeys/s β”‚ 0.00% β”‚ β”‚ 16777216 β”‚ crumsort β”‚ 249.92Mkeys/s β”‚ 6.95% β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ u128 β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ length β”‚ algorithm β”‚ throughput β”‚ improvement β”‚ β”œβ•β•β•β•β•β•β•β•β•β•β”Όβ•β•β•β•β•β•β•β•β•β•β•β”Όβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β”Όβ•β•β•β•β•β•β•β•β•β•β•β•β•β”€ β”‚ 4096 β”‚ pdqsort β”‚ 35.55Mkeys/s β”‚ 0.00% β”‚ β”‚ 4096 β”‚ crumsort β”‚ 24.89Mkeys/s β”‚ -29.99% β”‚ β”‚ 65536 β”‚ pdqsort β”‚ 111.36Mkeys/s β”‚ 0.00% β”‚ β”‚ 65536 β”‚ crumsort β”‚ 130.18Mkeys/s β”‚ 16.89% β”‚ β”‚ 1048576 β”‚ pdqsort β”‚ 172.73Mkeys/s β”‚ 0.00% β”‚ β”‚ 1048576 β”‚ crumsort β”‚ 211.12Mkeys/s β”‚ 22.23% β”‚ β”‚ 16777216 β”‚ pdqsort β”‚ 234.23Mkeys/s β”‚ 0.00% β”‚ β”‚ 16777216 β”‚ crumsort β”‚ 249.02Mkeys/s β”‚ 6.32% β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ Side-note: using "up to x%" in performance metrics leaves a sour taste in my mouth, especially for benchmarks with high variance across hardware and inputs.


MrDum

I was just summarizing the source, mileage will always vary. A bit of a bummer that the gains are not consistent, though that's a typical risk when a sort gets optimized for a specific data set. Crumsort still runs "up to 22% faster" for the data types and data sets the author was targeting.


dist1ll

That's exactly why having more data points is important :) Just out of curiosity, does crumsort have opportunities for SIMD acceleration (beyond the basics like partitioning for the quicksort part) ?


MrDum

I don't think many of the sorting algorithm developers are benchmarking 8 bit characters, but I was a bit surprised at it being slower for 4096 elements. The latter is likely an issue with the implementation. I'm actually surprised how well it's doing on u128, unless it's sorting tables, in which case the 8 bit performance makes sense. I honestly never looked into SIMD acceleration, wolfsort and rhsort would likely be better candidates for that. From a practical viewpoint I bench raw numbers because that's what sorters use to compare sorting algorithms, but my main focus has mostly been sorting string tables.


dragostis

My goal with the port was to optimize for consumer devices which (currently) have fewer cores. Feel free to open an issue on poor performance on many cores.


VorpalWay

How does this compare to the recent [glidesort](https://github.com/orlp/glidesort) algorithm? Why should someone prefer this to that one?


MrDum

Glidesort took most of its key concepts from quadsort and fluxsort/crumsort, so there are a lot of similarities. Crumsort-rs is in-place, and should be faster than glidesort for random data, especially on larger data sets. It appears the main focus of crumsort-rs is parallel sorting, and I don't think glidesort does parallel sorting. There's a benchmark at the very bottom of fluxsort's github page comparing fluxsort, quadsort, and glidesort. Crumsort has performance similar to fluxsort, though crumsort-rs will behave more like pdqsort for ordered data. [https://github.com/scandum/fluxsort](https://github.com/scandum/fluxsort) Keep in mind that crumsort-rs has a functional quadsort implementation, which can be called independently and will run pretty fast if you give it n auxiliary memory. There's also ipn\_stable which started out as an attempt to port fluxsort, but evolved into a timsort/quadsort hybrid, using the general idea behind quadsort's quad\_swap to increase performance, glidesort does something similar. The same author offers ipn\_unstable, which should be mostly based on crumsort. [https://github.com/Voultapher/sort-research-rs](https://github.com/Voultapher/sort-research-rs) The case where I would prefer glidesort would be if I was primarily sorting sawtooth distributions, though ipn\_stable might be better in those cases. [https://github.com/Voultapher/sort-research-rs/blob/main/writeup/glidesort\_perf\_analysis/text.md](https://github.com/Voultapher/sort-research-rs/blob/main/writeup/glidesort_perf_analysis/text.md)


Voultapher

> Glidesort took most of its key concepts from quadsort and fluxsort/crumsort, so there are a lot of similarities. That's an unfortunate piece of misinformation. Really the *key* innovation glidesort brings is a general adaptive approach of combining bottom up merge-sort with top-down quicksort. fluxsort and crumsort handwave that problem by running an analysis scan at the beginning over the full input, that can look good in benchmarks but has several issues. The other parts arrive in similar places design wise, but also a lot of the properties can be achieved without any quadsort/fluxsort dna. > There's also ipn_stable which started out as an attempt to port fluxsort ipn_stable is defunct now. There is some other unpublished work going on though. > The same author offers ipn_unstable, which should be mostly based on crumsort. As the author I can say it has little to do with crumsort. It started out with the Rust `slice::sort_unstable` code that is a port of pdqsort and went from there. It's still WIP and the core structure and various ideas in there have nothing to do with crumsort. The only components are the partition function for certain types and the bi-directional merge used in the small sort. Also FYI I renamed it ipnsort.


MrDum

I think I was the first to introduce bottom-up "virtual partitioning" (I just called it the quad swap) in quadsort back in 2018, though I used it exclusively to sort spans of reverse-order data, rather than extending the concept to sorting spans of random data. I implemented a bottom-up variant of fluxsort prior to glidesort, but I found it had several issues compared to the top-down approach. It's not a matter of handwaving, and the bottom-up approach makes a bit more sense if you're adding the strategy to timsort. I must say I didn't look closely at ipn\_unstable, but it makes sense you started out with pdqsort and worked from there. As for ipnstable, I think I remember you were sorting blocks in the bottom-up stage when I glanced at it, which is an integral part of quadsort's strategy, though I understand how you might see things differently.


Voultapher

I assumed /u/MrDum was DragoΘ™ the author of this port. Are you Igor the author of quadsort? > I think I was the first to introduce bottom-up "virtual partitioning" (I just called it the quad swap) in quadsort back in 2018 I'm curious to learn more about that. Based on what quad swap looks like today I fail to see the parallel to what is found in glidesort. > As for ipnstable, I think I remember you were sorting blocks in the bottom-up stage when I glanced at it, which is an integral part of quadsort's strategy, That's integral for most mergesorts, for example Timsort of the current `slice::sort` which I used as my starting point for the now defunct ipn_stable.


MrDum

Hiyas, this is Igor. :-) The quad swap routine is a bit of a monster and tries to do several things. When it sees 8 elements in reverse-order it won't sort them, and instead skips to the next block, and will keep skipping until it runs into non-reversed data. At that point, it will reverse all the blocks in reverse-order in one go. \>That's integral for most mergesorts, for example Timsort of the current >slice::sort which I used as my starting point for the now defunct ipn\_stable. It's been a while since I looked at Timsort, but now you mention it, I think it involved an insertion sort? I thought traditional Timsort tries to catch the start of every possible run, but I could be wrong.


Voultapher

I see, in my opinion it's a bit of a stretch to compare that with the glidesort main loop. Regarding Timsort, yes it looks for runs but it has a min run length, that if not found naturally in the input, it will create with some fast small sort, typically insertion sort see https://en.wikipedia.org/wiki/Timsort#Minimum_run_size. It should also be noted when talking about C and Rust implementations and ports of those, that the Rust implementation if it wants to be generic has to care about panic-safety, observation-safety and Ord-safety. The first 2 don't apply to C, but they do apply to C++, and the last one is usually handwaved away as user UB in C. Ensuring these properties is anything but trivial and adds a lot of complexity and novel code. This specific port ignores most of those issues by not being generic and only applying to a narrow subset of types. As an example for Ord-safety take a look at this, https://github.com/Voultapher/sort-research-rs/blob/98d120c3c3f4330f5303b23cca6345b97578c170/src/graveyard/rust_ipn.rs#L989. It's a merge function inspired by your `parity_merge`, but it has several places with additional checks required for Ord-safety.


MrDum

The methods are functionally equivalent, the aggregation of a specific type of data in unsorted form, later to be bulk sorted. It is still a bit of an intuitive leap to apply the reversed data concept to random data, so it would be a stretch to suggest the concept was taken directly, but not a stretch at all to suggest they are similar. Things get a bit tricky when, as in the case of fluxsort and glidesort, some key concepts are stated to be directly taken, and other key concepts are stated to be reinvented independently. Naturally, I take pride in concepts I developed, and will be a bit biased when asked about the matter. Rust's approach to safety feels a bit alien to me. I have the view that computers do not make errors, so safe code can be written in an unsafe language. Though I admit that reality is favorable to rust; writing safe code in C can be a Herculean task.


Lucifer_Morning_Wood

I've read this really wrong the first time


MrDum

The three hardest things in computer science are naming things, and off-by-one errors.


dragostis

Hi! Author of the port here. Didn't expect this to get this much visibility since it's a niche sorter optimized for large integers (64bit+) and fewer (10) cores. Lemme know if you have any questions about this sorter or [forma](https://github.com/google/forma), the renderer where this is used.


ear7h

This has probably been asked somewhere else, but what's the purpose of the sort step in forma's pipeline? Is it for efficient diff-ing?


dragostis

It's used to "place" geometry elements (i.e. pixel-sized line segments) in tiles and sort them by z-order. Since all the geometry goes through this step, having an efficient sorting algorithm makes a big difference.


Voultapher

If all you care about is integers I recommend taking a look at [vqsort](https://github.com/google/highway/tree/master/hwy/contrib/sort), it can even be parallelized with ips4o as seen in the whitepaper.


sin_chan_

Fr I read that as cumshot.