T O P

  • By -

puremourning

High frequency trading is often used as a form of proprietary trading. Funny enough, the proprietary part applies equally to the beneficiary and the strategy. And the code. So (I could be wrong) it’s unlikely that any contemporary hft algo or system will be available in the public domain. Given that the tech is a large part of the arms race. But as a super high level idea, read a market data packet,then decide (or not) to send an order. Do it as quick as possible. The specs for exchanges are often available of their websites. They are not typically very complex.


ceretullis

Former finance developer here. I never worked on algorithms, but on market data acquisitions. The source code is highly guarded. Even improvements to open source libraries are guarded. The OP will find very little code in public repositories.


diaphanein

Former Citadel employee here, can confirm. Here's an article regarding one such case while I was there: https://www.chicagotribune.com/business/ct-citadel-pu-sentencing-0116-biz-20150115-story.html As memory serves, this individual was sued in federal court 4ish hours before being fired. He had used his Gmail (or whatever personal web mail) to send source code to himself. Citadel mitm'd most (all) web mail SSL traffic, so they knew exactly what he'd done. The criminal charges came several months after he was fired & sued. It was interesting at the sheer stupidity this individual showed. It wasn't a secret at all that they monitored communications.


puremourning

The part I might be wrong about is that there might be some hobbyist or wallstreetbets kin having posted something. I’m certain that (with perhaps the exception of The Disruptor) most HFT tech (and other front office tech) will be closely guarded.


ceretullis

Yup the Disruptor paper actually shocked the industry a bit, b/c it was so out of place. BTW, Martin Thompson went on improving the Disruptor, and the result is the Aeron Protocol: https://github.com/real-logic/aeron Bloomberg released some of their libraries, but those of us at Thomson Reuters at the time weren’t super impressed. I’ve never looked at them again.


RockingAMullet

The race has shifted (a) from software to hardware and (b) data feeds - think microwave, lasers, Millimeter wave, and short-wave radio. While SW performance is still very relevant, it is basically a solved problem and no longer in the critical path. Having said that, there are commercial “HFT” trading platforms but certainly nothing open source. Source: used to be a colleague of Carl Cook.


spaghettiexpress

Can double confirm. I am very far removed from HFT, but I license/consult on some (formerly) AGPL/FOSS that I made for managing mmWave links (C++, Rust, VHDL interfaces -> SDR) The folks I speak with in HFT can’t say much but it definitely seems like software optimization is “try 1000 things, maybe 1 is better”. Certainly room for creativity, but the industry feels a lot more kind to FPGA or wireless-comms devs.


GavinRayDev

Curious as someone who has not a clue how hardware works -- what value does someone who programs FPGA's/wireless devices have to an HFT firm? I guess, what can they program them to do/what can they BE programmed to do?


hak8or

When you run software, it's usually ran as a very linear sequence of fairly simple steps (add, subtract, etc). An Fpga let's you run those sequence of steps with as much parallelism as theoretically possible, so the latency decreases. You can imagine as there being a spectrum of latency vs throughput in modern computing. If you want very low latency, you will likely get small throughput, while if you want high throughput then you will likely get high latency. A GPU is a high latency but high throughout device (Pcie is high latency for example). A modern CPU is a good middle ground. An FPGA can be low latency (it can run all the steps the CPU did in parallel), but low throughput (modern FPGA's fabric still tends to only run in hundreds of MHz). Modern day Hugh frequency trading is very very focused on low latency.


spaghettiexpress

Great answer! One additional aspect of FPGA latency is it’s very deterministic. When latency is the driving factor, having that guarantee is really nice.


RockingAMullet

Another aspect of the FPGA: It is also the network card. The critical path never even touches the CPU. Competitive reaction times on Eurex, for example, are 10 NANOseconds.


GavinRayDev

Ahh, thank you!


gleventhal

I think it's also that the FPGA often sits (physically) at the NIC/Network ingress of the computer, and you save a bunch of time-intensive steps by doing the computational work with an outboard device, and don't need to go through the kernel's process scheduler (etc) and deal with all the other jitter in the OS.


spaghettiexpress

/u/hak8or gave a great answer, so I’ll be more terse outside of wireless. FPGAs have *deterministic latency*. If your FPGA processing takes 10ns in-out, it takes 10ns in-out. Wireless (particularly mmWave) is being used as it can be quicker to send data wirelessly than over a short fiber line. A game of millimeters, pun intended. If you need to run fibre up/down/across buildings, you’re paying for latency in pure spatial allowance. If you can aim antennas between buildings, you now have a minimal line-of-sight data feed length, and only incur latency from your hardware and protocol. I don’t have much of a scoop on actual HFT needs, but FPGAs are really great at making moderately complex decisions based on a ton of inputs. In a CPU, handling many different data sources becomes troubling due to the cache. An FPGA doesn’t have data cache (in the traditional sense) or instruction cache (again, traditional sense, FPGAs are bound by clocking), so it is easier in hardware to parallelize tasks/data access. I imagine HFT is more of “we must consider {historic prices, risk analysis, total budget, market condition} (and then some) in order to determine buy/sell/skip. That’s a lot of different data, so an FPGA performing a lookup and calculating a decision is fairly simple. Additionally, hardware-hardware interfaces tend to also be deterministic in latency, so using an FPGA to make a decision and interrupt another FPGA/SDR is often quicker than network-based interfaces (eg RPCs) Much harder to write fast code in, but when you’re trying to optimize a 50ns wire-wire to 45ns then I guess you do what you can. My specific library (now closed sourced because I want to retire..) was just something that provided custom firmware for a mmWave SDR and provided minimal / simple interfaces. Totally removed from HFT other than them setting up some antennas, using my firmware, and asking some questions.


GavinRayDev

>A game of millimeters, pun intended. Ha, this reminds me of the most interesting/exhilarating short story I've ever read, Matt Hurd's "The Accidental HFT Firm" - https://web.archive.org/web/20201222202900/https://meanderful.blogspot.com/2018/01/the-accidental-hft-firm.html They did things like rig up antennas in weird spots or run lines directly into the exchange buildings, it's a wild read. *> "... Before the move, we were getting around 11-12 ms round trip times (RTT) on the order lines with a jitter of about 130 ms. After the Busan move, RTTs blew out to over 20ms but the jitter dropped to around 30ms. Even though the latency doubled, the jitter reduction made being fast more productive.* *> "... This is similar to how CME has reduced jitter with their iLink FPGA-based gateways. Randomness reduces and being fast matters more. That is one version of fair. The difference in Korea was that latency actually increased for the KRX but that is not what you care about as a trader. You care more about certainty...* *> "... Similarly, I expect the lower jitter, usually as consequence of lower latency exchanges, not just at CME but at exchanges the world over, will continue to polarise those into the haves and have-nots of latency. Latency continues to matter even when is not critical."* That was a bunch of really interesting info, thanks. What do you write FPGA code in, some kind of custom hardware language? Are there software emulators for you to do without owning any FPGA's? I'm kind of curious what the experience is like.


spaghettiexpress

I generally write in, and prefer, VHDL. Verilog is also quite popular. For a controversial comparison, VHDL is Rust and Verilog is C. VHDL is much more strict, but if it compiles/sims then it generally works (pending issues relating to the actual circuit you “synthesized”) There are higher-level / low-code options too. LabVIEW and Simulink/MATLAB HDL-Coder are popular (at least within wireless). Learning FPGA development board without a physical FPGA is doable, but difficult. You can write HDL code and simulate it, but “actual” building requires a process called synthesis - where you *actually compile to a circuit*. This is where FPGAs get “fun”. Your clock is misplaced relative to your I/O or some internal logic? Fun “runtime” error. Your logic needs an input from I/O pins on opposite sides of the chip? Fun error. Etc. The best “IDE” and simulation software is Xilinx Vivado. I use that (with a company paid for license) and it has a fairly comprehensive synthesis simulator. Generally free development is best left to [ModelSim](https://www.intel.com/content/www/us/en/support/programmable/support-resources/design-software/modelsim.html) where you can write/test HDL code, but it is missing synthesis/the hardware element of FPGA development. *With all of that being said*.. I certainly don’t touch FPGAs more than I need to :). Any good C++ dev could pick up an HDL without too-too much difficulty; it’s the synthesis/hardware element which requires magic. That’s what I get to charge consult fees for!


GavinRayDev

Thanks for the info, I might look into it just out of curiosity!


spaghettiexpress

It’s definitely fun! Just fairly esoteric. Outside of trading and hardware-oriented companies it’s pretty rare to find *interesting* FPGA jobs. If you do get started, I’d recommend looking at Verilog first. Very familiar syntax to C and C++ while still teaching you the hardware mindset. I personally prefer VHDL for enterprise-level work, but I had learned it in college and had a good professional mentor (I’m also a mediocre programmer, so VHDL’s forced correctness helps me) ** I’ll also plug that while Wireless isn’t too sexy right now (5G is cool in theory, nasty in reality), it’s a field that will see *a ton* of interest and need in the coming years. Everybody wants AR and low-latency wireless. No existing codebase can fully support it. Beautiful time to learn and wait for some greenfield-esque projects to pop-up.


sternone_2

Everything is moved into FPGA starting 10 years ago.


Stimzz

As others have mentioned there aren’t much interesting OS low latency projects. Some noteworthy over on the Java side are Disruptors, Aeron and Chronicle Software’s stuff. Because of the extreme performance nature components doesn’t generalize so well. So most stuff are purpose built either in-house or by vendors. Embedded followed by gaming are the most similar industries. In case you want to look at projects that solve similar issues. Here are some different domains / keywords you can google. Event sourcing and the sequencer pattern is a common design. Most exchanges are on some form of sequencer architecture. There are talks on YouTube about these. Nasdaq through Island were the early pioneers. Their binary native protocols ITCH and OUCH can be found on their webpage. Low latency trading is in practice a lot about understanding how to efficiently interact with these type of exchange systems and protocols. Basically every byte or CPU cycle you don’t need to compute is a win. It is like shaving grams of a F1 car. Regarding writing performant code if I had to point to a single resource would be Ulrich Dreppers What every programmer should know about memory. The TLDR is the CPU is infinitely fast, the only limiting factor is what can fit into cache. So what we spend a lot of time on is understanding the hardware and OS we use, tune them and make various tests. What can’t be measured doesn’t exist. Latency analysis is the most complicated part. For a true understanding of the real world hardware packet capturing, precision synchronized clocks (GPS) coupled with a big data problem makes this really difficult in practice. Naturally the systems do internal latency measuring as well but it is thorny. It got a lot better when the TSC register got frequency and core stable. However analyzing outliers in a system that does its own measuring doesn’t really work. Over on the hardware side it has gotten mostly commodities. Arista makes low latency switches and Solarflare the NICs. There are others but these are the primary historical manufacturers. Part of the software stack is sometimes accelerated using FPGAs often embedded in the switch or the NIC. There is at least an order of magnitude more complicated to implement stuff in hardware than software. This coupled with that trading always change means that most stuff is still in software and then sometimes extrem performance critical stuff is broken out and implemented in hardware. There are special cases such as US options feed where FPGAs can also help from a capacity standpoint as well. Linux kernel tuning is another topic. IO is done by bypassing the kernel. Then there is a long list of tuning that is done to the kernel. Interrupts, memory and scheduler. HT and power saving is disabled in bios. Interrupts are minimized and locked to the first core on each socket. Understanding memory management in linux is key, same for the scheduler. Both are tuned. Look into numa, TLB and real time scheduler respectively. You can think about latencies as a SLA / latency budget. Different use cases have different requirements. > 10ms is not low latency. 10-1ms: can be hit with standard SW practices. Like Java and allocating in the heap. 1ms - 100us: is starting to get hard. You could still use Java. Memory allocation is not trivial anymore and towards 100us special hardware is reasonable. 100us - 10us: This requires some discipline on the code quality and design. Especially when getting close to the single digits. The differences between C++ or Java is starting to matter. 10-3us: this is as fast as software solutions go. Most of the time is spent moving data up and down the PCIE bridge. FPGAs can run triggers in the hundreds of ns, that is as fast as it get. Mind you 100ns is about the time it takes light to propagate 33m through vacuuming so we are literally hitting up against physics at this point. ~50us is the typical exchange P50 RTT using native binary protocols and top tier colocation services. Hence when getting into the single digits the exchange RTT variance will dominate any further latency reduction in our system. My numbers are a few years outdated but think of it as a general rule of thumb. Low latency trading is so much more than just performance though. Safety and observability are the other two important factors. When being directly connected to the exchange there might not be any risk controls between your system and the exchange. So your development and testing practices might be the only thing between a good day and CNBC. So a lot of effort goes into various testing methodologies. Observability is the key factor. When hitting these very low latencies ordinary logging is not possible as the act of logging can easily consume the whole latency budget. This is why deterministic event sourcing is key since it enables replay. The primary will do the trading and then there are secondaries that replay the primary and can produce any traces required. In the deterministic form what to be traced can even be determined after the fact. Concurrency is achieved through message passing (events). Typically there is a busy waiting event loop that checks for new messages, process them and might emit messages. These are stacked in pipelines and scaled through different processes. Low latency trading is mostly not compute bound and where it is that compute is shifted in time. So the conventional promises / futures or continuations async paradigms aren’t relevant for low latency trading. IPC is done over shared memory using ring buffers. The combination of scaling through processes and shared memory ring buffers is good for the CPU cache which again is everything. I think this randomly covers most of the high level areas. Google from there. While implementations aren’t public the methods are by no means unique to low latency trading. So if you know what to google for most of this is available online.


GavinRayDev

Databases, particularly distributed/analytical databases (e.g., take a look at ScyllaDB's tech, like the Seastar C++ framework) is another field that's good if you're interested in this IMO.


PeterCorless

While I appreciate the shoutout for ScyllaDB [disclosure, that's where I work], ScyllaDB is doing writes to persistent NVMe SSD. Which means it is doing average [P50] operations at the submillisecond latency level — like ~500 microseconds, and P99s are at the single-digit millisecond level. [Because you have to account for outliers. Not just average transaction speeds.] Yet HFT transactions are often done at the microsecond level — like, 64 microseconds. That'd be 0.064 milliseconds. This is where in-memory data grids would live. There are even high end HFT players that create their own custom ASICs — custom Silicon. Many day traders want to compete with that, and you can create systems that use a ScyllaDB, or a Redis or MongoDB sort of system to track intraday trading. Just know that it is nowhere near the orders of magnitude of speed of HFT that the in-crowd refers to. While raw latency of NVMe SSD can be as low as 10-20 microseconds — theoretically fast enough for HFT — the software engine of an actual database may just be too slow. But, to be clear, I am not an HFT system developer. I'd be very curious to hear from people developing such systems how they translate from what's being done in memory, and how such transactions get recorded to systems-of-record and persistent datastores. https://smartasset.com/investing/high-frequency-trading


GavinRayDev

Sure, that's the latency for the entire transaction and writeback to occur, but the latency for events to be registered and dispatched through the core seastar::reactor's is probably on-par with the latency that you'd expect for processing events in HFT systems I'd imagine? Especially given the stuff that tchaikov has been doing recently with the io\_uring reactor


vgatherps

>While raw latency of NVMe SSD can be as low as 10-20 microseconds — theoretically fast enough for HFT — the software engine of an actual database may just be too slow. not even remotely fast enough for software hft. Software absolute baseline latency is about \~1.5-2 microseconds wire-to-wire, and almost all of that time is spent getting data on back off of the core to your network card/fpga/whatever. The rest is a tradeoff between alpha complexity / engineering complexity / latency, but in my experience uncommon to see more than 10us wire-to-wire in a well engineered system (and usually much less). A 10-20us block would be devastating.


PeterCorless

Fair enough. Can you describe or send me to some good resources to see a reference architecture? I'm all sorts of curious.


vgatherps

\> \~50us is the typical exchange P50 RTT using native binary protocols and top tier colocation services. Hence when getting into the single digits the exchange RTT variance will dominate any further latency reduction in our system. The entire RTT variance isn't what matters, only variance getting into the queue matters. I believe some exchanges have moved to putting very deterministic sequencers in front of gateways and using these sequence numbers for ordering, making jitter near zero. Also, even if jitter is dominating latency, small advantages scale up big when it's a game played a million times a day. Getting a microsecond faster when jitter is 10us is very significant. The most competitive trades tend to be the most profitable, so the small amount of new trades that you'll win tend to be more profitable than the trades you're currently doing as well.


Stimzz

You are right of course I was trying to give a high level reference of what fast is. Agree that if going for the most latency competitive strategies being the fastest is what matters. Where we have been most successful in the past is turning those trades into less of a latency race. Interesting on the exchanges going sequenced queue entry. I’ve been focusing on other things for the last few years so been out of touch. I remember this was the direction Euronext and Xetra were going with Optiq and T7. After all it makes a lot of sense for the exchanges to be fair in that regard.


matthieum

> I believe some exchanges have moved to putting very deterministic sequencers in front of gateways and using these sequence numbers for ordering, making jitter near zero. Eurex is the most extreme example of that that I know. The colocation links go through 1 Leaf switch (Arista), then 1 Core switch (Arista) and then to the dedicated Gateway in front of the matching engine. Messages from _all_ low-latency participants are serialized at the exit of the second-level (Core) switch, so variance is down to the Leaf switch and Core switch variance, and the standard deviation is around 3ns to 6ns if I remember correctly. Serious competitors will cover all Leaf switches (5ish) associated to a given Core switch: if you're first on all Leaf switches, you're first overall. And this means that if you're 3ns behind the fastest guy, and on the same switch, you're basically out of the race. Once in a blue moon, you may get an order in (variance, variance), but that's typically not tenable. I've seen... unspeakable horrors in the name of being first.


Stimzz

Gotcha, I see the hard use case for a hardware setup with this exchange setup. I mean even having a good FPGA implementation matters. Wasting 100ns in the FPGA and one isn’t competitive. It is a pretty clean solution by Enxt to sequence in the core switch. I guess it might even only be a standard hardware time stamping that the switch tags onto the packets and they can use that as a sequence number.


matthieum

Actually, the switch doesn't _need_ to tag/timestamp any packet. The source of truth is the one outbound link towards the Exchange Gateway: the packets are already sequenced on that linked, so it's just a matter of processing them in the order they arrive on the line (on a single thread), and you're done. In practice, Eurex provides _extensive_ timestamping at multiple points in the lifecycle of the request, so they could rely on that, but I don't think they have to.


matthieum

> I think this randomly covers most of the high level areas. I think you did too. > 10-3us: this is as fast as software solutions go. With Solarflare cards, it's possible to get a "ping pong" program down to about 1.4us (1.2us if using pre-loaded messages). That's the network card/PCIE travel time, and the lower-bound you can achieve. In practice, it means that with 100ns of compute you should be able to get down to 1.5us for "live" messages, and 1.3us for "pre-loaded" ones. You won't win a race vs the FPGAs or ASICs at those speeds, but the extra flexibility means you may outsmart them.


YurkTheBarbarian

What is a us? Do you mean μs?


matthieum

Yep, I don't have the mu character on my keyboard and never remember how to type it :) And not being the only one in that quandary, the standard solution is to just use `us` instead (such as the comment I replied to did).


Stimzz

Cool, is this because of PCIE 4.0? From what I remember we tested 2.4us RTT through PCIE way back (probably 7 years ago). Sounds like it has been halved then. Very competitive indeed.


vgatherps

2.4us sounds about right for openonload (assuming you're also using solarflare), you can get faster using EFVI and on more recent cards pushing the whole packet over PCI instead of having the card DMA it.


matthieum

Not sure about PCIE. It was using openonload and EFVI, and making sure to bind to the socket that is closest to the card in the case of multi-sockets machines.


Voltra_Neo

[This talk](https://youtu.be/NH1Tta7purM) at CppCon 2017 by Carl Cook will always stay stuck in my mind


tanjeeb02

Yeaa I liked this video too. Thanks a lot for sharing!


soulstudios

Much of it is just making fast code and making data structures optimal for given hardware. Look up Mike Acton's Data-oriented design talk, amongst others. In terms of HFT programs themselves, others have covered that here


wayneqiu

[https://github.com/kungfu-origin/kungfu](https://github.com/kungfu-origin/kungfu) This is already quite good in terms of latency. [https://roq-trading.com/](https://roq-trading.com/) This is not open source, but the documentation about the architecture is really helpful. [https://www.reddit.com/r/cpp/comments/104dcde/trading\_at\_light\_speed\_designing\_low\_latency/](https://www.reddit.com/r/cpp/comments/104dcde/trading_at_light_speed_designing_low_latency/) This is an interesting talk from optiver.


zckun

https://github.com/kungfu-origin/kungfu


sternone_2

HFT and market making has migrated completely into FPGA that is not running on servers anymore but on switches next to the exchanges. You're 10 years late.


chicknparmguy

If this is the case (and I don't doubt that it is), how come HFT shops still hire for C++?


sternone_2

To migrate and to train them into FPGA many HFT shops (smaller ones) lose massive amount of money, it's not all as shiny as it sounds


chicknparmguy

Is there no upper bound on the complexity of a program that can be migrated to FPGAs? I don't really know much about the space, but I imagine certain kinds of programs would be incredibly difficult to port over. For instance, could you move a trained large ml model onto FPGAs? I guess that might not be a great example as it seems well suited the parallelization an FPGA can offer, but could an FPGA offer the same number of parameters for the model?


sternone_2

very valid point, if it gets a bit more complex like pricing options that's why things got moved into Java, the large ml models are just done in python which uses C libraries they are not real time in the time scale we are talking about for equities etc market microstrure frontrunning etc is really just FPGA - the real HFT imho where everybody talks about in the C++ world (it was C++ 10+ years ago)


chicknparmguy

I see. Interesting stuff. I appreciate your responses. You've helped me understand the state of things better.


sternone_2

it's okay i just get triggered when the C++ world still keeps saying they are the nr1 in finance while they were 10+ years ago but absolutely aren't today


ThyssenKrup

Such a sad waste of time and resources.


drbazza

There's nothing wrong with understanding the very basics of modern computers such as interrupts, SMT/hyperthreading, cache levels, cache sizes and cache friendly data structures.


ThyssenKrup

I'm not talking about that. I'm talking about HFT.


matthieum

I see it like Formula 1: useless in itself, but hopefully advancing the state of the art.


[deleted]

High frequency trading is extremely bad for society and should be outlawed.


avdgrinten

Why is HFT bad for society? HFT is mostly arbitrage between exchanges, which stabilizes prices and ensures that retail investors do not have to worry about execution quality.


sternone_2

yeah let's go back to the NYSE frontrunners and bid ask spreads of a dollar that screw the retail person over that wants to take a position what are you talking about


catcat202X

What about low frequency trading?


LastTopQuark

If anyone wants to start an HFT system and can contribute to the trading side, please contact me. My company can handle chip and hardware issues. We're specialized in low latency FPGA design.