This talk was delivered at a STAC event in 2023, and mirrors a talk done by my colleagues. Martin Thompson gave the first version of the talk, and the content is directly derived from his work. It has been lightly edited for clarity. I've posted this here as I have been asked about it a few times.
The answer to the question is, it depends. It depends on what you’re trying to achieve. If, currently, you're pushing for sub microsecond optimizations or doing everything in FPGAs and ASICs, then clearly, it's not suitable.
At Adaptive, we sometimes go out and help companies make their systems run faster, and more often than not, the performance issues are within the application code.
So how do you know whether your applications are ready or not? First, you’ll need to baseline what the actual cloud platform is capable of, and then take what you know about your system’s actual performance profile, and then you can get an approximate view - maybe you can make it work on the cloud, maybe you can't.
But, let’s put that aside for a moment and look at the last fifteen to twenty years of what's been going on in this space, in both on-prem low latency and cloud environments. At Adaptive, we’ve been working in both of these environments side by side.
Back in the early days, we started talking about latency in averages, and we soon realized that was a huge mistake. The information held in the latency distribution is what matters when understanding predictable latency. And so we — as an industry — went off for what, 15 years pushing for lower and more predictable latency to where we are today.
The cloud providers have done similar and also learned similar lessons, but they were optimizing for different outcomes. It's about 20 years ago that Amazon put up their first cloud offering with AWS, with S3 and EC2, and various other things starting off in South Africa. Amazon designed their cloud offering around a flagship client, Toys R Us, but the needs of a retailer are very different from a trading firm. So AWS was first designed for microservices. They have capabilities such as 24x7 uptime, and spend effort slicing things up into small services so you can scale an organization, and so that you can have constrained downtime with a part of the system not working.
But the really interesting thing about this is that almost all communication happens through TCP, and that's pretty much how the cloud world is. If you look at TCP versus UDP, they're pretty different in how they behave.
Let’s move to where we are with low latency trading today, but also where the cloud providers globally are. We're going to explore that a little bit, and I'm going tobreak things down to tease out where the thresholds are.
So, what are the primary resources we use? First of all, we'll start with the servers themselves. It used to be the games industry that pushed server and CPU development forward. Over time, that has changed. The games industry is not such a significant influence anymore. It's now driven by cloud providers and small mobile device manufacturers, and as a result, CPUs are driven mainly by power reduction rather than higher performance or lower latency. But the really interesting part is cloud providers are now the biggest purchasers of server CPU chips. So they get all the good stuff first.
In the end, one of the places you can go to get the best CPUs is the cloud. It's possible to get 4.5 gigahertz CPUs now running in the cloud. And if you are on an existing type of model, and you want to migrate to the new model it’s usually a very simple task.
It’s slow and painful on-prem trying to do a CPU upgrade or a server upgrade. So you've got these pretty amazing servers that you can run on and get to them quite easily. We had one particular client who wanted to move to some new server chips that were recently released that we knew would fix their performance issue based on some benchmarking. On the cloud, they were able to move in one morning to the new architecture and go and benefit from the improvements straight away. Try that on-prem. Not easy. On top of very high clock speeds, these servers are becoming bigger with large numbers of sockets and cores.
This has driven a curious change in how software gets architected. We're seeing systems being constructed out of multiple services running on the same machine. And if you want to communicate with the network, well that can be very slow, even on the box. As an example, let's say we'll just be running Linux. It's very well optimized for the loopback interface.
Linux will ensure you take an optimal path between any two ends of the socket and that space it's designed to. So that works quite nicely, but still nowhere near as fast as possible because our memory system on these machines has got large shared caches that will exchange memory between CPUs and their private caches. These private CPU core caches are built on a message-passing system. Now, if you build leveraging that, then you've got a very high-speed network. What do we mean by very high speed? We're talking about all the same sockets, well, sub-hundred nanoseconds. We're communicating between cores. You go across sockets, maybe another one hundred nanoseconds. These are just rough ballparks. So if you're using the right sort of algorithms and communicating between your processes using this shared memory and the right shared lock-free memory data structures, then you can communicate extremely fast between services.
Cloud or not cloud it is kind of similar, but it all comes down to who are you sharing with on the physical host - in other words, your noisy neighbors. But get the whole machine to yourself, or at least the whole socket to yourself, and you can get pretty decent predictable performance.
Stepping back, in this world, we're often trying to tune the most out of our systems. That's a bit trickier in the cloud. But in fact, you're back to doing proper science. You can't just call up HP or Dell or whoever and say, I want to configure it for the lowest latency and have them work with you. The cloud providers may help a bit, but generally, it’s nothing like the past. So you have to have a model of what you think your system is going to achieve. You frequently run the benchmarks, you test, and you validate the profile. So it's just the scientific method.
So that’s servers.
Next, networks - what’s happening in that space? Most financial trading has been driven by UDP, whereas TCP has driven the cloud world.
Now, TCP doesn't make a very good choice for low and predictable latency because we get things like slow start and congestion control. I’ll touch on how flow control works in some of these scenarios. TCP was designed for networks that are congested and potentially have loss. Back in the 1980s, Van Jacobson and Sally Floyd, and others did work on the TCP implementation, which included congestion control and things like slow start and particularly slow start after idle. The way TCP works is it clamps down the buffers to begin with and then grows them as traffic goes into the system and continues to grow them to give you really good throughput. But if you're quiet for a while, it doesn't know if the networks changed or not, so it clamps the buffers back down again.
So you need to think about how you deal with that. So people end up — if they're working in financial environments — having to go back to UDP. But on the cloud, you have to use UDP with an awareness that you run on a shared network.
So now you’re back to needing flow control and congestion control. And this is one of the things that people find very painful in moving to an environment where the network is not dedicated to you. You have to share it. It's designed to be shared. So you have to become a good citizen, and expect that everyone else is a good citizen, and you have to do flow and congestion control, but you need a deep understanding of flow and congestion control.
And there’s been a wealth of work in this area. Now UDP is getting better on the general Internet because of a protocol called QUIC. Not everyone knows that if you’re using Facebook or any of the Google products, you're not using TCP from your browser back to their services - in fact, you're using UDP protocol called QUIC.
QUIC gives you a lower, more predictable latency. So we need to factor in that sort of stuff. But there's a kind of the elephant in the room here in finance, and that's multicast. How does multicast work in this world? Well, some cloud providers say they do support multicast.
Ever heard of transit gateway?
Does anybody know how the transit gateway works under the hood? What path does the traffic actually take? Like right back to the backbone, possibly out of your region and then back, so two machines that can be in the same availability zone that are communicating via Transit Gateway, but the traffic leaves their region.
That does not give you very low nor predictable latency. Now some providers are talking about offering genuine multicast at some stage. We may or may not get that. What we have to do now is often simulate and simulate using overlay networks or some other techniques.
The important thing is, if you do anything with UDP, you have to be aware that you're on that shared network, and you have to be a good citizen to get good latency. Otherwise, you'll be constantly throttled and get terrible latency. And if you're gonna simulate multicast, you have to apply the same algorithms that you do need to add flow control and congestion control along with dealing with loss as needed. You can't be running the algorithms that could go wrong on your dedicated switch network that's massively over-provisioned.
Finally, storage. Often our communication needs to be stored and replayed later. How are you storing in the cloud? Well, typically, we're looking at either instance-based storage, so you're storing on the local disks on that server. But the problem with that is if you’re migrated to another server, or you're restarted, that's gone. That's a different world than what we're used to in our on-prem data centers, where if you restart your machine, you've got your data stored locally on disk.
So if you have to write to disk and use instance storage, you may as well consider pure in-memory storage. If you can't fit things in memory, you end up using network storage elsewhere. And that's only going to increase the latency. And we've also got a throughput problem if we look at generally what happens in the cloud. System calls have a much higher cost. If you've got a high system call overhead, you're going to limit your throughput and constrain your latency unless you start doing some sensible things.
So, the first thing you can do is to start batching because if you batch, you can reduce your average latency, and you can get greater throughput. So you need to be building from the ground up with this way of thinking. In the on-prem world if you want to go further, you can look at things like Kernel bypass.
Now in the cloud providers, Kernel Bypass does also exist. So, for example, on the networking side, DPDK is available on AWS, Ali Cloud, and Google Cloud. We're starting to see better solutions out there. Now you may look at some of the new Linux features, such as io-uring, but the problem is, you need the latest kernels, and they are not as well optimized for UDP as they are for TCP. So you've got this kind of interesting world.
Well, what's possible in there?
Well, for nearly the last 10 years, Adaptive has been working on a messaging system called Aeron Transport that we have been able to deploy both in the cloud and on-prem. It builds upon lessons going back the last 30 years or so. Aeron Transport is a messaging system with flow control and congestion control built in but also made certain parts of it optional so it runs well in all sorts of environments. It batches at every level through the infrastructure to ensure good throughput. And then, it is able to plug in kernel bypass via our premium offerings where more performance is required.
But if we're looking at that interesting case of storage and the need to store something and replay it later. If you're going to go to the network file systems, when you f-sync to them, you can forget about having decent values for both throughput and latency.
You need to be using instance-style storage to get throughput and latency. But then you need to forget primary/secondary architectures for high availability, as you’re going to lose far too many servers. Because in the cloud, it really is cattle versus pets.
If you think your server is your pet and you want to take care of it because you don't want it to die, then you’re in trouble in the cloud. It's natural to lose servers with minimal or no notice, so you’ve got to be prepared for it. You must treat server loss as a normal event and design applications with this in mind. The good side to that is when you start thinking that way, you can then take the next step to move to seamless 24x7 operations.
Once you’ve moved from things like primary/secondary architectures to consensus-based algorithms where you've got the data stored on multiple nodes. The quorum - the majority - of those nodes has something safely stored. You can then afford to lose a minority of those nodes, so put in place something like raft or viewstamped replication, virtual synchrony.
Now, Raft, Paxos, Viewstamped Replication, and Virtual Synchrony have been around for a long time, but they've traditionally been slow. If you have a fast implementation of that, you can do well. To give you an idea, with Aeron Cluster on several different cloud providers we can go to millions of messages per second reaching consensus. And we can do that in the ballpark of a hundred mics. With dedicated on-prem style solutions, we can do it sub-20 mics. That's reaching a consensus and being fully fault-tolerant, and still dealing with server loss. So if your applications are in the sort of ballpark of running at a million events a second and able to deal with latency of approximately 100 mics, then it becomes possible.