Since leaving LMAX I have been neglecting my blog a bit. This is not because I have not been doing anything interesting. Quite the opposite really, things have been so busy the blog has taken a back seat. I’ve been consulting for a number of hedge funds and product companies, most of which are super secretive.
One company I have been spending quite a bit of time with is my-Channels, a messaging provider. They are really cool and have given me their blessing to blog about some of the interesting things I’ve been working on for them.
For context, my-Channels are a messaging provider that specialise in delivering data to every device known to man over dodgy networks such as the Internet or your corporate WAN. They can deliver live financial market data to your desktop, laptop at home, or your iPhone, at the fastest possible rates. Lately, they have made the strategic move to enter the low-latency messaging space for the enterprise, and as part of this they have enlisted my services. They want to go low-latency without giving up the rich functionality their product offers which is giving me some interesting challenges.
Just how bad is the latency of such a product when new to the low-latency space? I did not have high expectations because to be fair this was never their goal. After some initial tests, I’m thinking these guys are not in bad shape. They beat the crap out of most JMS implementations and it is going to be fun pushing them to the serious end of the low-latency space.
OK enough of the basic tests, now it is time to get serious. I worked with them to create appropriate load tests and get the profilers running. No big surprises here, when we piled on the pressure, lock-contention came out as the biggest culprit limiting both latency and throughput. As we go down the list, lots of other interesting things showed up but let’s follow good discipline and start at the top of the list.
Good discipline for “Theory of Constraints” states that you always work on the most limiting factor because when it is removed the list below it can change radically as new pressures are applied. So to address this contention issue we developed a new lock-free Executor to replace the standard Java implementation. Tests showed this new executor is ~10X better than what the JDK has to offer. We integrated the new Executor into the code base and now the throughput bottleneck has been massively changed. The system can now cope with 16X more throughput, and the latency histogram has become much more compressed. This is a good example of how macro-benchmarking is so much more valuable than micro-benchmarking. Not a bad start we are all thinking.
Enter Azul Stage Left
We tested on all the major JVMs and the most predictable latency was achieved with Azul Zing. Zing had by far the best latency profile with virtually no long tail. For many of the tests it also had the greatest throughput.
After the lock contention on the Executor issue had been resolved, the next big bottleneck when load testing on the same machine was being limited by using TCP between processes over the loopback adapter. We discussed developing a new transport that was not network based for Nirvana. For this we decided to apply a number of the techniques I teach on my lock-free concurrency course. This resulted in a new IPC transport based on shared memory via memory-mapped files in Java. We did inter-server testing using 10GigE networks, and had a fun using the new Solarflare network adapters with OpenOnload, but for this article I’ll stick with the Java story. I think Paul is still sore from me stuffing his little Draytek ADSL router with huge amounts of multicast traffic when the poor thing was connected to our 10GigE test LAN. Sorry Paul!
Developing the IPC transport unearthed a number of challenges with various JVM implementations of MappedByteBuffer. After some very useful chats with Cliff Click and Doug Lea we came up with a solution that worked across all JVMs. This solution has a mean latency of ~100ns on the best JVMs and can do ~12-22 million messages per second throughput for 60-byte messages depending on the JVM. This was the first time we had found a test whereby Azul was not close to being the fastest. I isolated a test case and sent it to them on a Friday. On Sunday evening I got an email from Gil Tene saying he had identified the issue and by Tuesday Cliff Click had a fix that we tried the next week. When we tested the new Azul JVM, we seen over 40 million messages per second at latencies just over 100ns for our new IPC transport. I had been teasing Azul that this must be possible in Java because I’d created similar algorithms in C and assembler that show what the x86_64 platform is capable of.
I’m starting to ramble but we had great fun removing latency through many parts of the stack. When I get more time I will blog about some of the other findings. The current position is still a work in progress with daily progress on an amazing scale. The guys at my-Channels are very conservative and do not want to publish actual figures until they have version 7.0 of Nirvana ready for GA, and have done more comprehensive testing. For now they are happy with me being open about the following:
One company I have been spending quite a bit of time with is my-Channels, a messaging provider. They are really cool and have given me their blessing to blog about some of the interesting things I’ve been working on for them.
For context, my-Channels are a messaging provider that specialise in delivering data to every device known to man over dodgy networks such as the Internet or your corporate WAN. They can deliver live financial market data to your desktop, laptop at home, or your iPhone, at the fastest possible rates. Lately, they have made the strategic move to enter the low-latency messaging space for the enterprise, and as part of this they have enlisted my services. They want to go low-latency without giving up the rich functionality their product offers which is giving me some interesting challenges.
Just how bad is the latency of such a product when new to the low-latency space? I did not have high expectations because to be fair this was never their goal. After some initial tests, I’m thinking these guys are not in bad shape. They beat the crap out of most JMS implementations and it is going to be fun pushing them to the serious end of the low-latency space.
OK enough of the basic tests, now it is time to get serious. I worked with them to create appropriate load tests and get the profilers running. No big surprises here, when we piled on the pressure, lock-contention came out as the biggest culprit limiting both latency and throughput. As we go down the list, lots of other interesting things showed up but let’s follow good discipline and start at the top of the list.
Good discipline for “Theory of Constraints” states that you always work on the most limiting factor because when it is removed the list below it can change radically as new pressures are applied. So to address this contention issue we developed a new lock-free Executor to replace the standard Java implementation. Tests showed this new executor is ~10X better than what the JDK has to offer. We integrated the new Executor into the code base and now the throughput bottleneck has been massively changed. The system can now cope with 16X more throughput, and the latency histogram has become much more compressed. This is a good example of how macro-benchmarking is so much more valuable than micro-benchmarking. Not a bad start we are all thinking.
Enter Azul Stage Left
We tested on all the major JVMs and the most predictable latency was achieved with Azul Zing. Zing had by far the best latency profile with virtually no long tail. For many of the tests it also had the greatest throughput.
After the lock contention on the Executor issue had been resolved, the next big bottleneck when load testing on the same machine was being limited by using TCP between processes over the loopback adapter. We discussed developing a new transport that was not network based for Nirvana. For this we decided to apply a number of the techniques I teach on my lock-free concurrency course. This resulted in a new IPC transport based on shared memory via memory-mapped files in Java. We did inter-server testing using 10GigE networks, and had a fun using the new Solarflare network adapters with OpenOnload, but for this article I’ll stick with the Java story. I think Paul is still sore from me stuffing his little Draytek ADSL router with huge amounts of multicast traffic when the poor thing was connected to our 10GigE test LAN. Sorry Paul!
Developing the IPC transport unearthed a number of challenges with various JVM implementations of MappedByteBuffer. After some very useful chats with Cliff Click and Doug Lea we came up with a solution that worked across all JVMs. This solution has a mean latency of ~100ns on the best JVMs and can do ~12-22 million messages per second throughput for 60-byte messages depending on the JVM. This was the first time we had found a test whereby Azul was not close to being the fastest. I isolated a test case and sent it to them on a Friday. On Sunday evening I got an email from Gil Tene saying he had identified the issue and by Tuesday Cliff Click had a fix that we tried the next week. When we tested the new Azul JVM, we seen over 40 million messages per second at latencies just over 100ns for our new IPC transport. I had been teasing Azul that this must be possible in Java because I’d created similar algorithms in C and assembler that show what the x86_64 platform is capable of.
I’m starting to ramble but we had great fun removing latency through many parts of the stack. When I get more time I will blog about some of the other findings. The current position is still a work in progress with daily progress on an amazing scale. The guys at my-Channels are very conservative and do not want to publish actual figures until they have version 7.0 of Nirvana ready for GA, and have done more comprehensive testing. For now they are happy with me being open about the following:
- Throughput increased 32X due to the implementation of lock-free techniques and optimising the call stack for message handling to remove any shared dependencies.
- Average latency decreased 20X from applying the same techniques and we have identified many more possible improvements.
- We know the raw transport for IPC is now ~100ns and the worst case pause due to GC is 80µs with Azul Zing. As to the latency for the double hop between a producer and consumer over IPC, via their broker, I’ll leave to your imagination as somewhere between those figures until the guys are willing to make an official announcement. As you can guess it is much much less than 80µs.
For me the big surprise was GC pauses only taking 80µs in the worst case. OS scheduling alone I have seen result in more jitter. I discussed this at length with Gil Tene from Azul, and even he was surprised. He expects some worst case scenarios with their JVM to be 1-2ms for a well behaved application. We then explored the my-Channels setup, and it turns out we have done everything almost perfectly to get the best out of a JVM which is worth sharing.
- Do not use locks in the main transaction flow because they cause context switches, and therefore latency and unpredictable jitter.
- Never have more threads that need to run than you have cores available.
- Set affinity of threads to cores, or at least sockets, to avoid cache pollution by avoiding migration. This is particularly important when on a server class machine having multiple sockets because of the NUMA effect.
- Ensure uncontested access to any resource respecting the Single Writer Principle so that the likes of biased locking can be your friend.
- Keep call stacks reasonably small. Still more work to do here. If you are crazy enough to use Spring, then check out your call stacks to see what I mean! The garbage collector has to walk them finding reachable objects.
- Do not use finalizers.
- Keep garbage generation to modest levels. This applies to most JVMs but is likely not an issue for Zing.
- Ensure no disk IO on the main flow.
- Do a proper warm-up before beginning to measure.
- Do all the appropriate OS tunings for low-latency systems that are way beyond this blog. For example turn off C-States power management in the BIOS and watch out for RHEL 6 as it turns it back on without telling you!
It should be noted that we ran this on some state of the art Intel CPUs with very large L3 caches. It is possible to get 20-30MB L3 caches on a single socket these days. It is very likely that our entire application was running out of L3 cache with the exception of the message flow which is very predictable.
Gil has added a cautionary note that while these results are very impressive we had a team focused on this issue with the appropriate skills to get the best out of the application. It is not the usual case for every client to apply this level of focus.
What I’ve taken from this experience is the amazing things that can be achieved by truly agile companies, staffed by talented individuals, who are empowered to make things happen. I love agile development but it has become a religion to some people who are more interested in following the “true” process than doing what is truly needed. Both my-Channels and Azul have shown during this engagement what is possible in making s*#t happen. It has been an absolute blast working with individuals who can assimilate information and ideas so fast, then turn them into working software. For this I will embarrass Matt Buckton at my-Channels, and Gil Tene & Cliff Click at Azul who never failed in rising to a challenge. So few organisations could have made so much progress over such a short time period. If you think Java cannot cut it in the high performance space, then deal with one of these two companies, and you will be thinking again. I bet a few months ago Matt never thought he’d be sitting in Singapore airport writing his first multi-producer lock-free queue when travelling home, and really enjoying it.
Gil has added a cautionary note that while these results are very impressive we had a team focused on this issue with the appropriate skills to get the best out of the application. It is not the usual case for every client to apply this level of focus.
What I’ve taken from this experience is the amazing things that can be achieved by truly agile companies, staffed by talented individuals, who are empowered to make things happen. I love agile development but it has become a religion to some people who are more interested in following the “true” process than doing what is truly needed. Both my-Channels and Azul have shown during this engagement what is possible in making s*#t happen. It has been an absolute blast working with individuals who can assimilate information and ideas so fast, then turn them into working software. For this I will embarrass Matt Buckton at my-Channels, and Gil Tene & Cliff Click at Azul who never failed in rising to a challenge. So few organisations could have made so much progress over such a short time period. If you think Java cannot cut it in the high performance space, then deal with one of these two companies, and you will be thinking again. I bet a few months ago Matt never thought he’d be sitting in Singapore airport writing his first multi-producer lock-free queue when travelling home, and really enjoying it.