Mechanical Sympathy: Fun with my-Channels Nirvana and Azul Zing

Thursday, 22 March 2012

Fun with my-Channels Nirvana and Azul Zing

Since leaving LMAX I have been neglecting my blog a bit. This is not because I have not been doing anything interesting. Quite the opposite really, things have been so busy the blog has taken a back seat. I’ve been consulting for a number of hedge funds and product companies, most of which are super secretive.

One company I have been spending quite a bit of time with is my-Channels, a messaging provider. They are really cool and have given me their blessing to blog about some of the interesting things I’ve been working on for them.

For context, my-Channels are a messaging provider that specialise in delivering data to every device known to man over dodgy networks such as the Internet or your corporate WAN. They can deliver live financial market data to your desktop, laptop at home, or your iPhone, at the fastest possible rates. Lately, they have made the strategic move to enter the low-latency messaging space for the enterprise, and as part of this they have enlisted my services. They want to go low-latency without giving up the rich functionality their product offers which is giving me some interesting challenges.

Just how bad is the latency of such a product when new to the low-latency space? I did not have high expectations because to be fair this was never their goal. After some initial tests, I’m thinking these guys are not in bad shape. They beat the crap out of most JMS implementations and it is going to be fun pushing them to the serious end of the low-latency space.

OK enough of the basic tests, now it is time to get serious. I worked with them to create appropriate load tests and get the profilers running. No big surprises here, when we piled on the pressure, lock-contention came out as the biggest culprit limiting both latency and throughput. As we go down the list, lots of other interesting things showed up but let’s follow good discipline and start at the top of the list.

Good discipline for “Theory of Constraints” states that you always work on the most limiting factor because when it is removed the list below it can change radically as new pressures are applied. So to address this contention issue we developed a new lock-free Executor to replace the standard Java implementation. Tests showed this new executor is ~10X better than what the JDK has to offer. We integrated the new Executor into the code base and now the throughput bottleneck has been massively changed. The system can now cope with 16X more throughput, and the latency histogram has become much more compressed. This is a good example of how macro-benchmarking is so much more valuable than micro-benchmarking. Not a bad start we are all thinking.

Enter Azul Stage Left

We tested on all the major JVMs and the most predictable latency was achieved with Azul Zing. Zing had by far the best latency profile with virtually no long tail. For many of the tests it also had the greatest throughput.

After the lock contention on the Executor issue had been resolved, the next big bottleneck when load testing on the same machine was being limited by using TCP between processes over the loopback adapter. We discussed developing a new transport that was not network based for Nirvana. For this we decided to apply a number of the techniques I teach on my lock-free concurrency course. This resulted in a new IPC transport based on shared memory via memory-mapped files in Java. We did inter-server testing using 10GigE networks, and had a fun using the new Solarflare network adapters with OpenOnload, but for this article I’ll stick with the Java story. I think Paul is still sore from me stuffing his little Draytek ADSL router with huge amounts of multicast traffic when the poor thing was connected to our 10GigE test LAN. Sorry Paul!

Developing the IPC transport unearthed a number of challenges with various JVM implementations of MappedByteBuffer. After some very useful chats with Cliff Click and Doug Lea we came up with a solution that worked across all JVMs. This solution has a mean latency of ~100ns on the best JVMs and can do ~12-22 million messages per second throughput for 60-byte messages depending on the JVM. This was the first time we had found a test whereby Azul was not close to being the fastest. I isolated a test case and sent it to them on a Friday. On Sunday evening I got an email from Gil Tene saying he had identified the issue and by Tuesday Cliff Click had a fix that we tried the next week. When we tested the new Azul JVM, we seen over 40 million messages per second at latencies just over 100ns for our new IPC transport. I had been teasing Azul that this must be possible in Java because I’d created similar algorithms in C and assembler that show what the x86_64 platform is capable of.

I’m starting to ramble but we had great fun removing latency through many parts of the stack. When I get more time I will blog about some of the other findings. The current position is still a work in progress with daily progress on an amazing scale. The guys at my-Channels are very conservative and do not want to publish actual figures until they have version 7.0 of Nirvana ready for GA, and have done more comprehensive testing. For now they are happy with me being open about the following:

Throughput increased 32X due to the implementation of lock-free techniques and optimising the call stack for message handling to remove any shared dependencies.
Average latency decreased 20X from applying the same techniques and we have identified many more possible improvements.
We know the raw transport for IPC is now ~100ns and the worst case pause due to GC is 80µs with Azul Zing. As to the latency for the double hop between a producer and consumer over IPC, via their broker, I’ll leave to your imagination as somewhere between those figures until the guys are willing to make an official announcement. As you can guess it is much much less than 80µs.

For me the big surprise was GC pauses only taking 80µs in the worst case. OS scheduling alone I have seen result in more jitter. I discussed this at length with Gil Tene from Azul, and even he was surprised. He expects some worst case scenarios with their JVM to be 1-2ms for a well behaved application. We then explored the my-Channels setup, and it turns out we have done everything almost perfectly to get the best out of a JVM which is worth sharing.

Do not use locks in the main transaction flow because they cause context switches, and therefore latency and unpredictable jitter.
Never have more threads that need to run than you have cores available.
Set affinity of threads to cores, or at least sockets, to avoid cache pollution by avoiding migration. This is particularly important when on a server class machine having multiple sockets because of the NUMA effect.
Ensure uncontested access to any resource respecting the Single Writer Principle so that the likes of biased locking can be your friend.
Keep call stacks reasonably small. Still more work to do here. If you are crazy enough to use Spring, then check out your call stacks to see what I mean! The garbage collector has to walk them finding reachable objects.
Do not use finalizers.
Keep garbage generation to modest levels. This applies to most JVMs but is likely not an issue for Zing.
Ensure no disk IO on the main flow.
Do a proper warm-up before beginning to measure.
Do all the appropriate OS tunings for low-latency systems that are way beyond this blog. For example turn off C-States power management in the BIOS and watch out for RHEL 6 as it turns it back on without telling you!

It should be noted that we ran this on some state of the art Intel CPUs with very large L3 caches. It is possible to get 20-30MB L3 caches on a single socket these days. It is very likely that our entire application was running out of L3 cache with the exception of the message flow which is very predictable.

Gil has added a cautionary note that while these results are very impressive we had a team focused on this issue with the appropriate skills to get the best out of the application. It is not the usual case for every client to apply this level of focus.

What I’ve taken from this experience is the amazing things that can be achieved by truly agile companies, staffed by talented individuals, who are empowered to make things happen. I love agile development but it has become a religion to some people who are more interested in following the “true” process than doing what is truly needed. Both my-Channels and Azul have shown during this engagement what is possible in making s*#t happen. It has been an absolute blast working with individuals who can assimilate information and ideas so fast, then turn them into working software. For this I will embarrass Matt Buckton at my-Channels, and Gil Tene & Cliff Click at Azul who never failed in rising to a challenge. So few organisations could have made so much progress over such a short time period. If you think Java cannot cut it in the high performance space, then deal with one of these two companies, and you will be thinking again. I bet a few months ago Matt never thought he’d be sitting in Singapore airport writing his first multi-producer lock-free queue when travelling home, and really enjoying it.

57 comments:

David Yu23 March 2012 at 03:55
No wonder you've been silent on the disruptor mailing list :-)

By any chance, is the multi-producer lock-free queue based on the disruptor-based fair queue you implemented months ago?
ReplyDelete
Replies
Erik Vanherck27 March 2012 at 09:31
Impressive. I always wonder though, when dealing with such tiny timings and high throughput isn't any profiler (even VTune or Solaris Studio) basically worthless due to the jitter they might introduce. I would assume reading the PCM registers would already skew results.
ReplyDelete
Replies
Unknown28 March 2012 at 11:00
Martin, I'm curious if you've looked at http://kaazing.com for your low-latency stack? Thoughts?
ReplyDelete
Replies
Ashwin Jayaprakash29 March 2012 at 04:28
Nice work. Always a delight to hear about your work, Cliff Click and the guys at Azul.

Are you planning to open src the shared mem IPC?

Does writing to shared mem/NIO buffers guarantee visibility across threads - always? I had read otherwise (http://milek.blogspot.com/2010/12/linux-osync-and-write-barriers.html and http://stackoverflow.com/questions/7061910/when-does-an-o-sync-write-become-visible-in-the-pagecache-mmapd-file)while doing some experiments (http://javaforu.blogspot.com/2011/09/offloading-data-from-jvm-heap-little.html)
ReplyDelete
Replies
bluedavy30 March 2012 at 10:51
"we developed a new lock-free Executor to replace the standard Java implementation."

are u planning to open src the new Executor?
ReplyDelete
Replies
Thierry Abaléa30 March 2012 at 22:15
Recently, Peter Lawrey (http://vanillajava.blogspot.fr/) wrote a similar Open Source implementation of a shared memory IPC based on MappedByteBuffer (https://github.com/peter-lawrey/Java-Chronicle, main class https://github.com/peter-lawrey/Java-Chronicle/blob/master/src/main/java/vanilla/java/chronicle/impl/IndexedChronicle.java).

I don't know if his implementation is JVM-independent or at least worked across the major JVMs (question asked here https://groups.google.com/forum/?fromgroups#!topic/java-chronicle/kwpQCiUfxXo).
ReplyDelete
Replies
Peter Lawrey31 March 2012 at 11:10
My thoughts after writing a similar library.

3. I have a thread affinity library to help you control the layout of your threads. This can help throughput, latency and minimise jitter.

7. I keep the GC to trivial levels. e.g. far less than one object per order.

8. I got similar results for latency and throughput while writing to memory mapped file so my conclusion is that you want to avoid blocking IO (or any system calls) Memory mapped files have the advantage of being written in the background, but also not lost if the process dies.

On Thierry's question, I have only tested OpenJDK/Oracle JDK, however I suspect portability issues are unlikely to be JVM specific, but platform specific. i.e. it doesn't use the JDK much at all.
ReplyDelete
Replies
Andriy Plokhotnyuk3 April 2012 at 10:13
Please vote for improvement:
http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=7023898
It should speed up lot of concurrent algorithms & structures in JVM
ReplyDelete
Replies
Unknown9 April 2012 at 18:52
Hey Martin, shoot Cliff Click an email, cliffc@acm.org, I'd love to catch up with you.

Cliff
ReplyDelete
Replies
Min Zhou19 April 2012 at 07:20
Hi Martin,
How dou you confirm your thread pool x10 times faster than the version in JDK? Could you give some words to describe your micro-benchmark on those two?

Thanks in advance,
Min
ReplyDelete
Replies
Unknown2 May 2012 at 20:15
Thanks fot the Great knowledge sharing.
Just curious, when u wrote similar algo in c and assembler, how much faster and less memory did it use?
ReplyDelete
Replies
rossjudson2 May 2012 at 21:50
I'm curious about how your lock-free executor would compare to a ForkJoinPool. It's a bit apples to oranges, but still relevant.
ReplyDelete
Replies
Muthukumaran26 May 2012 at 08:44
For MappedByteBuffer IPC, by any chance you had to use FileLock (may be for specific regions of the buffer dedicated for read and write) in order to address the contention between producers and consumers ?

Curious to know the approach taken to address issues like slow-consumer - fast-producer kind of scenarios if locks are not used.

Muthu
ReplyDelete
Replies
Muthukumaran26 May 2012 at 09:39
Thanks for the quick response

Just started with a small prototype with FileLock .. would hold that for now.

Muthu
ReplyDelete
Replies
Mohan Radhakrishnan5 June 2012 at 08:12
How do you trap values at this sub-second level ? Do you do that ? We started using graphite/statsd to store and measure transactions and latency. Since we are only modeling we want this data so that we can use statistical tools like 'R'.

Mohan
ReplyDelete
Replies
Unknown13 August 2012 at 13:06
Re Azul, it claims to be really good for x86 but is it really any good for 64 bit Linux?
ReplyDelete
Replies
J.J2 October 2012 at 09:33
Does the executor in this article could be accessed right now?
ReplyDelete
Replies
J.J21 October 2012 at 12:21
Thanks.Love to see your blog and andy update. According to your experience. When using off-heap memory/memorymappedfile, Does the memory-copy into/out JVM is worthy the improvement for gc? I am implementing a in-memory cache(plan to run on >16G RAM box) by wrapping concurrenthashmap, but the current gc really hurt our latency. I want to move it out of the heap, but that i need to serialize and do memory copy. Any suggestion?
ReplyDelete
Replies
J.J22 October 2012 at 14:01
Is our primary model but I need do some process of the content in cache before i forward it to network. The content could be thought as a simple template, i need replace the placeholder according to the request's information.
ReplyDelete
Replies
Senthil3 December 2012 at 17:01
Martin, Thanks for all the knowledge sharing, am a big fan of yours. Question, When you said you used the mappedbytebuffer for the IPC, how do we notify the consumer that we added more data into the memory. For my application ( the consumer is a c library ), I have to develop a JNI layer on the producer, and used Semaphore to notify the consumer. How to do it in Java to communicate between two independent java processes.
ReplyDelete
Replies
proksy15 February 2013 at 15:28
Hello Martin,
Thanks for the post.

When you say that Nirvana can be used as message delivery platform for iPhone (i presume also other mobile platform such as android, Windows 8..) then what exactly does this mean in context of Push Notification on iPhone, Android and Windows 8 platform. Because as far as I know/read for push notification we have to use the respective cloud service from Apple, Google and Microsoft. Does Nirvana seamlessly integrates with the vendors push notification servers i.e. customers do not need to install all 3 servers for 3 platforms but just one from Nirvana.

Or it is not meant for Push Notification, If no then which scenarion would someone use Nirvana for mobile platform.
ReplyDelete
Replies

Add comment