Thursday 22 March 2012

Fun with my-Channels Nirvana and Azul Zing

Since leaving LMAX I have been neglecting my blog a bit.  This is not because I have not been doing anything interesting.  Quite the opposite really, things have been so busy the blog has taken a back seat.  I’ve been consulting for a number of hedge funds and product companies, most of which are super secretive.

One company I have been spending quite a bit of time with is my-Channels, a messaging provider.  They are really cool and have given me their blessing to blog about some of the interesting things I’ve been working on for them.

For context, my-Channels are a messaging provider that specialise in delivering data to every device known to man over dodgy networks such as the Internet or your corporate WAN.  They can deliver live financial market data to your desktop, laptop at home, or your iPhone, at the fastest possible rates.  Lately, they have made the strategic move to enter the low-latency messaging space for the enterprise, and as part of this they have enlisted my services.  They want to go low-latency without giving up the rich functionality their product offers which is giving me some interesting challenges.

Just how bad is the latency of such a product when new to the low-latency space?  I did not have high expectations because to be fair this was never their goal.  After some initial tests, I’m thinking these guys are not in bad shape.  They beat the crap out of most JMS implementations and it is going to be fun pushing them to the serious end of the low-latency space. 

OK enough of the basic tests, now it is time to get serious.  I worked with them to create appropriate load tests and get the profilers running.  No big surprises here, when we piled on the pressure, lock-contention came out as the biggest culprit limiting both latency and throughput.  As we go down the list, lots of other interesting things showed up but let’s follow good discipline and start at the top of the list.

Good discipline for “Theory of Constraints” states that you always work on the most limiting factor because when it is removed the list below it can change radically as new pressures are applied.  So to address this contention issue we developed a new lock-free Executor to replace the standard Java implementation.  Tests showed this new executor is ~10X better than what the JDK has to offer.  We integrated the new Executor into the code base and now the throughput bottleneck has been massively changed.  The system can now cope with 16X more throughput, and the latency histogram has become much more compressed.  This is a good example of how macro-benchmarking is so much more valuable than micro-benchmarking.  Not a bad start we are all thinking.

Enter Azul Stage Left

We tested on all the major JVMs and the most predictable latency was achieved with Azul Zing.  Zing had by far the best latency profile with virtually no long tail.  For many of the tests it also had the greatest throughput.

After the lock contention on the Executor issue had been resolved, the next big bottleneck when load testing on the same machine was being limited by using TCP between processes over the loopback adapter.  We discussed developing a new transport that was not network based for Nirvana.  For this we decided to apply a number of the techniques I teach on my lock-free concurrency course.  This resulted in a new IPC transport based on shared memory via memory-mapped files in Java.  We did inter-server testing using 10GigE networks, and had a fun using the new Solarflare network adapters with OpenOnload, but for this article I’ll stick with the Java story.  I think Paul is still sore from me stuffing his little Draytek ADSL router with huge amounts of multicast traffic when the poor thing was connected to our 10GigE test LAN.  Sorry Paul!

Developing the IPC transport unearthed a number of challenges with various JVM implementations of MappedByteBuffer.  After some very useful chats with Cliff Click and Doug Lea we came up with a solution that worked across all JVMs.   This solution has a mean latency of ~100ns on the best JVMs and can do ~12-22 million messages per second throughput for 60-byte messages depending on the JVM.  This was the first time we had found a test whereby Azul was not close to being the fastest.   I isolated a test case and sent it to them on a Friday.  On Sunday evening I got an email from Gil Tene saying he had identified the issue and by Tuesday Cliff Click had a fix that we tried the next week.  When we tested the new Azul JVM, we seen over 40 million messages per second at latencies just over 100ns for our new IPC transport.  I had been teasing Azul that this must be possible in Java because I’d created similar algorithms in C and assembler that show what the x86_64 platform is capable of.

I’m starting to ramble but we had great fun removing latency through many parts of the stack.  When I get more time I will blog about some of the other findings.  The current position is still a work in progress with daily progress on an amazing scale.  The guys at my-Channels are very conservative and do not want to publish actual figures until they have version 7.0 of Nirvana ready for GA, and have done more comprehensive testing.  For now they are happy with me being open about the following:
  • Throughput increased 32X due to the implementation of lock-free techniques and optimising the call stack for message handling to remove any shared dependencies.
  • Average latency decreased 20X from applying the same techniques and we have identified many more possible improvements.
  • We know the raw transport for IPC is now ~100ns and the worst case pause due to GC is 80µs with Azul Zing.  As to the latency for the double hop between a producer and consumer over IPC, via their broker, I’ll leave to your imagination as somewhere between those figures until the guys are willing to make an official announcement.  As you can guess it is much much less than 80µs.
For me the big surprise was GC pauses only taking 80µs in the worst case.  OS scheduling alone I have seen result in more jitter.  I discussed this at length with Gil Tene from Azul, and even he was surprised.  He expects some worst case scenarios with their JVM to be 1-2ms for a well behaved application.  We then explored the my-Channels setup, and it turns out we have done everything almost perfectly to get the best out of a JVM which is worth sharing.
  1. Do not use locks in the main transaction flow because they cause context switches, and therefore latency and unpredictable jitter.
  2. Never have more threads that need to run than you have cores available.
  3. Set affinity of threads to cores, or at least sockets, to avoid cache pollution by avoiding migration.  This is particularly important when on a server class machine having multiple sockets because of the NUMA effect.
  4. Ensure uncontested access to any resource respecting the Single Writer Principle so that the likes of biased locking can be your friend.
  5. Keep call stacks reasonably small.  Still more work to do here.  If you are crazy enough to use Spring, then check out your call stacks to see what I mean!  The garbage collector has to walk them finding reachable objects.
  6. Do not use finalizers.
  7. Keep garbage generation to modest levels.  This applies to most JVMs but is likely not an issue for Zing.
  8. Ensure no disk IO on the main flow.
  9. Do a proper warm-up before beginning to measure.
  10. Do all the appropriate OS tunings for low-latency systems that are way beyond this blog.  For example turn off C-States power management in the BIOS and watch out for RHEL 6 as it turns it back on without telling you!
It should be noted that we ran this on some state of the art Intel CPUs with very large L3 caches.  It is possible to get 20-30MB L3 caches on a single socket these days.  It is very likely that our entire application was running out of L3 cache with the exception of the message flow which is very predictable.

Gil has added a cautionary note that while these results are very impressive we had a team focused on this issue with the appropriate skills to get the best out of the application.  It is not the usual case for every client to apply this level of focus.

What I’ve taken from this experience is the amazing things that can be achieved by truly agile companies, staffed by talented individuals, who are empowered to make things happen.  I love agile development but it has become a religion to some people who are more interested in following the “true” process than doing what is truly needed.  Both my-Channels and Azul have shown during this engagement what is possible in making s*#t happen.  It has been an absolute blast working with individuals who can assimilate information and ideas so fast, then turn them into working software.  For this I will embarrass Matt Buckton at my-Channels, and Gil Tene & Cliff Click at Azul who never failed in rising to a challenge.  So few organisations could have made so much progress over such a short time period.  If you think Java cannot cut it in the high performance space, then deal with one of these two companies, and you will be thinking again.  I bet a few months ago Matt never thought he’d be sitting in Singapore airport writing his first multi-producer lock-free queue when travelling home, and really enjoying it.

57 comments:

  1. No wonder you've been silent on the disruptor mailing list :-)

    By any chance, is the multi-producer lock-free queue based on the disruptor-based fair queue you implemented months ago?

    ReplyDelete
    Replies
    1. No they are all new. Recently, I put together a lock-free training course and created a bunch of new stuff from scratch. I'm giving the Java version for the first time in Belfast next month. The multi-producer side of the Disruptor has issues with slow/stalling producers and Mike has been working to address it. I've come up with a new algorithm that does not require the second CAS in the solution he added to the Disruptor. I've moved on from LMAX and the purpose of the Disruptor is to promote them to help with recruitment.

      Delete
  2. Impressive. I always wonder though, when dealing with such tiny timings and high throughput isn't any profiler (even VTune or Solaris Studio) basically worthless due to the jitter they might introduce. I would assume reading the PCM registers would already skew results.

    ReplyDelete
    Replies
    1. Running a tool like VTune, or most profilers, will have an impact on performance even if it is just cache pollution. However they are very useful for finding where the time is spent in an algorithm.

      I tend to do the majority of my measurements without a profiler attached. I measure individual latencies which can still cost a few 10's of nanoseconds per sample and build histograms. I also measure averages over long runs.

      MSRs are useful for giving feedback on well decoupled code fragments by reading at the beginning and end of a run. This does not require VTune. You can do this programatically with some assembler, or scripted via "perf stat" or "rdmsr". On Linux you can read /dev/cpu/*/msr without needing assembler.

      Delete
  3. Martin, I'm curious if you've looked at http://kaazing.com for your low-latency stack? Thoughts?

    ReplyDelete
    Replies
    1. I've not looked at Kaazing. For efficient delivery of data over HTTP you need a good COMET or Web Sockets implementation. In addition to my-Channels Nirvana, I would recommend looking at Resin from http://www.caucho.com/ who have a very good high-performance container.

      Delete
  4. Nice work. Always a delight to hear about your work, Cliff Click and the guys at Azul.

    Are you planning to open src the shared mem IPC?

    Does writing to shared mem/NIO buffers guarantee visibility across threads - always? I had read otherwise (http://milek.blogspot.com/2010/12/linux-osync-and-write-barriers.html and http://stackoverflow.com/questions/7061910/when-does-an-o-sync-write-become-visible-in-the-pagecache-mmapd-file)while doing some experiments (http://javaforu.blogspot.com/2011/09/offloading-data-from-jvm-heap-little.html)

    ReplyDelete
    Replies
    1. Unfortunately, the Java memory model has not been sufficiently specified for shared memory. This was the main topic of my conversations with Doug Lea and Cliff Click referred to in the article. We came up with a solution that worked across the major JVMs but it is probably a bit low-level to describe in a blog comment.

      Delete
  5. "we developed a new lock-free Executor to replace the standard Java implementation."

    are u planning to open src the new Executor?

    ReplyDelete
    Replies
    1. The one for Nirvana had additional features over the standard Executor. I have been packaging another implementation so Doug Lea can try it with Fork-Join. I'll hopefully get some time over the next few weeks to do this (I'll be flying a lot!). The standard Executor and ExecutorService do not have ideal APIs for performance so my code needs a little migration. Watch this space!

      Delete
    2. Can we ask you share ideas or spike code before the library will be published?

      With help of Dmitry Vyukov, Michael Barker and Viktor Klang I had managed how to increase throughput of Scalaz actors in ~3x times, but avg. latency for a case when actors ping each other decreased not so much:
      https://github.com/plokhotnyuk/actors/blob/master/out1.txt#L269

      While it can be in ~4x times less as for home-brewed actor, based on dedicated thread:
      https://github.com/plokhotnyuk/actors/blob/6d00a98e5a1be03334aa5cd7bc5f801007d5edb9/out1.txt#L373

      Delete
    3. I was having a beer with Michael Barker this evening as this arrived. We found the discussion on the concurrency interest list today interesting.

      I'm actively working on packaging the code for some new queues and executors as a library. The technique Mike pointed out was part of that. Give me a few weeks and I'll stick something out. It is possible to get much better average latency.

      As part of a larger discussion the whole mailbox actor problem can be better solved than with standard executors. One does not need an executor per mailbox but that is a larger discussion.

      Delete
    4. Did open-sourcing the queue/executor here ever go anywhere? I checked concurrency-interest and your github repo without success.

      Delete
    5. Martin, is this executor available now? I found JDK's executors sometimes would hiccup which pose tens millisecond latency.

      Delete
  6. Recently, Peter Lawrey (http://vanillajava.blogspot.fr/) wrote a similar Open Source implementation of a shared memory IPC based on MappedByteBuffer (https://github.com/peter-lawrey/Java-Chronicle, main class https://github.com/peter-lawrey/Java-Chronicle/blob/master/src/main/java/vanilla/java/chronicle/impl/IndexedChronicle.java).

    I don't know if his implementation is JVM-independent or at least worked across the major JVMs (question asked here https://groups.google.com/forum/?fromgroups#!topic/java-chronicle/kwpQCiUfxXo).

    ReplyDelete
  7. My thoughts after writing a similar library.

    3. I have a thread affinity library to help you control the layout of your threads. This can help throughput, latency and minimise jitter.

    7. I keep the GC to trivial levels. e.g. far less than one object per order.

    8. I got similar results for latency and throughput while writing to memory mapped file so my conclusion is that you want to avoid blocking IO (or any system calls) Memory mapped files have the advantage of being written in the background, but also not lost if the process dies.

    On Thierry's question, I have only tested OpenJDK/Oracle JDK, however I suspect portability issues are unlikely to be JVM specific, but platform specific. i.e. it doesn't use the JDK much at all.

    ReplyDelete
    Replies
    1. Otherwise, I agree with everything you have said. ;)

      Delete
    2. Thanks for the feedback Peter.

      On portability I did find issues in the implementation of MappedByteBuffer for Azul Zing, JRockit Real-time, and both IBM JDKs. Azul quickly made their implementation consistent with Hotspot. Most of the issues I found related to endianess of integers and if they got written in a single MOV asm instruction vs. a number of MOVs with byte ops. There are workarounds that involve the use of Unsafe.

      Delete
  8. Please vote for improvement:
    http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=7023898
    It should speed up lot of concurrent algorithms & structures in JVM

    ReplyDelete
    Replies
    1. I think this would be better addressed with a new method Unsafe.fetchAndAdd() that is supported as an intrinsic and retrofitted to Atomic classes. Internally it can do the most suitable action on a given platform. This way it can also be applied to the Atomic arrays.

      Delete
    2. In bug 7023898 it is stated just methods of Atomic classes which should be altered, and it seems that new implementation will call some new Unsafe methods like in following code.

      For AtomicInteger:

      public final int getAndIncrement() {
      return unsafe.fetchAndAdd(this, valueOffset, 1);
      }

      Or for AtomicReference:

      public final V getAndSet(V newValue) {
      return unsafe.fetchAndSet(this, valueOffset, newValue);
      }

      P.S. It looks like discussion of this proposal:
      https://blogs.oracle.com/dave/entry/atomic_fetch_and_add_vs

      Delete
    3. I blogged about this subject last year.

      http://mechanical-sympathy.blogspot.co.uk/2011/09/adventures-with-atomiclong.html

      Delete
  9. Hey Martin, shoot Cliff Click an email, cliffc@acm.org, I'd love to catch up with you.

    Cliff

    ReplyDelete
  10. Hi Martin,
    How dou you confirm your thread pool x10 times faster than the version in JDK? Could you give some words to describe your micro-benchmark on those two?

    Thanks in advance,
    Min

    ReplyDelete
    Replies
    1. I created a test whereby multiple threads generate a range of numbers and submit them as tasks at the maximum possible rate. These tasks were then picked up by multiple worker threads and summed so the processing costs are very low and thus would stress the lock contention on the executor internals.

      These tests were run with 1-4 producing threads and 1-4 worker threads in the executor. Against a single threaded executor I got a 20X improvement dropping to 10-12X with 4 producers and 4 consumers running on an 8 core system. Form about ~3 threads on the producer or consumer side the delta in performance stabilised at around an order of magnitude. The even bigger win in the difference in latency being much much lower and more consistent.

      In summary, under high-contention we can do at least an order of magnitude better, and under low-contention, which is more the common case, it can be significantly better than that.

      When I get sufficient time I plan to open-source some of the techniques behind this that I also teach on my lock-free algorithms course.

      Delete
    2. Thanks, Martin. Really looking forward your exectuor get open-sourced.

      Min

      Delete
  11. Thanks fot the Great knowledge sharing.
    Just curious, when u wrote similar algo in c and assembler, how much faster and less memory did it use?

    ReplyDelete
    Replies
    1. Memory foot print was the same because we are off-heap using memory-mapped files. Performance is 2-3X better with C/ASM because of low level features like PAUSE and structure copying.

      Delete
    2. Any possibility of seeing this shared?

      Thanks, love your blog.

      Delete
  12. I'm curious about how your lock-free executor would compare to a ForkJoinPool. It's a bit apples to oranges, but still relevant.

    ReplyDelete
    Replies
    1. If I get time to package it up and open-source it soon we can compare :-) Life is proving so busy that I've not had time. The APIs are not quite the same and I need to rationalise.

      Delete
    2. Thanks a lot, Martin!!!

      Waiting your library with impatience!

      Will you share your version of atomic objects, that uses LOCK XADD and LOCK XCHG operations instead CAS loop?

      Delete
    3. The JNI costs are not worth it from Java for XADD and XCHG. We need intrinsic support from the JVMs. For now these are only useful to native programs :-(

      Delete
    4. Is it possible fork and hack OpenJDK to prove that game worth the candle?

      Delete
    5. I would like to cast one more vote for a request to see that multiple-consumer lock-free implementation published :)

      Delete
  13. For MappedByteBuffer IPC, by any chance you had to use FileLock (may be for specific regions of the buffer dedicated for read and write) in order to address the contention between producers and consumers ?

    Curious to know the approach taken to address issues like slow-consumer - fast-producer kind of scenarios if locks are not used.

    Muthu

    ReplyDelete
    Replies
    1. No use of FileLock. I spent time discussing the approach with Doug Lea and Cliff Click to ensure we got the semantics for the memory model correct to ensure visibility. Much longer conversation required on this one. Maybe I'll cover in a later blog post.

      Delete
  14. Thanks for the quick response

    Just started with a small prototype with FileLock .. would hold that for now.

    Muthu

    ReplyDelete
  15. How do you trap values at this sub-second level ? Do you do that ? We started using graphite/statsd to store and measure transactions and latency. Since we are only modeling we want this data so that we can use statistical tools like 'R'.

    Mohan

    ReplyDelete
    Replies
    1. On the latest Linux kernels and JVMs it is possible capture timestamps at ~20ns cost using System.nanoTime(), when binding threads to cores to avoid cache line misses. When dealing with a large number of samples it is best to use a Histogram class like the one I wrote for the Disruptor.

      http://code.google.com/p/disruptor/source/browse/trunk/code/src/main/com/lmax/disruptor/collections/Histogram.java

      When capturing a lot of samples it is useful to dump them into Excel for analysis.

      For timing across threads one needs to be careful of inter-socket NUMA effects and the cache misses resulting from updating the cache line containing the latest time stamp counter across cores.

      Delete
  16. Re Azul, it claims to be really good for x86 but is it really any good for 64 bit Linux?

    ReplyDelete
    Replies
    1. Do you think there is an issue with 64-bit Linux?

      Delete
  17. Does the executor in this article could be accessed right now?

    ReplyDelete
    Replies
    1. This executor has been extended from the standard API and behavior set. A much faster standard executor will be appearing before the end of this year via some other work I am doing.

      Delete
  18. Thanks.Love to see your blog and andy update. According to your experience. When using off-heap memory/memorymappedfile, Does the memory-copy into/out JVM is worthy the improvement for gc? I am implementing a in-memory cache(plan to run on >16G RAM box) by wrapping concurrenthashmap, but the current gc really hurt our latency. I want to move it out of the heap, but that i need to serialize and do memory copy. Any suggestion?

    ReplyDelete
    Replies
    1. Why do you have to copy? Can the off-heap model not be the primary model for the data?

      Delete
  19. Is our primary model but I need do some process of the content in cache before i forward it to network. The content could be thought as a simple template, i need replace the placeholder according to the request's information.

    ReplyDelete
    Replies
    1. This sounds more like a design constraint in your system rather than anything to do with whether the model is on or off heap. I would need to better understand your requirements before recommending a suitable solution.

      Delete
    2. Thanks. Our requirements is we have some web source like html and css, we would like to change the link(aka,urlrewrite) in the content according to user's request (For example: from where)to direct following reqeust for these link to our geographically different storage cluster.

      Delete
    3. Have you considered splitting up the templates into ByteBuffers and use the scattering/gathering APIs to fill in the dynamic bits? This way you reduce the copies and system calls.

      Delete
    4. What i am considering is marked the position which need change in the ByteBuffer. For example: I have a ByteBuffer, position 100,200,300 need to insert another url, then i stop at that position when writing and use the url for writing. I am not sure if this have any difference than JDK's gathering? I found JDK's gathering source code have more stuff then just "write".

      Another problem is: If there a way to convert java native memory into DirectByteBuffer. I would like to use Unsafe to allocate memory and manage it by myself( It Might be stupid), I found JDK will use native way to handle IO by checking if the ByteBuffer is direct or not. How could i get around it?

      Thanks very much for your time.

      Delete
    5. Create a direct ByteBuffer, set the ordering to native, then use reflection to get the address, from there you can use Unsafe to work with this buffer.

      Delete
  20. Martin, Thanks for all the knowledge sharing, am a big fan of yours. Question, When you said you used the mappedbytebuffer for the IPC, how do we notify the consumer that we added more data into the memory. For my application ( the consumer is a c library ), I have to develop a JNI layer on the producer, and used Semaphore to notify the consumer. How to do it in Java to communicate between two independent java processes.

    ReplyDelete
    Replies
    1. You need to use lock-free algorithm techniques.

      Delete
  21. Hello Martin,
    Thanks for the post.

    When you say that Nirvana can be used as message delivery platform for iPhone (i presume also other mobile platform such as android, Windows 8..) then what exactly does this mean in context of Push Notification on iPhone, Android and Windows 8 platform. Because as far as I know/read for push notification we have to use the respective cloud service from Apple, Google and Microsoft. Does Nirvana seamlessly integrates with the vendors push notification servers i.e. customers do not need to install all 3 servers for 3 platforms but just one from Nirvana.

    Or it is not meant for Push Notification, If no then which scenarion would someone use Nirvana for mobile platform.

    ReplyDelete
    Replies
    1. I worked on scaling the server side of their messaging system. Best to contact them directly if you want more details on the client capabilities.

      http://www.my-channels.com/contact/

      Delete