Tuesday 9 August 2011

Inter Thread Latency

Message rates between threads are fundamentally determined by the latency of memory exchange between CPU cores.   The minimum unit of transfer will be a cache line exchanged via shared caches or socket interconnects.  In a previous article I explained Memory Barriers and why they are important to concurrent programming between threads.  These are the instructions that cause a CPU to make memory visible to other cores in an ordered and timely manner.

Lately I’ve been asked a lot about how much faster the Disruptor would be if C++ was used instead of Java.  For sure C++ would give more control for memory alignment and potential access to underlying CPU instructions such as memory barriers and lock instructions.  In this article I’ll directly compare C++ and Java to measure the cost of signalling a change between threads.

For the test we'll use two counters each updated by their own thread.  A simple ping-pong algorithm will be used to signal from one to the other and back again.  The exchange will be repeated millions of times to measure the average latency between cores.  This measurement will give us the latency of exchanging a cache line between cores in a serial manner.

For Java we’ll use volatile counters which the JVM will kindly insert a lock instruction for the update giving us an effective memory barrier.
public final class InterThreadLatency
    implements Runnable
{
    public static final long ITERATIONS = 500L * 1000L * 1000L;

    public static volatile long s1;
    public static volatile long s2;

    public static void main(final String[] args)
    {
        Thread t = new Thread(new InterThreadLatency());
        t.setDaemon(true);
        t.start();

        long start = System.nanoTime();

        long value = s1;
        while (s1 < ITERATIONS)
        {
            while (s2 != value)
            {
                // busy spin
            }
            value = ++s1;
        }

        long duration = System.nanoTime() - start;

        System.out.println("duration = " + duration);
        System.out.println("ns per op = " + duration / (ITERATIONS * 2));
        System.out.println("op/sec = " +  
            (ITERATIONS * 2L * 1000L * 1000L * 1000L) / duration);
        System.out.println("s1 = " + s1 + ", s2 = " + s2);
    }

    public void run()
    {
        long value = s2;
        while (true)
        {
            while (value == s1)
            {
                // busy spin
            }
            value = ++s2;
        }
    }
}

For C++ we’ll use the GNU Atomic Builtins which give us a similar lock instruction insertion to that which the JVM uses.
#include <time.h>
#include <pthread.h>
#include <stdio.h>

typedef unsigned long long uint64;
const uint64 ITERATIONS = 500LL * 1000LL * 1000LL;

volatile uint64 s1 = 0;
volatile uint64 s2 = 0;

void* run(void*)
{
    register uint64 value = s2;
    while (true)
    {
        while (value == s1)
        {
            // busy spin
        }
        value = __sync_add_and_fetch(&s2, 1);
    }
}

int main (int argc, char *argv[])
{
    pthread_t threads[1];
    pthread_create(&threads[0], NULL, run, NULL);

    timespec ts_start;
    timespec ts_finish;
    clock_gettime(CLOCK_MONOTONIC, &ts_start);

    register uint64 value = s1;
    while (s1 < ITERATIONS)
    {
        while (s2 != value)
        {
            // busy spin
        }
        value = __sync_add_and_fetch(&s1, 1);
    }

    clock_gettime(CLOCK_MONOTONIC, &ts_finish);

    uint64 start = (ts_start.tv_sec * 1000000000LL) + ts_start.tv_nsec;
    uint64 finish = (ts_finish.tv_sec * 1000000000LL) + ts_finish.tv_nsec;
    uint64 duration = finish - start;

    printf("duration = %lld\n", duration);
    printf("ns per op = %lld\n", (duration / (ITERATIONS * 2)));
    printf("op/sec = %lld\n",  
        ((ITERATIONS * 2L * 1000L * 1000L * 1000L) / duration));
    printf("s1 = %lld, s2 = %lld\n", s1, s2);

    return 0;
}
Results

$ taskset -c 2,4 /opt/jdk1.7.0/bin/java InterThreadLatency
duration = 50790271150
ns per op = 50
op/sec = 19,688,810
s1 = 500000000, s2 = 500000000

$ g++ -O3 -lpthread -lrt -o itl itl.cpp
$ taskset -c 2,4 ./itl
duration = 45087955393
ns per op = 45
op/sec = 22,178,872
s1 = 500000000, s2 = 500000000

The C++ version is slightly faster on my Intel Sandybridge laptop.  So what does this tell us?  Well, that the latency between 2 cores on a 2.2 GHz machine is ~45ns and that you can exchange 22m messages per second in a serial fashion.  On an Intel CPU this is fundamentally the cost of the lock instruction enforcing total order and forcing the store buffer and write combining buffers to drain, followed by the resulting cache coherency traffic between the cores.   Note that each core has a 96GB/s port onto the L3 cache ring bus, yet 22m * 64-bytes is only 1.4 GB/s.  This is because we have measured latency and not throughput.  We could easily fit some nice fat messages between those memory barriers as part of the exchange if the data has been written before the lock instruction was executed.

So what does this all mean for the Disruptor?  Basically, the latency of the Disruptor is about as low as we can get from Java.  It would be possible to get a ~10% latency improvement by moving to C++.  I’d expect a similar improvement in throughput for C++.  The main win with C++ would be the control, and therefore, the predictability that comes with it if used correctly.  The JVM gives us nice safety features like garbage collection in complex applications but we pay a little for that with the extra instructions it inserts that can be seen if you get Hotspot to dump the assembler instructions it is generating.

How does the Disruptor achieve more than 25m messages per second I hear you say???   Well that is one of the neat parts of its design.  The “waitFor” semantics on the SequenceBarrier enables a very efficient form of batching, which allows the BatchEventProcessor to process a series of events that occurred since it last checked in with the RingBuffer, all without incurring a memory barrier.  For real world applications this batching effect is really significant.  For micro benchmarks it only makes the results more random,  especially when there is little work done other than accepting the message.

Conclusion

So when processing events in series, the measurements tell us that the current generation of processors can do between 20-30 million exchanges per second at a latency less than 50ns.  The Disruptor design allows us to get greater throughput without explicit batching on the publisher side.  In addition the Disruptor has an explicit batching API on the publisher side that can give over 100 million messages per second.

36 comments:

  1. Absolutely brilliant article, its fascinating to be following the latest state-of-the-art advances in this field.

    ReplyDelete
  2. you do limit the process to two processors but shouldn't you also pin each thread to one processor?

    ReplyDelete
  3. Awesome. I'm not sure people think about this enough when designing multi-threaded code. I'm going to try this code out on our performance testing hosts to see what a slightly older host can do.

    ReplyDelete
  4. ipc,

    Yes it would be better to set the thread affinity for each thread. I've not for this simple test because the same cache line will ping pong and the instruction cache is so small in each case. For a more complex example you are exactly right and the threads should be pinned to cores.

    Martin...

    ReplyDelete
  5. Hi,

    I ported the Java disruptor to C++, and it's now available as open source (see http://www.2robots.com/2011/08/13/a-c-disruptor/). I haven't done extensive performance analysis though because (1) I don't have a 4-core box to benchmark on, and (2) I haven't ported the Java performance test framework to C++.

    I'd welcome someone else doing the performance testing with the C++ port. And, it's available now if people want to play with it in their C++ programs!

    ReplyDelete
  6. Hi,

    I ran these tests on our machine (an intel x5677, two quad-core processors). When pinning the threads to cores on the same socket, the difference between the java and the c++ version was relatively small, 34 vs 31 ns of latency. But when pinning the threads to cores on different sockets I got a huge difference - 227ns for the Java version and 80ns with the C++ one. I've run the tests several times, and pretty much same results every time. Any idea why there is such a big difference between java and C++ when communicating between different processors?

    Johan

    ReplyDelete
  7. Hi Johan,

    I've just ran the tests on an Intel X5650 which is the same architecture with a slightly slower clock speed. When measuring the inter-socket call I get a range of 87-239ns for Java, and 84-140ns for C++. I got the different figures by doing multiple runs and using different cores on each socket.

    The reason I used different cores from each socket is because I deliberately did not turn off IRQ load-balancing. I suspect what is going on here is the threads gets context switched and its cache gets polluted by whatever serviced the interrupt. This will be more expensive for Java to get its runtime hot in cache again.

    Also when going inter-socket the "Global Queue" in the un-core part is required to sequence the L2 cache misses and the QPI requests. Memory regions are owned by particular sockets for main memory. It may be the case that the Java runtime is being reloaded across the QPI link via the same global queues thus slowing it down.

    I'll post more about this at a later date but we need to pin the individual threads, which requires some native calls from Java, and ensure the OS threads and all interrupts go to different cores to get predictable results. When designing for soft real-time applications we need to carefully plan for where threads and memory will be located.

    Martin...

    ReplyDelete
  8. I saw some weird result: If I change the atomic inc into a regular ++:
    //value = __sync_add_and_fetch(&s2, 1);
    value = ++s2;
    //value = __sync_add_and_fetch(&s1, 1);
    value = ++s1;
    I actually saw lower ping pong latency. Any idea on how to how to explain this result?

    Here is my machine information:

    Model Name: iMac
    Model Identifier: iMac10,1
    Processor Name: Intel Core 2 Duo
    Processor Speed: 3.06 GHz
    Number Of Processors: 1
    Total Number Of Cores: 2
    L2 Cache: 3 MB
    Memory: 4 GB

    ReplyDelete
  9. Your code will be faster because you are not using a memory barrier the GNU Atomic Builtins give you. Please see my previous post for why they are important.

    http://mechanical-sympathy.blogspot.com/2011/07/memory-barriersfences.html

    Without a memory barrier you cannot guarantee what order you will see the results in, or if your program will just spin forever. For this particular example it will work on x86/x64 but it is highly likely not to work on other architectures.

    ReplyDelete
  10. Please read section 8.2 of the following document if implementing concurrent programs on Intel CPUs and trying to avoid the use of memory barriers. Be warned this is a very difficult topic to get right even for the best people in the world :-)

    http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-software-developer-vol-3a-part-1-manual.html

    You additionally need to ensure your compiler does as you instruct with software memory barriers and use of registers.

    Much safer to stick with atomic primitives unless you have significant experience in this area.

    ReplyDelete
  11. Martin,

    Interesting that you get as low as 87ns for Java. I constantly get ~225ns for Java and 80ns for C++, regardless of which inter-socket pair I pin the threads to. I have all interrupts going to CPU0, and I also set CPU affinity for init (and its childs) to CPU0 so no other process should disturb the latency test, but that did not help.

    The possible explanations you mention seem reasonable based on your measurements, but my results don't seem to support them?

    ReplyDelete
  12. johanh (http://johanh.myopenid.com/) has left a new comment on your post "Inter Thread Latency":

    Martin,

    Interesting that you get as low as 87ns for Java. I constantly get ~225ns for Java and 80ns for C++, regardless of which inter-socket pair I pin the threads to. I have all interrupts going to CPU0, and I also set CPU affinity for init (and its childs) to CPU0 so no other process should disturb the latency test, but that did not help.

    The possible explanations you mention seem reasonable based on your measurements, but my results don't seem to support them?

    ===================================

    For some reason this email has got to me but not turned up on the blog.

    Sorry but I cannot explain the results. I can get just over 80ns from Java on this machine. The Intel specs say the QPI link will add ~20ns plus I expect a little time in the uncore, so over 80 or even a bit less seems perfectly reasonable.

    ReplyDelete
  13. Martin,

    May I ask what JVM (and version) you are running? I just tried running the InterThreadLatency test on JRockit, and I get much lower values - ~90ns, which is much more consistent when the results of the C++ version. Unfortunately my tests of the Disruptor are significantly slower on JRockit though...

    Thanks

    ReplyDelete
  14. I've been running the tests on Oracle 64-bit Hotspot Server JVM 1.6.0_25.

    You should try the new version 2.0 Disruptor which I've just released. It has significantly increased throughput.

    What are you seeing for JRockit results?

    ReplyDelete
  15. Ok, thanks. I am running a simple test with one producer-one consumer, measuring average latency. On Hotspot 1.6.0_25 and 1.7.0 I get an average of ~100ns and 99% < 128ns intra-socket, on JRockit I get an average of almost 200ns.

    (For some reason, I have a much larger difference between the Disruptor's latency and the results of InterThreadLatency (34ns on my machine) than you seem to get).

    Thanks

    ReplyDelete
  16. On a pretty aged laptop I managed to shave a few more nanoseconds with a pause op in the busy wait loop.

    134ns Java, 166ns default g++, 133ns -march=native g++, 130ns with _mm_pause.

    ReplyDelete
  17. Johan,

    I'm finding that as I make the Disruptor faster it also becomes more dependent on processor affinity. For me the results are significantly better when affinity is set, both for latency and throughput. Since standard Java has no means to do this I'm going to release a Executor that can set thread affinity for a future release of the Disruptor. This will obviously have to come with a small native library. taskset on the Java process is insufficient.

    The tiny programs used for the InterThreadLatency are not as sensitive to cache pollution. Remember a single cache miss can cost a few hundred CPU cycles that is usually ~70+ ns when going out to main memory even on the fastest machines. Can be well over 100ns for older memory systems. You could also have a bit of false sharing going on. Do you have access to a tool like Intel VTune?

    Martin...

    ReplyDelete
  18. Steven,

    This is a good spot. I did not do it because Java does not have the option. I wish more options where available to Java such as _asm{ pause } for putting in a busy spin loop. The problem without it is the processor does some speculative execution that is wrong and not discovered until the instruction is retired by which stage it can have polluted the store buffers. For those interested in what it does.

    http://software.intel.com/sites/products/documentation/studio/composer/en-us/2009/compiler_c/intref_cls/common/intref_sse2_pause.htm

    I also wish "lock add", and "lock xadd" were available to Java as much better alternatives to CAS as cmpxchg for dealing with sequence claiming on x64/x68. Sadly they are not :-(

    Martin...

    ReplyDelete
  19. Sun developers love CAS, the Solaris intrinsics for atomic addition, et al, on x64 are all implemented using CAS. I think the BSD's went the same route.

    ReplyDelete
  20. Johan,

    There has been some "false sharing" introduced to the Disruptor on the latest JVMs. I made a new release yesterday (2.0.2) that addresses this false sharing. You should notice throughput and latency are much improved again.

    Martin...

    ReplyDelete
  21. Thanks. I tried the new version (2.0.2), and got a got slight latency improvement on hotspot (from ~100 to ~95ns), and a larger improvement on jrockit (from ~200 to ~140ns). Using your new method for padding at some other places in another application improved performance a lot though, so I was quite happy for that :)

    I wrote some JNI code to pin the individual threads to specific cores, but that actually didn't have any noticeable effect on my machine.

    I have never used Intel VTune, but I guess I can use oprofile to measure cache misses. Or does VTune offer any additional fucntionality that would be useful here?

    ~95ns is much much lower than I had with my earlier queue-based design (also with spinning consumer threads), so I'm already very happy but it annoys me a bit that I don't get as low figures as you do...

    Thanks,
    Johan

    ReplyDelete
  22. Johan,

    I measured my latency on a Sandybridge processor and I think you are using Westmere? I'll set up a test on Westmere to see if there is much of a difference. I expect a bit of a difference because Sandybridge has a "L0" uop cache and lower L3 cache latency.

    I've also just released 2.5RC of the Disruptor which can drop latency a little more plus offer a richer feature set.

    Martin...

    ReplyDelete
  23. Johan,

    See section 3.3.2 of this VTune reference for details on memory issues.

    http://software.intel.com/file/8685

    ReplyDelete
  24. Thanks for the post!

    Something strange happened when I set the CPU affinity to a single core:

    $ time taskset -c 0 java -XX:+TieredCompilation -cp bin InterThreadLatency
    duration = 411120019
    ns per op = 10278000
    op/sec = 97
    s1 = 20, s2 = 19

    The program almost stalled, with less than a hundred ops per second.

    Adding another core restores performance:

    $ time taskset -c 0,1 java -XX:+TieredCompilation -cp bin InterThreadLatency
    duration = 6177088
    ns per op = 154427
    op/sec = 6475
    s1 = 20, s2 = 20

    The iterations are very few since it would take forever otherwise.

    Any idea why this happens? I figured that since both threads where forced to run on the same core the latency would drop, but the complete opposite happened.

    Rikard

    ReplyDelete
    Replies
    1. Having thought about for a few more minutes I realize that each operation will involve a context switch in order for the other thread to get som CPU time. The 97 ops/s is simply the context switch frequency a assume...

      Delete
    2. You are pretty much measuring the length for a time slice on your scheduler.

      For an interesting result try the same with 2 threads, each bound to a different hyperthread on the same core. Should see latency come down to ~10ns. Use http://linux.die.net/man/1/hwloc-ls to determine CPU layout.

      Delete
    3. Different cores:

      $ time taskset -c 0,1 java -XX:+TieredCompilation -cp bin InterThreadLatency
      duration = 13632.577075 ms
      ns per op = 68
      op/sec = 14670740
      s1 = 100000000, s2 = 100000000

      Same core with hyperthreading:

      $ time taskset -c 0,4 java -XX:+TieredCompilation -cp bin InterThreadLatency
      duration = 5258.237784 ms
      ns per op = 26
      op/sec = 38035556
      s1 = 100000000, s2 = 100000000

      Not quite ~10ns on my machine, but a lot lower. Interesting.

      I guess that the reason is that you don't have to pay the penalty for going all the way to L3 cache in order to fetch the updated value.

      I looked up the cache latencies of Sandy Bridge here: http://www.7-cpu.com/cpu/SandyBridge.html

      The 26ns I get when on the same core is approximately twice the L2 cache latency. Does it correspond to one write and one read perhaps?

      Delete
    4. Without being able to see your actual machine I'm not totally sure. The cost is probably the LOCK ADD assembly instruction inserted for the memory barrier. Try replacing the longs with AtomicLong and using the lazySet() method for the write.

      Delete
    5. Tested it. The different core case dropped to 66 ns per op and the same core test dropped to 22 ns per op.

      CPU is a Sandy Bridge i7 2760QM 2.4Ghz with turbo boost disabled.

      Why did you expect 10 ns?

      If you really want to dig deeper I can make a gist of the assembly.

      Delete
  25. brilliant. I got below results when running the java code on my pc

    duration = 45039344402
    ns per op = 45
    op/sec = 22202809
    s1 = 500000000, s2 = 500000000

    linux centos 6.3|32gram|asus maximus iv extreme z| Intel 2700K|SSD 128G

    ReplyDelete
  26. Hi Martin,

    This is really interesting.

    Just one question, witht the sample code, two CORE will be fully utlized while one thread is waiting for the result to return from another thread, kind of waste of energy and unfriendly to environment. In real application, messages or events arive at pulses. Currently it cost about 2 to 5 microseconds to wake up a thread in our solutions. Is there any way to reduce CPU usage during waiting for message and wake up fast when message arrives?

    Thanks

    ReplyDelete
    Replies
    1. When busying spinning a core is tied up until work is available. There are techniques to address such as using the Intel PAUSE instruction and backoff strategies that involve doing other work. I cover this sort of issue on my training course.

      However if you wish to operate in a low-latency environment then busy spinning is often the only option. I've seen the cost to wake up a thread often be up ~16us on some versions of Linux and even into the milliseconds when virtualised. On top of the wakeup cost you must also consider the cache pollution and thread migration that results.

      Delete
  27. I decided to try this against the 1.8 developer preview (1.8.0-ea-b117)

    I am running these on an i7 Sandy Bridge dual-core laptop clocked at 2.7Ghz running MacOS. For the C benchmark I get similar results to others:

    duration = 22200214976
    ns per op = 22
    op/sec = 45044608
    s1 = 500000000, s2 = 500000000

    However, for the Java benchmark there is a massive difference between java 1.7 and java 1.8. For Java 1.7 I get the following:

    duration = 41235846000
    ns per op = 41
    op/sec = 24250745
    s1 = 500000000, s2 = 500000000

    which is in line with what everyone else is getting.

    However in the developer preview I get this:

    duration = 19182637000
    ns per op = 19
    op/sec = 52130476
    s1 = 500000000, s2 = 500000000

    Which is actually faster than C. Does anyone know the reason?

    ReplyDelete
    Replies
    1. It is most likely that if you seeing figures below 35ns that the 2 threads are running as hyperthreads on the same core. This can be illustrated with task pinning on Linux or Windows. Unfortunately you have no such control on OSX.

      Delete