Mechanical Sympathy: Inter Thread Latency

Tuesday, 9 August 2011

Inter Thread Latency

Message rates between threads are fundamentally determined by the latency of memory exchange between CPU cores. The minimum unit of transfer will be a cache line exchanged via shared caches or socket interconnects. In a previous article I explained Memory Barriers and why they are important to concurrent programming between threads. These are the instructions that cause a CPU to make memory visible to other cores in an ordered and timely manner.

Lately I’ve been asked a lot about how much faster the Disruptor would be if C++ was used instead of Java. For sure C++ would give more control for memory alignment and potential access to underlying CPU instructions such as memory barriers and lock instructions. In this article I’ll directly compare C++ and Java to measure the cost of signalling a change between threads.

For the test we'll use two counters each updated by their own thread. A simple ping-pong algorithm will be used to signal from one to the other and back again. The exchange will be repeated millions of times to measure the average latency between cores. This measurement will give us the latency of exchanging a cache line between cores in a serial manner.

For Java we’ll use volatile counters which the JVM will kindly insert a lock instruction for the update giving us an effective memory barrier.

public final class InterThreadLatency
    implements Runnable
{
    public static final long ITERATIONS = 500L * 1000L * 1000L;

    public static volatile long s1;
    public static volatile long s2;

    public static void main(final String[] args)
    {
        Thread t = new Thread(new InterThreadLatency());
        t.setDaemon(true);
        t.start();

        long start = System.nanoTime();

        long value = s1;
        while (s1 < ITERATIONS)
        {
            while (s2 != value)
            {
                // busy spin
            }
            value = ++s1;
        }

        long duration = System.nanoTime() - start;

        System.out.println("duration = " + duration);
        System.out.println("ns per op = " + duration / (ITERATIONS * 2));
        System.out.println("op/sec = " +  
            (ITERATIONS * 2L * 1000L * 1000L * 1000L) / duration);
        System.out.println("s1 = " + s1 + ", s2 = " + s2);
    }

    public void run()
    {
        long value = s2;
        while (true)
        {
            while (value == s1)
            {
                // busy spin
            }
            value = ++s2;
        }
    }
}

For C++ we’ll use the GNU Atomic Builtins which give us a similar lock instruction insertion to that which the JVM uses.

#include <time.h>
#include <pthread.h>
#include <stdio.h>

typedef unsigned long long uint64;
const uint64 ITERATIONS = 500LL * 1000LL * 1000LL;

volatile uint64 s1 = 0;
volatile uint64 s2 = 0;

void* run(void*)
{
    register uint64 value = s2;
    while (true)
    {
        while (value == s1)
        {
            // busy spin
        }
        value = __sync_add_and_fetch(&s2, 1);
    }
}

int main (int argc, char *argv[])
{
    pthread_t threads[1];
    pthread_create(&threads[0], NULL, run, NULL);

    timespec ts_start;
    timespec ts_finish;
    clock_gettime(CLOCK_MONOTONIC, &ts_start);

    register uint64 value = s1;
    while (s1 < ITERATIONS)
    {
        while (s2 != value)
        {
            // busy spin
        }
        value = __sync_add_and_fetch(&s1, 1);
    }

    clock_gettime(CLOCK_MONOTONIC, &ts_finish);

    uint64 start = (ts_start.tv_sec * 1000000000LL) + ts_start.tv_nsec;
    uint64 finish = (ts_finish.tv_sec * 1000000000LL) + ts_finish.tv_nsec;
    uint64 duration = finish - start;

    printf("duration = %lld\n", duration);
    printf("ns per op = %lld\n", (duration / (ITERATIONS * 2)));
    printf("op/sec = %lld\n",  
        ((ITERATIONS * 2L * 1000L * 1000L * 1000L) / duration));
    printf("s1 = %lld, s2 = %lld\n", s1, s2);

    return 0;
}

Results

$ taskset -c 2,4 /opt/jdk1.7.0/bin/java InterThreadLatency
duration = 50790271150
ns per op = 50
op/sec = 19,688,810
s1 = 500000000, s2 = 500000000

$ g++ -O3 -lpthread -lrt -o itl itl.cpp
$ taskset -c 2,4 ./itl
duration = 45087955393
ns per op = 45
op/sec = 22,178,872
s1 = 500000000, s2 = 500000000

The C++ version is slightly faster on my Intel Sandybridge laptop. So what does this tell us? Well, that the latency between 2 cores on a 2.2 GHz machine is ~45ns and that you can exchange 22m messages per second in a serial fashion. On an Intel CPU this is fundamentally the cost of the lock instruction enforcing total order and forcing the store buffer and write combining buffers to drain, followed by the resulting cache coherency traffic between the cores. Note that each core has a 96GB/s port onto the L3 cache ring bus, yet 22m * 64-bytes is only 1.4 GB/s. This is because we have measured latency and not throughput. We could easily fit some nice fat messages between those memory barriers as part of the exchange if the data has been written before the lock instruction was executed.

So what does this all mean for the Disruptor? Basically, the latency of the Disruptor is about as low as we can get from Java. It would be possible to get a ~10% latency improvement by moving to C++. I’d expect a similar improvement in throughput for C++. The main win with C++ would be the control, and therefore, the predictability that comes with it if used correctly. The JVM gives us nice safety features like garbage collection in complex applications but we pay a little for that with the extra instructions it inserts that can be seen if you get Hotspot to dump the assembler instructions it is generating.

How does the Disruptor achieve more than 25m messages per second I hear you say??? Well that is one of the neat parts of its design. The “waitFor” semantics on the SequenceBarrier enables a very efficient form of batching, which allows the BatchEventProcessor to process a series of events that occurred since it last checked in with the RingBuffer, all without incurring a memory barrier. For real world applications this batching effect is really significant. For micro benchmarks it only makes the results more random, especially when there is little work done other than accepting the message.

Conclusion

So when processing events in series, the measurements tell us that the current generation of processors can do between 20-30 million exchanges per second at a latency less than 50ns. The Disruptor design allows us to get greater throughput without explicit batching on the publisher side. In addition the Disruptor has an explicit batching API on the publisher side that can give over 100 million messages per second.

36 comments:

Shane Tolmie10 August 2011 at 10:38
Absolutely brilliant article, its fascinating to be following the latest state-of-the-art advances in this field.
ReplyDelete
Replies
ipc10 August 2011 at 14:16
you do limit the process to two processors but shouldn't you also pin each thread to one processor?
ReplyDelete
Replies
Geoff Flarity10 August 2011 at 15:05
Awesome. I'm not sure people think about this enough when designing multi-threaded code. I'm going to try this code out on our performance testing hosts to see what a slightly older host can do.
ReplyDelete
Replies
Martin Thompson10 August 2011 at 22:17
ipc,

Yes it would be better to set the thread affinity for each thread. I've not for this simple test because the same cache line will ping pong and the instruction cache is so small in each case. For a more complex example you are exactly right and the threads should be pinned to cores.

Martin...
ReplyDelete
Replies
RobotTwo14 August 2011 at 03:13
Hi,

I ported the Java disruptor to C++, and it's now available as open source (see http://www.2robots.com/2011/08/13/a-c-disruptor/). I haven't done extensive performance analysis though because (1) I don't have a 4-core box to benchmark on, and (2) I haven't ported the Java performance test framework to C++.

I'd welcome someone else doing the performance testing with the C++ port. And, it's available now if people want to play with it in their C++ programs!
ReplyDelete
Replies
Anonymous15 August 2011 at 15:36
Hi,

I ran these tests on our machine (an intel x5677, two quad-core processors). When pinning the threads to cores on the same socket, the difference between the java and the c++ version was relatively small, 34 vs 31 ns of latency. But when pinning the threads to cores on different sockets I got a huge difference - 227ns for the Java version and 80ns with the C++ one. I've run the tests several times, and pretty much same results every time. Any idea why there is such a big difference between java and C++ when communicating between different processors?

Johan
ReplyDelete
Replies
Martin Thompson15 August 2011 at 18:00
Hi Johan,

I've just ran the tests on an Intel X5650 which is the same architecture with a slightly slower clock speed. When measuring the inter-socket call I get a range of 87-239ns for Java, and 84-140ns for C++. I got the different figures by doing multiple runs and using different cores on each socket.

The reason I used different cores from each socket is because I deliberately did not turn off IRQ load-balancing. I suspect what is going on here is the threads gets context switched and its cache gets polluted by whatever serviced the interrupt. This will be more expensive for Java to get its runtime hot in cache again.

Also when going inter-socket the "Global Queue" in the un-core part is required to sequence the L2 cache misses and the QPI requests. Memory regions are owned by particular sockets for main memory. It may be the case that the Java runtime is being reloaded across the QPI link via the same global queues thus slowing it down.

I'll post more about this at a later date but we need to pin the individual threads, which requires some native calls from Java, and ensure the OS threads and all interrupts go to different cores to get predictable results. When designing for soft real-time applications we need to carefully plan for where threads and memory will be located.

Martin...
ReplyDelete
Replies
Xin Wang16 August 2011 at 02:34
I saw some weird result: If I change the atomic inc into a regular ++:
//value = __sync_add_and_fetch(&s2, 1);
value = ++s2;
//value = __sync_add_and_fetch(&s1, 1);
value = ++s1;
I actually saw lower ping pong latency. Any idea on how to how to explain this result?

Here is my machine information:

Model Name: iMac
Model Identifier: iMac10,1
Processor Name: Intel Core 2 Duo
Processor Speed: 3.06 GHz
Number Of Processors: 1
Total Number Of Cores: 2
L2 Cache: 3 MB
Memory: 4 GB
ReplyDelete
Replies
Martin Thompson16 August 2011 at 07:19
Your code will be faster because you are not using a memory barrier the GNU Atomic Builtins give you. Please see my previous post for why they are important.

http://mechanical-sympathy.blogspot.com/2011/07/memory-barriersfences.html

Without a memory barrier you cannot guarantee what order you will see the results in, or if your program will just spin forever. For this particular example it will work on x86/x64 but it is highly likely not to work on other architectures.
ReplyDelete
Replies
Martin Thompson16 August 2011 at 07:34
Please read section 8.2 of the following document if implementing concurrent programs on Intel CPUs and trying to avoid the use of memory barriers. Be warned this is a very difficult topic to get right even for the best people in the world :-)

http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-software-developer-vol-3a-part-1-manual.html

You additionally need to ensure your compiler does as you instruct with software memory barriers and use of registers.

Much safer to stick with atomic primitives unless you have significant experience in this area.
ReplyDelete
Replies
Anonymous18 August 2011 at 13:01
Martin,

Interesting that you get as low as 87ns for Java. I constantly get ~225ns for Java and 80ns for C++, regardless of which inter-socket pair I pin the threads to. I have all interrupts going to CPU0, and I also set CPU affinity for init (and its childs) to CPU0 so no other process should disturb the latency test, but that did not help.

The possible explanations you mention seem reasonable based on your measurements, but my results don't seem to support them?
ReplyDelete
Replies
Martin Thompson18 August 2011 at 16:30
johanh (http://johanh.myopenid.com/) has left a new comment on your post "Inter Thread Latency":

Martin,

Interesting that you get as low as 87ns for Java. I constantly get ~225ns for Java and 80ns for C++, regardless of which inter-socket pair I pin the threads to. I have all interrupts going to CPU0, and I also set CPU affinity for init (and its childs) to CPU0 so no other process should disturb the latency test, but that did not help.

The possible explanations you mention seem reasonable based on your measurements, but my results don't seem to support them?

===================================

For some reason this email has got to me but not turned up on the blog.

Sorry but I cannot explain the results. I can get just over 80ns from Java on this machine. The Intel specs say the QPI link will add ~20ns plus I expect a little time in the uncore, so over 80 or even a bit less seems perfectly reasonable.
ReplyDelete
Replies
Anonymous26 August 2011 at 13:18
Martin,

May I ask what JVM (and version) you are running? I just tried running the InterThreadLatency test on JRockit, and I get much lower values - ~90ns, which is much more consistent when the results of the C++ version. Unfortunately my tests of the Disruptor are significantly slower on JRockit though...

Thanks
ReplyDelete
Replies
Martin Thompson26 August 2011 at 13:22
I've been running the tests on Oracle 64-bit Hotspot Server JVM 1.6.0_25.

You should try the new version 2.0 Disruptor which I've just released. It has significantly increased throughput.

What are you seeing for JRockit results?
ReplyDelete
Replies
Anonymous26 August 2011 at 16:01
Ok, thanks. I am running a simple test with one producer-one consumer, measuring average latency. On Hotspot 1.6.0_25 and 1.7.0 I get an average of ~100ns and 99% < 128ns intra-socket, on JRockit I get an average of almost 200ns.

(For some reason, I have a much larger difference between the Disruptor's latency and the results of InterThreadLatency (34ns on my machine) than you seem to get).

Thanks
ReplyDelete
Replies
Steven McCoy26 August 2011 at 16:30
On a pretty aged laptop I managed to shave a few more nanoseconds with a pause op in the busy wait loop.

134ns Java, 166ns default g++, 133ns -march=native g++, 130ns with _mm_pause.
ReplyDelete
Replies
Martin Thompson26 August 2011 at 18:56
Johan,

I'm finding that as I make the Disruptor faster it also becomes more dependent on processor affinity. For me the results are significantly better when affinity is set, both for latency and throughput. Since standard Java has no means to do this I'm going to release a Executor that can set thread affinity for a future release of the Disruptor. This will obviously have to come with a small native library. taskset on the Java process is insufficient.

The tiny programs used for the InterThreadLatency are not as sensitive to cache pollution. Remember a single cache miss can cost a few hundred CPU cycles that is usually ~70+ ns when going out to main memory even on the fastest machines. Can be well over 100ns for older memory systems. You could also have a bit of false sharing going on. Do you have access to a tool like Intel VTune?

Martin...
ReplyDelete
Replies
Martin Thompson26 August 2011 at 19:04
Steven,

This is a good spot. I did not do it because Java does not have the option. I wish more options where available to Java such as _asm{ pause } for putting in a busy spin loop. The problem without it is the processor does some speculative execution that is wrong and not discovered until the instruction is retired by which stage it can have polluted the store buffers. For those interested in what it does.

http://software.intel.com/sites/products/documentation/studio/composer/en-us/2009/compiler_c/intref_cls/common/intref_sse2_pause.htm

I also wish "lock add", and "lock xadd" were available to Java as much better alternatives to CAS as cmpxchg for dealing with sequence claiming on x64/x68. Sadly they are not :-(

Martin...
ReplyDelete
Replies
Steven McCoy27 August 2011 at 09:03
Sun developers love CAS, the Solaris intrinsics for atomic addition, et al, on x64 are all implemented using CAS. I think the BSD's went the same route.
ReplyDelete
Replies
Martin Thompson1 September 2011 at 09:17
Johan,

There has been some "false sharing" introduced to the Disruptor on the latest JVMs. I made a new release yesterday (2.0.2) that addresses this false sharing. You should notice throughput and latency are much improved again.

Martin...
ReplyDelete
Replies
Anonymous4 September 2011 at 21:07
Thanks. I tried the new version (2.0.2), and got a got slight latency improvement on hotspot (from ~100 to ~95ns), and a larger improvement on jrockit (from ~200 to ~140ns). Using your new method for padding at some other places in another application improved performance a lot though, so I was quite happy for that :)

I wrote some JNI code to pin the individual threads to specific cores, but that actually didn't have any noticeable effect on my machine.

I have never used Intel VTune, but I guess I can use oprofile to measure cache misses. Or does VTune offer any additional fucntionality that would be useful here?

~95ns is much much lower than I had with my earlier queue-based design (also with spinning consumer threads), so I'm already very happy but it annoys me a bit that I don't get as low figures as you do...

Thanks,
Johan
ReplyDelete
Replies
Martin Thompson9 September 2011 at 09:20
Johan,

I measured my latency on a Sandybridge processor and I think you are using Westmere? I'll set up a test on Westmere to see if there is much of a difference. I expect a bit of a difference because Sandybridge has a "L0" uop cache and lower L3 cache latency.

I've also just released 2.5RC of the Disruptor which can drop latency a little more plus offer a richer feature set.

Martin...
ReplyDelete
Replies
Martin Thompson10 September 2011 at 09:04
Johan,

See section 3.3.2 of this VTune reference for details on memory issues.

http://software.intel.com/file/8685
ReplyDelete
Replies
GeO419 September 2012 at 22:46
Thanks for the post!

Something strange happened when I set the CPU affinity to a single core:

$ time taskset -c 0 java -XX:+TieredCompilation -cp bin InterThreadLatency
duration = 411120019
ns per op = 10278000
op/sec = 97
s1 = 20, s2 = 19

The program almost stalled, with less than a hundred ops per second.

Adding another core restores performance:

$ time taskset -c 0,1 java -XX:+TieredCompilation -cp bin InterThreadLatency
duration = 6177088
ns per op = 154427
op/sec = 6475
s1 = 20, s2 = 20

The iterations are very few since it would take forever otherwise.

Any idea why this happens? I figured that since both threads where forced to run on the same core the latency would drop, but the complete opposite happened.

Rikard
ReplyDelete
Replies
Mahamadou DIARRA17 October 2012 at 17:40
brilliant. I got below results when running the java code on my pc

duration = 45039344402
ns per op = 45
op/sec = 22202809
s1 = 500000000, s2 = 500000000

linux centos 6.3|32gram|asus maximus iv extreme z| Intel 2700K|SSD 128G
ReplyDelete
Replies
XiongZ17 May 2013 at 07:59
Hi Martin,

This is really interesting.

Just one question, witht the sample code, two CORE will be fully utlized while one thread is waiting for the result to return from another thread, kind of waste of energy and unfriendly to environment. In real application, messages or events arive at pulses. Currently it cost about 2 to 5 microseconds to wake up a thread in our solutions. Is there any way to reduce CPU usage during waiting for message and wake up fast when message arrives?

Thanks
ReplyDelete
Replies
Diogo Ferreira30 November 2013 at 13:14
I decided to try this against the 1.8 developer preview (1.8.0-ea-b117)

I am running these on an i7 Sandy Bridge dual-core laptop clocked at 2.7Ghz running MacOS. For the C benchmark I get similar results to others:

duration = 22200214976
ns per op = 22
op/sec = 45044608
s1 = 500000000, s2 = 500000000

However, for the Java benchmark there is a massive difference between java 1.7 and java 1.8. For Java 1.7 I get the following:

duration = 41235846000
ns per op = 41
op/sec = 24250745
s1 = 500000000, s2 = 500000000

which is in line with what everyone else is getting.

However in the developer preview I get this:

duration = 19182637000
ns per op = 19
op/sec = 52130476
s1 = 500000000, s2 = 500000000

Which is actually faster than C. Does anyone know the reason?
ReplyDelete
Replies

Add comment