Lately I’ve been asked a lot about how much faster the Disruptor would be if C++ was used instead of Java. For sure C++ would give more control for memory alignment and potential access to underlying CPU instructions such as memory barriers and lock instructions. In this article I’ll directly compare C++ and Java to measure the cost of signalling a change between threads.
For the test we'll use two counters each updated by their own thread. A simple ping-pong algorithm will be used to signal from one to the other and back again. The exchange will be repeated millions of times to measure the average latency between cores. This measurement will give us the latency of exchanging a cache line between cores in a serial manner.
For Java we’ll use volatile counters which the JVM will kindly insert a lock instruction for the update giving us an effective memory barrier.
public final class InterThreadLatency implements Runnable { public static final long ITERATIONS = 500L * 1000L * 1000L; public static volatile long s1; public static volatile long s2; public static void main(final String[] args) { Thread t = new Thread(new InterThreadLatency()); t.setDaemon(true); t.start(); long start = System.nanoTime(); long value = s1; while (s1 < ITERATIONS) { while (s2 != value) { // busy spin } value = ++s1; } long duration = System.nanoTime() - start; System.out.println("duration = " + duration); System.out.println("ns per op = " + duration / (ITERATIONS * 2)); System.out.println("op/sec = " + (ITERATIONS * 2L * 1000L * 1000L * 1000L) / duration); System.out.println("s1 = " + s1 + ", s2 = " + s2); } public void run() { long value = s2; while (true) { while (value == s1) { // busy spin } value = ++s2; } } }
For C++ we’ll use the GNU Atomic Builtins which give us a similar lock instruction insertion to that which the JVM uses.
#include <time.h> #include <pthread.h> #include <stdio.h> typedef unsigned long long uint64; const uint64 ITERATIONS = 500LL * 1000LL * 1000LL; volatile uint64 s1 = 0; volatile uint64 s2 = 0; void* run(void*) { register uint64 value = s2; while (true) { while (value == s1) { // busy spin } value = __sync_add_and_fetch(&s2, 1); } } int main (int argc, char *argv[]) { pthread_t threads[1]; pthread_create(&threads[0], NULL, run, NULL); timespec ts_start; timespec ts_finish; clock_gettime(CLOCK_MONOTONIC, &ts_start); register uint64 value = s1; while (s1 < ITERATIONS) { while (s2 != value) { // busy spin } value = __sync_add_and_fetch(&s1, 1); } clock_gettime(CLOCK_MONOTONIC, &ts_finish); uint64 start = (ts_start.tv_sec * 1000000000LL) + ts_start.tv_nsec; uint64 finish = (ts_finish.tv_sec * 1000000000LL) + ts_finish.tv_nsec; uint64 duration = finish - start; printf("duration = %lld\n", duration); printf("ns per op = %lld\n", (duration / (ITERATIONS * 2))); printf("op/sec = %lld\n", ((ITERATIONS * 2L * 1000L * 1000L * 1000L) / duration)); printf("s1 = %lld, s2 = %lld\n", s1, s2); return 0; }Results
$ taskset -c 2,4 /opt/jdk1.7.0/bin/java InterThreadLatency
duration = 50790271150
ns per op = 50
op/sec = 19,688,810
s1 = 500000000, s2 = 500000000
$ g++ -O3 -lpthread -lrt -o itl itl.cpp
$ taskset -c 2,4 ./itl
duration = 45087955393
ns per op = 45
op/sec = 22,178,872
s1 = 500000000, s2 = 500000000
The C++ version is slightly faster on my Intel Sandybridge laptop. So what does this tell us? Well, that the latency between 2 cores on a 2.2 GHz machine is ~45ns and that you can exchange 22m messages per second in a serial fashion. On an Intel CPU this is fundamentally the cost of the lock instruction enforcing total order and forcing the store buffer and write combining buffers to drain, followed by the resulting cache coherency traffic between the cores. Note that each core has a 96GB/s port onto the L3 cache ring bus, yet 22m * 64-bytes is only 1.4 GB/s. This is because we have measured latency and not throughput. We could easily fit some nice fat messages between those memory barriers as part of the exchange if the data has been written before the lock instruction was executed.
So what does this all mean for the Disruptor? Basically, the latency of the Disruptor is about as low as we can get from Java. It would be possible to get a ~10% latency improvement by moving to C++. I’d expect a similar improvement in throughput for C++. The main win with C++ would be the control, and therefore, the predictability that comes with it if used correctly. The JVM gives us nice safety features like garbage collection in complex applications but we pay a little for that with the extra instructions it inserts that can be seen if you get Hotspot to dump the assembler instructions it is generating.
How does the Disruptor achieve more than 25m messages per second I hear you say??? Well that is one of the neat parts of its design. The “waitFor” semantics on the SequenceBarrier enables a very efficient form of batching, which allows the BatchEventProcessor to process a series of events that occurred since it last checked in with the RingBuffer, all without incurring a memory barrier. For real world applications this batching effect is really significant. For micro benchmarks it only makes the results more random, especially when there is little work done other than accepting the message.
Conclusion
So when processing events in series, the measurements tell us that the current generation of processors can do between 20-30 million exchanges per second at a latency less than 50ns. The Disruptor design allows us to get greater throughput without explicit batching on the publisher side. In addition the Disruptor has an explicit batching API on the publisher side that can give over 100 million messages per second.
Absolutely brilliant article, its fascinating to be following the latest state-of-the-art advances in this field.
ReplyDeleteyou do limit the process to two processors but shouldn't you also pin each thread to one processor?
ReplyDeleteAwesome. I'm not sure people think about this enough when designing multi-threaded code. I'm going to try this code out on our performance testing hosts to see what a slightly older host can do.
ReplyDeleteipc,
ReplyDeleteYes it would be better to set the thread affinity for each thread. I've not for this simple test because the same cache line will ping pong and the instruction cache is so small in each case. For a more complex example you are exactly right and the threads should be pinned to cores.
Martin...
Hi,
ReplyDeleteI ported the Java disruptor to C++, and it's now available as open source (see http://www.2robots.com/2011/08/13/a-c-disruptor/). I haven't done extensive performance analysis though because (1) I don't have a 4-core box to benchmark on, and (2) I haven't ported the Java performance test framework to C++.
I'd welcome someone else doing the performance testing with the C++ port. And, it's available now if people want to play with it in their C++ programs!
Hi,
ReplyDeleteI ran these tests on our machine (an intel x5677, two quad-core processors). When pinning the threads to cores on the same socket, the difference between the java and the c++ version was relatively small, 34 vs 31 ns of latency. But when pinning the threads to cores on different sockets I got a huge difference - 227ns for the Java version and 80ns with the C++ one. I've run the tests several times, and pretty much same results every time. Any idea why there is such a big difference between java and C++ when communicating between different processors?
Johan
Hi Johan,
ReplyDeleteI've just ran the tests on an Intel X5650 which is the same architecture with a slightly slower clock speed. When measuring the inter-socket call I get a range of 87-239ns for Java, and 84-140ns for C++. I got the different figures by doing multiple runs and using different cores on each socket.
The reason I used different cores from each socket is because I deliberately did not turn off IRQ load-balancing. I suspect what is going on here is the threads gets context switched and its cache gets polluted by whatever serviced the interrupt. This will be more expensive for Java to get its runtime hot in cache again.
Also when going inter-socket the "Global Queue" in the un-core part is required to sequence the L2 cache misses and the QPI requests. Memory regions are owned by particular sockets for main memory. It may be the case that the Java runtime is being reloaded across the QPI link via the same global queues thus slowing it down.
I'll post more about this at a later date but we need to pin the individual threads, which requires some native calls from Java, and ensure the OS threads and all interrupts go to different cores to get predictable results. When designing for soft real-time applications we need to carefully plan for where threads and memory will be located.
Martin...
I saw some weird result: If I change the atomic inc into a regular ++:
ReplyDelete//value = __sync_add_and_fetch(&s2, 1);
value = ++s2;
//value = __sync_add_and_fetch(&s1, 1);
value = ++s1;
I actually saw lower ping pong latency. Any idea on how to how to explain this result?
Here is my machine information:
Model Name: iMac
Model Identifier: iMac10,1
Processor Name: Intel Core 2 Duo
Processor Speed: 3.06 GHz
Number Of Processors: 1
Total Number Of Cores: 2
L2 Cache: 3 MB
Memory: 4 GB
Your code will be faster because you are not using a memory barrier the GNU Atomic Builtins give you. Please see my previous post for why they are important.
ReplyDeletehttp://mechanical-sympathy.blogspot.com/2011/07/memory-barriersfences.html
Without a memory barrier you cannot guarantee what order you will see the results in, or if your program will just spin forever. For this particular example it will work on x86/x64 but it is highly likely not to work on other architectures.
Please read section 8.2 of the following document if implementing concurrent programs on Intel CPUs and trying to avoid the use of memory barriers. Be warned this is a very difficult topic to get right even for the best people in the world :-)
ReplyDeletehttp://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-software-developer-vol-3a-part-1-manual.html
You additionally need to ensure your compiler does as you instruct with software memory barriers and use of registers.
Much safer to stick with atomic primitives unless you have significant experience in this area.
Martin,
ReplyDeleteInteresting that you get as low as 87ns for Java. I constantly get ~225ns for Java and 80ns for C++, regardless of which inter-socket pair I pin the threads to. I have all interrupts going to CPU0, and I also set CPU affinity for init (and its childs) to CPU0 so no other process should disturb the latency test, but that did not help.
The possible explanations you mention seem reasonable based on your measurements, but my results don't seem to support them?
johanh (http://johanh.myopenid.com/) has left a new comment on your post "Inter Thread Latency":
ReplyDeleteMartin,
Interesting that you get as low as 87ns for Java. I constantly get ~225ns for Java and 80ns for C++, regardless of which inter-socket pair I pin the threads to. I have all interrupts going to CPU0, and I also set CPU affinity for init (and its childs) to CPU0 so no other process should disturb the latency test, but that did not help.
The possible explanations you mention seem reasonable based on your measurements, but my results don't seem to support them?
===================================
For some reason this email has got to me but not turned up on the blog.
Sorry but I cannot explain the results. I can get just over 80ns from Java on this machine. The Intel specs say the QPI link will add ~20ns plus I expect a little time in the uncore, so over 80 or even a bit less seems perfectly reasonable.
Martin,
ReplyDeleteMay I ask what JVM (and version) you are running? I just tried running the InterThreadLatency test on JRockit, and I get much lower values - ~90ns, which is much more consistent when the results of the C++ version. Unfortunately my tests of the Disruptor are significantly slower on JRockit though...
Thanks
I've been running the tests on Oracle 64-bit Hotspot Server JVM 1.6.0_25.
ReplyDeleteYou should try the new version 2.0 Disruptor which I've just released. It has significantly increased throughput.
What are you seeing for JRockit results?
Ok, thanks. I am running a simple test with one producer-one consumer, measuring average latency. On Hotspot 1.6.0_25 and 1.7.0 I get an average of ~100ns and 99% < 128ns intra-socket, on JRockit I get an average of almost 200ns.
ReplyDelete(For some reason, I have a much larger difference between the Disruptor's latency and the results of InterThreadLatency (34ns on my machine) than you seem to get).
Thanks
On a pretty aged laptop I managed to shave a few more nanoseconds with a pause op in the busy wait loop.
ReplyDelete134ns Java, 166ns default g++, 133ns -march=native g++, 130ns with _mm_pause.
Johan,
ReplyDeleteI'm finding that as I make the Disruptor faster it also becomes more dependent on processor affinity. For me the results are significantly better when affinity is set, both for latency and throughput. Since standard Java has no means to do this I'm going to release a Executor that can set thread affinity for a future release of the Disruptor. This will obviously have to come with a small native library. taskset on the Java process is insufficient.
The tiny programs used for the InterThreadLatency are not as sensitive to cache pollution. Remember a single cache miss can cost a few hundred CPU cycles that is usually ~70+ ns when going out to main memory even on the fastest machines. Can be well over 100ns for older memory systems. You could also have a bit of false sharing going on. Do you have access to a tool like Intel VTune?
Martin...
Steven,
ReplyDeleteThis is a good spot. I did not do it because Java does not have the option. I wish more options where available to Java such as _asm{ pause } for putting in a busy spin loop. The problem without it is the processor does some speculative execution that is wrong and not discovered until the instruction is retired by which stage it can have polluted the store buffers. For those interested in what it does.
http://software.intel.com/sites/products/documentation/studio/composer/en-us/2009/compiler_c/intref_cls/common/intref_sse2_pause.htm
I also wish "lock add", and "lock xadd" were available to Java as much better alternatives to CAS as cmpxchg for dealing with sequence claiming on x64/x68. Sadly they are not :-(
Martin...
Sun developers love CAS, the Solaris intrinsics for atomic addition, et al, on x64 are all implemented using CAS. I think the BSD's went the same route.
ReplyDeleteJohan,
ReplyDeleteThere has been some "false sharing" introduced to the Disruptor on the latest JVMs. I made a new release yesterday (2.0.2) that addresses this false sharing. You should notice throughput and latency are much improved again.
Martin...
Thanks. I tried the new version (2.0.2), and got a got slight latency improvement on hotspot (from ~100 to ~95ns), and a larger improvement on jrockit (from ~200 to ~140ns). Using your new method for padding at some other places in another application improved performance a lot though, so I was quite happy for that :)
ReplyDeleteI wrote some JNI code to pin the individual threads to specific cores, but that actually didn't have any noticeable effect on my machine.
I have never used Intel VTune, but I guess I can use oprofile to measure cache misses. Or does VTune offer any additional fucntionality that would be useful here?
~95ns is much much lower than I had with my earlier queue-based design (also with spinning consumer threads), so I'm already very happy but it annoys me a bit that I don't get as low figures as you do...
Thanks,
Johan
Johan,
ReplyDeleteI measured my latency on a Sandybridge processor and I think you are using Westmere? I'll set up a test on Westmere to see if there is much of a difference. I expect a bit of a difference because Sandybridge has a "L0" uop cache and lower L3 cache latency.
I've also just released 2.5RC of the Disruptor which can drop latency a little more plus offer a richer feature set.
Martin...
Johan,
ReplyDeleteSee section 3.3.2 of this VTune reference for details on memory issues.
http://software.intel.com/file/8685
Thanks for the post!
ReplyDeleteSomething strange happened when I set the CPU affinity to a single core:
$ time taskset -c 0 java -XX:+TieredCompilation -cp bin InterThreadLatency
duration = 411120019
ns per op = 10278000
op/sec = 97
s1 = 20, s2 = 19
The program almost stalled, with less than a hundred ops per second.
Adding another core restores performance:
$ time taskset -c 0,1 java -XX:+TieredCompilation -cp bin InterThreadLatency
duration = 6177088
ns per op = 154427
op/sec = 6475
s1 = 20, s2 = 20
The iterations are very few since it would take forever otherwise.
Any idea why this happens? I figured that since both threads where forced to run on the same core the latency would drop, but the complete opposite happened.
Rikard
Having thought about for a few more minutes I realize that each operation will involve a context switch in order for the other thread to get som CPU time. The 97 ops/s is simply the context switch frequency a assume...
DeleteYou are pretty much measuring the length for a time slice on your scheduler.
DeleteFor an interesting result try the same with 2 threads, each bound to a different hyperthread on the same core. Should see latency come down to ~10ns. Use http://linux.die.net/man/1/hwloc-ls to determine CPU layout.
Different cores:
Delete$ time taskset -c 0,1 java -XX:+TieredCompilation -cp bin InterThreadLatency
duration = 13632.577075 ms
ns per op = 68
op/sec = 14670740
s1 = 100000000, s2 = 100000000
Same core with hyperthreading:
$ time taskset -c 0,4 java -XX:+TieredCompilation -cp bin InterThreadLatency
duration = 5258.237784 ms
ns per op = 26
op/sec = 38035556
s1 = 100000000, s2 = 100000000
Not quite ~10ns on my machine, but a lot lower. Interesting.
I guess that the reason is that you don't have to pay the penalty for going all the way to L3 cache in order to fetch the updated value.
I looked up the cache latencies of Sandy Bridge here: http://www.7-cpu.com/cpu/SandyBridge.html
The 26ns I get when on the same core is approximately twice the L2 cache latency. Does it correspond to one write and one read perhaps?
Without being able to see your actual machine I'm not totally sure. The cost is probably the LOCK ADD assembly instruction inserted for the memory barrier. Try replacing the longs with AtomicLong and using the lazySet() method for the write.
DeleteTested it. The different core case dropped to 66 ns per op and the same core test dropped to 22 ns per op.
DeleteCPU is a Sandy Bridge i7 2760QM 2.4Ghz with turbo boost disabled.
Why did you expect 10 ns?
If you really want to dig deeper I can make a gist of the assembly.
Because I've seen it that low.
DeleteOk, thanks!
Deletebrilliant. I got below results when running the java code on my pc
ReplyDeleteduration = 45039344402
ns per op = 45
op/sec = 22202809
s1 = 500000000, s2 = 500000000
linux centos 6.3|32gram|asus maximus iv extreme z| Intel 2700K|SSD 128G
Hi Martin,
ReplyDeleteThis is really interesting.
Just one question, witht the sample code, two CORE will be fully utlized while one thread is waiting for the result to return from another thread, kind of waste of energy and unfriendly to environment. In real application, messages or events arive at pulses. Currently it cost about 2 to 5 microseconds to wake up a thread in our solutions. Is there any way to reduce CPU usage during waiting for message and wake up fast when message arrives?
Thanks
When busying spinning a core is tied up until work is available. There are techniques to address such as using the Intel PAUSE instruction and backoff strategies that involve doing other work. I cover this sort of issue on my training course.
DeleteHowever if you wish to operate in a low-latency environment then busy spinning is often the only option. I've seen the cost to wake up a thread often be up ~16us on some versions of Linux and even into the milliseconds when virtualised. On top of the wakeup cost you must also consider the cache pollution and thread migration that results.
I decided to try this against the 1.8 developer preview (1.8.0-ea-b117)
ReplyDeleteI am running these on an i7 Sandy Bridge dual-core laptop clocked at 2.7Ghz running MacOS. For the C benchmark I get similar results to others:
duration = 22200214976
ns per op = 22
op/sec = 45044608
s1 = 500000000, s2 = 500000000
However, for the Java benchmark there is a massive difference between java 1.7 and java 1.8. For Java 1.7 I get the following:
duration = 41235846000
ns per op = 41
op/sec = 24250745
s1 = 500000000, s2 = 500000000
which is in line with what everyone else is getting.
However in the developer preview I get this:
duration = 19182637000
ns per op = 19
op/sec = 52130476
s1 = 500000000, s2 = 500000000
Which is actually faster than C. Does anyone know the reason?
It is most likely that if you seeing figures below 35ns that the 2 threads are running as hyperthreads on the same core. This can be illustrated with task pinning on Linux or Windows. Unfortunately you have no such control on OSX.
Delete