Sunday, 11 September 2011

Adventures with AtomicLong

Sequencing events between threads is a common operation for many multi-threaded algorithms.  These sequences could be used for assigning identity to orders, trades, transactions, messages, events, etc.  Within the Disruptor we use a monotonic sequence for all events which is implemented as AtomicLong incrementAndGet for the multi-threaded publishing scenario.

While working on the latest version of the Disruptor I made some changes which I was convinced would improve performance, however the results surprised me.  I had removed some potentially megamorphic method calls and the performance got worse rather than better.  After a lot of investigation, I discovered that the megamorphic method calls were hiding a performance issue with the latest Intel Sandybridge processors.  With the megamorphic calls out of the way, the contention on the atomic sequence generation increased exposing the issue.  I've also observed this performance issue with other Java concurrent structures such as ArrayBlockingQueue.

I’ve been running various benchmarks on Sandybridge and have so far been impressed with performance improvements over Nehalem, especially for memory intensive applications due to the changes in its front-end.  However with this sequencing benchmark, I discovered that Sandybridge has taken a major step backward in performance with regard to atomic instructions.

Atomic instructions enable read-modify-write actions to be combined into an atomic operation.  A good example is incrementing a counter.  To complete the increment operation a thread must read the current value, increment it, and then write back the results.  In a multi-threaded environment these distinct operations could interleave with other threads doing the same with corrupt results as a consequence.  The normal way to avoid this interleaving is to take out a lock for mutual exclusion while performing the steps.  Locks are very expensive and often require kernel arbitration between threads.  Modern CPUs provide a number of atomic instructions which allow operations such as atomically incrementing a counter, or the ability to conditional set a pointer reference if the value is still as expected.  These operations are commonly referred to as CAS (Compare And Swap) instructions.  A good way to think of these CAS instructions is like optimistic locks, similar to what you experience when using a version control system like Subversion or CVS.  You try to make a change and if the version is what you expect then you succeed, otherwise the action aborts.

On x86/x64 these instructions are known as “lock” instructions.  The "lock" name comes from how a processor, after setting its lock signal, would lock the front-side/memory bus (FSB) for serialising memory access while the three steps of the operation took place atomically.  On more recent processors the lock instruction is simply implemented by getting an exclusive lock on the cache-line for modification.

These instructions are the basic building blocks used for implementing higher-level locks and semaphores.  This is, as will be explained shorty, why I've seen performing issues on Sandybridge for ArrayBlockingQueue in some of the Disruptor comparative performance tests.

Back to my benchmark.  The test was spending significantly more time in AtomicLong.incrementAndGet() than I had previously observed.  Initially, I suspected an issue with JDK 1.6.0_27 which I had just installed.  I ran the following test with various JVMs, including 1.7.0, and kept getting the same results.  I then booted different operating systems (Ubuntu, Fedora, Windows 7 - all 64-bit), again the same results.  This lead me to write an isolated test which I ran on Nehalem (2.8 GHz Core i7 860) and Sandybridge (2.2Ghz Core i7-2720QM).

import java.util.concurrent.atomic.AtomicLong;

public final class TestAtomicIncrement
    implements Runnable
{
    public static final long COUNT = 500L * 1000L * 1000L;
    public static final AtomicLong counter = new AtomicLong(0L);

    public static void main(final String[] args) throws Exception
    {
        final int numThreads = Integer.parseInt(args[0]);
        final long start = System.nanoTime();
        runTest(numThreads);
        System.out.println("duration = " + (System.nanoTime() - start));
        System.out.println("counter = " + counter);
    }

    private static void runTest(final int numThreads)
        throws InterruptedException
    {
        Thread[] threads = new Thread[numThreads];

        for (int i = 0; i < threads.length; i++)
        {
            threads[i] = new Thread(new TestAtomicIncrement());
        }

        for (Thread t : threads)
        {
            t.start();
        }

        for (Thread t : threads)
        {
            t.join();
        }
    }

    public void run()
    {
        long i = 0L;
        while (i < COUNT)
        {
            i = counter.incrementAndGet();
        }
    }
}

Figure 1.

After running this test on 4 different Sandybridge processors with a range of clock speeds, I concluded that using LOCK CMPXCHG, under contention with increasing numbers of threads, is much less scalable than the previous Nehalem generation of processors.  Figure 1. above charts the results in nanoseconds duration to complete 500 million increments of a counter with increasing thread count.  Less is better.

I confirmed the JVM was generating the correct instructions for the CAS loop by getting Hotspot to print the assembler it generated.  I also confirmed that Hotspot generated identical assembler instructions for both Nehalem and Sandybridge.

I then decided to investigate further and write the following C++ program to test the relevant lock instructions to compare Nehalem and Sandybridge.  I know from using “objdump -d” on the binary that the GNU Atomic Builtins generate the lock instructions for ADD, XADD, and CMPXCHG, for the respectively named functions below. 
#include <time.h>
#include <pthread.h>
#include <stdlib.h>
#include <iostream>

typedef unsigned long long uint64;
const uint64 COUNT = 500LL * 1000LL * 1000LL;
volatile uint64 counter = 0;

void* run_add(void* numThreads)
{
    register uint64 value = (COUNT / *((int*)numThreads)) + 1;
    while (--value != 0)
    {
        __sync_add_and_fetch(&counter, 1);
    }
}

void* run_xadd(void*)
{
    register uint64 value = counter;
    while (value < COUNT)
    {
        value = __sync_add_and_fetch(&counter, 1);
    }
}

void* run_cas(void*)
{
    register uint64 value = 0;
    while (value < COUNT)
    {
        do
        {
            value = counter;
        }
        while (!__sync_bool_compare_and_swap(&counter, value, value + 1));
    }
}

int main (int argc, char* argv[])
{
    const int NUM_THREADS = atoi(argv[1]);

    pthread_t threads[NUM_THREADS];
    void* status;
    timespec ts_start;
    timespec ts_finish;
    clock_gettime(CLOCK_MONOTONIC, &ts_start);

    for (int i = 0; i < NUM_THREADS; i++)
    {
        pthread_create(&threads[i], NULL, run_add, (void*)&NUM_THREADS);
    }

    for (int i = 0; i < NUM_THREADS; i++)
    {
        pthread_join(threads[i], &status);
    }

    clock_gettime(CLOCK_MONOTONIC, &ts_finish);

    uint64 start = (ts_start.tv_sec * 1000000000LL) + ts_start.tv_nsec;
    uint64 finish = (ts_finish.tv_sec * 1000000000LL) + ts_finish.tv_nsec;
    uint64 duration = finish - start;

    std::cout << "threads = "  << NUM_THREADS << std::endl;
    std::cout << "duration = " << duration << std::endl;
    std::cout << "counter = "  << counter << std::endl;

    return 0;
}
Figure 2.

It is clear from Figure 2. that Nehalem performs nearly an order of magnitude better for atomic operations as contention increases with threads.  I found LOCK ADD and LOCK XADD to be similar so I've only charted XADD for clarity.  The CAS operations for C++ and Java are comparable.

It is also very interesting how XADD greatly outperforms CAS and gives a nice scalable profile.  For 3 threads and above, XADD does not degrade further and simply performs at the rate at which the processor can keep the caches coherent.  Nehalem and Sandybridge level out respectively at ~100m and ~20m XADD operations per second for 3+ concurrent threads, whereas CAS continues to degrade with increasing thread count because of contention.  Naturally, performance degrades when QPI links are involved for a multi-socket scenario.  Oracle have now accepted that not supporting XADD is a bug and will hopefully fix it soon for the JVM. 

As to the performance I’ve observed with Sandybridge, it would be great if others could confirm my findings so we can all feedback to Intel and have this addressed.  I've not been able to get my hands on a server class system with Sandybridge.  I can confirm that for the "tick" to Westmere, the performance is similar to Nehalem and not an issue.  The "tock" to Sandybridge seems to introduce the issue.

Update: After discussions with Intel I wrote the following blog entry.

28 comments:

  1. Hi Martin,

    Did you try to change your run() method on your Sandybridge machines as follows:

    public void run()
    {
    long i = 0L;
    final long count = COUNT;
    while (i < count)
    {
    i = counter.incrementAndGet();
    }
    }

    Maybe, it's the java compiler problem on 'Sandybridge processors' so that during a bytecode interpretation it does not substitute COUNT 'costant' but fetches it from the memory in every loop causing a cache line clash ?

    I don't have any Sandybridge machine, so I cannot check the result.

    Regards,

    Roman

    ReplyDelete
  2. Thanks Roman,

    This change does not make any difference. For reference I got Hotspot to dump the assembler it generated for both platforms and it was identical. Also note the same issue occurs from C++ and I confirmed the assembler generated for that too.

    ReplyDelete
  3. Hi Martin,

    I was wondering whether you were using a 64-bit OS - it's not explict in the article (but I expect you are).

    If the C++ code you executed was exactly the same as the one you pasted in the post, the reason the xadd and add are so similar might be because it seems that you're calling __sync_add_and_fetch in both.

    You could ask the folk at http://www.asmcommunity.net/board/ for their opinion on the CAS performance issue.

    Regards

    Michael

    ReplyDelete
  4. Michael,

    I am using 64-bit Fedora 15, Ubuntu 11.04 and Windows 7.

    __sync_add_and_fetch() will be replaced with LOCK XADD if you use the return value, and LOCK ADD if you ignore the return value.

    Thanks for the link. I'll have a look.

    Martin...

    ReplyDelete
  5. I initially added and implemented CAS for JSR166 (Java 5) because (1) it is a univeral consensus primitive (see Herlihy & Shavit) (2) It can be easily emulated on machines with LL/SC (such as ARM and POWER) (3) the results are informative,m because a CAS failure indicates contention so library writers have a chance to perform alternative actions and (4) at the time, there was no significant advantage to using LOCKED instructions like xadd on x86s. So, no conspiracy. Given that there is now a significant performance difference on some platforms, I support creating additional atomics mappings, even though issue (3) means that they probably won't be used all that much inside java.util.concurrent.

    ReplyDelete
  6. Doug,

    Many thanks for the feedback and history lesson. It was pure conjecture on my part to why this was not implemented, however I do have experience with Sun from the early days of Java that have coloured my view ;-)

    I think we are long overdue taking advantage of LOCK instructions on x86 and the likes of the PAUSE instruction in busy spin loops so Java can be up with the best on this significant platform.

    BTW have you seen similar performance issues with Sandybridge?

    ReplyDelete
  7. Martin: Yes, I also seem to see worse degradation under contention on SandyBridge than Nehalem. But not otherwise atypically bad -- we usually do anything we can to avoid contention problems so it doesn't impact us as much.

    ReplyDelete
  8. Doug,

    In my own testing I'm finding your ArrayBlockingQueue reduce in performance on Sandybridge to only 20-60% of what it could do on Nehalem for some tests.

    http://code.google.com/p/disruptor/source/browse/#svn%2Ftrunk%2Fcode%2Fsrc%2Fperf%2Fcom%2Flmax%2Fdisruptor

    The "Multicast" scenario see the most significant degradation.

    On a positive note I'm seeing LinkTransferQueue.transfer() doing better on Sandybridge. As you say you must be avoiding the contention a bit more.

    Sandybridge 2.2 Ghz, Java 1.7.0 on Ubuntu 11.04

    ArrayBlockingQueue
    ===============
    UniCast1P1CPerfTest OpsPerSecond run 0: BlockingQueues=3,536,985,
    Disruptor=42,390,843
    UniCast1P1CPerfTest OpsPerSecond run 1: BlockingQueues=3,480,318,
    Disruptor=47,468,354
    UniCast1P1CPerfTest OpsPerSecond run 2: BlockingQueues=3,462,204,
    Disruptor=43,630,017

    LinkedTransferQueue
    ================
    UniCast1P1CTransferQueuePerfTest OpsPerSecond run 0:
    BlockingQueues=4,342,728, Disruptor=45,808,520
    UniCast1P1CTransferQueuePerfTest OpsPerSecond run 1:
    BlockingQueues=4,211,708, Disruptor=45,871,559
    UniCast1P1CTransferQueuePerfTest OpsPerSecond run 2:
    BlockingQueues=4,220,834, Disruptor=46,511,627

    Please run the tests for yourself to see the difference.

    ReplyDelete
  9. Hi Martin,

    A couple of ideas:

    1) While using operation for CMPXCHG, are you sure that you aren't hitting barrier, while the 'value+1' is retrieved?
    So:
    __sync_bool_compare_and_swap(&counter, value, value + 1));
    generates one more operation?

    2) What GCC you are compiling with, what would be the options? For tests won't it be easier to go for asm inlining?

    3) What mainboard you are on, maybe it send the false signal to the CPU and it thinks it's on a multi-socket system? (Thus, envolves QPI on each swap).

    ReplyDelete
  10. Stic,

    1) I don't understand what you mean by this? The "value + 1" will be evaluated before LOCK CMPXCHG is called because lock instructions have total memory order. It is required for a CAS operation if you think it is additional to what you see with the XADD example. The issue comes from contention with increasing thread count for the *same* number of operations executed but divided among the threads.

    2) g++ -O3 -lpthread -lrt -o atomic_inc atomic_inc.cpp

    Looking at the output with "objdump -d atomic_inc" I can see it is using the correct instructions. The same binary is used on both platforms and I get the same results from Java.

    3) I've tried on 4 systems with different CPUs and main boards with the same results. At present only single socket Sandybridge systems are available. For the posted results I tested on a Compal PBL21 motherboard with Intel HM65 Express Chipset.

    ReplyDelete
  11. Martin,
    1) I probably missed something here, but won't this exact evaluation (v+1) introduce congestion?
    Don't you need to look-up latest v to evaluate it, whilst another threads might have lock on it for cmpxchg, right?
    2) 4.6.* right? However, if you think it's fine, then it's fine :-)
    3) Google doesn't show anything obvious, so please ignore it

    ReplyDelete
  12. Stic,

    1) Have you noticed I use "value" and not "counter"? The value variable is local and qualified as register.

    2) Sorry missed that. g++ -v = 4.6.0 20110603

    ReplyDelete
  13. Hi,
    1) First, I hardly have experience with analysing code from this perspective, but:
    value = counter - as the counter is declared volatile, the gcc might not provide proper optimisation.
    Still you shouldn't hit memory fence (as volatile don't provide that in C++), but would you mind sharing what asm is generated for this operation?

    It might be that the optimiser postpone this assignment to the __sync_bool_compare_and_swap call (as before there is no use of it).

    In worst case it could even try to pass on counter and counter + 1.
    I doubt it, but ...

    Then of course it might be that you are just keep looping as the compare fails more and more often as thread count is going up.

    2) Have you tried to compile with -march set to some older architecture?
    -march=x86-64 -mtune=generic for example

    3) The most obvious thing probably, is the asm for Nehalem and Sandy Bridge the same?

    ReplyDelete
  14. Stic,

    I'm pretty certain it is with the processor and not the compiler. It is exactly the same binary and instructions being executed. I also get the same results from Java. ASM is below for the CAS method.

    atomic_cas: file format elf64-x86-64
    ...
    0000000000400c70 <_Z7run_casPv>:
    400c70: 48 8b 15 89 07 20 00 mov 0x200789(%rip),%rdx # 601400
    400c77: 48 8d 4a 01 lea 0x1(%rdx),%rcx
    400c7b: 48 89 d0 mov %rdx,%rax
    400c7e: f0 48 0f b1 0d 79 07 lock cmpxchg %rcx,0x200779(%rip) # 601400
    400c85: 20 00
    400c87: 75 e7 jne 400c70 <_Z7run_casPv>
    400c89: 48 81 fa ff 64 cd 1d cmp $0x1dcd64ff,%rdx
    400c90: 76 de jbe 400c70 <_Z7run_casPv>
    400c92: f3 c3 repz retq
    400c94: 66 66 66 2e 0f 1f 84 data32 data32 nopw %cs:0x0(%rax,%rax,1)
    400c9b: 00 00 00 00 00

    Please run objdump -d on the binary you create yourself to see the results.

    ReplyDelete
  15. Hi,

    I tried to run it with -S (easier to see what is what), but I got a tad different code. Will try at home later. The difference is that for some reson I'm not getting R* registers (need to fight gcc on 32-bit Windows to compile in 64-bit mode).

    It definitely seems strange - have you go anyone from Intel on the case?

    Regards,

    ReplyDelete
  16. Hi Martin,

    I have ran the test [JAVA CAS] on Westmere-EP(32nm) - 6 Hyper threaded Cores [2 processor box]

    The result is as follows.

    duration = 5141724000
    counter = 500000000

    duration = 15238567000
    counter = 500000001

    duration = 21312679000
    counter = 500000002

    duration = 18807963000
    counter = 500000003

    duration = 18548563000
    counter = 500000004

    duration = 18985587000
    counter = 500000005

    duration = 18276933000
    counter = 500000006

    duration = 17903526000
    counter = 500000007

    duration = 17878178000
    counter = 500000008

    duration = 17435337000
    counter = 500000009

    Cheers
    Manish

    ReplyDelete
  17. Thanks Manish,

    I see similar on Westmere. Seems the issue was not introduced until Sandybridge. If you "taskset" the tests to stay to one socket the results should be more stable.

    ReplyDelete
  18. Hi Martin,

    I'm quite interested in seeing the assembly code generated by HotSpot for something else I'm looking at, however, the "-XX:+PrintOptoAssembly" option, if that's what you're using, requires a debug build of the JVM. Do you need special access to be able to download this? I've been unable to find it on Oracle's site anywhere. Any pointers?

    Cheers :)

    ReplyDelete
  19. Derek,

    You do need a debug build prior to Java 7. Now you can install Java 7 and do the following.

    1. Pick your binary from the link below and drop it in $JAVA_HOME/jre/lib/amd64/server

    http://kenai.com/projects/base-hsdis/downloads

    2. Read up on what to do

    http://wikis.sun.com/display/HotSpotInternals/PrintAssembly

    3. If impatient then just run your program as follows to print the assembly

    $JAVA_HOME/bin/java -XX:+UnlockDiagnosticVMOptions -XX:CompileCommand=print,*.class_name.method_name java_options

    ReplyDelete
  20. Thanks Martin. That worked great, other than that I had to call the file "libhsdis-amd64.so" or it wouldn't pick it up.

    ReplyDelete
  21. I just had a thought about the run_cas implementation where you're using __sync_bool_compare_and_swap: you read the value of counter in both the success and the failure case. Since in the failure case cmpxchg returns the current value of the dest operand in RAX, it's possible to attempt a write immediately based on that value without having to read counter first. Using __sync_val_compare_and_swap should make this possible:

    void* run_cas2(void*) {
    register uint64 value = 0;
    register uint64 next = 0;
    while (value < COUNT) {
    value = counter;
    do {
    next = value +1;
    value = __sync_val_compare_and_swap(&counter, value, next);
    }
    while (value != next);
    }
    }

    The assembly at level -O3 looks OK. RAX is compared with RDX ('next') and skips the read at 0x4004b0 if the CAS was unsuccessful:

    00000000004004b0 <_Z8run_cas2Pv>:
    4004b0: mov 0x332a19(%rip),%rax # 732ed0
    4004b7: lea 0x1(%rax),%rdx
    4004bb: lock cmpxchg %rdx,0x332a0c(%rip) # 732ed0
    4004c4: cmp %rax,%rdx
    4004c7: jne 4004b7 <_Z8run_cas2Pv+0x7>
    4004c9: cmp $0x2faf07f,%rax
    4004cf: jbe 4004b0 <_Z8run_cas2Pv>
    4004d1: repz retq

    A quick run through on my (mains powered) Core i7 laptop suggests this is noticeably slower, but then the amount of time spent reading counter is insignificant compared to the price of a locked cmpxchg. One explanation might be that counter is being mutated at bus-locking speed such that even interposing cmp, jmp, inc (well, lea) before retrying is apparently time enough for its value to have changed.

    Regards

    Michael

    ReplyDelete
  22. Thinking about it, it should come as no surprise that higher contention shows worse performance. The read in the original run_cas must be acting as a back-off limiter and leading to fewer failed locked CAS operations.

    It would be interesting to see the performance of the alternate run_cas implementation on Nehalem which would provide another data-point in the Sandy Bridge comparison.

    ReplyDelete
  23. Mike,

    I don't think this is a contention problem. On Sandybridge I see it take an additional 40% duration at 1 thread. This gradually tails off to an additional 10% duration at 8 threads.

    On Nehalem I see a similar profile with an additional 100% duration at 1 thread tailing off to an additional 20% duration at 8 threads.

    There seems to be a higher workload per operation with this approach. I took the original CAS approach because it is what Java's AtomicLong does.

    ReplyDelete
  24. Great info.

    Is this still an issue with Sandy Bridge server chips (as opposed to desktop and laptop chips)?

    ReplyDelete
  25. Geoff,

    The folk at Intel have confirmed my findings on the new server class Sandybridge chips. I'm awaiting a response on how they plan to deal with it.

    ReplyDelete
  26. Replies
    1. Yes there is a follow up blog entry.

      http://mechanical-sympathy.blogspot.com/2013/01/further-adventures-with-cas.html

      Delete