Mechanical Sympathy: Atomic

Sequencing events between threads is a common operation for many multi-threaded algorithms. These sequences could be used for assigning identity to orders, trades, transactions, messages, events, etc. Within the Disruptor we use a monotonic sequence for all events which is implemented as AtomicLong incrementAndGet for the multi-threaded publishing scenario.

While working on the latest version of the Disruptor I made some changes which I was convinced would improve performance, however the results surprised me. I had removed some potentially megamorphic method calls and the performance got worse rather than better. After a lot of investigation, I discovered that the megamorphic method calls were hiding a performance issue with the latest Intel Sandybridge processors. With the megamorphic calls out of the way, the contention on the atomic sequence generation increased exposing the issue. I've also observed this performance issue with other Java concurrent structures such as ArrayBlockingQueue.

I’ve been running various benchmarks on Sandybridge and have so far been impressed with performance improvements over Nehalem, especially for memory intensive applications due to the changes in its front-end. However with this sequencing benchmark, I discovered that Sandybridge has taken a major step backward in performance with regard to atomic instructions.

Atomic instructions enable read-modify-write actions to be combined into an atomic operation. A good example is incrementing a counter. To complete the increment operation a thread must read the current value, increment it, and then write back the results. In a multi-threaded environment these distinct operations could interleave with other threads doing the same with corrupt results as a consequence. The normal way to avoid this interleaving is to take out a lock for mutual exclusion while performing the steps. Locks are very expensive and often require kernel arbitration between threads. Modern CPUs provide a number of atomic instructions which allow operations such as atomically incrementing a counter, or the ability to conditional set a pointer reference if the value is still as expected. These operations are commonly referred to as CAS (Compare And Swap) instructions. A good way to think of these CAS instructions is like optimistic locks, similar to what you experience when using a version control system like Subversion or CVS. You try to make a change and if the version is what you expect then you succeed, otherwise the action aborts.

On x86/x64 these instructions are known as “lock” instructions. The "lock" name comes from how a processor, after setting its lock signal, would lock the front-side/memory bus (FSB) for serialising memory access while the three steps of the operation took place atomically. On more recent processors the lock instruction is simply implemented by getting an exclusive lock on the cache-line for modification.

These instructions are the basic building blocks used for implementing higher-level locks and semaphores. This is, as will be explained shorty, why I've seen performing issues on Sandybridge for ArrayBlockingQueue in some of the Disruptor comparative performance tests.

Back to my benchmark. The test was spending significantly more time in AtomicLong.incrementAndGet() than I had previously observed. Initially, I suspected an issue with JDK 1.6.0_27 which I had just installed. I ran the following test with various JVMs, including 1.7.0, and kept getting the same results. I then booted different operating systems (Ubuntu, Fedora, Windows 7 - all 64-bit), again the same results. This lead me to write an isolated test which I ran on Nehalem (2.8 GHz Core i7 860) and Sandybridge (2.2Ghz Core i7-2720QM).

import java.util.concurrent.atomic.AtomicLong;

public final class TestAtomicIncrement
    implements Runnable
{
    public static final long COUNT = 500L * 1000L * 1000L;
    public static final AtomicLong counter = new AtomicLong(0L);

    public static void main(final String[] args) throws Exception
    {
        final int numThreads = Integer.parseInt(args[0]);
        final long start = System.nanoTime();
        runTest(numThreads);
        System.out.println("duration = " + (System.nanoTime() - start));
        System.out.println("counter = " + counter);
    }

    private static void runTest(final int numThreads)
        throws InterruptedException
    {
        Thread[] threads = new Thread[numThreads];

        for (int i = 0; i < threads.length; i++)
        {
            threads[i] = new Thread(new TestAtomicIncrement());
        }

        for (Thread t : threads)
        {
            t.start();
        }

        for (Thread t : threads)
        {
            t.join();
        }
    }

    public void run()
    {
        long i = 0L;
        while (i < COUNT)
        {
            i = counter.incrementAndGet();
        }
    }
}

Figure 1.

After running this test on 4 different Sandybridge processors with a range of clock speeds, I concluded that using LOCK CMPXCHG, under contention with increasing numbers of threads, is much less scalable than the previous Nehalem generation of processors. Figure 1. above charts the results in nanoseconds duration to complete 500 million increments of a counter with increasing thread count. Less is better.

I confirmed the JVM was generating the correct instructions for the CAS loop by getting Hotspot to print the assembler it generated. I also confirmed that Hotspot generated identical assembler instructions for both Nehalem and Sandybridge.

I then decided to investigate further and write the following C++ program to test the relevant lock instructions to compare Nehalem and Sandybridge. I know from using “objdump -d” on the binary that the GNU Atomic Builtins generate the lock instructions for ADD, XADD, and CMPXCHG, for the respectively named functions below.

#include <time.h>
#include <pthread.h>
#include <stdlib.h>
#include <iostream>

typedef unsigned long long uint64;
const uint64 COUNT = 500LL * 1000LL * 1000LL;
volatile uint64 counter = 0;

void* run_add(void* numThreads)
{
    register uint64 value = (COUNT / *((int*)numThreads)) + 1;
    while (--value != 0)
    {
        __sync_add_and_fetch(&counter, 1);
    }
}

void* run_xadd(void*)
{
    register uint64 value = counter;
    while (value < COUNT)
    {
        value = __sync_add_and_fetch(&counter, 1);
    }
}

void* run_cas(void*)
{
    register uint64 value = 0;
    while (value < COUNT)
    {
        do
        {
            value = counter;
        }
        while (!__sync_bool_compare_and_swap(&counter, value, value + 1));
    }
}

int main (int argc, char* argv[])
{
    const int NUM_THREADS = atoi(argv[1]);

    pthread_t threads[NUM_THREADS];
    void* status;
    timespec ts_start;
    timespec ts_finish;
    clock_gettime(CLOCK_MONOTONIC, &ts_start);

    for (int i = 0; i < NUM_THREADS; i++)
    {
        pthread_create(&threads[i], NULL, run_add, (void*)&NUM_THREADS);
    }

    for (int i = 0; i < NUM_THREADS; i++)
    {
        pthread_join(threads[i], &status);
    }

    clock_gettime(CLOCK_MONOTONIC, &ts_finish);

    uint64 start = (ts_start.tv_sec * 1000000000LL) + ts_start.tv_nsec;
    uint64 finish = (ts_finish.tv_sec * 1000000000LL) + ts_finish.tv_nsec;
    uint64 duration = finish - start;

    std::cout << "threads = "  << NUM_THREADS << std::endl;
    std::cout << "duration = " << duration << std::endl;
    std::cout << "counter = "  << counter << std::endl;

    return 0;
}

Figure 2.

It is clear from Figure 2. that Nehalem performs nearly an order of magnitude better for atomic operations as contention increases with threads. I found LOCK ADD and LOCK XADD to be similar so I've only charted XADD for clarity. The CAS operations for C++ and Java are comparable.

It is also very interesting how XADD greatly outperforms CAS and gives a nice scalable profile. For 3 threads and above, XADD does not degrade further and simply performs at the rate at which the processor can keep the caches coherent. Nehalem and Sandybridge level out respectively at ~100m and ~20m XADD operations per second for 3+ concurrent threads, whereas CAS continues to degrade with increasing thread count because of contention. Naturally, performance degrades when QPI links are involved for a multi-socket scenario. Oracle have now accepted that not supporting XADD is a bug and will hopefully fix it soon for the JVM.

As to the performance I’ve observed with Sandybridge, it would be great if others could confirm my findings so we can all feedback to Intel and have this addressed. I've not been able to get my hands on a server class system with Sandybridge. I can confirm that for the "tick" to Westmere, the performance is similar to Nehalem and not an issue. The "tock" to Sandybridge seems to introduce the issue.

Update: After discussions with Intel I wrote the following blog entry.

Mechanical Sympathy

Sunday, 11 September 2011

Adventures with AtomicLong