Mechanical Sympathy: Locks

Showing posts with label Locks. Show all posts

Monday, 26 August 2013

Lock-Based vs Lock-Free Concurrent Algorithms

Last week I attended a review session of the new JSR166 StampedLock run by Heinz Kabutz at the excellent JCrete unconference. StampedLock is an attempt to address the contention issues that arise in a system when multiple readers concurrently access shared state. StampedLock is designed to perform better than ReentrantReadWriteLock by taking an optimistic read approach.

While attending the session a couple of things occurred to me. Firstly, I thought it was about time I reviewed the current status of Java lock implementations. Secondly, that although StampedLock looks like a good addition to the JDK, it seems to miss the fact that lock-free algorithms are often a better solution to the multiple reader case.

Test Case

To compare implementations I needed an API test case that would not favour a particular approach. For example, the API should be garbage free and allow the methods to be atomic. A simple test case is to design a spaceship that can be moved around a 2-dimensional space with the coordinates of its position available to be read atomically. At least 2 fields need to be read, or written, per transaction to make the concurrency interesting.

/**
 * Interface to a concurrent representation of a ship that can move around
 * a 2 dimensional space with updates and reads performed concurrently.
 */
public interface Spaceship
{
    /**
     * Read the position of the spaceship into the array of coordinates provided.
     *
     * @param coordinates into which the x and y coordinates should be read.
     * @return the number of attempts made to read the current state.
     */
    int readPosition(final int[] coordinates);

    /**
     * Move the position of the spaceship by a delta to the x and y coordinates.
     *
     * @param xDelta delta by which the spaceship should be moved in the x-axis.
     * @param yDelta delta by which the spaceship should be moved in the y-axis.
     * @return the number of attempts made to write the new coordinates.
     */
    int move(final int xDelta, final int yDelta);
}

The above API would be cleaner by factoring out an immutable Position object but I want to keep it garbage free and create the need to update multiple internal fields with minimal indirection. This API could easily be extended for a 3-dimensional space and require the implementations to be atomic.

Multiple implementations are built for each spaceship and exercised by a test harness. All the code and results for this blog can be found here.

The test harness will run each of the implementations in turn by using a megamorphic dispatch pattern to try and prevent inlining, lock-coarsening, and loop unrolling when accessing the concurrent methods.

Each implementation is subjected to 4 distinct threading scenarios that result in different contention profiles:

1 reader - 1 writer
2 readers - 1 writer
3 readers - 1 writer
2 readers - 2 writers

All tests are run with 64-bit Java 1.7.0_25, Linux 3.6.30, and a quad core 2.2GHz Ivy Bridge i7-3632QM. Throughput is measured over 5 second periods for each implementation with the tests repeated 5 times to ensure sufficient warm up. The results below are throughputs averaged per second over 5 runs. To approximate a typical Java deployment, no thread affinity or core isolation has been employed which would have reduced variance significantly.

Note: Other CPUs and operating systems can produce very different results.

Results

Figure 1.

Figure 2.

Figure 3.

Figure 4.

The raw data for the above charts can be found here.

Analysis

The real surprise for me from the results is the performance of ReentrantReadWriteLock. I cannot see a use for this implementation beyond a case whereby there is a huge balance of reads and very little writes. My main takeaways are:

StampedLock is a major improvement over existing lock implementations especially with increasing numbers of reader threads.
StampedLock has a complex API. It is very easy to mistakenly call the wrong method for locking actions.
Synchronised is a good general purpose lock implementation when contention is from only 2 threads.
ReentrantLock is a good general purpose lock implementation when thread counts grow as previously discovered.
Choosing to use ReentrantReadWriteLock should be based on careful and appropriate measurement. As with all major decisions, measure and make decisions based on data.
Lock-free implementations can offer significant throughput advantages over lock-based algorithms.

Conclusion

It is nice seeing the influence of lock-free techniques appearing in lock-based algorithms. The optimistic strategy employed on read is effectively a lock-free algorithm at the times when a writer is not updating.

In my experience of teaching and developing lock-free algorithms, not only do they provide significant throughput advantages as evidenced here, they also provide much lower and less variance in latency.

Tuesday, 22 November 2011

Biased Locking, OSR, and Benchmarking Fun

After my last post on Java Lock Implementations, I got a lot of good feedback about my results and micro-benchmark design approach. As a result I now understand JVM warmup, On Stack Replacement (OSR) and Biased Locking somewhat better than before. Special thanks to Dave Dice from Oracle, and Cliff Click & Gil Tene from Azul, for their very useful feedback.

In the last post I concluded, based on my experiments, that biased locking was no longer necessary on modern CPUs. While this conclusion is understandable given the data gathered in the experiment, it was not valid because the experiment did not take account of some JVM warm up behaviour that I was unaware of.

In this post I will re-run the experiment taking into account the feedback and present some new results. I shall also expand on the changes I've made to the test and why it is important to consider the JVM warm-up behaviour when writing micro-benchmarks, or even very lean Java applications with quick start up time.

On Stack Replacement (OSR)

Java virtual machines will compile code to achieve greater performance based on runtime profiling. Some VMs run an interpreter for the majority of code and replace hot areas with compiled code following the 80/20 rule. Other VMs compile all code simply at first then replace the simple code with more optimised code based on profiling. Oracle Hotspot and Azul are examples of the first type and Oracle JRockit is an example of the second.

Oracle Hotspot will count invocations of a method return plus branch backs for loops in that method, and if this exceeds 10K in server mode the method will be compiled. The compiled code on normal JIT'ing can be used when the method is next called. However if a loop is still iterating it may make sense to replace the method before the loop completes, especially if it has many iterations to go. OSR is the means by which a method gets replaced with a compiled version part way through iterating a loop.

I was under the impression that normal JIT'ing and OSR would result in similar code. Cliff Click pointed out that it is much harder for a runtime to optimise a loop part way through, and especially difficult if nested. For example, bounds checking within the loop may not be possible to eliminate. Cliff will blog in more detail on this shortly.

What this means is that you are likely to get better optimised code by doing a small number of shorter warm ups than a single large one. You can see in the code below how I do 10 shorter runs in a loop before the main large run compared to the last article where I did a single large warm-up run.

Biased Locking

Dave Dice pointed out that Hotspot does not enable objects for biased locking in the first few seconds (4s at present) of JVM startup. This is because some benchmarks, and NetBeans, have a lot of thread contention on start up and the revocation cost is significant.

All objects by default are created with biased locking enabled in Oracle Hotspot after the first few seconds of start-up delay, and can be configured with -XX:BiasedLockingStartupDelay=0.

This point, combined with knowing more about OSR, is important for micro-benchmarks. It is also important to be aware of these points if you have a lean Java application that starts in a few seconds.

The Code

import java.util.concurrent.locks.Lock;
import java.util.concurrent.locks.ReentrantLock;
import java.util.concurrent.CyclicBarrier;

import static java.lang.System.out;

public final class TestLocks implements Runnable
{
    public enum LockType {JVM, JUC}
    public static LockType lockType;

    public static final long WARMUP_ITERATIONS = 100L * 1000L;
    public static final long ITERATIONS = 500L * 1000L * 1000L;
    public static long counter = 0L;

    public static final Object jvmLock = new Object();
    public static final Lock jucLock = new ReentrantLock();
    private static int numThreads;

    private final long iterationLimit;
    private final CyclicBarrier barrier;

    public TestLocks(final CyclicBarrier barrier, final long iterationLimit)
    {
        this.barrier = barrier;
        this.iterationLimit = iterationLimit;
    }

    public static void main(final String[] args) throws Exception
    {
        lockType = LockType.valueOf(args[0]);
        numThreads = Integer.parseInt(args[1]);

        for (int i = 0; i < 10; i++)
        {
            runTest(numThreads, WARMUP_ITERATIONS);
            counter = 0L;
        }

        final long start = System.nanoTime();
        runTest(numThreads, ITERATIONS);
        final long duration = System.nanoTime() - start;

        out.printf("%d threads, duration %,d (ns)\n", numThreads, duration);
        out.printf("%,d ns/op\n", duration / ITERATIONS);
        out.printf("%,d ops/s\n", (ITERATIONS * 1000000000L) / duration);
        out.println("counter = " + counter);
    }

    private static void runTest(final int numThreads, final long iterationLimit)
        throws Exception
    {
        CyclicBarrier barrier = new CyclicBarrier(numThreads);
        Thread[] threads = new Thread[numThreads];

        for (int i = 0; i < threads.length; i++)
        {
            threads[i] = new Thread(new TestLocks(barrier, iterationLimit));
        }

        for (Thread t : threads)
        {
            t.start();
        }

        for (Thread t : threads)
        {
            t.join();
        }
    }

    public void run()
    {
        try
        {
            barrier.await();
        }
        catch (Exception e)
        {
            // don't care
        }

        switch (lockType)
        {
            case JVM: jvmLockInc(); break;
            case JUC: jucLockInc(); break;
        }
    }

    private void jvmLockInc()
    {
        long count = iterationLimit / numThreads;
        while (0 != count--)
        {
            synchronized (jvmLock)
            {
                ++counter;
            }
        }
    }

    private void jucLockInc()
    {
        long count = iterationLimit / numThreads;
        while (0 != count--)
        {
            jucLock.lock();
            try
            {
                ++counter;
            }
            finally
            {
                jucLock.unlock();
            }
        }
    }
}

Script to run tests:

set -x
for i in {1..8}
do
    java -server -XX:-UseBiasedLocking TestLocks JVM $i
done

for i in {1..8}
do
    java -server -XX:+UseBiasedLocking -XX:BiasedLockingStartupDelay=0 TestLocks JVM $i
done

for i in {1..8}
do
    java -server TestLocks JUC $i
done

Results

The tests are carried out with 64-bit Linux (Fedora Core 15) and Oracle JDK 1.6.0_29.

Nehalem 2.8GHz - Ops/Sec

Threads

-UseBiasedLocking

+UseBiasedLocking

ReentrantLock

53,283,461

450,950,969

62,876,566

18,519,295

18,108,615

10,217,186

13,349,605

13,416,198

14,108,622

8,120,172

8,040,773

14,207,310

4,725,114

4,551,766

14,302,683

5,133,706

5,246,548

14,676,616

5,473,652

5,585,666

18,145,525

5,514,056

5,414,171

19,010,725

Sandy Bridge 2.0GHz - Ops/Sec

Threads

-UseBiasedLocking

+UseBiasedLocking

ReentrantLock

34,500,407

396,511,324

43,148,808

20,899,076

19,742,639

6,038,923

9,288,039

11,957,032

24,147,807

5,618,862

5,589,289

9,082,961

5,609,932

5,592,574

9,389,243

5,742,907

5,760,558

12,518,728

6,699,201

6,641,886

13,684,475

6,957,824

6,925,410

14,819,005

Observations

Biased locking has a huge benefit in the un-contended single threaded case.
Biased locking when un-contended, and not revoked, only adds 4-5 cycles of cost. This is the cost when having a cache hit for the lock structures, on top of the code protected in the critical section.
-XX:BiasedLockingStartupDelay=0 needs to be set for lean applications and micro-benchmarks.
Avoiding OSR does not make a material difference to this set of test results. This is likely to be because the loop is so simple or other costs are dominating.
For the current implementations, ReentrantLocks scale better than synchronised locks under contention, except in the case of 2 contending threads.

Conclusion

My tests in the last post are invalid for the testing of an un-contended biased lock, because the lock was not actually biased. If you are designing code following the single writer principle, and therefore having un-contended locks when using 3rd party libraries, then having biased locking enabled is a significant performance boost.

Saturday, 19 November 2011

Java Lock Implementations

We all use 3rd party libraries as a normal part of development. Generally, we have no control over their internals. The libraries provided with the JDK are a typical example. Many of these libraries employ locks to manage contention.

JDK locks come with two implementations. One uses atomic CAS style instructions to manage the claim process. CAS instructions tend to be the most expensive type of CPU instructions and on x86 have memory ordering semantics. Often locks are un-contended which gives rise to a possible optimisation whereby a lock can be biased to the un-contended thread using techniques to avoid the use of atomic instructions. This biasing allows a lock in theory to be quickly reacquired by the same thread. If the lock turns out to be contended by multiple threads the algorithm with revert from being biased and fall back to the standard approach using atomic instructions. Biased locking became the default lock implementation with Java 6.

When respecting the single writer principle, biased locking should be your friend. Lately, when using the sockets API, I decided to measure the lock costs and was surprised by the results. I found that my un-contended thread was incurring a bit more cost than I expected from the lock. I put together the following test to compare the cost of the current lock implementations available in Java 6.

The Test

For the test I shall increment a counter within a lock, and increase the number of contending threads on the lock. This test will be repeated for the 3 major lock implementations available to Java:

Atomic locking on Java language monitors
Biased locking on Java language monitors
ReentrantLock introduced with the java.util.concurrent package in Java 5.

I'll also run the tests on the 3 most recent generations of the Intel CPU. For each CPU I'll execute the tests up to the maximum number of concurrent threads the core count will support.

The tests are carried out with 64-bit Linux (Fedora Core 15) and Oracle JDK 1.6.0_29.

The Code

import java.util.concurrent.BrokenBarrierException;
import java.util.concurrent.locks.Lock;
import java.util.concurrent.locks.ReentrantLock;
import java.util.concurrent.CyclicBarrier;

import static java.lang.System.out;

public final class TestLocks implements Runnable
{
    public enum LockType { JVM, JUC }
    public static LockType lockType;

    public static final long ITERATIONS = 500L * 1000L *1000L;
    public static long counter = 0L;

    public static final Object jvmLock = new Object();
    public static final Lock jucLock = new ReentrantLock();
    private static int numThreads;
    private static CyclicBarrier barrier;

    public static void main(final String[] args) throws Exception
    {
        lockType = LockType.valueOf(args[0]);
        numThreads = Integer.parseInt(args[1]);
        
        runTest(numThreads); // warm up
        counter = 0L;

        final long start = System.nanoTime();
        runTest(numThreads);
        final long duration = System.nanoTime() - start;

        out.printf("%d threads, duration %,d (ns)\n", numThreads, duration);
        out.printf("%,d ns/op\n", duration / ITERATIONS);
        out.printf("%,d ops/s\n", (ITERATIONS * 1000000000L) / duration);
        out.println("counter = " + counter);
    }

    private static void runTest(final int numThreads) throws Exception
    {
        barrier = new CyclicBarrier(numThreads);
        Thread[] threads = new Thread[numThreads];

        for (int i = 0; i < threads.length; i++)
        {
            threads[i] = new Thread(new TestLocks());
        }

        for (Thread t : threads)
        {
            t.start();
        }

        for (Thread t : threads)
        {
            t.join();
        }
    }

    public void run()
    {
        try
        {
            barrier.await();
        }
        catch (Exception e)
        {
            // don't care
        }

        switch (lockType)
        {
            case JVM: jvmLockInc(); break;
            case JUC: jucLockInc(); break;
        }
    }

    private void jvmLockInc()
    {
        long count = ITERATIONS / numThreads;
        while (0 != count--)
        {
            synchronized (jvmLock)
            {
                ++counter;
            }
        }
    }

    private void jucLockInc()
    {
        long count = ITERATIONS / numThreads;
        while (0 != count--)
        {
            jucLock.lock();
            try
            {
                ++counter;
            }
            finally
            {
                jucLock.unlock();
            }
        }
    }
}

Script the tests:

set -x
for i in {1..8}; do java -XX:-UseBiasedLocking TestLocks JVM $i; done
for i in {1..8}; do java -XX:+UseBiasedLocking TestLocks JVM $i; done
for i in {1..8}; do java TestLocks JUC $i; done

The Results

Figure 1.

Figure 2.

Figure 3.

Observations

Biased locking, in the un-contended case, is ~10% more expensive than the atomic locking. It seems that for recent CPU generations the cost of atomic instructions is less than the necessary housekeeping for biased locks. Previous to Nehalem, lock instructions would assert a lock on the memory bus to perform the these atomic operations, each would cost more than 100 cycles. Since Nehalem, atomic instructions can be handled local to a CPU core, and typically cost only 10-20 cycles if they do not need to wait on the store buffer to empty while enforcing memory ordering semantics.
As contention increases, language monitor locks quickly reach a throughput limit regardless of thread count.
ReentrantLock gives the best un-contended performance and scales significantly better with increasing contention compared to language monitors using synchronized.
ReentrantLock has an odd characteristic of reduced performance when 2 threads are contending. This deserves further investigation.
Sandybridge suffers from the increased latency of atomic instructions I detailed in a previous article when contended thread count is low. As contended thread count continues to increase, the cost of the kernel arbitration tends to dominate and Sandybridge shows its strength with increased memory throughput.

Conclusion

Biased locking should no longer be the default lock implementation on modern Intel processors. I recommend you measure your applications and experiement with the -XX:-UseBiasedLocking JVM option to determine if you can benefit from using atomic lock based algorithm for the un-contended case.

When developing your own concurrent libraries I would recommend ReentrantLock rather than using the synchronized keyword due to the significantly better performance on x86, if a lock-free alternative algorithm is not a viable option.

Update 20-Nov-2011

Dave Dice has pointed out that biased locking is not implemented for the locks created in the first few seconds of the JVM startup. I'll re-run my tests this week and post the results. I've had some more quality feedback that suggests my results could be potentially invalid. Micro benchmarks can be tricky but the advice of measuring your own application in the large still stands.

A re-run of the tests can be seen in this follow-on blog taking account of Dave's feedback.

Saturday, 5 November 2011

Locks & Condition Variables - Latency Impact

In a previous article on Inter-Thread Latency I showed how it is possible to signal a state change between 2 threads with less than 50ns of latency. To many developers, writing concurrent code using locks is a scary experience. Writing concurrent code using lock-free algorithms, i.e. algorithms that rely on the use of memory barriers and an intimate understanding of the underlying memory models, can be totally terrifying. To me lock-free / non-blocking algorithms are like playing with explosives or corrosive chemicals, if you do not understand what you are doing, or show the ultimate respect, then very bad things can, and most likely will, happen!

In this article, I'd like to illustrate the impact of using locks and the resulting latency they can impose on your designs. I want to use a very similar algorithm to that used in my previous inter-thread latency article to illustrate the ping-pong effect of handing control back and forth between 2 threads. In this case, rather than using a couple of volatile variables, I will employ a pair of condition variables to signal a state change so control can be passed back and forth.

The Code

import java.util.concurrent.locks.Condition;
import java.util.concurrent.locks.Lock;
import java.util.concurrent.locks.ReentrantLock;

import static java.lang.System.out;

public final class LockedSignallingLatency
{
    private static final int ITERATIONS = 10 * 1000 * 1000;

    private static final Lock lock = new ReentrantLock();
    private static final Condition sendCondition = lock.newCondition();
    private static final Condition echoCondition = lock.newCondition();

    private static long sendValue = -1L;
    private static long echoValue = -1L;

    public static void main(final String[] args)
        throws Exception
    {
        final Thread sendThread = new Thread(new SendRunner());
        final Thread echoThread = new Thread(new EchoRunner());

        final long start = System.nanoTime();

        echoThread.start();
        sendThread.start();

        sendThread.join();
        echoThread.join();

        final long duration = System.nanoTime() - start;

        out.printf("duration %,d (ns)\n", duration);
        out.printf("%,d ns/op\n", duration / (ITERATIONS * 2L));
        out.printf("%,d ops/s\n", (ITERATIONS * 2L * 1000000000L) / duration);
    }

    public static final class SendRunner implements Runnable
    {
        public void run()
        {
            for (long i = 0; i < ITERATIONS; i++)
            {
                lock.lock();
                try
                {
                    sendValue = i;
                    sendCondition.signal();
                }
                finally
                {
                    lock.unlock();
                }

                lock.lock();
                try
                {
                    while (echoValue != i)
                    {
                        echoCondition.await();
                    }
                }
                catch (final InterruptedException ex)
                {
                    break;
                }
                finally
                {
                    lock.unlock();
                }

            }
        }
    }

    public static final class EchoRunner implements Runnable
    {
        public void run()
        {
            for (long i = 0; i < ITERATIONS; i++)
            {
                lock.lock();
                try
                {
                    while (sendValue != i)
                    {
                        sendCondition.await();
                    }
                }
                catch (final InterruptedException ex)
                {
                    break;
                }
                finally
                {
                    lock.unlock();
                }

                lock.lock();
                try
                {
                    echoValue = i;
                    echoCondition.signal();
                }
                finally
                {
                    lock.unlock();
                }
            }
        }
    }
}

Test Results

Windows 7 Professional 64-bit - Oracle JDK 1.6.0 - Nehalem 2.8 GHz

$ start /AFFINITY 0x14 /B /WAIT java LockedSignallingLatency
duration 41,649,616,343 (ns)
2,082 ns/op
480,196 ops/s

$ java LockedSignallingLatency
duration 73,789,456,491 (ns)
3,689 ns/op
271,041 ops/s

Linux Fedora Core 15 64-bit - Oracle JDK 1.6.0 - Nehalem 2.8 GHz

$ taskset -c 2,4 java LockedSignallingLatency

duration 40,469,689,559 (ns)

2,023 ns/op

494,197 ops/s

$ java LockedSignallingLatency
duration 169,795,756,230 (ns)
8,489 ns/op
117,788 ops/s

Linux Fedora Core 15 64-bit - Oracle JDK 1.6.0 - Sandybridge 2.0 GHz

$ taskset -c 2,4 java LockedSignallingLatency
duration 47,209,549,484 (ns)
2,360 ns/op
423,643 ops/s

$ java LockedSignallingLatency
duration 336,168,489,093 (ns)
16,808 ns/op
59,493 ops/s

Observations

The above is a typical set of results I've seen in the middle of the range from multiple runs. There are a couple of interesting observations I'd like to expand on.

Firstly, this is 3 orders of magnitude greater latency than what I illustrated in the previous article using just memory barriers to signal between threads. This cost comes about because the kernel needs to get involved to arbitrate between the threads for the lock, and then manage the scheduling for the threads to awaken when the condition is signalled. The one-way latency to signal a change is pretty much the same as what is considered current state of the art for network hops between nodes via a switch. It is possible to get ~1µs latency with InfiniBand and less than 5µs with 10GigE and user-space IP stacks.

Secondly, the impact is clear when letting the OS choose what CPUs the threads get scheduled on rather than pinning them manually. I've observed this same issue across many use cases whereby Linux, in default configuration for its scheduler, will greatly impact the performance of a low-latency system by scheduling threads on different cores resulting in cache pollution. Windows by default seems to make a better job of this.

I recently had an interesting discussion with Cliff Click about using condition variables and their cost. He pointed out a problem he was seeing. If you look at the case where a sleeping thread gets signalled within the lock, it goes to run and then discovers it cannot get the lock because the signalling thread already has the lock, so it gets put back to sleep until the signalling thread releases the lock, thus causing more work than necessary. Modern schedulers would benefit from being more aware of communication mechanisms between threads to have more efficient location and rescheduling logic. As we go more concurrent and parallel our schedulers need to become more aware of IPC mechanisms.

Conclusion

When designing a low-latency system it is crucial to avoid the use of locks and condition variables for the main transaction flows. Non-blocking or lock-free algorithms are key to achieving ultra-low latency but can be very difficult to prove correct. I would not recommend designing lock-free algorithms for business logic but they can be very effectively employed for low-level infrastructure components. The business logic is best run on single threads following the Single Writer Principle from my previous article.