Saturday 5 November 2011

Locks & Condition Variables - Latency Impact

In a previous article on Inter-Thread Latency I showed how it is possible to signal a state change between 2 threads with less than 50ns of latency.  To many developers, writing concurrent code using locks is a scary experience.  Writing concurrent code using lock-free algorithms, i.e. algorithms that rely on the use of memory barriers and an intimate understanding of the underlying memory models, can be totally terrifying.  To me lock-free / non-blocking algorithms are like playing with explosives or corrosive chemicals, if you do not understand what you are doing, or show the ultimate respect, then very bad things can, and most likely will, happen!

In this article, I'd like to illustrate the impact of using locks and the resulting latency they can impose on your designs.  I want to use a very similar algorithm to that used in my previous inter-thread latency article to illustrate the ping-pong effect of handing control back and forth between 2 threads.  In this case, rather than using a couple of volatile variables, I will employ a pair of condition variables to signal a state change so control can be passed back and forth.

The Code
import java.util.concurrent.locks.Condition;
import java.util.concurrent.locks.Lock;
import java.util.concurrent.locks.ReentrantLock;

import static java.lang.System.out;

public final class LockedSignallingLatency
{
    private static final int ITERATIONS = 10 * 1000 * 1000;

    private static final Lock lock = new ReentrantLock();
    private static final Condition sendCondition = lock.newCondition();
    private static final Condition echoCondition = lock.newCondition();

    private static long sendValue = -1L;
    private static long echoValue = -1L;

    public static void main(final String[] args)
        throws Exception
    {
        final Thread sendThread = new Thread(new SendRunner());
        final Thread echoThread = new Thread(new EchoRunner());

        final long start = System.nanoTime();

        echoThread.start();
        sendThread.start();

        sendThread.join();
        echoThread.join();

        final long duration = System.nanoTime() - start;

        out.printf("duration %,d (ns)\n", duration);
        out.printf("%,d ns/op\n", duration / (ITERATIONS * 2L));
        out.printf("%,d ops/s\n", (ITERATIONS * 2L * 1000000000L) / duration);
    }

    public static final class SendRunner implements Runnable
    {
        public void run()
        {
            for (long i = 0; i < ITERATIONS; i++)
            {
                lock.lock();
                try
                {
                    sendValue = i;
                    sendCondition.signal();
                }
                finally
                {
                    lock.unlock();
                }

                lock.lock();
                try
                {
                    while (echoValue != i)
                    {
                        echoCondition.await();
                    }
                }
                catch (final InterruptedException ex)
                {
                    break;
                }
                finally
                {
                    lock.unlock();
                }

            }
        }
    }

    public static final class EchoRunner implements Runnable
    {
        public void run()
        {
            for (long i = 0; i < ITERATIONS; i++)
            {
                lock.lock();
                try
                {
                    while (sendValue != i)
                    {
                        sendCondition.await();
                    }
                }
                catch (final InterruptedException ex)
                {
                    break;
                }
                finally
                {
                    lock.unlock();
                }

                lock.lock();
                try
                {
                    echoValue = i;
                    echoCondition.signal();
                }
                finally
                {
                    lock.unlock();
                }
            }
        }
    }
}
Test Results

Windows 7 Professional 64-bit - Oracle JDK 1.6.0 - Nehalem 2.8 GHz

$ start /AFFINITY 0x14 /B /WAIT java LockedSignallingLatency
duration 41,649,616,343 (ns)
2,082 ns/op
480,196 ops/s

$ java LockedSignallingLatency
duration 73,789,456,491 (ns)
3,689 ns/op
271,041 ops/s

Linux Fedora Core 15 64-bit - Oracle JDK 1.6.0 - Nehalem 2.8 GHz

$ taskset -c 2,4 java LockedSignallingLatency
duration 40,469,689,559 (ns)
2,023 ns/op
494,197 ops/s

$ java LockedSignallingLatency
duration 169,795,756,230 (ns)
8,489 ns/op
117,788 ops/s

Linux Fedora Core 15 64-bit - Oracle JDK 1.6.0 - Sandybridge 2.0 GHz

$ taskset -c 2,4 java LockedSignallingLatency
duration 47,209,549,484 (ns)
2,360 ns/op
423,643 ops/s

$ java LockedSignallingLatency
duration 336,168,489,093 (ns)
16,808 ns/op
59,493 ops/s

Observations

The above is a typical set of results I've seen in the middle of the range from multiple runs.  There are a couple of interesting observations I'd like to expand on.

Firstly, this is 3 orders of magnitude greater latency than what I illustrated in the previous article using just memory barriers to signal between threads.  This cost comes about because the kernel needs to get involved to arbitrate between the threads for the lock, and then manage the scheduling for the threads to awaken when the condition is signalled.  The one-way latency to signal a change is pretty much the same as what is considered current state of the art for network hops between nodes via a switch.  It is possible to get ~1µs latency with InfiniBand and less than 5µs with 10GigE and user-space IP stacks.

Secondly, the impact is clear when letting the OS choose what CPUs the threads get scheduled on rather than pinning them manually.  I've observed this same issue across many use cases whereby Linux, in default configuration for its scheduler, will greatly impact the performance of a low-latency system by scheduling threads on different cores resulting in cache pollution.   Windows by default seems to make a better job of this.

I recently had an interesting discussion with Cliff Click about using condition variables and their cost.  He pointed out a problem he was seeing.  If you look at the case where a sleeping thread gets signalled within the lock, it goes to run and then discovers it cannot get the lock because the signalling thread already has the lock, so it gets put back to sleep until the signalling thread releases the lock, thus causing more work than necessary.  Modern schedulers would benefit from being more aware of communication mechanisms between threads to have more efficient location and rescheduling logic.  As we go more concurrent and parallel our schedulers need to become more aware of IPC mechanisms.

Conclusion

When designing a low-latency system it is crucial to avoid the use of locks and condition variables for the main transaction flows.  Non-blocking or lock-free algorithms are key to achieving ultra-low latency but can be very difficult to prove correct.   I would not recommend designing lock-free algorithms for business logic but they can be very effectively employed for low-level infrastructure components.  The business logic is best run on single threads following the Single Writer Principle from my previous article.

No comments:

Post a Comment