Saturday 30 July 2011

False Sharing

Memory is stored within the cache system in units know as cache lines.  Cache lines are a power of 2 of contiguous bytes which are typically 32-256 in size.  The most common cache line size is 64 bytes.   False sharing is a term which applies when threads unwittingly impact the performance of each other while modifying independent variables sharing the same cache line.  Write contention on cache lines is the single most limiting factor on achieving scalability for parallel threads of execution in an SMP system.  I’ve heard false sharing described as the silent performance killer because it is far from obvious when looking at code.

To achieve linear scalability with number of threads, we must ensure no two threads write to the same variable or cache line.  Two threads writing to the same variable can be tracked down at a code level.   To be able to know if independent variables share the same cache line we need to know the memory layout, or we can get a tool to tell us.  Intel VTune is such a profiling tool.  In this article I’ll explain how memory is laid out for Java objects and how we can pad out our cache lines to avoid false sharing.

Figure 1.

Figure 1. above illustrates the issue of false sharing.  A thread running on core 1 wants to update variable X while a thread on core 2 wants to update variable Y.  Unfortunately these two hot variables reside in the same cache line.  Each thread will race for ownership of the cache line so they can update it.  If core 1 gets ownership then the cache sub-system will need to invalidate the corresponding cache line for core 2.  When Core 2 gets ownership and performs its update, then core 1 will be told to invalidate its copy of the cache line.  This will ping pong back and forth via the L3 cache greatly impacting performance.  The issue would be further exacerbated if competing cores are on different sockets and additionally have to cross the socket interconnect.

Java Memory Layout

For the Hotspot JVM, all objects have a 2-word header.  First is the “mark” word which is made up of 24-bits for the hash code and 8-bits for flags such as the lock state, or it can be swapped for lock objects.  The second is a reference to the class of the object.  Arrays have an additional word for the size of the array.  Every object is aligned to an 8-byte granularity boundary for performance.  Therefore to be efficient when packing, the object fields are re-ordered from declaration order to the following order based on size in bytes:
  1. doubles (8) and longs (8)
  2. ints (4) and floats (4)
  3. shorts (2) and chars (2)
  4. booleans (1) and bytes (1)
  5. references (4/8)
  6. <repeat for sub-class fields>
With this knowledge we can pad a cache line between any fields with 7 longs.  Within the Disruptor we pad cache lines around the RingBuffer cursor and BatchEventProcessor sequences.

To show the performance impact let’s take a few threads each updating their own independent counters.  These counters will be volatile longs so the world can see their progress.
public final class FalseSharing
    implements Runnable
{
    public final static int NUM_THREADS = 4; // change
    public final static long ITERATIONS = 500L * 1000L * 1000L;
    private final int arrayIndex;

    private static VolatileLong[] longs = new VolatileLong[NUM_THREADS];
    static
    {
        for (int i = 0; i < longs.length; i++)
        {
            longs[i] = new VolatileLong();
        }
    }

    public FalseSharing(final int arrayIndex)
    {
        this.arrayIndex = arrayIndex;
    }

    public static void main(final String[] args) throws Exception
    {
        final long start = System.nanoTime();
        runTest();
        System.out.println("duration = " + (System.nanoTime() - start));
    }

    private static void runTest() throws InterruptedException
    {
        Thread[] threads = new Thread[NUM_THREADS];

        for (int i = 0; i < threads.length; i++)
        {
            threads[i] = new Thread(new FalseSharing(i));
        }

        for (Thread t : threads)
        {
            t.start();
        }

        for (Thread t : threads)
        {
            t.join();
        }
    }

    public void run()
    {
        long i = ITERATIONS + 1;
        while (0 != --i)
        {
            longs[arrayIndex].value = i;
        }
    }

    public final static class VolatileLong
    {
        public volatile long value = 0L;
        public long p1, p2, p3, p4, p5, p6; // comment out
    }
}
Results

Running the above code while ramping the number of threads and adding/removing the cache line padding,  I get the results depicted in Figure 2. below.  This is measuring the duration of test runs on my 4-core Nehalem.

Figure 2.

The impact of false sharing can clearly be seen by the increased execution time required to complete the test.  Without the cache line contention we achieve near linear scale up with threads.

This is not a perfect test because we cannot be sure where the VolatileLongs will be laid out in memory.  They are independent objects.  However experience shows that objects allocated at the same time tend to be co-located.

So there you have it.  False sharing can be a silent performance killer.

Note: Please read my further adventures with false sharing in this follow on blog.

Sunday 24 July 2011

Memory Barriers/Fences

In this article I'll discuss the most fundamental technique in concurrent programming known as memory barriers, or fences, that make the memory state within a processor visible to other processors.

CPUs have employed many techniques to try and accommodate the fact that CPU execution unit performance has greatly outpaced main memory performance.  In my “Write Combining” article I touched on just one of these techniques.  The most common technique employed by CPUs to hide memory latency is to pipeline instructions and then spend significant effort, and resource, on trying to re-order these pipelines to minimise stalls related to cache misses.

When a program is executed it does not matter if its instructions are re-ordered provided the same end result is achieved.  For example, within a loop it does not matter when the loop counter is updated if no operation within the loop uses it.  The compiler and CPU are free to re-order the instructions to best utilise the CPU provided it is updated by the time the next iteration is about to commence.  Also over the execution of a loop this variable may be stored in a register and never pushed out to cache or main memory, thus it is never visible to another CPU.

CPU cores contain multiple execution units.  For example, a modern Intel CPU contains 6 execution units which can do a combination of arithmetic, conditional logic, and memory manipulation.  Each execution unit can do some combination of these tasks.  These execution units operate in parallel allowing instructions to be executed in parallel.  This introduces another level of non-determinism to program order if it was observed from another CPU.

Finally, when a cache-miss occurs, a modern CPU can make an assumption on the results of a memory load and continue executing based on this assumption until the load returns the actual data.

Provided “program order” is preserved the CPU, and compiler, are free to do whatever they see fit to improve performance.

Figure 1.

Loads and stores to the caches and main memory are buffered and re-ordered using the load, store, and write-combining buffers.  These buffers are associative queues that allow fast lookup.  This lookup is necessary when a later load needs to read the value of a previous store that has not yet reached the cache.  Figure 1 above depicts a simplified view of a modern multi-core CPU.  It shows how the execution units can use the local registers and buffers to manage memory while it is being transferred back and forth from the cache sub-system.

In a multi-threaded environment techniques need to be employed for making program results visible in a timely manner.  I will not cover cache coherence in this article.  Just assume that once memory has been pushed to the cache then a protocol of messages will occur to ensure all caches are coherent for any shared data.  The techniques for making memory visible from a processor core are known as memory barriers or fences.

Memory barriers provide two properties.  Firstly, they preserve externally visible program order by ensuring all instructions either side of the barrier appear in the correct program order if observed from another CPU and, secondly, they make the memory visible by ensuring the data is propagated to the cache sub-system.

Memory barriers are a complex subject.  They are implemented very differently across CPU architectures.  At one end of the spectrum there is a relatively strong memory model on Intel CPUs that is more simple than say the weak and complex memory model on a DEC Alpha with its partitioned caches in addition to cache layers.  Since x86 CPUs are the most common for multi-threaded programming I’ll try and simplify to this level.

Store Barrier
A store barrier, “sfence” instruction on x86, waits for all store instructions prior to the barrier to be written from the store buffer to the L1 cache for the CPU on which it is issued.  This will make the program state visible to other CPUs so they can act on it if necessary.  A good example of this in action is the following simplified code from the BatchEventProcessor in the Disruptor.  When the sequence is updated other consumers and producers know how far this consumer has progressed and thus can take appropriate action.  All previous updates to memory that happened before the barrier are now visible.
private volatile long sequence = RingBuffer.INITIAL_CURSOR_VALUE;

// from inside the run() method

T event = null;
long nextSequence = sequence.get() + 1L;
while (running)
{
    try
    {
        final long availableSequence = barrier.waitFor(nextSequence);

        while (nextSequence <= availableSequence)
        {
            event = ringBuffer.get(nextSequence);
            boolean endOfBatch = nextSequence == availableSequence;
            eventHandler.onEvent(event, nextSequence, endOfBatch);
            nextSequence++;
        }

        // store barrier inserted here !!!
        sequence.set(nextSequence - 1L); 
    }
    catch (final Exception ex)
    {
        exceptionHandler.handle(ex, nextSequence, event);

        // store barrier inserted here !!!
        sequence.set(nextSequence);

        nextSequence++;
    }
}
Load Barrier
A load barrier, “lfence” instruction on x86, ensures all load instructions after the barrier to happen after the barrier and then wait on the load buffer to drain for the issuing CPU.  This makes program state exposed from other CPUs visible to this CPU before making further progress.  A good example of this is when the BatchEventProcessor sequence referenced above is read by producers, or consumers, in the corresponding barriers of the Disruptor.

Full Barrier
A full barrier, "mfence" instruction on x86, is a composite of both load and store barriers happening on a CPU.

Java Memory Model
In the Java Memory Model a volatile field has a store barrier before the write, and full barrier after the write to it, this is paired with and a load barrier inserted after a read of it.  Qualified final fields of a class have a store barrier inserted after their initialisation to ensure these fields are visible once the constructor completes when a reference to the object is available. A JVM does not have to issue specific instructions, such as sfence, it can use other techniques to achieve the same behaviour provided by the processor architecture and compiler blend.

Atomic Instructions and Software Locks
Atomic instructions, such as the “lock ...” instructions on x86, are effectively a full barrier as they lock the memory sub-system to perform an operation and have guaranteed total order, even across CPUs.  Software locks usually employ memory barriers, or atomic instructions, to achieve visibility and preserve program order.

Performance Impact of Memory Barriers
Memory barriers prevent a CPU from performing a lot of techniques to hide memory latency therefore they have a significant performance cost which must be considered.  To achieve maximum performance it is best to model the problem so the processor can do units of work, then have all the necessary memory barriers occur on the boundaries of these work units.  Taking this approach allows the processor to optimise the units of work without restriction.  There is an advantage to grouping necessary memory barriers in that buffers flushed after the first one will be less costly because no work will be under way to refill them.

Tuesday 19 July 2011

Processor Affinity - Part 1

In a series of articles I’ll aim to show the performance impact of processor affinity in a range of use cases.

Background

A thread of execution will typically run until it has used up its quantum (aka time slice), at which point it joins the back of the run queue waiting to be re-scheduled as soon as a processor core becomes available.  While running the thread will have accumulated a significant amount of state in the processor, including instructions and data in the cache.   If the thread can be re-scheduled to run on the same core as last time it can benefit from all that accumulated state.   A thread may equally not run to the end of its quantum because it has been pre-empted, or blocked on IO or a lock.  After which, when it is ready to run again, the same holds true.

There are numerous techniques available for pinning threads to a particular core.   In this article I’ll illustrate the use of the taskset command on two threads exchanging IP multicast messages via a dummy interface.  I’ve chosen this as the first example because in a low-latency environment multicast is the preferred IP protocol.  For simplicity, I’ve also chosen to not involve the physical network while introducing the concepts.   In the next article I’ll expand on this example and the issues involving a real network.

1. Create the dummy interface

  $ su -
  $ modprobe dummy
  $ ifconfig dummy0 172.16.1.1 netmask 255.255.255.0
  $ ifconfig dummy0 multicast

2. Get the Java files (Sender and Receiver) and compile them

  $ javac *.java

3. Run the tests without CPU pinning

Window 1:
  $ java MultiCastReceiver 230.0.0.1 dummy0

Window 2:
  $ java MultiCastSender 230.0.0.1 dummy0 20000000

4. Run the tests with CPU pinning

Window 1:
  $ taskset -c 2 java MultiCastReceiver 230.0.0.1 dummy0

Window 2:
  $ taskset -c 4 java MultiCastSender 230.0.0.1 dummy0 20000000

Results

The tests output once per second the number of messages they have managed to send and receive.  A typically example run is charted in Figure 1 below.

Figure 1.

The interesting thing I've observed is that the unpinned test will follow a step function of unpredictable performance.  Across many runs I've seen different patterns but all similar in this step function nature.  For the pinned tests I get consistent throughput with no step pattern and always the greatest throughput.

This test is not particularly CPU intensive, nor does it access the physical network device, yet it shows how critical processor affinity is to not just high performance but also predictable performance.  In the next article of this series I'll introduce a network hop and the issues arising from interrupt handling.

LMAX Architecture - by Martin Fowler

Martin Fowler has written a great article about the architecture we developed for our exchange LMAX

The article explains the origins of the Disruptor and what is possible using single threaded business logic.  I really like how he explains the benefits of this approach, much better than I could :-)

Saturday 16 July 2011

Let The Caller Choose

A programming idiom I often see in managed runtime environments, such as Java, is to return a collection or array from a method that is allocated within the method.  Having originally come from a C/C++ language background this always struck me as a bit strange and lazy.  In the C language world this idiom would likely be a source of memory leaks.  I do however understand how convenient it can be in garbage collected environments.  A good example is the split function on a String as below:

    String str = “Some string that I may want to split”;     
    String[] values = str.split(“ );

For this a new array has to be allocated within the method.  If the array is small it can be allocated in the young generation.  Due to the round-robin nature of the young generation it will most likely involve a cache miss.  The OS, in addition, must zero the memory for security before handing it over.  What’s worse is the size of many arrays cannot be determined until the method completes, so temporary structures may have to be allocated before the final structure is allocated and copied into.  Then after all this, the memory will eventually need to be garbage collected.  This feels like the CPU is doing a lot of work to make this idiom work.

What is the alternative?

Another idiom is to let the caller allocate the structure and pass it into the method like in the example below:

    String str = “Some string that I may want to split”;
    Collection<String> values = new ArrayList<String>();   
    str.split(values, “ );

With this approach the caller has more control and flexibility.  Not only can the caller reuse the collection across multiple calls, they can also choose the most appropriate structure for the end needs.  For example, it could be a HashSet if finding distinct words, an ArrayList for cache friendly performance, or a LinkedList if memory is a scarce resource.  The method can return the collection allowing a fluent coding pattern.

In the Java world this idiom becomes critical to performance when large arrays are involved.  An array may be too large to be allocated in the young generation and therefore needs to be allocated in the old generation.  Besides being less efficient and contended for large object allocation, the large array may cause a compaction of the old generation resulting in a multi-second stop-the-world pause.  Now that is a good topic for another blog post...

I can hear folk cry out, “but short lived objects are cheap!”  While this is true when compared to longer living objects, they are relatively cheap but certainly not free.  I’d prefer to use those extra cycles solving the business problem and, in addition, have the flexibility to choose the most appropriate data structure for the task at hand.

Friday 15 July 2011

Write Combining

Modern CPUs employ lots of techniques to counteract the latency cost of going to main memory.  These days CPUs can process hundreds of instructions in the time it takes to read or write data to the DRAM memory banks. 

The major tool used to hide this latency is multiple layers of SRAM cache.  In addition, SMP systems employ message passing protocols to achieve coherence between caches.  Unfortunately CPUs are now so fast that even these caches cannot keep up at times.  So to further hide this latency a number of less well known buffers are used. 

This article explores “write combining buffers” and how we can write code that uses them effectively.

CPU caches are effectively unchained hash maps where each bucket is typically 64-bytes. This is known as a “cache line”.  The cache line is the effective unit of memory transfer.  For example, an address A in main memory would hash to map to a given cache line C.

If a CPU needs to work with an address which hashes to a line that is not already in cache, then the existing line that matches that hash needs to be evicted so the new line can take its place.  For example if we have two addresses which both map via the hashing algorithm to the same cache line, then the old one must make way for the new cache line.

When a CPU executes a store operation it will try to write the data to the L1 cache nearest to the CPU.  If a cache miss occurs at this stage the CPU goes out to the next layer of cache.  At this point on an Intel, and many other, CPUs a technique known as “write combining” comes into play. 

While the request for ownership of the L2 cache line is outstanding the data to be stored is written to one of a number of cache line sized buffers on the processor itself, known as line fill buffers on Intel CPUs.  These on chip buffers allow the CPU to continue processing instructions while the cache sub-system gets ready to receive and process the data.  The biggest advantage comes when the data is not present in any of the other cache layers.

These buffers become very interesting when subsequent writes happen to require the same cache line.  The subsequent writes can be combined into the buffer before it is committed to the L2 cache. These 64-byte buffers maintain a 64-bit field which has the corresponding bit set for each byte that is updated to indicate what data is valid when the buffer is transferred to the outer caches.

Hang on I hear you say.  What happens if the program wants to read some of the data that has been written to a buffer?  Well our hardware friends have thought of that and they will snoop the buffers before they read the caches.

What does all this mean for our programs?

If we can fill these buffers before they are transferred to the outer caches then we will greatly improve the effective use of the transfer bus at every level.  How do we do this?  Well programs spend most of their time in loops doing work. 

There are a limited number of these buffers, and they differ by CPU model.  For example on an Intel CPU you are only guaranteed to get 4 of them at one time.  What this means is that within a loop you should not write to more than 4 distinct memory locations at one time or you will not benefit from the write combining effect.

What does this look like in code?
import static java.lang.System.out;

public final class WriteCombining
{
    private static final int ITERATIONS = Integer.MAX_VALUE;
    private static final int ITEMS = 1 << 24;
    private static final int MASK = ITEMS - 1;

    private static final byte[] arrayA = new byte[ITEMS];
    private static final byte[] arrayB = new byte[ITEMS];
    private static final byte[] arrayC = new byte[ITEMS];
    private static final byte[] arrayD = new byte[ITEMS];
    private static final byte[] arrayE = new byte[ITEMS];
    private static final byte[] arrayF = new byte[ITEMS];

    public static void main(final String[] args)
    {
        for (int i = 1; i <= 3; i++)
        {
            out.println(i + " SingleLoop duration (ns) = " + runCaseOne());
            out.println(i + " SplitLoop  duration (ns) = " + runCaseTwo());
        }

        int result = arrayA[1] + arrayB[2] + arrayC[3] +
                     arrayD[4] + arrayE[5] + arrayF[6];
        out.println("result = " + result);
    }

    public static long runCaseOne()
    {
        long start = System.nanoTime();

        int i = ITERATIONS;
        while (--i != 0)
        {
            int slot = i & MASK;
            byte b = (byte)i;
            arrayA[slot] = b;
            arrayB[slot] = b;
            arrayC[slot] = b;
            arrayD[slot] = b;
            arrayE[slot] = b;
            arrayF[slot] = b;
        }

        return System.nanoTime() - start;
    }

    public static long runCaseTwo()
    {
        long start = System.nanoTime();

        int i = ITERATIONS;
        while (--i != 0)
        {
            int slot = i & MASK;
            byte b = (byte)i;
            arrayA[slot] = b;
            arrayB[slot] = b;
            arrayC[slot] = b;
        }

        i = ITERATIONS;
        while (--i != 0)
        {
            int slot = i & MASK;
            byte b = (byte)i;
            arrayD[slot] = b;
            arrayE[slot] = b;
            arrayF[slot] = b;
        }

        return System.nanoTime() - start;
    }
}
This program on my Windows 7 64-bit Intel Core i7 860 @ 2.8 GHz system produces the following output:

1 SingleLoop duration (ns) = 14019753545
1 SplitLoop  duration (ns) = 8972368661
2 SingleLoop duration (ns) = 14162455066
2 SplitLoop  duration (ns) = 8887610558
3 SingleLoop duration (ns) = 13800914725
3 SplitLoop  duration (ns) = 7271752889


To spell it out, if we write to 6 array locations (memory addresses) inside one loop we see that the program takes significantly longer than if we split the work up, and write first to 3 array locations, then to the other 3 locations sequentially.

By splitting the loop we do much more work yet the program completes in much less time!  Welcome to the magic of “write combining”.  By using our knowledge of the CPU architecture to fill those buffers properly we can use the underlying hardware to accelerate our code by a factor of two.

Don’t forget that with hyper-threading you can have 2 threads in competition for these buffers on the same core.

Why Mechanical Sympathy?

A little while ago I discovered this wonderful quote by Henry Peteroski:

"The most amazing achievement of the computer software industry is its continuing cancellation of the steady and staggering gains made by the computer hardware industry."

This observation has been a great source of employment for me over the last two decades.  In this series of blog posts I will try and share some of the techniques I've learned for getting the most out of our underlying hardware.  The name "Mechanical Sympathy" comes from the great racing driver Jackie Stewart, who was a 3 times world Formula 1 champion.  He believed the best drivers had enough understanding of how a machine worked so they could work in harmony with it.

Why does the software we use today not feel any faster than the DOS based applications we used 20 years ago???  It does not have to be this way.  As a software developer I want to try and produce software which does justice to the wonderful achievements of our hardware friends.