Mechanical Sympathy

Friday, 2 September 2011

Modelling Is Everything

I’m often asked, “What is the best way to learn about building high-performance systems”? There are many perfectly valid answers to this question but there is one thing that stands out for me above everything else, and that is modelling. Modelling what you need to implement is the most important and effective step in the process. I’d go further and say this principle applies to any development and the rest is just typing :-)

Domain Driven Design (DDD) advocates modelling the domain and expressing this model in code as fundamental to the successful delivery and ongoing maintenance of software. I wholeheartedly agree with this. How often do we see code that is an approximation of the problem domain? Code that exhibits behaviour which approximates to what is required via inappropriate abstractions and mappings which just about cope. Those mappings between what is in the code and the real domain are only contained in the developers’ heads and this is just not good enough.

When requiring high-performance, code for parts of the system often have to model what is happening with the CPU, memory, storage sub-systems, or network sub-systems. When we have imperfect abstractions on top of these domains, performance can be very adversely affected. The goal of my “Mechanical Sympathy” blog is to peek at what is under the hood so we can improve our abstractions.

What is a Model?

A model does not need to be the result of a 3-year exercise producing UML. It can be, and often is best as, people communicating via various means including speech, drawings, illustrations, metaphors, analogies, etc, to build a mental model for shared understanding. If an accurate and distilled understanding can be reached then this model can be turned into code with great results.

Infrastructure Domain Models

If developers writing a concurrent framework do not have a good model of how a typical cache sub-system works, i.e. it uses message passing to exchange cache lines, then the framework is unlikely to perform well or be correct. If their code drives the cache sub-system with mechanical sympathy and understanding, it is less likely to have bugs and more likely to perform well.

It is much easier to predict performance from a sound model when coming from an understanding of the infrastructure for the underlying platform and its published abilities. For example, if you know how many packets per second a network sub-system can handle, and the size of its transfer unit, then it is easy to extrapolate expected bandwidth. With this model based understanding we can test our code for expectations with confidence.

I’ve fixed many performance issues whereby a framework treated a storage sub-system as stream-based when it is really a block-based model. If you update part of a file on disk, the block to be updated must be read, the changes applied, and the results written back. Now if you know the system is block based and the boundaries of the blocks, you can write whole blocks back without incurring the read, modify, write back cycle replacing these actions with a single write. This applies even when appending to a file as the last block is likely to have been partially written previously.

Business Domain Models

The same thinking should be applied to the models we construct for the business domain. If a business process is modelled accurately, then the software will not surprise its end users. When we draw up a model it is important to describe the relationships for cardinality and the characteristics by which they will be traversed. This understanding will guide the selection of data structures to those best suited for implementing the relationships. I often see people use a list for a relationship which is mostly searched by key, for this case a map could be more appropriate. Are the entities at the other end of a relationship ordered? A tree or skiplist implementation may be a better option.

Identity

Identity of entities in a model is so important. All models have to be entered in some way, and this normally starts with an entity from which to walk. That entity could be “Customer” by customer ID but could equally be “DiskBlock” by filename and offset in an infrastructure domain. The identity of each entity in the system needs to be clear so the model can be accessed efficiently. If for each interaction with a model we waste precious cycles trying to find our entity as a starting point, then other optimisations can become almost irrelevant. Make identity explicit in your model and, if necessary, index entities by their identity so you can efficiently enter the model for each interaction.

Refine as we learn

It is also important to keep refining a model as we learn. If the model grows as a series of extensions without refining and distilling, then we end up with a spaghetti mess that is very difficult to manage when trying to achieve predictable performance. Never mind how difficult it is to maintain and support. Everyday we learn new things. Reflect this in the model and keep it up to date.

Implement no more, but also no less, than what is needed!

The fastest code is code that does just what is needed and no more. Perform the instructions to complete the task and no more. Really fast code is normally not a weird mess of bit-shifting and complier tricks. It is best to start with something clean and elegant. Then measure to see if you are within performance targets. So often this will be sufficient. Sometimes performance will be a surprise. You then need to apply science to test and measure before jumping to conclusions. A profiler will often tell you where the time is being taken. Once the basic modelling mistakes and assumptions have been corrected, it usually takes just a little mechanical sympathy to reach the performance goal. Unused code is waste. Try not to create it. If you happen to create some, then remove it from your codebase as soon as you notice it.

Conclusion

When cross-functional requirements, such as performance and availability, are critical to success, I’ve found the most important thing is to get the model correct for the domain at all levels. That is, take the principles of DDD and make sure your code is an appropriate reflection of each domain. Be that the domain of business applications, or the domain of interactions with infrastructure, I’ve found modelling is everything.

Saturday, 27 August 2011

Disruptor 2.0 Released

Significantly improved performance and a cleaner API are the key takeaways for the Disruptor 2.0 concurrent programming framework for Java. This release is the result of all the great feedback we have received from the community. Feedback is very welcome and really improves the end product so please keep it coming.

You can find the Disruptor project here, plus we have a wiki with links to detailed blogs describing how things work.

Naming & API

Over the lifetime of the Disruptor naming has been a challenge. The funny thing is that with the 2.0 release we have come almost full circle. Originally we considered the Disruptor as an event processing framework that often got used as a queue replacement. To make it understandable to queue users we adopted the nomenclature of producers and consumers. However the consumers are not true consumers. With this release the consensus is to return to the event processing roots and adopt the following naming changes.

Producer -> Publisher
Events are claimed in strict sequence and published to the RingBuffer.

Entry -> Event
Events represent the currency of data exchange through the dependency graph of EventProcessors.

Consumer -> EventProcessor
Events are processed by EventProcessors. The processing of an event can be read only, but can also involve mutations on which other EventProcessors depend.

ConsumerBarrier -> DependencyBarrier
Complex graphs of dependent EventProcessors can be constructed for the processing of an Event. The DependencyBarriers are assembled to represent the dependency graph. This topic is the real value of the Disruptor and often misunderstood. A fun example can be seen playing FizzBuzz in our performance tests.

The ProducerBarrier was always a one-to-one relationship with the RingBuffer so for ease of use its behaviour has been merged into the RingBuffer. This allows direct publishing into the RingBuffer.

DSL Wizard

The most complex part of using the Disruptor is the setting up of the dependency graph of EventProcessors. To simplify this for the most common cases we have integrated the DisruptorWizard project which provides a DSL as a fluent API for assembling the graph and assigning threads.

Performance

Significant performance tuning effort has gone into this release. This effort has resulted in a ~2-3X improvement in throughput depending on CPU architecture. For most use cases it is now an order of magnitude better than queue based approaches. On Sandybridge processors I've seen over 50 million events processed per second.

Sequence tracking has been completely rewritten to reduce the usage of hardware memory barriers, indirection layers, and megamorphic method calls resulting in a much more data and instruction cache friendly design. New techniques have been employed to prevent false sharing because the previous ones got optimised out by the Oracle Java 7 JVM.

The one area not seeing a significant performance increase is the sequencer pattern. The Disruptor is still much faster than queue based approaches for this pattern but a limitation of Java hits us hard here. Java on x86/x64 is using LOCK CMPXCHG for CAS operations to implement the AtomicLong incrementAndGet() method which, based on my measurements, is ~2-10X slower than using LOCK XADD as contention increases. Hopefully Oracle will see the error of SUNs ways on this and embrace x86/x64 to take advantage of such instructions. Dave Dice at Oracle has blogged on the subject so I live in hope.

Memory Barriers

Of special note for this release is the elimination of hardware memory barriers on x86/x64 for Sequence tracking. The beauty in the Disruptor design is that on CPU architectures that have a memory model [1] whereby:

“loads are not reordered with older loads”, and
“stores are not reordered with older stores”;

it is then possible to take advantage of the semantics provided by AtomicLong to avoid the use of the Java volatile keyword, and thus hardware fences on x86/x64. The one sticky rule for concurrent algorithms, such as Dekker [2] and Peterson [3] locks, on x86/x64 is “loads can be re-ordered with older stores”. This is not an issue given the design of the Disruptor. The issue relates to the snooping of CPU local store buffers for older writes. I’m likely to blog in more detail about why this is the case at a later date. The code should be safe on other CPU architectures if the JVM implementers get the semantics of AtomicLong and Unsafe correct, however your mileage may vary for performance on other architectures compared to x64.

Roadmap

With this latest release it is becoming increasingly obvious how sensitive some CPU architectures are to processor affinity for threads. When an EventProcessor gets rescheduled on a different core, after its time-slice is exhausted or it yields, the resulting cache pollution really hits performance. For those who require more extreme and predictable performance I plan to release an Executor service with the Disruptor to allow the pinning of threads to CPU cores.

I'm also thinking of adding a progressive back off strategy for waiting EventProcessors as a WaitStrategy. This strategy would first busy spin, then yield, then eventually sleep in millisecond periods to conserve CPU resource for those applications that burst for a while then go quiet.

Memory Model: See Section 8.2 of http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-software-developer-vol-3a-part-1-manual.html
Dekker algorithm: http://en.wikipedia.org/wiki/Dekker%27s_algorithm
Peterson Algorithm: http://en.wikipedia.org/wiki/Peterson%27s_algorithm

Saturday, 20 August 2011

Code Refurbishment

Within our industry we use a huge range of terminology. Unfortunately we don’t all agree on what individual terms actually mean. I so often hear people misuse the term “Refactoring” which has come to make the business in many organisations recoil in fear. The reason for this fear I’ve observed is because of what people often mean when misusing this term.

I feel we are holding back our industry by not being disciplined in our use of terminology. If one chemist said to another chemist “we are about to perform titration”, both would have a good idea what is involved. I believe computing is still a very immature science. As our subject matures hopefully we will become more precise and disciplined in our use of terminology and thus make our communication more accurate and effective.

Refactoring is a very useful technique for improving code quality and clarity. To be precise it is a behaviour preserving change that improves a code base for future maintenance and understanding. A good example would be extracting a method to remove code duplication and applying this method at every site of the duplication, thus removing the duplication. Refactoring was first discussed in the early 1990s and became mainstream after Martin Fowler’s excellent “Refactoring” book in 1999.

Refactoring involves making a number of small internal changes to the code structure. These changes will typically not have any external impact. Well written unit tests that just assert externally observable behaviour will not change when code is refactored. If the external behaviour of code is changing when the structure is being changed then this is not refactoring.

Now, why do our business folk recoil in fear when this simple and useful technique of “refactoring” is mentioned? I believe this is because developers are actually talking about a much more extensive structural redevelopment technique that does not have a common term. These structural changes are often not a complete ground-up rewrite because much of the existing code will be reused. The reason the business folk have come to recoil is that they fear we are about to head off into uncharted waters with no idea of how long things will take and if any value will come out of the exercise.

This example of significant structural change reminds me of when a bar or restaurant gets taken over by new management. The new management often undertake a refurbishment exercise to make the place more appealing and suitable for the customers they are targeting. A lot of the building will be preserved and reused thus greatly reducing the costs of a complete rebuild. In my experience when developers use the term “refactoring” what they really mean is that some module, or bounded context, in a code base is about to undergo significant refurbishment. If we define this term, and agree the goal and value to the business, we may be able to better plan and manage our projects.

These code refurbishment exercises should have clear goals defined at the outset and all change must be tested against these goals. For example, we may have discovered that code is not a true reflection of the business domain after new insights. These insights may have been gleaned over a period of time and the code has grown out of step to become an approximation of what the business requires. While performing Domain Driven Design the penny may drop with the essence of the business model becoming clear. After this clarity of understanding the code may need a major overhaul to align it with this new understanding of the business. Code can also drift from being a distilled model of the business domain if quick hacks are put in place to meet a deadline. Over time these hacks can build on each other until the model no longer describes the business, it just about makes itself useful by side effect. During this exercise our tests are likely to see significant change as we tighten up the specification for our new improved understanding of the business domain.

A code refurbishment is worthwhile to correct the core domain if it's about to undergo significant further development, or if a module is business critical and needs to be occasionally corrected under production pressure to preserve revenue generation.

I’m interested to know if other folk have observed similar developments and if you think refinement of this concept would be valuable?

Saturday, 13 August 2011

False Sharing && Java 7

In my previous post on False Sharing I suggested it can be avoided by padding the cache line with unused long fields. It seems Java 7 got clever and eliminated or re-ordered the unused fields, thus re-introducing false sharing. I've experimented with a number of techniques on different platforms and found the following code to be the most reliable.

import java.util.concurrent.atomic.AtomicLong;

public final class FalseSharing
    implements Runnable
{
    public final static int NUM_THREADS = 4; // change
    public final static long ITERATIONS = 500L * 1000L * 1000L;
    private final int arrayIndex;

    private static PaddedAtomicLong[] longs = new PaddedAtomicLong[NUM_THREADS];
    static
    {
        for (int i = 0; i < longs.length; i++)
        {
            longs[i] = new PaddedAtomicLong();
        }
    }

    public FalseSharing(final int arrayIndex)
    {
        this.arrayIndex = arrayIndex;
    }

    public static void main(final String[] args) throws Exception
    {
        final long start = System.nanoTime();
        runTest();
        System.out.println("duration = " + (System.nanoTime() - start));
    }

    private static void runTest() throws InterruptedException
    {
        Thread[] threads = new Thread[NUM_THREADS];

        for (int i = 0; i < threads.length; i++)
        {
            threads[i] = new Thread(new FalseSharing(i));
        }

        for (Thread t : threads)
        {
            t.start();
        }

        for (Thread t : threads)
        {
            t.join();
        }
    }

    public void run()
    {
        long i = ITERATIONS + 1;
        while (0 != --i)
        {
            longs[arrayIndex].set(i);
        }
    }

    public static long sumPaddingToPreventOptimisation(final int index)
    {
        PaddedAtomicLong v = longs[index];
        return v.p1 + v.p2 + v.p3 + v.p4 + v.p5 + v.p6;
    }

    public static class PaddedAtomicLong extends AtomicLong
    {
        public volatile long p1, p2, p3, p4, p5, p6 = 7L;
    }
}

With this code I get similar performance results to those stated in the previous False Sharing article. The padding in PaddedAtomicLong above can be commented out to see the false sharing effect.

I think we should all lobby the powers that be inside Oracle to have intrinsics added to the language so we can have cache line aligned and padded atomic classes. This and some other low-level changes would help make Java a real concurrent programming language. We keep hearing them say multi-core is coming. I say it is here and Java needs to catch up.

Tuesday, 9 August 2011

Inter Thread Latency

Message rates between threads are fundamentally determined by the latency of memory exchange between CPU cores. The minimum unit of transfer will be a cache line exchanged via shared caches or socket interconnects. In a previous article I explained Memory Barriers and why they are important to concurrent programming between threads. These are the instructions that cause a CPU to make memory visible to other cores in an ordered and timely manner.

Lately I’ve been asked a lot about how much faster the Disruptor would be if C++ was used instead of Java. For sure C++ would give more control for memory alignment and potential access to underlying CPU instructions such as memory barriers and lock instructions. In this article I’ll directly compare C++ and Java to measure the cost of signalling a change between threads.

For the test we'll use two counters each updated by their own thread. A simple ping-pong algorithm will be used to signal from one to the other and back again. The exchange will be repeated millions of times to measure the average latency between cores. This measurement will give us the latency of exchanging a cache line between cores in a serial manner.

For Java we’ll use volatile counters which the JVM will kindly insert a lock instruction for the update giving us an effective memory barrier.

public final class InterThreadLatency
    implements Runnable
{
    public static final long ITERATIONS = 500L * 1000L * 1000L;

    public static volatile long s1;
    public static volatile long s2;

    public static void main(final String[] args)
    {
        Thread t = new Thread(new InterThreadLatency());
        t.setDaemon(true);
        t.start();

        long start = System.nanoTime();

        long value = s1;
        while (s1 < ITERATIONS)
        {
            while (s2 != value)
            {
                // busy spin
            }
            value = ++s1;
        }

        long duration = System.nanoTime() - start;

        System.out.println("duration = " + duration);
        System.out.println("ns per op = " + duration / (ITERATIONS * 2));
        System.out.println("op/sec = " +  
            (ITERATIONS * 2L * 1000L * 1000L * 1000L) / duration);
        System.out.println("s1 = " + s1 + ", s2 = " + s2);
    }

    public void run()
    {
        long value = s2;
        while (true)
        {
            while (value == s1)
            {
                // busy spin
            }
            value = ++s2;
        }
    }
}

For C++ we’ll use the GNU Atomic Builtins which give us a similar lock instruction insertion to that which the JVM uses.

#include <time.h>
#include <pthread.h>
#include <stdio.h>

typedef unsigned long long uint64;
const uint64 ITERATIONS = 500LL * 1000LL * 1000LL;

volatile uint64 s1 = 0;
volatile uint64 s2 = 0;

void* run(void*)
{
    register uint64 value = s2;
    while (true)
    {
        while (value == s1)
        {
            // busy spin
        }
        value = __sync_add_and_fetch(&s2, 1);
    }
}

int main (int argc, char *argv[])
{
    pthread_t threads[1];
    pthread_create(&threads[0], NULL, run, NULL);

    timespec ts_start;
    timespec ts_finish;
    clock_gettime(CLOCK_MONOTONIC, &ts_start);

    register uint64 value = s1;
    while (s1 < ITERATIONS)
    {
        while (s2 != value)
        {
            // busy spin
        }
        value = __sync_add_and_fetch(&s1, 1);
    }

    clock_gettime(CLOCK_MONOTONIC, &ts_finish);

    uint64 start = (ts_start.tv_sec * 1000000000LL) + ts_start.tv_nsec;
    uint64 finish = (ts_finish.tv_sec * 1000000000LL) + ts_finish.tv_nsec;
    uint64 duration = finish - start;

    printf("duration = %lld\n", duration);
    printf("ns per op = %lld\n", (duration / (ITERATIONS * 2)));
    printf("op/sec = %lld\n",  
        ((ITERATIONS * 2L * 1000L * 1000L * 1000L) / duration));
    printf("s1 = %lld, s2 = %lld\n", s1, s2);

    return 0;
}

Results

$ taskset -c 2,4 /opt/jdk1.7.0/bin/java InterThreadLatency
duration = 50790271150
ns per op = 50
op/sec = 19,688,810
s1 = 500000000, s2 = 500000000

$ g++ -O3 -lpthread -lrt -o itl itl.cpp
$ taskset -c 2,4 ./itl
duration = 45087955393
ns per op = 45
op/sec = 22,178,872
s1 = 500000000, s2 = 500000000

The C++ version is slightly faster on my Intel Sandybridge laptop. So what does this tell us? Well, that the latency between 2 cores on a 2.2 GHz machine is ~45ns and that you can exchange 22m messages per second in a serial fashion. On an Intel CPU this is fundamentally the cost of the lock instruction enforcing total order and forcing the store buffer and write combining buffers to drain, followed by the resulting cache coherency traffic between the cores. Note that each core has a 96GB/s port onto the L3 cache ring bus, yet 22m * 64-bytes is only 1.4 GB/s. This is because we have measured latency and not throughput. We could easily fit some nice fat messages between those memory barriers as part of the exchange if the data has been written before the lock instruction was executed.

So what does this all mean for the Disruptor? Basically, the latency of the Disruptor is about as low as we can get from Java. It would be possible to get a ~10% latency improvement by moving to C++. I’d expect a similar improvement in throughput for C++. The main win with C++ would be the control, and therefore, the predictability that comes with it if used correctly. The JVM gives us nice safety features like garbage collection in complex applications but we pay a little for that with the extra instructions it inserts that can be seen if you get Hotspot to dump the assembler instructions it is generating.

How does the Disruptor achieve more than 25m messages per second I hear you say??? Well that is one of the neat parts of its design. The “waitFor” semantics on the SequenceBarrier enables a very efficient form of batching, which allows the BatchEventProcessor to process a series of events that occurred since it last checked in with the RingBuffer, all without incurring a memory barrier. For real world applications this batching effect is really significant. For micro benchmarks it only makes the results more random, especially when there is little work done other than accepting the message.

Conclusion

So when processing events in series, the measurements tell us that the current generation of processors can do between 20-30 million exchanges per second at a latency less than 50ns. The Disruptor design allows us to get greater throughput without explicit batching on the publisher side. In addition the Disruptor has an explicit batching API on the publisher side that can give over 100 million messages per second.

Saturday, 30 July 2011

False Sharing

Memory is stored within the cache system in units know as cache lines. Cache lines are a power of 2 of contiguous bytes which are typically 32-256 in size. The most common cache line size is 64 bytes. False sharing is a term which applies when threads unwittingly impact the performance of each other while modifying independent variables sharing the same cache line. Write contention on cache lines is the single most limiting factor on achieving scalability for parallel threads of execution in an SMP system. I’ve heard false sharing described as the silent performance killer because it is far from obvious when looking at code.

To achieve linear scalability with number of threads, we must ensure no two threads write to the same variable or cache line. Two threads writing to the same variable can be tracked down at a code level. To be able to know if independent variables share the same cache line we need to know the memory layout, or we can get a tool to tell us. Intel VTune is such a profiling tool. In this article I’ll explain how memory is laid out for Java objects and how we can pad out our cache lines to avoid false sharing.

Figure 1.

Figure 1. above illustrates the issue of false sharing. A thread running on core 1 wants to update variable X while a thread on core 2 wants to update variable Y. Unfortunately these two hot variables reside in the same cache line. Each thread will race for ownership of the cache line so they can update it. If core 1 gets ownership then the cache sub-system will need to invalidate the corresponding cache line for core 2. When Core 2 gets ownership and performs its update, then core 1 will be told to invalidate its copy of the cache line. This will ping pong back and forth via the L3 cache greatly impacting performance. The issue would be further exacerbated if competing cores are on different sockets and additionally have to cross the socket interconnect.

Java Memory Layout

For the Hotspot JVM, all objects have a 2-word header. First is the “mark” word which is made up of 24-bits for the hash code and 8-bits for flags such as the lock state, or it can be swapped for lock objects. The second is a reference to the class of the object. Arrays have an additional word for the size of the array. Every object is aligned to an 8-byte granularity boundary for performance. Therefore to be efficient when packing, the object fields are re-ordered from declaration order to the following order based on size in bytes:

doubles (8) and longs (8)
ints (4) and floats (4)
shorts (2) and chars (2)
booleans (1) and bytes (1)
references (4/8)
<repeat for sub-class fields>

With this knowledge we can pad a cache line between any fields with 7 longs. Within the Disruptor we pad cache lines around the RingBuffer cursor and BatchEventProcessor sequences.

To show the performance impact let’s take a few threads each updating their own independent counters. These counters will be volatile longs so the world can see their progress.

public final class FalseSharing
    implements Runnable
{
    public final static int NUM_THREADS = 4; // change
    public final static long ITERATIONS = 500L * 1000L * 1000L;
    private final int arrayIndex;

    private static VolatileLong[] longs = new VolatileLong[NUM_THREADS];
    static
    {
        for (int i = 0; i < longs.length; i++)
        {
            longs[i] = new VolatileLong();
        }
    }

    public FalseSharing(final int arrayIndex)
    {
        this.arrayIndex = arrayIndex;
    }

    public static void main(final String[] args) throws Exception
    {
        final long start = System.nanoTime();
        runTest();
        System.out.println("duration = " + (System.nanoTime() - start));
    }

    private static void runTest() throws InterruptedException
    {
        Thread[] threads = new Thread[NUM_THREADS];

        for (int i = 0; i < threads.length; i++)
        {
            threads[i] = new Thread(new FalseSharing(i));
        }

        for (Thread t : threads)
        {
            t.start();
        }

        for (Thread t : threads)
        {
            t.join();
        }
    }

    public void run()
    {
        long i = ITERATIONS + 1;
        while (0 != --i)
        {
            longs[arrayIndex].value = i;
        }
    }

    public final static class VolatileLong
    {
        public volatile long value = 0L;
        public long p1, p2, p3, p4, p5, p6; // comment out
    }
}

Results

Running the above code while ramping the number of threads and adding/removing the cache line padding, I get the results depicted in Figure 2. below. This is measuring the duration of test runs on my 4-core Nehalem.

Figure 2.

The impact of false sharing can clearly be seen by the increased execution time required to complete the test. Without the cache line contention we achieve near linear scale up with threads.

This is not a perfect test because we cannot be sure where the VolatileLongs will be laid out in memory. They are independent objects. However experience shows that objects allocated at the same time tend to be co-located.

So there you have it. False sharing can be a silent performance killer.

Note: Please read my further adventures with false sharing in this follow on blog.

Sunday, 24 July 2011

Memory Barriers/Fences

In this article I'll discuss the most fundamental technique in concurrent programming known as memory barriers, or fences, that make the memory state within a processor visible to other processors.

CPUs have employed many techniques to try and accommodate the fact that CPU execution unit performance has greatly outpaced main memory performance. In my “Write Combining” article I touched on just one of these techniques. The most common technique employed by CPUs to hide memory latency is to pipeline instructions and then spend significant effort, and resource, on trying to re-order these pipelines to minimise stalls related to cache misses.

When a program is executed it does not matter if its instructions are re-ordered provided the same end result is achieved. For example, within a loop it does not matter when the loop counter is updated if no operation within the loop uses it. The compiler and CPU are free to re-order the instructions to best utilise the CPU provided it is updated by the time the next iteration is about to commence. Also over the execution of a loop this variable may be stored in a register and never pushed out to cache or main memory, thus it is never visible to another CPU.

CPU cores contain multiple execution units. For example, a modern Intel CPU contains 6 execution units which can do a combination of arithmetic, conditional logic, and memory manipulation. Each execution unit can do some combination of these tasks. These execution units operate in parallel allowing instructions to be executed in parallel. This introduces another level of non-determinism to program order if it was observed from another CPU.

Finally, when a cache-miss occurs, a modern CPU can make an assumption on the results of a memory load and continue executing based on this assumption until the load returns the actual data.

Provided “program order” is preserved the CPU, and compiler, are free to do whatever they see fit to improve performance.

Figure 1.

Loads and stores to the caches and main memory are buffered and re-ordered using the load, store, and write-combining buffers. These buffers are associative queues that allow fast lookup. This lookup is necessary when a later load needs to read the value of a previous store that has not yet reached the cache. Figure 1 above depicts a simplified view of a modern multi-core CPU. It shows how the execution units can use the local registers and buffers to manage memory while it is being transferred back and forth from the cache sub-system.

In a multi-threaded environment techniques need to be employed for making program results visible in a timely manner. I will not cover cache coherence in this article. Just assume that once memory has been pushed to the cache then a protocol of messages will occur to ensure all caches are coherent for any shared data. The techniques for making memory visible from a processor core are known as memory barriers or fences.

Memory barriers provide two properties. Firstly, they preserve externally visible program order by ensuring all instructions either side of the barrier appear in the correct program order if observed from another CPU and, secondly, they make the memory visible by ensuring the data is propagated to the cache sub-system.

Memory barriers are a complex subject. They are implemented very differently across CPU architectures. At one end of the spectrum there is a relatively strong memory model on Intel CPUs that is more simple than say the weak and complex memory model on a DEC Alpha with its partitioned caches in addition to cache layers. Since x86 CPUs are the most common for multi-threaded programming I’ll try and simplify to this level.

Store Barrier
A store barrier, “sfence” instruction on x86, waits for all store instructions prior to the barrier to be written from the store buffer to the L1 cache for the CPU on which it is issued. This will make the program state visible to other CPUs so they can act on it if necessary. A good example of this in action is the following simplified code from the BatchEventProcessor in the Disruptor. When the sequence is updated other consumers and producers know how far this consumer has progressed and thus can take appropriate action. All previous updates to memory that happened before the barrier are now visible.

private volatile long sequence = RingBuffer.INITIAL_CURSOR_VALUE;

// from inside the run() method

T event = null;
long nextSequence = sequence.get() + 1L;
while (running)
{
    try
    {
        final long availableSequence = barrier.waitFor(nextSequence);

        while (nextSequence <= availableSequence)
        {
            event = ringBuffer.get(nextSequence);
            boolean endOfBatch = nextSequence == availableSequence;
            eventHandler.onEvent(event, nextSequence, endOfBatch);
            nextSequence++;
        }

        // store barrier inserted here !!!
        sequence.set(nextSequence - 1L); 
    }
    catch (final Exception ex)
    {
        exceptionHandler.handle(ex, nextSequence, event);

        // store barrier inserted here !!!
        sequence.set(nextSequence);

        nextSequence++;
    }
}

Load Barrier
A load barrier, “lfence” instruction on x86, ensures all load instructions after the barrier to happen after the barrier and then wait on the load buffer to drain for the issuing CPU. This makes program state exposed from other CPUs visible to this CPU before making further progress. A good example of this is when the BatchEventProcessor sequence referenced above is read by producers, or consumers, in the corresponding barriers of the Disruptor.

Full Barrier
A full barrier, "mfence" instruction on x86, is a composite of both load and store barriers happening on a CPU.

Java Memory Model
In the Java Memory Model a volatile field has a store barrier before the write, and full barrier after the write to it, this is paired with and a load barrier inserted after a read of it. Qualified final fields of a class have a store barrier inserted after their initialisation to ensure these fields are visible once the constructor completes when a reference to the object is available. A JVM does not have to issue specific instructions, such as sfence, it can use other techniques to achieve the same behaviour provided by the processor architecture and compiler blend.

Atomic Instructions and Software Locks
Atomic instructions, such as the “lock ...” instructions on x86, are effectively a full barrier as they lock the memory sub-system to perform an operation and have guaranteed total order, even across CPUs. Software locks usually employ memory barriers, or atomic instructions, to achieve visibility and preserve program order.

Performance Impact of Memory Barriers
Memory barriers prevent a CPU from performing a lot of techniques to hide memory latency therefore they have a significant performance cost which must be considered. To achieve maximum performance it is best to model the problem so the processor can do units of work, then have all the necessary memory barriers occur on the boundaries of these work units. Taking this approach allows the processor to optimise the units of work without restriction. There is an advantage to grouping necessary memory barriers in that buffers flushed after the first one will be less costly because no work will be under way to refill them.