Sunday, 5 August 2012

Memory Access Patterns Are Important

In high-performance computing it is often said that the cost of a cache-miss is the largest performance penalty for an algorithm.  For many years the increase in speed of our processors has greatly outstripped latency gains to main-memory.  Bandwidth to main-memory has greatly increased via wider, and multi-channel, buses however the latency has not significantly reduced.  To hide this latency our processors employ evermore complex cache sub-systems that have many layers.

The 1994 paper "Hitting the memory wall: implications of the obvious" describes the problem and goes on to argue that caches do not ultimately help because of compulsory cache-misses.  I aim to show that by using access patterns which display consideration for the cache hierarchy, this conclusion is not inevitable.

Let's start putting the problem in context with some examples.  Our hardware tries to hide the main-memory latency via a number of techniques.  Basically three major bets are taken on memory access patterns:
  1. Temporal: Memory accessed recently will likely be required again soon.
  2. Spatial: Adjacent memory is likely to be required soon. 
  3. Striding: Memory access is likely to follow a predictable pattern.
To illustrate these three bets in action let's write some code and measure the results.
  1. Walk through memory in a linear fashion being completely predictable.
  2. Pseudo randomly walk round memory within a restricted area then move on.  This restricted area is what is commonly known as an operating system page of memory.
  3. Pseudo randomly walk around a large area of the heap.
Code

The following code should be run with the -Xmx4g JVM option.
public class TestMemoryAccessPatterns
{
    private static final int LONG_SIZE = 8;
    private static final int PAGE_SIZE = 2 * 1024 * 1024;
    private static final int ONE_GIG = 1024 * 1024 * 1024;
    private static final long TWO_GIG = 2L * ONE_GIG;

    private static final int ARRAY_SIZE = (int)(TWO_GIG / LONG_SIZE);
    private static final int WORDS_PER_PAGE = PAGE_SIZE / LONG_SIZE;

    private static final int ARRAY_MASK = ARRAY_SIZE - 1;
    private static final int PAGE_MASK = WORDS_PER_PAGE - 1;

    private static final int PRIME_INC = 514229;

    private static final long[] memory = new long[ARRAY_SIZE];

    static
    {
        for (int i = 0; i < ARRAY_SIZE; i++)
        {
            memory[i] = 777;
        }
    }

    public enum StrideType
    {
        LINEAR_WALK
        {
            public int next(final int pageOffset, final int wordOffset, final int pos)
            {
                return (pos + 1) & ARRAY_MASK;
            }
        },

        RANDOM_PAGE_WALK
        {
            public int next(final int pageOffset, final int wordOffset, final int pos)
            {
                return pageOffset + ((pos + PRIME_INC) & PAGE_MASK);
            }
        },

        RANDOM_HEAP_WALK
        {
            public int next(final int pageOffset, final int wordOffset, final int pos)
            {
                return (pos + PRIME_INC) & ARRAY_MASK;
            }
        };

        public abstract int next(int pageOffset, int wordOffset, int pos);
    }

    public static void main(final String[] args)
    {
        final StrideType strideType;
        switch (Integer.parseInt(args[0]))
        {
            case 1:
                strideType = StrideType.LINEAR_WALK;
                break;

            case 2:
                strideType = StrideType.RANDOM_PAGE_WALK;
                break;

            case 3:
                strideType = StrideType.RANDOM_HEAP_WALK;
                break;

            default:
                throw new IllegalArgumentException("Unknown StrideType");
        }

        for (int i = 0; i < 5; i++)
        {
            perfTest(i, strideType);
        }
    }

    private static void perfTest(final int runNumber, final StrideType strideType)
    {
        final long start = System.nanoTime();

        int pos = -1;
        long result = 0;
        for (int pageOffset = 0; pageOffset < ARRAY_SIZE; pageOffset += WORDS_PER_PAGE)
        {
            for (int wordOffset = pageOffset, limit = pageOffset + WORDS_PER_PAGE;
                 wordOffset < limit;
                 wordOffset++)
            {
                pos = strideType.next(pageOffset, wordOffset, pos);
                result += memory[pos];
            }
        }

        final long duration = System.nanoTime() - start;
        final double nsOp = duration / (double)ARRAY_SIZE;

        if (208574349312L != result)
        {
            throw new IllegalStateException();
        }

        System.out.format("%d - %.2fns %s\n",
                          Integer.valueOf(runNumber),
                          Double.valueOf(nsOp),
                          strideType);
    }
}
Results
Intel U4100 @ 1.3GHz, 4GB RAM DDR2 800MHz, 
Windows 7 64-bit, Java 1.7.0_05
===========================================
0 - 2.38ns LINEAR_WALK
1 - 2.41ns LINEAR_WALK
2 - 2.35ns LINEAR_WALK
3 - 2.36ns LINEAR_WALK
4 - 2.39ns LINEAR_WALK

0 - 12.45ns RANDOM_PAGE_WALK
1 - 12.27ns RANDOM_PAGE_WALK
2 - 12.17ns RANDOM_PAGE_WALK
3 - 12.22ns RANDOM_PAGE_WALK
4 - 12.18ns RANDOM_PAGE_WALK

0 - 152.86ns RANDOM_HEAP_WALK
1 - 151.80ns RANDOM_HEAP_WALK
2 - 151.72ns RANDOM_HEAP_WALK
3 - 151.91ns RANDOM_HEAP_WALK
4 - 151.36ns RANDOM_HEAP_WALK

Intel i7-860 @ 2.8GHz, 8GB RAM DDR3 1333MHz, 
Windows 7 64-bit, Java 1.7.0_05
=============================================
0 - 1.06ns LINEAR_WALK
1 - 1.05ns LINEAR_WALK
2 - 0.98ns LINEAR_WALK
3 - 1.00ns LINEAR_WALK
4 - 1.00ns LINEAR_WALK

0 - 3.80ns RANDOM_PAGE_WALK
1 - 3.85ns RANDOM_PAGE_WALK
2 - 3.79ns RANDOM_PAGE_WALK
3 - 3.65ns RANDOM_PAGE_WALK
4 - 3.64ns RANDOM_PAGE_WALK

0 - 30.04ns RANDOM_HEAP_WALK
1 - 29.05ns RANDOM_HEAP_WALK
2 - 29.14ns RANDOM_HEAP_WALK
3 - 28.88ns RANDOM_HEAP_WALK
4 - 29.57ns RANDOM_HEAP_WALK

Intel i7-2760QM @ 2.40GHz, 8GB RAM DDR3 1600MHz, 
Linux 3.4.6 kernel 64-bit, Java 1.7.0_05
=================================================
0 - 0.91ns LINEAR_WALK
1 - 0.92ns LINEAR_WALK
2 - 0.88ns LINEAR_WALK
3 - 0.89ns LINEAR_WALK
4 - 0.89ns LINEAR_WALK

0 - 3.29ns RANDOM_PAGE_WALK
1 - 3.35ns RANDOM_PAGE_WALK
2 - 3.33ns RANDOM_PAGE_WALK
3 - 3.31ns RANDOM_PAGE_WALK
4 - 3.30ns RANDOM_PAGE_WALK

0 - 9.58ns RANDOM_HEAP_WALK
1 - 9.20ns RANDOM_HEAP_WALK
2 - 9.44ns RANDOM_HEAP_WALK
3 - 9.46ns RANDOM_HEAP_WALK
4 - 9.47ns RANDOM_HEAP_WALK
Analysis

I ran the code on 3 different CPU architectures illustrating generational steps forward for Intel.  It is clear from the results that each generation has become progressively better at hiding the latency to main-memory based on the 3 bets described above for a relatively small heap.  This is because the size and sophistication of various caches keep improving.  However as memory size increases they become less effective.  For example, if the array is doubled to be 4GB in size, then the average latency increases from ~30ns to ~55ns for the i7-860 doing the random heap walk.

It seems that for the linear walk case, memory latency does not exist.  However as we walk around memory in an evermore random pattern then the latency starts to become very apparent.

The random heap walk produced an interesting result.  This is a our worst case scenario, and given the hardware specifications of these systems, we could be looking at 150ns, 65ns, and 75ns for the above tests respectively based on memory controller and memory module latencies.  For the Nehalem (i7-860) I can further subvert the cache sub-system by using a 4GB array resulting in ~55ns on average per iteration.  The i7-2760QM has larger load buffers, TLB caches, and Linux is running with transparent huge pages which are all working to further hide the latency.  By playing with different prime numbers for the stride, results can vary wildly depending on processor type, e.g. try PRIME_INC = 39916801 for Nehalem.  I'd like to test this on a much larger heap with Sandy Bridge.

The main take away is the more predictable the pattern of access to memory, then the better the cache sub-systems are at hiding main-memory latency.  Let's look at these cache sub-systems in a little detail to try and understand the observed results.

Hardware Components

We have many layers of cache plus the pre-fetchers to consider for how latency gets hidden.  In this section I'll try and cover the major components used to hide latency that our hardware and systems software friends have put in place.  We will investigate these latency hiding components and use the Linux perf and Lightweight Performance Counters utilities to retrieve the performance counters from our CPUs which tell how effective these components are when we execute our programs.  Performance counters are CPU specific and what I've used here are specific to Sandy Bridge.

Data Caches
Processors typically have 2 or 3 layers of data cache.  Each layer as we move out is progressively larger with increasing latency.  The latest Intel processors have 3 layers (L1D, L2, and L3); with sizes 32KB, 256KB, and 4-30MB; and ~1ns, ~4ns, and ~15ns latency respectively for a 3.0GHz CPU.

Data caches are effectively hardware hash tables with a fixed number of slots for each hash value.  These slots are known as "ways".  An 8-way associative cache will have 8 slots to hold values for addresses that hash to the same cache location.  Within these slots the data caches do not store words, they store cache-lines of multiple words.  For an Intel processor these cache-lines are typically 64-bytes, that is 8 words on a 64-bit machine.  This plays to the spatial bet that adjacent memory is likely to be required soon, which is typically the case if we think of arrays or fields of an object.

Data caches are typically evicted in a LRU manner.  Caches work by using a write-back algorithm were stores need only be propagated to main-memory when a modified cache-line is evicted.  This gives rise the the interesting phenomenon that a load can cause a write-back to the outer cache layers and eventually to main-memory.
perf stat -e L1-dcache-loads,L1-dcache-load-misses java -Xmx4g TestMemoryAccessPatterns $

 Performance counter stats for 'java -Xmx4g TestMemoryAccessPatterns 1':
     1,496,626,053 L1-dcache-loads                                            
       274,255,164 L1-dcache-misses
         #   18.32% of all L1-dcache hits

 Performance counter stats for 'java -Xmx4g TestMemoryAccessPatterns 2':
     1,537,057,965 L1-dcache-loads                                            
     1,570,105,933 L1-dcache-misses
         #  102.15% of all L1-dcache hits 

 Performance counter stats for 'java -Xmx4g TestMemoryAccessPatterns 3':
     4,321,888,497 L1-dcache-loads                                           
     1,780,223,433 L1-dcache-misses
         #   41.19% of all L1-dcache hits  

likwid-perfctr -C 2 -g L2CACHE java -Xmx4g TestMemoryAccessPatterns $

java -Xmx4g TestMemoryAccessPatterns 1
+-----------------------+-------------+
|         Event         |   core 2    |
+-----------------------+-------------+
|   INSTR_RETIRED_ANY   | 5.94918e+09 |
| CPU_CLK_UNHALTED_CORE | 5.15969e+09 |
| L2_TRANS_ALL_REQUESTS | 1.07252e+09 |
|     L2_RQSTS_MISS     | 3.25413e+08 |
+-----------------------+-------------+
+-----------------+-----------+
|     Metric      |  core 2   |
+-----------------+-----------+
|   Runtime [s]   |  2.15481  |
|       CPI       | 0.867293  |
| L2 request rate |  0.18028  |
|  L2 miss rate   | 0.0546988 |
|  L2 miss ratio  | 0.303409  |
+-----------------+-----------+
+------------------------+-------------+
|         Event          |   core 2    |
+------------------------+-------------+
| L3_LAT_CACHE_REFERENCE | 1.26545e+08 |
|   L3_LAT_CACHE_MISS    | 2.59059e+07 |
+------------------------+-------------+

java -Xmx4g TestMemoryAccessPatterns 2
+-----------------------+-------------+
|         Event         |   core 2    |
+-----------------------+-------------+
|   INSTR_RETIRED_ANY   | 1.48772e+10 |
| CPU_CLK_UNHALTED_CORE | 1.64712e+10 |
| L2_TRANS_ALL_REQUESTS | 3.41061e+09 |
|     L2_RQSTS_MISS     | 1.5547e+09  |
+-----------------------+-------------+
+-----------------+----------+
|     Metric      |  core 2  |
+-----------------+----------+
|   Runtime [s]   | 6.87876  |
|       CPI       | 1.10714  |
| L2 request rate | 0.22925  |
|  L2 miss rate   | 0.104502 |
|  L2 miss ratio  | 0.455843 |
+-----------------+----------+
+------------------------+-------------+
|         Event          |   core 2    |
+------------------------+-------------+
| L3_LAT_CACHE_REFERENCE | 1.52088e+09 |
|   L3_LAT_CACHE_MISS    | 1.72918e+08 |
+------------------------+-------------+

java -Xmx4g TestMemoryAccessPatterns 3
+-----------------------+-------------+
|         Event         |   core 2    |
+-----------------------+-------------+
|   INSTR_RETIRED_ANY   | 6.49533e+09 |
| CPU_CLK_UNHALTED_CORE | 4.18416e+10 |
| L2_TRANS_ALL_REQUESTS | 4.67488e+09 |
|     L2_RQSTS_MISS     | 1.43442e+09 |
+-----------------------+-------------+
+-----------------+----------+
|     Metric      |  core 2  |
+-----------------+----------+
|   Runtime [s]   |  17.474  |
|       CPI       |  6.4418  |
| L2 request rate | 0.71973  |
|  L2 miss rate   | 0.220838 |
|  L2 miss ratio  | 0.306835 |
+-----------------+----------+
+------------------------+-------------+
|         Event          |   core 2    |
+------------------------+-------------+
| L3_LAT_CACHE_REFERENCE | 1.40079e+09 |
|   L3_LAT_CACHE_MISS    | 1.34832e+09 |
+------------------------+-------------+
Note: The cache-miss rate of the combined L1D, L2 and L3 increases significantly as the pattern of access becomes more random.

Translation Lookaside Buffers (TLBs)
Our programs deal with virtual memory addresses that need to be translated to physical memory addresses.  Virtual memory systems do this by mapping pages.  We need to know the offset for a given page and its size for any memory operation.  Typically page sizes are 4KB and are gradually moving to 2MB and greater.  Linux introduced Transparent Huge Pages in the 2.6.38 kernel giving us 2MB pages.  The translation of virtual memory pages to physical pages is maintained by the page table.  This translation can require multiple accesses to the page table which is a huge performance penalty.  To accelerate this lookup, processors have a small hardware cache at each cache level called the TLB cache.  A miss on the TLB cache can be hugely expensive because the page table may not be in a nearby data cache.  By moving to larger pages, a TLB cache can cover a larger address range for the same number of entries.
perf stat -e dTLB-loads,dTLB-load-misses java -Xmx4g TestMemoryAccessPatterns $
 
 Performance counter stats for 'java -Xmx4g TestMemoryAccessPatterns 1':
     1,496,128,634 dTLB-loads
           310,901 dTLB-misses
              #    0.02% of all dTLB cache hits 

 Performance counter stats for 'java -Xmx4g TestMemoryAccessPatterns 2':
     1,551,585,263 dTLB-loads
           340,230 dTLB-misses
              #    0.02% of all dTLB cache hits

 Performance counter stats for 'java -Xmx4g TestMemoryAccessPatterns 3':
     4,031,344,537 dTLB-loads
     1,345,807,418 dTLB-misses
              #   33.38% of all dTLB cache hits  
Note: We only incur significant TLB misses when randomly walking the whole heap when huge pages are employed.

Hardware Pre-Fetchers
Hardware will try and predict the next memory access our programs will make and speculatively load that memory into fill buffers.  This is done at it simplest level by pre-loading adjacent cache-lines for the spatial bet, or by recognising regular stride based access patterns, typically less than 2KB in stride length.  The tests below we are measuring the number of loads that hit a fill buffer from a hardware pre-fetch.
likwid-perfctr -C 2 -g LOAD_HIT_PRE_HW_PF:PMC0 java -Xmx4g TestMemoryAccessPatterns $

java -Xmx4g TestMemoryAccessPatterns 1
+--------------------+-------------+
|       Event        |   core 2    |
+--------------------+-------------+
| LOAD_HIT_PRE_HW_PF | 1.31613e+09 |
+--------------------+-------------+

java -Xmx4g TestMemoryAccessPatterns 2
+--------------------+--------+
|       Event        | core 2 |
+--------------------+--------+
| LOAD_HIT_PRE_HW_PF | 368930 |
+--------------------+--------+

java -Xmx4g TestMemoryAccessPatterns 3
+--------------------+--------+
|       Event        | core 2 |
+--------------------+--------+
| LOAD_HIT_PRE_HW_PF | 324373 |
+--------------------+--------+
Note: We have a significant success rate for load hits with the pre-fetcher on the linear walk.

Memory Controllers and Row Buffers
Beyond our last level cache (LLC) sits the memory controllers that manage access to the SDRAM banks.  Memory is organised into rows and columns.  To access an address, first the row address must be selected (RAS), then the column address is selected (CAS) within that row to get the word.  The row is typically a page in size and loaded into a row buffer.  Even at this stage the hardware is still helping hide the latency.  A queue of memory access requests is maintained and re-ordered so that multiple words can be fetched from the same row if possible.

Non-Uniform Memory Access (NUMA)
Systems now have memory controllers on the CPU socket.  This move to on-socket memory controllers gave an ~50ns latency reduction over existing front side bus (FSB) and external Northbridge memory controllers.  Systems with multiple sockets employ memory interconnects, QPI from Intel, which are used when one CPU wants to access memory managed by another CPU socket.  The presence of these interconnects gives rise to the non-uniform nature of server memory access.  In a 2-socket system memory may be local or 1 hop away.  On a 8-socket system memory can be up to 3 hops away, were each hop adds 20ns latency in each direction.

What does this mean for algorithms?

The difference between an L1D cache-hit, and a full miss resulting in main-memory access, is 2 orders of magnitude; i.e. <1ns vs. 65-100ns.  If algorithms randomly walk around our ever increasing address spaces, then we are less likely to benefit from the hardware support that hides this latency.

Is there anything we can do about this when designing algorithms and data-structures?  Yes there is a lot we can do.  If we perform chunks of work on data that is co-located, and we stride around memory in a predictable fashion, then our algorithms can be many times faster.  For example rather than using bucket and chain hash tables, like in the JDK, we can employ hash tables using open-addressing with linear-probing.  Rather than using linked-lists or trees with single items in each node, we can store an array of many items in each node.

Research is advancing on algorithmic approaches that work in harmony with cache sub-systems.  One area I find fascinating is Cache Oblivious Algorithms.  The name is a bit misleading but there are some great concepts here for how to improve software performance and better execute in parallel.  This article is a great illustration of the performance benefits that can be gained.

Conclusion

To achieve great performance it is important to have sympathy for the cache sub-systems.  We have seen in this article what can be achieved by accessing memory in patterns which work with, rather than against, these caches.  When designing algorithms and data structures, it is now vitally important to consider cache-misses, probably even more so than counting steps in the algorithm.  This is not what we were taught in algorithm theory when studying computer science.  The last decade has seen some fundamental changes in technology.  For me the two most significant are the rise of multi-core, and now big-memory systems with 64-bit address spaces.

One thing is certain, if we want software to execute faster and scale better, we need to make better use of the many cores in our CPUs, and pay attention to memory access patterns.

Update: 06-August-2012
Trying to design a random walk algorithm for all processors and memory sizes is tricky.  If I use the algorithm below then my Sandy Bridge processor is slower but the Nehalem is faster.  The point is performance will be very unpredictable when you walk around memory in a random fashion.  I've also included the L3 cache counters for more detail in all the tests.
    private static final long LARGE_PRIME_INC = 70368760954879L;

        RANDOM_HEAP_WALK
        {
            public int next(final int pageOffset, final int wordOffset, final int pos)
            {
                return (int)(pos + LARGE_PRIME_INC) & ARRAY_MASK;
            }
        };
Intel i7-2760QM @ 2.40GHz, 8GB RAM DDR3 1600MHz, 
Linux 3.4.6 kernel 64-bit, Java 1.7.0_05
=================================================
0 - 29.06ns RANDOM_HEAP_WALK
1 - 29.47ns RANDOM_HEAP_WALK
2 - 29.48ns RANDOM_HEAP_WALK
3 - 29.43ns RANDOM_HEAP_WALK
4 - 29.42ns RANDOM_HEAP_WALK

 Performance counter stats for 'java -Xmx4g TestMemoryAccessPatterns 3':
     9,444,928,682 dTLB-loads
     4,371,982,327 dTLB-misses
         #   46.29% of all dTLB cache hits 

     9,390,675,639 L1-dcache-loads
     1,471,647,016 L1-dcache-misses
         #   15.67% of all L1-dcache hits  

+-----------------------+-------------+
|         Event         |   core 2    |
+-----------------------+-------------+
|   INSTR_RETIRED_ANY   | 7.71171e+09 |
| CPU_CLK_UNHALTED_CORE | 1.31717e+11 |
| L2_TRANS_ALL_REQUESTS | 8.4912e+09  |
|     L2_RQSTS_MISS     | 2.79635e+09 |
+-----------------------+-------------+
+-----------------+----------+
|     Metric      |  core 2  |
+-----------------+----------+
|   Runtime [s]   | 55.0094  |
|       CPI       | 17.0801  |
| L2 request rate | 1.10108  |
|  L2 miss rate   | 0.362611 |
|  L2 miss ratio  | 0.329324 |
+-----------------+----------+
+--------------------+-------------+
|       Event        |   core 2    |
+--------------------+-------------+
| LOAD_HIT_PRE_HW_PF | 3.59509e+06 |
+--------------------+-------------+
+------------------------+-------------+
|        Event           |   core 2    |
+------------------------+-------------+
| L3_LAT_CACHE_REFERENCE | 1.30318e+09 |
| L3_LAT_CACHE_MISS      | 2.62346e+07 |
+------------------------+-------------+

Thursday, 5 July 2012

Native C/C++ Like Performance For Java Object Serialisation

Do you ever wish you could turn a Java object into a stream of bytes as fast as it can be done in a native language like C++?  If you use standard Java Serialization you could be disappointed with the performance.  Java Serialization was designed for a very different purpose than serialising objects as quickly and compactly as possible.

Why do we need fast and compact serialisation?  Many of our systems are distributed and we need to communicate by passing state between processes efficiently.  This state lives inside our objects.  I've profiled many systems and often a large part of the cost is the serialisation of this state to-and-from byte buffers.  I've seen a significant range of protocols and mechanisms used to achieve this.  At one end of the spectrum are the easy to use but inefficient protocols likes Java Serialisation, XML and JSON.  At the other end of this spectrum are the binary protocols that can be very fast and efficient but they require a deeper understanding and skill.

In this article I will illustrate the performance gains that are possible when using simple binary protocols and introduce a little known technique available in Java to achieve similar performance to what is possible with native languages like C or C++.

The three approaches to be compared are:
  1. Java Serialization: The standard method in Java of having an object implement Serializable.
  2. Binary via ByteBuffer: A simple protocol using the ByteBuffer API to write the fields of an object in binary format.  This is our baseline for what is considered a good binary encoding approach.
  3. Binary via Unsafe: Introduction to Unsafe and its collection of methods that allow direct memory manipulation.  Here I will show how to get similar performance to C/C++.
The Code
import sun.misc.Unsafe;
import java.io.ByteArrayInputStream;
import java.io.ByteArrayOutputStream;
import java.io.ObjectInputStream;
import java.io.ObjectOutputStream;
import java.io.Serializable;
import java.lang.reflect.Field;
import java.nio.ByteBuffer;
import java.util.Arrays;

public final class TestSerialisationPerf
{
    public static final int REPETITIONS = 1 * 1000 * 1000;

    private static ObjectToBeSerialised ITEM =
        new ObjectToBeSerialised(
            1010L, true, 777, 99,
            new double[]{0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0},
            new long[]{1, 2, 3, 4, 5, 6, 7, 8, 9, 10});


    public static void main(final String[] arg) throws Exception
    {
        for (final PerformanceTestCase testCase : testCases)
        {
            for (int i = 0; i < 5; i++)
            {
                testCase.performTest();

                System.out.format("%d %s\twrite=%,dns read=%,dns total=%,dns\n",
                                  i,
                                  testCase.getName(),
                                  testCase.getWriteTimeNanos(),
                                  testCase.getReadTimeNanos(),
                                  testCase.getWriteTimeNanos() + 
                                  testCase.getReadTimeNanos());

                if (!ITEM.equals(testCase.getTestOutput()))
                {
                    throw new IllegalStateException("Objects do not match");
                }

                System.gc();
                Thread.sleep(3000);
            }
        }
    }

    private static final PerformanceTestCase[] testCases =
    {
        new PerformanceTestCase("Serialisation", REPETITIONS, ITEM)
        {
            ByteArrayOutputStream baos = new ByteArrayOutputStream();

            public void testWrite(ObjectToBeSerialised item) throws Exception
            {
                for (int i = 0; i < REPETITIONS; i++)
                {
                    baos.reset();

                    ObjectOutputStream oos = new ObjectOutputStream(baos);
                    oos.writeObject(item);
                    oos.close();
                }
            }

            public ObjectToBeSerialised testRead() throws Exception
            {
                ObjectToBeSerialised object = null;
                for (int i = 0; i < REPETITIONS; i++)
                {
                    ByteArrayInputStream bais = 
                        new ByteArrayInputStream(baos.toByteArray());
                    ObjectInputStream ois = new ObjectInputStream(bais);
                    object = (ObjectToBeSerialised)ois.readObject();
                }

                return object;
            }
        },

        new PerformanceTestCase("ByteBuffer", REPETITIONS, ITEM)
        {
            ByteBuffer byteBuffer = ByteBuffer.allocate(1024);

            public void testWrite(ObjectToBeSerialised item) throws Exception
            {
                for (int i = 0; i < REPETITIONS; i++)
                {
                    byteBuffer.clear();
                    item.write(byteBuffer);
                }
            }

            public ObjectToBeSerialised testRead() throws Exception
            {
                ObjectToBeSerialised object = null;
                for (int i = 0; i < REPETITIONS; i++)
                {
                    byteBuffer.flip();
                    object = ObjectToBeSerialised.read(byteBuffer);
                }

                return object;
            }
        },

        new PerformanceTestCase("UnsafeMemory", REPETITIONS, ITEM)
        {
            UnsafeMemory buffer = new UnsafeMemory(new byte[1024]);

            public void testWrite(ObjectToBeSerialised item) throws Exception
            {
                for (int i = 0; i < REPETITIONS; i++)
                {
                    buffer.reset();
                    item.write(buffer);
                }
            }

            public ObjectToBeSerialised testRead() throws Exception
            {
                ObjectToBeSerialised object = null;
                for (int i = 0; i < REPETITIONS; i++)
                {
                    buffer.reset();
                    object = ObjectToBeSerialised.read(buffer);
                }

                return object;
            }
        },
    };
}

abstract class PerformanceTestCase
{
    private final String name;
    private final int repetitions;
    private final ObjectToBeSerialised testInput;
    private ObjectToBeSerialised testOutput;
    private long writeTimeNanos;
    private long readTimeNanos;

    public PerformanceTestCase(final String name, final int repetitions,
                               final ObjectToBeSerialised testInput)
    {
        this.name = name;
        this.repetitions = repetitions;
        this.testInput = testInput;
    }

    public String getName()
    {
        return name;
    }

    public ObjectToBeSerialised getTestOutput()
    {
        return testOutput;
    }

    public long getWriteTimeNanos()
    {
        return writeTimeNanos;
    }

    public long getReadTimeNanos()
    {
        return readTimeNanos;
    }

    public void performTest() throws Exception
    {
        final long startWriteNanos = System.nanoTime();
        testWrite(testInput);
        writeTimeNanos = (System.nanoTime() - startWriteNanos) / repetitions;

        final long startReadNanos = System.nanoTime();
        testOutput = testRead();
        readTimeNanos = (System.nanoTime() - startReadNanos) / repetitions;
    }

    public abstract void testWrite(ObjectToBeSerialised item) throws Exception;
    public abstract ObjectToBeSerialised testRead() throws Exception;
}

class ObjectToBeSerialised implements Serializable
{
    private static final long serialVersionUID = 10275539472837495L;

    private final long sourceId;
    private final boolean special;
    private final int orderCode;
    private final int priority;
    private final double[] prices;
    private final long[] quantities;

    public ObjectToBeSerialised(final long sourceId, final boolean special,
                                final int orderCode, final int priority,
                                final double[] prices, final long[] quantities)
    {
        this.sourceId = sourceId;
        this.special = special;
        this.orderCode = orderCode;
        this.priority = priority;
        this.prices = prices;
        this.quantities = quantities;
    }

    public void write(final ByteBuffer byteBuffer)
    {
        byteBuffer.putLong(sourceId);
        byteBuffer.put((byte)(special ? 1 : 0));
        byteBuffer.putInt(orderCode);
        byteBuffer.putInt(priority);

        byteBuffer.putInt(prices.length);
        for (final double price : prices)
        {
            byteBuffer.putDouble(price);
        }

        byteBuffer.putInt(quantities.length);
        for (final long quantity : quantities)
        {
            byteBuffer.putLong(quantity);
        }
    }

    public static ObjectToBeSerialised read(final ByteBuffer byteBuffer)
    {
        final long sourceId = byteBuffer.getLong();
        final boolean special = 0 != byteBuffer.get();
        final int orderCode = byteBuffer.getInt();
        final int priority = byteBuffer.getInt();

        final int pricesSize = byteBuffer.getInt();
        final double[] prices = new double[pricesSize];
        for (int i = 0; i < pricesSize; i++)
        {
            prices[i] = byteBuffer.getDouble();
        }

        final int quantitiesSize = byteBuffer.getInt();
        final long[] quantities = new long[quantitiesSize];
        for (int i = 0; i < quantitiesSize; i++)
        {
            quantities[i] = byteBuffer.getLong();
        }

        return new ObjectToBeSerialised(sourceId, special, orderCode, 
                                        priority, prices, quantities);
    }

    public void write(final UnsafeMemory buffer)
    {
        buffer.putLong(sourceId);
        buffer.putBoolean(special);
        buffer.putInt(orderCode);
        buffer.putInt(priority);
        buffer.putDoubleArray(prices);
        buffer.putLongArray(quantities);
    }

    public static ObjectToBeSerialised read(final UnsafeMemory buffer)
    {
        final long sourceId = buffer.getLong();
        final boolean special = buffer.getBoolean();
        final int orderCode = buffer.getInt();
        final int priority = buffer.getInt();
        final double[] prices = buffer.getDoubleArray();
        final long[] quantities = buffer.getLongArray();

        return new ObjectToBeSerialised(sourceId, special, orderCode, 
                                        priority, prices, quantities);
    }

    @Override
    public boolean equals(final Object o)
    {
        if (this == o)
        {
            return true;
        }
        if (o == null || getClass() != o.getClass())
        {
            return false;
        }

        final ObjectToBeSerialised that = (ObjectToBeSerialised)o;

        if (orderCode != that.orderCode)
        {
            return false;
        }
        if (priority != that.priority)
        {
            return false;
        }
        if (sourceId != that.sourceId)
        {
            return false;
        }
        if (special != that.special)
        {
            return false;
        }
        if (!Arrays.equals(prices, that.prices))
        {
            return false;
        }
        if (!Arrays.equals(quantities, that.quantities))
        {
            return false;
        }

        return true;
    }
}

class UnsafeMemory
{
    private static final Unsafe unsafe;
    static
    {
        try
        {
            Field field = Unsafe.class.getDeclaredField("theUnsafe");
            field.setAccessible(true);
            unsafe = (Unsafe)field.get(null);
        }
        catch (Exception e)
        {
            throw new RuntimeException(e);
        }
    }

    private static final long byteArrayOffset = unsafe.arrayBaseOffset(byte[].class);
    private static final long longArrayOffset = unsafe.arrayBaseOffset(long[].class);
    private static final long doubleArrayOffset = unsafe.arrayBaseOffset(double[].class);

    private static final int SIZE_OF_BOOLEAN = 1;
    private static final int SIZE_OF_INT = 4;
    private static final int SIZE_OF_LONG = 8;

    private int pos = 0;
    private final byte[] buffer;

    public UnsafeMemory(final byte[] buffer)
    {
        if (null == buffer)
        {
            throw new NullPointerException("buffer cannot be null");
        }

        this.buffer = buffer;
    }

    public void reset()
    {
        this.pos = 0;
    }

    public void putBoolean(final boolean value)
    {
        unsafe.putBoolean(buffer, byteArrayOffset + pos, value);
        pos += SIZE_OF_BOOLEAN;
    }

    public boolean getBoolean()
    {
        boolean value = unsafe.getBoolean(buffer, byteArrayOffset + pos);
        pos += SIZE_OF_BOOLEAN;

        return value;
    }

    public void putInt(final int value)
    {
        unsafe.putInt(buffer, byteArrayOffset + pos, value);
        pos += SIZE_OF_INT;
    }

    public int getInt()
    {
        int value = unsafe.getInt(buffer, byteArrayOffset + pos);
        pos += SIZE_OF_INT;

        return value;
    }

    public void putLong(final long value)
    {
        unsafe.putLong(buffer, byteArrayOffset + pos, value);
        pos += SIZE_OF_LONG;
    }

    public long getLong()
    {
        long value = unsafe.getLong(buffer, byteArrayOffset + pos);
        pos += SIZE_OF_LONG;

        return value;
    }

    public void putLongArray(final long[] values)
    {
        putInt(values.length);

        long bytesToCopy = values.length << 3;
        unsafe.copyMemory(values, longArrayOffset,
                          buffer, byteArrayOffset + pos,
                          bytesToCopy);
        pos += bytesToCopy;
    }

    public long[] getLongArray()
    {
        int arraySize = getInt();
        long[] values = new long[arraySize];

        long bytesToCopy = values.length << 3;
        unsafe.copyMemory(buffer, byteArrayOffset + pos,
                          values, longArrayOffset,
                          bytesToCopy);
        pos += bytesToCopy;

        return values;
    }

    public void putDoubleArray(final double[] values)
    {
        putInt(values.length);

        long bytesToCopy = values.length << 3;
        unsafe.copyMemory(values, doubleArrayOffset,
                          buffer, byteArrayOffset + pos,
                          bytesToCopy);
        pos += bytesToCopy;
    }

    public double[] getDoubleArray()
    {
        int arraySize = getInt();
        double[] values = new double[arraySize];

        long bytesToCopy = values.length << 3;
        unsafe.copyMemory(buffer, byteArrayOffset + pos,
                          values, doubleArrayOffset,
                          bytesToCopy);
        pos += bytesToCopy;

        return values;
    }
}

Results
2.8GHz Nehalem - Java 1.7.0_04
==============================
0 Serialisation  write=2,517ns read=11,570ns total=14,087ns
1 Serialisation  write=2,198ns read=11,122ns total=13,320ns
2 Serialisation  write=2,190ns read=11,011ns total=13,201ns
3 Serialisation  write=2,221ns read=10,972ns total=13,193ns
4 Serialisation  write=2,187ns read=10,817ns total=13,004ns
0 ByteBuffer     write=264ns   read=273ns    total=537ns
1 ByteBuffer     write=248ns   read=243ns    total=491ns
2 ByteBuffer     write=262ns   read=243ns    total=505ns
3 ByteBuffer     write=300ns   read=240ns    total=540ns
4 ByteBuffer     write=247ns   read=243ns    total=490ns
0 UnsafeMemory   write=99ns    read=84ns     total=183ns
1 UnsafeMemory   write=53ns    read=82ns     total=135ns
2 UnsafeMemory   write=63ns    read=66ns     total=129ns
3 UnsafeMemory   write=46ns    read=63ns     total=109ns
4 UnsafeMemory   write=48ns    read=58ns     total=106ns

2.4GHz Sandy Bridge - Java 1.7.0_04
===================================
0 Serialisation  write=1,940ns read=9,006ns total=10,946ns
1 Serialisation  write=1,674ns read=8,567ns total=10,241ns
2 Serialisation  write=1,666ns read=8,680ns total=10,346ns
3 Serialisation  write=1,666ns read=8,623ns total=10,289ns
4 Serialisation  write=1,715ns read=8,586ns total=10,301ns
0 ByteBuffer     write=199ns   read=198ns   total=397ns
1 ByteBuffer     write=176ns   read=178ns   total=354ns
2 ByteBuffer     write=174ns   read=174ns   total=348ns
3 ByteBuffer     write=172ns   read=183ns   total=355ns
4 ByteBuffer     write=174ns   read=180ns   total=354ns
0 UnsafeMemory   write=38ns    read=75ns    total=113ns
1 UnsafeMemory   write=26ns    read=52ns    total=78ns
2 UnsafeMemory   write=26ns    read=51ns    total=77ns
3 UnsafeMemory   write=25ns    read=51ns    total=76ns
4 UnsafeMemory   write=27ns    read=50ns    total=77ns

Analysis

To write and read back a single relatively small object on my fast 2.4 GHz Sandy Bridge laptop can take ~10,000ns using Java Serialization, whereas when using Unsafe this can come down to well less than 100ns even accounting for the test code itself.  To put this in context, when using Java Serialization the costs are on par with a network hop!  Now that would be very costly if your transport is a fast IPC mechanism on the same system.

There are numerous reasons why Java Serialisation is so costly.  For example it writes out the fully qualified class and field names for each object plus version information.  Also ObjectOutputStream keeps a collection of all written objects so they can be conflated when close() is called.   Java Serialisation requires 340 bytes for this example object, yet we only require 185 bytes for the binary versions.  Details for the Java Serialization format can be found here.  If I had not used arrays for the majority of data, then the serialised object would have been significantly larger with Java Serialization because of the field names.  In my experience text based protocols like XML and JSON can be even less efficient than Java Serialization.  Also be aware that Java Serialization is the standard mechanism employed for RMI.

The real issue is the number of instructions to be executed.  The Unsafe method wins by a significant margin because in Hotspot, and many other JVMs, the optimiser treats these operations as intrinsics and replaces the call with assembly instructions to perform the memory manipulation.  For primitive types this results in a single x86 MOV instruction which can often happen in a single cycle.  The details can be seen by having Hotspot output the optimised code as I described in a previous article.

Now it has to be said that "with great power comes great responsibility" and if you use Unsafe it is effectively the same as programming in C, and with that can come memory access violations when you get offsets wrong.

Adding Some Context

"What about the likes of Google Protocol Buffers?", I hear you cry out.  These are very useful libraries and can often offer better performance and more flexibility than Java Serialisation.  However they are not remotely close to the performance of using Unsafe like I have shown here.  Protocol Buffers solve a different problem and provide nice self-describing messages which work well across languages.  Please test with different protocols and serialisation techniques to compare results.

Also the astute among you will be asking, "What about Endianness (byte-ordering) of the integers written?"  With Unsafe the bytes are written in native order.  This is great for IPC and between systems of the same type.  When systems use differing formats then conversion will be necessary.

How do we deal with multiple versions of a class or determining what class an object belongs to?  I want to keep this article focused but let's say a simple integer to indicate the implementation class is all that is required for a header.  This integer can be used to look up the appropriately implementation for the de-serialisation operation.

An argument I often hear against binary protocols, and for text protocols, is what about being human readable and debugging?  There is an easy solution to this.  Develop a tool for reading the binary format!

Conclusion

In conclusion it is possible to achieve the same native C/C++ like levels of performance in Java for serialising an object to-and-from a byte stream by effectively using the same techniques.  The UnsafeMemory class, for which I've provided a skeleton implementation, could easily be expanded to encapsulate this behaviour and thus protect oneself from many of the potential issues when dealing with such a sharp tool.

Now for the burning question.  Would it not be so much better if Java offered an alternative Marshallable interface to Serializable by offering natively what I've effectively done with Unsafe???

Saturday, 19 May 2012

Applying Back Pressure When Overloaded

How should a system respond when under sustained load?  Should it keep accepting requests until its response times follow the deadly hockey stick, followed by a crash?  All too often this is what happens unless a system is designed to cope with the case of more requests arriving than it is capable of processing.  If we are seeing a sustained arrival rate of requests, greater than our system is capable of processing, then something has to give.  Having the entire system degrade is not the ideal service we want to give our customers.  A better approach would be to process transactions at our systems maximum possible throughput rate, while maintaining a good response time, and rejecting requests above this arrival rate.

Let’s consider a small art gallery as an metaphor.  In this gallery the typical viewer spends on average 20 minutes browsing, and the gallery can hold a maximum of 30 viewers.  If more than 30 viewers occupy the gallery at the same time then customers become unhappy because they cannot have a clear view of the paintings.  If this happens they are unlikely to purchase or return.  To keep our viewers happy it is better to recommend that some viewers visit the café a few doors down and come back when the gallery is less busy.  This way the viewers in the gallery get to see all the paintings without other viewers in the way, and in the meantime those we cannot accommodate enjoy a coffee.  If we apply Little’s Law we cannot have customers arriving at more than 90 per hour, otherwise the maximum capacity is exceeded.  If between 9:00-10:00 they are arriving at 100 per hour, then I’m sure the café down the road will appreciate the extra 10 customers.

Within our systems the available capacity is generally a function of the size of our thread pools and time to process individual transactions.  These thread pools are usually fronted by queues to handle bursts of traffic above our maximum arrival rate.  If the queues are unbounded, and we have a sustained arrival rate above the maximum capacity, then the queues will grow unchecked.  As the queues grow they increasingly add latency beyond acceptable response times, and eventually they will consume all memory causing our systems to fail.  Would it not be better to send the overflow of requests to the café while still serving everyone else at the maximum possible rate?  We can do this by designing our systems to apply “Back Pressure”.

Figure 1.

Separation of concerns encourages good systems design at all levels.  I like to layer a design so that the gateways to third parties are separated from the main transaction services.  This can be achieved by having gateways responsible for protocol translation and border security only.  A typical gateway could be a web container running Servlets.  Gateways accept customer requests, apply appropriate security, and translate the channel protocols for forwarding to the transaction service hosting the domain model.  The transaction service may use a durable store if transactions need to be preserved.  For example, the state of a chat server domain model may not require preservation, whereas a model for financial transactions must be kept for many years for compliance and business reasons.

Figure 1. above is a simplified view of the typical request flow in many systems.  Pools of threads in a gateway accept user requests and forward them to a transaction service.  Let’s assume we have asynchronous transaction services fronted by an input and output queues, or similar FIFO structures.  If we want the system to meet a response time quality-of-service (QoS) guarantee, then we need to consider the three following variables:
  1. The time taken for individual transactions on a thread
  2. The number of threads in a pool that can execute transactions in parallel
  3. The length of the input queue to set the maximum acceptable latency
    max latency = (transaction time / number of threads) * queue length
    queue length = max latency / (transaction time / number of threads)

By allowing the queue to be unbounded the latency will continue to increase.  So if we want to set a maximum response time then we need to limit the queue length.

By bounding the input queue we block the thread receiving network packets which will apply back pressure up stream.  If the network protocol is TCP, similar back pressure is applied via the filling of network buffers, on the sender.  This process can repeat all the way back via the gateway to the customer.  For each service we need to configure the queues so that they do their part in achieving the required quality-of-service for the end-to-end customer experience.

One of the biggest wins I often find is to improve the time taken to process individual transaction latency.  This helps in the best and worst case scenarios.

Worst Case Scenario

Let’s say the queue is unbounded and the system is under sustained heavy load.  Things can begin to go wrong very quickly in subtle ways before memory is exhausted.  What do you think will happen when the queue is larger than the processor cache?  The consumer threads will be suffering cache misses just at the time when they are struggling to keep up, thus compounding the problem.  This can cause a system to get into trouble very quickly and eventually crash.  Under Linux this is particularly nasty because malloc, or one of its friends, will succeed because Linux allows “Over Commit” by default, then later at the point of using that memory, the OOM Killer will start shooting processes. When the OS starts shooting processes, you just know things are not going to end well!

What About Synchronous Designs?

You may say that with synchronous designs there are no queues.  Well not such obvious ones.  If you have a thread pool then it will have a lock, or semaphore, wait queues to assign threads.  If you are crazy enough to allocate a new thread on every request, then once you are over the huge cost of thread creation, your thread is in the run queue for a processor to execute.  Also, these queues involve context switches and condition variables which greatly increase the costs.  You just cannot run away from queues, they are everywhere!  Best to embrace them and design for the quality-of-service your system needs to deliver to its customers.  If we must have queues, then design for them, and maybe choose some nice lock-free ones with great performance.

When we need to support synchronous protocols like REST then use back pressure, signalled by our full incoming queue at the gateway, to send a meaningful “server busy” message such as the HTTP 503 status code.  The customer can then interpret this as time for a coffee and cake at the café down the road.

Subtleties To Watch Out For...

You need to consider the whole end-to-end service.  What if a client is very slow at consuming data from your system?  It could tie up a thread in the gateway taking it out of action.  Now you have less threads working the queue so the response time will be increasing.  Queues and threads need to be monitored, and appropriate action needs to be taken when thresholds are crossed.  For example, when a queue is 70% full, maybe an alert should be raised so an investigation can take place?  Also, transaction times need to be sampled to ensure they are in the expected range.

Summary

If we do not consider how our systems will behave when under heavy load then they will most likely seriously degrade at best, and at worst crash.  When they crash this way, we get to find out if there are any really evil data corruption bugs lurking in those dark places.  Applying back pressure is one effective technique for coping with sustained high-load, such that maximum throughput can be delivered without degrading system performance for the already accepted requests and transactions.

Sunday, 29 April 2012

Invoke Interface Optimisations

I'm often asked about the performance differences between Java, C, and C++, and which is better.  As with most things in life there is no black and white answer.  A lot is often discussed about how managed runtime based languages offer less performance than their statically compiled compatriots.  There are however a few tricks available to managed runtimes that can provide optimisation opportunities not available to statically optimised languages.

One such optimisation available to the runtime is to dynamically inline a method at the call site.  Many would say inlining is *the* major optimisation of dynamic languages.  This is an approach whereby the function/method call overhead can be avoided and further optimisations enabled.  Inlining can easily be done at compile, or run, time for static or private methods of a class because they cannot be overridden.  It can also be done by Hotspot at run time which is way more interesting.  In bytecode the runtime will see invokestatic and invokespecial opcodes for static and private methods respectively.  Methods that involve late binding, such as interface implementations and method overriding, appear as the invokeinterface and invokevirtual opcodes respectively.

At compile time it is not possible to determine how many implementations there will be for an interface, or how many classes will override a base method.  The compiler can have some awareness but just how do you deal with dynamically loaded classes via Class.forName("x").newInstance()?  The Hotspot runtime is very smart.  It can track all classes as they are loaded and apply appropriate optimisations to give the best possible performance for our code.  One such approach is dynamic inlining at the call site which we will explore.

Code

public interface Operation
{
    int map(int value);
}

public class IncOperation implements Operation
{
    public int map(final int value)
    {
        return value + 1;
    }
}

public class DecOperation implements Operation
{
    public int map(final int value)
    {
        return value - 1;
    }
}

public class StepIncOperation implements Operation
{
    public int map(final int value)
    {
        return value + 7;
    }
}

public class StepDecOperation implements Operation
{
    public int map(final int value)
    {
        return value - 3;
    }
}

public final class OperationPerfTest
{
    private static final int ITERATIONS = 50 * 1000 * 1000;

    public static void main(final String[] args)
        throws Exception
    {
        final Operation[] operations = new Operation[4];
        int index = 0;
        operations[index++] = new StepIncOperation();
        operations[index++] = new StepDecOperation();
        operations[index++] = new IncOperation();
        operations[index++] = new DecOperation();

        int value = 777;
        for (int i = 0; i < 3; i++)
        {
            System.out.println("*** Run each method in turn: loop " + i);

            for (final Operation operation : operations)
            {
                System.out.println(operation.getClass().getName()); 
                value = runTests(operation, value);
            }
        } 

        System.out.println("value = " + value);
    } 

    private static int runTests(final Operation operation, int value)
    {
        for (int i = 0; i < 10; i++)
        {
            final long start = System.nanoTime();

            value += opRun(operation, value);

            final long duration = System.nanoTime() - start;
            final long opsPerSec = 
                (ITERATIONS * 1000L * 1000L * 1000L) / duration;
            System.out.printf("    %,d ops/sec\n", opsPerSec);
        }

        return value;
    } 

    private static int opRun(final Operation operation, int value)
    {
        for (int i = 0; i < ITERATIONS; i++)
        {
            value += operation.map(value);
        } 

        return value;
    } 
}

Results


The following results are for running on a Linux 3.3.2 kernel with Oracle 1.7.0_02 server JVM on a Intel Sandy Bridge 2.4Ghz processor.

*** Run each method in turn: loop 0
StepIncOperation
    2,256,816,714 ops/sec
    2,245,800,936 ops/sec
    3,161,643,847 ops/sec
    3,100,375,269 ops/sec
    3,144,364,173 ops/sec
    3,091,009,138 ops/sec
    3,089,241,641 ops/sec
    3,153,922,056 ops/sec
    3,147,331,497 ops/sec
    3,076,211,099 ops/sec
StepDecOperation
    623,131,120 ops/sec
    659,686,236 ops/sec
    1,029,231,089 ops/sec
    1,021,060,933 ops/sec
    999,287,607 ops/sec
    1,015,432,172 ops/sec
    1,023,581,307 ops/sec
    1,019,266,750 ops/sec
    1,022,726,580 ops/sec
    1,004,237,016 ops/sec
IncOperation
    301,419,319 ops/sec
    304,712,250 ops/sec
    307,269,912 ops/sec
    308,519,923 ops/sec
    307,372,436 ops/sec
    306,230,247 ops/sec
    307,964,022 ops/sec
    306,243,292 ops/sec
    308,689,942 ops/sec
    365,152,716 ops/sec
DecOperation
    236,804,700 ops/sec
    237,912,786 ops/sec
    238,672,489 ops/sec
    278,745,901 ops/sec
    278,169,934 ops/sec
    277,979,158 ops/sec
    276,620,509 ops/sec
    278,349,766 ops/sec
    276,159,225 ops/sec
    278,578,373 ops/sec
*** Run each method in turn: loop 1
StepIncOperation
    276,054,944 ops/sec
    276,683,805 ops/sec
    276,551,970 ops/sec
    279,861,144 ops/sec
    275,543,192 ops/sec
    278,451,092 ops/sec
    275,399,262 ops/sec
    277,340,411 ops/sec
    274,529,616 ops/sec
    277,091,930 ops/sec
StepDecOperation
    279,729,066 ops/sec
    279,812,269 ops/sec
    276,478,587 ops/sec
    277,660,649 ops/sec
    276,844,441 ops/sec
    278,684,313 ops/sec
    277,791,665 ops/sec
    277,617,484 ops/sec
    278,575,241 ops/sec
    278,228,274 ops/sec
IncOperation
    277,724,770 ops/sec
    278,234,042 ops/sec
    276,798,434 ops/sec
    277,926,962 ops/sec
    277,786,824 ops/sec
    278,739,590 ops/sec
    275,286,293 ops/sec
    279,062,831 ops/sec
    276,672,019 ops/sec
    277,248,956 ops/sec
DecOperation
    277,303,150 ops/sec
    277,746,139 ops/sec
    276,245,511 ops/sec
    278,559,202 ops/sec
    274,683,406 ops/sec
    279,280,730 ops/sec
    276,174,620 ops/sec
    276,374,159 ops/sec
    275,943,446 ops/sec
    277,765,688 ops/sec
*** Run each method in turn: loop 2
StepIncOperation
    278,405,907 ops/sec
    278,713,953 ops/sec
    276,841,096 ops/sec
    277,891,660 ops/sec
    275,716,314 ops/sec
    277,474,242 ops/sec
    277,715,270 ops/sec
    277,857,014 ops/sec
    275,956,486 ops/sec
    277,675,378 ops/sec
StepDecOperation
    277,273,039 ops/sec
    278,101,972 ops/sec
    275,694,572 ops/sec
    276,312,449 ops/sec
    275,964,418 ops/sec
    278,423,621 ops/sec
    276,498,569 ops/sec
    276,593,475 ops/sec
    276,238,451 ops/sec
    277,057,568 ops/sec
IncOperation
    275,700,451 ops/sec
    277,463,507 ops/sec
    275,886,477 ops/sec
    277,546,096 ops/sec
    275,019,816 ops/sec
    278,242,287 ops/sec
    277,317,964 ops/sec
    277,252,014 ops/sec
    276,893,038 ops/sec
    277,601,325 ops/sec
DecOperation
    275,580,894 ops/sec
    280,146,646 ops/sec
    276,901,134 ops/sec
    276,672,567 ops/sec
    276,879,422 ops/sec
    278,674,196 ops/sec
    275,606,174 ops/sec
    278,132,534 ops/sec
    275,858,358 ops/sec
    279,444,112 ops/sec

What is going on here?


On the first iteration over the list of operations we see the performance degrade from ~3bn operations per second down to ~275m operations per second.  This happens in a step function with each new implementation loaded.  On the second, and subsequent, iteration over the array of operations, performance stabilised at ~275m operations per second.  What we are seeing here is how Hotspot can optimise when we have a limited number of implementations for an interface, and how it has to fall back to late bound method calls when many implementations are possible from a given call site.

If we run the JVM with -XX:+PrintCompilation we can see Hotspot choosing to compile the methods then de-optimise existing optimisations as new implementations get loaded.

     52    1             java.lang.String::hashCode (67 bytes)
     54    2             StepIncOperation::map (5 bytes)
     55    1 %           OperationPerfTest::opRun @ 2 (26 bytes)
     76    3             OperationPerfTest::opRun (26 bytes)
    223    3             OperationPerfTest::opRun (26 bytes)   made not entrant
    223    1 %           OperationPerfTest::opRun @ -2 (26 bytes)   made not entrant
    224    2 %           OperationPerfTest::opRun @ 2 (26 bytes)
    224    4             StepDecOperation::map (4 bytes)
    306    5             OperationPerfTest::opRun (26 bytes)
    772    2 %           OperationPerfTest::opRun @ -2 (26 bytes)   made not entrant
    772    3 %           OperationPerfTest::opRun @ 2 (26 bytes)
    773    6             IncOperation::map (4 bytes)
    930    5             OperationPerfTest::opRun (26 bytes)   made not entrant
   1995    7             OperationPerfTest::opRun (26 bytes)
   2293    8             DecOperation::map (4 bytes)
  11339    9             java.lang.String::indexOf (87 bytes)
  15017   10             java.lang.String::charAt (33 bytes)

The output above shows the decisions made by Hotspot as it compiles code.  When the third column contains the symbol "%" it is performing OSR (On Stack Replacement) of the method.  This is followed 4 times by the method being "made not entrant" as it is de-optimised when Hotspot discovers new implementations.  3 times the method is made not entrant for the newly discovered classes and once for removing the OSR version to be replaced by a non-OSR normal JIT'ed version when the final implementation is settled on.  Even greater detail can be seen by replacing -XX:+PrintCompilation with -XX:+UnlockDiagnosticVMOptions -XX:+LogCompilation.

For the monomorphic single implementation case, Hotspot can simply inline the method and place a trap in the code to fire if future implementations are loaded.  This gives performance very similar to no function call overhead. For the second bimorphic implementation, Hotspot can inline both methods and select the implementation based on a branch condition.  Beyond this things get tricky and jump tables are required to  resolve the method at runtime, thus making the code polymorphic or megamorphic.  The generated assembly code can be viewed with -XX:+UnlockDiagnosticVMOptions -XX:CompileCommand=print,OperationPerfTest.doRun JVM options for Java 7. The output shows the steps in compilation whereby not only is the method inlining deoptimised, Hotspot also no longer does loop unrolling for this method.

Conclusions


We can see that if an interface method has only one or two implementations then Hotspot can dynamically inline the method avoiding the function call overhead.  This would only be possible with profile guided optimisation for a language like C or C++.  We can also see that method calls are relatively cheap on a modern JVM, in the order of 12 cycles, even when we cannot avoid them.  It should be noted that the cost of method calls goes up by a few cycles for each additional argument passed.

In addition, I have observed that when a class implements multiple interfaces, with multiple methods, performance can degrade significantly because the method dispatch involves a linear search of method list to find the right implementation for dispatch.  Overridden methods from a base class do not involve this linear search but still require the jump table dispatch.  All the more reason to keep classes and interfaces simple.

Thursday, 22 March 2012

Fun with my-Channels Nirvana and Azul Zing

Since leaving LMAX I have been neglecting my blog a bit.  This is not because I have not been doing anything interesting.  Quite the opposite really, things have been so busy the blog has taken a back seat.  I’ve been consulting for a number of hedge funds and product companies, most of which are super secretive.

One company I have been spending quite a bit of time with is my-Channels, a messaging provider.  They are really cool and have given me their blessing to blog about some of the interesting things I’ve been working on for them.

For context, my-Channels are a messaging provider that specialise in delivering data to every device known to man over dodgy networks such as the Internet or your corporate WAN.  They can deliver live financial market data to your desktop, laptop at home, or your iPhone, at the fastest possible rates.  Lately, they have made the strategic move to enter the low-latency messaging space for the enterprise, and as part of this they have enlisted my services.  They want to go low-latency without giving up the rich functionality their product offers which is giving me some interesting challenges.

Just how bad is the latency of such a product when new to the low-latency space?  I did not have high expectations because to be fair this was never their goal.  After some initial tests, I’m thinking these guys are not in bad shape.  They beat the crap out of most JMS implementations and it is going to be fun pushing them to the serious end of the low-latency space. 

OK enough of the basic tests, now it is time to get serious.  I worked with them to create appropriate load tests and get the profilers running.  No big surprises here, when we piled on the pressure, lock-contention came out as the biggest culprit limiting both latency and throughput.  As we go down the list, lots of other interesting things showed up but let’s follow good discipline and start at the top of the list.

Good discipline for “Theory of Constraints” states that you always work on the most limiting factor because when it is removed the list below it can change radically as new pressures are applied.  So to address this contention issue we developed a new lock-free Executor to replace the standard Java implementation.  Tests showed this new executor is ~10X better than what the JDK has to offer.  We integrated the new Executor into the code base and now the throughput bottleneck has been massively changed.  The system can now cope with 16X more throughput, and the latency histogram has become much more compressed.  This is a good example of how macro-benchmarking is so much more valuable than micro-benchmarking.  Not a bad start we are all thinking.

Enter Azul Stage Left

We tested on all the major JVMs and the most predictable latency was achieved with Azul Zing.  Zing had by far the best latency profile with virtually no long tail.  For many of the tests it also had the greatest throughput.

After the lock contention on the Executor issue had been resolved, the next big bottleneck when load testing on the same machine was being limited by using TCP between processes over the loopback adapter.  We discussed developing a new transport that was not network based for Nirvana.  For this we decided to apply a number of the techniques I teach on my lock-free concurrency course.  This resulted in a new IPC transport based on shared memory via memory-mapped files in Java.  We did inter-server testing using 10GigE networks, and had a fun using the new Solarflare network adapters with OpenOnload, but for this article I’ll stick with the Java story.  I think Paul is still sore from me stuffing his little Draytek ADSL router with huge amounts of multicast traffic when the poor thing was connected to our 10GigE test LAN.  Sorry Paul!

Developing the IPC transport unearthed a number of challenges with various JVM implementations of MappedByteBuffer.  After some very useful chats with Cliff Click and Doug Lea we came up with a solution that worked across all JVMs.   This solution has a mean latency of ~100ns on the best JVMs and can do ~12-22 million messages per second throughput for 60-byte messages depending on the JVM.  This was the first time we had found a test whereby Azul was not close to being the fastest.   I isolated a test case and sent it to them on a Friday.  On Sunday evening I got an email from Gil Tene saying he had identified the issue and by Tuesday Cliff Click had a fix that we tried the next week.  When we tested the new Azul JVM, we seen over 40 million messages per second at latencies just over 100ns for our new IPC transport.  I had been teasing Azul that this must be possible in Java because I’d created similar algorithms in C and assembler that show what the x86_64 platform is capable of.

I’m starting to ramble but we had great fun removing latency through many parts of the stack.  When I get more time I will blog about some of the other findings.  The current position is still a work in progress with daily progress on an amazing scale.  The guys at my-Channels are very conservative and do not want to publish actual figures until they have version 7.0 of Nirvana ready for GA, and have done more comprehensive testing.  For now they are happy with me being open about the following:
  • Throughput increased 32X due to the implementation of lock-free techniques and optimising the call stack for message handling to remove any shared dependencies.
  • Average latency decreased 20X from applying the same techniques and we have identified many more possible improvements.
  • We know the raw transport for IPC is now ~100ns and the worst case pause due to GC is 80µs with Azul Zing.  As to the latency for the double hop between a producer and consumer over IPC, via their broker, I’ll leave to your imagination as somewhere between those figures until the guys are willing to make an official announcement.  As you can guess it is much much less than 80µs.
For me the big surprise was GC pauses only taking 80µs in the worst case.  OS scheduling alone I have seen result in more jitter.  I discussed this at length with Gil Tene from Azul, and even he was surprised.  He expects some worst case scenarios with their JVM to be 1-2ms for a well behaved application.  We then explored the my-Channels setup, and it turns out we have done everything almost perfectly to get the best out of a JVM which is worth sharing.
  1. Do not use locks in the main transaction flow because they cause context switches, and therefore latency and unpredictable jitter.
  2. Never have more threads that need to run than you have cores available.
  3. Set affinity of threads to cores, or at least sockets, to avoid cache pollution by avoiding migration.  This is particularly important when on a server class machine having multiple sockets because of the NUMA effect.
  4. Ensure uncontested access to any resource respecting the Single Writer Principle so that the likes of biased locking can be your friend.
  5. Keep call stacks reasonably small.  Still more work to do here.  If you are crazy enough to use Spring, then check out your call stacks to see what I mean!  The garbage collector has to walk them finding reachable objects.
  6. Do not use finalizers.
  7. Keep garbage generation to modest levels.  This applies to most JVMs but is likely not an issue for Zing.
  8. Ensure no disk IO on the main flow.
  9. Do a proper warm-up before beginning to measure.
  10. Do all the appropriate OS tunings for low-latency systems that are way beyond this blog.  For example turn off C-States power management in the BIOS and watch out for RHEL 6 as it turns it back on without telling you!
It should be noted that we ran this on some state of the art Intel CPUs with very large L3 caches.  It is possible to get 20-30MB L3 caches on a single socket these days.  It is very likely that our entire application was running out of L3 cache with the exception of the message flow which is very predictable.

Gil has added a cautionary note that while these results are very impressive we had a team focused on this issue with the appropriate skills to get the best out of the application.  It is not the usual case for every client to apply this level of focus.

What I’ve taken from this experience is the amazing things that can be achieved by truly agile companies, staffed by talented individuals, who are empowered to make things happen.  I love agile development but it has become a religion to some people who are more interested in following the “true” process than doing what is truly needed.  Both my-Channels and Azul have shown during this engagement what is possible in making s*#t happen.  It has been an absolute blast working with individuals who can assimilate information and ideas so fast, then turn them into working software.  For this I will embarrass Matt Buckton at my-Channels, and Gil Tene & Cliff Click at Azul who never failed in rising to a challenge.  So few organisations could have made so much progress over such a short time period.  If you think Java cannot cut it in the high performance space, then deal with one of these two companies, and you will be thinking again.  I bet a few months ago Matt never thought he’d be sitting in Singapore airport writing his first multi-producer lock-free queue when travelling home, and really enjoying it.