Sunday 5 August 2012

Memory Access Patterns Are Important

In high-performance computing it is often said that the cost of a cache-miss is the largest performance penalty for an algorithm.  For many years the increase in speed of our processors has greatly outstripped latency gains to main-memory.  Bandwidth to main-memory has greatly increased via wider, and multi-channel, buses however the latency has not significantly reduced.  To hide this latency our processors employ evermore complex cache sub-systems that have many layers.

The 1994 paper "Hitting the memory wall: implications of the obvious" describes the problem and goes on to argue that caches do not ultimately help because of compulsory cache-misses.  I aim to show that by using access patterns which display consideration for the cache hierarchy, this conclusion is not inevitable.

Let's start putting the problem in context with some examples.  Our hardware tries to hide the main-memory latency via a number of techniques.  Basically three major bets are taken on memory access patterns:
  1. Temporal: Memory accessed recently will likely be required again soon.
  2. Spatial: Adjacent memory is likely to be required soon. 
  3. Striding: Memory access is likely to follow a predictable pattern.
To illustrate these three bets in action let's write some code and measure the results.
  1. Walk through memory in a linear fashion being completely predictable.
  2. Pseudo randomly walk round memory within a restricted area then move on.  This restricted area is what is commonly known as an operating system page of memory.
  3. Pseudo randomly walk around a large area of the heap.
Code

The following code should be run with the -Xmx4g JVM option.
public class TestMemoryAccessPatterns
{
    private static final int LONG_SIZE = 8;
    private static final int PAGE_SIZE = 2 * 1024 * 1024;
    private static final int ONE_GIG = 1024 * 1024 * 1024;
    private static final long TWO_GIG = 2L * ONE_GIG;

    private static final int ARRAY_SIZE = (int)(TWO_GIG / LONG_SIZE);
    private static final int WORDS_PER_PAGE = PAGE_SIZE / LONG_SIZE;

    private static final int ARRAY_MASK = ARRAY_SIZE - 1;
    private static final int PAGE_MASK = WORDS_PER_PAGE - 1;

    private static final int PRIME_INC = 514229;

    private static final long[] memory = new long[ARRAY_SIZE];

    static
    {
        for (int i = 0; i < ARRAY_SIZE; i++)
        {
            memory[i] = 777;
        }
    }

    public enum StrideType
    {
        LINEAR_WALK
        {
            public int next(final int pageOffset, final int wordOffset, final int pos)
            {
                return (pos + 1) & ARRAY_MASK;
            }
        },

        RANDOM_PAGE_WALK
        {
            public int next(final int pageOffset, final int wordOffset, final int pos)
            {
                return pageOffset + ((pos + PRIME_INC) & PAGE_MASK);
            }
        },

        RANDOM_HEAP_WALK
        {
            public int next(final int pageOffset, final int wordOffset, final int pos)
            {
                return (pos + PRIME_INC) & ARRAY_MASK;
            }
        };

        public abstract int next(int pageOffset, int wordOffset, int pos);
    }

    public static void main(final String[] args)
    {
        final StrideType strideType;
        switch (Integer.parseInt(args[0]))
        {
            case 1:
                strideType = StrideType.LINEAR_WALK;
                break;

            case 2:
                strideType = StrideType.RANDOM_PAGE_WALK;
                break;

            case 3:
                strideType = StrideType.RANDOM_HEAP_WALK;
                break;

            default:
                throw new IllegalArgumentException("Unknown StrideType");
        }

        for (int i = 0; i < 5; i++)
        {
            perfTest(i, strideType);
        }
    }

    private static void perfTest(final int runNumber, final StrideType strideType)
    {
        final long start = System.nanoTime();

        int pos = -1;
        long result = 0;
        for (int pageOffset = 0; pageOffset < ARRAY_SIZE; pageOffset += WORDS_PER_PAGE)
        {
            for (int wordOffset = pageOffset, limit = pageOffset + WORDS_PER_PAGE;
                 wordOffset < limit;
                 wordOffset++)
            {
                pos = strideType.next(pageOffset, wordOffset, pos);
                result += memory[pos];
            }
        }

        final long duration = System.nanoTime() - start;
        final double nsOp = duration / (double)ARRAY_SIZE;

        if (208574349312L != result)
        {
            throw new IllegalStateException();
        }

        System.out.format("%d - %.2fns %s\n",
                          Integer.valueOf(runNumber),
                          Double.valueOf(nsOp),
                          strideType);
    }
}
Results
Intel U4100 @ 1.3GHz, 4GB RAM DDR2 800MHz, 
Windows 7 64-bit, Java 1.7.0_05
===========================================
0 - 2.38ns LINEAR_WALK
1 - 2.41ns LINEAR_WALK
2 - 2.35ns LINEAR_WALK
3 - 2.36ns LINEAR_WALK
4 - 2.39ns LINEAR_WALK

0 - 12.45ns RANDOM_PAGE_WALK
1 - 12.27ns RANDOM_PAGE_WALK
2 - 12.17ns RANDOM_PAGE_WALK
3 - 12.22ns RANDOM_PAGE_WALK
4 - 12.18ns RANDOM_PAGE_WALK

0 - 152.86ns RANDOM_HEAP_WALK
1 - 151.80ns RANDOM_HEAP_WALK
2 - 151.72ns RANDOM_HEAP_WALK
3 - 151.91ns RANDOM_HEAP_WALK
4 - 151.36ns RANDOM_HEAP_WALK

Intel i7-860 @ 2.8GHz, 8GB RAM DDR3 1333MHz, 
Windows 7 64-bit, Java 1.7.0_05
=============================================
0 - 1.06ns LINEAR_WALK
1 - 1.05ns LINEAR_WALK
2 - 0.98ns LINEAR_WALK
3 - 1.00ns LINEAR_WALK
4 - 1.00ns LINEAR_WALK

0 - 3.80ns RANDOM_PAGE_WALK
1 - 3.85ns RANDOM_PAGE_WALK
2 - 3.79ns RANDOM_PAGE_WALK
3 - 3.65ns RANDOM_PAGE_WALK
4 - 3.64ns RANDOM_PAGE_WALK

0 - 30.04ns RANDOM_HEAP_WALK
1 - 29.05ns RANDOM_HEAP_WALK
2 - 29.14ns RANDOM_HEAP_WALK
3 - 28.88ns RANDOM_HEAP_WALK
4 - 29.57ns RANDOM_HEAP_WALK

Intel i7-2760QM @ 2.40GHz, 8GB RAM DDR3 1600MHz, 
Linux 3.4.6 kernel 64-bit, Java 1.7.0_05
=================================================
0 - 0.91ns LINEAR_WALK
1 - 0.92ns LINEAR_WALK
2 - 0.88ns LINEAR_WALK
3 - 0.89ns LINEAR_WALK
4 - 0.89ns LINEAR_WALK

0 - 3.29ns RANDOM_PAGE_WALK
1 - 3.35ns RANDOM_PAGE_WALK
2 - 3.33ns RANDOM_PAGE_WALK
3 - 3.31ns RANDOM_PAGE_WALK
4 - 3.30ns RANDOM_PAGE_WALK

0 - 9.58ns RANDOM_HEAP_WALK
1 - 9.20ns RANDOM_HEAP_WALK
2 - 9.44ns RANDOM_HEAP_WALK
3 - 9.46ns RANDOM_HEAP_WALK
4 - 9.47ns RANDOM_HEAP_WALK
Analysis

I ran the code on 3 different CPU architectures illustrating generational steps forward for Intel.  It is clear from the results that each generation has become progressively better at hiding the latency to main-memory based on the 3 bets described above for a relatively small heap.  This is because the size and sophistication of various caches keep improving.  However as memory size increases they become less effective.  For example, if the array is doubled to be 4GB in size, then the average latency increases from ~30ns to ~55ns for the i7-860 doing the random heap walk.

It seems that for the linear walk case, memory latency does not exist.  However as we walk around memory in an evermore random pattern then the latency starts to become very apparent.

The random heap walk produced an interesting result.  This is a our worst case scenario, and given the hardware specifications of these systems, we could be looking at 150ns, 65ns, and 75ns for the above tests respectively based on memory controller and memory module latencies.  For the Nehalem (i7-860) I can further subvert the cache sub-system by using a 4GB array resulting in ~55ns on average per iteration.  The i7-2760QM has larger load buffers, TLB caches, and Linux is running with transparent huge pages which are all working to further hide the latency.  By playing with different prime numbers for the stride, results can vary wildly depending on processor type, e.g. try PRIME_INC = 39916801 for Nehalem.  I'd like to test this on a much larger heap with Sandy Bridge.

The main take away is the more predictable the pattern of access to memory, then the better the cache sub-systems are at hiding main-memory latency.  Let's look at these cache sub-systems in a little detail to try and understand the observed results.

Hardware Components

We have many layers of cache plus the pre-fetchers to consider for how latency gets hidden.  In this section I'll try and cover the major components used to hide latency that our hardware and systems software friends have put in place.  We will investigate these latency hiding components and use the Linux perf and Lightweight Performance Counters utilities to retrieve the performance counters from our CPUs which tell how effective these components are when we execute our programs.  Performance counters are CPU specific and what I've used here are specific to Sandy Bridge.

Data Caches
Processors typically have 2 or 3 layers of data cache.  Each layer as we move out is progressively larger with increasing latency.  The latest Intel processors have 3 layers (L1D, L2, and L3); with sizes 32KB, 256KB, and 4-30MB; and ~1ns, ~4ns, and ~15ns latency respectively for a 3.0GHz CPU.

Data caches are effectively hardware hash tables with a fixed number of slots for each hash value.  These slots are known as "ways".  An 8-way associative cache will have 8 slots to hold values for addresses that hash to the same cache location.  Within these slots the data caches do not store words, they store cache-lines of multiple words.  For an Intel processor these cache-lines are typically 64-bytes, that is 8 words on a 64-bit machine.  This plays to the spatial bet that adjacent memory is likely to be required soon, which is typically the case if we think of arrays or fields of an object.

Data caches are typically evicted in a LRU manner.  Caches work by using a write-back algorithm were stores need only be propagated to main-memory when a modified cache-line is evicted.  This gives rise the the interesting phenomenon that a load can cause a write-back to the outer cache layers and eventually to main-memory.
perf stat -e L1-dcache-loads,L1-dcache-load-misses java -Xmx4g TestMemoryAccessPatterns $

 Performance counter stats for 'java -Xmx4g TestMemoryAccessPatterns 1':
     1,496,626,053 L1-dcache-loads                                            
       274,255,164 L1-dcache-misses
         #   18.32% of all L1-dcache hits

 Performance counter stats for 'java -Xmx4g TestMemoryAccessPatterns 2':
     1,537,057,965 L1-dcache-loads                                            
     1,570,105,933 L1-dcache-misses
         #  102.15% of all L1-dcache hits 

 Performance counter stats for 'java -Xmx4g TestMemoryAccessPatterns 3':
     4,321,888,497 L1-dcache-loads                                           
     1,780,223,433 L1-dcache-misses
         #   41.19% of all L1-dcache hits  

likwid-perfctr -C 2 -g L2CACHE java -Xmx4g TestMemoryAccessPatterns $

java -Xmx4g TestMemoryAccessPatterns 1
+-----------------------+-------------+
|         Event         |   core 2    |
+-----------------------+-------------+
|   INSTR_RETIRED_ANY   | 5.94918e+09 |
| CPU_CLK_UNHALTED_CORE | 5.15969e+09 |
| L2_TRANS_ALL_REQUESTS | 1.07252e+09 |
|     L2_RQSTS_MISS     | 3.25413e+08 |
+-----------------------+-------------+
+-----------------+-----------+
|     Metric      |  core 2   |
+-----------------+-----------+
|   Runtime [s]   |  2.15481  |
|       CPI       | 0.867293  |
| L2 request rate |  0.18028  |
|  L2 miss rate   | 0.0546988 |
|  L2 miss ratio  | 0.303409  |
+-----------------+-----------+
+------------------------+-------------+
|         Event          |   core 2    |
+------------------------+-------------+
| L3_LAT_CACHE_REFERENCE | 1.26545e+08 |
|   L3_LAT_CACHE_MISS    | 2.59059e+07 |
+------------------------+-------------+

java -Xmx4g TestMemoryAccessPatterns 2
+-----------------------+-------------+
|         Event         |   core 2    |
+-----------------------+-------------+
|   INSTR_RETIRED_ANY   | 1.48772e+10 |
| CPU_CLK_UNHALTED_CORE | 1.64712e+10 |
| L2_TRANS_ALL_REQUESTS | 3.41061e+09 |
|     L2_RQSTS_MISS     | 1.5547e+09  |
+-----------------------+-------------+
+-----------------+----------+
|     Metric      |  core 2  |
+-----------------+----------+
|   Runtime [s]   | 6.87876  |
|       CPI       | 1.10714  |
| L2 request rate | 0.22925  |
|  L2 miss rate   | 0.104502 |
|  L2 miss ratio  | 0.455843 |
+-----------------+----------+
+------------------------+-------------+
|         Event          |   core 2    |
+------------------------+-------------+
| L3_LAT_CACHE_REFERENCE | 1.52088e+09 |
|   L3_LAT_CACHE_MISS    | 1.72918e+08 |
+------------------------+-------------+

java -Xmx4g TestMemoryAccessPatterns 3
+-----------------------+-------------+
|         Event         |   core 2    |
+-----------------------+-------------+
|   INSTR_RETIRED_ANY   | 6.49533e+09 |
| CPU_CLK_UNHALTED_CORE | 4.18416e+10 |
| L2_TRANS_ALL_REQUESTS | 4.67488e+09 |
|     L2_RQSTS_MISS     | 1.43442e+09 |
+-----------------------+-------------+
+-----------------+----------+
|     Metric      |  core 2  |
+-----------------+----------+
|   Runtime [s]   |  17.474  |
|       CPI       |  6.4418  |
| L2 request rate | 0.71973  |
|  L2 miss rate   | 0.220838 |
|  L2 miss ratio  | 0.306835 |
+-----------------+----------+
+------------------------+-------------+
|         Event          |   core 2    |
+------------------------+-------------+
| L3_LAT_CACHE_REFERENCE | 1.40079e+09 |
|   L3_LAT_CACHE_MISS    | 1.34832e+09 |
+------------------------+-------------+
Note: The cache-miss rate of the combined L1D, L2 and L3 increases significantly as the pattern of access becomes more random.

Translation Lookaside Buffers (TLBs)
Our programs deal with virtual memory addresses that need to be translated to physical memory addresses.  Virtual memory systems do this by mapping pages.  We need to know the offset for a given page and its size for any memory operation.  Typically page sizes are 4KB and are gradually moving to 2MB and greater.  Linux introduced Transparent Huge Pages in the 2.6.38 kernel giving us 2MB pages.  The translation of virtual memory pages to physical pages is maintained by the page table.  This translation can require multiple accesses to the page table which is a huge performance penalty.  To accelerate this lookup, processors have a small hardware cache at each cache level called the TLB cache.  A miss on the TLB cache can be hugely expensive because the page table may not be in a nearby data cache.  By moving to larger pages, a TLB cache can cover a larger address range for the same number of entries.
perf stat -e dTLB-loads,dTLB-load-misses java -Xmx4g TestMemoryAccessPatterns $
 
 Performance counter stats for 'java -Xmx4g TestMemoryAccessPatterns 1':
     1,496,128,634 dTLB-loads
           310,901 dTLB-misses
              #    0.02% of all dTLB cache hits 

 Performance counter stats for 'java -Xmx4g TestMemoryAccessPatterns 2':
     1,551,585,263 dTLB-loads
           340,230 dTLB-misses
              #    0.02% of all dTLB cache hits

 Performance counter stats for 'java -Xmx4g TestMemoryAccessPatterns 3':
     4,031,344,537 dTLB-loads
     1,345,807,418 dTLB-misses
              #   33.38% of all dTLB cache hits  
Note: We only incur significant TLB misses when randomly walking the whole heap when huge pages are employed.

Hardware Pre-Fetchers
Hardware will try and predict the next memory access our programs will make and speculatively load that memory into fill buffers.  This is done at it simplest level by pre-loading adjacent cache-lines for the spatial bet, or by recognising regular stride based access patterns, typically less than 2KB in stride length.  The tests below we are measuring the number of loads that hit a fill buffer from a hardware pre-fetch.
likwid-perfctr -C 2 -g LOAD_HIT_PRE_HW_PF:PMC0 java -Xmx4g TestMemoryAccessPatterns $

java -Xmx4g TestMemoryAccessPatterns 1
+--------------------+-------------+
|       Event        |   core 2    |
+--------------------+-------------+
| LOAD_HIT_PRE_HW_PF | 1.31613e+09 |
+--------------------+-------------+

java -Xmx4g TestMemoryAccessPatterns 2
+--------------------+--------+
|       Event        | core 2 |
+--------------------+--------+
| LOAD_HIT_PRE_HW_PF | 368930 |
+--------------------+--------+

java -Xmx4g TestMemoryAccessPatterns 3
+--------------------+--------+
|       Event        | core 2 |
+--------------------+--------+
| LOAD_HIT_PRE_HW_PF | 324373 |
+--------------------+--------+
Note: We have a significant success rate for load hits with the pre-fetcher on the linear walk.

Memory Controllers and Row Buffers
Beyond our last level cache (LLC) sits the memory controllers that manage access to the SDRAM banks.  Memory is organised into rows and columns.  To access an address, first the row address must be selected (RAS), then the column address is selected (CAS) within that row to get the word.  The row is typically a page in size and loaded into a row buffer.  Even at this stage the hardware is still helping hide the latency.  A queue of memory access requests is maintained and re-ordered so that multiple words can be fetched from the same row if possible.

Non-Uniform Memory Access (NUMA)
Systems now have memory controllers on the CPU socket.  This move to on-socket memory controllers gave an ~50ns latency reduction over existing front side bus (FSB) and external Northbridge memory controllers.  Systems with multiple sockets employ memory interconnects, QPI from Intel, which are used when one CPU wants to access memory managed by another CPU socket.  The presence of these interconnects gives rise to the non-uniform nature of server memory access.  In a 2-socket system memory may be local or 1 hop away.  On a 8-socket system memory can be up to 3 hops away, were each hop adds 20ns latency in each direction.

What does this mean for algorithms?

The difference between an L1D cache-hit, and a full miss resulting in main-memory access, is 2 orders of magnitude; i.e. <1ns vs. 65-100ns.  If algorithms randomly walk around our ever increasing address spaces, then we are less likely to benefit from the hardware support that hides this latency.

Is there anything we can do about this when designing algorithms and data-structures?  Yes there is a lot we can do.  If we perform chunks of work on data that is co-located, and we stride around memory in a predictable fashion, then our algorithms can be many times faster.  For example rather than using bucket and chain hash tables, like in the JDK, we can employ hash tables using open-addressing with linear-probing.  Rather than using linked-lists or trees with single items in each node, we can store an array of many items in each node.

Research is advancing on algorithmic approaches that work in harmony with cache sub-systems.  One area I find fascinating is Cache Oblivious Algorithms.  The name is a bit misleading but there are some great concepts here for how to improve software performance and better execute in parallel.  This article is a great illustration of the performance benefits that can be gained.

Conclusion

To achieve great performance it is important to have sympathy for the cache sub-systems.  We have seen in this article what can be achieved by accessing memory in patterns which work with, rather than against, these caches.  When designing algorithms and data structures, it is now vitally important to consider cache-misses, probably even more so than counting steps in the algorithm.  This is not what we were taught in algorithm theory when studying computer science.  The last decade has seen some fundamental changes in technology.  For me the two most significant are the rise of multi-core, and now big-memory systems with 64-bit address spaces.

One thing is certain, if we want software to execute faster and scale better, we need to make better use of the many cores in our CPUs, and pay attention to memory access patterns.

Update: 06-August-2012
Trying to design a random walk algorithm for all processors and memory sizes is tricky.  If I use the algorithm below then my Sandy Bridge processor is slower but the Nehalem is faster.  The point is performance will be very unpredictable when you walk around memory in a random fashion.  I've also included the L3 cache counters for more detail in all the tests.
    private static final long LARGE_PRIME_INC = 70368760954879L;

        RANDOM_HEAP_WALK
        {
            public int next(final int pageOffset, final int wordOffset, final int pos)
            {
                return (int)(pos + LARGE_PRIME_INC) & ARRAY_MASK;
            }
        };
Intel i7-2760QM @ 2.40GHz, 8GB RAM DDR3 1600MHz, 
Linux 3.4.6 kernel 64-bit, Java 1.7.0_05
=================================================
0 - 29.06ns RANDOM_HEAP_WALK
1 - 29.47ns RANDOM_HEAP_WALK
2 - 29.48ns RANDOM_HEAP_WALK
3 - 29.43ns RANDOM_HEAP_WALK
4 - 29.42ns RANDOM_HEAP_WALK

 Performance counter stats for 'java -Xmx4g TestMemoryAccessPatterns 3':
     9,444,928,682 dTLB-loads
     4,371,982,327 dTLB-misses
         #   46.29% of all dTLB cache hits 

     9,390,675,639 L1-dcache-loads
     1,471,647,016 L1-dcache-misses
         #   15.67% of all L1-dcache hits  

+-----------------------+-------------+
|         Event         |   core 2    |
+-----------------------+-------------+
|   INSTR_RETIRED_ANY   | 7.71171e+09 |
| CPU_CLK_UNHALTED_CORE | 1.31717e+11 |
| L2_TRANS_ALL_REQUESTS | 8.4912e+09  |
|     L2_RQSTS_MISS     | 2.79635e+09 |
+-----------------------+-------------+
+-----------------+----------+
|     Metric      |  core 2  |
+-----------------+----------+
|   Runtime [s]   | 55.0094  |
|       CPI       | 17.0801  |
| L2 request rate | 1.10108  |
|  L2 miss rate   | 0.362611 |
|  L2 miss ratio  | 0.329324 |
+-----------------+----------+
+--------------------+-------------+
|       Event        |   core 2    |
+--------------------+-------------+
| LOAD_HIT_PRE_HW_PF | 3.59509e+06 |
+--------------------+-------------+
+------------------------+-------------+
|        Event           |   core 2    |
+------------------------+-------------+
| L3_LAT_CACHE_REFERENCE | 1.30318e+09 |
| L3_LAT_CACHE_MISS      | 2.62346e+07 |
+------------------------+-------------+