In high-performance computing it is often said that the cost of a cache-miss is the largest performance penalty for an algorithm. For many years the increase in speed of our processors has greatly outstripped latency gains to main-memory. Bandwidth to main-memory has greatly increased via wider, and multi-channel, buses however the latency has not significantly reduced. To hide this latency our processors employ evermore complex cache sub-systems that have many layers.
The 1994 paper "Hitting the memory wall: implications of the obvious" describes the problem and goes on to argue that caches do not ultimately help because of compulsory cache-misses. I aim to show that by using access patterns which display consideration for the cache hierarchy, this conclusion is not inevitable.
Let's start putting the problem in context with some examples. Our hardware tries to hide the main-memory latency via a number of techniques. Basically three major bets are taken on memory access patterns:
The following code should be run with the -Xmx4g JVM option.
I ran the code on 3 different CPU architectures illustrating generational steps forward for Intel. It is clear from the results that each generation has become progressively better at hiding the latency to main-memory based on the 3 bets described above for a relatively small heap. This is because the size and sophistication of various caches keep improving. However as memory size increases they become less effective. For example, if the array is doubled to be 4GB in size, then the average latency increases from ~30ns to ~55ns for the i7-860 doing the random heap walk.
It seems that for the linear walk case, memory latency does not exist. However as we walk around memory in an evermore random pattern then the latency starts to become very apparent.
The random heap walk produced an interesting result. This is a our worst case scenario, and given the hardware specifications of these systems, we could be looking at 150ns, 65ns, and 75ns for the above tests respectively based on memory controller and memory module latencies. For the Nehalem (i7-860) I can further subvert the cache sub-system by using a 4GB array resulting in ~55ns on average per iteration. The i7-2760QM has larger load buffers, TLB caches, and Linux is running with transparent huge pages which are all working to further hide the latency. By playing with different prime numbers for the stride, results can vary wildly depending on processor type, e.g. try PRIME_INC = 39916801 for Nehalem. I'd like to test this on a much larger heap with Sandy Bridge.
The main take away is the more predictable the pattern of access to memory, then the better the cache sub-systems are at hiding main-memory latency. Let's look at these cache sub-systems in a little detail to try and understand the observed results.
Hardware Components
We have many layers of cache plus the pre-fetchers to consider for how latency gets hidden. In this section I'll try and cover the major components used to hide latency that our hardware and systems software friends have put in place. We will investigate these latency hiding components and use the Linux perf and Lightweight Performance Counters utilities to retrieve the performance counters from our CPUs which tell how effective these components are when we execute our programs. Performance counters are CPU specific and what I've used here are specific to Sandy Bridge.
Data Caches
Processors typically have 2 or 3 layers of data cache. Each layer as we move out is progressively larger with increasing latency. The latest Intel processors have 3 layers (L1D, L2, and L3); with sizes 32KB, 256KB, and 4-30MB; and ~1ns, ~4ns, and ~15ns latency respectively for a 3.0GHz CPU.
Data caches are effectively hardware hash tables with a fixed number of slots for each hash value. These slots are known as "ways". An 8-way associative cache will have 8 slots to hold values for addresses that hash to the same cache location. Within these slots the data caches do not store words, they store cache-lines of multiple words. For an Intel processor these cache-lines are typically 64-bytes, that is 8 words on a 64-bit machine. This plays to the spatial bet that adjacent memory is likely to be required soon, which is typically the case if we think of arrays or fields of an object.
Data caches are typically evicted in a LRU manner. Caches work by using a write-back algorithm were stores need only be propagated to main-memory when a modified cache-line is evicted. This gives rise the the interesting phenomenon that a load can cause a write-back to the outer cache layers and eventually to main-memory.
Translation Lookaside Buffers (TLBs)
Our programs deal with virtual memory addresses that need to be translated to physical memory addresses. Virtual memory systems do this by mapping pages. We need to know the offset for a given page and its size for any memory operation. Typically page sizes are 4KB and are gradually moving to 2MB and greater. Linux introduced Transparent Huge Pages in the 2.6.38 kernel giving us 2MB pages. The translation of virtual memory pages to physical pages is maintained by the page table. This translation can require multiple accesses to the page table which is a huge performance penalty. To accelerate this lookup, processors have a small hardware cache at each cache level called the TLB cache. A miss on the TLB cache can be hugely expensive because the page table may not be in a nearby data cache. By moving to larger pages, a TLB cache can cover a larger address range for the same number of entries.
Hardware Pre-Fetchers
Hardware will try and predict the next memory access our programs will make and speculatively load that memory into fill buffers. This is done at it simplest level by pre-loading adjacent cache-lines for the spatial bet, or by recognising regular stride based access patterns, typically less than 2KB in stride length. The tests below we are measuring the number of loads that hit a fill buffer from a hardware pre-fetch.
Memory Controllers and Row Buffers
Beyond our last level cache (LLC) sits the memory controllers that manage access to the SDRAM banks. Memory is organised into rows and columns. To access an address, first the row address must be selected (RAS), then the column address is selected (CAS) within that row to get the word. The row is typically a page in size and loaded into a row buffer. Even at this stage the hardware is still helping hide the latency. A queue of memory access requests is maintained and re-ordered so that multiple words can be fetched from the same row if possible.
Non-Uniform Memory Access (NUMA)
Systems now have memory controllers on the CPU socket. This move to on-socket memory controllers gave an ~50ns latency reduction over existing front side bus (FSB) and external Northbridge memory controllers. Systems with multiple sockets employ memory interconnects, QPI from Intel, which are used when one CPU wants to access memory managed by another CPU socket. The presence of these interconnects gives rise to the non-uniform nature of server memory access. In a 2-socket system memory may be local or 1 hop away. On a 8-socket system memory can be up to 3 hops away, were each hop adds 20ns latency in each direction.
What does this mean for algorithms?
The difference between an L1D cache-hit, and a full miss resulting in main-memory access, is 2 orders of magnitude; i.e. <1ns vs. 65-100ns. If algorithms randomly walk around our ever increasing address spaces, then we are less likely to benefit from the hardware support that hides this latency.
Is there anything we can do about this when designing algorithms and data-structures? Yes there is a lot we can do. If we perform chunks of work on data that is co-located, and we stride around memory in a predictable fashion, then our algorithms can be many times faster. For example rather than using bucket and chain hash tables, like in the JDK, we can employ hash tables using open-addressing with linear-probing. Rather than using linked-lists or trees with single items in each node, we can store an array of many items in each node.
Research is advancing on algorithmic approaches that work in harmony with cache sub-systems. One area I find fascinating is Cache Oblivious Algorithms. The name is a bit misleading but there are some great concepts here for how to improve software performance and better execute in parallel. This article is a great illustration of the performance benefits that can be gained.
Conclusion
To achieve great performance it is important to have sympathy for the cache sub-systems. We have seen in this article what can be achieved by accessing memory in patterns which work with, rather than against, these caches. When designing algorithms and data structures, it is now vitally important to consider cache-misses, probably even more so than counting steps in the algorithm. This is not what we were taught in algorithm theory when studying computer science. The last decade has seen some fundamental changes in technology. For me the two most significant are the rise of multi-core, and now big-memory systems with 64-bit address spaces.
One thing is certain, if we want software to execute faster and scale better, we need to make better use of the many cores in our CPUs, and pay attention to memory access patterns.
Update: 06-August-2012
Trying to design a random walk algorithm for all processors and memory sizes is tricky. If I use the algorithm below then my Sandy Bridge processor is slower but the Nehalem is faster. The point is performance will be very unpredictable when you walk around memory in a random fashion. I've also included the L3 cache counters for more detail in all the tests.
The 1994 paper "Hitting the memory wall: implications of the obvious" describes the problem and goes on to argue that caches do not ultimately help because of compulsory cache-misses. I aim to show that by using access patterns which display consideration for the cache hierarchy, this conclusion is not inevitable.
Let's start putting the problem in context with some examples. Our hardware tries to hide the main-memory latency via a number of techniques. Basically three major bets are taken on memory access patterns:
- Temporal: Memory accessed recently will likely be required again soon.
- Spatial: Adjacent memory is likely to be required soon.
- Striding: Memory access is likely to follow a predictable pattern.
To illustrate these three bets in action let's write some code and measure the results.
- Walk through memory in a linear fashion being completely predictable.
- Pseudo randomly walk round memory within a restricted area then move on. This restricted area is what is commonly known as an operating system page of memory.
- Pseudo randomly walk around a large area of the heap.
Code
The following code should be run with the -Xmx4g JVM option.
public class TestMemoryAccessPatterns { private static final int LONG_SIZE = 8; private static final int PAGE_SIZE = 2 * 1024 * 1024; private static final int ONE_GIG = 1024 * 1024 * 1024; private static final long TWO_GIG = 2L * ONE_GIG; private static final int ARRAY_SIZE = (int)(TWO_GIG / LONG_SIZE); private static final int WORDS_PER_PAGE = PAGE_SIZE / LONG_SIZE; private static final int ARRAY_MASK = ARRAY_SIZE - 1; private static final int PAGE_MASK = WORDS_PER_PAGE - 1; private static final int PRIME_INC = 514229; private static final long[] memory = new long[ARRAY_SIZE]; static { for (int i = 0; i < ARRAY_SIZE; i++) { memory[i] = 777; } } public enum StrideType { LINEAR_WALK { public int next(final int pageOffset, final int wordOffset, final int pos) { return (pos + 1) & ARRAY_MASK; } }, RANDOM_PAGE_WALK { public int next(final int pageOffset, final int wordOffset, final int pos) { return pageOffset + ((pos + PRIME_INC) & PAGE_MASK); } }, RANDOM_HEAP_WALK { public int next(final int pageOffset, final int wordOffset, final int pos) { return (pos + PRIME_INC) & ARRAY_MASK; } }; public abstract int next(int pageOffset, int wordOffset, int pos); } public static void main(final String[] args) { final StrideType strideType; switch (Integer.parseInt(args[0])) { case 1: strideType = StrideType.LINEAR_WALK; break; case 2: strideType = StrideType.RANDOM_PAGE_WALK; break; case 3: strideType = StrideType.RANDOM_HEAP_WALK; break; default: throw new IllegalArgumentException("Unknown StrideType"); } for (int i = 0; i < 5; i++) { perfTest(i, strideType); } } private static void perfTest(final int runNumber, final StrideType strideType) { final long start = System.nanoTime(); int pos = -1; long result = 0; for (int pageOffset = 0; pageOffset < ARRAY_SIZE; pageOffset += WORDS_PER_PAGE) { for (int wordOffset = pageOffset, limit = pageOffset + WORDS_PER_PAGE; wordOffset < limit; wordOffset++) { pos = strideType.next(pageOffset, wordOffset, pos); result += memory[pos]; } } final long duration = System.nanoTime() - start; final double nsOp = duration / (double)ARRAY_SIZE; if (208574349312L != result) { throw new IllegalStateException(); } System.out.format("%d - %.2fns %s\n", Integer.valueOf(runNumber), Double.valueOf(nsOp), strideType); } }
Results
Intel U4100 @ 1.3GHz, 4GB RAM DDR2 800MHz, Windows 7 64-bit, Java 1.7.0_05 =========================================== 0 - 2.38ns LINEAR_WALK 1 - 2.41ns LINEAR_WALK 2 - 2.35ns LINEAR_WALK 3 - 2.36ns LINEAR_WALK 4 - 2.39ns LINEAR_WALK 0 - 12.45ns RANDOM_PAGE_WALK 1 - 12.27ns RANDOM_PAGE_WALK 2 - 12.17ns RANDOM_PAGE_WALK 3 - 12.22ns RANDOM_PAGE_WALK 4 - 12.18ns RANDOM_PAGE_WALK 0 - 152.86ns RANDOM_HEAP_WALK 1 - 151.80ns RANDOM_HEAP_WALK 2 - 151.72ns RANDOM_HEAP_WALK 3 - 151.91ns RANDOM_HEAP_WALK 4 - 151.36ns RANDOM_HEAP_WALK Intel i7-860 @ 2.8GHz, 8GB RAM DDR3 1333MHz, Windows 7 64-bit, Java 1.7.0_05 ============================================= 0 - 1.06ns LINEAR_WALK 1 - 1.05ns LINEAR_WALK 2 - 0.98ns LINEAR_WALK 3 - 1.00ns LINEAR_WALK 4 - 1.00ns LINEAR_WALK 0 - 3.80ns RANDOM_PAGE_WALK 1 - 3.85ns RANDOM_PAGE_WALK 2 - 3.79ns RANDOM_PAGE_WALK 3 - 3.65ns RANDOM_PAGE_WALK 4 - 3.64ns RANDOM_PAGE_WALK 0 - 30.04ns RANDOM_HEAP_WALK 1 - 29.05ns RANDOM_HEAP_WALK 2 - 29.14ns RANDOM_HEAP_WALK 3 - 28.88ns RANDOM_HEAP_WALK 4 - 29.57ns RANDOM_HEAP_WALK Intel i7-2760QM @ 2.40GHz, 8GB RAM DDR3 1600MHz, Linux 3.4.6 kernel 64-bit, Java 1.7.0_05 ================================================= 0 - 0.91ns LINEAR_WALK 1 - 0.92ns LINEAR_WALK 2 - 0.88ns LINEAR_WALK 3 - 0.89ns LINEAR_WALK 4 - 0.89ns LINEAR_WALK 0 - 3.29ns RANDOM_PAGE_WALK 1 - 3.35ns RANDOM_PAGE_WALK 2 - 3.33ns RANDOM_PAGE_WALK 3 - 3.31ns RANDOM_PAGE_WALK 4 - 3.30ns RANDOM_PAGE_WALK 0 - 9.58ns RANDOM_HEAP_WALK 1 - 9.20ns RANDOM_HEAP_WALK 2 - 9.44ns RANDOM_HEAP_WALK 3 - 9.46ns RANDOM_HEAP_WALK 4 - 9.47ns RANDOM_HEAP_WALK
Analysis
I ran the code on 3 different CPU architectures illustrating generational steps forward for Intel. It is clear from the results that each generation has become progressively better at hiding the latency to main-memory based on the 3 bets described above for a relatively small heap. This is because the size and sophistication of various caches keep improving. However as memory size increases they become less effective. For example, if the array is doubled to be 4GB in size, then the average latency increases from ~30ns to ~55ns for the i7-860 doing the random heap walk.
It seems that for the linear walk case, memory latency does not exist. However as we walk around memory in an evermore random pattern then the latency starts to become very apparent.
The random heap walk produced an interesting result. This is a our worst case scenario, and given the hardware specifications of these systems, we could be looking at 150ns, 65ns, and 75ns for the above tests respectively based on memory controller and memory module latencies. For the Nehalem (i7-860) I can further subvert the cache sub-system by using a 4GB array resulting in ~55ns on average per iteration. The i7-2760QM has larger load buffers, TLB caches, and Linux is running with transparent huge pages which are all working to further hide the latency. By playing with different prime numbers for the stride, results can vary wildly depending on processor type, e.g. try PRIME_INC = 39916801 for Nehalem. I'd like to test this on a much larger heap with Sandy Bridge.
The main take away is the more predictable the pattern of access to memory, then the better the cache sub-systems are at hiding main-memory latency. Let's look at these cache sub-systems in a little detail to try and understand the observed results.
Hardware Components
We have many layers of cache plus the pre-fetchers to consider for how latency gets hidden. In this section I'll try and cover the major components used to hide latency that our hardware and systems software friends have put in place. We will investigate these latency hiding components and use the Linux perf and Lightweight Performance Counters utilities to retrieve the performance counters from our CPUs which tell how effective these components are when we execute our programs. Performance counters are CPU specific and what I've used here are specific to Sandy Bridge.
Data Caches
Processors typically have 2 or 3 layers of data cache. Each layer as we move out is progressively larger with increasing latency. The latest Intel processors have 3 layers (L1D, L2, and L3); with sizes 32KB, 256KB, and 4-30MB; and ~1ns, ~4ns, and ~15ns latency respectively for a 3.0GHz CPU.
Data caches are effectively hardware hash tables with a fixed number of slots for each hash value. These slots are known as "ways". An 8-way associative cache will have 8 slots to hold values for addresses that hash to the same cache location. Within these slots the data caches do not store words, they store cache-lines of multiple words. For an Intel processor these cache-lines are typically 64-bytes, that is 8 words on a 64-bit machine. This plays to the spatial bet that adjacent memory is likely to be required soon, which is typically the case if we think of arrays or fields of an object.
Data caches are typically evicted in a LRU manner. Caches work by using a write-back algorithm were stores need only be propagated to main-memory when a modified cache-line is evicted. This gives rise the the interesting phenomenon that a load can cause a write-back to the outer cache layers and eventually to main-memory.
perf stat -e L1-dcache-loads,L1-dcache-load-misses java -Xmx4g TestMemoryAccessPatterns $ Performance counter stats for 'java -Xmx4g TestMemoryAccessPatterns 1': 1,496,626,053 L1-dcache-loads 274,255,164 L1-dcache-misses # 18.32% of all L1-dcache hits Performance counter stats for 'java -Xmx4g TestMemoryAccessPatterns 2': 1,537,057,965 L1-dcache-loads 1,570,105,933 L1-dcache-misses # 102.15% of all L1-dcache hits Performance counter stats for 'java -Xmx4g TestMemoryAccessPatterns 3': 4,321,888,497 L1-dcache-loads 1,780,223,433 L1-dcache-misses # 41.19% of all L1-dcache hits likwid-perfctr -C 2 -g L2CACHE java -Xmx4g TestMemoryAccessPatterns $ java -Xmx4g TestMemoryAccessPatterns 1 +-----------------------+-------------+ | Event | core 2 | +-----------------------+-------------+ | INSTR_RETIRED_ANY | 5.94918e+09 | | CPU_CLK_UNHALTED_CORE | 5.15969e+09 | | L2_TRANS_ALL_REQUESTS | 1.07252e+09 | | L2_RQSTS_MISS | 3.25413e+08 | +-----------------------+-------------+ +-----------------+-----------+ | Metric | core 2 | +-----------------+-----------+ | Runtime [s] | 2.15481 | | CPI | 0.867293 | | L2 request rate | 0.18028 | | L2 miss rate | 0.0546988 | | L2 miss ratio | 0.303409 | +-----------------+-----------+ +------------------------+-------------+ | Event | core 2 | +------------------------+-------------+ | L3_LAT_CACHE_REFERENCE | 1.26545e+08 | | L3_LAT_CACHE_MISS | 2.59059e+07 | +------------------------+-------------+ java -Xmx4g TestMemoryAccessPatterns 2 +-----------------------+-------------+ | Event | core 2 | +-----------------------+-------------+ | INSTR_RETIRED_ANY | 1.48772e+10 | | CPU_CLK_UNHALTED_CORE | 1.64712e+10 | | L2_TRANS_ALL_REQUESTS | 3.41061e+09 | | L2_RQSTS_MISS | 1.5547e+09 | +-----------------------+-------------+ +-----------------+----------+ | Metric | core 2 | +-----------------+----------+ | Runtime [s] | 6.87876 | | CPI | 1.10714 | | L2 request rate | 0.22925 | | L2 miss rate | 0.104502 | | L2 miss ratio | 0.455843 | +-----------------+----------+ +------------------------+-------------+ | Event | core 2 | +------------------------+-------------+ | L3_LAT_CACHE_REFERENCE | 1.52088e+09 | | L3_LAT_CACHE_MISS | 1.72918e+08 | +------------------------+-------------+ java -Xmx4g TestMemoryAccessPatterns 3 +-----------------------+-------------+ | Event | core 2 | +-----------------------+-------------+ | INSTR_RETIRED_ANY | 6.49533e+09 | | CPU_CLK_UNHALTED_CORE | 4.18416e+10 | | L2_TRANS_ALL_REQUESTS | 4.67488e+09 | | L2_RQSTS_MISS | 1.43442e+09 | +-----------------------+-------------+ +-----------------+----------+ | Metric | core 2 | +-----------------+----------+ | Runtime [s] | 17.474 | | CPI | 6.4418 | | L2 request rate | 0.71973 | | L2 miss rate | 0.220838 | | L2 miss ratio | 0.306835 | +-----------------+----------+ +------------------------+-------------+ | Event | core 2 | +------------------------+-------------+ | L3_LAT_CACHE_REFERENCE | 1.40079e+09 | | L3_LAT_CACHE_MISS | 1.34832e+09 | +------------------------+-------------+Note: The cache-miss rate of the combined L1D, L2 and L3 increases significantly as the pattern of access becomes more random.
Translation Lookaside Buffers (TLBs)
Our programs deal with virtual memory addresses that need to be translated to physical memory addresses. Virtual memory systems do this by mapping pages. We need to know the offset for a given page and its size for any memory operation. Typically page sizes are 4KB and are gradually moving to 2MB and greater. Linux introduced Transparent Huge Pages in the 2.6.38 kernel giving us 2MB pages. The translation of virtual memory pages to physical pages is maintained by the page table. This translation can require multiple accesses to the page table which is a huge performance penalty. To accelerate this lookup, processors have a small hardware cache at each cache level called the TLB cache. A miss on the TLB cache can be hugely expensive because the page table may not be in a nearby data cache. By moving to larger pages, a TLB cache can cover a larger address range for the same number of entries.
perf stat -e dTLB-loads,dTLB-load-misses java -Xmx4g TestMemoryAccessPatterns $ Performance counter stats for 'java -Xmx4g TestMemoryAccessPatterns 1': 1,496,128,634 dTLB-loads 310,901 dTLB-misses # 0.02% of all dTLB cache hits Performance counter stats for 'java -Xmx4g TestMemoryAccessPatterns 2': 1,551,585,263 dTLB-loads 340,230 dTLB-misses # 0.02% of all dTLB cache hits Performance counter stats for 'java -Xmx4g TestMemoryAccessPatterns 3': 4,031,344,537 dTLB-loads 1,345,807,418 dTLB-misses # 33.38% of all dTLB cache hitsNote: We only incur significant TLB misses when randomly walking the whole heap when huge pages are employed.
Hardware Pre-Fetchers
Hardware will try and predict the next memory access our programs will make and speculatively load that memory into fill buffers. This is done at it simplest level by pre-loading adjacent cache-lines for the spatial bet, or by recognising regular stride based access patterns, typically less than 2KB in stride length. The tests below we are measuring the number of loads that hit a fill buffer from a hardware pre-fetch.
likwid-perfctr -C 2 -g LOAD_HIT_PRE_HW_PF:PMC0 java -Xmx4g TestMemoryAccessPatterns $ java -Xmx4g TestMemoryAccessPatterns 1 +--------------------+-------------+ | Event | core 2 | +--------------------+-------------+ | LOAD_HIT_PRE_HW_PF | 1.31613e+09 | +--------------------+-------------+ java -Xmx4g TestMemoryAccessPatterns 2 +--------------------+--------+ | Event | core 2 | +--------------------+--------+ | LOAD_HIT_PRE_HW_PF | 368930 | +--------------------+--------+ java -Xmx4g TestMemoryAccessPatterns 3 +--------------------+--------+ | Event | core 2 | +--------------------+--------+ | LOAD_HIT_PRE_HW_PF | 324373 | +--------------------+--------+Note: We have a significant success rate for load hits with the pre-fetcher on the linear walk.
Memory Controllers and Row Buffers
Beyond our last level cache (LLC) sits the memory controllers that manage access to the SDRAM banks. Memory is organised into rows and columns. To access an address, first the row address must be selected (RAS), then the column address is selected (CAS) within that row to get the word. The row is typically a page in size and loaded into a row buffer. Even at this stage the hardware is still helping hide the latency. A queue of memory access requests is maintained and re-ordered so that multiple words can be fetched from the same row if possible.
Non-Uniform Memory Access (NUMA)
Systems now have memory controllers on the CPU socket. This move to on-socket memory controllers gave an ~50ns latency reduction over existing front side bus (FSB) and external Northbridge memory controllers. Systems with multiple sockets employ memory interconnects, QPI from Intel, which are used when one CPU wants to access memory managed by another CPU socket. The presence of these interconnects gives rise to the non-uniform nature of server memory access. In a 2-socket system memory may be local or 1 hop away. On a 8-socket system memory can be up to 3 hops away, were each hop adds 20ns latency in each direction.
What does this mean for algorithms?
The difference between an L1D cache-hit, and a full miss resulting in main-memory access, is 2 orders of magnitude; i.e. <1ns vs. 65-100ns. If algorithms randomly walk around our ever increasing address spaces, then we are less likely to benefit from the hardware support that hides this latency.
Is there anything we can do about this when designing algorithms and data-structures? Yes there is a lot we can do. If we perform chunks of work on data that is co-located, and we stride around memory in a predictable fashion, then our algorithms can be many times faster. For example rather than using bucket and chain hash tables, like in the JDK, we can employ hash tables using open-addressing with linear-probing. Rather than using linked-lists or trees with single items in each node, we can store an array of many items in each node.
Research is advancing on algorithmic approaches that work in harmony with cache sub-systems. One area I find fascinating is Cache Oblivious Algorithms. The name is a bit misleading but there are some great concepts here for how to improve software performance and better execute in parallel. This article is a great illustration of the performance benefits that can be gained.
Conclusion
To achieve great performance it is important to have sympathy for the cache sub-systems. We have seen in this article what can be achieved by accessing memory in patterns which work with, rather than against, these caches. When designing algorithms and data structures, it is now vitally important to consider cache-misses, probably even more so than counting steps in the algorithm. This is not what we were taught in algorithm theory when studying computer science. The last decade has seen some fundamental changes in technology. For me the two most significant are the rise of multi-core, and now big-memory systems with 64-bit address spaces.
One thing is certain, if we want software to execute faster and scale better, we need to make better use of the many cores in our CPUs, and pay attention to memory access patterns.
Update: 06-August-2012
Trying to design a random walk algorithm for all processors and memory sizes is tricky. If I use the algorithm below then my Sandy Bridge processor is slower but the Nehalem is faster. The point is performance will be very unpredictable when you walk around memory in a random fashion. I've also included the L3 cache counters for more detail in all the tests.
private static final long LARGE_PRIME_INC = 70368760954879L; RANDOM_HEAP_WALK { public int next(final int pageOffset, final int wordOffset, final int pos) { return (int)(pos + LARGE_PRIME_INC) & ARRAY_MASK; } };
Intel i7-2760QM @ 2.40GHz, 8GB RAM DDR3 1600MHz, Linux 3.4.6 kernel 64-bit, Java 1.7.0_05 ================================================= 0 - 29.06ns RANDOM_HEAP_WALK 1 - 29.47ns RANDOM_HEAP_WALK 2 - 29.48ns RANDOM_HEAP_WALK 3 - 29.43ns RANDOM_HEAP_WALK 4 - 29.42ns RANDOM_HEAP_WALK Performance counter stats for 'java -Xmx4g TestMemoryAccessPatterns 3': 9,444,928,682 dTLB-loads 4,371,982,327 dTLB-misses # 46.29% of all dTLB cache hits 9,390,675,639 L1-dcache-loads 1,471,647,016 L1-dcache-misses # 15.67% of all L1-dcache hits +-----------------------+-------------+ | Event | core 2 | +-----------------------+-------------+ | INSTR_RETIRED_ANY | 7.71171e+09 | | CPU_CLK_UNHALTED_CORE | 1.31717e+11 | | L2_TRANS_ALL_REQUESTS | 8.4912e+09 | | L2_RQSTS_MISS | 2.79635e+09 | +-----------------------+-------------+ +-----------------+----------+ | Metric | core 2 | +-----------------+----------+ | Runtime [s] | 55.0094 | | CPI | 17.0801 | | L2 request rate | 1.10108 | | L2 miss rate | 0.362611 | | L2 miss ratio | 0.329324 | +-----------------+----------+ +--------------------+-------------+ | Event | core 2 | +--------------------+-------------+ | LOAD_HIT_PRE_HW_PF | 3.59509e+06 | +--------------------+-------------+ +------------------------+-------------+ | Event | core 2 | +------------------------+-------------+ | L3_LAT_CACHE_REFERENCE | 1.30318e+09 | | L3_LAT_CACHE_MISS | 2.62346e+07 | +------------------------+-------------+
Love your articles. Keep 'em coming. We're working on systems in Scala in which these types of considerations are important.
ReplyDeleteQuick question, have you compared matrix type multiplications which are often done via 2 different arrays? That is to say, I'm very curious as to what sort of performance penalties are incurred as a result of having to switch between two very large arrays to do something simple like addition. A single walk around a single array is very often not the norm in a typical application.
You have hit directly on the topic that cache oblivious algorithms tries to address. Check out the article I link in the last paragraph of the "What does this mean for algorithms?" section.
DeleteThe simple answer is to chunk up the problem so you are dealing with pages, or smaller units, at a time of each array.
Scala's Vector is a bit-mapped hash trie with 32 elements per node, making it a more cache friendly structure – at least in theory, I have only seen anecdotal evidence of its actual performance characteristics.
DeleteTries have the potential to be more cache efficient for reads. Unfortunately many implementations are quite slow for updates.
DeleteThe important point is to measure and not assume. Potential does not equal actual if the implementation is not well executed.
I have an Ivy Bridge Xeon E3-1230V2 running Linux. If you can boil your info gathering into a script, I'd be more than happy to run it.
ReplyDeleteThank you for the nice article. I learned a great deal about cache-friendly programming and usage of tools like perf/likwid from your post.
ReplyDeleteMany thanks for producing these great articles, they are the best of breed. I always know that I'm in for a good reading session when I see your blog among the unread feeds in my reader.
ReplyDeletePerfect timing, Martin. Just saw your article now and it fits in very nicely with a book review I am publishing for the Well-Grounded Java Developer by Evans and Verburg. I've put a link to your article in the P.S., hopefully that will spread your fame :-)
ReplyDeletehttp://www.javaspecialists.eu/archive/Issue204.html
Glad I'm not the only one working in August :-)
Heinz
You may found it interesting: here is an article, in which authors examine how probing strategy choice in open-addressing hash table interact with caches. Briefly, they found that with pretty good hash function and near-uniform key distribution linear probing really rule them all. But for not-so-ideal hash function and more realistic key distribution linear probing leads to significant keys clasterization, which bring performance to the knees. Standard double hashing is much more stable against such non-ideality, and, while gives less effective cache usage, generally gives better overall performance.
ReplyDeleteBy the way, thank your for your articles, Martin. Always read them with pleasure.
Your point is very well made. However many key ranges do hash well for distribution. For example, many keys are monotonic sequences representing IDs for customers, accounts, orders, trades, deals, etc. When hashed these work very well in my experience with linear probing.
DeleteIf your key is a string then it may not have the same quality of distribution after a single hash function. String generally do not makes great keys anyway :-) The entropy typically only comes at the end for URLs, paths, and other hierarchical descriptors that strings are often used for.
Hi Martin..as you probably know, hw prefetchers are a lot more complicated today.
ReplyDeletefrom http://software.intel.com/sites/products/collateral/hpc/vtune/performance_analysis_guide.pdf
it's talking about intel i7, you can see there are multiple prefetchers, at different levels of the cache hierarchy. Think of these as speculative execution..i.e. sometimes they guess wrong and get in your way! So you're trying to write code that "coexists" with hw prefetchers, that were designed analyzing last years code :)...And every generation of CPU has different algorithms for hw prefetch.
Trying to code to the iron reminds me of the hoops GPU programmers jump thru..they have to understand the GPU load/store/cache/memory pipes in detail. It's a big deal when they change.
I think that's a core problem with coding to the iron for general-purpose CPUS..the hw changes a lot. (think of all the variance with L3 caches, shared caches, non-shared caches, etc.)
from the link:
The Intel® Core™ i7 processor has a 4 component hardware prefetcher very similar to that of the Core™ processors. Two components associated with the L2 CACHE and two components associated with the L1 data cache. The 2 components of L2 CACHE
hardware prefetcher are similar to those in the Pentium™ 4 and Core™ processors. There is a “streaming” component that looks for multiple accesses in a local address window as a trigger and an “adjacency” component that causes 2 lines to be fetched instead of one with each triggering of the “streaming” component. The L1 data cache prefetcher is similar to the L1 data cache prefetcher familiar from the Core™ processors. It has another “streaming” component (which was usually disabled in the bios’ for the Core™ processors) and a “stride” or “IP” component that detected constant stride accesses at individual instruction pointers. The Intel® Core™ i7 processor has various improvements in the details of the hardware pattern identifications used in the prefetchers.
The hardware prefetchers can get it wrong if they incorrectly predict a stride, or if they load adjacent cache lines that are not used. Having adjacent prefetching turned on the BIOS could (one should always measure) be bad if you have a very random pattern of access all over the heap.
DeleteI'm trying to understand your point. Are you suggesting the writing code that honors locality and has a predictable pattern of access is somehow wrong?
No. Just that the hardware AI is more than you say. Given a choice with equal number of ld/st, grouping for sequential is obviously best. But use less data and fewer ld/st counts for a lot too.
DeleteI agree that if it is possible to process less data that can have benefits. Lean data models is a whole topic in its own right.
DeleteCan you point out where the hardware AI for prefetches is more advanced than adjacent line and stride prediction?
Martin, I am having a very good time following your blog.
ReplyDeleteI have been playing around with some data structures and recently started writing up a small matching engine, it seemed like a fun place to start. I am using a pair of binary heaps of buy/sells and can process around 1MM orders per second.
https://github.com/fmstephe/matching_engine
But my naive heap implementation has terrible memory access patterns. I have been looking at various tree layouts and would appreciate it if you could write up a post on some data structures you have found effective in practice. Thanks
A reasonable matching engine should be able to do over 20 million orders per second on modern hardware without anything special. I might blog about that but then again I have to have some means of making a living via consultancy :-)
DeleteThis is one problem that requires some custom data structures especially when you consider cancellation to be fast.
I respect your, and LMAX's, need to have some trade secrets. I'll just have to keep reading papers and testing.
Delete(20 million, I'm going to have to find a new approach I think). Right now the approach I am taking is outlined here
http://www.quantcup.org/home/howtohft_howtobuildafastlimitorderbook
Another quick question is about your concurrency course. Are you planning to give another 3 day course in Belfast this year?
I'm happy to give the course anywhere I get sufficient interest. I've dropped Tara an email to see if he would like to run it again.
DeleteBTW that algorithm is not O(log N) for cancels because at any given price point you have to search by order id on an effective linked-list to remove it. Also prices tend to trend a direction on a day and there is no discussion on how the tree stays balanced.
It should be noted that for performance Big O notation is important but equally important is, "what is the effective range for N in any algorithm?"
I have never written an order matching engine, but over 20 million orders per seconds sounds very fast. Are you talking about 20 millions on a single core, i.e. average process time of <50ns?
DeleteI’m just curious, what is actually included in this time? I assume it doesn’t include decoding of incoming orders/cancellations, so it’s the time it takes to match the order against opposite orders and/or insert it into the order book, and generate some kind of output to the owners of the orders and to some market data feed (but not encode this output into whatever format is used)?
The logic is to accept, reject, execute or cancel an order and generate the necessary execution reports and market data for the business logic. This all runs on a single thread. It is necessary to use other threads for handling the incoming and outgoing network traffic. You can do a lot in 150-200 cycles on a modern processor :-) If you cache-miss much then it is much slower. If the book is very deep and random cancellations come in then cache misses are going to happen.
DeleteWhat kind of networking hardware will let you receive and send 20 million packets/messages a second?
DeleteWhat can you reasonably expect from a gig-E or 10-gig E NIC?
This is just an orderbook being driven directly in memory. I does not include a network interface. There is a massive range on what is possible by NIC type, manufacturer, and IP stack.
DeleteI thought I could escape the rebalancing cost by using a binary heap. But for reach 20MM I don't think a heap will cut it. I am doing some hunting around the literature for heaps that behave well w.r.t. cache. Will have some fun implementing some of them.
ReplyDeleteI expected that I would have to mix an RB-tree into the order struct (along side the linked list) to support fast cancellation. That's on my TODO list.
I happened upon Radix trees while reading on my commute. Maybe that will give me the boost I need :)
Francis, it may be to early for you to face off binary heap... if you use object (Order list) identifiers as 4 bytes you can put 8 levels (maximum exposed by CME CBOT is 5) in 10*8 = 80 bytes. You'd have typical remove/insert operation with max 6 memory writes to the adjacent block of memory.
DeleteAll the rest above 8 levels can go to RB-tree/trie or what ever efficient structure. Would be nice to hear Martin's take on LOB impl mentioned above.
Do you use likwid to identify false positives in cache-line sharing ? How do you identify thread id's, cache lines, variables etc. at such granularity ? I think Sun studio analyzer was not intended for x86.
ReplyDeleteMohan
Finding false-sharing is not easy in my experience. It often a case of experimentation based on experience. I use a variety of tools and have not found one that is a perfect fit. If you set affinity for your threads to specific CPUs, you can then get the counters for the specific CPUs. Why do you think Solaris Studio Performance Analyser is not for x86? Works for me on x86.
DeleteIf Solaris studio can show false sharing I will try again.
ReplyDeleteUnfortunately in Java we don't really know on which memory page our objects sit. Do you know whether at least Java GC's take page location into account when doing their work?
ReplyDeleteThe only case I know of is where the GC keeps a String object and its associated character array co-located. Java so needs arrays of structures and arrays within structures. We might get them for Java 9. I think it is the biggest performance issues with Java that needs addressing.
DeleteObjects tend to start off allocated together in the young generation and even get promoted together. The really nasty issue comes when he old generation gets compacted and who knows what hole they will get moved into. I'm planning a later blog to talk about how big data can be better addressed with memory mapped files and using the flyweight pattern.
Thank you for the very informative article. After reading Ulrich Dreepers, What every programmer should know about memory your article feels to me like a practical summary about the importance of the topic and a very good usage of perf tool by the way!
ReplyDeleteI took the time to try out your method(using c++, not java) and played a bit with the perf tool.
I have one question, still, regarding more details on how you obtained the values you are printing for the LINEAR_WALK? For the Intel i7-2760QM you got 0.88-0.92 ns, using nanoTime(why not constant and multiple of cputick is another secondary question I have, 'cause I'm not a java developer)
I used rdtsc in c++ and cannot get less than 6 ticks on my E7-4830 CPU(system tunned for maximum performance). I probably have a issue and I'm not good enough with assembler to analyze the code, but I added all the optimizations I could to g++ in order to achieve better times.
I can reach 3 ticks/array read if I do not accumulate on result, just overwrite it(result=memory[pos]). Maybe the solution is to get better with assembler and dig more, but I'm still not clear why your times seems to be around 2 ticks.
Using rdtsc/rdtscp/cpuid to make sure out-of-order execution does not come in to place brings no change, I still get 6ticks/array read(not even sure ooe is a problem here).
Obviously I used taskset to pin the process to one core, used cpusets to reduce the impact of the system to zero(I anyway do not see any cpu-migrations on perf stat) but I cannot get to your small numbers.
With my current frequency of 2127.980 MHz the 6 ticks I got are a lot more that what you got on your i7. Also they seem to me be too much for a simple array walk but I cannot identify what can be wrong.
Although this is not the point of the article, do you have any ideea what else can I try to get close to your values?
With Turbo Boost 2.0 my CPU is running at ~3.4GHz for the test according to perf stat.
DeleteIf I was you I'd get an ASM dump of both the Java and C++ and compare them.
Hello.
ReplyDeletehow to calculate for Ivy Bridge?
http://www.7-cpu.com/cpu/IvyBridge.html
DeleteHow are the cpu cache references/misses report generated? Can someone help me with that. Thanks.
ReplyDeleteThe post shows you how to do this with "perf stat" and "likwid" tools. Google for these to find out more.
DeleteThis comment has been removed by the author.
ReplyDelete