Mechanical Sympathy: Big Data

Showing posts with label Big Data. Show all posts

Monday, 5 May 2014

Simple Binary Encoding

Financial systems communicate by sending and receiving vast numbers of messages in many different formats. When people use terms like "vast" I normally think, "really..how many?" So lets quantify "vast" for the finance industry. Market data feeds from financial exchanges typically can be emitting tens or hundreds of thousands of message per second, and aggregate feeds like OPRA can peak at over 10 million messages per second with volumes growing year-on-year. This presentation gives a good overview.

In this crazy world we still see significant use of ASCII encoded presentations, such as FIX tag value, and some more slightly sane binary encoded presentations like FAST. Some markets even commit the sin of sending out market data as XML! Well I cannot complain too much as they have at times provided me a good income writing ultra fast XML parsers.

Last year the CME, who are a member the FIX community, commissioned Todd Montgomery, of 29West LBM fame, and myself to build the reference implementation of the new FIX Simple Binary Encoding (SBE) standard. SBE is a codec aimed at addressing the efficiency issues in low-latency trading, with a specific focus on market data. The CME, working within the FIX community, have done a great job of coming up with an encoding presentation that can be so efficient. Maybe a suitable atonement for the sins of past FIX tag value implementations. Todd and I worked on the Java and C++ implementation, and later we were helped on the .Net side by the amazing Olivier Deheurles at Adaptive. Working on a cool technical problem with such a team is a dream job.

SBE Overview

SBE is an OSI layer 6 presentation for encoding/decoding messages in binary format to support low-latency applications. Of the many applications I profile with performance issues, message encoding/decoding is often the most significant cost. I've seen many applications that spend significantly more CPU time parsing and transforming XML and JSON than executing business logic. SBE is designed to make this part of a system the most efficient it can be. SBE follows a number of design principles to achieve this goal. By adhering to these design principles sometimes means features available in other codecs will not being offered. For example, many codecs allow strings to be encoded at any field position in a message; SBE only allows variable length fields, such as strings, as fields grouped at the end of a message.

The SBE reference implementation consists of a compiler that takes a message schema as input and then generates language specific stubs. The stubs are used to directly encode and decode messages from buffers. The SBE tool can also generate a binary representation of the schema that can be used for the on-the-fly decoding of messages in a dynamic environment, such as for a log viewer or network sniffer.

The design principles drive the implementation of a codec that ensures messages are streamed through memory without backtracking, copying, or unnecessary allocation. Memory access patterns should not be underestimated in the design of a high-performance application. Low-latency systems in any language especially need to consider all allocation to avoid the resulting issues in reclamation. This applies for both managed runtime and native languages. SBE is totally allocation free in all three language implementations.

The end result of applying these design principles is a codec that has ~16-25 times greater throughput than Google Protocol Buffers (GPB) with very low and predictable latency. This has been observed in micro-benchmarks and real-world application use. A typical market data message can be encoded, or decoded, in ~25ns compared to ~1000ns for the same message with GPB on the same hardware. XML and FIX tag value messages are orders of magnitude slower again.

The sweet spot for SBE is as a codec for structured data that is mostly fixed size fields which are numbers, bitsets, enums, and arrays. While it does work for strings and blobs, many my find some of the restrictions a usability issue. These users would be better off with another codec more suited to string encoding.

Message Structure

A message must be capable of being read or written sequentially to preserve the streaming access design principle, i.e. with no need to backtrack. Some codecs insert location pointers for variable length fields, such as string types, that have to be indirected for access. This indirection comes at a cost of extra instructions plus losing the support of the hardware prefetchers. SBE's design allows for pure sequential access and copy-free native access semantics.

Figure 1

SBE messages have a common header that identifies the type and version of the message body to follow. The header is followed by the root fields of the message which are all fixed length with static offsets. The root fields are very similar to a struct in C. If the message is more complex then one or more repeating groups similar to the root block can follow. Repeating groups can nest other repeating group structures. Finally, variable length strings and blobs come at the end of the message. Fields may also be optional. The XML schema describing the SBE presentation can be found here.

SbeTool and the Compiler

To use SBE it is first necessary to define a schema for your messages. SBE provides a language independent type system supporting integers, floating point numbers, characters, arrays, constants, enums, bitsets, composites, grouped structures that repeat, and variable length strings and blobs.

A message schema can be input into the SbeTool and compiled to produce stubs in a range of languages, or to generate binary metadata suitable for decoding messages on-the-fly.

    java [-Doption=value] -jar sbe.jar <message-declarations-file.xml>

SbeTool and the compiler are written in Java. The tool can currently output stubs in Java, C++, and C#.

Programming with Stubs

A full example of messages defined in a schema with supporting code can be found here. The generated stubs follow a flyweight pattern with instances reused to avoid allocation. The stubs wrap a buffer at an offset and then read it sequentially and natively.

    // Write the message header first
    MESSAGE_HEADER.wrap(directBuffer, bufferOffset, messageTemplateVersion)
                  .blockLength(CAR.sbeBlockLength())
                  .templateId(CAR.sbeTemplateId())
                  .schemaId(CAR.sbeSchemaId())
                  .version(CAR.sbeSchemaVersion());

    // Then write the body of the message
    car.wrapForEncode(directBuffer, bufferOffset)
       .serialNumber(1234)
       .modelYear(2013)
       .available(BooleanType.TRUE)
       .code(Model.A)
       .putVehicleCode(VEHICLE_CODE, srcOffset);

Messages can be written via the generated stubs in a fluent manner. Each field appears as a generated pair of methods to encode and decode.

    // Read the header and lookup the appropriate template to decode
    MESSAGE_HEADER.wrap(directBuffer, bufferOffset, messageTemplateVersion);

    final int templateId = MESSAGE_HEADER.templateId();
    final int actingBlockLength = MESSAGE_HEADER.blockLength();
    final int schemaId = MESSAGE_HEADER.schemaId();
    final int actingVersion = MESSAGE_HEADER.version();

    // Once the template is located then the fields can be decoded.
    car.wrapForDecode(directBuffer, bufferOffset, actingBlockLength, actingVersion);

    final StringBuilder sb = new StringBuilder();
    sb.append("\ncar.templateId=").append(car.sbeTemplateId());
    sb.append("\ncar.schemaId=").append(schemaId);
    sb.append("\ncar.schemaVersion=").append(car.sbeSchemaVersion());
    sb.append("\ncar.serialNumber=").append(car.serialNumber());
    sb.append("\ncar.modelYear=").append(car.modelYear());
    sb.append("\ncar.available=").append(car.available());
    sb.append("\ncar.code=").append(car.code());

The generated code in all languages gives performance similar to casting a C struct over the memory.

On-The-Fly Decoding

The compiler produces an intermediate representation (IR) for the input XML message schema. This IR can be serialised in the SBE binary format to be used for later on-the-fly decoding of messages that have been stored. It is also useful for tools, such as a network sniffer, that will not have been compiled with the stubs. A full example of the IR being used can be found here.

Direct Buffers

SBE, via Agrona, provides an abstraction to Java, with the MutableDirectBuffer class, to work with buffers that are byte[], heap or direct ByteBuffer buffers, and off heap memory addresses returned from Unsafe.allocateMemory(long) or JNI. In low-latency applications, messages are often encoded/decoded in memory mapped files via MappedByteBuffer and thus can be be transferred to a network channel by the kernel thus avoiding user space copies.

C++ and C# have built-in support for direct memory access and do not require such an abstraction as the Java version does. A DirectBuffer abstraction was added for C# to support Endianess and encapsulate the unsafe pointer access.

Message Extension and Versioning

SBE schemas carry a version number that allows for message extension. A message can be extended by adding fields at the end of a block. Fields cannot be removed or reordered for backwards compatibility.

Extension fields must be optional otherwise a newer template reading an older message would not work. Templates carry metadata for min, max, null, timeunit, character encoding, etc., these are accessible via static (class level) methods on the stubs.

Byte Ordering and Alignment

The message schema allows for precise alignment of fields by specifying offsets. Fields are by default encoded in Little Endian form unless otherwise specified in a schema. For maximum performance native encoding with fields on word aligned boundaries should be used. The penalty for accessing non-aligned fields on some processors can be very significant. For alignment one must consider the framing protocol and buffer locations in memory.

Message Protocols

I often see people complain that a codec cannot support a particular presentation in a single message. However this is often possible to address with a protocol of messages. Protocols are a great way to split an interaction into its component parts, these parts are then often composable for many interactions between systems. For example, the IR implementation of schema metadata is more complex than can be supported by the structure of a single message. We encode IR by first sending a template message providing an overview, followed by a stream of messages, each encoding the tokens from the compiler IR. This allows for the design of a very fast OTF decoder which can be implemented as a threaded interpreter with much less branching than the typical switch based state machines.

Protocol design is an area that most developers don't seem to get an opportunity to learn. I feel this is a great loss. The fact that so many developers will call an "encoding" such as ASCII a "protocol" is very telling. The value of protocols is so obvious when one gets to work with a programmer like Todd who has spent his life successfully designing protocols.

Stub Performance

The stubs provide a significant performance advantage over the dynamic OTF decoding. For accessing primitive fields we believe the performance is reaching the limits of what is possible from a general purpose tool. The generated assembly code is very similar to what a compiler will generate for accessing a C struct, even from Java!

Regarding the general performance of the stubs, we have observed that C++ has a very marginal advantage over the Java which we believe is due to runtime inserted Safepoint checks. The C# version lags a little further behind due to its runtime not being as aggressive with inlining methods as the Java runtime. Stubs for all three languages are capable of encoding or decoding typical financial messages in tens of nanoseconds. This effectively makes the encoding and decoding of messages almost free for most applications relative to the rest of the application logic.

Feedback

This is the first version of SBE and we would welcome feedback. The reference implementation is constrained by the FIX community specification. It is possible to influence the specification but please don't expect pull requests to be accepted that significantly go against the specification. Support for Javascript, Python, Erlang, and other languages has been discussed and would be very welcome.

Update: 08-May-2014

Thanks to feedback from Kenton Varda, the creator of GPB, we were able to improve the benchmarks to get the best performance out of GPB. Below are the results for the changes to the Java benchmarks.

The C++ GPB examples on optimisation show approximately a doubling of throughput compared to initial results. It should be noted that you often have to do the opposite in Java with GPB compared to C++ to get performance improvements, such as allocate objects rather than reuse them.

Before GPB Optimisation:

Mode Thr    Cnt  Sec         Mean   Mean error    Units
     [exec] u.c.r.protobuf.CarBenchmark.testDecode           thrpt   1     30    1      462.817        6.474   ops/ms
     [exec] u.c.r.protobuf.CarBenchmark.testEncode           thrpt   1     30    1      326.018        2.972   ops/ms
     [exec] u.c.r.protobuf.MarketDataBenchmark.testDecode    thrpt   1     30    1     1148.050       17.194   ops/ms
     [exec] u.c.r.protobuf.MarketDataBenchmark.testEncode    thrpt   1     30    1     1242.252       12.248   ops/ms

     [exec] u.c.r.sbe.CarBenchmark.testDecode                thrpt   1     30    1    10436.476      102.114   ops/ms
     [exec] u.c.r.sbe.CarBenchmark.testEncode                thrpt   1     30    1    11657.190       65.168   ops/ms
     [exec] u.c.r.sbe.MarketDataBenchmark.testDecode         thrpt   1     30    1    34078.646      261.775   ops/ms
     [exec] u.c.r.sbe.MarketDataBenchmark.testEncode         thrpt   1     30    1    29193.600      443.638   ops/ms

After GPB Optimisation:

Mode Thr    Cnt  Sec         Mean   Mean error    Units
     [exec] u.c.r.protobuf.CarBenchmark.testDecode           thrpt   1     30    1      619.467        4.429   ops/ms
     [exec] u.c.r.protobuf.CarBenchmark.testEncode           thrpt   1     30    1      433.711       10.364   ops/ms
     [exec] u.c.r.protobuf.MarketDataBenchmark.testDecode    thrpt   1     30    1     2088.998       60.619   ops/ms
     [exec] u.c.r.protobuf.MarketDataBenchmark.testEncode    thrpt   1     30    1     1316.123       19.816   ops/ms

Throughput msg/ms - Before GPB Optimisation

Test

Protocol Buffers

SBE

Ratio

Car Encode

462.817

10436.476

22.52

Car Decode

326.018

11657.190

35.76

Market Data Encode

1148.050

34078.646

29.68

Market Data Decode

1242.252

29193.600

23.50

Throughput msg/ms - After GPB Optimisation

Test

Protocol Buffers

SBE

Ratio

Car Encode

619.467

10436.476

16.85

Car Decode

433.711

11657.190

26.88

Market Data Encode

2088.998

34078.646

16.31

Market Data Decode

1316.123

29193.600

22.18

Wednesday, 17 October 2012

Compact Off-Heap Structures/Tuples In Java

In my last post I detailed the implications of the access patterns your code takes to main memory. Since then I've had a lot of questions about what can be done in Java to enable more predictable memory layout. There are patterns that can be applied using array backed structures which I will discuss in another post. This post will explore how to simulate a feature sorely missing in Java - arrays of structures similar to what C has to offer.

Structures are very useful, both on the stack and the heap. To my knowledge it is not possible to simulate this feature on the Java stack. Not being able to do this on the stack is such as shame because it greatly limits the performance of some parallel algorithms, however that is a rant for another day.

In Java, all user defined types have to exist on the heap. The Java heap is managed by the garbage collector in the general case, however there is more to the wider heap in a Java process. With the introduction of direct ByteBuffer, memory can be allocated which is not tracked by the garbage collector because it can be available to native code for tasks like avoiding the copying of data to and from the kernel for IO. So one method of managing structures is to fake them within a ByteBuffer as a reasonable approach. This can allow compact data representations, but has performance and size limitations. For example, it is not possible to have a ByteBuffer greater than 2GB, and all access is bounds checked which impacts performance. An alternative exists using Unsafe that is both faster and and not size constrained like ByteBuffer.

The approach I'm about to detail is not traditional Java. If your problem space is dealing with big data, or extreme performance, then there are benefits to be had. If your data sets are small, and performance is not an issue, then run away now to avoid getting sucked into the dark arts of native memory management.

The benefits of the approach I'm about to detail are:

Significantly improved performance
More compact data representation
Ability to work with very large data sets while avoiding nasty GC pauses[1]

With all choices there are consequences. By taking the approach detailed below you take responsibility for some of the memory managment yourself. Getting it wrong can lead to memory leaks, or worse, you can crash the JVM! Proceed with caution...

Suitable Example - Trade Data

A common challenge faced in finance applications is capturing and working with very large volumes of order and trade data. For the example I will create a large table of in-memory trade data that can have analysis queries run against it. This table will be built using 2 contrasting approaches. Firstly, I'll take the traditional Java approach of creating a large array and reference individual Trade objects. Secondly, I keep the usage code identical but replace the large array and Trade objects with an off-heap array of structures that can be manipulated via a Flyweight pattern.

If for the traditional Java approach I used some other data structure, such as a Map or Tree, then the memory footprint would be even greater and the performance lower.

Traditional Java Approach

public class TestJavaMemoryLayout
{
    private static final int NUM_RECORDS = 50 * 1000 * 1000;

    private static JavaMemoryTrade[] trades;

    public static void main(final String[] args)
    {
        for (int i = 0; i < 5; i++)
        {
            System.gc();
            perfRun(i);
        }
    }

    private static void perfRun(final int runNum)
    {
        long start = System.currentTimeMillis();

        init();

        System.out.format("Memory %,d total, %,d free\n",
                          Runtime.getRuntime().totalMemory(),
                          Runtime.getRuntime().freeMemory());

        long buyCost = 0;
        long sellCost = 0;

        for (int i = 0; i < NUM_RECORDS; i++)
        {
            final JavaMemoryTrade trade = get(i);

            if (trade.getSide() == 'B')
            {
                buyCost += (trade.getPrice() * trade.getQuantity());
            }
            else
            {
                sellCost += (trade.getPrice() * trade.getQuantity());
            }
        }

        long duration = System.currentTimeMillis() - start;
        System.out.println(runNum + " - duration " + duration + "ms");
        System.out.println("buyCost = " + buyCost + " sellCost = " + sellCost);
    }

    private static JavaMemoryTrade get(final int index)
    {
        return trades[index];
    }

    public static void init()
    {
        trades = new JavaMemoryTrade[NUM_RECORDS];

        final byte[] londonStockExchange = {'X', 'L', 'O', 'N'};
        final int venueCode = pack(londonStockExchange);

        final byte[] billiton = {'B', 'H', 'P'};
        final int instrumentCode = pack( billiton);

        for (int i = 0; i < NUM_RECORDS; i++)
        {
            JavaMemoryTrade trade = new JavaMemoryTrade();
            trades[i] = trade;

            trade.setTradeId(i);
            trade.setClientId(1);
            trade.setVenueCode(venueCode);
            trade.setInstrumentCode(instrumentCode);

            trade.setPrice(i);
            trade.setQuantity(i);

            trade.setSide((i & 1) == 0 ? 'B' : 'S');
        }
    }

    private static int pack(final byte[] value)
    {
        int result = 0;
        switch (value.length)
        {
            case 4:
                result = (value[3]);
            case 3:
                result |= ((int)value[2] << 8);
            case 2:
                result |= ((int)value[1] << 16);
            case 1:
                result |= ((int)value[0] << 24);
                break;

            default:
                throw new IllegalArgumentException("Invalid array size");
        }

        return result;
    }

    private static class JavaMemoryTrade
    {
        private long tradeId;
        private long clientId;
        private int venueCode;
        private int instrumentCode;
        private long price;
        private long quantity;
        private char side;

        public long getTradeId()
        {
            return tradeId;
        }

        public void setTradeId(final long tradeId)
        {
            this.tradeId = tradeId;
        }

        public long getClientId()
        {
            return clientId;
        }

        public void setClientId(final long clientId)
        {
            this.clientId = clientId;
        }

        public int getVenueCode()
        {
            return venueCode;
        }

        public void setVenueCode(final int venueCode)
        {
            this.venueCode = venueCode;
        }

        public int getInstrumentCode()
        {
            return instrumentCode;
        }

        public void setInstrumentCode(final int instrumentCode)
        {
            this.instrumentCode = instrumentCode;
        }

        public long getPrice()
        {
            return price;
        }

        public void setPrice(final long price)
        {
            this.price = price;
        }

        public long getQuantity()
        {
            return quantity;
        }

        public void setQuantity(final long quantity)
        {
            this.quantity = quantity;
        }

        public char getSide()
        {
            return side;
        }

        public void setSide(final char side)
        {
            this.side = side;
        }
    }
}

Compact Off-Heap Structures

import sun.misc.Unsafe;

import java.lang.reflect.Field;

public class TestDirectMemoryLayout
{
    private static final Unsafe unsafe;
    static
    {
        try
        {
            Field field = Unsafe.class.getDeclaredField("theUnsafe");
            field.setAccessible(true);
            unsafe = (Unsafe)field.get(null);
        }
        catch (Exception e)
        {
            throw new RuntimeException(e);
        }
    }

    private static final int NUM_RECORDS = 50 * 1000 * 1000;

    private static long address;
    private static final DirectMemoryTrade flyweight = new DirectMemoryTrade();

    public static void main(final String[] args)
    {
        for (int i = 0; i < 5; i++)
        {
            System.gc();
            perfRun(i);
        }
    }

    private static void perfRun(final int runNum)
    {
        long start = System.currentTimeMillis();

        init();

        System.out.format("Memory %,d total, %,d free\n",
                          Runtime.getRuntime().totalMemory(),
                          Runtime.getRuntime().freeMemory());

        long buyCost = 0;
        long sellCost = 0;

        for (int i = 0; i < NUM_RECORDS; i++)
        {
            final DirectMemoryTrade trade = get(i);

            if (trade.getSide() == 'B')
            {
                buyCost += (trade.getPrice() * trade.getQuantity());
            }
            else
            {
                sellCost += (trade.getPrice() * trade.getQuantity());
            }
        }

        long duration = System.currentTimeMillis() - start;
        System.out.println(runNum + " - duration " + duration + "ms");
        System.out.println("buyCost = " + buyCost + " sellCost = " + sellCost);

        destroy();
    }

    private static DirectMemoryTrade get(final int index)
    {
        final long offset = address + (index * DirectMemoryTrade.getObjectSize());
        flyweight.setObjectOffset(offset);
        return flyweight;
    }

    public static void init()
    {
        final long requiredHeap = NUM_RECORDS * DirectMemoryTrade.getObjectSize();
        address = unsafe.allocateMemory(requiredHeap);

        final byte[] londonStockExchange = {'X', 'L', 'O', 'N'};
        final int venueCode = pack(londonStockExchange);

        final byte[] billiton = {'B', 'H', 'P'};
        final int instrumentCode = pack( billiton);

        for (int i = 0; i < NUM_RECORDS; i++)
        {
            DirectMemoryTrade trade = get(i);

            trade.setTradeId(i);
            trade.setClientId(1);
            trade.setVenueCode(venueCode);
            trade.setInstrumentCode(instrumentCode);

            trade.setPrice(i);
            trade.setQuantity(i);

            trade.setSide((i & 1) == 0 ? 'B' : 'S');
        }
    }

    private static void destroy()
    {
        unsafe.freeMemory(address);
    }

    private static int pack(final byte[] value)
    {
        int result = 0;
        switch (value.length)
        {
            case 4:
                result |= (value[3]);
            case 3:
                result |= ((int)value[2] << 8);
            case 2:
                result |= ((int)value[1] << 16);
            case 1:
                result |= ((int)value[0] << 24);
                break;

            default:
                throw new IllegalArgumentException("Invalid array size");
        }

        return result;
    }

    private static class DirectMemoryTrade
    {
        private static long offset = 0;

        private static final long tradeIdOffset = offset += 0;
        private static final long clientIdOffset = offset += 8;
        private static final long venueCodeOffset = offset += 8;
        private static final long instrumentCodeOffset = offset += 4;
        private static final long priceOffset = offset += 4;
        private static final long quantityOffset = offset += 8;
        private static final long sideOffset = offset += 8;

        private static final long objectSize = offset += 2;

        private long objectOffset;

        public static long getObjectSize()
        {
            return objectSize;
        }

        void setObjectOffset(final long objectOffset)
        {
            this.objectOffset = objectOffset;
        }

        public long getTradeId()
        {
            return unsafe.getLong(objectOffset + tradeIdOffset);
        }

        public void setTradeId(final long tradeId)
        {
            unsafe.putLong(objectOffset + tradeIdOffset, tradeId);
        }

        public long getClientId()
        {
            return unsafe.getLong(objectOffset + clientIdOffset);
        }

        public void setClientId(final long clientId)
        {
            unsafe.putLong(objectOffset + clientIdOffset, clientId);
        }

        public int getVenueCode()
        {
            return unsafe.getInt(objectOffset + venueCodeOffset);
        }

        public void setVenueCode(final int venueCode)
        {
            unsafe.putInt(objectOffset + venueCodeOffset, venueCode);
        }

        public int getInstrumentCode()
        {
            return unsafe.getInt(objectOffset + instrumentCodeOffset);
        }

        public void setInstrumentCode(final int instrumentCode)
        {
            unsafe.putInt(objectOffset + instrumentCodeOffset, instrumentCode);
        }

        public long getPrice()
        {
            return unsafe.getLong(objectOffset + priceOffset);
        }

        public void setPrice(final long price)
        {
            unsafe.putLong(objectOffset + priceOffset, price);
        }

        public long getQuantity()
        {
            return unsafe.getLong(objectOffset + quantityOffset);
        }

        public void setQuantity(final long quantity)
        {
            unsafe.putLong(objectOffset + quantityOffset, quantity);
        }

        public char getSide()
        {
            return unsafe.getChar(objectOffset + sideOffset);
        }

        public void setSide(final char side)
        {
            unsafe.putChar(objectOffset + sideOffset, side);
        }
    }
}

Results

Intel i7-860 @ 2.8GHz, 8GB RAM DDR3 1333MHz, 
Windows 7 64-bit, Java 1.7.0_07
=============================================
java -server -Xms4g -Xmx4g TestJavaMemoryLayout
Memory 4,116,054,016 total, 1,108,901,104 free
0 - duration 19334ms
Memory 4,116,054,016 total, 1,109,964,752 free
1 - duration 14295ms
Memory 4,116,054,016 total, 1,108,455,504 free
2 - duration 14272ms
Memory 3,817,799,680 total, 815,308,600 free
3 - duration 28358ms
Memory 3,817,799,680 total, 810,552,816 free
4 - duration 32487ms

java -server TestDirectMemoryLayout
Memory 128,647,168 total, 126,391,384 free
0 - duration 983ms
Memory 128,647,168 total, 126,992,160 free
1 - duration 958ms
Memory 128,647,168 total, 127,663,408 free
2 - duration 873ms
Memory 128,647,168 total, 127,663,408 free
3 - duration 886ms
Memory 128,647,168 total, 127,663,408 free
4 - duration 884ms

Intel i7-2760QM @ 2.40GHz, 8GB RAM DDR3 1600MHz, 
Linux 3.4.11 kernel 64-bit, Java 1.7.0_07
=================================================
java -server -Xms4g -Xmx4g TestJavaMemoryLayout
Memory 4,116,054,016 total, 1,108,912,960 free
0 - duration 12262ms
Memory 4,116,054,016 total, 1,109,962,832 free
1 - duration 9822ms
Memory 4,116,054,016 total, 1,108,458,720 free
2 - duration 10239ms
Memory 3,817,799,680 total, 815,307,640 free
3 - duration 21558ms
Memory 3,817,799,680 total, 810,551,856 free
4 - duration 23074ms

java -server TestDirectMemoryLayout 
Memory 123,994,112 total, 121,818,528 free
0 - duration 634ms
Memory 123,994,112 total, 122,455,944 free
1 - duration 619ms
Memory 123,994,112 total, 123,103,320 free
2 - duration 546ms
Memory 123,994,112 total, 123,103,320 free
3 - duration 547ms
Memory 123,994,112 total, 123,103,320 free
4 - duration 534ms

Analysis

Let's compare the results to the 3 benefits promised above.

1. Significantly improved performance

The evidence here is pretty clear cut. Using the off-heap structures approach is more than an order of magnitude faster. At the most extreme, look at the 5th run on a Sandy Bridge processor, we have 43.2 times difference in duration to complete the task. It is also a nice illustration of how well Sandy Bridge does with predictable access patterns to data. Not only is the performance significantly better it is also more consistent. As the heap becomes fragmented, and thus access patterns become more random, the performance degrades as can be seen in the later runs with standard Java approach.

2. More compact data representation

For our off-heap representation each object requires 42-bytes. To store 50 million of these, as in the example, we require 2,100,000,000 bytes. The memory required by the JVM heap is:

memory required = total memory - free memory - base JVM needs

2,883,248,712 = 3,817,799,680 - 810,551,856 - 123,999,112

This implies the JVM needs ~40% more memory to represent the same data. The reason for this overhead is the array of references to the Java objects plus the object headers. In a previous post I discussed object layout in Java.

When working with very large data sets this overhead can become a significant limiting factor.

3. Ability to work with very large data sets while avoiding nasty GC pauses

The sample code above forces a GC cycle before each run and can improve the consistency of the results in some cases. Feel free to remove the call to System.gc() and observe the implications for yourself. If you run the tests adding the following command line arguments then the garbage collector will output in painful detail what happened.

-XX:+PrintGC -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintTenuringDistribution -XX:+PrintHeapAtGC -XX:+PrintGCApplicationConcurrentTime -XX:+PrintGCApplicationStoppedTime -XX:+PrintSafepointStatistics

From analysing the output I can see the application underwent a total of 29 GC cycles. Pause times are listed below by extracting the lines from the output indicating when the application threads are stopped.

With System.gc() before each run
================================
Total time for which application threads were stopped: 0.0085280 seconds
Total time for which application threads were stopped: 0.7280530 seconds
Total time for which application threads were stopped: 8.1703460 seconds
Total time for which application threads were stopped: 5.6112210 seconds
Total time for which application threads were stopped: 1.2531370 seconds
Total time for which application threads were stopped: 7.6392250 seconds
Total time for which application threads were stopped: 5.7847050 seconds
Total time for which application threads were stopped: 1.3070470 seconds
Total time for which application threads were stopped: 8.2520880 seconds
Total time for which application threads were stopped: 6.0949910 seconds
Total time for which application threads were stopped: 1.3988480 seconds
Total time for which application threads were stopped: 8.1793240 seconds
Total time for which application threads were stopped: 6.4138720 seconds
Total time for which application threads were stopped: 4.4991670 seconds
Total time for which application threads were stopped: 4.5612290 seconds
Total time for which application threads were stopped: 0.3598490 seconds
Total time for which application threads were stopped: 0.7111000 seconds
Total time for which application threads were stopped: 1.4426750 seconds
Total time for which application threads were stopped: 1.5931500 seconds
Total time for which application threads were stopped: 10.9484920 seconds
Total time for which application threads were stopped: 7.0707230 seconds

Without System.gc() before each run
===================================
Test run times
0 - duration 12120ms
1 - duration 9439ms
2 - duration 9844ms
3 - duration 20933ms
4 - duration 23041ms

Total time for which application threads were stopped: 0.0170860 seconds
Total time for which application threads were stopped: 0.7915350 seconds
Total time for which application threads were stopped: 10.7153320 seconds
Total time for which application threads were stopped: 5.6234650 seconds
Total time for which application threads were stopped: 1.2689950 seconds
Total time for which application threads were stopped: 7.6238170 seconds
Total time for which application threads were stopped: 6.0114540 seconds
Total time for which application threads were stopped: 1.2990070 seconds
Total time for which application threads were stopped: 7.9918480 seconds
Total time for which application threads were stopped: 5.9997920 seconds
Total time for which application threads were stopped: 1.3430040 seconds
Total time for which application threads were stopped: 8.0759940 seconds
Total time for which application threads were stopped: 6.3980610 seconds
Total time for which application threads were stopped: 4.5572100 seconds
Total time for which application threads were stopped: 4.6193830 seconds
Total time for which application threads were stopped: 0.3877930 seconds
Total time for which application threads were stopped: 0.7429270 seconds
Total time for which application threads were stopped: 1.5248070 seconds
Total time for which application threads were stopped: 1.5312130 seconds
Total time for which application threads were stopped: 10.9120250 seconds
Total time for which application threads were stopped: 7.3528590 seconds

It can been seen from the output that a significant proportion of the time is spent in the garbage collector. When your threads are stopped your application is not responsive. These tests have been done with default GC settings. It is possible to tune the GC for better results but this can be a highly skilled and significant effort. The only JVM I know that copes well by not imposing long pause times, even under high-throughput conditions, is the Azul concurrent compacting collector.

When profiling this application, I can see that the majority of the time is spent allocating the objects and promoting them to the old generation because they do not fit in the young generation. The initialisation costs can be removed from the timing but that is not realistic. If the traditional Java approach is taken the state needs to be built up before the query can take place. The end user of an application has to wait for the state to be built up and the query executed.

This test is really quite trivial. Imagine working with similar data sets but at the 100 GB scale.

Note: When the garbage collector compacts a region, then objects that were next to each other can be moved far apart. This can result in TLB and other cache misses.

Side Note On Serialization

A huge benefit of using off-heap structures in this manner is how they can be very easily serialised to network, or storage, by a simple memory copy as I have shown in the previous post. This way we can completely bypass intermediate buffer and object allocation.

Conclusion

If you are willing to do some C style programming for large datasets it is possible to control the memory layout in Java by going off-heap. If you do, the benefits in performance, compactness, and avoiding GC issues are significant. However this is an approach that should not be used for all applications. Its benefits are only noticable for very large datasets, or the extremes of performance in throughput and/or latency.

I hope the Java community can collectively realise the importance of supporting structures both on the heap and the stack. John Rose has done some excellent work in this area defining how tuples could be added to the JVM. His talk on Arrays 2.0 from the JVM Language Summit this year is really worth a watch. John discusses options for arrays of structures, and structures of arrays, in his talk. If the tuples, as proposed by John, were available then the test described here could have comparable performance and be a more pleasant programming style. The whole array of structures could be allocated in a single action thus bypassing the copy of individual objects across generations, and it would be stored in a compact contiguous fashion. This would remove the significant GC issues for this class of problem.

Lately, I was comparing standard data structures between Java and .Net. In some cases I observed a 6-10X performance advantage to .Net for things like maps and dictionaries when .Net used native structure support. Let's get this into Java as soon as possible!

It is also pretty obvious from the results that if we are to use Java for real-time analysis on big data, then our standard garbage collectors need to significantly improve and support true concurrent operations.

[1] - To my knowledge the only JVM that deals well with very large heaps is Azul Zing

Sunday, 5 August 2012

Memory Access Patterns Are Important

In high-performance computing it is often said that the cost of a cache-miss is the largest performance penalty for an algorithm. For many years the increase in speed of our processors has greatly outstripped latency gains to main-memory. Bandwidth to main-memory has greatly increased via wider, and multi-channel, buses however the latency has not significantly reduced. To hide this latency our processors employ evermore complex cache sub-systems that have many layers.

The 1994 paper "Hitting the memory wall: implications of the obvious" describes the problem and goes on to argue that caches do not ultimately help because of compulsory cache-misses. I aim to show that by using access patterns which display consideration for the cache hierarchy, this conclusion is not inevitable.

Let's start putting the problem in context with some examples. Our hardware tries to hide the main-memory latency via a number of techniques. Basically three major bets are taken on memory access patterns:

Temporal: Memory accessed recently will likely be required again soon.
Spatial: Adjacent memory is likely to be required soon.
Striding: Memory access is likely to follow a predictable pattern.

To illustrate these three bets in action let's write some code and measure the results.

Walk through memory in a linear fashion being completely predictable.
Pseudo randomly walk round memory within a restricted area then move on. This restricted area is what is commonly known as an operating system page of memory.
Pseudo randomly walk around a large area of the heap.

Code

The following code should be run with the -Xmx4g JVM option.

public class TestMemoryAccessPatterns
{
    private static final int LONG_SIZE = 8;
    private static final int PAGE_SIZE = 2 * 1024 * 1024;
    private static final int ONE_GIG = 1024 * 1024 * 1024;
    private static final long TWO_GIG = 2L * ONE_GIG;

    private static final int ARRAY_SIZE = (int)(TWO_GIG / LONG_SIZE);
    private static final int WORDS_PER_PAGE = PAGE_SIZE / LONG_SIZE;

    private static final int ARRAY_MASK = ARRAY_SIZE - 1;
    private static final int PAGE_MASK = WORDS_PER_PAGE - 1;

    private static final int PRIME_INC = 514229;

    private static final long[] memory = new long[ARRAY_SIZE];

    static
    {
        for (int i = 0; i < ARRAY_SIZE; i++)
        {
            memory[i] = 777;
        }
    }

    public enum StrideType
    {
        LINEAR_WALK
        {
            public int next(final int pageOffset, final int wordOffset, final int pos)
            {
                return (pos + 1) & ARRAY_MASK;
            }
        },

        RANDOM_PAGE_WALK
        {
            public int next(final int pageOffset, final int wordOffset, final int pos)
            {
                return pageOffset + ((pos + PRIME_INC) & PAGE_MASK);
            }
        },

        RANDOM_HEAP_WALK
        {
            public int next(final int pageOffset, final int wordOffset, final int pos)
            {
                return (pos + PRIME_INC) & ARRAY_MASK;
            }
        };

        public abstract int next(int pageOffset, int wordOffset, int pos);
    }

    public static void main(final String[] args)
    {
        final StrideType strideType;
        switch (Integer.parseInt(args[0]))
        {
            case 1:
                strideType = StrideType.LINEAR_WALK;
                break;

            case 2:
                strideType = StrideType.RANDOM_PAGE_WALK;
                break;

            case 3:
                strideType = StrideType.RANDOM_HEAP_WALK;
                break;

            default:
                throw new IllegalArgumentException("Unknown StrideType");
        }

        for (int i = 0; i < 5; i++)
        {
            perfTest(i, strideType);
        }
    }

    private static void perfTest(final int runNumber, final StrideType strideType)
    {
        final long start = System.nanoTime();

        int pos = -1;
        long result = 0;
        for (int pageOffset = 0; pageOffset < ARRAY_SIZE; pageOffset += WORDS_PER_PAGE)
        {
            for (int wordOffset = pageOffset, limit = pageOffset + WORDS_PER_PAGE;
                 wordOffset < limit;
                 wordOffset++)
            {
                pos = strideType.next(pageOffset, wordOffset, pos);
                result += memory[pos];
            }
        }

        final long duration = System.nanoTime() - start;
        final double nsOp = duration / (double)ARRAY_SIZE;

        if (208574349312L != result)
        {
            throw new IllegalStateException();
        }

        System.out.format("%d - %.2fns %s\n",
                          Integer.valueOf(runNumber),
                          Double.valueOf(nsOp),
                          strideType);
    }
}

Results

Intel U4100 @ 1.3GHz, 4GB RAM DDR2 800MHz, 
Windows 7 64-bit, Java 1.7.0_05
===========================================
0 - 2.38ns LINEAR_WALK
1 - 2.41ns LINEAR_WALK
2 - 2.35ns LINEAR_WALK
3 - 2.36ns LINEAR_WALK
4 - 2.39ns LINEAR_WALK

0 - 12.45ns RANDOM_PAGE_WALK
1 - 12.27ns RANDOM_PAGE_WALK
2 - 12.17ns RANDOM_PAGE_WALK
3 - 12.22ns RANDOM_PAGE_WALK
4 - 12.18ns RANDOM_PAGE_WALK

0 - 152.86ns RANDOM_HEAP_WALK
1 - 151.80ns RANDOM_HEAP_WALK
2 - 151.72ns RANDOM_HEAP_WALK
3 - 151.91ns RANDOM_HEAP_WALK
4 - 151.36ns RANDOM_HEAP_WALK

Intel i7-860 @ 2.8GHz, 8GB RAM DDR3 1333MHz, 
Windows 7 64-bit, Java 1.7.0_05
=============================================
0 - 1.06ns LINEAR_WALK
1 - 1.05ns LINEAR_WALK
2 - 0.98ns LINEAR_WALK
3 - 1.00ns LINEAR_WALK
4 - 1.00ns LINEAR_WALK

0 - 3.80ns RANDOM_PAGE_WALK
1 - 3.85ns RANDOM_PAGE_WALK
2 - 3.79ns RANDOM_PAGE_WALK
3 - 3.65ns RANDOM_PAGE_WALK
4 - 3.64ns RANDOM_PAGE_WALK

0 - 30.04ns RANDOM_HEAP_WALK
1 - 29.05ns RANDOM_HEAP_WALK
2 - 29.14ns RANDOM_HEAP_WALK
3 - 28.88ns RANDOM_HEAP_WALK
4 - 29.57ns RANDOM_HEAP_WALK

Intel i7-2760QM @ 2.40GHz, 8GB RAM DDR3 1600MHz, 
Linux 3.4.6 kernel 64-bit, Java 1.7.0_05
=================================================
0 - 0.91ns LINEAR_WALK
1 - 0.92ns LINEAR_WALK
2 - 0.88ns LINEAR_WALK
3 - 0.89ns LINEAR_WALK
4 - 0.89ns LINEAR_WALK

0 - 3.29ns RANDOM_PAGE_WALK
1 - 3.35ns RANDOM_PAGE_WALK
2 - 3.33ns RANDOM_PAGE_WALK
3 - 3.31ns RANDOM_PAGE_WALK
4 - 3.30ns RANDOM_PAGE_WALK

0 - 9.58ns RANDOM_HEAP_WALK
1 - 9.20ns RANDOM_HEAP_WALK
2 - 9.44ns RANDOM_HEAP_WALK
3 - 9.46ns RANDOM_HEAP_WALK
4 - 9.47ns RANDOM_HEAP_WALK

Analysis

I ran the code on 3 different CPU architectures illustrating generational steps forward for Intel. It is clear from the results that each generation has become progressively better at hiding the latency to main-memory based on the 3 bets described above for a relatively small heap. This is because the size and sophistication of various caches keep improving. However as memory size increases they become less effective. For example, if the array is doubled to be 4GB in size, then the average latency increases from ~30ns to ~55ns for the i7-860 doing the random heap walk.

It seems that for the linear walk case, memory latency does not exist. However as we walk around memory in an evermore random pattern then the latency starts to become very apparent.

The random heap walk produced an interesting result. This is a our worst case scenario, and given the hardware specifications of these systems, we could be looking at 150ns, 65ns, and 75ns for the above tests respectively based on memory controller and memory module latencies. For the Nehalem (i7-860) I can further subvert the cache sub-system by using a 4GB array resulting in ~55ns on average per iteration. The i7-2760QM has larger load buffers, TLB caches, and Linux is running with transparent huge pages which are all working to further hide the latency. By playing with different prime numbers for the stride, results can vary wildly depending on processor type, e.g. try PRIME_INC = 39916801 for Nehalem. I'd like to test this on a much larger heap with Sandy Bridge.

The main take away is the more predictable the pattern of access to memory, then the better the cache sub-systems are at hiding main-memory latency. Let's look at these cache sub-systems in a little detail to try and understand the observed results.

Hardware Components

We have many layers of cache plus the pre-fetchers to consider for how latency gets hidden. In this section I'll try and cover the major components used to hide latency that our hardware and systems software friends have put in place. We will investigate these latency hiding components and use the Linux perf and Lightweight Performance Counters utilities to retrieve the performance counters from our CPUs which tell how effective these components are when we execute our programs. Performance counters are CPU specific and what I've used here are specific to Sandy Bridge.

Data Caches
Processors typically have 2 or 3 layers of data cache. Each layer as we move out is progressively larger with increasing latency. The latest Intel processors have 3 layers (L1D, L2, and L3); with sizes 32KB, 256KB, and 4-30MB; and ~1ns, ~4ns, and ~15ns latency respectively for a 3.0GHz CPU.

Data caches are effectively hardware hash tables with a fixed number of slots for each hash value. These slots are known as "ways". An 8-way associative cache will have 8 slots to hold values for addresses that hash to the same cache location. Within these slots the data caches do not store words, they store cache-lines of multiple words. For an Intel processor these cache-lines are typically 64-bytes, that is 8 words on a 64-bit machine. This plays to the spatial bet that adjacent memory is likely to be required soon, which is typically the case if we think of arrays or fields of an object.

Data caches are typically evicted in a LRU manner. Caches work by using a write-back algorithm were stores need only be propagated to main-memory when a modified cache-line is evicted. This gives rise the the interesting phenomenon that a load can cause a write-back to the outer cache layers and eventually to main-memory.

perf stat -e L1-dcache-loads,L1-dcache-load-misses java -Xmx4g TestMemoryAccessPatterns $

 Performance counter stats for 'java -Xmx4g TestMemoryAccessPatterns 1':
     1,496,626,053 L1-dcache-loads                                            
       274,255,164 L1-dcache-misses
         #   18.32% of all L1-dcache hits

 Performance counter stats for 'java -Xmx4g TestMemoryAccessPatterns 2':
     1,537,057,965 L1-dcache-loads                                            
     1,570,105,933 L1-dcache-misses
         #  102.15% of all L1-dcache hits 

 Performance counter stats for 'java -Xmx4g TestMemoryAccessPatterns 3':
     4,321,888,497 L1-dcache-loads                                           
     1,780,223,433 L1-dcache-misses
         #   41.19% of all L1-dcache hits  

likwid-perfctr -C 2 -g L2CACHE java -Xmx4g TestMemoryAccessPatterns $

java -Xmx4g TestMemoryAccessPatterns 1
+-----------------------+-------------+
|         Event         |   core 2    |
+-----------------------+-------------+
|   INSTR_RETIRED_ANY   | 5.94918e+09 |
| CPU_CLK_UNHALTED_CORE | 5.15969e+09 |
| L2_TRANS_ALL_REQUESTS | 1.07252e+09 |
|     L2_RQSTS_MISS     | 3.25413e+08 |
+-----------------------+-------------+
+-----------------+-----------+
|     Metric      |  core 2   |
+-----------------+-----------+
|   Runtime [s]   |  2.15481  |
|       CPI       | 0.867293  |
| L2 request rate |  0.18028  |
|  L2 miss rate   | 0.0546988 |
|  L2 miss ratio  | 0.303409  |
+-----------------+-----------+
+------------------------+-------------+
|         Event          |   core 2    |
+------------------------+-------------+
| L3_LAT_CACHE_REFERENCE | 1.26545e+08 |
|   L3_LAT_CACHE_MISS    | 2.59059e+07 |
+------------------------+-------------+

java -Xmx4g TestMemoryAccessPatterns 2
+-----------------------+-------------+
|         Event         |   core 2    |
+-----------------------+-------------+
|   INSTR_RETIRED_ANY   | 1.48772e+10 |
| CPU_CLK_UNHALTED_CORE | 1.64712e+10 |
| L2_TRANS_ALL_REQUESTS | 3.41061e+09 |
|     L2_RQSTS_MISS     | 1.5547e+09  |
+-----------------------+-------------+
+-----------------+----------+
|     Metric      |  core 2  |
+-----------------+----------+
|   Runtime [s]   | 6.87876  |
|       CPI       | 1.10714  |
| L2 request rate | 0.22925  |
|  L2 miss rate   | 0.104502 |
|  L2 miss ratio  | 0.455843 |
+-----------------+----------+
+------------------------+-------------+
|         Event          |   core 2    |
+------------------------+-------------+
| L3_LAT_CACHE_REFERENCE | 1.52088e+09 |
|   L3_LAT_CACHE_MISS    | 1.72918e+08 |
+------------------------+-------------+

java -Xmx4g TestMemoryAccessPatterns 3
+-----------------------+-------------+
|         Event         |   core 2    |
+-----------------------+-------------+
|   INSTR_RETIRED_ANY   | 6.49533e+09 |
| CPU_CLK_UNHALTED_CORE | 4.18416e+10 |
| L2_TRANS_ALL_REQUESTS | 4.67488e+09 |
|     L2_RQSTS_MISS     | 1.43442e+09 |
+-----------------------+-------------+
+-----------------+----------+
|     Metric      |  core 2  |
+-----------------+----------+
|   Runtime [s]   |  17.474  |
|       CPI       |  6.4418  |
| L2 request rate | 0.71973  |
|  L2 miss rate   | 0.220838 |
|  L2 miss ratio  | 0.306835 |
+-----------------+----------+
+------------------------+-------------+
|         Event          |   core 2    |
+------------------------+-------------+
| L3_LAT_CACHE_REFERENCE | 1.40079e+09 |
|   L3_LAT_CACHE_MISS    | 1.34832e+09 |
+------------------------+-------------+

Note: The cache-miss rate of the combined L1D, L2 and L3 increases significantly as the pattern of access becomes more random.

Translation Lookaside Buffers (TLBs)
Our programs deal with virtual memory addresses that need to be translated to physical memory addresses. Virtual memory systems do this by mapping pages. We need to know the offset for a given page and its size for any memory operation. Typically page sizes are 4KB and are gradually moving to 2MB and greater. Linux introduced Transparent Huge Pages in the 2.6.38 kernel giving us 2MB pages. The translation of virtual memory pages to physical pages is maintained by the page table. This translation can require multiple accesses to the page table which is a huge performance penalty. To accelerate this lookup, processors have a small hardware cache at each cache level called the TLB cache. A miss on the TLB cache can be hugely expensive because the page table may not be in a nearby data cache. By moving to larger pages, a TLB cache can cover a larger address range for the same number of entries.

perf stat -e dTLB-loads,dTLB-load-misses java -Xmx4g TestMemoryAccessPatterns $
 
 Performance counter stats for 'java -Xmx4g TestMemoryAccessPatterns 1':
     1,496,128,634 dTLB-loads
           310,901 dTLB-misses
              #    0.02% of all dTLB cache hits 

 Performance counter stats for 'java -Xmx4g TestMemoryAccessPatterns 2':
     1,551,585,263 dTLB-loads
           340,230 dTLB-misses
              #    0.02% of all dTLB cache hits

 Performance counter stats for 'java -Xmx4g TestMemoryAccessPatterns 3':
     4,031,344,537 dTLB-loads
     1,345,807,418 dTLB-misses
              #   33.38% of all dTLB cache hits

Note: We only incur significant TLB misses when randomly walking the whole heap when huge pages are employed.

Hardware Pre-Fetchers
Hardware will try and predict the next memory access our programs will make and speculatively load that memory into fill buffers. This is done at it simplest level by pre-loading adjacent cache-lines for the spatial bet, or by recognising regular stride based access patterns, typically less than 2KB in stride length. The tests below we are measuring the number of loads that hit a fill buffer from a hardware pre-fetch.

likwid-perfctr -C 2 -g LOAD_HIT_PRE_HW_PF:PMC0 java -Xmx4g TestMemoryAccessPatterns $

java -Xmx4g TestMemoryAccessPatterns 1
+--------------------+-------------+
|       Event        |   core 2    |
+--------------------+-------------+
| LOAD_HIT_PRE_HW_PF | 1.31613e+09 |
+--------------------+-------------+

java -Xmx4g TestMemoryAccessPatterns 2
+--------------------+--------+
|       Event        | core 2 |
+--------------------+--------+
| LOAD_HIT_PRE_HW_PF | 368930 |
+--------------------+--------+

java -Xmx4g TestMemoryAccessPatterns 3
+--------------------+--------+
|       Event        | core 2 |
+--------------------+--------+
| LOAD_HIT_PRE_HW_PF | 324373 |
+--------------------+--------+

Note: We have a significant success rate for load hits with the pre-fetcher on the linear walk.

Memory Controllers and Row Buffers
Beyond our last level cache (LLC) sits the memory controllers that manage access to the SDRAM banks. Memory is organised into rows and columns. To access an address, first the row address must be selected (RAS), then the column address is selected (CAS) within that row to get the word. The row is typically a page in size and loaded into a row buffer. Even at this stage the hardware is still helping hide the latency. A queue of memory access requests is maintained and re-ordered so that multiple words can be fetched from the same row if possible.

Non-Uniform Memory Access (NUMA)
Systems now have memory controllers on the CPU socket. This move to on-socket memory controllers gave an ~50ns latency reduction over existing front side bus (FSB) and external Northbridge memory controllers. Systems with multiple sockets employ memory interconnects, QPI from Intel, which are used when one CPU wants to access memory managed by another CPU socket. The presence of these interconnects gives rise to the non-uniform nature of server memory access. In a 2-socket system memory may be local or 1 hop away. On a 8-socket system memory can be up to 3 hops away, were each hop adds 20ns latency in each direction.

What does this mean for algorithms?

The difference between an L1D cache-hit, and a full miss resulting in main-memory access, is 2 orders of magnitude; i.e. <1ns vs. 65-100ns. If algorithms randomly walk around our ever increasing address spaces, then we are less likely to benefit from the hardware support that hides this latency.

Is there anything we can do about this when designing algorithms and data-structures? Yes there is a lot we can do. If we perform chunks of work on data that is co-located, and we stride around memory in a predictable fashion, then our algorithms can be many times faster. For example rather than using bucket and chain hash tables, like in the JDK, we can employ hash tables using open-addressing with linear-probing. Rather than using linked-lists or trees with single items in each node, we can store an array of many items in each node.

Research is advancing on algorithmic approaches that work in harmony with cache sub-systems. One area I find fascinating is Cache Oblivious Algorithms. The name is a bit misleading but there are some great concepts here for how to improve software performance and better execute in parallel. This article is a great illustration of the performance benefits that can be gained.

Conclusion

To achieve great performance it is important to have sympathy for the cache sub-systems. We have seen in this article what can be achieved by accessing memory in patterns which work with, rather than against, these caches. When designing algorithms and data structures, it is now vitally important to consider cache-misses, probably even more so than counting steps in the algorithm. This is not what we were taught in algorithm theory when studying computer science. The last decade has seen some fundamental changes in technology. For me the two most significant are the rise of multi-core, and now big-memory systems with 64-bit address spaces.

One thing is certain, if we want software to execute faster and scale better, we need to make better use of the many cores in our CPUs, and pay attention to memory access patterns.

Update: 06-August-2012
Trying to design a random walk algorithm for all processors and memory sizes is tricky. If I use the algorithm below then my Sandy Bridge processor is slower but the Nehalem is faster. The point is performance will be very unpredictable when you walk around memory in a random fashion. I've also included the L3 cache counters for more detail in all the tests.

    private static final long LARGE_PRIME_INC = 70368760954879L;

        RANDOM_HEAP_WALK
        {
            public int next(final int pageOffset, final int wordOffset, final int pos)
            {
                return (int)(pos + LARGE_PRIME_INC) & ARRAY_MASK;
            }
        };

Intel i7-2760QM @ 2.40GHz, 8GB RAM DDR3 1600MHz, 
Linux 3.4.6 kernel 64-bit, Java 1.7.0_05
=================================================
0 - 29.06ns RANDOM_HEAP_WALK
1 - 29.47ns RANDOM_HEAP_WALK
2 - 29.48ns RANDOM_HEAP_WALK
3 - 29.43ns RANDOM_HEAP_WALK
4 - 29.42ns RANDOM_HEAP_WALK

 Performance counter stats for 'java -Xmx4g TestMemoryAccessPatterns 3':
     9,444,928,682 dTLB-loads
     4,371,982,327 dTLB-misses
         #   46.29% of all dTLB cache hits 

     9,390,675,639 L1-dcache-loads
     1,471,647,016 L1-dcache-misses
         #   15.67% of all L1-dcache hits  

+-----------------------+-------------+
|         Event         |   core 2    |
+-----------------------+-------------+
|   INSTR_RETIRED_ANY   | 7.71171e+09 |
| CPU_CLK_UNHALTED_CORE | 1.31717e+11 |
| L2_TRANS_ALL_REQUESTS | 8.4912e+09  |
|     L2_RQSTS_MISS     | 2.79635e+09 |
+-----------------------+-------------+
+-----------------+----------+
|     Metric      |  core 2  |
+-----------------+----------+
|   Runtime [s]   | 55.0094  |
|       CPI       | 17.0801  |
| L2 request rate | 1.10108  |
|  L2 miss rate   | 0.362611 |
|  L2 miss ratio  | 0.329324 |
+-----------------+----------+
+--------------------+-------------+
|       Event        |   core 2    |
+--------------------+-------------+
| LOAD_HIT_PRE_HW_PF | 3.59509e+06 |
+--------------------+-------------+
+------------------------+-------------+
|        Event           |   core 2    |
+------------------------+-------------+
| L3_LAT_CACHE_REFERENCE | 1.30318e+09 |
| L3_LAT_CACHE_MISS      | 2.62346e+07 |
+------------------------+-------------+