Wednesday 17 October 2012

Compact Off-Heap Structures/Tuples In Java

In my last post I detailed the implications of the access patterns your code takes to main memory.  Since then I've had a lot of questions about what can be done in Java to enable more predictable memory layout.  There are patterns that can be applied using array backed structures which I will discuss in another post.   This post will explore how to simulate a feature sorely missing in Java - arrays of structures similar to what C has to offer.

Structures are very useful, both on the stack and the heap.  To my knowledge it is not possible to simulate this feature on the Java stack.  Not being able to do this on the stack is such as shame because it greatly limits the performance of some parallel algorithms, however that is a rant for another day.

In Java, all user defined types have to exist on the heap.  The Java heap is managed by the garbage collector in the general case, however there is more to the wider heap in a Java process.  With the introduction of direct ByteBuffer, memory can be allocated which is not tracked by the garbage collector because it can be available to native code for tasks like avoiding the copying of data to and from the kernel for IO.  So one method of managing structures is to fake them within a ByteBuffer as a reasonable approach.  This can allow compact data representations, but has performance and size limitations.  For example, it is not possible to have a ByteBuffer greater than 2GB, and all access is bounds checked which impacts performance.  An alternative exists using Unsafe that is both faster and and not size constrained like ByteBuffer.

The approach I'm about to detail is not traditional Java.  If your problem space is dealing with big data, or extreme performance, then there are benefits to be had.  If your data sets are small, and performance is not an issue, then run away now to avoid getting sucked into the dark arts of native memory management.

The benefits of the approach I'm about to detail are:
  1. Significantly improved performance 
  2. More compact data representation
  3. Ability to work with very large data sets while avoiding nasty GC pauses[1]
With all choices there are consequences.  By taking the approach detailed below you take responsibility for some of the memory managment yourself.  Getting it wrong can lead to memory leaks, or worse, you can crash the JVM!  Proceed with caution...

Suitable Example - Trade Data

A common challenge faced in finance applications is capturing and working with very large volumes of order and trade data.  For the example I will create a large table of in-memory trade data that can have analysis queries run against it.  This table will be built using 2 contrasting approaches.  Firstly, I'll take the traditional Java approach of creating a large array and reference individual Trade objects.  Secondly, I keep the usage code identical but replace the large array and Trade objects with an off-heap array of structures that can be manipulated via a Flyweight pattern.

If for the traditional Java approach I used some other data structure, such as a Map or Tree, then the memory footprint would be even greater and the performance lower.

Traditional Java Approach
public class TestJavaMemoryLayout
{
    private static final int NUM_RECORDS = 50 * 1000 * 1000;

    private static JavaMemoryTrade[] trades;

    public static void main(final String[] args)
    {
        for (int i = 0; i < 5; i++)
        {
            System.gc();
            perfRun(i);
        }
    }

    private static void perfRun(final int runNum)
    {
        long start = System.currentTimeMillis();

        init();

        System.out.format("Memory %,d total, %,d free\n",
                          Runtime.getRuntime().totalMemory(),
                          Runtime.getRuntime().freeMemory());

        long buyCost = 0;
        long sellCost = 0;

        for (int i = 0; i < NUM_RECORDS; i++)
        {
            final JavaMemoryTrade trade = get(i);

            if (trade.getSide() == 'B')
            {
                buyCost += (trade.getPrice() * trade.getQuantity());
            }
            else
            {
                sellCost += (trade.getPrice() * trade.getQuantity());
            }
        }

        long duration = System.currentTimeMillis() - start;
        System.out.println(runNum + " - duration " + duration + "ms");
        System.out.println("buyCost = " + buyCost + " sellCost = " + sellCost);
    }

    private static JavaMemoryTrade get(final int index)
    {
        return trades[index];
    }

    public static void init()
    {
        trades = new JavaMemoryTrade[NUM_RECORDS];

        final byte[] londonStockExchange = {'X', 'L', 'O', 'N'};
        final int venueCode = pack(londonStockExchange);

        final byte[] billiton = {'B', 'H', 'P'};
        final int instrumentCode = pack( billiton);

        for (int i = 0; i < NUM_RECORDS; i++)
        {
            JavaMemoryTrade trade = new JavaMemoryTrade();
            trades[i] = trade;

            trade.setTradeId(i);
            trade.setClientId(1);
            trade.setVenueCode(venueCode);
            trade.setInstrumentCode(instrumentCode);

            trade.setPrice(i);
            trade.setQuantity(i);

            trade.setSide((i & 1) == 0 ? 'B' : 'S');
        }
    }

    private static int pack(final byte[] value)
    {
        int result = 0;
        switch (value.length)
        {
            case 4:
                result = (value[3]);
            case 3:
                result |= ((int)value[2] << 8);
            case 2:
                result |= ((int)value[1] << 16);
            case 1:
                result |= ((int)value[0] << 24);
                break;

            default:
                throw new IllegalArgumentException("Invalid array size");
        }

        return result;
    }

    private static class JavaMemoryTrade
    {
        private long tradeId;
        private long clientId;
        private int venueCode;
        private int instrumentCode;
        private long price;
        private long quantity;
        private char side;

        public long getTradeId()
        {
            return tradeId;
        }

        public void setTradeId(final long tradeId)
        {
            this.tradeId = tradeId;
        }

        public long getClientId()
        {
            return clientId;
        }

        public void setClientId(final long clientId)
        {
            this.clientId = clientId;
        }

        public int getVenueCode()
        {
            return venueCode;
        }

        public void setVenueCode(final int venueCode)
        {
            this.venueCode = venueCode;
        }

        public int getInstrumentCode()
        {
            return instrumentCode;
        }

        public void setInstrumentCode(final int instrumentCode)
        {
            this.instrumentCode = instrumentCode;
        }

        public long getPrice()
        {
            return price;
        }

        public void setPrice(final long price)
        {
            this.price = price;
        }

        public long getQuantity()
        {
            return quantity;
        }

        public void setQuantity(final long quantity)
        {
            this.quantity = quantity;
        }

        public char getSide()
        {
            return side;
        }

        public void setSide(final char side)
        {
            this.side = side;
        }
    }
}
Compact Off-Heap Structures
import sun.misc.Unsafe;

import java.lang.reflect.Field;

public class TestDirectMemoryLayout
{
    private static final Unsafe unsafe;
    static
    {
        try
        {
            Field field = Unsafe.class.getDeclaredField("theUnsafe");
            field.setAccessible(true);
            unsafe = (Unsafe)field.get(null);
        }
        catch (Exception e)
        {
            throw new RuntimeException(e);
        }
    }

    private static final int NUM_RECORDS = 50 * 1000 * 1000;

    private static long address;
    private static final DirectMemoryTrade flyweight = new DirectMemoryTrade();

    public static void main(final String[] args)
    {
        for (int i = 0; i < 5; i++)
        {
            System.gc();
            perfRun(i);
        }
    }

    private static void perfRun(final int runNum)
    {
        long start = System.currentTimeMillis();

        init();

        System.out.format("Memory %,d total, %,d free\n",
                          Runtime.getRuntime().totalMemory(),
                          Runtime.getRuntime().freeMemory());

        long buyCost = 0;
        long sellCost = 0;

        for (int i = 0; i < NUM_RECORDS; i++)
        {
            final DirectMemoryTrade trade = get(i);

            if (trade.getSide() == 'B')
            {
                buyCost += (trade.getPrice() * trade.getQuantity());
            }
            else
            {
                sellCost += (trade.getPrice() * trade.getQuantity());
            }
        }

        long duration = System.currentTimeMillis() - start;
        System.out.println(runNum + " - duration " + duration + "ms");
        System.out.println("buyCost = " + buyCost + " sellCost = " + sellCost);

        destroy();
    }

    private static DirectMemoryTrade get(final int index)
    {
        final long offset = address + (index * DirectMemoryTrade.getObjectSize());
        flyweight.setObjectOffset(offset);
        return flyweight;
    }

    public static void init()
    {
        final long requiredHeap = NUM_RECORDS * DirectMemoryTrade.getObjectSize();
        address = unsafe.allocateMemory(requiredHeap);

        final byte[] londonStockExchange = {'X', 'L', 'O', 'N'};
        final int venueCode = pack(londonStockExchange);

        final byte[] billiton = {'B', 'H', 'P'};
        final int instrumentCode = pack( billiton);

        for (int i = 0; i < NUM_RECORDS; i++)
        {
            DirectMemoryTrade trade = get(i);

            trade.setTradeId(i);
            trade.setClientId(1);
            trade.setVenueCode(venueCode);
            trade.setInstrumentCode(instrumentCode);

            trade.setPrice(i);
            trade.setQuantity(i);

            trade.setSide((i & 1) == 0 ? 'B' : 'S');
        }
    }

    private static void destroy()
    {
        unsafe.freeMemory(address);
    }

    private static int pack(final byte[] value)
    {
        int result = 0;
        switch (value.length)
        {
            case 4:
                result |= (value[3]);
            case 3:
                result |= ((int)value[2] << 8);
            case 2:
                result |= ((int)value[1] << 16);
            case 1:
                result |= ((int)value[0] << 24);
                break;

            default:
                throw new IllegalArgumentException("Invalid array size");
        }

        return result;
    }

    private static class DirectMemoryTrade
    {
        private static long offset = 0;

        private static final long tradeIdOffset = offset += 0;
        private static final long clientIdOffset = offset += 8;
        private static final long venueCodeOffset = offset += 8;
        private static final long instrumentCodeOffset = offset += 4;
        private static final long priceOffset = offset += 4;
        private static final long quantityOffset = offset += 8;
        private static final long sideOffset = offset += 8;

        private static final long objectSize = offset += 2;

        private long objectOffset;

        public static long getObjectSize()
        {
            return objectSize;
        }

        void setObjectOffset(final long objectOffset)
        {
            this.objectOffset = objectOffset;
        }

        public long getTradeId()
        {
            return unsafe.getLong(objectOffset + tradeIdOffset);
        }

        public void setTradeId(final long tradeId)
        {
            unsafe.putLong(objectOffset + tradeIdOffset, tradeId);
        }

        public long getClientId()
        {
            return unsafe.getLong(objectOffset + clientIdOffset);
        }

        public void setClientId(final long clientId)
        {
            unsafe.putLong(objectOffset + clientIdOffset, clientId);
        }

        public int getVenueCode()
        {
            return unsafe.getInt(objectOffset + venueCodeOffset);
        }

        public void setVenueCode(final int venueCode)
        {
            unsafe.putInt(objectOffset + venueCodeOffset, venueCode);
        }

        public int getInstrumentCode()
        {
            return unsafe.getInt(objectOffset + instrumentCodeOffset);
        }

        public void setInstrumentCode(final int instrumentCode)
        {
            unsafe.putInt(objectOffset + instrumentCodeOffset, instrumentCode);
        }

        public long getPrice()
        {
            return unsafe.getLong(objectOffset + priceOffset);
        }

        public void setPrice(final long price)
        {
            unsafe.putLong(objectOffset + priceOffset, price);
        }

        public long getQuantity()
        {
            return unsafe.getLong(objectOffset + quantityOffset);
        }

        public void setQuantity(final long quantity)
        {
            unsafe.putLong(objectOffset + quantityOffset, quantity);
        }

        public char getSide()
        {
            return unsafe.getChar(objectOffset + sideOffset);
        }

        public void setSide(final char side)
        {
            unsafe.putChar(objectOffset + sideOffset, side);
        }
    }
}
Results
Intel i7-860 @ 2.8GHz, 8GB RAM DDR3 1333MHz, 
Windows 7 64-bit, Java 1.7.0_07
=============================================
java -server -Xms4g -Xmx4g TestJavaMemoryLayout
Memory 4,116,054,016 total, 1,108,901,104 free
0 - duration 19334ms
Memory 4,116,054,016 total, 1,109,964,752 free
1 - duration 14295ms
Memory 4,116,054,016 total, 1,108,455,504 free
2 - duration 14272ms
Memory 3,817,799,680 total, 815,308,600 free
3 - duration 28358ms
Memory 3,817,799,680 total, 810,552,816 free
4 - duration 32487ms

java -server TestDirectMemoryLayout
Memory 128,647,168 total, 126,391,384 free
0 - duration 983ms
Memory 128,647,168 total, 126,992,160 free
1 - duration 958ms
Memory 128,647,168 total, 127,663,408 free
2 - duration 873ms
Memory 128,647,168 total, 127,663,408 free
3 - duration 886ms
Memory 128,647,168 total, 127,663,408 free
4 - duration 884ms

Intel i7-2760QM @ 2.40GHz, 8GB RAM DDR3 1600MHz, 
Linux 3.4.11 kernel 64-bit, Java 1.7.0_07
=================================================
java -server -Xms4g -Xmx4g TestJavaMemoryLayout
Memory 4,116,054,016 total, 1,108,912,960 free
0 - duration 12262ms
Memory 4,116,054,016 total, 1,109,962,832 free
1 - duration 9822ms
Memory 4,116,054,016 total, 1,108,458,720 free
2 - duration 10239ms
Memory 3,817,799,680 total, 815,307,640 free
3 - duration 21558ms
Memory 3,817,799,680 total, 810,551,856 free
4 - duration 23074ms

java -server TestDirectMemoryLayout 
Memory 123,994,112 total, 121,818,528 free
0 - duration 634ms
Memory 123,994,112 total, 122,455,944 free
1 - duration 619ms
Memory 123,994,112 total, 123,103,320 free
2 - duration 546ms
Memory 123,994,112 total, 123,103,320 free
3 - duration 547ms
Memory 123,994,112 total, 123,103,320 free
4 - duration 534ms
Analysis

Let's compare the results to the 3 benefits promised above.

1.  Significantly improved performance

The evidence here is pretty clear cut.  Using the off-heap structures approach is more than an order of magnitude faster.  At the most extreme, look at the 5th run on a Sandy Bridge processor, we have 43.2 times difference in duration to complete the task.  It is also a nice illustration of how well Sandy Bridge does with predictable access patterns to data.  Not only is the performance significantly better it is also more consistent.  As the heap becomes fragmented, and thus access patterns become more random, the performance degrades as can be seen in the later runs with standard Java approach.

2.  More compact data representation

For our off-heap representation each object requires 42-bytes.  To store 50 million of these, as in the example, we require 2,100,000,000 bytes.  The memory required by the JVM heap is:

   memory required = total memory - free memory - base JVM needs 

     2,883,248,712 = 3,817,799,680 - 810,551,856 - 123,999,112

This implies the JVM needs ~40% more memory to represent the same data.  The reason for this overhead is the array of references to the Java objects plus the object headers.  In a previous post I discussed object layout in Java.

When working with very large data sets this overhead can become a significant limiting factor.

3.  Ability to work with very large data sets while avoiding nasty GC pauses

The sample code above forces a GC cycle before each run and can improve the consistency of the results in some cases.  Feel free to remove the call to System.gc() and observe the implications for yourself.  If you run the tests adding the following command line arguments then the garbage collector will output in painful detail what happened.

-XX:+PrintGC -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintTenuringDistribution -XX:+PrintHeapAtGC -XX:+PrintGCApplicationConcurrentTime -XX:+PrintGCApplicationStoppedTime -XX:+PrintSafepointStatistics

From analysing the output I can see the application underwent a total of 29 GC cycles.  Pause times are listed below by extracting the lines from the output indicating when the application threads are stopped.
With System.gc() before each run
================================
Total time for which application threads were stopped: 0.0085280 seconds
Total time for which application threads were stopped: 0.7280530 seconds
Total time for which application threads were stopped: 8.1703460 seconds
Total time for which application threads were stopped: 5.6112210 seconds
Total time for which application threads were stopped: 1.2531370 seconds
Total time for which application threads were stopped: 7.6392250 seconds
Total time for which application threads were stopped: 5.7847050 seconds
Total time for which application threads were stopped: 1.3070470 seconds
Total time for which application threads were stopped: 8.2520880 seconds
Total time for which application threads were stopped: 6.0949910 seconds
Total time for which application threads were stopped: 1.3988480 seconds
Total time for which application threads were stopped: 8.1793240 seconds
Total time for which application threads were stopped: 6.4138720 seconds
Total time for which application threads were stopped: 4.4991670 seconds
Total time for which application threads were stopped: 4.5612290 seconds
Total time for which application threads were stopped: 0.3598490 seconds
Total time for which application threads were stopped: 0.7111000 seconds
Total time for which application threads were stopped: 1.4426750 seconds
Total time for which application threads were stopped: 1.5931500 seconds
Total time for which application threads were stopped: 10.9484920 seconds
Total time for which application threads were stopped: 7.0707230 seconds

Without System.gc() before each run
===================================
Test run times
0 - duration 12120ms
1 - duration 9439ms
2 - duration 9844ms
3 - duration 20933ms
4 - duration 23041ms

Total time for which application threads were stopped: 0.0170860 seconds
Total time for which application threads were stopped: 0.7915350 seconds
Total time for which application threads were stopped: 10.7153320 seconds
Total time for which application threads were stopped: 5.6234650 seconds
Total time for which application threads were stopped: 1.2689950 seconds
Total time for which application threads were stopped: 7.6238170 seconds
Total time for which application threads were stopped: 6.0114540 seconds
Total time for which application threads were stopped: 1.2990070 seconds
Total time for which application threads were stopped: 7.9918480 seconds
Total time for which application threads were stopped: 5.9997920 seconds
Total time for which application threads were stopped: 1.3430040 seconds
Total time for which application threads were stopped: 8.0759940 seconds
Total time for which application threads were stopped: 6.3980610 seconds
Total time for which application threads were stopped: 4.5572100 seconds
Total time for which application threads were stopped: 4.6193830 seconds
Total time for which application threads were stopped: 0.3877930 seconds
Total time for which application threads were stopped: 0.7429270 seconds
Total time for which application threads were stopped: 1.5248070 seconds
Total time for which application threads were stopped: 1.5312130 seconds
Total time for which application threads were stopped: 10.9120250 seconds
Total time for which application threads were stopped: 7.3528590 seconds
It can been seen from the output that a significant proportion of the time is spent in the garbage collector.  When your threads are stopped your application is not responsive.  These tests have been done with default GC settings.  It is possible to tune the GC for better results but this can be a highly skilled and significant effort.  The only JVM I know that copes well by not imposing long pause times, even under high-throughput conditions, is the Azul concurrent compacting collector.

When profiling this application, I can see that the majority of the time is spent allocating the objects and promoting them to the old generation because they do not fit in the young generation.  The initialisation costs can be removed from the timing but that is not realistic.  If the traditional Java approach is taken the state needs to be built up before the query can take place.  The end user of an application has to wait for the state to be built up and the query executed.

This test is really quite trivial.  Imagine working with similar data sets but at the 100 GB scale.

Note: When the garbage collector compacts a region, then objects that were next to each other can be moved far apart.  This can result in TLB and other cache misses.

Side Note On Serialization

A huge benefit of using off-heap structures in this manner is how they can be very easily serialised to network, or storage, by a simple memory copy as I have shown in the previous post.  This way we can completely bypass intermediate buffer and object allocation.

Conclusion

If you are willing to do some C style programming for large datasets it is possible to control the memory layout in Java by going off-heap.  If you do, the benefits in performance, compactness, and avoiding GC issues are significant.  However this is an approach that should not be used for all applications.  Its benefits are only noticable for very large datasets, or the extremes of performance in throughput and/or latency. 

I hope the Java community can collectively realise the importance of supporting structures both on the heap and the stack.  John Rose has done some excellent work in this area defining how tuples could be added to the JVM.  His talk on Arrays 2.0 from the JVM Language Summit this year is really worth a watch.  John discusses options for arrays of structures, and structures of arrays, in his talk.  If the tuples, as proposed by John, were available then the test described here could have comparable performance and be a more pleasant programming style.  The whole array of structures could be allocated in a single action thus bypassing the copy of individual objects across generations, and it would be stored in a compact contiguous fashion.  This would remove the significant GC issues for this class of problem.

Lately, I was comparing standard data structures between Java and .Net.  In some cases I observed a 6-10X performance advantage to .Net for things like maps and dictionaries when .Net used native structure support.  Let's get this into Java as soon as possible!

It is also pretty obvious from the results that if we are to use Java for real-time analysis on big data, then our standard garbage collectors need to significantly improve and support true concurrent operations.

[1] - To my knowledge the only JVM that deals well with very large heaps is Azul Zing

80 comments:

  1. Very interesting post. One question, though: have you tried implement same flyweight on the top of plain java byte[] instead of direct memory?

    ReplyDelete
    Replies
    1. With a plain byte[] I'd expect it to be slower on access because of the bounds checking and the bytes to primitive conversions that need to constantly happen. A native order ByteBuffer is a better approach but still slower than using Unsafe. Being limited to 2GB for a normal byte[] is not exactly what I would call big data :-) You would need to first select one of many 2GB arrays into which you address thus adding another level of indirection when dealing with a large heap.

      Delete
    2. This comment has been removed by the author.

      Delete
  2. Also, I can see dramatic change in results if
    1) move init() out from measure.
    2) do .gc() and .sleep() several times before measure to force java objects to be moved into old gen.

    ReplyDelete
    Replies
    1. Yes, I pointed out that GC is one of the biggests costs when you run a profiler on this type of problem. Real world apps have to allocate the objects, and for big data applications they will not all fit in the young generation thus must be promoted.

      To be fair both approaches have to allocate the memory and initialise it to be like-for-like. A bunch of forced GCs and sleeps is not real world behaviour.

      The world is very interesting when you measure and profile :-) Thanks for the feedback.

      Delete
    2. Yes, real world apps allocate memory. But they do it same way for pojo and direct backed storage. So, if you pretend to allocate once and by dig chunks -- you can get benefit from direct memory, but you also get the benefit from pre-allocate array and entries in pojo-style. Or, would you pretend you app allocate objects incrementally -- you'll have much trouble with unsafe.allocate on many small chunks, I would bet it is much slower then usual java thread-local-buffer based allocation.

      By the way, about java-array-backed storage: I've just done benchmarks with long[] backed "trade array", and it is very close to direct-buffer one. Yes, world is very interesting place when you conjecture and verify :)

      Delete
    3. ...actually, long[] based storage even faster, then Unsafe based -- on my macbook air, at least. 70M records, 6Gb heap, jdk 1.6:

      long[] 236+/- 95 ms
      direct 287+/- 113 ms
      pojo 380+/- 160 ms

      Delete
    4. For the long[] approach do you use a slot for each variable or do you pack when smaller? For example 2 ints pack into a long. Can you post a link to code showing the comparison? I'm curious to know what approach is taken and if it will work for the full range of data types.

      Delete
    5. Here are the benchmarks (my own, based on your idea).

      I've pack two ints into 1 long, yes. For last char I use full long cell. I'm also curious about different primitive types serialization/deserialization: your example was quite simple since it was long-dominated.

      Delete
    6. Your "long[]" approach works well if your data is mostly longs. If you have a lot of smaller primitives then the packing costs will start to dominate. Unsafe works well when you have the full range of data types and also has the advantage of supporting volatile and CAS operations. With large data sets you want to pack them to get the most out of memory.

      The big advantage with any of the large array/buffer approaches is they are directly allocated in the old gen and avoid the generational copy cost.

      Delete
    7. Well, generally I agree. But there are also some tricks to use: you do not bounded to longs, you could choose primitive by your data dominated field type. I think, it is not very complex task to make some fitness-function, which is account cost of packing/unpacking with cost of storing/loading data with smaller chunck then native hardware word, and so give you the best storage type for given fields-and-types set.

      From the over side I would rely on JIT support. I do not examine assembly yet, but couldn't JIT optimize pack-and-store/load-and-unpack by use wider/norrower load/store instructions? It seems like not very complex heuristic.

      About old gen -- well, do you actually see copying? If you allocate 16Gb long[] it does not fit into the young gen at all -- it will be allocated in old gen from beginning, afaik

      Delete
    8. As I said in my previous comment that the large array *is* allocated directly in old gen, thus avoiding the copy. With the POJO approach all those individual objects need to be copied a number of times in the promotion via the survivor spaces.

      Delete
    9. Yes, I've missed that point.

      By the way, my point is that you can get most of C-like structure performance and compactness with plain java arrays (although it will be unusual java style), without black magic. This is much safer, and it may be even faster, or at least very close to direct memory. It seems like direct memory gives you real benefit only in few corver cases.

      Delete
    10. You are right that things can be very close with standard arrays. It is safer and easier to debug. I'd only recommend this approach if you really really need to :-) In fact I often recommend standard array approaches with flyweights for column-oriented approaches. Column-oriented in arrays makes excellent indices for searching.

      The cases where direct memory has real benefits is if:

      1. Your memory needs are greater than 2-8GB depending on data types.
      2. You want to directly serialise by copying to input or output buffers without conversion.
      3. You are performing concurrent operations on fields and need volatile and/or Compare and Swap semantics. This is very common in uber large memory applications.
      4. You use a wide range of data types, especially Strings as byte or char arrays.
      5. The address you use is to a memory-mapped file that you are using for storage, therefore gaining transparent persistence.

      Thanks for the interaction and feedback! Hopefully those reading are learning from the comments.

      Delete
    11. "If you have a lot of smaller primitives then the packing costs will start to dominate", shouldn't JIT be good for packing/unpacking, they are simple multiple and/or/plus/shift operations, in addition, Intel core has several unit to compute in pipeline, shouldn't pack/unpack cost be very small? if not, maybe JIT compiler can be improved on this regard.

      Delete
    12. Pack and unpack operations will be relatively cheap compared to a cache miss. Have you considered that lots of other work needs to be done as part of computation and by not using the all the execution units unnecessarily when we have alternatives?

      I don't understand the JIT reference. The JIT cannot turn these into packed structures because it would not be using the primitive array as specified. It also cannot make work magically go away. Am I missing something in this point?

      Delete
    13. http://hg.openjdk.java.net/jdk7/jdk7/hotspot/file/tip/src/cpu/x86/vm/templateTable_x86_64.cpp
      shows that if unpack pack use OR/AND/SHIFT/ADD it will be cheap, however, to access byte at a certain offset of the array, the "[index]" I hope it is not using this (baload opcode), would it involve the multiply of HeapWordSize in base_offset_in_bytes ? upsetting.

      678 void TemplateTable::baload() {
      679 transition(itos, itos);
      680 __ pop_ptr(rdx);
      681 // eax: index
      682 // rdx: array
      683 index_check(rdx, rax); // kills rbx
      684 __ load_signed_byte(rax,
      685 Address(rdx, rax,
      686 Address::times_1,
      687 arrayOopDesc::base_offset_in_bytes(T_BYTE)));
      688 }

      http://hg.openjdk.java.net/jdk6/jdk6/hotspot/diff/ba764ed4b6f2/src/share/vm/oops/arrayOop.hpp
      + // Returns the offset of the first element.
      + static int base_offset_in_bytes(BasicType type) {
      + return header_size(type) * HeapWordSize;
      + }

      Delete
    14. We are expanding on the scope of the post but that is OK. On a real project I would byte align the beginning of each structure so addressing can be done with shifts. This is also necessary to ensure the words for concurrent access are aligned just as Java object fields are.

      I'm working with some JVM implementers on a new intrinsic that treats a class like the following as if it is a real array of structures.

      https://github.com/mjpt777/examples/blob/master/src/java/uk/co/real_logic/intrinsics/StructuredArray.java

      Delete
    15. Agreed, variable length data such as char array can be 0 padded to a array length of power 2, base 10 is for human, not memory ( need sufficient cache size ? it will be hard to decide 64 -> 128, ouch) .
      "return partitions[partitionIndex][partitionOffset];"
      Still, there has to be magic behind the getter in the above line right ? I guess this JVM implementation is not oracle hotspot :)
      would this intrinsics also be efficient in terms of wiping the region clean and copy to another instances?

      Delete
    16. When using 64-bit addressing the the "magic" behind the getter is not very difficult as an intrinsic. The issue with the above is the 32-bit index addressing limitation on Java arrays. This class specifies the behaviour and not the implementation for the intrinsic.

      This can be made very efficient for copy and reset operations with contiguous memory layout.

      I know of a number of JVM implementers who are looking at memory layout with structures, arrays, and object co-location. It would be great to see this work become generally available.

      Delete
  3. The major difference between the two implementations is the time it takes to init the array. Doing the actual trades is comparable, direct being a slightly faster.

    But if you pool your memory array by creating the JavaMemoryTrade objects before-hand, the memory example is actually faster than direct.

    public static void main(final String[] args) {
    trades = new JavaMemoryTrade[NUM_RECORDS];

    for (int i = 0; i < NUM_RECORDS; i++) {
    trades[i] = new JavaMemoryTrade();
    }

    for (int i = 0; i < 5; i++) {
    // System.gc();
    perfRun(i);
    }
    }

    ReplyDelete
    Replies
    1. Are you reusing the same memory for the direct example to be a fair comparison? When I do the direct is still faster by a reasonable amount.

      Delete
  4. I feel the same way. The JVM works best if you only use the heap for scratch space and very long lived immutable state. Don't let anything make it out of the young generation and if you can, also avoid copying anything to survivor spaces as well.

    I wish someone would implement a red-black tree or hash table based on this design.

    ReplyDelete
  5. Ok, I allocated the direct memory in the main and it's faster.

    I had to drop records to 5,000,000 due to ram limits.

    I get mem duration ~105ms, direct ~92ms.

    ReplyDelete
    Replies
    1. You should test with a larger memory machine. It gets more interesting then. With increased volume the TLB and other caches get put under greater stress. At 10X memory compared to you, I'm seeing ~70% delta on just the iteration.

      The real killer comes in with real world applications where the heap is fragmented and those objects are all over the place. The direct memory will always be together. I'd need to write a lot more code to show that in action. No other objects are allocated in this test to mix up the heap.

      Delete
  6. Any idea why you saw generally a 30% improvement on linux vs windows? Is this just the sandybridge vs nehalem? Or does the OS contribute significantly?

    ReplyDelete
    Replies
    1. All memory management is happening inside the JVM, the min and max sizes are set at startup. In my experience it is the processor rather than the OS making the difference here. Sandy Bridge has an improved cache and memory ordering buffer compared to Nehalem, and it will get a lot better again in Haswell.

      Delete
  7. This comment has been removed by the author.

    ReplyDelete
  8. Very interesting! Have you looked at HugeCollections library (http://code.google.com/p/vanilla-java/wiki/HugeCollections)? What do you think of it? It seems to provide a similar functionality. The main difference is that it is column-oriented as opposed to your example, which is row-oriented.

    ReplyDelete
    Replies
    1. I've not tried HugeCollections personally. Column-oriented is great for some classes of problem and row-oriented is great for others. I use whichever is most appropriate for the problem I need to solve. Neither is perfect for all scenarios.

      Delete
    2. It would be interesting to see if there is any big difference between both approaches. Anyone is willing to try it out?

      Delete
  9. Hello,

    This is quite wonderful code. I tested some things out.

    1. took your code.
    2. Changed the 5 reps, to NUM_REPS where NUM_REPS == 500
    3. I removed the System.gc(); call in the main loop. Only to see how this might run in a non testing environment.
    4. NUM_RECORDS = 3 * 1000 * 1000

    DirectMemory has less variation (i.e. coefficient of variation)

    See http://screencast.com/t/mGBmoqOyETcG.

    a) The top panel is JavaMemoryLayout, the bottom panel is DirectMemoryLayout.
    b) Dropped first 10 observations for both
    c) the y-axis is log(runtime, 10)
    d) these are 490 observations plotted in order of occurrence

    Very very impressive.
    Though could you explain the cycles in the DirectMemoryLayout?

    ReplyDelete
    Replies
    1. This comment has been removed by the author.

      Delete
    2. I updated to a 1000 points and added a red loess smoother
      See http://screencast.com/t/EK6xuriF

      So my last question,
      - notice for TestJava, the red trend is straight, though for TestDirect (bottom panel), it appears to be increasing.
      Could the OS be affecting this?

      Thanks
      Saptarshi

      Delete
    3. Is the y-axis time to complete the run? If it is the issue might be the the OS memory is being fragmented as the process constantly grow and shrinks in the direct case. The plain Java version has a constant memory requirement.

      Try taking the memory allocation out of the loop in the direct case, which is more like a real application, and see what it does.

      Delete
  10. Why would there any heap fragmentation? You seem to be allocating the entire array and its contents at once and then deleting it for the next test.

    Is it because the young gen is not big enough for this test?

    ReplyDelete
    Replies
    1. Yes, most data will be promoted to the old generation.

      Delete
  11. Is this not similar to how BigMemory of Terracotta works, though I think they use memory mapped file (through MappedByteBuffer)?

    ReplyDelete
    Replies
    1. I believe many "Big Memory" solutions and KV stores use similar techniques with ByteBuffers that are either heap based or memory-mapped files. If structures/tuples are added to the language, like John Rose has described, then none of these workarounds would be necessary.

      Delete
    2. One thing that strikes me right now is that we could use UnsafeMemory to alleviate some full gc pain.
      Assume that we have something like a Disruptor that we need to provide with byte buckets (byte arrays of say 1500 bytes). If we allocate them from the heap, they will reside in old gen but will never be released, just increasing the live data set of the old gen. OTOH, if we allocate those from UnsafeMemory, they will not clutter old gen, potentially lowering full gc times. I know i definitely could use something like that right now as the full gc:s doesn't release all that much memory.

      Delete
    3. Going off-heap is very common in low-latency applications to avoid the GC overhead. This is especially important if you need to be fast right out the gate.

      Since byte buffers don't contain object references there is no card marking and thus limited GC interaction once promoted. The majority of the cost will come in allocation and promotion to the old generation.

      Delete
    4. Is it a good idea to use this technique for the RingBuffer? How would you handle messages larger than 1500 bytes?

      Delete
    5. Yes this works very effectively. Large messages can be handled by chaining blocks or using variable sized records.

      Delete
    6. I have multiple publishers... Is it still possible to use chaining blocks or variable sized records in that case? I cannot figure out where to put the pointer to the next block. If by block you mean RingBuffer Event, at translate time we haven't claimed the next seqNum yet, but if you mean DirectMemoryObj, the get() function only works reliably for multiple producers if the blocks are fixed length. Am I missing something?

      Delete
    7. I think I got it now. I was relying on the dsl API too much. I could just use the RingBuffer directly and have a translator with a different API. Something like this:

      // assume we worked out that message fits into 2 events
      long seq1 = ringBuffer.next();
      long seq2 = ringBuffer.next();
      try {
      translator.translateChunk(ringBuffer.get(seq1), seq1, seq2);
      translator.translateChunk(ringBuffer.get(seq2), seq2, -1L);
      } finally {
      ringBuffer.publish(seq1);
      ringBuffer.publish(seq2);
      }


      Delete
    8. I would not go with the Disruptor for this type of solution. Best to create a new structure that does exactly what you want.

      Delete
  12. The most dramatic speed and memory usage improvement comes when you compares standard Java data structures with Unsafe - based ones: up to 8-10 times less memory usage , zero GC on 50-100GB heaps (off-heaps, of course).

    ReplyDelete
    Replies
    1. There is one question on Unsafe.allocateMemory/freeMemory implementation. Is it platform malloc/free or the interface to JVM malloc/free? Platform (default Linux glibc, for example) is prone to significant memory fragmentation and not suitable for long running server applications. Many native Linux (and I suppose Windows) applications use alternative memory allocators (custom, tcmalloc, jemalloc etc).

      Delete
    2. The implementation of Unsafe.allocateMemory will likely be JVM specific and therefore you cannot rely on an underlying implementation. Best to allocate large contiguous chunks and manage it yourself. I would not use this for object level allocations.

      Delete
  13. How about granting a huge memory to young generation of GC with -XX:NewSize JVM option (e.g. -XX:NewSize=4G, -Xms=6G, -Xmx=6G, XX:+UseConcMarkSweepGC, ...)?

    My result (without System.gc()):

    0 - duration 1229ms
    1 - duration 1559ms
    2 - duration 1699ms
    3 - duration 2549ms
    4 - duration 3159ms


    Total time for which application threads were stopped: 0.8967600 seconds
    Total time for which application threads were stopped: 1.0796190 seconds
    Total time for which application threads were stopped: 1.6817910 seconds
    Total time for which application threads were stopped: 0.2799290 seconds
    Total time for which application threads were stopped: 2.5433730 seconds

    ReplyDelete
    Replies
    1. That is one way to avoid the generational copy issues :-) GC can be played with for a microbenchmark. The real challenge is how to do this well without lots of tricks and be part of a real application. The only true answer is to add support for arrays of structs to Java which cannot come soon enough.

      Delete
  14. "To my knowledge it is not possible to simulate this feature on the Java stack." Being pedantic for a moment...

    You can create pseudo-structures on the stack using primitives.

    It's just that it's not terribly useful in the Java world. Kind of like trying to model functional programming in Java -- you can do it, but it's clunky and painful.

    Here's an ugly example of using the stack for structures in Java. Instead of defining a "car" structure, we use primitive parameters to simulate passing a car structure on the stack from one procedure to the next.

    This example is somewhat modeled after creating a temporary structure on the stack in C, doing some work with it, then throwing it away.

    public class CarTest
    {
    public static void main ( String [] args )
    {
    CarCounter counter = new CarCounter ( null );
    CarInitializer initializer = new CarInitializer ( counter );
    initializer.invoke ( "Ford Taurus", 0 );
    initializer.invoke ( "Honda Civiv", 0 );
    initializer.invoke ( "Toyota Corolla", 0 );
    System.out.println( "Total number of cars: " + counter.numCars () );
    System.out.println( "Total number of wheels: " + counter.numWheels () );
    }
    }


    public class CarInitializer
    extends CarProcedure
    {
    public CarInitializer ( CarProcedure nested_procedure )
    {
    super ( nested_procedure );
    }

    public void invoke ( String model, int wheels )
    {
    wheels = 4;
    CarProcedure nested_procedure = this.nestedProcedure ();
    if ( nested_procedure != null )
    {
    nested_procedure.invoke ( model, wheels );
    }
    }
    }


    public class CarCounter
    extends CarProcedure
    {
    private int numCars = 0;
    private int numWheels = 0;

    public CarCounter ( CarProcedure nested_procedure )
    {
    super ( nested_procedure );
    }

    public void invoke ( String model, int wheels )
    {
    numCars ++;
    numWheels += wheels;

    CarProcedure nested_procedure = this.nestedProcedure ();
    if ( nested_procedure != null )
    {
    nested_procedure.invoke ( model, wheels );
    }
    }

    public int numCars ()
    {
    return numCars;
    }

    public int numWheels ()
    {
    return numWheels;
    }
    }


    public abstract class CarProcedure
    {
    private final CarProcedure nestedProcedure;

    public CarProcedure ( CarProcedure nested_procedure )
    {
    this.nestedProcedure = nested_procedure;
    }

    public CarProcedure nestedProcedure ()
    {
    return this.nestedProcedure;
    }

    public abstract void invoke ( String model, int wheels );
    }



    There are other similar ways of tying your data to the stack, but ultimately none of them are pleasant or useful.

    Cheers,

    Johann Tienhaara

    ReplyDelete
    Replies
    1. I'm either missing your point or you are missing mine :-)

      By "on the stack" I mean the actual stack that will be generated in native code after the optimiser has run. No heap objects at all so no GC interaction, and the structs/objects are *in* the stack frame for memory locality. Your pseudo objects above are on the heap away from the stack frame and need to be GC'ed. Real stack based data is key to the scalability of many true parallel algorithms to avoid contention.

      Escape analysis promised much here and delivered little.

      Delete
    2. Cool, I'm still allowed to reply after nearly 4 years!

      Sorry I'm a little late noticing your reply. :)

      I should have specified: the "structure" I referred to is precisely what's on the stack in the method calls:

      initializer.invoke ( "Ford Taurus", 0 );
      initializer.invoke ( "Honda Civiv", 0 );
      initializer.invoke ( "Toyota Corolla", 0 );

      Three instances of a structure that looks something like:

      struct {
      String model;
      int wheels;
      }

      Where the String model is initialized at the outset, but the int wheels is initialized inside a method call, and both values are lost once the method call is complete.

      There should be no heap space allocated for a Ford Taurus with 4 wheels, or a Honda "Civiv" with 4 wheels, and so on. To the best of my knowledge, every Java compiler puts constants right on the stack. Certainly the class files I've looked at in the past do not allocate heap space for primitive or String constants.

      I believe the heap space you referred to is all the single-instance classes which *do something with* the structure. You could have 18 billion structure instances defined, none of them on the heap, and the various CarInitializer etc classes would all still only be instantiated exactly once each.

      Anyway this was certainly a bit of pedantry, as I pointed out in my original post. :) Anyone who tries to use "parameter passing down the stack" as stack-based structures is likely to 1) run into the limitations in expressiveness of this approach and 2) be shot by a co-worker.

      If you still see a flaw in the concept, I'd certainly be keen to hear about it (even after all these years!).

      Cheers Martin,

      Johann

      Delete
    3. Cool, I'm still allowed to reply after nearly 4 years!

      Sorry I'm a little late noticing your reply. :)

      I should have specified: the "structure" I referred to is precisely what's on the stack in the method calls:

      initializer.invoke ( "Ford Taurus", 0 );
      initializer.invoke ( "Honda Civiv", 0 );
      initializer.invoke ( "Toyota Corolla", 0 );

      Three instances of a structure that looks something like:

      struct {
      String model;
      int wheels;
      }

      Where the String model is initialized at the outset, but the int wheels is initialized inside a method call, and both values are lost once the method call is complete.

      There should be no heap space allocated for a Ford Taurus with 4 wheels, or a Honda "Civiv" with 4 wheels, and so on. To the best of my knowledge, every Java compiler puts constants right on the stack. Certainly the class files I've looked at in the past do not allocate heap space for primitive or String constants.

      I believe the heap space you referred to is all the single-instance classes which *do something with* the structure. You could have 18 billion structure instances defined, none of them on the heap, and the various CarInitializer etc classes would all still only be instantiated exactly once each.

      Anyway this was certainly a bit of pedantry, as I pointed out in my original post. :) Anyone who tries to use "parameter passing down the stack" as stack-based structures is likely to 1) run into the limitations in expressiveness of this approach and 2) be shot by a co-worker.

      If you still see a flaw in the concept, I'd certainly be keen to hear about it (even after all these years!).

      Cheers Martin,

      Johann

      Delete
    4. I'm struggling to understand what you mean by this. Maybe not enough coffee yet this morning :) Please post a full project on GitHub so the idea can be better understood and evaluated.

      Delete
  15. Hi Martin,

    firstly I would like to thank you for your blog and excellent posts you putting there. I wanted to use the Unsafe approach in my project, but I encounter one issue which I can't resolve. I'm trying to create byte array, which I then copy to off-heap memory and get the pointer to it. Everything works fine with 32bit java, but with 64bit java I'm getting JVM crash:(. Do you have any explanation to why the simple test program below is crashing on 64bit, but not on 32bit? Thanks a lot for your help!

    Peter

    public class UnsafePointerTest {

    public static class Pointer {
    public static final String fieldName = "object";

    Object object;

    public Object getObject(){
    return object;
    }
    }

    private static final Unsafe unsafe;
    private static final long byteArrayOffset;

    static{
    try {
    Field field = Unsafe.class.getDeclaredField("theUnsafe");
    field.setAccessible(true);
    unsafe = (Unsafe)field.get(null);
    byteArrayOffset = unsafe.arrayBaseOffset(byte[].class);
    }catch (Exception e){
    throw new RuntimeException(e);
    }
    }

    public static void main(String[] args) throws NoSuchFieldException, SecurityException{
    int arraySize = 100;
    long address = unsafe.allocateMemory(arraySize + byteArrayOffset);
    byte[] array = new byte[arraySize];
    copyToOffHeap(array, address);
    Pointer pointer = getPointer(address);
    System.out.println(((byte[])pointer.getObject()).length); // crash on 64bit!!
    }

    static Pointer getPointer(long address) throws NoSuchFieldException, SecurityException{
    Pointer pointer = new Pointer();
    long pointerOffset = unsafe.objectFieldOffset(Pointer.class.getDeclaredField(Pointer.fieldName));
    unsafe.putLong(pointer, pointerOffset, address);
    return pointer;
    }

    static void copyToOffHeap(byte[] array, long address){
    long size = array.length + byteArrayOffset;
    unsafe.copyMemory(array, 0, null, address, size);
    }

    }

    ReplyDelete
    Replies
    1. You cannot safely treat off-heap memory like real objects. You must use Unsafe or direct buffers to read and write the bytes as primitives, and not as real objects.

      I suspect this is working on 32-but because the address is real. On 64-bit the JVM uses compressed object references that are treated as special offsets based on byte alignment rather than real addresses. This *may* work on 32GB+ heaps but is still wrong.

      Basically your "Pointer" class is evil ;-)

      Delete
    2. Hi Martin,

      thank you very much for your reply. If I disable compressed oops with -XX:-UseCompressedOops option everything works fine on 64bit too. Do you still think that I should not use the "Pointer" class at all?

      The reason why I use it is the following scenario:

      The application is receiving messages from the middleware. The middleware supports writing the messages directly into the byte array on the given offset - something like int recv(byte[] bytearray, int offset). So with "Pointer" class I can pass the off-heap memory as byte array to recv method and middleware will write directly to off-heap memory - instead of writing to temporary managed byte array and then rewriting this array to off heap through Unsafe.

      What do you think about this approach? I think the only question is what happen if the "Pointer" instance is garbage collected - but this should not be an issue if I prevent it.

      Thanks a lot.

      P.

      Delete
    3. I still think this is evil :-/ The garbage collector will try and mark this pointer as reachable when it clearly is not. You have just deferred the crash to some random point in the future. Going off-heap is a "handle with care situation". Please please please do not do things to deceive the runtime. It will only end in tears.

      Why don't you allow the middleware to write off-heap then use a flyweight to access the data? Or simply use a direct ByteBuffer?

      Delete
    4. I agree that letting GC to mark the pointer as reachable is wrong, but if I write the code that I always keep the reference to "Pointer" instance until the application is terminated than it should not be a big issue. But I have to agree that the whole "Pointer" approach is evil:).

      Btw I can't allow the middleware to write to off-heap directly because it is not internally developed middleware - we using zeromq (http://www.zeromq.org/).

      Delete
    5. The issue with the GC marking a pointer as reachable is that many VM do this by updating a state table some given offset from the current page containing the object. It may also try to move the fake object during compaction. Off heap must be treated as raw bytes and cannot be safely treated as real Java objects.

      Delete
  16. Have you, or anyone else, written up anything about how LMAX deals with message reliability. I am hunting around for something about this. Perhaps another trade secret? :)

    ReplyDelete
    Replies
    1. I adapted a number of reliable messaging techniques that go back many years and are common in older TP systems. I can offer consultancy in this area.

      Delete
  17. We are organising training spending for next year at my work. I would like to attend your concurrency course if you are running it. I had a look at the instil site for the course and it is listed as TBC, so I am registering my interest here to encourage you to run it :)

    ReplyDelete
  18. Having recently been advised by yourself(many thanks) to beware of cache line alignment issues I wondered how you would reason about alignment in the above case(assuming 64b cache line):
    1) the trade structure size is 42
    2) unsafe.allocateMemory will return an address that "will never be zero, and will be
    aligned for all value types." --> will be 16 bytes aligned, from what I can find.
    This means your address can start in one of 4 locations on a cache line(0,16,32,48) resulting in a variety of conditions in which one of the fields is split across 2 cache lines e.g:
    If the address is cache aligned then the second record has it's 4 field split between line one and line 2.
    I understand this can result in loss of atomicity of updates to the field, meaning a half formed field could be visible on write. I assume this can be corrected by padding the structure such that this cannot happen(to a size which is a multiple of 16).
    Am I correct in my reasoning so far? If you agree that the issue exists, how would it manifest(perf issue? correctness issue)?
    Thanks again,
    Nitsan

    ReplyDelete
    Replies
    1. This is the reason why Java will align objects on 8-byte boundaries and then carefully organise the fields like I described in a previous blog on false sharing.

      1) You could get issues if the word you are using for coordination in a lock-free concurrent algorithm is split across cache lines, thus making loads and stores not atomic.

      2) You can also get performance issues reading fields that are split across cache lines, or even not aligned on word boundaries depending on processor.

      I did not want to make this article more complex but when building systems I allocated my off-heap objects on word boundaries and do similar for this fields. Maybe I'll do a follow up with more detail on this.

      Delete
    2. 1) Would you not get similar issues of correctness on regular reads/writes?
      2) Would you not see performance issues for writes?
      I agree with the sentiment of keeping it simple, but as your blog is widely read, and highly regarded, I think many people would not be aware of the issue and copy your example as is pretty much. But perhaps I'm just projecting my ignorance on others :).

      Delete
    3. Regular loads and stores will work as expected but may suffer performance issues as words get assembled that span cache lines. Store costs are more likely to be hidden by the store buffer and write combining buffers than load costs.

      OK you've tempted me to write a more focused blog on this subject :-)

      Delete
    4. Cache alignment is only a performance issue. Never a correctness issue. Cache lines do not add or remove atomicity. Specifically, non atomic updates within a single cache line can still be seen non-atomically by other threads.

      The Java spec only guarantees atomic treatment (as in the bits in the field are read and written all-or-nothing) for boolean, byte, char, short, int, and float field types. The larger types (long and double) *may* be read and written using two separate operations (although most JVMs will still perform these as atomic, all-or-nothing operations).

      The more complex atomic operations (e.g. AtomicLong.addAndGet()) that both read and write a memory field within a single atomic operations are guaranteed to provide atomicity regardless of memory layout.

      In practice, JVM implementations typically force fields to not cross cache line boundaries by simply aligning all fields to their field size. This is a common necessity since most architectures do not support unaligned types in memory, and do not support unaligned atomic memory operations.

      BTW, all x86 variants DO support both unaligned data types in memory, as well as LOCK operations on such types. This means that on an x86, a LOCK operations that spans boundary between two cache lines will still be atomic (this little bit of historical compatibility probably has modern x86 implementors cursing each time).

      Delete
    5. Ended up digging into the topic a bit (with Gil's help), here's the result: http://bit.ly/X5VrjF
      Feedback is welcome.
      Thanks,
      Nitsan

      Delete
  19. Correction to the above, alignment is guaranteed 8b aligned, may be a multiple of that.

    ReplyDelete
  20. Would love a follow up article addressing design of the packing algorithm to prevent perf/correctness issues from non-ideal alignment. I am especially interested in other threads reading the results.

    ReplyDelete
  21. Anyone can tell me why the offset is incremented by 4? instead of 8 like all the other long fields ???

    private static final long instrumentCodeOffset = offset += 4;
    private static final long priceOffset = offset += 4;

    ReplyDelete
    Replies
    1. Look at the JavaMemoryTrade class as a comparison and see if you can work it out? :-)

      Delete
  22. Hi,

    I have significant concerns about these microbenchmarks and the use of sequential integers in the test data. In the past, I've seen very significant skewing of test results occur when that is done. I just did a test to verify that these tests also suffer from the same flaw.

    In each init() method I added a seeded random (to ensure the same values are being compared):

    java.util.Random r = new Random(NUM_RECORDS);

    and for each use of "i" in constructing the trade record, I replaced "i" with r.nextInt(NUM_RECORDS). Doing this had little impact on the traditional Java test (probably because GC predominates). The initial test runs had been ~10 seconds on my box, that stayed ~10 seconds, and then to ~20+ seconds for the final two runs. However, for the DirectMemoryTest, it had been around 920ms on my box, but after making the change, the times jumped to 4.5 seconds. I think that 5-fold difference is pretty highly significant and should be more carefully guarded against. The direct memory is still clearly faster, but the advantage was pretty significantly eroded. I'd suggest that the pseudo random numbers are far more representative of many (most?) real world applications (certainly any trading system) than a sequentially increasing numbers.

    ReplyDelete
    Replies
    1. Your use of random here will likely defeat the hardware prefetcher. For bulk operations you want the support of the hardware prefetcher to hide memory latency. I do they very deliberately. I've blogged about this in more detail.

      http://mechanical-sympathy.blogspot.co.uk/2012/08/memory-access-patterns-are-important.html

      Delete
    2. Yes, I read that post. First, I think this isn't about hardware prefetching since we're not fetching trades based on the contents of previous trades. We're still iterating over memory (the trades themselves) sequentially. Note I am not randomly accessing the trades, simply setting the values on the trades to random numbers. The only use of random (in my modification) was in the init() method. e.g.:

      public static void init() {
      final long requiredHeap = NUM_RECORDS * DirectMemoryTrade.getObjectSize();
      java.util.Random r = new Random(NUM_RECORDS);
      address = unsafe.allocateMemory(requiredHeap);

      final byte[] londonStockExchange = {'X', 'L', 'O', 'N'};
      final int venueCode = pack(londonStockExchange);

      final byte[] billiton = {'B', 'H', 'P'};
      final int instrumentCode = pack(billiton);

      for (int i = 0; i < NUM_RECORDS; i++) {
      DirectMemoryTrade trade = get(i);

      trade.setTradeId(r.nextInt(NUM_RECORDS));
      trade.setClientId(1);
      trade.setVenueCode(venueCode);
      trade.setInstrumentCode(instrumentCode);

      trade.setPrice(r.nextInt(NUM_RECORDS));
      trade.setQuantity(r.nextInt(NUM_RECORDS));

      trade.setSide((r.nextInt(NUM_RECORDS) & 1) == 0 ? 'B' : 'S');
      }
      }


      The rest of the test is done as before.

      Micro-benchmarks become somewhat misleading if they allow optimizations to take place which are unrealistic. In this case, I would suspect hotspot is is the culprit, rather than prefetching.

      Delete
    3. Sorry I misunderstood your code. It is hard reading code pasted into a comment :-)

      I took your example and profiled it. 79.2% of the total time is spent in Random.nextInt(). So basically you have increased the time just by generating random number which is an expensive operation.

      As you said, "Micro-benchmarks become somewhat misleading...".

      Delete
  23. If you're using Java 7 (yes, I know it's end of life), I found I had to enable "tiered compilation" using the "XX" switch to ensure the JIT compiler heavily optimised the setter/getters. Without this I was finding direct memory access using the flyweight pattern often 2-3x slower on reads than using a primitive array allocated in old gen.

    Tiered compilation is enabled by default on Java 8.

    ReplyDelete