Mechanical Sympathy: Compact Off-Heap Structures/Tuples In Java

Wednesday, 17 October 2012

Compact Off-Heap Structures/Tuples In Java

In my last post I detailed the implications of the access patterns your code takes to main memory. Since then I've had a lot of questions about what can be done in Java to enable more predictable memory layout. There are patterns that can be applied using array backed structures which I will discuss in another post. This post will explore how to simulate a feature sorely missing in Java - arrays of structures similar to what C has to offer.

Structures are very useful, both on the stack and the heap. To my knowledge it is not possible to simulate this feature on the Java stack. Not being able to do this on the stack is such as shame because it greatly limits the performance of some parallel algorithms, however that is a rant for another day.

In Java, all user defined types have to exist on the heap. The Java heap is managed by the garbage collector in the general case, however there is more to the wider heap in a Java process. With the introduction of direct ByteBuffer, memory can be allocated which is not tracked by the garbage collector because it can be available to native code for tasks like avoiding the copying of data to and from the kernel for IO. So one method of managing structures is to fake them within a ByteBuffer as a reasonable approach. This can allow compact data representations, but has performance and size limitations. For example, it is not possible to have a ByteBuffer greater than 2GB, and all access is bounds checked which impacts performance. An alternative exists using Unsafe that is both faster and and not size constrained like ByteBuffer.

The approach I'm about to detail is not traditional Java. If your problem space is dealing with big data, or extreme performance, then there are benefits to be had. If your data sets are small, and performance is not an issue, then run away now to avoid getting sucked into the dark arts of native memory management.

The benefits of the approach I'm about to detail are:

Significantly improved performance
More compact data representation
Ability to work with very large data sets while avoiding nasty GC pauses[1]

With all choices there are consequences. By taking the approach detailed below you take responsibility for some of the memory managment yourself. Getting it wrong can lead to memory leaks, or worse, you can crash the JVM! Proceed with caution...

Suitable Example - Trade Data

A common challenge faced in finance applications is capturing and working with very large volumes of order and trade data. For the example I will create a large table of in-memory trade data that can have analysis queries run against it. This table will be built using 2 contrasting approaches. Firstly, I'll take the traditional Java approach of creating a large array and reference individual Trade objects. Secondly, I keep the usage code identical but replace the large array and Trade objects with an off-heap array of structures that can be manipulated via a Flyweight pattern.

If for the traditional Java approach I used some other data structure, such as a Map or Tree, then the memory footprint would be even greater and the performance lower.

Traditional Java Approach

public class TestJavaMemoryLayout
{
    private static final int NUM_RECORDS = 50 * 1000 * 1000;

    private static JavaMemoryTrade[] trades;

    public static void main(final String[] args)
    {
        for (int i = 0; i < 5; i++)
        {
            System.gc();
            perfRun(i);
        }
    }

    private static void perfRun(final int runNum)
    {
        long start = System.currentTimeMillis();

        init();

        System.out.format("Memory %,d total, %,d free\n",
                          Runtime.getRuntime().totalMemory(),
                          Runtime.getRuntime().freeMemory());

        long buyCost = 0;
        long sellCost = 0;

        for (int i = 0; i < NUM_RECORDS; i++)
        {
            final JavaMemoryTrade trade = get(i);

            if (trade.getSide() == 'B')
            {
                buyCost += (trade.getPrice() * trade.getQuantity());
            }
            else
            {
                sellCost += (trade.getPrice() * trade.getQuantity());
            }
        }

        long duration = System.currentTimeMillis() - start;
        System.out.println(runNum + " - duration " + duration + "ms");
        System.out.println("buyCost = " + buyCost + " sellCost = " + sellCost);
    }

    private static JavaMemoryTrade get(final int index)
    {
        return trades[index];
    }

    public static void init()
    {
        trades = new JavaMemoryTrade[NUM_RECORDS];

        final byte[] londonStockExchange = {'X', 'L', 'O', 'N'};
        final int venueCode = pack(londonStockExchange);

        final byte[] billiton = {'B', 'H', 'P'};
        final int instrumentCode = pack( billiton);

        for (int i = 0; i < NUM_RECORDS; i++)
        {
            JavaMemoryTrade trade = new JavaMemoryTrade();
            trades[i] = trade;

            trade.setTradeId(i);
            trade.setClientId(1);
            trade.setVenueCode(venueCode);
            trade.setInstrumentCode(instrumentCode);

            trade.setPrice(i);
            trade.setQuantity(i);

            trade.setSide((i & 1) == 0 ? 'B' : 'S');
        }
    }

    private static int pack(final byte[] value)
    {
        int result = 0;
        switch (value.length)
        {
            case 4:
                result = (value[3]);
            case 3:
                result |= ((int)value[2] << 8);
            case 2:
                result |= ((int)value[1] << 16);
            case 1:
                result |= ((int)value[0] << 24);
                break;

            default:
                throw new IllegalArgumentException("Invalid array size");
        }

        return result;
    }

    private static class JavaMemoryTrade
    {
        private long tradeId;
        private long clientId;
        private int venueCode;
        private int instrumentCode;
        private long price;
        private long quantity;
        private char side;

        public long getTradeId()
        {
            return tradeId;
        }

        public void setTradeId(final long tradeId)
        {
            this.tradeId = tradeId;
        }

        public long getClientId()
        {
            return clientId;
        }

        public void setClientId(final long clientId)
        {
            this.clientId = clientId;
        }

        public int getVenueCode()
        {
            return venueCode;
        }

        public void setVenueCode(final int venueCode)
        {
            this.venueCode = venueCode;
        }

        public int getInstrumentCode()
        {
            return instrumentCode;
        }

        public void setInstrumentCode(final int instrumentCode)
        {
            this.instrumentCode = instrumentCode;
        }

        public long getPrice()
        {
            return price;
        }

        public void setPrice(final long price)
        {
            this.price = price;
        }

        public long getQuantity()
        {
            return quantity;
        }

        public void setQuantity(final long quantity)
        {
            this.quantity = quantity;
        }

        public char getSide()
        {
            return side;
        }

        public void setSide(final char side)
        {
            this.side = side;
        }
    }
}

Compact Off-Heap Structures

import sun.misc.Unsafe;

import java.lang.reflect.Field;

public class TestDirectMemoryLayout
{
    private static final Unsafe unsafe;
    static
    {
        try
        {
            Field field = Unsafe.class.getDeclaredField("theUnsafe");
            field.setAccessible(true);
            unsafe = (Unsafe)field.get(null);
        }
        catch (Exception e)
        {
            throw new RuntimeException(e);
        }
    }

    private static final int NUM_RECORDS = 50 * 1000 * 1000;

    private static long address;
    private static final DirectMemoryTrade flyweight = new DirectMemoryTrade();

    public static void main(final String[] args)
    {
        for (int i = 0; i < 5; i++)
        {
            System.gc();
            perfRun(i);
        }
    }

    private static void perfRun(final int runNum)
    {
        long start = System.currentTimeMillis();

        init();

        System.out.format("Memory %,d total, %,d free\n",
                          Runtime.getRuntime().totalMemory(),
                          Runtime.getRuntime().freeMemory());

        long buyCost = 0;
        long sellCost = 0;

        for (int i = 0; i < NUM_RECORDS; i++)
        {
            final DirectMemoryTrade trade = get(i);

            if (trade.getSide() == 'B')
            {
                buyCost += (trade.getPrice() * trade.getQuantity());
            }
            else
            {
                sellCost += (trade.getPrice() * trade.getQuantity());
            }
        }

        long duration = System.currentTimeMillis() - start;
        System.out.println(runNum + " - duration " + duration + "ms");
        System.out.println("buyCost = " + buyCost + " sellCost = " + sellCost);

        destroy();
    }

    private static DirectMemoryTrade get(final int index)
    {
        final long offset = address + (index * DirectMemoryTrade.getObjectSize());
        flyweight.setObjectOffset(offset);
        return flyweight;
    }

    public static void init()
    {
        final long requiredHeap = NUM_RECORDS * DirectMemoryTrade.getObjectSize();
        address = unsafe.allocateMemory(requiredHeap);

        final byte[] londonStockExchange = {'X', 'L', 'O', 'N'};
        final int venueCode = pack(londonStockExchange);

        final byte[] billiton = {'B', 'H', 'P'};
        final int instrumentCode = pack( billiton);

        for (int i = 0; i < NUM_RECORDS; i++)
        {
            DirectMemoryTrade trade = get(i);

            trade.setTradeId(i);
            trade.setClientId(1);
            trade.setVenueCode(venueCode);
            trade.setInstrumentCode(instrumentCode);

            trade.setPrice(i);
            trade.setQuantity(i);

            trade.setSide((i & 1) == 0 ? 'B' : 'S');
        }
    }

    private static void destroy()
    {
        unsafe.freeMemory(address);
    }

    private static int pack(final byte[] value)
    {
        int result = 0;
        switch (value.length)
        {
            case 4:
                result |= (value[3]);
            case 3:
                result |= ((int)value[2] << 8);
            case 2:
                result |= ((int)value[1] << 16);
            case 1:
                result |= ((int)value[0] << 24);
                break;

            default:
                throw new IllegalArgumentException("Invalid array size");
        }

        return result;
    }

    private static class DirectMemoryTrade
    {
        private static long offset = 0;

        private static final long tradeIdOffset = offset += 0;
        private static final long clientIdOffset = offset += 8;
        private static final long venueCodeOffset = offset += 8;
        private static final long instrumentCodeOffset = offset += 4;
        private static final long priceOffset = offset += 4;
        private static final long quantityOffset = offset += 8;
        private static final long sideOffset = offset += 8;

        private static final long objectSize = offset += 2;

        private long objectOffset;

        public static long getObjectSize()
        {
            return objectSize;
        }

        void setObjectOffset(final long objectOffset)
        {
            this.objectOffset = objectOffset;
        }

        public long getTradeId()
        {
            return unsafe.getLong(objectOffset + tradeIdOffset);
        }

        public void setTradeId(final long tradeId)
        {
            unsafe.putLong(objectOffset + tradeIdOffset, tradeId);
        }

        public long getClientId()
        {
            return unsafe.getLong(objectOffset + clientIdOffset);
        }

        public void setClientId(final long clientId)
        {
            unsafe.putLong(objectOffset + clientIdOffset, clientId);
        }

        public int getVenueCode()
        {
            return unsafe.getInt(objectOffset + venueCodeOffset);
        }

        public void setVenueCode(final int venueCode)
        {
            unsafe.putInt(objectOffset + venueCodeOffset, venueCode);
        }

        public int getInstrumentCode()
        {
            return unsafe.getInt(objectOffset + instrumentCodeOffset);
        }

        public void setInstrumentCode(final int instrumentCode)
        {
            unsafe.putInt(objectOffset + instrumentCodeOffset, instrumentCode);
        }

        public long getPrice()
        {
            return unsafe.getLong(objectOffset + priceOffset);
        }

        public void setPrice(final long price)
        {
            unsafe.putLong(objectOffset + priceOffset, price);
        }

        public long getQuantity()
        {
            return unsafe.getLong(objectOffset + quantityOffset);
        }

        public void setQuantity(final long quantity)
        {
            unsafe.putLong(objectOffset + quantityOffset, quantity);
        }

        public char getSide()
        {
            return unsafe.getChar(objectOffset + sideOffset);
        }

        public void setSide(final char side)
        {
            unsafe.putChar(objectOffset + sideOffset, side);
        }
    }
}

Results

Intel i7-860 @ 2.8GHz, 8GB RAM DDR3 1333MHz, 
Windows 7 64-bit, Java 1.7.0_07
=============================================
java -server -Xms4g -Xmx4g TestJavaMemoryLayout
Memory 4,116,054,016 total, 1,108,901,104 free
0 - duration 19334ms
Memory 4,116,054,016 total, 1,109,964,752 free
1 - duration 14295ms
Memory 4,116,054,016 total, 1,108,455,504 free
2 - duration 14272ms
Memory 3,817,799,680 total, 815,308,600 free
3 - duration 28358ms
Memory 3,817,799,680 total, 810,552,816 free
4 - duration 32487ms

java -server TestDirectMemoryLayout
Memory 128,647,168 total, 126,391,384 free
0 - duration 983ms
Memory 128,647,168 total, 126,992,160 free
1 - duration 958ms
Memory 128,647,168 total, 127,663,408 free
2 - duration 873ms
Memory 128,647,168 total, 127,663,408 free
3 - duration 886ms
Memory 128,647,168 total, 127,663,408 free
4 - duration 884ms

Intel i7-2760QM @ 2.40GHz, 8GB RAM DDR3 1600MHz, 
Linux 3.4.11 kernel 64-bit, Java 1.7.0_07
=================================================
java -server -Xms4g -Xmx4g TestJavaMemoryLayout
Memory 4,116,054,016 total, 1,108,912,960 free
0 - duration 12262ms
Memory 4,116,054,016 total, 1,109,962,832 free
1 - duration 9822ms
Memory 4,116,054,016 total, 1,108,458,720 free
2 - duration 10239ms
Memory 3,817,799,680 total, 815,307,640 free
3 - duration 21558ms
Memory 3,817,799,680 total, 810,551,856 free
4 - duration 23074ms

java -server TestDirectMemoryLayout 
Memory 123,994,112 total, 121,818,528 free
0 - duration 634ms
Memory 123,994,112 total, 122,455,944 free
1 - duration 619ms
Memory 123,994,112 total, 123,103,320 free
2 - duration 546ms
Memory 123,994,112 total, 123,103,320 free
3 - duration 547ms
Memory 123,994,112 total, 123,103,320 free
4 - duration 534ms

Analysis

Let's compare the results to the 3 benefits promised above.

1. Significantly improved performance

The evidence here is pretty clear cut. Using the off-heap structures approach is more than an order of magnitude faster. At the most extreme, look at the 5th run on a Sandy Bridge processor, we have 43.2 times difference in duration to complete the task. It is also a nice illustration of how well Sandy Bridge does with predictable access patterns to data. Not only is the performance significantly better it is also more consistent. As the heap becomes fragmented, and thus access patterns become more random, the performance degrades as can be seen in the later runs with standard Java approach.

2. More compact data representation

For our off-heap representation each object requires 42-bytes. To store 50 million of these, as in the example, we require 2,100,000,000 bytes. The memory required by the JVM heap is:

memory required = total memory - free memory - base JVM needs

2,883,248,712 = 3,817,799,680 - 810,551,856 - 123,999,112

This implies the JVM needs ~40% more memory to represent the same data. The reason for this overhead is the array of references to the Java objects plus the object headers. In a previous post I discussed object layout in Java.

When working with very large data sets this overhead can become a significant limiting factor.

3. Ability to work with very large data sets while avoiding nasty GC pauses

The sample code above forces a GC cycle before each run and can improve the consistency of the results in some cases. Feel free to remove the call to System.gc() and observe the implications for yourself. If you run the tests adding the following command line arguments then the garbage collector will output in painful detail what happened.

-XX:+PrintGC -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintTenuringDistribution -XX:+PrintHeapAtGC -XX:+PrintGCApplicationConcurrentTime -XX:+PrintGCApplicationStoppedTime -XX:+PrintSafepointStatistics

From analysing the output I can see the application underwent a total of 29 GC cycles. Pause times are listed below by extracting the lines from the output indicating when the application threads are stopped.

With System.gc() before each run
================================
Total time for which application threads were stopped: 0.0085280 seconds
Total time for which application threads were stopped: 0.7280530 seconds
Total time for which application threads were stopped: 8.1703460 seconds
Total time for which application threads were stopped: 5.6112210 seconds
Total time for which application threads were stopped: 1.2531370 seconds
Total time for which application threads were stopped: 7.6392250 seconds
Total time for which application threads were stopped: 5.7847050 seconds
Total time for which application threads were stopped: 1.3070470 seconds
Total time for which application threads were stopped: 8.2520880 seconds
Total time for which application threads were stopped: 6.0949910 seconds
Total time for which application threads were stopped: 1.3988480 seconds
Total time for which application threads were stopped: 8.1793240 seconds
Total time for which application threads were stopped: 6.4138720 seconds
Total time for which application threads were stopped: 4.4991670 seconds
Total time for which application threads were stopped: 4.5612290 seconds
Total time for which application threads were stopped: 0.3598490 seconds
Total time for which application threads were stopped: 0.7111000 seconds
Total time for which application threads were stopped: 1.4426750 seconds
Total time for which application threads were stopped: 1.5931500 seconds
Total time for which application threads were stopped: 10.9484920 seconds
Total time for which application threads were stopped: 7.0707230 seconds

Without System.gc() before each run
===================================
Test run times
0 - duration 12120ms
1 - duration 9439ms
2 - duration 9844ms
3 - duration 20933ms
4 - duration 23041ms

Total time for which application threads were stopped: 0.0170860 seconds
Total time for which application threads were stopped: 0.7915350 seconds
Total time for which application threads were stopped: 10.7153320 seconds
Total time for which application threads were stopped: 5.6234650 seconds
Total time for which application threads were stopped: 1.2689950 seconds
Total time for which application threads were stopped: 7.6238170 seconds
Total time for which application threads were stopped: 6.0114540 seconds
Total time for which application threads were stopped: 1.2990070 seconds
Total time for which application threads were stopped: 7.9918480 seconds
Total time for which application threads were stopped: 5.9997920 seconds
Total time for which application threads were stopped: 1.3430040 seconds
Total time for which application threads were stopped: 8.0759940 seconds
Total time for which application threads were stopped: 6.3980610 seconds
Total time for which application threads were stopped: 4.5572100 seconds
Total time for which application threads were stopped: 4.6193830 seconds
Total time for which application threads were stopped: 0.3877930 seconds
Total time for which application threads were stopped: 0.7429270 seconds
Total time for which application threads were stopped: 1.5248070 seconds
Total time for which application threads were stopped: 1.5312130 seconds
Total time for which application threads were stopped: 10.9120250 seconds
Total time for which application threads were stopped: 7.3528590 seconds

It can been seen from the output that a significant proportion of the time is spent in the garbage collector. When your threads are stopped your application is not responsive. These tests have been done with default GC settings. It is possible to tune the GC for better results but this can be a highly skilled and significant effort. The only JVM I know that copes well by not imposing long pause times, even under high-throughput conditions, is the Azul concurrent compacting collector.

When profiling this application, I can see that the majority of the time is spent allocating the objects and promoting them to the old generation because they do not fit in the young generation. The initialisation costs can be removed from the timing but that is not realistic. If the traditional Java approach is taken the state needs to be built up before the query can take place. The end user of an application has to wait for the state to be built up and the query executed.

This test is really quite trivial. Imagine working with similar data sets but at the 100 GB scale.

Note: When the garbage collector compacts a region, then objects that were next to each other can be moved far apart. This can result in TLB and other cache misses.

Side Note On Serialization

A huge benefit of using off-heap structures in this manner is how they can be very easily serialised to network, or storage, by a simple memory copy as I have shown in the previous post. This way we can completely bypass intermediate buffer and object allocation.

Conclusion

If you are willing to do some C style programming for large datasets it is possible to control the memory layout in Java by going off-heap. If you do, the benefits in performance, compactness, and avoiding GC issues are significant. However this is an approach that should not be used for all applications. Its benefits are only noticable for very large datasets, or the extremes of performance in throughput and/or latency.

I hope the Java community can collectively realise the importance of supporting structures both on the heap and the stack. John Rose has done some excellent work in this area defining how tuples could be added to the JVM. His talk on Arrays 2.0 from the JVM Language Summit this year is really worth a watch. John discusses options for arrays of structures, and structures of arrays, in his talk. If the tuples, as proposed by John, were available then the test described here could have comparable performance and be a more pleasant programming style. The whole array of structures could be allocated in a single action thus bypassing the copy of individual objects across generations, and it would be stored in a compact contiguous fashion. This would remove the significant GC issues for this class of problem.

Lately, I was comparing standard data structures between Java and .Net. In some cases I observed a 6-10X performance advantage to .Net for things like maps and dictionaries when .Net used native structure support. Let's get this into Java as soon as possible!

It is also pretty obvious from the results that if we are to use Java for real-time analysis on big data, then our standard garbage collectors need to significantly improve and support true concurrent operations.

[1] - To my knowledge the only JVM that deals well with very large heaps is Azul Zing

80 comments:

Ruslan Cheremin17 October 2012 at 14:56
Very interesting post. One question, though: have you tried implement same flyweight on the top of plain java byte[] instead of direct memory?
ReplyDelete
Replies
Ruslan Cheremin17 October 2012 at 15:25
Also, I can see dramatic change in results if
1) move init() out from measure.
2) do .gc() and .sleep() several times before measure to force java objects to be moved into old gen.
ReplyDelete
Replies
Unknown17 October 2012 at 15:41
The major difference between the two implementations is the time it takes to init the array. Doing the actual trades is comparable, direct being a slightly faster.

But if you pool your memory array by creating the JavaMemoryTrade objects before-hand, the memory example is actually faster than direct.

public static void main(final String[] args) {
trades = new JavaMemoryTrade[NUM_RECORDS];

for (int i = 0; i < NUM_RECORDS; i++) {
trades[i] = new JavaMemoryTrade();
}

for (int i = 0; i < 5; i++) {
// System.gc();
perfRun(i);
}
}
ReplyDelete
Replies
Ariel Weisberg17 October 2012 at 15:48
I feel the same way. The JVM works best if you only use the heap for scratch space and very long lived immutable state. Don't let anything make it out of the young generation and if you can, also avoid copying anything to survivor spaces as well.

I wish someone would implement a red-black tree or hash table based on this design.
ReplyDelete
Replies
Unknown17 October 2012 at 15:53
Ok, I allocated the direct memory in the main and it's faster.

I had to drop records to 5,000,000 due to ram limits.

I get mem duration ~105ms, direct ~92ms.

ReplyDelete
Replies
Karl17 October 2012 at 16:05
Any idea why you saw generally a 30% improvement on linux vs windows? Is this just the sandybridge vs nehalem? Or does the OS contribute significantly?
ReplyDelete
Replies
blah17 October 2012 at 16:44
This comment has been removed by the author.
ReplyDelete
Replies
Unknown17 October 2012 at 17:28
Very interesting! Have you looked at HugeCollections library (http://code.google.com/p/vanilla-java/wiki/HugeCollections)? What do you think of it? It seems to provide a similar functionality. The main difference is that it is column-oriented as opposed to your example, which is row-oriented.
ReplyDelete
Replies
Unknown17 October 2012 at 23:27
Hello,

This is quite wonderful code. I tested some things out.

1. took your code.
2. Changed the 5 reps, to NUM_REPS where NUM_REPS == 500
3. I removed the System.gc(); call in the main loop. Only to see how this might run in a non testing environment.
4. NUM_RECORDS = 3 * 1000 * 1000

DirectMemory has less variation (i.e. coefficient of variation)

See http://screencast.com/t/mGBmoqOyETcG.

a) The top panel is JavaMemoryLayout, the bottom panel is DirectMemoryLayout.
b) Dropped first 10 observations for both
c) the y-axis is log(runtime, 10)
d) these are 490 observations plotted in order of occurrence

Very very impressive.
Though could you explain the cycles in the DirectMemoryLayout?
ReplyDelete
Replies
Ashwin Jayaprakash18 October 2012 at 01:21
Why would there any heap fragmentation? You seem to be allocating the entire array and its contents at once and then deleting it for the next test.

Is it because the young gen is not big enough for this test?
ReplyDelete
Replies
Invader ACE19 October 2012 at 14:14
Is this not similar to how BigMemory of Terracotta works, though I think they use memory mapped file (through MappedByteBuffer)?
ReplyDelete
Replies
Vladimir Rodionov21 October 2012 at 20:05
The most dramatic speed and memory usage improvement comes when you compares standard Java data structures with Unsafe - based ones: up to 8-10 times less memory usage , zero GC on 50-100GB heaps (off-heaps, of course).
ReplyDelete
Replies
manuzhang26 October 2012 at 17:12
How about granting a huge memory to young generation of GC with -XX:NewSize JVM option (e.g. -XX:NewSize=4G, -Xms=6G, -Xmx=6G, XX:+UseConcMarkSweepGC, ...)?

My result (without System.gc()):

0 - duration 1229ms
1 - duration 1559ms
2 - duration 1699ms
3 - duration 2549ms
4 - duration 3159ms

Total time for which application threads were stopped: 0.8967600 seconds
Total time for which application threads were stopped: 1.0796190 seconds
Total time for which application threads were stopped: 1.6817910 seconds
Total time for which application threads were stopped: 0.2799290 seconds
Total time for which application threads were stopped: 2.5433730 seconds
ReplyDelete
Replies
Anonymous29 October 2012 at 15:59
"To my knowledge it is not possible to simulate this feature on the Java stack." Being pedantic for a moment...

You can create pseudo-structures on the stack using primitives.

It's just that it's not terribly useful in the Java world. Kind of like trying to model functional programming in Java -- you can do it, but it's clunky and painful.

Here's an ugly example of using the stack for structures in Java. Instead of defining a "car" structure, we use primitive parameters to simulate passing a car structure on the stack from one procedure to the next.

This example is somewhat modeled after creating a temporary structure on the stack in C, doing some work with it, then throwing it away.

public class CarTest
{
public static void main ( String [] args )
{
CarCounter counter = new CarCounter ( null );
CarInitializer initializer = new CarInitializer ( counter );
initializer.invoke ( "Ford Taurus", 0 );
initializer.invoke ( "Honda Civiv", 0 );
initializer.invoke ( "Toyota Corolla", 0 );
System.out.println( "Total number of cars: " + counter.numCars () );
System.out.println( "Total number of wheels: " + counter.numWheels () );
}
}

public class CarInitializer
extends CarProcedure
{
public CarInitializer ( CarProcedure nested_procedure )
{
super ( nested_procedure );
}

public void invoke ( String model, int wheels )
{
wheels = 4;
CarProcedure nested_procedure = this.nestedProcedure ();
if ( nested_procedure != null )
{
nested_procedure.invoke ( model, wheels );
}
}
}

public class CarCounter
extends CarProcedure
{
private int numCars = 0;
private int numWheels = 0;

public CarCounter ( CarProcedure nested_procedure )
{
super ( nested_procedure );
}

public void invoke ( String model, int wheels )
{
numCars ++;
numWheels += wheels;

CarProcedure nested_procedure = this.nestedProcedure ();
if ( nested_procedure != null )
{
nested_procedure.invoke ( model, wheels );
}
}

public int numCars ()
{
return numCars;
}

public int numWheels ()
{
return numWheels;
}
}

public abstract class CarProcedure
{
private final CarProcedure nestedProcedure;

public CarProcedure ( CarProcedure nested_procedure )
{
this.nestedProcedure = nested_procedure;
}

public CarProcedure nestedProcedure ()
{
return this.nestedProcedure;
}

public abstract void invoke ( String model, int wheels );
}

There are other similar ways of tying your data to the stack, but ultimately none of them are pleasant or useful.

Cheers,

Johann Tienhaara
ReplyDelete
Replies
postol17 November 2012 at 16:34
Hi Martin,

firstly I would like to thank you for your blog and excellent posts you putting there. I wanted to use the Unsafe approach in my project, but I encounter one issue which I can't resolve. I'm trying to create byte array, which I then copy to off-heap memory and get the pointer to it. Everything works fine with 32bit java, but with 64bit java I'm getting JVM crash:(. Do you have any explanation to why the simple test program below is crashing on 64bit, but not on 32bit? Thanks a lot for your help!

Peter

public class UnsafePointerTest {

public static class Pointer {
public static final String fieldName = "object";

Object object;

public Object getObject(){
return object;
}
}

private static final Unsafe unsafe;
private static final long byteArrayOffset;

static{
try {
Field field = Unsafe.class.getDeclaredField("theUnsafe");
field.setAccessible(true);
unsafe = (Unsafe)field.get(null);
byteArrayOffset = unsafe.arrayBaseOffset(byte[].class);
}catch (Exception e){
throw new RuntimeException(e);
}
}

public static void main(String[] args) throws NoSuchFieldException, SecurityException{
int arraySize = 100;
long address = unsafe.allocateMemory(arraySize + byteArrayOffset);
byte[] array = new byte[arraySize];
copyToOffHeap(array, address);
Pointer pointer = getPointer(address);
System.out.println(((byte[])pointer.getObject()).length); // crash on 64bit!!
}

static Pointer getPointer(long address) throws NoSuchFieldException, SecurityException{
Pointer pointer = new Pointer();
long pointerOffset = unsafe.objectFieldOffset(Pointer.class.getDeclaredField(Pointer.fieldName));
unsafe.putLong(pointer, pointerOffset, address);
return pointer;
}

static void copyToOffHeap(byte[] array, long address){
long size = array.length + byteArrayOffset;
unsafe.copyMemory(array, 0, null, address, size);
}

}
ReplyDelete
Replies
Francis Stephens4 December 2012 at 15:12
Have you, or anyone else, written up anything about how LMAX deals with message reliability. I am hunting around for something about this. Perhaps another trade secret? :)
ReplyDelete
Replies
Francis Stephens11 December 2012 at 11:04
We are organising training spending for next year at my work. I would like to attend your concurrency course if you are running it. I had a look at the instil site for the course and it is listed as TBC, so I am registering my interest here to encourage you to run it :)
ReplyDelete
Replies
Nitsan13 December 2012 at 22:00
Having recently been advised by yourself(many thanks) to beware of cache line alignment issues I wondered how you would reason about alignment in the above case(assuming 64b cache line):
1) the trade structure size is 42
2) unsafe.allocateMemory will return an address that "will never be zero, and will be
aligned for all value types." --> will be 16 bytes aligned, from what I can find.
This means your address can start in one of 4 locations on a cache line(0,16,32,48) resulting in a variety of conditions in which one of the fields is split across 2 cache lines e.g:
If the address is cache aligned then the second record has it's 4 field split between line one and line 2.
I understand this can result in loss of atomicity of updates to the field, meaning a half formed field could be visible on write. I assume this can be corrected by padding the structure such that this cannot happen(to a size which is a multiple of 16).
Am I correct in my reasoning so far? If you agree that the issue exists, how would it manifest(perf issue? correctness issue)?
Thanks again,
Nitsan
ReplyDelete
Replies
Nitsan13 December 2012 at 23:13
Correction to the above, alignment is guaranteed 8b aligned, may be a multiple of that.
ReplyDelete
Replies
Unknown28 August 2013 at 10:04
Would love a follow up article addressing design of the packing algorithm to prevent perf/correctness issues from non-ideal alignment. I am especially interested in other threads reading the results.
ReplyDelete
Replies
ymo9 January 2014 at 21:57
Anyone can tell me why the offset is incremented by 4? instead of 8 like all the other long fields ???

private static final long instrumentCodeOffset = offset += 4;
private static final long priceOffset = offset += 4;

ReplyDelete
Replies
Anonymous5 April 2014 at 06:45
Hi,

I have significant concerns about these microbenchmarks and the use of sequential integers in the test data. In the past, I've seen very significant skewing of test results occur when that is done. I just did a test to verify that these tests also suffer from the same flaw.

In each init() method I added a seeded random (to ensure the same values are being compared):

java.util.Random r = new Random(NUM_RECORDS);

and for each use of "i" in constructing the trade record, I replaced "i" with r.nextInt(NUM_RECORDS). Doing this had little impact on the traditional Java test (probably because GC predominates). The initial test runs had been ~10 seconds on my box, that stayed ~10 seconds, and then to ~20+ seconds for the final two runs. However, for the DirectMemoryTest, it had been around 920ms on my box, but after making the change, the times jumped to 4.5 seconds. I think that 5-fold difference is pretty highly significant and should be more carefully guarded against. The direct memory is still clearly faster, but the advantage was pretty significantly eroded. I'd suggest that the pseudo random numbers are far more representative of many (most?) real world applications (certainly any trading system) than a sequentially increasing numbers.

ReplyDelete
Replies
Unknown13 July 2016 at 17:25
If you're using Java 7 (yes, I know it's end of life), I found I had to enable "tiered compilation" using the "XX" switch to ensure the JIT compiler heavily optimised the setter/getters. Without this I was finding direct memory access using the flyweight pattern often 2-3x slower on reads than using a primitive array allocated in old gen.

Tiered compilation is enabled by default on Java 8.
ReplyDelete
Replies