Comments on Mechanical Sympathy: Compact Off-Heap Structures/Tuples In Java

I'm struggling to understand what you mean by ...

2016-08-13T09:32:15.540+01:00

I'm struggling to understand what you mean by this. Maybe not enough coffee yet this morning :) Please post a full project on GitHub so the idea can be better understood and evaluated.

Cool, I'm still allowed to reply after nearly ...

2016-08-13T03:04:43.599+01:00

Cool, I'm still allowed to reply after nearly 4 years!

Sorry I'm a little late noticing your reply. :)

I should have specified: the "structure" I referred to is precisely what's on the stack in the method calls:

initializer.invoke ( "Ford Taurus", 0 );
initializer.invoke ( "Honda Civiv", 0 );
initializer.invoke ( "Toyota Corolla", 0 );

Three instances of a structure that looks something like:

struct {
String model;
int wheels;
}

Where the String model is initialized at the outset, but the int wheels is initialized inside a method call, and both values are lost once the method call is complete.

There should be no heap space allocated for a Ford Taurus with 4 wheels, or a Honda "Civiv" with 4 wheels, and so on. To the best of my knowledge, every Java compiler puts constants right on the stack. Certainly the class files I've looked at in the past do not allocate heap space for primitive or String constants.

I believe the heap space you referred to is all the single-instance classes which *do something with* the structure. You could have 18 billion structure instances defined, none of them on the heap, and the various CarInitializer etc classes would all still only be instantiated exactly once each.

Anyway this was certainly a bit of pedantry, as I pointed out in my original post. :) Anyone who tries to use "parameter passing down the stack" as stack-based structures is likely to 1) run into the limitations in expressiveness of this approach and 2) be shot by a co-worker.

If you still see a flaw in the concept, I'd certainly be keen to hear about it (even after all these years!).

Cheers Martin,

Johann

Cool, I'm still allowed to reply after nearly ...

2016-08-13T03:03:30.345+01:00

If you're using Java 7 (yes, I know it's e...

2016-07-13T17:25:04.591+01:00

If you're using Java 7 (yes, I know it's end of life), I found I had to enable "tiered compilation" using the "XX" switch to ensure the JIT compiler heavily optimised the setter/getters. Without this I was finding direct memory access using the flyweight pattern often 2-3x slower on reads than using a primitive array allocated in old gen.

Tiered compilation is enabled by default on Java 8.

Sorry I misunderstood your code. It is hard readin...

2014-04-05T16:35:06.270+01:00

Sorry I misunderstood your code. It is hard reading code pasted into a comment :-)

I took your example and profiled it. 79.2% of the total time is spent in Random.nextInt(). So basically you have increased the time just by generating random number which is an expensive operation.

As you said, "Micro-benchmarks become somewhat misleading...".

Yes, I read that post. First, I think this isn...

2014-04-05T15:26:57.466+01:00

Yes, I read that post. First, I think this isn't about hardware prefetching since we're not fetching trades based on the contents of previous trades. We're still iterating over memory (the trades themselves) sequentially. Note I am not randomly accessing the trades, simply setting the values on the trades to random numbers. The only use of random (in my modification) was in the init() method. e.g.:

public static void init() {
final long requiredHeap = NUM_RECORDS * DirectMemoryTrade.getObjectSize();
java.util.Random r = new Random(NUM_RECORDS);
address = unsafe.allocateMemory(requiredHeap);

final byte[] londonStockExchange = {'X', 'L', 'O', 'N'};
final int venueCode = pack(londonStockExchange);

final byte[] billiton = {'B', 'H', 'P'};
final int instrumentCode = pack(billiton);

for (int i = 0; i < NUM_RECORDS; i++) {
DirectMemoryTrade trade = get(i);

trade.setTradeId(r.nextInt(NUM_RECORDS));
trade.setClientId(1);
trade.setVenueCode(venueCode);
trade.setInstrumentCode(instrumentCode);

trade.setPrice(r.nextInt(NUM_RECORDS));
trade.setQuantity(r.nextInt(NUM_RECORDS));

trade.setSide((r.nextInt(NUM_RECORDS) & 1) == 0 ? 'B' : 'S');
}
}

The rest of the test is done as before.

Micro-benchmarks become somewhat misleading if they allow optimizations to take place which are unrealistic. In this case, I would suspect hotspot is is the culprit, rather than prefetching.

Your use of random here will likely defeat the har...

2014-04-05T07:58:58.254+01:00

Your use of random here will likely defeat the hardware prefetcher. For bulk operations you want the support of the hardware prefetcher to hide memory latency. I do they very deliberately. I've blogged about this in more detail.

http://mechanical-sympathy.blogspot.co.uk/2012/08/memory-access-patterns-are-important.html

Hi, I have significant concerns about these micro...

2014-04-05T06:45:39.489+01:00

Hi,

I have significant concerns about these microbenchmarks and the use of sequential integers in the test data. In the past, I've seen very significant skewing of test results occur when that is done. I just did a test to verify that these tests also suffer from the same flaw.

In each init() method I added a seeded random (to ensure the same values are being compared):

java.util.Random r = new Random(NUM_RECORDS);

and for each use of "i" in constructing the trade record, I replaced "i" with r.nextInt(NUM_RECORDS). Doing this had little impact on the traditional Java test (probably because GC predominates). The initial test runs had been ~10 seconds on my box, that stayed ~10 seconds, and then to ~20+ seconds for the final two runs. However, for the DirectMemoryTest, it had been around 920ms on my box, but after making the change, the times jumped to 4.5 seconds. I think that 5-fold difference is pretty highly significant and should be more carefully guarded against. The direct memory is still clearly faster, but the advantage was pretty significantly eroded. I'd suggest that the pseudo random numbers are far more representative of many (most?) real world applications (certainly any trading system) than a sequentially increasing numbers.

Look at the JavaMemoryTrade class as a comparison ...

2014-01-09T23:07:05.951+00:00

Look at the JavaMemoryTrade class as a comparison and see if you can work it out? :-)

Anyone can tell me why the offset is incremented b...

2014-01-09T21:57:13.204+00:00

Anyone can tell me why the offset is incremented by 4? instead of 8 like all the other long fields ???

private static final long instrumentCodeOffset = offset += 4;
private static final long priceOffset = offset += 4;

Would love a follow up article addressing design o...

2013-08-28T10:04:05.680+01:00

Would love a follow up article addressing design of the packing algorithm to prevent perf/correctness issues from non-ideal alignment. I am especially interested in other threads reading the results.

2013-07-04T20:26:14.655+01:00

This comment has been removed by the author.

Ended up digging into the topic a bit (with Gil...

2013-02-12T21:56:12.870+00:00

Ended up digging into the topic a bit (with Gil's help), here's the result: http://bit.ly/X5VrjF
Feedback is welcome.
Thanks,
Nitsan

When using 64-bit addressing the the "magic&q...

2013-01-29T14:33:36.572+00:00

When using 64-bit addressing the the "magic" behind the getter is not very difficult as an intrinsic. The issue with the above is the 32-bit index addressing limitation on Java arrays. This class specifies the behaviour and not the implementation for the intrinsic.

This can be made very efficient for copy and reset operations with contiguous memory layout.

I know of a number of JVM implementers who are looking at memory layout with structures, arrays, and object co-location. It would be great to see this work become generally available.

Agreed, variable length data such as char array ca...

2013-01-29T13:31:38.976+00:00

Agreed, variable length data such as char array can be 0 padded to a array length of power 2, base 10 is for human, not memory ( need sufficient cache size ? it will be hard to decide 64 -> 128, ouch) .
"return partitions[partitionIndex][partitionOffset];"
Still, there has to be magic behind the getter in the above line right ? I guess this JVM implementation is not oracle hotspot :)
would this intrinsics also be efficient in terms of wiping the region clean and copy to another instances?

We are expanding on the scope of the post but that...

2013-01-29T09:14:27.036+00:00

We are expanding on the scope of the post but that is OK. On a real project I would byte align the beginning of each structure so addressing can be done with shifts. This is also necessary to ensure the words for concurrent access are aligned just as Java object fields are.

I'm working with some JVM implementers on a new intrinsic that treats a class like the following as if it is a real array of structures.

https://github.com/mjpt777/examples/blob/master/src/java/uk/co/real_logic/intrinsics/StructuredArray.java

http://hg.openjdk.java.net/jdk7/jdk7/hotspot/file/...

2013-01-29T02:52:03.627+00:00

http://hg.openjdk.java.net/jdk7/jdk7/hotspot/file/tip/src/cpu/x86/vm/templateTable_x86_64.cpp
shows that if unpack pack use OR/AND/SHIFT/ADD it will be cheap, however, to access byte at a certain offset of the array, the "[index]" I hope it is not using this (baload opcode), would it involve the multiply of HeapWordSize in base_offset_in_bytes ? upsetting.

678 void TemplateTable::baload() {
679 transition(itos, itos);
680 __ pop_ptr(rdx);
681 // eax: index
682 // rdx: array
683 index_check(rdx, rax); // kills rbx
684 __ load_signed_byte(rax,
685 Address(rdx, rax,
686 Address::times_1,
687 arrayOopDesc::base_offset_in_bytes(T_BYTE)));
688 }

http://hg.openjdk.java.net/jdk6/jdk6/hotspot/diff/ba764ed4b6f2/src/share/vm/oops/arrayOop.hpp
+ // Returns the offset of the first element.
+ static int base_offset_in_bytes(BasicType type) {
+ return header_size(type) * HeapWordSize;
+ }

Pack and unpack operations will be relatively chea...

2013-01-28T20:37:31.435+00:00

Pack and unpack operations will be relatively cheap compared to a cache miss. Have you considered that lots of other work needs to be done as part of computation and by not using the all the execution units unnecessarily when we have alternatives?

I don't understand the JIT reference. The JIT cannot turn these into packed structures because it would not be using the primitive array as specified. It also cannot make work magically go away. Am I missing something in this point?

"If you have a lot of smaller primitives then...

2013-01-27T20:36:30.760+00:00

"If you have a lot of smaller primitives then the packing costs will start to dominate", shouldn't JIT be good for packing/unpacking, they are simple multiple and/or/plus/shift operations, in addition, Intel core has several unit to compute in pipeline, shouldn't pack/unpack cost be very small? if not, maybe JIT compiler can be improved on this regard.

Cache alignment is only a performance issue. Never...

2012-12-22T19:48:31.908+00:00

Cache alignment is only a performance issue. Never a correctness issue. Cache lines do not add or remove atomicity. Specifically, non atomic updates within a single cache line can still be seen non-atomically by other threads.

The Java spec only guarantees atomic treatment (as in the bits in the field are read and written all-or-nothing) for boolean, byte, char, short, int, and float field types. The larger types (long and double) *may* be read and written using two separate operations (although most JVMs will still perform these as atomic, all-or-nothing operations).

The more complex atomic operations (e.g. AtomicLong.addAndGet()) that both read and write a memory field within a single atomic operations are guaranteed to provide atomicity regardless of memory layout.

In practice, JVM implementations typically force fields to not cross cache line boundaries by simply aligning all fields to their field size. This is a common necessity since most architectures do not support unaligned types in memory, and do not support unaligned atomic memory operations.

BTW, all x86 variants DO support both unaligned data types in memory, as well as LOCK operations on such types. This means that on an x86, a LOCK operations that spans boundary between two cache lines will still be atomic (this little bit of historical compatibility probably has modern x86 implementors cursing each time).

Regular loads and stores will work as expected but...

2012-12-14T12:22:16.936+00:00

Regular loads and stores will work as expected but may suffer performance issues as words get assembled that span cache lines. Store costs are more likely to be hidden by the store buffer and write combining buffers than load costs.

OK you've tempted me to write a more focused blog on this subject :-)

1) Would you not get similar issues of correctness...

2012-12-14T10:42:57.399+00:00

1) Would you not get similar issues of correctness on regular reads/writes?
2) Would you not see performance issues for writes?
I agree with the sentiment of keeping it simple, but as your blog is widely read, and highly regarded, I think many people would not be aware of the issue and copy your example as is pretty much. But perhaps I'm just projecting my ignorance on others :).

This is the reason why Java will align objects on ...

2012-12-14T06:15:13.493+00:00

This is the reason why Java will align objects on 8-byte boundaries and then carefully organise the fields like I described in a previous blog on false sharing.

1) You could get issues if the word you are using for coordination in a lock-free concurrent algorithm is split across cache lines, thus making loads and stores not atomic.

2) You can also get performance issues reading fields that are split across cache lines, or even not aligned on word boundaries depending on processor.

I did not want to make this article more complex but when building systems I allocated my off-heap objects on word boundaries and do similar for this fields. Maybe I'll do a follow up with more detail on this.

Correction to the above, alignment is guaranteed 8...

2012-12-13T23:13:31.415+00:00

Correction to the above, alignment is guaranteed 8b aligned, may be a multiple of that.

Having recently been advised by yourself(many than...

2012-12-13T22:00:36.308+00:00

Having recently been advised by yourself(many thanks) to beware of cache line alignment issues I wondered how you would reason about alignment in the above case(assuming 64b cache line):
1) the trade structure size is 42
2) unsafe.allocateMemory will return an address that "will never be zero, and will be
aligned for all value types." --> will be 16 bytes aligned, from what I can find.
This means your address can start in one of 4 locations on a cache line(0,16,32,48) resulting in a variety of conditions in which one of the fields is split across 2 cache lines e.g:
If the address is cache aligned then the second record has it's 4 field split between line one and line 2.
I understand this can result in loss of atomicity of updates to the field, meaning a half formed field could be visible on write. I assume this can be corrected by padding the structure such that this cannot happen(to a size which is a multiple of 16).
Am I correct in my reasoning so far? If you agree that the issue exists, how would it manifest(perf issue? correctness issue)?
Thanks again,
Nitsan