Mechanical Sympathy: Let The Caller Choose

Saturday, 16 July 2011

Let The Caller Choose

A programming idiom I often see in managed runtime environments, such as Java, is to return a collection or array from a method that is allocated within the method. Having originally come from a C/C++ language background this always struck me as a bit strange and lazy. In the C language world this idiom would likely be a source of memory leaks. I do however understand how convenient it can be in garbage collected environments. A good example is the split function on a String as below:

String str = “Some string that I may want to split”;

String[] values = str.split(“ ”);

For this a new array has to be allocated within the method. If the array is small it can be allocated in the young generation. Due to the round-robin nature of the young generation it will most likely involve a cache miss. The OS, in addition, must zero the memory for security before handing it over. What’s worse is the size of many arrays cannot be determined until the method completes, so temporary structures may have to be allocated before the final structure is allocated and copied into. Then after all this, the memory will eventually need to be garbage collected. This feels like the CPU is doing a lot of work to make this idiom work.

What is the alternative?

Another idiom is to let the caller allocate the structure and pass it into the method like in the example below:

String str = “Some string that I may want to split”;

Collection<String> values = new ArrayList<String>();

str.split(values, “ ”);

With this approach the caller has more control and flexibility. Not only can the caller reuse the collection across multiple calls, they can also choose the most appropriate structure for the end needs. For example, it could be a HashSet if finding distinct words, an ArrayList for cache friendly performance, or a LinkedList if memory is a scarce resource. The method can return the collection allowing a fluent coding pattern.

In the Java world this idiom becomes critical to performance when large arrays are involved. An array may be too large to be allocated in the young generation and therefore needs to be allocated in the old generation. Besides being less efficient and contended for large object allocation, the large array may cause a compaction of the old generation resulting in a multi-second stop-the-world pause. Now that is a good topic for another blog post...

I can hear folk cry out, “but short lived objects are cheap!” While this is true when compared to longer living objects, they are relatively cheap but certainly not free. I’d prefer to use those extra cycles solving the business problem and, in addition, have the flexibility to choose the most appropriate data structure for the task at hand.

12 comments:

swankjesse21 July 2011 at 14:19
The performance model changes more frequently than the APIs. Most of the APIs I use today were designed for the hardware and virtual machines of a decade ago.
ReplyDelete
Replies
Shane Tolmie4 August 2011 at 18:24
Interesting. I guess its the old tradeoff of whats human readable, and what has the best performance.
ReplyDelete
Replies
Unknown25 August 2011 at 17:17
I really like your articles and learn a lot from them, but imho it seems like you are trading some cycles on a single core for safety & scalability here.

1) Allocating new objects within the method may often be combined with immutable objects & a more functional style. While possibly consuming more memory, functional code / immutable data is said to scale better to multiple cores.

2) If I remember defensive copying right, I would have to copy internally anyway, because during the method's work I might produce an illegal state of the data that must not be visible from outside. Second, an attacker could concurrently access my structure if I do everything in place, so I would have to apply a synchronization strategy here, which needs to be documented, may be misunderstood, may cause hot locks/contention, does not compose very well when this code is part of some component ...
ReplyDelete
Replies
Martin Thompson25 August 2011 at 17:46
Hi Joachim,

Thanks for the feedback. I like having the ideas tested so they can be refined. To answer you points:

1. I often hear that "functional code" is the way to scale but I must admit those who I hear saying it are not offering measurement and science to back it up. Over allocation of objects has a huge cost and results a problem know as the "Application Memory Wall". It has been documented going back to 1994 at least.

http://www.cs.virginia.edu/papers/Hitting_Memory_Wall-wulf94.pdf

http://dl.acm.org/citation.cfm?id=1640116&dl=ACM&coll=DL&CFID=37104408&CFTOKEN=43194801

http://research.microsoft.com/en-us/um/people/simonpj/papers/parallel/local-gc.pdf

2. This collection should be thread local. How could an attacker get a reference to it on another thread? Defensive copy is about taking a copy so that when the method returns you are not worried about mutation. The mutation in this case is explicit and only for the duration of the method call. The collection is the callers responsibility and passed over for mutation during the duration of the method call only.

Martin...
ReplyDelete
Replies
Unknown26 August 2011 at 07:11
[I split this comment in two parts, because it wont get posted as one]

Hi Martin,

Coming from the Java side and remembering C++ only back from university I'm certainly biased and cannot argue on the same level, but ... ;)

1) The memory wall is an issue, but I think for a large range of domains a broader use of immutable data & fresh object creation offers more safety to programmers and allow for heavy optimizations (compile & runtime) under the hood so that the memory wall is not the bottleneck. There certainly exist other areas (video codecs etc.) where this is not true.

Details:
A) Josh Bloch (creator of java collections) advocates immutable objects in 'Effective Java'. He's now Google's Java-Guru.

B) At least commercial JVMs can allocate and GC incredibly fast: See Cliff Click's http://www.azulsystems.com/blog/wp-content/uploads/2011/03/2011_WhatDoesJVMDo.pdf
and http://www.infoq.com/articles/azul_gc_in_detail. Its about hundreds of GB / sec for > 500 cores and heaps > 500 GB without a stop-the-world GC.

C) Scala, a object & functional language on the JVM performes on pair with Java, gains ground in 'real life' and is respected by the Java-Community

D) Your cited Microsoft paper refers to older GC models and from what I understand from a quick read goes with 'local heaps' the same direction: avoid globally consistent mutation by keeping changes locally as long as possible because the ‘memory wall’ is an issue with global state
ReplyDelete
Replies
Unknown26 August 2011 at 07:14
E) Allocation of memory and mutating that fresh object in a method seems to me a local issue in the first part. Even if its expensive, it takes only one core (and its caches). Because it has not yet escaped the method, there is no need for it to be part of global state/heap and so the other cores should not bother. If manipulation is expensive (e.g. lots of r/w access, like your String.split example on longer strings), manipulating a globally known object in-place seems more expensive than doing the same for one core only plus some additional local copying. For not returned 'helper-objects' escape analysis should even optimize this further.

F) This point is a bit blurry and I cant express it better in english, but:
I have the impression that especially on managed runtimes nowadays the program code describes more and more only semantics, not instructions that really happen in this order. I know that on low level the CPU does a lot of out-of-order and speculative execution but I mean this on higher level. With immutable objects you just can make better assumptions and optimize more aggressively. So all the object creation that one sees in program code might not happen in reality. Escape analysis, object reuse, inlining etc. are all done by the JVM at runtime, and e.g. I heard that Scala’s compiler does a great deal of optimizing object creation during compile time.

2) You are right for your example and ‘defensive copying’ is the wrong term in its closer definition. But how should the implementer of String.split(...) know, what is passed in? Its fine if its a locally scoped object, but with in-place mutation you rely on that assumption.
A utility method like that could easily called with a shared object and my arguments went for this case. So for general purpose APIs, your pattern might be problematic (or named: splitWithNonSharedTarget(target, stringToSplit) ). But to be honest, there is sometimes similar pattern in the JDK (Collections.toArray(T a) )
ReplyDelete
Replies
Martin Thompson26 August 2011 at 08:37
Interesting arguments :-)

1)

In my experience with C++ since 1992 and Java since 1995 it has been so easy making a healthy living reminding folk that the "common wisdom" does not result in good performance. If we follow common wisdom no financial exchange would perform and the Disruptor would not exist.

Can you give examples of actually code showing significant immutable object creation in video codecs etc.? I've studied and been involved with games, encoding techniques, large data transformations, encryption techniques - to name a few high performance areas and never seen large scale memory allocation happen without very careful consideration when performance was critical.

The Azul reference is an interesting point. Are you aware that the Azul Vega processor has a special instruction to allocate a cache line pre-zero'ed without the need to fetch it on cache miss from main memory and then zero it? This instruction addresses this very issue which can take 20% of time in a typical Java application. No other mainstream processor that I'm aware of supports it. They are also making significant changes to the Linux memory subsystem to cope with remapping and allocation. I've enjoyed a number very detailed discussions with Gil and Cliff on this subject.

I'm not saying immutable objects are bad. I'm actually a big fan when used in the right situation. However I cannot buy the argument that they are the general solution to performance because in my experience this is not the case.

The "local" memory issue is complex and subtle. As soon as these new objects are referenced by an existing part of the model, or a collection, the global heap needs a reference back to it. These references need remapped when retained which happens a lot in functional languages. Have you profiled the memory costs when performing "path copy" on an immutable functional model that requires Tries for the data structure? What if the object is put in a cache? There are many subtle things going on here.

I feel I may fail to convince you but that is OK. My own experience is limited and I'm here to learn too. My goal is to raise awareness so that folk will measure for their own problem domain and select the appropriate solution based on measurement when performance is a requirement. Performance is only one of many requirements and often personal taste and preference can be much stronger requirements.
ReplyDelete
Replies
Martin Thompson26 August 2011 at 10:49
2)

There are two points I'd like to make on this. First up we have discussed thread local vs. global allocation. With arrays it is very easy to exceed the local allocation size limit and therefore they become global allocation by default. This is another silent performance killer. I've seen it happen many times with byte buffers.

Second, the code for the existing pattern is more complex having two impacts. First complexity makes things difficult to understand for humans developing, and secondly, Hotspot has a default size limit of 35 bytecodes on a method before it stops inlining, i.e. -XX:MaxInlineSize=. Bigger more complex code is more difficult optimise.

Let's take a real example. Let's say we want to write a function that counts the number of distinct words in a file and compare the two approaches. How can any compiler or cpu make the first method more efficient than the second? Take an assembly dump of similar approaches and see how many cpu instructions are executed with each. The results are staggering.

http://code.google.com/p/mechanical-sympathy/source/browse/trunk/let-the-caller-choose/LetTheCallerChoose.java
ReplyDelete
Replies
Unknown30 August 2011 at 06:57
2) There might several issues with local allocation failing silently on current JVMs, you definitely know better than I do. Taken your word count example and a naive implementation, I suggest that in a multicore environment, creating a new map and saving the words and their counts while scanning the input words could be (theoretically) more efficient, because the map is not yet known to the ‘world’ and the sequential updates do not have to be propagated to all other cores/caches. Only publishing the returned result needs a global update.

Another thought on word-count that strikes me and makes me understand your arguments biased towards single-CPU cases is Map-Reduce frameworks like Hadoop. Word-count is there kind of a ‘hello world’ case (http://wiki.apache.org/hadoop/WordCount). The code is definitely longer, slower and not as good to inline per CPU, but the functional style and decomposition of the problem helps there to scale to new dimensions of big data. I do not want to open up another big discussion on that here, but you asked above for examples of functional code performing superior.
ReplyDelete
Replies
Martin Thompson30 August 2011 at 09:50
Joachim,

In the examples I posted all the collections are local and only referenced from the thread stack. The point I'm making is there is one array and 2 ArrayLists created if you trace the code that do not contribute to the final result other than wasting cycles. There is no multi-threading in this example.

Map Reduce and other similar parallel techniques are really useful when for O of n, and n is very large, like indexing the whole web ;-) I'm as much a fan of parallel programming as concurrent programming. I just think some intelligence needs to be applied for when you choose one over another. Are you aware that Google are moving away from map-reduce to "Percolator" so they can reduce the latency for when indices are updated?

http://www.theregister.co.uk/2010/09/24/google_percolator/
ReplyDelete
Replies
Unknown30 August 2011 at 16:03
Seems like the first part of my comments got lost, maybe the blog software has a blather filter :)
To make it short and final: On the video codec you got me wrong (or I said it wrong). For the rest I think I got the message.
Comparing you article to my experiences, I just had the feeling, that in hasty, feature & timeline driven, quick & dirty code where following data flow is hard, where methods are too long and multiple features are mixed together in code and data, there is more danger with using other components/APIs in ‘Let the caller choose’ style.
Thanks for your replies and the many useful links and details (Azul!). Congrats to Disruptor 2.0!
ReplyDelete
Replies
eugene10 March 2014 at 19:52
This comment has been removed by the author.
ReplyDelete
Replies