tag:blogger.com,1999:blog-55602096613891755292024-03-18T19:29:25.347+00:00Mechanical SympathyHardware and software working together in harmonyMartin Thompsonhttp://www.blogger.com/profile/15893849163924476586noreply@blogger.comBlogger32125tag:blogger.com,1999:blog-5560209661389175529.post-51161480745519571362014-05-05T20:01:00.001+01:002022-08-17T11:29:14.309+01:00Simple Binary Encoding<div dir="ltr" style="text-align: left;" trbidi="on">
Financial systems communicate by sending and receiving vast numbers of messages in many different formats. When people use terms like "vast" I normally think, "really..how many?" So lets quantify "vast" for the finance industry. Market data feeds from financial exchanges typically can be emitting tens or hundreds of thousands of message per second, and aggregate feeds like <a href="http://www.opradata.com/">OPRA</a> can peak at over 10 million messages per second with volumes growing year-on-year. This presentation gives a good <a href="https://fif.com/docs/2013_6_fifmd_capacity_stats.pdf">overview</a>.<br />
<br />
In this crazy world we still see significant use of ASCII encoded presentations, such as <a href="http://en.wikipedia.org/wiki/Financial_Information_eXchange">FIX</a> tag value, and some more slightly sane binary encoded presentations like <a href="http://en.wikipedia.org/wiki/FAST_protocol">FAST</a>. Some markets even commit the sin of sending out market data as XML! Well I cannot complain too much as they have at times provided me a good income writing ultra fast XML parsers.<br />
<br />
Last year the CME, who are a member the FIX <a href="http://www.fixtradingcommunity.org/">community</a>, commissioned <a href="https://twitter.com/toddlmontgomery">Todd Montgomery</a>, of 29West LBM fame, and myself to build the reference implementation of the new FIX <a href="http://real-logic.github.io/simple-binary-encoding/">Simple Binary Encoding</a> (SBE) standard. SBE is a codec aimed at addressing the efficiency issues in low-latency trading, with a specific focus on market data. The CME, working within the FIX community, have done a great job of coming up with an encoding presentation that can be so efficient. Maybe a suitable atonement for the sins of past FIX tag value implementations. Todd and I worked on the Java and C++ implementation, and later we were helped on the .Net side by the amazing <a href="https://twitter.com/olivierdeheurle">Olivier Deheurles</a> at <a href="http://weareadaptive.com/">Adaptive</a>. Working on a cool technical problem with such a team is a dream job.<br />
<br />
<span style="font-size: large;"><b>SBE Overview</b></span><br />
<br />
SBE is an <a href="http://en.wikipedia.org/wiki/OSI_model">OSI</a> layer 6 presentation for encoding/decoding messages in binary format to support low-latency applications. Of the many applications I profile with performance issues, message encoding/decoding is often the most significant cost. I've seen many applications that spend significantly more CPU time parsing and transforming XML and JSON than executing business logic. SBE is designed to make this part of a system the most efficient it can be. SBE follows a number of <a href="https://github.com/real-logic/simple-binary-encoding/wiki/Design-Principles">design principles</a> to achieve this goal. By adhering to these design principles sometimes means features available in other codecs will not being offered. For example, many codecs allow strings to be encoded at any field position in a message; SBE only allows variable length fields, such as strings, as fields grouped at the end of a message.<br />
<br />
The SBE reference implementation consists of a compiler that takes a message schema as input and then generates language specific stubs. The stubs are used to directly encode and decode messages from buffers. The SBE tool can also generate a binary representation of the schema that can be used for the on-the-fly decoding of messages in a dynamic environment, such as for a log viewer or network sniffer.<br />
<br />
The design principles drive the implementation of a codec that ensures messages are streamed through memory without backtracking, copying, or unnecessary allocation. <a href="http://mechanical-sympathy.blogspot.co.uk/2012/08/memory-access-patterns-are-important.html">Memory access patterns</a> should not be underestimated in the design of a high-performance application. Low-latency systems in any language especially need to consider all allocation to avoid the resulting issues in reclamation. This applies for both managed runtime and native languages. SBE is totally allocation free in all three language implementations.<br />
<br />
The end result of applying these design principles is a codec that has ~16-25 times greater throughput than Google Protocol Buffers (GPB) with very low and predictable latency. This has been observed in <a href="https://github.com/real-logic/simple-binary-encoding/tree/master/sbe-benchmarks/src/main">micro-benchmarks</a> and real-world application use. A typical market data message can be encoded, or decoded, in ~25ns compared to ~1000ns for the same message with GPB on the same hardware. XML and FIX tag value messages are orders of magnitude slower again.<br />
<br />
The sweet spot for SBE is as a codec for structured data that is mostly fixed size fields which are numbers, bitsets, enums, and arrays. While it does work for strings and blobs, many my find some of the restrictions a usability issue. These users would be better off with another codec more suited to string encoding.<br />
<br />
<b><span style="font-size: large;">Message Structure</span></b><br />
<br />
A message must be capable of being read or written sequentially to preserve the streaming access design principle, i.e. with no need to backtrack. Some codecs insert location pointers for variable length fields, such as string types, that have to be indirected for access. This indirection comes at a cost of extra instructions plus losing the support of the hardware prefetchers. SBE's design allows for pure sequential access and copy-free native access semantics.<br />
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhxenVqouSMVH92Kjq5os7wb9U1onbidJZXYt3uc4cBWm7tVhSx9B79llj2w9n0NjNA9IHwiplVNZ3gticsM7A-kzmx-drvMqFvrwk-NHg1WkgsGlmoT-BD2KEMTwS2Y2mrVN066itYdqA/s1600/SBE-msg-format.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhxenVqouSMVH92Kjq5os7wb9U1onbidJZXYt3uc4cBWm7tVhSx9B79llj2w9n0NjNA9IHwiplVNZ3gticsM7A-kzmx-drvMqFvrwk-NHg1WkgsGlmoT-BD2KEMTwS2Y2mrVN066itYdqA/s1600/SBE-msg-format.png" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Figure 1</td></tr>
</tbody></table>
SBE messages have a common header that identifies the type and version of the message body to follow. The header is followed by the root fields of the message which are all fixed length with static offsets. The root fields are very similar to a struct in C. If the message is more complex then one or more repeating groups similar to the root block can follow. Repeating groups can nest other repeating group structures. Finally, variable length strings and blobs come at the end of the message. Fields may also be optional. The XML schema describing the SBE presentation can be found <a href="https://github.com/real-logic/simple-binary-encoding/blob/master/sbe-tool/src/main/resources/fpl/SimpleBinary1-0.xsd">here</a>.<br />
<br />
<span style="font-size: large;"><b>SbeTool and the Compiler</b></span><br />
<br />
To use SBE it is first necessary to define a schema for your messages. SBE provides a language independent type system supporting integers, floating point numbers, characters, arrays, constants, enums, bitsets, composites, grouped structures that repeat, and variable length strings and blobs.<br />
<br />
A message schema can be input into the <a href="https://github.com/real-logic/simple-binary-encoding/wiki/Sbe-Tool-Guide">SbeTool</a> and compiled to produce stubs in a range of languages, or to generate binary metadata suitable for decoding messages on-the-fly.<br />
<br />
<pre> java [-Doption=value] -jar sbe.jar <message-declarations-file.xml>
</pre>
<br />
SbeTool and the compiler are written in Java. The tool can currently output stubs in Java, C++, and C#.<br />
<br />
<span style="font-size: large;"><b>Programming with Stubs</b></span><br />
<br />
A full example of messages defined in a <a href="https://github.com/real-logic/simple-binary-encoding/blob/master/sbe-samples/src/main/resources/example-schema.xml">schema</a> with supporting code can be found <a href="https://github.com/real-logic/simple-binary-encoding/blob/master/sbe-samples/src/main/java/uk/co/real_logic/sbe/examples/ExampleUsingGeneratedStub.java">here</a>. The generated stubs follow a flyweight pattern with instances reused to avoid allocation. The stubs wrap a buffer at an offset and then read it sequentially and natively.<br />
<pre> // Write the message header first
MESSAGE_HEADER.wrap(directBuffer, bufferOffset, messageTemplateVersion)
.blockLength(CAR.sbeBlockLength())
.templateId(CAR.sbeTemplateId())
.schemaId(CAR.sbeSchemaId())
.version(CAR.sbeSchemaVersion());
// Then write the body of the message
car.wrapForEncode(directBuffer, bufferOffset)
.serialNumber(1234)
.modelYear(2013)
.available(BooleanType.TRUE)
.code(Model.A)
.putVehicleCode(VEHICLE_CODE, srcOffset);
</pre>
Messages can be written via the generated stubs in a fluent manner. Each field appears as a generated pair of methods to encode and decode.<br />
<pre> // Read the header and lookup the appropriate template to decode
MESSAGE_HEADER.wrap(directBuffer, bufferOffset, messageTemplateVersion);
final int templateId = MESSAGE_HEADER.templateId();
final int actingBlockLength = MESSAGE_HEADER.blockLength();
final int schemaId = MESSAGE_HEADER.schemaId();
final int actingVersion = MESSAGE_HEADER.version();
// Once the template is located then the fields can be decoded.
car.wrapForDecode(directBuffer, bufferOffset, actingBlockLength, actingVersion);
final StringBuilder sb = new StringBuilder();
sb.append("\ncar.templateId=").append(car.sbeTemplateId());
sb.append("\ncar.schemaId=").append(schemaId);
sb.append("\ncar.schemaVersion=").append(car.sbeSchemaVersion());
sb.append("\ncar.serialNumber=").append(car.serialNumber());
sb.append("\ncar.modelYear=").append(car.modelYear());
sb.append("\ncar.available=").append(car.available());
sb.append("\ncar.code=").append(car.code());
</pre>
<br />
The generated code in all languages gives performance similar to casting a C struct over the memory.
<br />
<br />
<span style="font-size: large;"><b>On-The-Fly Decoding</b></span><br />
<br />
The compiler produces an intermediate representation (IR) for the input XML message schema. This IR can be serialised in the SBE binary format to be used for later on-the-fly decoding of messages that have been stored. It is also useful for tools, such as a network sniffer, that will not have been compiled with the stubs. A full example of the IR being used can be found <a href="https://github.com/real-logic/simple-binary-encoding/blob/master/sbe-samples/src/main/java/uk/co/real_logic/sbe/examples/OtfExample.java">here</a>.<br />
<br />
<span style="font-size: large;"><b>Direct Buffers</b></span><br />
<br />
SBE, via Agrona, provides an abstraction to Java, with the <span style="font-family: "courier new" , "courier" , monospace;"><a href="https://github.com/real-logic/Agrona/blob/master/src/main/java/org/agrona/MutableDirectBuffer.java">MutableDirectBuffer</a></span> class, to work with buffers that are byte[], heap or direct <span style="font-family: "courier new" , "courier" , monospace;"><a href="http://docs.oracle.com/javase/7/docs/api/java/nio/ByteBuffer.html">ByteBuffer</a></span> buffers, and off heap memory addresses returned from <span style="font-family: "courier new" , "courier" , monospace;"><a href="http://www.docjar.com/docs/api/sun/misc/Unsafe.html#allocateMemory(long)">Unsafe.allocateMemory(long)</a></span> or JNI. In low-latency applications, messages are often encoded/decoded in memory mapped files via <span style="font-family: "courier new" , "courier" , monospace;"><a href="http://docs.oracle.com/javase/7/docs/api/java/nio/MappedByteBuffer.html">MappedByteBuffer</a></span> and thus can be be <a href="http://docs.oracle.com/javase/7/docs/api/java/nio/channels/FileChannel.html#transferTo(long, long, java.nio.channels.WritableByteChannel)">transferred</a> to a network channel by the kernel thus avoiding user space copies.<br />
<br />
C++ and C# have built-in support for direct memory access and do not require such an abstraction as the Java version does. A DirectBuffer abstraction was added for C# to support Endianess and encapsulate the unsafe pointer access.<br />
<br />
<span style="font-size: large;"><b>Message Extension and Versioning</b></span><br />
<br />
SBE schemas carry a version number that allows for message extension. A message can be extended by adding fields at the end of a block. Fields cannot be removed or reordered for backwards compatibility.<br />
<br />
Extension fields must be optional otherwise a newer template reading an older message would not work. Templates carry metadata for min, max, null, timeunit, character encoding, etc., these are accessible via static (class level) methods on the stubs.<br />
<br />
<span style="font-size: large;"><b>Byte Ordering and Alignment</b></span><br />
<br />
The message schema allows for precise alignment of fields by specifying offsets. Fields are by default encoded in Little <a href="http://www.ietf.org/rfc/ien/ien137.txt">Endian</a> form unless otherwise specified in a schema. For maximum performance native encoding with fields on word aligned boundaries should be used. The penalty for accessing non-aligned fields on some processors can be very significant. For alignment one must consider the framing protocol and buffer locations in memory.<br />
<br />
<span style="font-size: large;"><b>Message Protocols</b></span><br />
<br />
I often see people complain that a codec cannot support a particular presentation in a single message. However this is often possible to address with a protocol of messages. Protocols are a great way to split an interaction into its component parts, these parts are then often composable for many interactions between systems. For example, the IR implementation of schema metadata is more complex than can be supported by the structure of a single message. We encode IR by first sending a template message providing an overview, followed by a stream of messages, each encoding the tokens from the compiler IR. This allows for the design of a very fast OTF decoder which can be implemented as a threaded interpreter with much less branching than the typical switch based state machines.<br />
<br />
Protocol design is an area that most developers don't seem to get an opportunity to learn. I feel this is a great loss. The fact that so many developers will call an "encoding" such as ASCII a "protocol" is very telling. The value of protocols is so obvious when one gets to work with a programmer like Todd who has spent his life successfully designing protocols.<br />
<br />
<span style="font-size: large;"><b>Stub Performance</b></span><br />
<br />
The stubs provide a significant performance advantage over the dynamic OTF decoding. For accessing primitive fields we believe the performance is reaching the limits of what is possible from a general purpose tool. The generated assembly code is very similar to what a compiler will generate for accessing a C struct, even from Java!<br />
<br />
Regarding the general performance of the stubs, we have observed that C++ has a very marginal advantage over the Java which we believe is due to runtime inserted Safepoint checks. The C# version lags a little further behind due to its runtime not being as aggressive with inlining methods as the Java runtime. Stubs for all three languages are capable of encoding or decoding typical financial messages in tens of nanoseconds. This effectively makes the encoding and decoding of messages almost free for most applications relative to the rest of the application logic.<br />
<br />
<span style="font-size: large;"><b>Feedback</b></span><br />
<br />
This is the first version of SBE and we would welcome <a href="https://github.com/real-logic/simple-binary-encoding/issues?state=open">feedback</a>. The reference implementation is constrained by the FIX community specification. It is possible to influence the specification but please don't expect pull requests to be accepted that significantly go against the <a href="https://github.com/FIXTradingCommunity/fix-simple-binary-encoding">specification</a>. Support for Javascript, Python, Erlang, and other languages has been discussed and would be very welcome.<br />
<br />
<br />
<b><span style="font-size: large;">Update: 08-May-2014</span></b><br />
<br />
Thanks to feedback from Kenton Varda, the creator of GPB, we were able to improve the benchmarks to get the best performance out of GPB. Below are the results for the changes to the Java benchmarks.<br />
<br />
The C++ GPB examples on optimisation show approximately a doubling of throughput compared to initial results. It should be noted that you often have to do the opposite in Java with GPB compared to C++ to get performance improvements, such as allocate objects rather than reuse them.<br />
<br />
<b>Before GPB Optimisation:</b>
<br />
<pre>Mode Thr Cnt Sec Mean Mean error Units
[exec] u.c.r.protobuf.CarBenchmark.testDecode thrpt 1 30 1 462.817 6.474 ops/ms
[exec] u.c.r.protobuf.CarBenchmark.testEncode thrpt 1 30 1 326.018 2.972 ops/ms
[exec] u.c.r.protobuf.MarketDataBenchmark.testDecode thrpt 1 30 1 1148.050 17.194 ops/ms
[exec] u.c.r.protobuf.MarketDataBenchmark.testEncode thrpt 1 30 1 1242.252 12.248 ops/ms
[exec] u.c.r.sbe.CarBenchmark.testDecode thrpt 1 30 1 10436.476 102.114 ops/ms
[exec] u.c.r.sbe.CarBenchmark.testEncode thrpt 1 30 1 11657.190 65.168 ops/ms
[exec] u.c.r.sbe.MarketDataBenchmark.testDecode thrpt 1 30 1 34078.646 261.775 ops/ms
[exec] u.c.r.sbe.MarketDataBenchmark.testEncode thrpt 1 30 1 29193.600 443.638 ops/ms
</pre>
<b>After GPB Optimisation:</b>
<br />
<pre>Mode Thr Cnt Sec Mean Mean error Units
[exec] u.c.r.protobuf.CarBenchmark.testDecode thrpt 1 30 1 619.467 4.429 ops/ms
[exec] u.c.r.protobuf.CarBenchmark.testEncode thrpt 1 30 1 433.711 10.364 ops/ms
[exec] u.c.r.protobuf.MarketDataBenchmark.testDecode thrpt 1 30 1 2088.998 60.619 ops/ms
[exec] u.c.r.protobuf.MarketDataBenchmark.testEncode thrpt 1 30 1 1316.123 19.816 ops/ms
</pre>
<br />
<br />
<div align="center">
<table border="1" cellpadding="5"><tbody>
<tr style="background-color: cyan;"><th colspan="4">Throughput msg/ms - Before GPB Optimisation</th></tr>
<tr><th style="background-color: #cfe2f3;">Test</th><th style="background-color: #cfe2f3;">Protocol Buffers</th><th style="background-color: #cfe2f3;">SBE</th><th style="background-color: #cfe2f3;">Ratio</th></tr>
<tr><td align="right">Car Encode</td><td align="right">462.817</td><td align="right"><table border="0" cellpadding="0" cellspacing="0" style="width: 126px;"><colgroup><col width="126"></col></colgroup><tbody>
<tr height="20"><td align="right" class="xl63" height="20" style="height: 15pt; width: 95pt;" width="126"><span style="color: black;">10436.476</span></td></tr>
</tbody></table>
</td><td align="right"><table border="0" cellpadding="0" cellspacing="0" style="width: 97px;"><colgroup><col width="97"></col></colgroup><tbody>
<tr height="20"><td align="right" class="xl65" height="20" style="height: 15pt; width: 73pt;" width="97">22.52</td></tr>
</tbody></table>
</td></tr>
<tr><td align="right">Car Decode</td><td align="right">326.018</td><td align="right"><table border="0" cellpadding="0" cellspacing="0" style="width: 126px;"><colgroup><col width="126"></col></colgroup><tbody>
<tr height="20"><td align="right" class="xl65" height="20" style="height: 15pt; width: 95pt;" width="126">11657.190</td></tr>
</tbody></table>
</td><td align="right"><table border="0" cellpadding="0" cellspacing="0" style="width: 97px;"><colgroup><col width="97"></col></colgroup><tbody>
<tr height="20"><td align="right" class="xl65" height="20" style="height: 15pt; width: 73pt;" width="97">35.76</td></tr>
</tbody></table>
</td></tr>
<tr><td align="right">Market Data Encode</td><td align="right">1148.050</td><td align="right"><table border="0" cellpadding="0" cellspacing="0" style="width: 126px;"><colgroup><col width="126"></col></colgroup><tbody>
<tr height="20"><td align="right" class="xl65" height="20" style="height: 15pt; width: 95pt;" width="126">34078.646</td></tr>
</tbody></table>
</td><td align="right"><table border="0" cellpadding="0" cellspacing="0" style="width: 97px;"><colgroup><col width="97"></col></colgroup><tbody>
<tr height="20"><td align="right" class="xl65" height="20" style="height: 15pt; width: 73pt;" width="97">29.68</td></tr>
</tbody></table>
</td></tr>
<tr><td align="right">Market Data Decode</td><td align="right">1242.252</td><td align="right"><table border="0" cellpadding="0" cellspacing="0" style="width: 126px;"><colgroup><col width="126"></col></colgroup><tbody>
<tr height="20"><td align="right" class="xl65" height="20" style="height: 15pt; width: 95pt;" width="126">29193.600</td></tr>
</tbody></table>
</td><td align="right"><table border="0" cellpadding="0" cellspacing="0" style="width: 97px;"><colgroup><col width="97"></col></colgroup><tbody>
<tr height="20"><td align="right" class="xl65" height="20" style="height: 15pt; width: 73pt;" width="97">23.50</td></tr>
</tbody></table>
</td></tr>
</tbody></table>
</div>
<br />
<div align="center">
<table border="1" cellpadding="5"><tbody>
<tr style="background-color: cyan;"><th colspan="4">Throughput msg/ms - After GPB Optimisation</th></tr>
<tr><th style="background-color: #cfe2f3;">Test</th><th style="background-color: #cfe2f3;">Protocol Buffers</th><th style="background-color: #cfe2f3;">SBE</th><th style="background-color: #cfe2f3;">Ratio</th></tr>
<tr><td align="right">Car Encode</td><td align="right">619.467</td><td align="right"><table border="0" cellpadding="0" cellspacing="0" style="width: 126px;"><colgroup><col width="126"></col></colgroup><tbody>
<tr height="20"><td align="right" class="xl63" height="20" style="height: 15pt; width: 95pt;" width="126"><span style="color: black;">10436.476</span></td></tr>
</tbody></table>
</td><td align="right"><table border="0" cellpadding="0" cellspacing="0" style="width: 97px;"><colgroup><col width="97"></col></colgroup><tbody>
<tr height="20"><td align="right" class="xl65" height="20" style="height: 15pt; width: 73pt;" width="97">16.85</td></tr>
</tbody></table>
</td></tr>
<tr><td align="right">Car Decode</td><td align="right">433.711</td><td align="right"><table border="0" cellpadding="0" cellspacing="0" style="width: 126px;"><colgroup><col width="126"></col></colgroup><tbody>
<tr height="20"><td align="right" class="xl65" height="20" style="height: 15pt; width: 95pt;" width="126">11657.190</td></tr>
</tbody></table>
</td><td align="right"><table border="0" cellpadding="0" cellspacing="0" style="width: 97px;"><colgroup><col width="97"></col></colgroup><tbody>
<tr height="20"><td align="right" class="xl65" height="20" style="height: 15pt; width: 73pt;" width="97">26.88</td></tr>
</tbody></table>
</td></tr>
<tr><td align="right">Market Data Encode</td><td align="right">2088.998</td><td align="right"><table border="0" cellpadding="0" cellspacing="0" style="width: 126px;"><colgroup><col width="126"></col></colgroup><tbody>
<tr height="20"><td align="right" class="xl65" height="20" style="height: 15pt; width: 95pt;" width="126">34078.646</td></tr>
</tbody></table>
</td><td align="right"><table border="0" cellpadding="0" cellspacing="0" style="width: 97px;"><colgroup><col width="97"></col></colgroup><tbody>
<tr height="20"><td align="right" class="xl65" height="20" style="height: 15pt; width: 73pt;" width="97">16.31</td></tr>
</tbody></table>
</td></tr>
<tr><td align="right">Market Data Decode</td><td align="right">1316.123</td><td align="right"><table border="0" cellpadding="0" cellspacing="0" style="width: 126px;"><colgroup><col width="126"></col></colgroup><tbody>
<tr height="20"><td align="right" class="xl65" height="20" style="height: 15pt; width: 95pt;" width="126">29193.600</td></tr>
</tbody></table>
</td><td align="right"><table border="0" cellpadding="0" cellspacing="0" style="width: 97px;"><colgroup><col width="97"></col></colgroup><tbody>
<tr height="20"><td align="right" class="xl65" height="20" style="height: 15pt; width: 73pt;" width="97">22.18</td></tr>
</tbody></table>
</td></tr>
</tbody></table>
</div>
</div>Martin Thompsonhttp://www.blogger.com/profile/15893849163924476586noreply@blogger.com43London, UK51.508515 -0.1254871999999522851.192402 -0.77093419999995227 51.824628000000004 0.51995980000004771tag:blogger.com,1999:blog-5560209661389175529.post-88903474924571041862013-08-26T11:48:00.001+01:002022-08-17T11:33:08.901+01:00Lock-Based vs Lock-Free Concurrent Algorithms<div dir="ltr" style="text-align: left;" trbidi="on">
Last week I attended a review session of the new <a href="http://g.oswego.edu/dl/concurrency-interest/">JSR166</a> <a href="http://gee.cs.oswego.edu/dl/jsr166/dist/docs/java/util/concurrent/locks/StampedLock.html">StampedLock</a> run by <a href="http://www.javaspecialists.eu/">Heinz Kabutz</a> at the excellent <a href="http://www.jcrete.org/">JCrete</a> unconference. StampedLock is an attempt to address the contention issues that arise in a system when multiple readers concurrently access shared state. StampedLock is designed to perform better than <a href="http://docs.oracle.com/javase/7/docs/api/java/util/concurrent/locks/ReentrantReadWriteLock.html">ReentrantReadWriteLock</a> by taking an optimistic read approach.<br />
<br />
While attending the session a couple of things occurred to me. Firstly, I thought it was about time I reviewed the current status of Java lock implementations. Secondly, that although StampedLock looks like a good addition to the JDK, it seems to miss the fact that lock-free algorithms are often a better solution to the multiple reader case.<br />
<br />
<span style="font-family: "arial" , "helvetica" , sans-serif; font-size: large;"><b>Test Case</b></span><br />
<br />
To compare implementations I needed an API test case that would not favour a particular approach. For example, the API should be garbage free and allow the methods to be atomic. A simple test case is to design a spaceship that can be moved around a 2-dimensional space with the coordinates of its position available to be read atomically. At least 2 fields need to be read, or written, per transaction to make the concurrency interesting.<br />
<pre>/**
* Interface to a concurrent representation of a ship that can move around
* a 2 dimensional space with updates and reads performed concurrently.
*/
public interface Spaceship
{
/**
* Read the position of the spaceship into the array of coordinates provided.
*
* @param coordinates into which the x and y coordinates should be read.
* @return the number of attempts made to read the current state.
*/
int readPosition(final int[] coordinates);
/**
* Move the position of the spaceship by a delta to the x and y coordinates.
*
* @param xDelta delta by which the spaceship should be moved in the x-axis.
* @param yDelta delta by which the spaceship should be moved in the y-axis.
* @return the number of attempts made to write the new coordinates.
*/
int move(final int xDelta, final int yDelta);
}
</pre>
The above API would be cleaner by factoring out an immutable Position object but I want to keep it garbage free and create the need to update multiple internal fields with minimal indirection. This API could easily be extended for a 3-dimensional space and require the implementations to be atomic.<br />
<br />
Multiple implementations are built for each spaceship and exercised by a test harness. All the code and results for this blog can be found <a href="https://github.com/mjpt777/rw-concurrency">here</a>.<br />
<br />
The <a href="https://github.com/mjpt777/rw-concurrency/blob/master/src/PerfTest.java">test harness</a> will run each of the implementations in turn by using a <a href="http://mechanical-sympathy.blogspot.co.uk/2012/04/invoke-interface-optimisations.html">megamorphic dispatch</a> pattern to try and prevent inlining, lock-coarsening, and loop unrolling when accessing the concurrent methods.<br />
<br />
Each implementation is subjected to 4 distinct threading scenarios that result in different contention profiles:<br />
<ul style="text-align: left;">
<li>1 reader - 1 writer</li>
<li>2 readers - 1 writer</li>
<li>3 readers - 1 writer</li>
<li>2 readers - 2 writers</li>
</ul>
All tests are run with 64-bit Java 1.7.0_25, Linux 3.6.30, and a quad core 2.2GHz Ivy Bridge i7-3632QM. Throughput is measured over 5 second periods for each implementation with the tests repeated 5 times to ensure sufficient warm up. The results below are throughputs averaged per second over 5 runs. To approximate a typical Java deployment, no thread affinity or core isolation has been employed which would have reduced variance significantly.<br />
<br />
<b>Note:</b> Other CPUs and operating systems can produce very different results.<br />
<br />
<b><span style="font-family: "arial" , "helvetica" , sans-serif; font-size: large;">Results</span></b><br />
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgbYCB5dGeAADODQ73ElIkfi1ecNVo8KQncGgQvcPRKqTO1tLCGDrWUjFooCo3YH1_3ndX06ovaFUGZkwZxyulP3xvrFVE6LTs1T3mTAuF85Mo_Azedrs48Wybi7FinTpzScOgEDc4T6uU/s1600/1-reader-1-writer.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgbYCB5dGeAADODQ73ElIkfi1ecNVo8KQncGgQvcPRKqTO1tLCGDrWUjFooCo3YH1_3ndX06ovaFUGZkwZxyulP3xvrFVE6LTs1T3mTAuF85Mo_Azedrs48Wybi7FinTpzScOgEDc4T6uU/s1600/1-reader-1-writer.png" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Figure 1.</td></tr>
</tbody></table>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiGsWP1XTcX4nnEbMCHspgP7ud_SePFg2474Mx83_VTIYrpCFjFZB6pfNNRPqtZuSD_gobCkkBBVP8Vv72s26pESXS7HBKrhs0GDsPF0fKof6lLVAwKqqpsssgDO7Aj0eGEdx-8eMmzhLQ/s1600/2-readers-1-writer.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiGsWP1XTcX4nnEbMCHspgP7ud_SePFg2474Mx83_VTIYrpCFjFZB6pfNNRPqtZuSD_gobCkkBBVP8Vv72s26pESXS7HBKrhs0GDsPF0fKof6lLVAwKqqpsssgDO7Aj0eGEdx-8eMmzhLQ/s1600/2-readers-1-writer.png" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Figure 2.</td></tr>
</tbody></table>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhvcHrMCztdacsnR8d2WC1emZk3mcP9-HCqvhKBLpbei4V3utdqTKF1-HkTEWU8O4FNdunNZNdYLzlKNPGHLJWsVz5cebOFtVyFjgb-Bq6FkDCGLbmtNQE5FTe-EZjUDjtgJCiTidP9RiQ/s1600/3-readers-1-writer.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhvcHrMCztdacsnR8d2WC1emZk3mcP9-HCqvhKBLpbei4V3utdqTKF1-HkTEWU8O4FNdunNZNdYLzlKNPGHLJWsVz5cebOFtVyFjgb-Bq6FkDCGLbmtNQE5FTe-EZjUDjtgJCiTidP9RiQ/s1600/3-readers-1-writer.png" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Figure 3.</td></tr>
</tbody></table>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhXPAqdLjMxGsT00uRfKMHLtrmkIiRdr6SQ69aRGKM0nStTgWZvdhjhOAI6EZ2Bs1GR1SwExWW-CGPEYpTf7Qm0nRbROh08kC7SH3Sw9Dg18mtmHnqdqiPMDQPRmj16pntBVlR5cALRolc/s1600/2-readers-2-writers.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhXPAqdLjMxGsT00uRfKMHLtrmkIiRdr6SQ69aRGKM0nStTgWZvdhjhOAI6EZ2Bs1GR1SwExWW-CGPEYpTf7Qm0nRbROh08kC7SH3Sw9Dg18mtmHnqdqiPMDQPRmj16pntBVlR5cALRolc/s1600/2-readers-2-writers.png" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Figure 4.</td></tr>
</tbody></table>
<br />
The raw data for the above charts can be found <a href="https://github.com/mjpt777/rw-concurrency/tree/master/data">here</a>.<br />
<br />
<b><span style="font-family: "arial" , "helvetica" , sans-serif; font-size: large;">Analysis</span></b><br />
<br />
The real surprise for me from the results is the performance of ReentrantReadWriteLock. I cannot see a use for this implementation beyond a case whereby there is a huge balance of reads and very little writes. My main takeaways are:<br />
<ol style="text-align: left;">
<li>StampedLock is a major improvement over existing lock implementations especially with increasing numbers of reader threads.</li>
<li>StampedLock has a complex API. It is very easy to mistakenly call the wrong method for locking actions.</li>
<li>Synchronised is a good general purpose lock implementation when contention is from only 2 threads.</li>
<li>ReentrantLock is a good general purpose lock implementation when thread counts grow as <a href="http://mechanical-sympathy.blogspot.co.uk/2011/11/java-lock-implementations.html">previously discovered</a>.</li>
<li>Choosing to use ReentrantReadWriteLock should be based on careful and appropriate measurement. As with all major decisions, measure and make decisions based on data.</li>
<li>Lock-free implementations can offer significant throughput advantages over lock-based algorithms.</li>
</ol>
<b><span style="font-family: "arial" , "helvetica" , sans-serif; font-size: large;">Conclusion</span></b><br />
<br />
It is nice seeing the influence of lock-free techniques appearing in lock-based algorithms. The optimistic strategy employed on read is effectively a lock-free algorithm at the times when a writer is not updating.<br />
<br />
In my experience of <a href="http://www.real-logic.co.uk/training.html">teaching</a> and developing lock-free algorithms, not only do they provide significant throughput advantages as evidenced here, they also provide much lower and less variance in latency.</div>
Martin Thompsonhttp://www.blogger.com/profile/15893849163924476586noreply@blogger.com50London, UK51.511213899999987 -0.1198243999999704151.195100899999986 -0.7652713999999704 51.827326899999989 0.52562260000002958tag:blogger.com,1999:blog-5560209661389175529.post-22360870597681120912013-07-16T20:45:00.001+01:002013-07-19T09:08:24.481+01:00Java Garbage Collection Distilled<div dir="ltr" style="text-align: left;" trbidi="on">
Serial, Parallel, Concurrent, CMS, G1, Young Gen, New Gen, Old Gen, Perm Gen, Eden, Tenured, Survivor Spaces, Safepoints, and the hundreds of JVM startup flags. Does this all baffle you when trying to tune the garbage collector while trying to get the required throughput and latency from your Java application? If it does then do not worry, you are not alone. Documentation describing garbage collection feels like man pages for an aircraft. Every knob and dial is detailed and explained but nowhere can you find a guide on how to fly. This article will attempt to explain the tradeoffs when choosing and tuning garbage collection algorithms for a particular workload.<br />
<br />
The focus will be on Oracle Hotspot JVM and OpenJDK collectors as those are the ones in most common usage. Towards the end other commercial JVMs will be discussed to illustrate alternatives.<br />
<br />
<b><span style="font-family: Arial, Helvetica, sans-serif; font-size: large;">The Tradeoffs</span></b><br />
<br />
Wise folk keep telling us, <i>“You do not get something for nothing”</i>. When we get something we usually have to give up something in return. When it comes to garbage collection we play with 3 major variables that set targets for the collectors:<br />
<ol style="text-align: left;">
<li><b>Throughput:</b> The amount of work done by an application as a ratio of time spent in GC. Target throughput with <span style="font-family: Courier New, Courier, monospace;">‑XX:GCTimeRatio=99</span>
; 99 is the default equating to 1% GC time.</li>
<li><b>Latency:</b> The time taken by systems in responding to events which is impacted by pauses introduced by garbage collection. Target latency for GC pauses with <span style="font-family: Courier New, Courier, monospace;">‑XX:MaxGCPauseMillis=<n></span>.</li>
<li><b>Memory:</b> The amount of memory our systems use to store state, which is often copied and moved around when being managed. The set of active objects retained by the application at any point in time is known as the Live Set. Maximum heap size <span style="font-family: Courier New, Courier, monospace;">–Xmx<n></span> is a tuning parameter for setting the heap size available to an application.</li>
</ol>
<b>Note:</b> Often Hotspot cannot achieve these targets and will silently continue without warning, having missed its target by a great margin.<br />
<br />
Latency is a distribution across events. It may be acceptable to have an increased average latency to reduce the worst-case latency, or make it less frequent. We should not interpret the term “real-time” to mean the lowest possible latency; rather real-time refers to having deterministic latency regardless of throughput.<br />
<br />
For some application workloads, throughput is the most important target. An example would be a long running batch-processing job; it does not matter if a batch job is occasionally paused for a few seconds while garbage collection takes place, as long as the overall job can be completed sooner.<br />
<br />
For virtually all other workloads, from human facing interactive applications to financial trading systems, if a system goes unresponsive for anything more than a few seconds or even milliseconds in some cases, it can spell disaster. In financial trading it is often worthwhile to trade off some throughput in return for consistent latency. We may also have applications that are limited by the amount of physical memory available and have to maintain a footprint, in which case we have to give up performance on both latency and throughput fronts.<br />
<br />
Tradeoffs often play out as follows:<br />
<ul>
<li>To a large extent the cost of garbage collection, as an amortized cost, can be reduced by providing the garbage collection algorithms with more memory.</li>
<li>The observed worst-case latency-inducing pauses due to garbage collecting can be reduced by containing the live set and keeping the heap size small.</li>
<li>The frequency with which pauses occur can be reduced by managing the heap and generation sizes, and by controlling the application’s object allocation rate.</li>
<li>The frequency of large pauses can be reduced by concurrently running the GC with the application, sometimes at the expense of throughput.</li>
</ul>
<br />
<b><span style="font-family: Arial, Helvetica, sans-serif; font-size: large;">Object Lifetimes</span></b><br />
<br />
Garbage collection algorithms are often optimised with the expectation that most objects live for a very short period of time, while relatively few live for very long. In most applications, objects that live for a significant period of time tend to constitute a very small percentage of objects allocated over time. In garbage collection theory this observed behavior is often known as “<i>infant mortality</i>” or the “<i>weak generational hypothesis</i>”. For example, loop Iterators are mostly short lived whereas static Strings are effectively immortal.<br />
<br />
Experimentation has shown that generational garbage collectors can usually support an order-of-magnitude greater throughput than non-generational collectors do, and thus are almost ubiquitously used in server JVMs. By separating the generations of objects, we know that a region of newly allocated objects is likely to be very sparse for live objects. Therefore a collector that scavenges for the few live objects in this new region and copies them to another region for older objects can be very efficient. Hotspot garbage collectors record the age of an object in terms of the number of GC cycles survived.<br />
<br />
<b>Note:</b> If your application consistently generates a lot of objects that live for a fairly long time then expect your application to be spending a significant portion of its time garbage collecting, and expect to be spending a significant portion of your time tuning the Hotspot garbage collectors. This is due to the reduced GC efficiency that happens when the generational “filter” is less effective, and resulting cost of collecting the longer living generations more frequently. Older generations are less sparse, and as a result the efficiency of older generation collection algorithms tends to be much lower. Generational garbage collectors tend to operate in two distinct collection cycles: Minor collections, when short-lived objects are collected, and the less frequent Major collections, when the older regions are collected.<br />
<br />
<b><span style="font-family: Arial, Helvetica, sans-serif; font-size: large;">Stop-The-World Events</span></b><br />
<br />
The pauses that applications suffer during garbage collection are due to what are known as stop-the-world events. For garbage collectors to operate it is necessary, for practical engineering reasons, to periodically stop the running application so that memory can be managed. Depending on the algorithms, different collectors will stop-the-world at specific points of execution for varying durations of time. To bring an application to a total stop it is necessary to pause all the running threads. Garbage collectors do this by signaling the threads to stop when they come to a “<i>safepoint</i>”, which is a point during program execution at which all GC roots are known and all heap object contents are consistent. Depending on what a thread is doing it may take some time to reach a safepoint. Safepoint checks are normally performed on method returns and loop back edges, but can be optimized away in some places making them more dynamically rare. For example, if a thread is copying a large array, cloning a large object, or executing a monotonic counted loop with a finite bound, it may be many milliseconds before a safepoint is reached. Time To Safepoint (TTSP) is an important consideration in low-latency applications. This time can be surfaced by enabling the <span style="font-family: Courier New, Courier, monospace;">‑XX:+PrintGCApplicationStoppedTime</span> flag in addition to the other GC flags.<br />
<br />
<b>Note:</b> For applications with a large number of running threads, when a stop-the-world event occurs a system will undergo significant scheduling pressure as the threads resume when released. Therefore algorithms with less reliance on stop-the-world events can potentially be more efficient.<br />
<br />
<b><span style="font-family: Arial, Helvetica, sans-serif; font-size: large;">Heap Organisation in Hotspot</span></b><br />
<br />
To understand how the different collectors operate it is best to explore how the Java heap is organised to support generational collectors.<br />
<br />
<i>Eden</i> is the region where most objects are initially allocated. The <i>survivor</i> spaces are a temporary store for objects that have survived a collection of the Eden space. Survivor space usage will be described when minor collections are discussed. Collectively Eden and the survivor spaces are known as the <i>“young”</i> or <i>“new”</i> generation.<br />
<br />
Objects that live long enough are eventually promoted to the <i>tenured</i> space.<br />
<br />
The <i>perm</i> generation is where the runtime stores objects it “knows” to be effectively immortal, such as Classes and static Strings. Unfortunately the common use of class loading on an ongoing basis in many applications makes the motivating assumption behind the perm generation wrong, i.e. that classes are immortal. In Java 7 interned Strings were moved from <i>permgen</i> to tenured, and from Java 8 the perm generation is no more and will not be discussed in this article. Most other commercial collectors do not use a separate perm space and tend to treat all long living objects as tenured.<br />
<br />
<b>Note:</b> The Virtual spaces allow the collectors to adjust the size of regions to meet throughput and latency targets. Collectors keep statistics for each collection phase and adjust the region sizes accordingly in an attempt to reach the targets.<br />
<br />
<b><span style="font-family: Arial, Helvetica, sans-serif; font-size: large;">Object Allocation</span></b><br />
<br />
To avoid contention each thread is assigned a Thread Local Allocation Buffer (TLAB) from which it allocates objects. Using TLABs allows object allocation to scale with number of threads by avoiding contention on a single memory resource. Object allocation via a TLAB is a very cheap operation; it simply bumps a pointer for the object size which takes roughly 10 instructions on most platforms. Heap memory allocation for Java is even cheaper than using malloc from the C runtime.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh9IjAEKsiP_C4tkg9DZ8VNHz8FqSEgleM6oruxJiJJVnDgmsQRAX0og5Qzy9BWEKzA2GZ6LRT70H-mZlpmWSuuP0Mg_gkqAngQUKmtgXHojXmubpSVp9Fqk7WtpgmdgzcwGF_bi2FDZ5k/s1600/Generations.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh9IjAEKsiP_C4tkg9DZ8VNHz8FqSEgleM6oruxJiJJVnDgmsQRAX0og5Qzy9BWEKzA2GZ6LRT70H-mZlpmWSuuP0Mg_gkqAngQUKmtgXHojXmubpSVp9Fqk7WtpgmdgzcwGF_bi2FDZ5k/s1600/Generations.png" /></a></div>
<br />
<b>Note:</b> Whereas individual object allocation is very cheap, the rate at which <i>minor</i> collections must occur is directly proportional to the rate of object allocation.<br />
<br />
When a TLAB is exhausted a thread simply requests a new one from the Eden space. When Eden has been filled a minor collection commences.<br />
<br />
Large objects (<span style="font-family: Courier New, Courier, monospace;">-XX:PretenureSizeThreshold=<n></span>) may fail to be accommodated in the young generation and thus have to be allocated in the old generation, e.g. a large array. If the threshold is set below TLAB size then objects that fit in the TLAB will not be created in the old generation. The new G1 collector handles large objects differently and will be discussed later in its own section.<br />
<br />
<b><span style="font-family: Arial, Helvetica, sans-serif; font-size: large;">Minor Collections</span></b><br />
<br />
A minor collection is triggered when Eden becomes full. This is done by copying all the live objects in the new generation to either a survivor space or the tenured space as appropriate. Copying to the tenured space is known as promotion or tenuring. Promotion occurs for objects that are sufficiently old (<span style="font-family: Courier New, Courier, monospace;">– XX:MaxTenuringThreshold=<n></span>), or when the survivor space overflows.<br />
<br />
<i>Live</i> objects are objects that are reachable by the application; any other objects cannot be reached and can therefore be considered dead. In a minor collection, the copying of live objects is performed by first following what are known as <i>GC Roots</i>, and iteratively copying anything reachable to the survivor space. GC Roots normally include references from application and JVM-internal static fields, and from thread stack-frames, all of which effectively point to the application’s reachable object graphs.<br />
<br />
In generational collection, the GC Roots for the new generation’s reachable object graph also include any references from the old generation to the new generation. These references must also be processed to make sure all reachable objects in the new generation survive the minor collection. Identifying these cross-generational references is achieved by use of a “<i>card table</i>”. The Hotspot card table is an array of bytes in which each byte is used to track the potential existence of cross-generational references in a corresponding 512 byte region of the old generation. As references are stored to the heap, “store barrier” code will mark cards to indicate that a potential reference from the old generation to the new generation may exist in the associated 512 byte heap region. At collection time, the card table is used to scan for such cross-generational references, which effectively represent additional GC Roots into the new generation. Therefore a significant fixed cost of minor collections is directly proportional to the size of the old generation.<br />
<br />
There are two survivor spaces in the Hotspot new generation, which alternate in their “<i>to-space</i>” and “<i>from-space</i>” roles. At the beginning of a minor collection, the to-space survivor space is always empty, and acts as a target copy area for the minor collection. The previous minor collection’s target survivor space is part of the from-space, which also includes Eden, where live objects that need to be copied may be found.<br />
<br />
The cost of a minor GC collection is usually dominated by the cost of copying objects to the survivor and tenured spaces. Objects that do not survive a minor collection are effectively free to be dealt with. The work done during a minor collection is directly proportional to the number of live objects found, and not to the size of the new generation. The total time spent doing minor collections can be almost be halved each time the Eden size is doubled. Memory can therefore be traded for throughput. A doubling of Eden size will result in an increase in collection time per-collection cycle, but this is relatively small if both the number of objects being promoted and size of the old generation is constant.<br />
<br />
<b>Note:</b> In Hotspot minor collections are stop-the-world events. This is rapidly becoming a major issue as our heaps get larger with more live objects. We are already starting to see the need for concurrent collection of the young generation to reach pause-time targets.<br />
<br />
<b><span style="font-family: Arial, Helvetica, sans-serif; font-size: large;">Major Collections</span></b><br />
<br />
Major collections collect the <i>old</i> generation so that objects can be promoted from the <i>young</i> generation. In most applications, the vast majority of program state ends up in the old generation. The greatest variety of GC algorithms exists for the old generation. Some will compact the whole space when it fills, whereas others will collect <i>concurrently</i> with the application in an effort to prevent it from filling up.<br />
<br />
The old generation collector will try to predict when it needs to collect to avoid a promotion failure from the young generation. The collectors track a fill threshold for the old generation and begin collection when this threshold is passed. If this threshold is not sufficient to meet promotion requirements then a “<i>FullGC</i>” is triggered. A FullGC involves promoting all live objects from the young generations followed by a collection and compaction of the old generation. Promotion failure is a very expensive operation as state and promoted objects from this cycle must be unwound so the FullGC event can occur.<br />
<br />
<b>Note:</b> To avoid promotion failure you will need to tune the padding that the old generation allows to accommodate promotions (<span style="font-family: Courier New, Courier, monospace;">‑XX:PromotedPadding=<n></span>).<br />
<br />
<b>Note:</b> When the Heap needs to grow a FullGC is triggered. These heap-resizing FullGCs can be avoided by setting <span style="font-family: Courier New, Courier, monospace;">–Xms</span> and <span style="font-family: Courier New, Courier, monospace;">–Xmx</span> to the same value.<br />
<br />
Other than a FullGC, a compaction of the old generation is likely to be the largest stop-the-world pause an application will experience. The time for this compaction tends to grow linearly with the number of live objects in the tenured space.<br />
<br />
The rate at which the tenured space fills up can sometimes be reduced by increasing the size of the survivor spaces and the age of objects before being promoted to the tenured generation. However, increasing the size of the survivor spaces and object age in Minor collections (<span style="font-family: Courier New, Courier, monospace;">–XX:MaxTenuringThreshold=<n></span>) before promotion can also increase the cost and pause times in the minor collections due to the increased copy cost between survivor spaces on minor collections.<br />
<br />
<b><span style="font-family: Arial, Helvetica, sans-serif; font-size: large;">Serial Collector</span></b><br />
<br />
The Serial collector (<span style="font-family: Courier New, Courier, monospace;">-XX:+UseSerialGC</span>) is the simplest collector and is a good option for single processor systems. It also has the smallest footprint of any collector. It uses a single thread for both minor and major collections. Objects are allocated in the tenured space using a simple bump the pointer algorithm. Major collections are triggered when the tenured space is full.<br />
<br />
<b><span style="font-family: Arial, Helvetica, sans-serif; font-size: large;">Parallel Collector</span></b><br />
<br />
The Parallel collector comes in two forms. The <i>Parallel collector</i> (<span style="font-family: Courier New, Courier, monospace;">‑XX:+UseParallelGC</span>) which uses multiple threads to perform minor collections of the young generation and a single thread for major collections on the old generation. The <i>Parallel Old collector</i> (<span style="font-family: Courier New, Courier, monospace;">‑XX:+UseParallelOldGC</span>) , the default since Java 7u4, uses multiple threads for minor collections and multiple threads for major collections. Objects are allocated in the tenured space using a simple bump the pointer algorithm. Major collections are triggered when the tenured space is full.<br />
<br />
On multiprocessor systems the Parallel Old collector will give the greatest throughput of any collector. It has no impact on a running application until a collection occurs, and then will collect in parallel using multiple threads using the most efficient algorithm. This makes the Parallel Old collector very suitable for batch applications.<br />
<br />
The cost of collecting the old generations is affected by the number of objects to retain to a greater extent than by the size of the heap. Therefore the efficiency of the Parallel Old collector can be increased to achieve greater throughput by providing more memory and accepting larger, but fewer, collection pauses.<br />
<br />
Expect the fastest minor collections with this collector because the promotion to tenured space is a simple bump the pointer and copy operation.<br />
<br />
For server applications the Parallel Old collector should be the first port-of-call. However if the major collection pauses are more than your application can tolerate then you need to consider employing a concurrent collector that collects the tenured objects concurrently while the application is running.<br />
<br />
<b>Note:</b> Expect pauses in the order of one to five seconds per GB of live data on modern hardware while the old generation is compacted.<br />
<br />
<b>Note:</b> The parallel collector can sometimes gain performance benefits from <span style="font-family: Courier New, Courier, monospace;">-XX:+UseNUMA</span> on multi-socket CPU server applications by allocating Eden memory for threads local to the CPU socket. It is a shame this feature is not available to the other collectors.<br />
<br />
<b><span style="font-family: Arial, Helvetica, sans-serif; font-size: large;">Concurrent Mark Sweep (CMS) Collector</span></b><br />
<br />
The CMS (<span style="font-family: Courier New, Courier, monospace;">-XX:+UseConcMarkSweepGC</span>) collector runs in the Old generation collecting tenured objects that are no longer reachable during a major collection. It runs concurrently with the application with the goal of keeping sufficient free space in the old generation so that a promotion failure from the young generation does not occur.<br />
<br />
Promotion failure will trigger a FullGC. CMS follows a multistep process:<br />
<ol>
<li><i>Initial Mark</i> <stop-the-world>: Find GC Roots.</stop-the-world></li>
<li><i>Concurrent Mark</i>: Mark all reachable objects from the GC Roots.</li>
<li><i>Concurrent Pre-clean</i>: Check for object references that have been updated and objects that have been promoted during the concurrent mark phase by remarking.</li>
<li><i>Re-mark</i> <stop-the-world>: Capture object references that have been updated since the Pre-clean stage.</stop-the-world></li>
<li><i>Concurrent Sweep</i>: Update the free-lists by reclaiming memory occupied by dead objects.</li>
<li><i>Concurrent Reset</i>: Reset data structures for next run.</li>
</ol>
As tenured objects become unreachable, the space is reclaimed by CMS and put on free-lists. When promotion occurs, the free-lists must be searched for a suitable sized hole for the promoted object. This increases the cost of promotion and thus increases the cost of the Minor collections compared to the Parallel Collector.<br />
<br />
<b>Note</b>: CMS is not a compacting collector, which over time can result in old generation fragmentation. Object promotion can fail because a large object may not fit in the available holes in the old generation. When this happens a “<i>promotion failed</i>” message is logged and a FullGC is triggered to compact the live tenured objects. For such compaction-driven FullGCs, expect pauses to be worse than major collections using the Parallel Old collector because CMS uses only a single thread for compaction.<br />
<br />
CMS is mostly concurrent with the application, which has a number of implications. First, CPU time is taken by the collector, thus reducing the CPU available to the application. The amount of time required by CMS grows linearly with the amount of object promotion to the tenured space. Second, for some phases of the concurrent GC cycle, all application threads have to be brought to a safepoint for marking GC Roots and performing a parallel re-mark to check for mutation.<br />
<br />
<b>Note</b>: If an application sees significant mutation of tenured objects then the re-mark phase can be significant, at the extremes it may take longer than a full compaction with the Parallel Old collector.<br />
<br />
CMS makes FullGC a less frequent event at the expense of reduced throughput, more expensive minor collections, and greater footprint. The reduction in throughput can be anything from 10%-40% compared to the Parallel collector, depending on promotion rate. CMS also requires a 20% greater footprint to accommodate additional data structures and “floating garbage” that can be missed during the concurrent marking that gets carried over to the next cycle.<br />
<br />
High promotion rates and resulting fragmentation can sometimes be reduced by increasing the size of both the young and old generation spaces.<br />
<br />
<b>Note</b>: CMS can suffer “<i>concurrent mode failures</i>”, which can be seen in the logs, when it fails to collect at a sufficient rate to keep up with promotion. This can be caused when the collection commences too late, which can sometimes be addressed by tuning. But it can also occur when the collection rate cannot keep up with the high promotion rate or with the high object mutation rate of some applications. If the promotion rate, or mutation rate, of the application is too high then your application might require some changes to reduce the promotion pressure. Adding more memory to such a system can sometimes make the situation worse, as CMS would then have more memory to scan.<br />
<br />
<b><span style="font-family: Arial, Helvetica, sans-serif; font-size: large;">Garbage First (G1) Collector</span></b><br />
<br />
G1 (<span style="font-family: Courier New, Courier, monospace;">-XX:+UseG1GC</span>) is a new collector introduced in Java 6 and now officially supported as of Java 7u4. It is a partially concurrent collecting algorithm that also tries to compact the tenured space in smaller incremental stop-the-world pauses to try and minimize the FullGC events that plague CMS because of fragmentation. G1 is a generational collector that organizes the heap differently from the other collectors by dividing it into a large number (~2000) of fixed size regions of variable purpose, rather than contiguous regions for the same purpose.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi6TrWF9G2VgbvrlwoZqA0EJiZQfgsGBibmcllPSWQ0HYKfyuweOGiHWg5f13gnaNe9iqGlHcC56s67mD3DFXAxM7uKXfeZvm365zyCmVAaUJ6euU2qbQy91SsfPTjTpJW2vOZVeiG_15w/s1600/G1.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi6TrWF9G2VgbvrlwoZqA0EJiZQfgsGBibmcllPSWQ0HYKfyuweOGiHWg5f13gnaNe9iqGlHcC56s67mD3DFXAxM7uKXfeZvm365zyCmVAaUJ6euU2qbQy91SsfPTjTpJW2vOZVeiG_15w/s1600/G1.png" /></a></div>
<br />
G1 takes the approach of concurrently marking regions to track references between regions, and to focus collection on the regions with the most free space. These regions are then collected in stop-the-world pause increments by <i>evacuating</i> the live objects to an empty region, thus compacting in the process. The regions to be collected in a cycle are known as the <i>Collection Set</i>.<br />
<br />
Objects larger than 50% of a region are allocated in humongous regions, which are a multiple of region size. Allocation and collection of humongous objects can be very costly under G1, and to date has had little or no optimisation effort applied.<br />
<br />
The challenge with any compacting collector is not the moving of objects but the updating of references to those objects. If an object is referenced from many regions then updating those references can take significantly longer than moving the object. G1 tracks which objects in a region have references from other regions via the “<i>Remembered Sets</i>”. Remember Sets are collections of cards that have been marked for mutation. If the Remembered Sets become large then G1 can significantly slow down. When evacuating objects from one region to another, the length of the associated stop-the-world event tends to be proportional to the number of regions with references that need to be scanned and potentially patched.<br />
<br />
Maintaining the Remembered Sets increases the cost of minor collections resulting in pauses greater than those seen with Parallel Old or CMS collectors for Minor collections.<br />
<br />
G1 is target driven on latency <span style="font-family: Courier New, Courier, monospace;">–XX:MaxGCPauseMillis=<n></span>, default value = 200ms. The target will influence the amount of work done on each cycle on a best-efforts only basis. Setting targets in tens of milliseconds is mostly futile, and as of this writing targeting tens of milliseconds has not been a focus of G1.<br />
<br />
G1 is a good general-purpose collector for larger heaps that have a tendency to become fragmented when an application can tolerate pauses in the 0.5-1.0 second range for incremental compactions. G1 tends to reduce the frequency of the worst-case pauses seen by CMS because of fragmentation at the cost of extended minor collections and incremental compactions of the old generation. Most pauses end up being constrained to regional rather than full heap compactions.<br />
<br />
Like CMS, G1 can also fail to keep up with promotion rates, and will fall back to a stop-the-world FullGC. Just like CMS has “<i>concurrent mode failure</i>”, G1 can suffer an evacuation failure, seen in the logs as “<i>to-space overflow</i>”. This occurs when there are no free regions into which objects can be evacuated, which is similar to a promotion failure. If this occurs, try using a larger heap and more marking threads, but in some cases application changes may be necessary to reduce allocation rates.<br />
<br />
A challenging problem for G1 is dealing with popular objects and regions. Incremental stop-the-world compaction works well when regions have live objects that are not heavily referenced from other regions. If an object or region is popular then the Remembered Set will be large, and G1 will try to avoid collecting those objects. Eventually it can have no choice, which results in very frequent mid-length pauses as the heap gets compacted.<br />
<br />
<b><span style="font-family: Arial, Helvetica, sans-serif; font-size: large;">Alternative Concurrent Collectors</span></b><br />
<br />
CMS and G1 are often called mostly concurrent collectors. When you look at the total work performed it is clear that the young generation, promotion and even much of the old generation work is not concurrent at all. CMS is mostly concurrent for the old generation; G1 is much more of a stop-the-world incremental collector. Both CMS and G1 have significant and regularly occurring stop-the-world events, and worst-case scenarios that often make them unsuitable for strict low-latency applications, such a financial trading or reactive user interfaces.<br />
<br />
Alternative collectors are available such as Oracle JRockit Real Time, IBM Websphere Real Time, and Azul Zing. The JRockit and Websphere collectors have latency advantages in most cases over CMS and G1 but often see throughput limitations and still suffer significant stop-the-world events. Zing is the only Java collector know to this author that can be truly concurrent for collection and compaction while maintaining a high-throughput rate for all generations. Zing does have some sub-millisecond stop-the-world events but these are for phase shifts in the collection cycle that are not related to live object set size.<br />
<br />
JRockit RT can achieve typical pause times in the tens of milliseconds for high allocation rates at contained heap sizes but occasionally has to fail back to full compaction pauses. Websphere RT can achieve single-digit millisecond pause times via constrained allocation rates and live set sizes. Zing can achieve sub-millisecond pauses with high allocation rates by being concurrent for all phases, including during minor collections. Zing is able to maintain this consistent behavior regardless of heap size, allowing the user to apply large heap sizes as needed for keeping up with application throughput or object model state needs, without fear of increased pause times.<br />
<br />
For all the concurrent collectors targeting latency you have to give up some throughput and gain footprint. Depending on the efficiency of the concurrent collector you may give up a little throughput but you are always adding significant footprint. If truly concurrent, with few stop-the-world events, then more CPU cores are needed to enable the concurrent operation and maintain throughput.<br />
<br />
<b>Note:</b> All the concurrent collectors tend to function more efficiently when sufficient space is allocated. As a starting point rule of thumb, you should budget a heap of at least two to three times the size of the live set for efficient operation. However, space requirements for maintaining concurrent operation grows with application throughput, and the associated allocation and promotion rates. So for higher throughput applications a higher heap-size to live set ratio may be warranted. Given the huge memory spaces available to today’s systems footprint is seldom an issue on the server side.<br />
<br />
<b><span style="font-family: Arial, Helvetica, sans-serif; font-size: large;">Garbage Collection Monitoring & Tuning</span></b><br />
<br />
To understand how your application and garbage collector are behaving, start your JVM with at least the following settings:<br />
<pre>-verbose:gc
-Xloggc:<filename>
-XX:+PrintGCDetails
-XX:+PrintGCDateStamps
-XX:+PrintTenuringDistribution
-XX:+PrintGCApplicationConcurrentTime
-XX:+PrintGCApplicationStoppedTime</filename></pre>
<br />
Then load the logs into a tool like <a href="https://github.com/chewiebug/GCViewer">Chewiebug</a> for analysis.<br />
<br />
To see the dynamic nature of GC, launch JVisualVM and install the Visual GC plugin. This will enable you to see the GC in action for your application as below.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEisC7NL_3tfPgoR0MnSjDqwma3NG70bJT48kI_Ly10vr_EEkNU-pxLwNT2UQU_ehpOvigukknXCtU-bCgdhNHEPAi0fwFgXDv0ovhiQ8i48c-t8lknES_Xhyphenhyphenr46r-DSE3YBOoNYWfD3dh0/s1600/Figure3.JPG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEisC7NL_3tfPgoR0MnSjDqwma3NG70bJT48kI_Ly10vr_EEkNU-pxLwNT2UQU_ehpOvigukknXCtU-bCgdhNHEPAi0fwFgXDv0ovhiQ8i48c-t8lknES_Xhyphenhyphenr46r-DSE3YBOoNYWfD3dh0/s1600/Figure3.JPG" /></a></div>
<br />
To get an understanding of your applcations’ GC needs, you need representative load tests that can be executed repeatedly. As you get to grips with how each of the collectors work then run your load tests with different configurations as experiments until you reach your throughput and latency targets. It is important to measure latency from the end user perspective. This can be achieved by capturing the response time of every test request in a histogram, e.g. <a href="https://github.com/giltene/HdrHistogram">HdrHistogram</a> or <a href="https://github.com/LMAX-Exchange/disruptor/blob/master/src/main/java/com/lmax/disruptor/collections/Histogram.java">Disruptor Histogram</a>. If you have latency spikes that are outside your acceptable range, then try and correlate these with the GC logs to determine if GC is the issue. It is possible other issues may be causing latency spikes. Another useful tool to consider is <a href="http://www.jhiccup.com/">jHiccup</a> which can be used to track pauses within the JVM and across a system as a whole. Measure your idle systems for a few hours with jHiccup and you will often be very surprised.<br />
<br />
If latency spikes are due to GC then invest in tuning CMS or G1 to see if your latency targets can be meet. Sometimes this may not be possible because of high allocation and promotion rates combined with low-latency requirements. GC tuning can become a highly skilled exercise that often requires application changes to reduce object allocation rates or object lifetimes. If this is the case then a commercial trade-off between time and resource spent on GC tuning and application changes, verses, purchasing one of the commercial concurrent compacting JVMs such as JRockit Real Time or Azul Zing may be required.</div>
Martin Thompsonhttp://www.blogger.com/profile/15893849163924476586noreply@blogger.com22London, UK51.511213899999987 -0.1198243999999704151.195100899999986 -0.7652713999999704 51.827326899999989 0.52562260000002958tag:blogger.com,1999:blog-5560209661389175529.post-84431908785726665642013-06-27T20:24:00.002+01:002022-08-17T11:33:40.938+01:00Printing Generated Assembly Code From The Hotspot JIT Compiler<div dir="ltr" style="text-align: left;" trbidi="on">
Sometimes when profiling a Java application it is necessary to understand the assembly code generated by the Hotspot JIT compiler. This can be useful in determining what optimisation decisions have been made and how our code changes can affect the generated assembly code. It is also useful at times knowing what instructions are emitted when debugging a concurrent algorithm to ensure visibility rules have been applied as expected. I have found quite a few bugs in various JVMs this way. <br />
<br />
This blog illustrates how to install a <a href="https://kenai.com/projects/base-hsdis/downloads">Disassembler Plugin</a> and provides command line options for targeting a particular method.<br />
<br />
<span style="font-family: Arial, Helvetica, sans-serif; font-size: large;"><b>Installation</b></span><br />
<br />
Previously it was necessary to obtain a debug build for printing the assembly code generated by the Hotspot JIT for the Oracle/SUN JVM. Since Java 7, it has been possible to print the generated assembly code if a <a href="https://kenai.com/projects/base-hsdis/downloads">Disassembler Plugin</a> is installed in a standard Oracle Hotspot JVM. To install the plugin for 64-bit Linux follow the steps below:<br />
<ol style="text-align: left;">
<li>Download the appropriate binary, or build from source, from <a href="https://kenai.com/projects/base-hsdis/downloads">https://kenai.com/projects/base-hsdis/downloads</a></li>
<li>On Linux rename <span style="font-family: Courier New, Courier, monospace;">linux-hsdis-amd64.so</span> to <span style="font-family: Courier New, Courier, monospace;">libhsdis-amd64.so</span></li>
<li>Copy the shared library to <span style="font-family: Courier New, Courier, monospace;">$JAVA_HOME/jre/lib/amd64/server</span></li>
</ol>
You now have the plugin installed!
<br />
<br />
<span style="font-family: Arial, Helvetica, sans-serif; font-size: large;"><b>Test Program</b></span><br />
<br />
To test the plugin we need some code that is both interesting to a programmer and executes sufficiently hot to be optimised by the JIT. Some details of when the JIT will optimise can be found <a href="http://mechanical-sympathy.blogspot.co.uk/2011/11/biased-locking-osr-and-benchmarking-fun.html">here</a>. The code below can be used to measure the average latency between two threads by reading and writing <span style="font-family: Courier New, Courier, monospace;">volatile</span> fields. These <span style="font-family: Courier New, Courier, monospace;">volatile</span> fields are interesting because they require associated hardware <a href="http://mechanical-sympathy.blogspot.co.uk/2011/07/memory-barriersfences.html">fences</a> to honour the <a href="http://g.oswego.edu/dl/jmm/cookbook.html">Java Memory Model</a>.<br />
<pre>import static java.lang.System.out;
public class InterThreadLatency
{
private static final int REPETITIONS = 100 * 1000 * 1000;
private static volatile int ping = -1;
private static volatile int pong = -1;
public static void main(final String[] args)
throws Exception
{
for (int i = 0; i < 5; i++)
{
final long duration = runTest();
out.printf("%d - %dns avg latency - ping=%d pong=%d\n",
i,
duration / (REPETITIONS * 2),
ping,
pong);
}
}
private static long runTest() throws InterruptedException
{
final Thread pongThread = new Thread(new PongRunner());
final Thread pingThread = new Thread(new PingRunner());
pongThread.start();
pingThread.start();
final long start = System.nanoTime();
pongThread.join();
return System.nanoTime() - start;
}
public static class PingRunner implements Runnable
{
public void run()
{
for (int i = 0; i < REPETITIONS; i++)
{
ping = i;
while (i != pong)
{
// busy spin
}
}
}
}
public static class PongRunner implements Runnable
{
public void run()
{
for (int i = 0; i < REPETITIONS; i++)
{
while (i != ping)
{
// busy spin
}
pong = i;
}
}
}
}
</pre>
<b><span style="font-family: Arial, Helvetica, sans-serif; font-size: large;">Printing Assembly Code</span></b><br />
<br />
It is possible to print all generated assembly code with the following statement.<br />
<br />
<span style="font-family: Courier New, Courier, monospace;">java -XX:+UnlockDiagnosticVMOptions -XX:+PrintAssembly InterThreadLatency</span><br />
<br />
However this can put you in the situation of not being able to see the forest for the trees. It is generally much more useful to target a particular method. For this test, the <span style="font-family: Courier New, Courier, monospace;">run()</span> method will be optimised and generated twice by Hotspot. Once for the OSR version, and then again for the standard JIT version. The standard JIT version follows.<br />
<br />
<span style="font-family: Courier New, Courier, monospace;">java -XX:+UnlockDiagnosticVMOptions '-XX:CompileCommand=print,*PongRunner.run' InterThreadLatency</span><br />
<br />
<span style="font-family: 'Courier New', Courier, monospace; font-size: x-small;">Compiled method (c2) 10531 5 InterThreadLatency$PongRunner::run (30 bytes)</span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;"> total in heap [0x00007fed81060850,0x00007fed81060b30] = 736</span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;"> relocation [0x00007fed81060970,0x00007fed81060980] = 16</span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;"> main code [0x00007fed81060980,0x00007fed81060a00] = 128</span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;"> stub code [0x00007fed81060a00,0x00007fed81060a18] = 24</span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;"> oops [0x00007fed81060a18,0x00007fed81060a30] = 24</span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;"> scopes data [0x00007fed81060a30,0x00007fed81060a78] = 72</span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;"> scopes pcs [0x00007fed81060a78,0x00007fed81060b28] = 176</span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;"> dependencies [0x00007fed81060b28,0x00007fed81060b30] = 8</span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;">Decoding compiled method 0x00007fed81060850:</span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;">Code:</span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;">[Entry Point]</span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;">[Constants]</span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;"> # {method} 'run' '()V' in 'InterThreadLatency$PongRunner'</span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;"> # [sp+0x20] (sp of caller)</span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;"> 0x00007fed81060980: mov 0x8(%rsi),%r10d</span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;"> 0x00007fed81060984: shl $0x3,%r10</span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;"> 0x00007fed81060988: cmp %r10,%rax</span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;"> 0x00007fed8106098b: jne 0x00007fed81037a60 ; {runtime_call}</span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;"> 0x00007fed81060991: xchg %ax,%ax</span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;"> 0x00007fed81060994: nopl 0x0(%rax,%rax,1)</span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;"> 0x00007fed8106099c: xchg %ax,%ax</span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;">[Verified Entry Point]</span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;"> 0x00007fed810609a0: sub $0x18,%rsp</span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;"> 0x00007fed810609a7: mov %rbp,0x10(%rsp) ;*synchronization entry</span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;"> ; - InterThreadLatency$PongRunner::run@-1 (line 58)</span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;"> 0x00007fed810609ac: xor %r11d,%r11d</span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;"> 0x00007fed810609af: mov $0x7ad0fcbf0,%r10 ; {oop(a 'java/lang/Class' = 'InterThreadLatency')}</span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;"> 0x00007fed810609b9: jmp 0x00007fed810609d0</span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;"> 0x00007fed810609bb: nopl 0x0(%rax,%rax,1) ; OopMap{r10=Oop off=64}</span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;"> ;*goto</span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;"> ; - InterThreadLatency$PongRunner::run@15 (line 60)</span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;"> 0x00007fed810609c0: test %eax,0xaa1663a(%rip) # 0x00007fed8ba77000</span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;"> ;*goto</span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;"> ; - InterThreadLatency$PongRunner::run@15 (line 60)</span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;"> ; {poll}</span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;"> 0x00007fed810609c6: nopw 0x0(%rax,%rax,1) ;*iload_1</span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;"> ; - InterThreadLatency$PongRunner::run@8 (line 60)</span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;"> 0x00007fed810609d0: mov 0x74(%r10),%r9d ;*getstatic ping</span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;"> ; - InterThreadLatency::access$000@0 (line 3)</span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;"> ; - InterThreadLatency$PongRunner::run@9 (line 60)</span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;"> 0x00007fed810609d4: cmp %r9d,%r11d</span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;"> 0x00007fed810609d7: jne 0x00007fed810609c0</span><br />
<span style="color: #cc0000; font-family: Courier New, Courier, monospace; font-size: x-small;"><b> 0x00007fed810609d9: mov %r11d,0x78(%r10)</b></span><br />
<span style="color: #cc0000; font-family: Courier New, Courier, monospace; font-size: x-small;"><b> 0x00007fed810609dd: lock addl $0x0,(%rsp) ;*putstatic pong</b></span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;"> ; - InterThreadLatency::access$102@2 (line 3)</span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;"> ; - InterThreadLatency$PongRunner::run@19 (line 65)</span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;"> 0x00007fed810609e2: inc %r11d ;*iinc</span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;"> ; - InterThreadLatency$PongRunner::run@23 (line 58)</span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;"> 0x00007fed810609e5: cmp $0x5f5e100,%r11d</span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;"> 0x00007fed810609ec: jl 0x00007fed810609d0 ;*if_icmpeq</span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;"> ; - InterThreadLatency$PongRunner::run@12 (line 60)</span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;"> 0x00007fed810609ee: add $0x10,%rsp</span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;"> 0x00007fed810609f2: pop %rbp</span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;"> 0x00007fed810609f3: test %eax,0xaa16607(%rip) # 0x00007fed8ba77000</span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;"> ; {poll_return}</span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;"> 0x00007fed810609f9: retq ;*iload_1</span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;"> ; - InterThreadLatency$PongRunner::run@8 (line 60)</span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;"> 0x00007fed810609fa: hlt </span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;"> 0x00007fed810609fb: hlt </span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;"> 0x00007fed810609fc: hlt </span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;"> 0x00007fed810609fd: hlt </span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;"> 0x00007fed810609fe: hlt </span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;"> 0x00007fed810609ff: hlt </span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;">[Exception Handler]</span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;">[Stub Code]</span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;"> 0x00007fed81060a00: jmpq 0x00007fed8105eaa0 ; {no_reloc}</span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;">[Deopt Handler Code]</span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;"> 0x00007fed81060a05: callq 0x00007fed81060a0a</span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;"> 0x00007fed81060a0a: subq $0x5,(%rsp)</span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;"> 0x00007fed81060a0f: jmpq 0x00007fed81038c00 ; {runtime_call}</span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;"> 0x00007fed81060a14: hlt </span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;"> 0x00007fed81060a15: hlt </span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;"> 0x00007fed81060a16: hlt </span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;"> 0x00007fed81060a17: hlt </span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;">OopMapSet contains 1 OopMaps</span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;"><br /></span>
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;">#0 </span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;">OopMap{r10=Oop off=64}</span><br />
<div>
<br /></div>
<b><span style="font-family: Arial, Helvetica, sans-serif; font-size: large;">An Interesting Observation</span></b><br />
<br />
The red highlighted lines of assembly code above are very interesting. When a <span style="font-family: Courier New, Courier, monospace;">volatile</span> field is written, under the Java Memory Model the write must be <a href="http://en.wikipedia.org/wiki/Sequential_consistency">sequentially consistent</a>, i.e. not appear to be reordered due to optimisations normally applied such as staging the write to the <a href="http://mechanical-sympathy.blogspot.co.uk/2013/02/cpu-cache-flushing-fallacy.html">store buffer</a>. This can be achieved by inserting the appropriate memory barriers. In the case above, Hotspot has chosen to enforce the ordering by issuing a MOV instruction (register to memory address - i.e. the write) followed by a LOCK ADD instruction (no op to the stack pointer as a fence idiom) that has ordering semantics. This could be less than ideal on an x86 processor. The same action could have been performed more efficiently and correctly with a single LOCK XCHG instruction for the write. This makes me wonder if there are some significant compromises in the JVM to make it portable across many architectures, rather than be the best it can on x86.</div>
Martin Thompsonhttp://www.blogger.com/profile/15893849163924476586noreply@blogger.com5London, UK51.511213899999987 -0.1198243999999704151.195100899999986 -0.7652713999999704 51.827326899999989 0.52562260000002958tag:blogger.com,1999:blog-5560209661389175529.post-29779577895263650952013-02-14T12:22:00.000+00:002013-04-12T11:17:13.836+01:00CPU Cache Flushing Fallacy<div dir="ltr" style="text-align: left;" trbidi="on">
Even from highly experienced technologists I often hear talk about how certain operations cause a CPU cache to "flush". This seems to be illustrating a very common fallacy about how CPU caches work, and how the cache sub-system interacts with the execution cores. In this article I will attempt to explain the function CPU caches fulfil, and how the cores, which execute our programs of instructions, interact with them. For a concrete example I will dive into one of the latest Intel x86 server CPUs. Other CPUs use similar techniques to achieve the same ends.<br />
<br />
Most modern systems that execute our programs are shared-memory multi-processor systems in design. A shared-memory system has a single memory resource that is accessed by 2 or more independent CPU cores. Latency to main memory is highly variable from 10s to 100s of nanoseconds. Within 100ns it is possible for a 3.0GHz CPU to process up to 1200 instructions. Each Sandy Bridge core is capable of retiring up to 4 instructions-per-cycle (IPC) in parallel. CPUs employ cache sub-systems to hide this latency and allow them to exercise their huge capacity to process instructions. Some of these caches are small, very fast, and local to each core; others are slower, larger, and shared across cores. Together with registers and main-memory, these caches make up our non-persistent memory hierarchy.<br />
<br />
Next time you are developing an important algorithm, try pondering that a cache-miss is a lost opportunity to have executed ~500 CPU instructions! This is for a single-socket system, on a multi-socket system you can effectively double the lost opportunity as memory requests cross socket interconnects.<br />
<br />
<span style="font-family: Arial, Helvetica, sans-serif; font-size: large;"><b>Memory Hierarchy</b></span><br />
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjzOHcrRgbOtzCf_4F5-0EqM21hCGy-rvSvPQVg4QmGhtq9TgG-tX04mtFIMMB38l2o9leCI1p09NMcY0Sbjvgplozc_zKTRzaq7nt_Fzkx4TqtJy31WBPf5u1QCy9XjDyeZ2xlBR2Sw28/s1600/MemoryHeirarchy.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="441" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjzOHcrRgbOtzCf_4F5-0EqM21hCGy-rvSvPQVg4QmGhtq9TgG-tX04mtFIMMB38l2o9leCI1p09NMcY0Sbjvgplozc_zKTRzaq7nt_Fzkx4TqtJy31WBPf5u1QCy9XjDyeZ2xlBR2Sw28/s640/MemoryHeirarchy.png" width="640" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Figure 1.</td></tr>
</tbody></table>
<br />
For the circa 2012 Sandy Bridge E class servers our memory hierarchy can be decomposed as follows:<br />
<ol style="text-align: left;">
<li><b>Registers</b>: Within each core are separate register files containing 160 entries for integers and 144 floating point numbers. These registers are accessible within a single cycle and constitute the fastest memory available to our execution cores. Compilers will allocate our local variables and function arguments to these registers. Compilers allocate to subset of registers know as the <a href="http://en.wikipedia.org/wiki/X86#x86_registers">architectural registers</a>, then the hardware expands on these as it runs instructions in parallel and out-of-order. Compilers are aware of out-of-order and parallel execution ability for given processor, and order instruction streams and register allocation to take advantage of this. When <a href="http://en.wikipedia.org/wiki/Hyper-threading">hyperthreading</a> is enabled these registers are shared between the co-located hyperthreads.</li>
<li><b>Memory Ordering Buffers (MOB)</b>: The MOB is comprised of a 64-entry load and 36-entry store buffer. These buffers are used to track in-flight operations while waiting on the cache sub-system as instructions get executed out-of-order. The store buffer is a fully associative queue that can be searched for existing store operations, which have been queued when waiting on the L1 cache. These buffers enable our fast processors to run without blocking while data is transferred to and from the cache sub-system. When the processor issues reads and writes they can can come back out-of-order. The MOB is used to disambiguate the load and store ordering for compliance to the published <a href="http://en.wikipedia.org/wiki/Memory_model_(programming)">memory model</a>. Instructions are executed out-of-order in addition to our loads and stores that can come back out-of-order from the cache sub-system. These buffers enable an ordered view of the world to be re-constructed for what is expected according to the memory model.</li>
<li><b>Level 1 Cache</b>: The L1 is a core-local cache split into separate 32K data and 32K instruction caches. Access time is 3 cycles and can be hidden as instructions are <a href="http://en.wikipedia.org/wiki/Pipelining">pipelined</a> by the core for data already in the L1 cache.</li>
<li><b>Level 2 Cache</b>: The L2 cache is a core-local cache designed to buffer access between the L1 and the shared L3 cache. The L2 cache is 256K in size and acts as an effective queue of memory accesses between the L1 and L3. L2 contains both data and instructions. L2 access latency is 12 cycles. </li>
<li><b>Level 3 Cache</b>: The L3 cache is shared across all cores within a socket. The L3 is split into 2MB segments each connected to a ring-bus network on the socket. Each core is also connected to this ring-bus. Addresses are hashed to segments for greater throughput. Latency can be up to 38 cycles depending on cache size. Cache size can be up to 20MB depending on the number of segments, with each additional hop around the ring taking an additional cycle. The L3 cache is inclusive of all data in the L1 and L2 for each core on the same socket. This inclusiveness, at the cost of space, allows the L3 cache to intercept requests thus removing the burden from private core-local L1 & L2 caches.</li>
<li><b>Main Memory</b>: DRAM channels are connected to each socket with an average latency of ~65ns for socket local access on a full cache-miss. This is however extremely variable, being much less for subsequent accesses to columns in the same row buffer, through to significantly more when queuing effects and memory refresh cycles conflict. 4 memory channels are aggregated together on each socket for throughput, and to hide latency via <a href="http://en.wikipedia.org/wiki/Pipelining">pipelining</a> on the independent memory channels.</li>
<li><b>NUMA</b>: In a multi-socket server we have <a href="http://en.wikipedia.org/wiki/Non-Uniform_Memory_Access">non-uniform memory access</a>. It is non-uniform because the required memory maybe on a remote socket having an additional 40ns hop across the <a href="http://en.wikipedia.org/wiki/QPI">QPI</a> bus. Sandy Bridge is a major step forward for 2-socket systems over Westmere and Nehalem. With Sandy Bridge the QPI limit has been raised from 6.4GT/s to 8.0GT/s, and two lanes can be aggregated thus eliminating the bottleneck of the previous systems. For Nehalem and Westmere the QPI link is only capable of ~40% the bandwidth that could be delivered by the memory controller for an individual socket. This limitation made accessing remote memory a choke point. In addition, the QPI link can now forward pre-fetch requests which previous generations could not.</li>
</ol>
<div>
<b><span style="font-family: Arial, Helvetica, sans-serif; font-size: large;">Associativity Levels</span></b></div>
<div>
<br /></div>
<div>
Caches are effectively hardware based hash tables. The hash function is usually a simple masking of some low-order bits for cache indexing. Hash tables need some means to handle a collision for the same slot. The associativity level is the number of slots, also known as ways or sets, which can be used to hold a hashed version of an address. Having more levels of associativity is a trade off between storing more data vs. power requirements and time to search each of the ways.</div>
<div>
<br /></div>
<div>
For Sandy Bridge the L1D and L2 are 8-way associative, the L3 is 12-way associative.</div>
<div>
<br /></div>
<span style="font-family: Arial, Helvetica, sans-serif; font-size: large;"><b>Cache Coherence</b></span><br />
<br />
With some caches being local to cores, we need a means of keeping them coherent so all cores can have a consistent view of memory. The cache sub-system is considered the "source of truth" for mainstream systems. If memory is fetched from the cache it is never stale; the cache is the master copy when data exists in both the cache and main-memory. This style of memory management is known as <a href="http://www.webopedia.com/TERM/W/write_back_cache.html">write-back</a> whereby data in the cache is only written back to main-memory when the cache-line is evicted because a new line is taking its place. An x86 cache works on blocks of data that are 64-bytes in size, known as a <a href="http://en.wikipedia.org/wiki/CPU_cache#Cache_Entries">cache-line</a>. Other processors can use a different size for the cache-line. A larger cache-line size reduces effective latency at the expense of increased bandwidth requirements.<br />
<br />
To keep the caches coherent the cache controller tracks the state of each cache-line as being in one of a finite number of states. The protocol Intel employs for this is <a href="http://www.realworldtech.com/common-system-interface/5/">MESIF</a>, AMD employs a variant know as <a href="http://en.wikipedia.org/wiki/MOESI_protocol">MOESI</a>. Under the MESIF protocol each cache-line can be in 1 of the 5 following states:<br />
<ol style="text-align: left;">
<li><b>Modified</b>: Indicates the cache-line is dirty and must be written back to memory at a later stage. When written back to main-memory the state transitions to Exclusive.</li>
<li><b>Exclusive</b>: Indicates the cache-line is held exclusively and that it matches main-memory. When written to, the state then transitions to Modified. To achieve this state a Read-For-Ownership (RFO) message is sent which involves a read plus an invalidate broadcast to all other copies.</li>
<li><b>Shared</b>: Indicates a clean copy of a cache-line that matches main-memory.</li>
<li><b>Invalid</b>: Indicates an unused cache-line.</li>
<li><b>Forward</b>: Indicates a specialised version of the shared state i.e. this is the designated cache which should respond to other caches in a NUMA system.</li>
</ol>
To transition from one state to another, a series of messages are sent between the caches to effect state changes. Previous to <a href="http://en.wikipedia.org/wiki/Nehalem_(microarchitecture)">Nehalem</a> for Intel, and <a href="http://en.wikipedia.org/wiki/Opteron">Opteron</a> for AMD, this cache coherence traffic between sockets had to share the memory bus which greatly limited scalability. These days the memory controller traffic is on a separate bus. The Intel QPI, and AMD <a href="http://en.wikipedia.org/wiki/HyperTransport">HyperTransport</a>, buses are used for cache coherence between sockets.<br />
<br />
The cache controller exists as a module within each L3 cache segment that is connected to the on-socket ring-bus network. Each core, L3 cache segment, QPI controller, memory controller, and integrated graphics sub-system are connected to this ring-bus. The ring is made up of 4 independent lanes for: <i>request</i>, <i>snoop</i>, <i>acknowledge</i>, and 32-bytes <i>data </i>per cycle. The L3 cache is inclusive in that any cache-line held in the L1 or L2 caches is also held in the L3. This provides for rapid identification of the core containing a modified line when snooping for changes. The cache controller for the L3 segment keeps track of which core could have a modified version of a cache-line it owns.<br />
<br />
If a core wants to read some memory, and it does not have it in a Shared, Exclusive, or Modified state; then it must make a read on the ring bus. It will then either be read from main-memory if not in the cache sub-systems, or read from L3 if clean, or snooped from another core if Modified. In any case the read will never return a stale copy from the cache sub-system, it is guaranteed to be coherent.<br />
<br />
<b><span style="font-family: Arial, Helvetica, sans-serif; font-size: large;">Concurrent Programming</span></b><br />
<br />
If our caches are always coherent then why do we worry about visibility when writing concurrent programs? This is because within our cores, in their quest for ever greater performance, data modifications can appear out-of-order to other threads. There are 2 major reasons for this.<br />
<br />
Firstly, our compilers can generate programs that store variables in registers for relatively long periods of time for performance reasons, e.g. variables used repeatedly within a loop. If we need these variables to be visible across cores then the updates must not be register allocated. This is achieved in C by qualifying a variable as "<span style="font-family: Courier New, Courier, monospace;">volatile</span>". Beware that C/C++ <span style="font-family: Courier New, Courier, monospace;">volatile</span> is inadequate for telling the compiler not to reorder other instructions. For this you need memory fences/barriers.<br />
<br />
The second major issue with ordering we have to be aware of is a thread could write a variable and then, if it reads it shortly after, could see the value in its store buffer which may be older than the latest value in the cache sub-system. This is never an issue for algorithms following the <a href="http://mechanical-sympathy.blogspot.co.uk/2011/09/single-writer-principle.html">Single Writer Principle</a>. The store buffer also allows a load instruction to get ahead of an older store and is thus an issue for the likes of the <a href="http://en.wikipedia.org/wiki/Dekker's_algorithm">Dekker</a> and <a href="http://en.wikipedia.org/wiki/Peterson%27s_algorithm">Peterson</a> lock algorithms. To overcome these issues, the thread must not let a sequential consistent load get ahead of the sequentially consistent store of the value in the local store buffer. This can be achieved by issuing a fence instruction. The write of a <span style="font-family: Courier New, Courier, monospace;">volatile</span> variable in Java, in addition to never being register allocated, is accompanied by a full fence instruction. This fence instruction on x86 has a significant performance impact by preventing progress on the issuing thread until the store buffer is drained. Fences on other processors can have more efficient implementations that simply put a marker in the store buffer for the search boundary, e.g. the Azul Vega does this.<br />
<br />
If you want to ensure memory ordering across Java threads when following the Single Writer Principle, and avoid the store fence, it is possible by using the <span style="font-family: Courier New, Courier, monospace;">j.u.c.Atomic(Int|Long|Reference).lazySet()</span> method, as opposed to setting a <span style="font-family: Courier New, Courier, monospace;">volatile</span> variable.<br />
<br />
<b><span style="font-family: Arial, Helvetica, sans-serif; font-size: large;">The Fallacy</span></b><br />
<br />
Returning to the fallacy of "flushing the cache" as part of a concurrent algorithm. I think we can safely say that we never "flush" the CPU cache within our user space programs. I believe the source of this fallacy is the need to flush, mark or drain to a point, the store buffer for some classes of concurrent algorithms so the latest value can be observed on a subsequent load operation. For this we require a memory ordering fence and not a cache flush.<br />
<br />
Another possible source of this fallacy is that L1 caches, or the <a href="http://en.wikipedia.org/wiki/Translation_lookaside_buffer">TLB</a>, may need to be flushed based on address indexing policy on a context switch. ARM, previous to ARMv6, did not use address space tags on TLB entries thus requiring the whole L1 cache to be flushed on a context switch. Many processors require the L1 instruction cache to be flushed for similar reasons, in many cases this is simply because instruction caches are not required to be kept coherent. The bottom line is, context switching is expensive and a bit off topic, so in addition to the cache pollution of the L2, a context switch can also cause the TLB and/or L1 caches to require a flush. Intel x86 processors require only a TLB flush on context switch.</div>
Martin Thompsonhttp://www.blogger.com/profile/15893849163924476586noreply@blogger.com29London, UK51.5073346 -0.1276831000000129351.1912286 -0.77313010000001292 51.8234406 0.51776389999998707tag:blogger.com,1999:blog-5560209661389175529.post-26476361333597527262013-01-25T17:59:00.002+00:002022-08-17T11:34:24.170+01:00Further Adventures With CAS Instructions And Micro Benchmarking <div dir="ltr" style="text-align: left;" trbidi="on">
In a previous <a href="http://mechanical-sympathy.blogspot.co.uk/2011/09/adventures-with-atomiclong.html">article</a> I reported what appeared to be a performance issue with CAS/LOCK instructions on the <a href="http://en.wikipedia.org/wiki/Sandy_Bridge">Sandy Bridge</a> microarchitecture compared to the previous <a href="http://en.wikipedia.org/wiki/Nehalem_(microarchitecture)">Nehalem</a> microarchitecture. Since then I've worked with the good people of Intel to understand what was going on and I'm now pleased to be able to shine some light on the previous results.<br />
<br />
I observed a small drop in throughput with the uncontended single-thread case, and an order-of-magnitude decrease in throughput once multiple threads contend when performing updates. This testing spawned out of observations testing Java Queue implementations and the Disruptor for the multi-producer case. I was initially puzzled by these findings because almost every other performance test I applied to Sandy Bridge indicated a major step forward for this microarchitecture.<br />
<br />
After digging deeper into this issue it has come to light that my tests have once again fallen fowl of the difficulties in micro-benchmarking. My test is not a good means of testing throughput and it is actually testing fairness in a roundabout manner. Let's revisit the code and work through what is going on.<br />
<br />
<b><span style="font-family: Arial, Helvetica, sans-serif; font-size: large;">Test Code</span></b><br />
<pre>#include <time.h>
#include <pthread.h>
#include <stdlib.h>
#include <iostream>
typedef unsigned long long uint64;
const uint64 COUNT = 500 * 1000 * 1000;
volatile uint64 counter = 0;
void* run_add(void* numThreads)
{
register uint64 value = (COUNT / *((int*)numThreads)) + 1;
while (--value != 0)
{
__sync_add_and_fetch(&counter, 1);
}
}
void* run_xadd(void*)
{
register uint64 value = counter;
while (value < COUNT)
{
value = __sync_add_and_fetch(&counter, 1);
}
}
void* run_cas(void*)
{
register uint64 value = 0;
while (value < COUNT)
{
do
{
value = counter;
}
while (!__sync_bool_compare_and_swap(&counter, value, value + 1));
}
}
void* run_cas2(void*)
{
register uint64 value = 0;
register uint64 next = 0;
while (value < COUNT)
{
value = counter;
do
{
next = value + 1;
value = __sync_val_compare_and_swap(&counter, value, next);
}
while (value != next);
}
}
int main (int argc, char *argv[])
{
const int NUM_THREADS = atoi(argv[1]);
const int TESTCASE = atoi(argv[2]);
pthread_t threads[NUM_THREADS];
void* status;
timespec ts_start;
timespec ts_finish;
clock_gettime(CLOCK_MONOTONIC, &ts_start);
for (int i = 0; i < NUM_THREADS; i++)
{
switch (TESTCASE)
{
case 1:
std::cout << "LOCK ADD" << std::endl;
pthread_create(&threads[i], NULL, run_add, (void*)&NUM_THREADS);
break;
case 2:
std::cout << "LOCK XADD" << std::endl;
pthread_create(&threads[i], NULL, run_xadd, (void*)&NUM_THREADS);
break;
case 3:
std::cout << "LOCK CMPXCHG BOOL" << std::endl;
pthread_create(&threads[i], NULL, run_cas, (void*)&NUM_THREADS);
break;
case 4:
std::cout << "LOCK CMPXCHG VAL" << std::endl;
pthread_create(&threads[i], NULL, run_cas2, (void*)&NUM_THREADS);
break;
default:
exit(1);
}
}
for (int i = 0; i < NUM_THREADS; i++)
{
pthread_join(threads[i], &status);
}
clock_gettime(CLOCK_MONOTONIC, &ts_finish);
uint64 start = (ts_start.tv_sec * 1000000000) + ts_start.tv_nsec;
uint64 finish = (ts_finish.tv_sec * 1000000000) + ts_finish.tv_nsec;
uint64 duration = finish - start;
std::cout << "threads = " << NUM_THREADS << std::endl;
std::cout << "duration = " << duration << std::endl;
std::cout << "ns per op = " << (duration / (COUNT * 2)) << std::endl;
std::cout << "op/sec = " << ((COUNT * 2 * 1000 * 1000 * 1000) / duration) << std::endl;
std::cout << "counter = " << counter << std::endl;
return 0;
}</pre>
The code above makes it possible to test the major CAS based techniques on x86. For full clarity an <span style="font-family: Courier New, Courier, monospace;"><a href="http://linux.die.net/man/1/objdump">objdump -d</a> </span>of the binary reveals the compiler generated assembly instructions for the above methods. The "<span style="font-family: Courier New, Courier, monospace;"><b>lock</b></span>" instruction in each section is where the atomic update is happening.
<br />
<pre>0000000000400dc0 <_z8run_cas2pv>:
400dc0: 48 8b 05 d9 07 20 00 mov 0x2007d9(%rip),%rax
400dc7: 66 0f 1f 84 00 00 00 nopw 0x0(%rax,%rax,1)
400dce: 00 00
400dd0: 48 8d 50 01 lea 0x1(%rax),%rdx
400dd4: f0 48 0f b1 15 c3 07 lock cmpxchg %rdx,0x2007c3(%rip)
400ddb: 20 00
400ddd: 48 39 c2 cmp %rax,%rdx
400de0: 75 ee jne 400dd0 <_z8run_cas2pv>
400de2: 48 3d ff 64 cd 1d cmp $0x1dcd64ff,%rax
400de8: 76 d6 jbe 400dc0 <_z8run_cas2pv>
400dea: f3 c3 repz retq
400dec: 0f 1f 40 00 nopl 0x0(%rax)
0000000000400df0 <_z7run_caspv>:
400df0: 48 8b 15 a9 07 20 00 mov 0x2007a9(%rip),%rdx
400df7: 48 8d 4a 01 lea 0x1(%rdx),%rcx
400dfb: 48 89 d0 mov %rdx,%rax
400dfe: f0 48 0f b1 0d 99 07 lock cmpxchg %rcx,0x200799(%rip)
400e05: 20 00
400e07: 75 e7 jne 400df0 <_z7run_caspv>
400e09: 48 81 fa ff 64 cd 1d cmp $0x1dcd64ff,%rdx
400e10: 76 de jbe 400df0 <_z7run_caspv>
400e12: f3 c3 repz retq
400e14: 66 66 66 2e 0f 1f 84 data32 data32 nopw %cs:0x0(%rax,%rax,1)
400e1b: 00 00 00 00 00
0000000000400e20 <_z8run_xaddpv>:
400e20: 48 8b 05 79 07 20 00 mov 0x200779(%rip),%rax
400e27: 48 3d ff 64 cd 1d cmp $0x1dcd64ff,%rax
400e2d: 77 1b ja 400e4a <_z8run_xaddpv>
400e2f: 90 nop
400e30: b8 01 00 00 00 mov $0x1,%eax
400e35: f0 48 0f c1 05 62 07 lock xadd %rax,0x200762(%rip)
400e3c: 20 00
400e3e: 48 83 c0 01 add $0x1,%rax
400e42: 48 3d ff 64 cd 1d cmp $0x1dcd64ff,%rax
400e48: 76 e6 jbe 400e30 <_z8run_xaddp>
400e4a: f3 c3 repz retq
400e4c: 0f 1f 40 00 nopl 0x0(%rax)
0000000000400e50 <_z7run_addpv>:
400e50: 48 63 0f movslq (%rdi),%rcx
400e53: 31 d2 xor %edx,%edx
400e55: b8 00 65 cd 1d mov $0x1dcd6500,%eax
400e5a: 48 f7 f1 div %rcx
400e5d: 48 85 c0 test %rax,%rax
400e60: 74 15 je 400e77 <_z7run_addpv>
400e62: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1)
400e68: f0 48 83 05 2f 07 20 lock addq $0x1,0x20072f(%rip)
400e6f: 00 01
400e71: 48 83 e8 01 sub $0x1,%rax
400e75: 75 f1 jne 400e68 <_z7run_addpv>
400e77: f3 c3 repz retq
400e79: 90 nop
400e7a: 90 nop
400e7b: 90 nop
400e7c: 90 nop
400e7d: 90 nop
400e7e: 90 nop
400e7f: 90 nop
</pre>
To purely isolate the performance of the CAS operation the test should be run using the <span style="font-family: Courier New, Courier, monospace;"><a href="http://en.wikipedia.org/wiki/Fetch-and-add#x86_implementation">lock xadd</a></span> option for an atomic increment in hardware. This instruction lets us avoid the spin-retry loop of a pure software CAS that can dirty the experiment.<br />
<br />
I repeated the experiment from the previous article and got very similar results. Previously, I thought I'd observed a throughput drop even in the uncontended single-threaded case. So I focused in on this to confirm. To do this I had to find two processors that once <a href="http://en.wikipedia.org/wiki/Intel_Turbo_Boost">Turbo Boost</a> had kicked in then the clock speeds would be comparable. I found this by using a 2.8GHz Nehalem and 2.4GHz Sandy Bridge. For the single-threaded case they are both operating at ~3.4GHz.<br />
<pre>Nehalem 2.8GHz
==============
$ perf stat ./atomic_inc 1 2
LOCK XADD
threads = 1
duration = 3090445546
ns per op = 3
op/sec = 323577938
Performance counter stats for './atomic_inc 1 2':
3085.466216 task-clock # 0.997 CPUs utilized
331 context-switches # 0.107 K/sec
4 CPU-migrations # 0.001 K/sec
360 page-faults # 0.117 K/sec
10,527,264,923 cycles # 3.412 GHz
9,394,575,677 stalled-cycles-frontend # 89.24% frontend cycles idle
7,423,070,202 stalled-cycles-backend # 70.51% backend cycles idle
2,517,668,293 instructions # 0.24 insns per cycle
# 3.73 stalled cycles per insn
503,526,119 branches # 163.193 M/sec
110,695 branch-misses # 0.02% of all branches
3.093402966 seconds time elapsed
Sandy Bridge 2.4GHz
===================
$ perf stat ./atomic_inc 1 2
LOCK XADD
threads = 1
duration = 3394221940
ns per op = 3
op/sec = 294618330
Performance counter stats for './atomic_inc 1 2':
3390.404400 task-clock # 0.998 CPUs utilized
357 context-switches # 0.105 K/sec
1 CPU-migrations # 0.000 K/sec
358 page-faults # 0.106 K/sec
11,522,932,068 cycles # 3.399 GHz
9,542,667,311 stalled-cycles-frontend # 82.81% frontend cycles idle
6,721,330,874 stalled-cycles-backend # 58.33% backend cycles idle
2,518,638,461 instructions # 0.22 insns per cycle
# 3.79 stalled cycles per insn
502,490,710 branches # 148.210 M/sec
36,955 branch-misses # 0.01% of all branches
3.398206155 seconds time elapsed
</pre>
<br />
<b><span style="font-family: Arial, Helvetica, sans-serif; font-size: large;">Analysis</span></b><br />
<br />
So repeating the tests with comparable clock speeds confirmed the previous results. The single-threaded case shows a ~10% drop in throughput and the multi-threaded contended case displays an order-of-magnitude difference in throughput. <br />
<br />
Now the big question is what is going on and why has throughput dropped? Well the single-threaded case suggests nothing major has happened to number of cycles required to execute the instruction when uncontended. The small differences could be attributed to system noise or the changes in the CPU front-end for Sandy Bridge with introduction of the additional load address generation unit.<br />
<br />
For the multi-threaded case we found an interesting surprise when Intel monitored what the instructions are doing. We found that each thread on Nehalem was able to perform more updates in a batch before loosing the exclusive state on the cacheline containing the counter. This is because the inter-core latency has improved with Sandy Bridge so other threads are able to faster claim the cacheline containing the counter to do their own updates. What we are actually measuring with this micro-benchmark is how long a core can hold a cacheline before it is released to another core. Sandy Bridge is exhibiting greater fairness which is what you'd want in a real world application.<br />
<br />
This micro-benchmark is very unrealistic for a real world application. Normally between performing counter updates a core would be doing a lot of other work. At the point when the counter needs to be updated the reduced latency inter-core would then be a benefit. <br />
<br />
In all my macro application benchmarks Sandy Bridge has proved to have better performance than Nehalem at comparable clock speeds.<br />
<br />
<b><span style="font-family: Arial, Helvetica, sans-serif; font-size: large;">Conclusion</span></b><br />
<br />
What did I learn from this? Well once again that writing micro-benchmarks is notoriously difficult. It is so hard to know what you are measuring and what effects can come into play. To illustrate how difficult it is to recognise such a flaw, for all those who have read this blog, no one has identified the issue and fed this back to me.<br />
<br />
It also shows that what on first blush can be considered a performance bug is actually the opposite. This shows how it is possible to have a second order effect when a performance improvement can make a specific work case run more slowly. </div>
Martin Thompsonhttp://www.blogger.com/profile/15893849163924476586noreply@blogger.com7London, UK51.5073346 -0.1276831000000129351.1912231 -0.77313010000001292 51.8234461 0.51776389999998707tag:blogger.com,1999:blog-5560209661389175529.post-50408230044468437062012-12-19T20:24:00.000+00:002012-12-19T20:31:22.184+00:00Mechanical Sympathy Discussion Group<div dir="ltr" style="text-align: left;" trbidi="on">
Lately a number of people have suggested I start a discussion group on the subject of mechanical sympathy, so I've taken the plunge and done it! The group can be a place to discuss topics related to writing software which works in harmony with the underlying hardware to gain great performance.<br />
<div>
<br /></div>
<div>
<a href="https://groups.google.com/forum/?fromgroups#!forum/mechanical-sympathy">https://groups.google.com/forum/?fromgroups#!forum/mechanical-sympathy</a></div>
<div>
<br /></div>
</div>
Martin Thompsonhttp://www.blogger.com/profile/15893849163924476586noreply@blogger.com3London, UK51.5073346 -0.1276831000000129351.1912111 -0.77313010000001292 51.8234581 0.51776389999998707tag:blogger.com,1999:blog-5560209661389175529.post-28228242664645565892012-10-17T13:36:00.002+01:002022-08-17T11:35:17.970+01:00Compact Off-Heap Structures/Tuples In Java<div dir="ltr" style="text-align: left;" trbidi="on">
In my last <a href="http://mechanical-sympathy.blogspot.co.uk/2012/08/memory-access-patterns-are-important.html">post</a> I detailed the implications of the access patterns your code takes to main memory. Since then I've had a lot of questions about what can be done in Java to enable more predictable memory layout. There are patterns that can be applied using array backed structures which I will discuss in another post. This post will explore how to simulate a feature sorely missing in Java - arrays of <a href="http://en.wikipedia.org/wiki/Struct_(C_programming_language)">structures</a> similar to what C has to offer.<br />
<br />
Structures are very useful, both on the stack and the heap. To my knowledge it is not possible to simulate this feature on the Java stack. Not being able to do this on the stack is such as shame because it greatly limits the performance of some parallel algorithms, however that is a rant for another day. <br />
<br />
In Java, all user defined types have to exist on the heap. The Java heap is managed by the garbage collector in the general case, however there is more to the wider heap in a Java process. With the introduction of direct <a href="http://docs.oracle.com/javase/6/docs/api/java/nio/ByteBuffer.html"><span style="font-family: Courier New, Courier, monospace;">ByteBuffer</span></a>, memory can be allocated which is not tracked by the garbage collector because it can be available to native code for tasks like avoiding the copying of data to and from the kernel for IO. So one method of managing structures is to fake them within a ByteBuffer as a reasonable approach. This can allow compact data representations, but has performance and size limitations. For example, it is not possible to have a ByteBuffer greater than 2GB, and all access is bounds checked which impacts performance. An alternative exists using <span style="font-family: Courier New, Courier, monospace;"><a href="http://www.docjar.com/docs/api/sun/misc/Unsafe.html">Unsafe</a></span> that is both faster and and not size constrained like <a href="http://docs.oracle.com/javase/6/docs/api/java/nio/ByteBuffer.html"><span style="font-family: Courier New, Courier, monospace;">ByteBuffer</span></a>.<br />
<br />
The approach I'm about to detail is not traditional Java. If your problem space is dealing with big data, or extreme performance, then there are benefits to be had. If your data sets are small, and performance is not an issue, then run away now to avoid getting sucked into the dark arts of native memory management.<br />
<br />
The benefits of the approach I'm about to detail are:<br />
<ol style="text-align: left;">
<li>Significantly improved performance </li>
<li>More compact data representation</li>
<li>Ability to work with very large data sets while avoiding nasty GC pauses[1]</li>
</ol>
With all choices there are consequences. By taking the approach detailed below you take responsibility for some of the memory managment yourself. Getting it wrong can lead to memory leaks, or worse, you can crash the JVM! Proceed with caution...<br />
<br />
<span style="font-family: Arial, Helvetica, sans-serif; font-size: large;"><b>Suitable Example - <i>Trade Data</i></b></span><br />
<br />
A common challenge faced in finance applications is capturing and working with very large volumes of order and trade data. For the example I will create a large table of in-memory trade data that can have analysis queries run against it. This table will be built using 2 contrasting approaches. Firstly, I'll take the traditional Java approach of creating a large array and reference individual Trade objects. Secondly, I keep the usage code identical but replace the large array and Trade objects with an off-heap array of structures that can be manipulated via a <a href="http://www.oodesign.com/flyweight-pattern.html">Flyweight</a> pattern. <br />
<br />
If for the traditional Java approach I used some other data structure, such as a Map or Tree, then the memory footprint would be even greater and the performance lower.<br />
<br />
<b><span style="font-family: Arial, Helvetica, sans-serif; font-size: large;">Traditional Java Approach</span></b>
<br />
<pre>public class TestJavaMemoryLayout
{
private static final int NUM_RECORDS = 50 * 1000 * 1000;
private static JavaMemoryTrade[] trades;
public static void main(final String[] args)
{
for (int i = 0; i < 5; i++)
{
System.gc();
perfRun(i);
}
}
private static void perfRun(final int runNum)
{
long start = System.currentTimeMillis();
init();
System.out.format("Memory %,d total, %,d free\n",
Runtime.getRuntime().totalMemory(),
Runtime.getRuntime().freeMemory());
long buyCost = 0;
long sellCost = 0;
for (int i = 0; i < NUM_RECORDS; i++)
{
final JavaMemoryTrade trade = get(i);
if (trade.getSide() == 'B')
{
buyCost += (trade.getPrice() * trade.getQuantity());
}
else
{
sellCost += (trade.getPrice() * trade.getQuantity());
}
}
long duration = System.currentTimeMillis() - start;
System.out.println(runNum + " - duration " + duration + "ms");
System.out.println("buyCost = " + buyCost + " sellCost = " + sellCost);
}
private static JavaMemoryTrade get(final int index)
{
return trades[index];
}
public static void init()
{
trades = new JavaMemoryTrade[NUM_RECORDS];
final byte[] londonStockExchange = {'X', 'L', 'O', 'N'};
final int venueCode = pack(londonStockExchange);
final byte[] billiton = {'B', 'H', 'P'};
final int instrumentCode = pack( billiton);
for (int i = 0; i < NUM_RECORDS; i++)
{
JavaMemoryTrade trade = new JavaMemoryTrade();
trades[i] = trade;
trade.setTradeId(i);
trade.setClientId(1);
trade.setVenueCode(venueCode);
trade.setInstrumentCode(instrumentCode);
trade.setPrice(i);
trade.setQuantity(i);
trade.setSide((i & 1) == 0 ? 'B' : 'S');
}
}
private static int pack(final byte[] value)
{
int result = 0;
switch (value.length)
{
case 4:
result = (value[3]);
case 3:
result |= ((int)value[2] << 8);
case 2:
result |= ((int)value[1] << 16);
case 1:
result |= ((int)value[0] << 24);
break;
default:
throw new IllegalArgumentException("Invalid array size");
}
return result;
}
private static class JavaMemoryTrade
{
private long tradeId;
private long clientId;
private int venueCode;
private int instrumentCode;
private long price;
private long quantity;
private char side;
public long getTradeId()
{
return tradeId;
}
public void setTradeId(final long tradeId)
{
this.tradeId = tradeId;
}
public long getClientId()
{
return clientId;
}
public void setClientId(final long clientId)
{
this.clientId = clientId;
}
public int getVenueCode()
{
return venueCode;
}
public void setVenueCode(final int venueCode)
{
this.venueCode = venueCode;
}
public int getInstrumentCode()
{
return instrumentCode;
}
public void setInstrumentCode(final int instrumentCode)
{
this.instrumentCode = instrumentCode;
}
public long getPrice()
{
return price;
}
public void setPrice(final long price)
{
this.price = price;
}
public long getQuantity()
{
return quantity;
}
public void setQuantity(final long quantity)
{
this.quantity = quantity;
}
public char getSide()
{
return side;
}
public void setSide(final char side)
{
this.side = side;
}
}
}</pre>
<b><span style="font-family: Arial, Helvetica, sans-serif; font-size: large;">Compact Off-Heap Structures</span></b>
<br />
<pre>import sun.misc.Unsafe;
import java.lang.reflect.Field;
public class TestDirectMemoryLayout
{
private static final Unsafe unsafe;
static
{
try
{
Field field = Unsafe.class.getDeclaredField("theUnsafe");
field.setAccessible(true);
unsafe = (Unsafe)field.get(null);
}
catch (Exception e)
{
throw new RuntimeException(e);
}
}
private static final int NUM_RECORDS = 50 * 1000 * 1000;
private static long address;
private static final DirectMemoryTrade flyweight = new DirectMemoryTrade();
public static void main(final String[] args)
{
for (int i = 0; i < 5; i++)
{
System.gc();
perfRun(i);
}
}
private static void perfRun(final int runNum)
{
long start = System.currentTimeMillis();
init();
System.out.format("Memory %,d total, %,d free\n",
Runtime.getRuntime().totalMemory(),
Runtime.getRuntime().freeMemory());
long buyCost = 0;
long sellCost = 0;
for (int i = 0; i < NUM_RECORDS; i++)
{
final DirectMemoryTrade trade = get(i);
if (trade.getSide() == 'B')
{
buyCost += (trade.getPrice() * trade.getQuantity());
}
else
{
sellCost += (trade.getPrice() * trade.getQuantity());
}
}
long duration = System.currentTimeMillis() - start;
System.out.println(runNum + " - duration " + duration + "ms");
System.out.println("buyCost = " + buyCost + " sellCost = " + sellCost);
destroy();
}
private static DirectMemoryTrade get(final int index)
{
final long offset = address + (index * DirectMemoryTrade.getObjectSize());
flyweight.setObjectOffset(offset);
return flyweight;
}
public static void init()
{
final long requiredHeap = NUM_RECORDS * DirectMemoryTrade.getObjectSize();
address = unsafe.allocateMemory(requiredHeap);
final byte[] londonStockExchange = {'X', 'L', 'O', 'N'};
final int venueCode = pack(londonStockExchange);
final byte[] billiton = {'B', 'H', 'P'};
final int instrumentCode = pack( billiton);
for (int i = 0; i < NUM_RECORDS; i++)
{
DirectMemoryTrade trade = get(i);
trade.setTradeId(i);
trade.setClientId(1);
trade.setVenueCode(venueCode);
trade.setInstrumentCode(instrumentCode);
trade.setPrice(i);
trade.setQuantity(i);
trade.setSide((i & 1) == 0 ? 'B' : 'S');
}
}
private static void destroy()
{
unsafe.freeMemory(address);
}
private static int pack(final byte[] value)
{
int result = 0;
switch (value.length)
{
case 4:
result |= (value[3]);
case 3:
result |= ((int)value[2] << 8);
case 2:
result |= ((int)value[1] << 16);
case 1:
result |= ((int)value[0] << 24);
break;
default:
throw new IllegalArgumentException("Invalid array size");
}
return result;
}
private static class DirectMemoryTrade
{
private static long offset = 0;
private static final long tradeIdOffset = offset += 0;
private static final long clientIdOffset = offset += 8;
private static final long venueCodeOffset = offset += 8;
private static final long instrumentCodeOffset = offset += 4;
private static final long priceOffset = offset += 4;
private static final long quantityOffset = offset += 8;
private static final long sideOffset = offset += 8;
private static final long objectSize = offset += 2;
private long objectOffset;
public static long getObjectSize()
{
return objectSize;
}
void setObjectOffset(final long objectOffset)
{
this.objectOffset = objectOffset;
}
public long getTradeId()
{
return unsafe.getLong(objectOffset + tradeIdOffset);
}
public void setTradeId(final long tradeId)
{
unsafe.putLong(objectOffset + tradeIdOffset, tradeId);
}
public long getClientId()
{
return unsafe.getLong(objectOffset + clientIdOffset);
}
public void setClientId(final long clientId)
{
unsafe.putLong(objectOffset + clientIdOffset, clientId);
}
public int getVenueCode()
{
return unsafe.getInt(objectOffset + venueCodeOffset);
}
public void setVenueCode(final int venueCode)
{
unsafe.putInt(objectOffset + venueCodeOffset, venueCode);
}
public int getInstrumentCode()
{
return unsafe.getInt(objectOffset + instrumentCodeOffset);
}
public void setInstrumentCode(final int instrumentCode)
{
unsafe.putInt(objectOffset + instrumentCodeOffset, instrumentCode);
}
public long getPrice()
{
return unsafe.getLong(objectOffset + priceOffset);
}
public void setPrice(final long price)
{
unsafe.putLong(objectOffset + priceOffset, price);
}
public long getQuantity()
{
return unsafe.getLong(objectOffset + quantityOffset);
}
public void setQuantity(final long quantity)
{
unsafe.putLong(objectOffset + quantityOffset, quantity);
}
public char getSide()
{
return unsafe.getChar(objectOffset + sideOffset);
}
public void setSide(final char side)
{
unsafe.putChar(objectOffset + sideOffset, side);
}
}
}
</pre>
<b><span style="font-family: Arial, Helvetica, sans-serif; font-size: large;">Results</span></b>
<br />
<pre>Intel i7-860 @ 2.8GHz, 8GB RAM DDR3 1333MHz,
Windows 7 64-bit, Java 1.7.0_07
=============================================
java -server -Xms4g -Xmx4g TestJavaMemoryLayout
Memory 4,116,054,016 total, 1,108,901,104 free
0 - duration 19334ms
Memory 4,116,054,016 total, 1,109,964,752 free
1 - duration 14295ms
Memory 4,116,054,016 total, 1,108,455,504 free
2 - duration 14272ms
Memory 3,817,799,680 total, 815,308,600 free
3 - duration 28358ms
Memory 3,817,799,680 total, 810,552,816 free
4 - duration 32487ms
java -server TestDirectMemoryLayout
Memory 128,647,168 total, 126,391,384 free
0 - duration 983ms
Memory 128,647,168 total, 126,992,160 free
1 - duration 958ms
Memory 128,647,168 total, 127,663,408 free
2 - duration 873ms
Memory 128,647,168 total, 127,663,408 free
3 - duration 886ms
Memory 128,647,168 total, 127,663,408 free
4 - duration 884ms
Intel i7-2760QM @ 2.40GHz, 8GB RAM DDR3 1600MHz,
Linux 3.4.11 kernel 64-bit, Java 1.7.0_07
=================================================
java -server -Xms4g -Xmx4g TestJavaMemoryLayout
Memory 4,116,054,016 total, 1,108,912,960 free
0 - duration 12262ms
Memory 4,116,054,016 total, 1,109,962,832 free
1 - duration 9822ms
Memory 4,116,054,016 total, 1,108,458,720 free
2 - duration 10239ms
Memory 3,817,799,680 total, 815,307,640 free
3 - duration 21558ms
Memory 3,817,799,680 total, 810,551,856 free
4 - duration 23074ms
java -server TestDirectMemoryLayout
Memory 123,994,112 total, 121,818,528 free
0 - duration 634ms
Memory 123,994,112 total, 122,455,944 free
1 - duration 619ms
Memory 123,994,112 total, 123,103,320 free
2 - duration 546ms
Memory 123,994,112 total, 123,103,320 free
3 - duration 547ms
Memory 123,994,112 total, 123,103,320 free
4 - duration 534ms
</pre>
<b><span style="font-family: Arial, Helvetica, sans-serif; font-size: large;">Analysis</span></b><br />
<br />
Let's compare the results to the 3 benefits promised above.<br />
<br />
<b>1. Significantly improved performance</b><br />
<br />
The evidence here is pretty clear cut. Using the off-heap structures approach is more than an order of magnitude faster. At the most extreme, look at the 5th run on a Sandy Bridge processor, we have <b>43.2</b> <b>times</b> difference in duration to complete the task. It is also a nice illustration of how well Sandy Bridge does with predictable access patterns to data. Not only is the performance significantly better it is also more consistent. As the heap becomes fragmented, and thus access patterns become more random, the performance degrades as can be seen in the later runs with standard Java approach.<br />
<br />
<b>2. More compact data representation</b><br />
<br />
For our off-heap representation each object requires 42-bytes. To store 50 million of these, as in the example, we require 2,100,000,000 bytes. The memory required by the JVM heap is:<br />
<br />
<span style="font-family: Courier New, Courier, monospace;"> memory required = total memory - free memory - base JVM needs </span><br />
<br />
<span style="font-family: 'Courier New', Courier, monospace;"> 2,883,248,712 = </span><span style="font-family: Courier New, Courier, monospace;">3,817,799,680 - 810,551,856</span><span style="font-family: 'Courier New', Courier, monospace;"> </span><span style="font-family: 'Courier New', Courier, monospace;">- 123,999,112</span><br />
<br />
This implies the JVM needs ~40% more memory to represent the same data. The reason for this overhead is the array of references to the Java objects plus the object headers. In a previous <a href="http://mechanical-sympathy.blogspot.co.uk/2011/07/false-sharing.html">post</a> I discussed object layout in Java.<br />
<br />
When working with very large data sets this overhead can become a significant limiting factor.<br />
<br />
<b>3. Ability to work with very large data sets while avoiding nasty GC pauses</b><br />
<br />
The sample code above forces a GC cycle before each run and can improve the consistency of the results in some cases. Feel free to remove the call to <span style="font-family: Courier New, Courier, monospace;">System.gc()</span> and observe the implications for yourself. If you run the tests adding the following command line arguments then the garbage collector will output in painful detail what happened.<br />
<br />
<span style="font-family: Courier New, Courier, monospace;"><span style="background-color: #fafafa; color: #444444; font-size: 14px; line-height: 18px;">-XX:+PrintGC -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintTenuringDistribution -XX:+PrintHeapAtGC -XX:+</span><wbr style="color: #444444; font-size: 14px; line-height: 18px;"></wbr><span style="background-color: #fafafa; color: #444444; font-size: 14px; line-height: 18px;">PrintGCApplicationConcurrentTi</span><wbr style="color: #444444; font-size: 14px; line-height: 18px;"></wbr><span style="background-color: #fafafa; color: #444444; font-size: 14px; line-height: 18px;">me -XX:+</span><wbr style="color: #444444; font-size: 14px; line-height: 18px;"></wbr><span style="background-color: #fafafa; color: #444444; font-size: 14px; line-height: 18px;">PrintGCApplicationStoppedTime -XX:+PrintSafepointStatistics</span></span><br />
<span style="background-color: #fafafa; color: #444444; font-family: Arial, 'Liberation Sans', 'DejaVu Sans', sans-serif; font-size: 14px; line-height: 18px;"><br /></span>
From analysing the output I can see the application underwent a total of 29 GC cycles. Pause times are listed below by extracting the lines from the output indicating when the application threads are stopped.
<br />
<pre>With System.gc() before each run
================================
Total time for which application threads were stopped: 0.0085280 seconds
Total time for which application threads were stopped: 0.7280530 seconds
Total time for which application threads were stopped: 8.1703460 seconds
Total time for which application threads were stopped: 5.6112210 seconds
Total time for which application threads were stopped: 1.2531370 seconds
Total time for which application threads were stopped: 7.6392250 seconds
Total time for which application threads were stopped: 5.7847050 seconds
Total time for which application threads were stopped: 1.3070470 seconds
Total time for which application threads were stopped: 8.2520880 seconds
Total time for which application threads were stopped: 6.0949910 seconds
Total time for which application threads were stopped: 1.3988480 seconds
Total time for which application threads were stopped: 8.1793240 seconds
Total time for which application threads were stopped: 6.4138720 seconds
Total time for which application threads were stopped: 4.4991670 seconds
Total time for which application threads were stopped: 4.5612290 seconds
Total time for which application threads were stopped: 0.3598490 seconds
Total time for which application threads were stopped: 0.7111000 seconds
Total time for which application threads were stopped: 1.4426750 seconds
Total time for which application threads were stopped: 1.5931500 seconds
Total time for which application threads were stopped: 10.9484920 seconds
Total time for which application threads were stopped: 7.0707230 seconds
Without System.gc() before each run
===================================
Test run times
0 - duration 12120ms
1 - duration 9439ms
2 - duration 9844ms
3 - duration 20933ms
4 - duration 23041ms
Total time for which application threads were stopped: 0.0170860 seconds
Total time for which application threads were stopped: 0.7915350 seconds
Total time for which application threads were stopped: 10.7153320 seconds
Total time for which application threads were stopped: 5.6234650 seconds
Total time for which application threads were stopped: 1.2689950 seconds
Total time for which application threads were stopped: 7.6238170 seconds
Total time for which application threads were stopped: 6.0114540 seconds
Total time for which application threads were stopped: 1.2990070 seconds
Total time for which application threads were stopped: 7.9918480 seconds
Total time for which application threads were stopped: 5.9997920 seconds
Total time for which application threads were stopped: 1.3430040 seconds
Total time for which application threads were stopped: 8.0759940 seconds
Total time for which application threads were stopped: 6.3980610 seconds
Total time for which application threads were stopped: 4.5572100 seconds
Total time for which application threads were stopped: 4.6193830 seconds
Total time for which application threads were stopped: 0.3877930 seconds
Total time for which application threads were stopped: 0.7429270 seconds
Total time for which application threads were stopped: 1.5248070 seconds
Total time for which application threads were stopped: 1.5312130 seconds
Total time for which application threads were stopped: 10.9120250 seconds
Total time for which application threads were stopped: 7.3528590 seconds
</pre>
It can been seen from the output that a significant proportion of the time is spent in the garbage collector. When your threads are stopped your application is not responsive. These tests have been done with default GC settings. It is possible to tune the GC for better results but this can be a highly skilled and significant effort. The only JVM I know that copes well by not imposing long pause times, even under high-throughput conditions, is the Azul concurrent compacting collector.<br />
<br />
When profiling this application, I can see that the majority of the time is spent allocating the objects and promoting them to the old generation because they do not fit in the young generation. The initialisation costs can be removed from the timing but that is not realistic. If the traditional Java approach is taken the state needs to be built up before the query can take place. The end user of an application has to wait for the state to be built up and the query executed.<br />
<br />
This test is really quite trivial. Imagine working with similar data sets but at the 100 GB scale.<br />
<br />
<b>Note:</b> When the garbage collector compacts a region, then objects that were next to each other can be moved far apart. This can result in TLB and other cache misses.<br />
<br />
<b><span style="font-family: Arial, Helvetica, sans-serif; font-size: large;">Side Note On Serialization</span></b><br />
<br />
A huge benefit of using off-heap structures in this manner is how they can be very easily serialised to network, or storage, by a simple memory copy as I have shown in the previous <a href="http://mechanical-sympathy.blogspot.co.uk/2012/07/native-cc-like-performance-for-java.html">post</a>. This way we can completely bypass intermediate buffer and object allocation.<br />
<br />
<b><span style="font-family: Arial, Helvetica, sans-serif; font-size: large;">Conclusion</span></b><br />
<br />
If you are willing to do some C style programming for large datasets it is possible to control the memory layout in Java by going off-heap. If you do, the benefits in performance, compactness, and avoiding GC issues are significant. However this is an approach that should <b>not</b> be used for all applications. Its benefits are only noticable for very large datasets, or the extremes of performance in throughput and/or latency. <br />
<br />
I hope the Java community can collectively realise the importance of supporting structures both on the heap and the stack. John Rose has done some excellent <a href="https://blogs.oracle.com/jrose/entry/tuples_in_the_vm">work</a> in this area defining how tuples could be added to the JVM. His talk on <a href="http://medianetwork.oracle.com/video/player/1785452137001">Arrays 2.0</a> from the JVM Language Summit this year is really worth a watch. John discusses options for arrays of structures, and structures of arrays, in his talk. If the tuples, as proposed by John, were available then the test described here could have comparable performance and be a more pleasant programming style. The whole array of structures could be allocated in a single action thus bypassing the copy of individual objects across generations, and it would be stored in a compact contiguous fashion. This would remove the significant GC issues for this class of problem.<br />
<br />
Lately, I was comparing standard data structures between Java and .Net. In some cases I observed a 6-10X performance advantage to .Net for things like maps and dictionaries when .Net used native structure support. Let's get this into Java as soon as possible!<br />
<br />
It is also pretty obvious from the results that if we are to use Java for real-time analysis on big data, then our standard garbage collectors need to significantly improve and support true concurrent operations.
<br />
<br />
[1] - To my knowledge the only JVM that deals well with very large heaps is <a href="http://www.azulsystems.com/products/zing/whatisit">Azul Zing</a>
</div>
Martin Thompsonhttp://www.blogger.com/profile/15893849163924476586noreply@blogger.com80London, UK51.5073346 -0.127683151.3492066 -0.4435401 51.6654626 0.1881739tag:blogger.com,1999:blog-5560209661389175529.post-57580081949660146202012-08-05T21:12:00.000+01:002012-11-11T17:38:29.209+00:00Memory Access Patterns Are Important<div dir="ltr" style="text-align: left;" trbidi="on">
In high-performance computing it is often said that the cost of a cache-miss is the largest performance penalty for an algorithm. For many years the increase in speed of our processors has greatly outstripped latency gains to main-memory. Bandwidth to main-memory has greatly increased via wider, and multi-channel, buses however the latency has not significantly reduced. To hide this latency our processors employ evermore complex cache sub-systems that have many layers.
<br />
<br />
The 1994 paper "<a href="http://dl.acm.org/citation.cfm?id=216588">Hitting the memory wall: implications of the obvious</a>" describes the problem and goes on to argue that caches do not ultimately help because of compulsory cache-misses. I aim to show that by using access patterns which display consideration for the cache hierarchy, this conclusion is not inevitable.<br />
<br />
Let's start putting the problem in context with some examples. Our hardware tries to hide the main-memory latency via a number of techniques. Basically three major bets are taken on memory access patterns:<br />
<ol style="text-align: left;">
<li><b>Temporal</b>: Memory accessed recently will likely be required again soon.</li>
<li><b>Spatial</b>: Adjacent memory is likely to be required soon. </li>
<li><b>Striding</b>: Memory access is likely to follow a predictable pattern.</li>
</ol>
<div>
To illustrate these three bets in action let's write some code and measure the results.</div>
<div>
<ol style="text-align: left;">
<li>Walk through memory in a linear fashion being completely predictable.</li>
<li>Pseudo randomly walk round memory within a restricted area then move on. This restricted area is what is commonly known as an operating system <a href="http://en.wikipedia.org/wiki/Page_(computer_memory)">page</a> of memory.</li>
<li>Pseudo randomly walk around a large area of the heap.</li>
</ol>
<div>
<span style="font-family: Arial, Helvetica, sans-serif; font-size: large;"><b>Code</b></span></div>
<br />
The following code should be run with the <span style="font-family: Courier New, Courier, monospace;"><b>-Xmx4g</b></span> JVM option.
</div>
<pre class="brush: java; toolbar: false">public class TestMemoryAccessPatterns
{
private static final int LONG_SIZE = 8;
private static final int PAGE_SIZE = 2 * 1024 * 1024;
private static final int ONE_GIG = 1024 * 1024 * 1024;
private static final long TWO_GIG = 2L * ONE_GIG;
private static final int ARRAY_SIZE = (int)(TWO_GIG / LONG_SIZE);
private static final int WORDS_PER_PAGE = PAGE_SIZE / LONG_SIZE;
private static final int ARRAY_MASK = ARRAY_SIZE - 1;
private static final int PAGE_MASK = WORDS_PER_PAGE - 1;
private static final int PRIME_INC = 514229;
private static final long[] memory = new long[ARRAY_SIZE];
static
{
for (int i = 0; i < ARRAY_SIZE; i++)
{
memory[i] = 777;
}
}
public enum StrideType
{
LINEAR_WALK
{
public int next(final int pageOffset, final int wordOffset, final int pos)
{
return (pos + 1) & ARRAY_MASK;
}
},
RANDOM_PAGE_WALK
{
public int next(final int pageOffset, final int wordOffset, final int pos)
{
return pageOffset + ((pos + PRIME_INC) & PAGE_MASK);
}
},
RANDOM_HEAP_WALK
{
public int next(final int pageOffset, final int wordOffset, final int pos)
{
return (pos + PRIME_INC) & ARRAY_MASK;
}
};
public abstract int next(int pageOffset, int wordOffset, int pos);
}
public static void main(final String[] args)
{
final StrideType strideType;
switch (Integer.parseInt(args[0]))
{
case 1:
strideType = StrideType.LINEAR_WALK;
break;
case 2:
strideType = StrideType.RANDOM_PAGE_WALK;
break;
case 3:
strideType = StrideType.RANDOM_HEAP_WALK;
break;
default:
throw new IllegalArgumentException("Unknown StrideType");
}
for (int i = 0; i < 5; i++)
{
perfTest(i, strideType);
}
}
private static void perfTest(final int runNumber, final StrideType strideType)
{
final long start = System.nanoTime();
int pos = -1;
long result = 0;
for (int pageOffset = 0; pageOffset < ARRAY_SIZE; pageOffset += WORDS_PER_PAGE)
{
for (int wordOffset = pageOffset, limit = pageOffset + WORDS_PER_PAGE;
wordOffset < limit;
wordOffset++)
{
pos = strideType.next(pageOffset, wordOffset, pos);
result += memory[pos];
}
}
final long duration = System.nanoTime() - start;
final double nsOp = duration / (double)ARRAY_SIZE;
if (208574349312L != result)
{
throw new IllegalStateException();
}
System.out.format("%d - %.2fns %s\n",
Integer.valueOf(runNumber),
Double.valueOf(nsOp),
strideType);
}
}
</pre>
<div>
<span style="font-family: Arial, Helvetica, sans-serif; font-size: large;"><b>Results</b></span></div>
<pre>Intel U4100 @ 1.3GHz, 4GB RAM DDR2 800MHz,
Windows 7 64-bit, Java 1.7.0_05
===========================================
0 - 2.38ns LINEAR_WALK
1 - 2.41ns LINEAR_WALK
2 - 2.35ns LINEAR_WALK
3 - 2.36ns LINEAR_WALK
4 - 2.39ns LINEAR_WALK
0 - 12.45ns RANDOM_PAGE_WALK
1 - 12.27ns RANDOM_PAGE_WALK
2 - 12.17ns RANDOM_PAGE_WALK
3 - 12.22ns RANDOM_PAGE_WALK
4 - 12.18ns RANDOM_PAGE_WALK
0 - 152.86ns RANDOM_HEAP_WALK
1 - 151.80ns RANDOM_HEAP_WALK
2 - 151.72ns RANDOM_HEAP_WALK
3 - 151.91ns RANDOM_HEAP_WALK
4 - 151.36ns RANDOM_HEAP_WALK
Intel i7-860 @ 2.8GHz, 8GB RAM DDR3 1333MHz,
Windows 7 64-bit, Java 1.7.0_05
=============================================
0 - 1.06ns LINEAR_WALK
1 - 1.05ns LINEAR_WALK
2 - 0.98ns LINEAR_WALK
3 - 1.00ns LINEAR_WALK
4 - 1.00ns LINEAR_WALK
0 - 3.80ns RANDOM_PAGE_WALK
1 - 3.85ns RANDOM_PAGE_WALK
2 - 3.79ns RANDOM_PAGE_WALK
3 - 3.65ns RANDOM_PAGE_WALK
4 - 3.64ns RANDOM_PAGE_WALK
0 - 30.04ns RANDOM_HEAP_WALK
1 - 29.05ns RANDOM_HEAP_WALK
2 - 29.14ns RANDOM_HEAP_WALK
3 - 28.88ns RANDOM_HEAP_WALK
4 - 29.57ns RANDOM_HEAP_WALK
Intel i7-2760QM @ 2.40GHz, 8GB RAM DDR3 1600MHz,
Linux 3.4.6 kernel 64-bit, Java 1.7.0_05
=================================================
0 - 0.91ns LINEAR_WALK
1 - 0.92ns LINEAR_WALK
2 - 0.88ns LINEAR_WALK
3 - 0.89ns LINEAR_WALK
4 - 0.89ns LINEAR_WALK
0 - 3.29ns RANDOM_PAGE_WALK
1 - 3.35ns RANDOM_PAGE_WALK
2 - 3.33ns RANDOM_PAGE_WALK
3 - 3.31ns RANDOM_PAGE_WALK
4 - 3.30ns RANDOM_PAGE_WALK
0 - 9.58ns RANDOM_HEAP_WALK
1 - 9.20ns RANDOM_HEAP_WALK
2 - 9.44ns RANDOM_HEAP_WALK
3 - 9.46ns RANDOM_HEAP_WALK
4 - 9.47ns RANDOM_HEAP_WALK
</pre>
<div>
<span style="font-family: Arial, Helvetica, sans-serif; font-size: large;"><b>Analysis</b></span></div>
<div>
<br />
I ran the code on 3 different CPU architectures illustrating generational steps forward for Intel. It is clear from the results that each generation has become progressively better at hiding the latency to main-memory based on the 3 bets described above for a relatively small heap. This is because the size and sophistication of various caches keep improving. However as memory size increases they become less effective. For example, if the array is doubled to be 4GB in size, then the average latency increases from ~30ns to ~55ns for the i7-860 doing the random heap walk.<br />
<br />
It seems that for the linear walk case, memory latency does not exist. However as we walk around memory in an evermore random pattern then the latency starts to become very apparent.</div>
<div>
<br />
The random heap walk produced an interesting result. This is a our worst case scenario, and given the hardware specifications of these systems, we could be looking at 150ns, 65ns, and 75ns for the above tests respectively based on memory controller and memory module latencies. For the Nehalem (i7-860) I can further subvert the cache sub-system by using a 4GB array resulting in ~55ns on average per iteration. The i7-2760QM has larger load buffers, TLB caches, and Linux is running with transparent huge pages which are all working to further hide the latency. By playing with different prime numbers for the stride, results can vary wildly depending on processor type, e.g. try <span style="font-family: Courier New, Courier, monospace;">PRIME_INC = 39916801</span> for Nehalem. I'd like to test this on a much larger heap with Sandy Bridge.<br />
<br />
The main take away is the more predictable the pattern of access to memory, then the better the cache sub-systems are at hiding main-memory latency. Let's look at these cache sub-systems in a little detail to try and understand the observed results.<br />
<br />
<span style="font-family: Arial, Helvetica, sans-serif; font-size: large;"><b>Hardware Components</b></span><br />
<br />
We have many layers of cache plus the pre-fetchers to consider for how latency gets hidden. In this section I'll try and cover the major components used to hide latency that our hardware and systems software friends have put in place. We will investigate these latency hiding components and use the Linux <a href="https://perf.wiki.kernel.org/index.php/Tutorial">perf</a> and <a href="http://code.google.com/p/likwid/">Lightweight Performance Counters</a> utilities to retrieve the performance counters from our CPUs which tell how effective these components are when we execute our programs. Performance counters are CPU specific and what I've used here are specific to Sandy Bridge.<br />
<br />
<b>Data Caches</b><br />
Processors typically have 2 or 3 layers of data cache. Each layer as we move out is progressively larger with increasing latency. The latest Intel processors have 3 layers (L1D, L2, and L3); with sizes 32KB, 256KB, and 4-30MB; and ~1ns, ~4ns, and ~15ns latency respectively for a 3.0GHz CPU.<br />
<br />
Data caches are effectively hardware hash tables with a fixed number of slots for each hash value. These slots are known as "ways". An 8-way associative cache will have 8 slots to hold values for addresses that hash to the same cache location. Within these slots the data caches do not store words, they store cache-lines of multiple words. For an Intel processor these cache-lines are typically 64-bytes, that is 8 words on a 64-bit machine. This plays to the spatial bet that adjacent memory is likely to be required soon, which is typically the case if we think of arrays or fields of an object.<br />
<br />
Data caches are typically evicted in a LRU manner. Caches work by using a write-back algorithm were stores need only be propagated to main-memory when a modified cache-line is evicted. This gives rise the the interesting phenomenon that a load can cause a write-back to the outer cache layers and eventually to main-memory.<br />
<pre>perf stat -e L1-dcache-loads,L1-dcache-load-misses java -Xmx4g TestMemoryAccessPatterns $
Performance counter stats for 'java -Xmx4g TestMemoryAccessPatterns 1':
1,496,626,053 L1-dcache-loads
274,255,164 L1-dcache-misses
# 18.32% of all L1-dcache hits
Performance counter stats for 'java -Xmx4g TestMemoryAccessPatterns 2':
1,537,057,965 L1-dcache-loads
1,570,105,933 L1-dcache-misses
# 102.15% of all L1-dcache hits
Performance counter stats for 'java -Xmx4g TestMemoryAccessPatterns 3':
4,321,888,497 L1-dcache-loads
1,780,223,433 L1-dcache-misses
# 41.19% of all L1-dcache hits
likwid-perfctr -C 2 -g L2CACHE java -Xmx4g TestMemoryAccessPatterns $
java -Xmx4g TestMemoryAccessPatterns 1
+-----------------------+-------------+
| Event | core 2 |
+-----------------------+-------------+
| INSTR_RETIRED_ANY | 5.94918e+09 |
| CPU_CLK_UNHALTED_CORE | 5.15969e+09 |
| L2_TRANS_ALL_REQUESTS | 1.07252e+09 |
| L2_RQSTS_MISS | 3.25413e+08 |
+-----------------------+-------------+
+-----------------+-----------+
| Metric | core 2 |
+-----------------+-----------+
| Runtime [s] | 2.15481 |
| CPI | 0.867293 |
| L2 request rate | 0.18028 |
| L2 miss rate | 0.0546988 |
| L2 miss ratio | 0.303409 |
+-----------------+-----------+
+------------------------+-------------+
| Event | core 2 |
+------------------------+-------------+
| L3_LAT_CACHE_REFERENCE | 1.26545e+08 |
| L3_LAT_CACHE_MISS | 2.59059e+07 |
+------------------------+-------------+
java -Xmx4g TestMemoryAccessPatterns 2
+-----------------------+-------------+
| Event | core 2 |
+-----------------------+-------------+
| INSTR_RETIRED_ANY | 1.48772e+10 |
| CPU_CLK_UNHALTED_CORE | 1.64712e+10 |
| L2_TRANS_ALL_REQUESTS | 3.41061e+09 |
| L2_RQSTS_MISS | 1.5547e+09 |
+-----------------------+-------------+
+-----------------+----------+
| Metric | core 2 |
+-----------------+----------+
| Runtime [s] | 6.87876 |
| CPI | 1.10714 |
| L2 request rate | 0.22925 |
| L2 miss rate | 0.104502 |
| L2 miss ratio | 0.455843 |
+-----------------+----------+
+------------------------+-------------+
| Event | core 2 |
+------------------------+-------------+
| L3_LAT_CACHE_REFERENCE | 1.52088e+09 |
| L3_LAT_CACHE_MISS | 1.72918e+08 |
+------------------------+-------------+
java -Xmx4g TestMemoryAccessPatterns 3
+-----------------------+-------------+
| Event | core 2 |
+-----------------------+-------------+
| INSTR_RETIRED_ANY | 6.49533e+09 |
| CPU_CLK_UNHALTED_CORE | 4.18416e+10 |
| L2_TRANS_ALL_REQUESTS | 4.67488e+09 |
| L2_RQSTS_MISS | 1.43442e+09 |
+-----------------------+-------------+
+-----------------+----------+
| Metric | core 2 |
+-----------------+----------+
| Runtime [s] | 17.474 |
| CPI | 6.4418 |
| L2 request rate | 0.71973 |
| L2 miss rate | 0.220838 |
| L2 miss ratio | 0.306835 |
+-----------------+----------+
+------------------------+-------------+
| Event | core 2 |
+------------------------+-------------+
| L3_LAT_CACHE_REFERENCE | 1.40079e+09 |
| L3_LAT_CACHE_MISS | 1.34832e+09 |
+------------------------+-------------+
</pre>
<b>Note</b>: The cache-miss rate of the combined L1D, L2 and L3 increases significantly as the pattern of access becomes more random.
<br />
<br />
<b>Translation Lookaside Buffers (TLBs)</b><br />
Our programs deal with virtual memory addresses that need to be translated to physical memory addresses. Virtual memory systems do this by mapping pages. We need to know the offset for a given page and its size for any memory operation. Typically page sizes are 4KB and are gradually moving to 2MB and greater. Linux introduced <a href="http://lwn.net/Articles/423584/">Transparent Huge Pages</a> in the 2.6.38 kernel giving us 2MB pages. The translation of virtual memory pages to physical pages is maintained by the <a href="http://en.wikipedia.org/wiki/Page_table">page table</a>. This translation can require multiple accesses to the page table which is a huge performance penalty. To accelerate this lookup, processors have a small hardware cache at each cache level called the TLB cache. A miss on the TLB cache can be hugely expensive because the page table may not be in a nearby data cache. By moving to larger pages, a TLB cache can cover a larger address range for the same number of entries.<br />
<pre>perf stat -e dTLB-loads,dTLB-load-misses java -Xmx4g TestMemoryAccessPatterns $
Performance counter stats for 'java -Xmx4g TestMemoryAccessPatterns 1':
1,496,128,634 dTLB-loads
310,901 dTLB-misses
# 0.02% of all dTLB cache hits
Performance counter stats for 'java -Xmx4g TestMemoryAccessPatterns 2':
1,551,585,263 dTLB-loads
340,230 dTLB-misses
# 0.02% of all dTLB cache hits
Performance counter stats for 'java -Xmx4g TestMemoryAccessPatterns 3':
4,031,344,537 dTLB-loads
1,345,807,418 dTLB-misses
# 33.38% of all dTLB cache hits
</pre>
<b>Note</b>: We only incur significant TLB misses when randomly walking the whole heap when huge pages are employed.
<br />
<br />
<b>Hardware Pre-Fetchers</b><br />
Hardware will try and predict the next memory access our programs will make and speculatively load that memory into fill buffers. This is done at it simplest level by pre-loading adjacent cache-lines for the spatial bet, or by recognising regular stride based access patterns, typically less than 2KB in stride length. The tests below we are measuring the number of loads that hit a fill buffer from a hardware pre-fetch.<br />
<pre>likwid-perfctr -C 2 -g LOAD_HIT_PRE_HW_PF:PMC0 java -Xmx4g TestMemoryAccessPatterns $
java -Xmx4g TestMemoryAccessPatterns 1
+--------------------+-------------+
| Event | core 2 |
+--------------------+-------------+
| LOAD_HIT_PRE_HW_PF | 1.31613e+09 |
+--------------------+-------------+
java -Xmx4g TestMemoryAccessPatterns 2
+--------------------+--------+
| Event | core 2 |
+--------------------+--------+
| LOAD_HIT_PRE_HW_PF | 368930 |
+--------------------+--------+
java -Xmx4g TestMemoryAccessPatterns 3
+--------------------+--------+
| Event | core 2 |
+--------------------+--------+
| LOAD_HIT_PRE_HW_PF | 324373 |
+--------------------+--------+
</pre>
<b>Note</b>: We have a significant success rate for load hits with the pre-fetcher on the linear walk.
<br />
<br />
<b>Memory Controllers and Row Buffers</b><br />
Beyond our last level cache (LLC) sits the memory controllers that manage access to the SDRAM banks. Memory is organised into rows and columns. To access an address, first the row address must be selected (RAS), then the column address is selected (CAS) within that row to get the word. The row is typically a page in size and loaded into a row buffer. Even at this stage the hardware is still helping hide the latency. A queue of memory access requests is maintained and re-ordered so that multiple words can be fetched from the same row if possible. <br />
<br />
<b>Non-Uniform Memory Access (NUMA)</b><br />
Systems now have memory controllers on the CPU socket. This move to on-socket memory controllers gave an ~50ns latency reduction over existing front side bus (FSB) and external <a href="http://en.wikipedia.org/wiki/Northbridge_(computing)">Northbridge</a> memory controllers. Systems with multiple sockets employ memory interconnects, <a href="http://www.intel.com/content/www/us/en/io/quickpath-technology/quick-path-interconnect-introduction-paper.html">QPI</a> from Intel, which are used when one CPU wants to access memory managed by another CPU socket. The presence of these interconnects gives rise to the non-uniform nature of server memory access. In a 2-socket system memory may be local or 1 hop away. On a 8-socket system memory can be up to 3 hops away, were each hop adds 20ns latency in each direction.<br />
<br />
<span style="font-family: Arial, Helvetica, sans-serif; font-size: large;"><b>What does this mean for algorithms?</b></span><br />
<br />
The difference between an L1D cache-hit, and a full miss resulting in main-memory access, is 2 orders of magnitude; i.e. <1ns vs. 65-100ns. If algorithms randomly walk around our ever increasing address spaces, then we are less likely to benefit from the hardware support that hides this latency.<br />
<br />
Is there anything we can do about this when designing algorithms and data-structures? Yes there is a lot we can do. If we perform chunks of work on data that is co-located, and we stride around memory in a predictable fashion, then our algorithms can be many times faster. For example rather than using bucket and chain <a href="http://en.wikipedia.org/wiki/Hash_table">hash tables</a>, like in the JDK, we can employ hash tables using open-addressing with linear-probing. Rather than using linked-lists or trees with single items in each node, we can store an array of many items in each node.<br />
<br />
Research is advancing on algorithmic approaches that work in harmony with cache sub-systems. One area I find fascinating is <a href="http://en.wikipedia.org/wiki/Cache-oblivious_algorithm">Cache Oblivious Algorithms</a>. The name is a bit misleading but there are some great concepts here for how to improve software performance and better execute in parallel. This <a href="http://www.1024cores.net/home/parallel-computing/cache-oblivious-algorithms">article</a> is a great illustration of the performance benefits that can be gained.<br />
<br />
<b><span style="font-family: Arial, Helvetica, sans-serif; font-size: large;">Conclusion</span></b><br />
<br />
To achieve great performance it is important to have sympathy for the cache sub-systems. We have seen in this article what can be achieved by accessing memory in patterns which work with, rather than against, these caches. When designing algorithms and data structures, it is now vitally important to consider cache-misses, probably even more so than counting steps in the algorithm. This is not what we were taught in algorithm theory when studying computer science. The last decade has seen some fundamental changes in technology. For me the two most significant are the rise of multi-core, and now big-memory systems with 64-bit address spaces.<br />
<br />
One thing is certain, if we want software to execute faster and scale better, we need to make better use of the many cores in our CPUs, and pay attention to memory access patterns.
<br />
<br />
<b>Update: 06-August-2012</b><br />
Trying to design a random walk algorithm for all processors and memory sizes is tricky. If I use the algorithm below then my Sandy Bridge processor is slower but the Nehalem is faster. The point is performance will be <b>very</b> unpredictable when you walk around memory in a random fashion. I've also included the L3 cache counters for more detail in all the tests.<br />
<pre class="brush: java; toolbar: false"> private static final long LARGE_PRIME_INC = 70368760954879L;
RANDOM_HEAP_WALK
{
public int next(final int pageOffset, final int wordOffset, final int pos)
{
return (int)(pos + LARGE_PRIME_INC) & ARRAY_MASK;
}
};
</pre>
<pre>Intel i7-2760QM @ 2.40GHz, 8GB RAM DDR3 1600MHz,
Linux 3.4.6 kernel 64-bit, Java 1.7.0_05
=================================================
0 - 29.06ns RANDOM_HEAP_WALK
1 - 29.47ns RANDOM_HEAP_WALK
2 - 29.48ns RANDOM_HEAP_WALK
3 - 29.43ns RANDOM_HEAP_WALK
4 - 29.42ns RANDOM_HEAP_WALK
Performance counter stats for 'java -Xmx4g TestMemoryAccessPatterns 3':
9,444,928,682 dTLB-loads
4,371,982,327 dTLB-misses
# 46.29% of all dTLB cache hits
9,390,675,639 L1-dcache-loads
1,471,647,016 L1-dcache-misses
# 15.67% of all L1-dcache hits
+-----------------------+-------------+
| Event | core 2 |
+-----------------------+-------------+
| INSTR_RETIRED_ANY | 7.71171e+09 |
| CPU_CLK_UNHALTED_CORE | 1.31717e+11 |
| L2_TRANS_ALL_REQUESTS | 8.4912e+09 |
| L2_RQSTS_MISS | 2.79635e+09 |
+-----------------------+-------------+
+-----------------+----------+
| Metric | core 2 |
+-----------------+----------+
| Runtime [s] | 55.0094 |
| CPI | 17.0801 |
| L2 request rate | 1.10108 |
| L2 miss rate | 0.362611 |
| L2 miss ratio | 0.329324 |
+-----------------+----------+
+--------------------+-------------+
| Event | core 2 |
+--------------------+-------------+
| LOAD_HIT_PRE_HW_PF | 3.59509e+06 |
+--------------------+-------------+
+------------------------+-------------+
| Event | core 2 |
+------------------------+-------------+
| L3_LAT_CACHE_REFERENCE | 1.30318e+09 |
| L3_LAT_CACHE_MISS | 2.62346e+07 |
+------------------------+-------------+
</pre>
</div>
</div>
Martin Thompsonhttp://www.blogger.com/profile/15893849163924476586noreply@blogger.com36London, UK51.5073346 -0.127683151.3492066 -0.4435401 51.6654626 0.1881739tag:blogger.com,1999:blog-5560209661389175529.post-2933678847830738962012-07-05T18:51:00.001+01:002022-08-17T11:35:58.858+01:00Native C/C++ Like Performance For Java Object Serialisation<div dir="ltr" style="text-align: left;" trbidi="on">
Do you ever wish you could turn a Java object into a stream of bytes as fast as it can be done in a native language like C++? If you use standard Java Serialization you could be disappointed with the performance. Java Serialization was designed for a very different purpose than serialising objects as quickly and compactly as possible.<br />
<br />
Why do we need fast and compact serialisation? Many of our systems are distributed and we need to communicate by passing state between processes efficiently. This state lives inside our objects. I've profiled many systems and often a large part of the cost is the serialisation of this state to-and-from byte buffers. I've seen a significant range of protocols and mechanisms used to achieve this. At one end of the spectrum are the easy to use but inefficient protocols likes Java <a href="http://java.sun.com/developer/technicalArticles/Programming/serialization/">Serialisation</a>, <a href="http://en.wikipedia.org/wiki/XML">XML</a> and <a href="http://en.wikipedia.org/wiki/JSON">JSON</a>. At the other end of this spectrum are the binary protocols that can be very fast and efficient but they require a deeper understanding and skill.<br />
<br />
In this article I will illustrate the performance gains that are possible when using simple binary protocols and introduce a little known technique available in Java to achieve similar performance to what is possible with native languages like C or C++.<br />
<br />
The three approaches to be compared are:
<br />
<ol style="text-align: left;">
<li><b>Java Serialization</b>: The standard method in Java of having an object implement <span style="background-color: white;"><a href="http://docs.oracle.com/javase/6/docs/api/java/io/Serializable.html"><span style="font-family: 'Courier New', Courier, monospace;">Serializable</span></a>.</span></li>
<li><b>Binary via ByteBuffer</b>: A simple protocol using the <a href="http://docs.oracle.com/javase/6/docs/api/java/nio/ByteBuffer.html"><span style="font-family: 'Courier New', Courier, monospace;">ByteBuffer</span></a> API to write the fields of an object in binary format. This is our baseline for what is considered a good binary encoding approach.</li>
<li><b>Binary via Unsafe</b>: Introduction to <a href="http://www.docjar.com/docs/api/sun/misc/Unsafe.html"><span style="font-family: 'Courier New', Courier, monospace;">Unsafe</span></a> and its collection of methods that allow direct memory manipulation. Here I will show how to get similar performance to C/C++.</li>
</ol>
<span style="font-family: Arial, Helvetica, sans-serif; font-size: large;"><b>The Code</b></span><br />
<pre>
import sun.misc.Unsafe;
import java.io.ByteArrayInputStream;
import java.io.ByteArrayOutputStream;
import java.io.ObjectInputStream;
import java.io.ObjectOutputStream;
import java.io.Serializable;
import java.lang.reflect.Field;
import java.nio.ByteBuffer;
import java.util.Arrays;
public final class TestSerialisationPerf
{
public static final int REPETITIONS = 1 * 1000 * 1000;
private static ObjectToBeSerialised ITEM =
new ObjectToBeSerialised(
1010L, true, 777, 99,
new double[]{0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0},
new long[]{1, 2, 3, 4, 5, 6, 7, 8, 9, 10});
public static void main(final String[] arg) throws Exception
{
for (final PerformanceTestCase testCase : testCases)
{
for (int i = 0; i < 5; i++)
{
testCase.performTest();
System.out.format("%d %s\twrite=%,dns read=%,dns total=%,dns\n",
i,
testCase.getName(),
testCase.getWriteTimeNanos(),
testCase.getReadTimeNanos(),
testCase.getWriteTimeNanos() +
testCase.getReadTimeNanos());
if (!ITEM.equals(testCase.getTestOutput()))
{
throw new IllegalStateException("Objects do not match");
}
System.gc();
Thread.sleep(3000);
}
}
}
private static final PerformanceTestCase[] testCases =
{
new PerformanceTestCase("Serialisation", REPETITIONS, ITEM)
{
ByteArrayOutputStream baos = new ByteArrayOutputStream();
public void testWrite(ObjectToBeSerialised item) throws Exception
{
for (int i = 0; i < REPETITIONS; i++)
{
baos.reset();
ObjectOutputStream oos = new ObjectOutputStream(baos);
oos.writeObject(item);
oos.close();
}
}
public ObjectToBeSerialised testRead() throws Exception
{
ObjectToBeSerialised object = null;
for (int i = 0; i < REPETITIONS; i++)
{
ByteArrayInputStream bais =
new ByteArrayInputStream(baos.toByteArray());
ObjectInputStream ois = new ObjectInputStream(bais);
object = (ObjectToBeSerialised)ois.readObject();
}
return object;
}
},
new PerformanceTestCase("ByteBuffer", REPETITIONS, ITEM)
{
ByteBuffer byteBuffer = ByteBuffer.allocate(1024);
public void testWrite(ObjectToBeSerialised item) throws Exception
{
for (int i = 0; i < REPETITIONS; i++)
{
byteBuffer.clear();
item.write(byteBuffer);
}
}
public ObjectToBeSerialised testRead() throws Exception
{
ObjectToBeSerialised object = null;
for (int i = 0; i < REPETITIONS; i++)
{
byteBuffer.flip();
object = ObjectToBeSerialised.read(byteBuffer);
}
return object;
}
},
new PerformanceTestCase("UnsafeMemory", REPETITIONS, ITEM)
{
UnsafeMemory buffer = new UnsafeMemory(new byte[1024]);
public void testWrite(ObjectToBeSerialised item) throws Exception
{
for (int i = 0; i < REPETITIONS; i++)
{
buffer.reset();
item.write(buffer);
}
}
public ObjectToBeSerialised testRead() throws Exception
{
ObjectToBeSerialised object = null;
for (int i = 0; i < REPETITIONS; i++)
{
buffer.reset();
object = ObjectToBeSerialised.read(buffer);
}
return object;
}
},
};
}
abstract class PerformanceTestCase
{
private final String name;
private final int repetitions;
private final ObjectToBeSerialised testInput;
private ObjectToBeSerialised testOutput;
private long writeTimeNanos;
private long readTimeNanos;
public PerformanceTestCase(final String name, final int repetitions,
final ObjectToBeSerialised testInput)
{
this.name = name;
this.repetitions = repetitions;
this.testInput = testInput;
}
public String getName()
{
return name;
}
public ObjectToBeSerialised getTestOutput()
{
return testOutput;
}
public long getWriteTimeNanos()
{
return writeTimeNanos;
}
public long getReadTimeNanos()
{
return readTimeNanos;
}
public void performTest() throws Exception
{
final long startWriteNanos = System.nanoTime();
testWrite(testInput);
writeTimeNanos = (System.nanoTime() - startWriteNanos) / repetitions;
final long startReadNanos = System.nanoTime();
testOutput = testRead();
readTimeNanos = (System.nanoTime() - startReadNanos) / repetitions;
}
public abstract void testWrite(ObjectToBeSerialised item) throws Exception;
public abstract ObjectToBeSerialised testRead() throws Exception;
}
class ObjectToBeSerialised implements Serializable
{
private static final long serialVersionUID = 10275539472837495L;
private final long sourceId;
private final boolean special;
private final int orderCode;
private final int priority;
private final double[] prices;
private final long[] quantities;
public ObjectToBeSerialised(final long sourceId, final boolean special,
final int orderCode, final int priority,
final double[] prices, final long[] quantities)
{
this.sourceId = sourceId;
this.special = special;
this.orderCode = orderCode;
this.priority = priority;
this.prices = prices;
this.quantities = quantities;
}
public void write(final ByteBuffer byteBuffer)
{
byteBuffer.putLong(sourceId);
byteBuffer.put((byte)(special ? 1 : 0));
byteBuffer.putInt(orderCode);
byteBuffer.putInt(priority);
byteBuffer.putInt(prices.length);
for (final double price : prices)
{
byteBuffer.putDouble(price);
}
byteBuffer.putInt(quantities.length);
for (final long quantity : quantities)
{
byteBuffer.putLong(quantity);
}
}
public static ObjectToBeSerialised read(final ByteBuffer byteBuffer)
{
final long sourceId = byteBuffer.getLong();
final boolean special = 0 != byteBuffer.get();
final int orderCode = byteBuffer.getInt();
final int priority = byteBuffer.getInt();
final int pricesSize = byteBuffer.getInt();
final double[] prices = new double[pricesSize];
for (int i = 0; i < pricesSize; i++)
{
prices[i] = byteBuffer.getDouble();
}
final int quantitiesSize = byteBuffer.getInt();
final long[] quantities = new long[quantitiesSize];
for (int i = 0; i < quantitiesSize; i++)
{
quantities[i] = byteBuffer.getLong();
}
return new ObjectToBeSerialised(sourceId, special, orderCode,
priority, prices, quantities);
}
public void write(final UnsafeMemory buffer)
{
buffer.putLong(sourceId);
buffer.putBoolean(special);
buffer.putInt(orderCode);
buffer.putInt(priority);
buffer.putDoubleArray(prices);
buffer.putLongArray(quantities);
}
public static ObjectToBeSerialised read(final UnsafeMemory buffer)
{
final long sourceId = buffer.getLong();
final boolean special = buffer.getBoolean();
final int orderCode = buffer.getInt();
final int priority = buffer.getInt();
final double[] prices = buffer.getDoubleArray();
final long[] quantities = buffer.getLongArray();
return new ObjectToBeSerialised(sourceId, special, orderCode,
priority, prices, quantities);
}
public boolean equals(final Object o)
{
if (this == o)
{
return true;
}
if (o == null || getClass() != o.getClass())
{
return false;
}
final ObjectToBeSerialised that = (ObjectToBeSerialised)o;
if (orderCode != that.orderCode)
{
return false;
}
if (priority != that.priority)
{
return false;
}
if (sourceId != that.sourceId)
{
return false;
}
if (special != that.special)
{
return false;
}
if (!Arrays.equals(prices, that.prices))
{
return false;
}
if (!Arrays.equals(quantities, that.quantities))
{
return false;
}
return true;
}
}
class UnsafeMemory
{
private static final Unsafe unsafe;
static
{
try
{
Field field = Unsafe.class.getDeclaredField("theUnsafe");
field.setAccessible(true);
unsafe = (Unsafe)field.get(null);
}
catch (Exception e)
{
throw new RuntimeException(e);
}
}
private static final long byteArrayOffset = unsafe.arrayBaseOffset(byte[].class);
private static final long longArrayOffset = unsafe.arrayBaseOffset(long[].class);
private static final long doubleArrayOffset = unsafe.arrayBaseOffset(double[].class);
private static final int SIZE_OF_BOOLEAN = 1;
private static final int SIZE_OF_INT = 4;
private static final int SIZE_OF_LONG = 8;
private int pos = 0;
private final byte[] buffer;
public UnsafeMemory(final byte[] buffer)
{
if (null == buffer)
{
throw new NullPointerException("buffer cannot be null");
}
this.buffer = buffer;
}
public void reset()
{
this.pos = 0;
}
public void putBoolean(final boolean value)
{
unsafe.putBoolean(buffer, byteArrayOffset + pos, value);
pos += SIZE_OF_BOOLEAN;
}
public boolean getBoolean()
{
boolean value = unsafe.getBoolean(buffer, byteArrayOffset + pos);
pos += SIZE_OF_BOOLEAN;
return value;
}
public void putInt(final int value)
{
unsafe.putInt(buffer, byteArrayOffset + pos, value);
pos += SIZE_OF_INT;
}
public int getInt()
{
int value = unsafe.getInt(buffer, byteArrayOffset + pos);
pos += SIZE_OF_INT;
return value;
}
public void putLong(final long value)
{
unsafe.putLong(buffer, byteArrayOffset + pos, value);
pos += SIZE_OF_LONG;
}
public long getLong()
{
long value = unsafe.getLong(buffer, byteArrayOffset + pos);
pos += SIZE_OF_LONG;
return value;
}
public void putLongArray(final long[] values)
{
putInt(values.length);
long bytesToCopy = values.length << 3;
unsafe.copyMemory(values, longArrayOffset,
buffer, byteArrayOffset + pos,
bytesToCopy);
pos += bytesToCopy;
}
public long[] getLongArray()
{
int arraySize = getInt();
long[] values = new long[arraySize];
long bytesToCopy = values.length << 3;
unsafe.copyMemory(buffer, byteArrayOffset + pos,
values, longArrayOffset,
bytesToCopy);
pos += bytesToCopy;
return values;
}
public void putDoubleArray(final double[] values)
{
putInt(values.length);
long bytesToCopy = values.length << 3;
unsafe.copyMemory(values, doubleArrayOffset,
buffer, byteArrayOffset + pos,
bytesToCopy);
pos += bytesToCopy;
}
public double[] getDoubleArray()
{
int arraySize = getInt();
double[] values = new double[arraySize];
long bytesToCopy = values.length << 3;
unsafe.copyMemory(buffer, byteArrayOffset + pos,
values, doubleArrayOffset,
bytesToCopy);
pos += bytesToCopy;
return values;
}
}</pre>
<br />
<span style="font-family: Arial, Helvetica, sans-serif; font-size: large;"><b>Results</b></span>
<br />
<pre>2.8GHz Nehalem - Java 1.7.0_04
==============================
0 Serialisation write=2,517ns read=11,570ns total=14,087ns
1 Serialisation write=2,198ns read=11,122ns total=13,320ns
2 Serialisation write=2,190ns read=11,011ns total=13,201ns
3 Serialisation write=2,221ns read=10,972ns total=13,193ns
4 Serialisation write=2,187ns read=10,817ns total=13,004ns
0 ByteBuffer write=264ns read=273ns total=537ns
1 ByteBuffer write=248ns read=243ns total=491ns
2 ByteBuffer write=262ns read=243ns total=505ns
3 ByteBuffer write=300ns read=240ns total=540ns
4 ByteBuffer write=247ns read=243ns total=490ns
0 UnsafeMemory write=99ns read=84ns total=183ns
1 UnsafeMemory write=53ns read=82ns total=135ns
2 UnsafeMemory write=63ns read=66ns total=129ns
3 UnsafeMemory write=46ns read=63ns total=109ns
4 UnsafeMemory write=48ns read=58ns total=106ns
2.4GHz Sandy Bridge - Java 1.7.0_04
===================================
0 Serialisation write=1,940ns read=9,006ns total=10,946ns
1 Serialisation write=1,674ns read=8,567ns total=10,241ns
2 Serialisation write=1,666ns read=8,680ns total=10,346ns
3 Serialisation write=1,666ns read=8,623ns total=10,289ns
4 Serialisation write=1,715ns read=8,586ns total=10,301ns
0 ByteBuffer write=199ns read=198ns total=397ns
1 ByteBuffer write=176ns read=178ns total=354ns
2 ByteBuffer write=174ns read=174ns total=348ns
3 ByteBuffer write=172ns read=183ns total=355ns
4 ByteBuffer write=174ns read=180ns total=354ns
0 UnsafeMemory write=38ns read=75ns total=113ns
1 UnsafeMemory write=26ns read=52ns total=78ns
2 UnsafeMemory write=26ns read=51ns total=77ns
3 UnsafeMemory write=25ns read=51ns total=76ns
4 UnsafeMemory write=27ns read=50ns total=77ns
</pre>
<br />
<span style="font-family: Arial, Helvetica, sans-serif; font-size: large;"><b>Analysis</b></span><br />
<br />
To write and read back a single relatively small object on my fast 2.4 GHz Sandy Bridge laptop can take ~10,000ns using Java Serialization, whereas when using Unsafe this can come down to well less than 100ns even accounting for the test code itself. To put this in context, when using Java Serialization the costs are on par with a network hop! Now that would be very costly if your transport is a fast <a href="http://en.wikipedia.org/wiki/Inter-process_communication">IPC</a> mechanism on the same system.<br />
<br />
There are numerous reasons why Java Serialisation is so costly. For example it writes out the fully qualified class and field names for each object plus version information. Also <a href="http://docs.oracle.com/javase/6/docs/api/java/io/ObjectOutputStream.html"><span style="font-family: 'Courier New', Courier, monospace;">ObjectOutputStream</span></a> keeps a collection of all written objects so they can be conflated when <span style="font-family: 'Courier New', Courier, monospace;">close()</span> is called.
Java Serialisation requires 340 bytes for this example object, yet we only require 185 bytes for the binary versions. Details for the Java Serialization format can be found <a href="http://docs.oracle.com/javase/6/docs/platform/serialization/spec/protocol.html">here</a>. If I had not used arrays for the majority of data, then the serialised object would have been significantly larger with Java Serialization because of the field names. In my experience text based protocols like XML and JSON can be even less efficient than Java Serialization. Also be aware that Java Serialization is the standard mechanism employed for <a href="http://www.oracle.com/technetwork/java/javase/tech/index-jsp-136424.html">RMI</a>.<br />
<br />
The real issue is the number of instructions to be executed. The Unsafe method wins by a significant margin because in Hotspot, and many other JVMs, the optimiser treats these operations as intrinsics and replaces the call with assembly instructions to perform the memory manipulation. For primitive types this results in a single x86 <a href="http://en.wikipedia.org/wiki/MOV_(x86_instruction)"><span style="font-family: 'Courier New', Courier, monospace;">MOV</span></a> instruction which can often happen in a single cycle. The details can be seen by having Hotspot output the optimised code as I described in a previous <a href="http://mechanical-sympathy.blogspot.co.uk/2012/04/invoke-interface-optimisations.html">article</a>.<br />
<br />
Now it has to be said that "<i><b>with great power comes great responsibility</b></i>" and if you use <span style="font-family: 'Courier New', Courier, monospace;">Unsafe</span> it is effectively the same as programming in C, and with that can come memory access violations when you get offsets wrong.<br />
<br />
<span style="font-family: Arial, Helvetica, sans-serif; font-size: large;"><b>Adding Some Context</b></span><br />
<br />
"What about the likes of <a href="https://developers.google.com/protocol-buffers/">Google Protocol Buffers</a>?", I hear you cry out. These are very useful libraries and can often offer better performance and more flexibility than Java Serialisation. However they are not remotely close to the performance of using <span style="font-family: 'Courier New', Courier, monospace;">Unsafe</span> like I have shown here. Protocol Buffers solve a different problem and provide nice self-describing messages which work well across languages. Please test with different protocols and serialisation techniques to compare results.<br />
<br />
Also the astute among you will be asking, "What about <a href="http://en.wikipedia.org/wiki/Endianness">Endianness</a> (byte-ordering) of the integers written?" With <span style="font-family: 'Courier New', Courier, monospace;">Unsafe</span> the bytes are written in native order. This is great for IPC and between systems of the same type. When systems use differing formats then conversion will be necessary.<br />
<br />
How do we deal with multiple versions of a class or determining what class an object belongs to? I want to keep this article focused but let's say a simple integer to indicate the implementation class is all that is required for a header. This integer can be used to look up the appropriately implementation for the de-serialisation operation.<br />
<br />
An argument I often hear against binary protocols, and for text protocols, is what about being human readable and debugging? There is an easy solution to this. Develop a tool for reading the binary format!<br />
<br />
<span style="font-family: Arial, Helvetica, sans-serif; font-size: large;"><b>Conclusion</b></span><br />
<br />
In conclusion it is possible to achieve the same native C/C++ like levels of performance in Java for serialising an object to-and-from a byte stream by effectively using the same techniques. The <span style="font-family: 'Courier New', Courier, monospace;">UnsafeMemory</span> class, for which I've provided a skeleton implementation, could easily be expanded to encapsulate this behaviour and thus protect oneself from many of the potential issues when dealing with such a sharp tool.<br />
<br />
Now for the burning question. Would it not be so much better if Java offered an alternative <span style="font-family: 'Courier New', Courier, monospace;">Marshallable</span> interface to
<a href="http://docs.oracle.com/javase/6/docs/api/java/io/Serializable.html"><span style="font-family: 'Courier New', Courier, monospace;">Serializable</span></a> by offering natively what I've effectively done with <span style="font-family: 'Courier New', Courier, monospace;">Unsafe</span>???</div>Martin Thompsonhttp://www.blogger.com/profile/15893849163924476586noreply@blogger.com42London, UK51.5073346 -0.127683151.3492066 -0.4435401 51.6654626 0.1881739tag:blogger.com,1999:blog-5560209661389175529.post-45769558378316645442012-05-19T22:17:00.001+01:002014-01-08T14:05:44.308+00:00Applying Back Pressure When Overloaded<div dir="ltr" style="text-align: left;" trbidi="on">
How should a system respond when under sustained load? Should it keep accepting requests until its response times follow the deadly hockey stick, followed by a crash? All too often this is what happens unless a system is designed to cope with the case of more requests arriving than it is capable of processing. If we are seeing a sustained arrival rate of requests, greater than our system is capable of processing, then something has to give. Having the entire system degrade is not the ideal service we want to give our customers. A better approach would be to process transactions at our systems maximum possible throughput rate, while maintaining a good response time, and rejecting requests above this arrival rate.<br />
<br />
Let’s consider a small art gallery as an metaphor. In this gallery the typical viewer spends on average 20 minutes browsing, and the gallery can hold a maximum of 30 viewers. If more than 30 viewers occupy the gallery at the same time then customers become unhappy because they cannot have a clear view of the paintings. If this happens they are unlikely to purchase or return. To keep our viewers happy it is better to recommend that some viewers visit the café a few doors down and come back when the gallery is less busy. This way the viewers in the gallery get to see all the paintings without other viewers in the way, and in the meantime those we cannot accommodate enjoy a coffee. If we apply <a href="http://en.wikipedia.org/wiki/Little%27s_law">Little’s Law</a> we cannot have customers arriving at more than 90 per hour, otherwise the maximum capacity is exceeded. If between 9:00-10:00 they are arriving at 100 per hour, then I’m sure the café down the road will appreciate the extra 10 customers.<br />
<br />
Within our systems the available capacity is generally a function of the size of our thread pools and time to process individual transactions. These thread pools are usually fronted by queues to handle bursts of traffic above our maximum arrival rate. If the queues are unbounded, and we have a sustained arrival rate above the maximum capacity, then the queues will grow unchecked. As the queues grow they increasingly add latency beyond acceptable response times, and eventually they will consume all memory causing our systems to fail. Would it not be better to send the overflow of requests to the café while still serving everyone else at the maximum possible rate? We can do this by designing our systems to apply “Back Pressure”.
<br />
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj1YONiUjsgSY74tF2WuO34H9uv5xLbVGB1OLHD54cVbwwJWEbk4eOHj_R1vLO_mIXub4Tm6-LDf7P3ISJ6pavkHsOrKzMOvB4zieqZl9Dfzyjc1uFy8Cs-fSmIA47FGWaT1-AxO4kG3qc/s1600/back-pressure.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj1YONiUjsgSY74tF2WuO34H9uv5xLbVGB1OLHD54cVbwwJWEbk4eOHj_R1vLO_mIXub4Tm6-LDf7P3ISJ6pavkHsOrKzMOvB4zieqZl9Dfzyjc1uFy8Cs-fSmIA47FGWaT1-AxO4kG3qc/s1600/back-pressure.png" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Figure 1.</td></tr>
</tbody></table>
<br />
<a href="http://en.wikipedia.org/wiki/Separation_of_concerns">Separation of concerns</a> encourages good systems design at all levels. I like to layer a design so that the gateways to third parties are separated from the main transaction services. This can be achieved by having gateways responsible for protocol translation and border security only. A typical gateway could be a web container running <a href="http://jcp.org/en/jsr/detail?id=315">Servlets</a>. Gateways accept customer requests, apply appropriate security, and translate the channel protocols for forwarding to the transaction service hosting the <a href="http://martinfowler.com/eaaCatalog/domainModel.html">domain model</a>. The transaction service may use a durable store if transactions need to be preserved. For example, the state of a chat server domain model may not require preservation, whereas a model for financial transactions must be kept for many years for compliance and business reasons.<br />
<br />
Figure 1. above is a simplified view of the typical request flow in many systems. Pools of threads in a gateway accept user requests and forward them to a transaction service. Let’s assume we have asynchronous transaction services fronted by an input and output queues, or similar <a href="http://en.wikipedia.org/wiki/FIFO">FIFO</a> structures. If we want the system to meet a response time quality-of-service (QoS) guarantee, then we need to consider the three following variables:<br />
<ol style="text-align: left;">
<li>The time taken for individual transactions on a thread</li>
<li>The number of threads in a pool that can execute transactions in parallel</li>
<li>The length of the input queue to set the maximum acceptable latency </li>
</ol>
<div style="font-family: "Courier New",Courier,monospace;">
<b> max latency = (transaction time / number of threads) * queue length</b></div>
<div style="font-family: "Courier New",Courier,monospace;">
<b> queue length = max latency / (transaction time / number of threads) </b></div>
<br />
By allowing the queue to be unbounded the latency will continue to increase. So if we want to set a maximum response time then we need to limit the queue length.<br />
<br />
By bounding the input queue we block the thread receiving network packets which will apply back pressure up stream. If the network protocol is TCP, similar back pressure is applied via the filling of network buffers, on the sender. This process can repeat all the way back via the gateway to the customer. For each service we need to configure the queues so that they do their part in achieving the required quality-of-service for the end-to-end customer experience.<br />
<br />
One of the biggest wins I often find is to improve the time taken to process individual transaction latency. This helps in the best and worst case scenarios.<br />
<br />
<b><span style="font-size: large;">Worst Case Scenario</span></b><br />
<br />
Let’s say the queue is unbounded and the system is under sustained heavy load. Things can begin to go wrong very quickly in subtle ways before memory is exhausted. What do you think will happen when the queue is larger than the processor cache? The consumer threads will be suffering cache misses just at the time when they are struggling to keep up, thus compounding the problem. This can cause a system to get into trouble very quickly and eventually crash. Under Linux this is particularly nasty because <a href="http://en.wikipedia.org/wiki/C_dynamic_memory_allocation">malloc</a>, or one of its friends, will succeed because Linux allows “<a href="http://www.win.tue.nl/~aeb/linux/lk/lk-9.html#ss9.6">Over Commit</a>” by default, then later at the point of using that memory, the <a href="http://lwn.net/Articles/317814/">OOM Killer</a> will start shooting processes. When the OS starts shooting processes, you just know things are not going to end well!<br />
<br />
<b><span style="font-size: large;">What About Synchronous Designs?</span></b><br />
<br />
You may say that with synchronous designs there are no queues. Well not such obvious ones. If you have a thread pool then it will have a lock, or semaphore, wait queues to assign threads. If you are crazy enough to allocate a new thread on every request, then once you are over the huge cost of thread creation, your thread is in the run queue for a processor to execute. Also, these queues involve context switches and condition variables which greatly increase the <a href="http://mechanical-sympathy.blogspot.co.uk/2011/11/locks-condition-variables-latency.html">costs</a>. You just cannot run away from queues, they are everywhere! Best to embrace them and design for the quality-of-service your system needs to deliver to its customers. If we must have queues, then design for them, and maybe choose some nice lock-free ones with great performance.<br />
<br />
When we need to support synchronous protocols like REST then use back pressure, signalled by our full incoming queue at the gateway, to send a meaningful “server busy” message such as the HTTP 503 status code. The customer can then interpret this as time for a coffee and cake at the café down the road.<br />
<br />
<b><span style="font-size: large;">Subtleties To Watch Out For...</span></b><br />
<br />
You need to consider the whole end-to-end service. What if a client is very slow at consuming data from your system? It could tie up a thread in the gateway taking it out of action. Now you have less threads working the queue so the response time will be increasing. Queues and threads need to be monitored, and appropriate action needs to be taken when thresholds are crossed. For example, when a queue is 70% full, maybe an alert should be raised so an investigation can take place? Also, transaction times need to be sampled to ensure they are in the expected range.<br />
<br />
<b><span style="font-size: large;">Summary</span></b><br />
<br />
If we do not consider how our systems will behave when under heavy load then they will most likely seriously degrade at best, and at worst crash. When they crash this way, we get to find out if there are any really evil data corruption bugs lurking in those dark places. Applying back pressure is one effective technique for coping with sustained high-load, such that maximum throughput can be delivered without degrading system performance for the already accepted requests and transactions.</div>
Martin Thompsonhttp://www.blogger.com/profile/15893849163924476586noreply@blogger.com6London, UK51.5081289 -0.12800551.350006900000004 -0.443862 51.6662509 0.187852tag:blogger.com,1999:blog-5560209661389175529.post-76193826889330762952012-04-29T11:22:00.001+01:002022-08-17T11:36:28.822+01:00Invoke Interface Optimisations<div dir="ltr" style="text-align: left;" trbidi="on">
I'm often asked about the performance differences between Java, C, and C++, and which is better. As with most things in life there is no black and white answer. A lot is often discussed about how managed runtime based languages offer less performance than their statically compiled compatriots. There are however a few tricks available to managed runtimes that can provide optimisation opportunities not available to statically optimised languages.<br />
<br />
One such optimisation available to the runtime is to dynamically inline a method at the call site. Many would say inlining is *the* major optimisation of dynamic languages. This is an approach whereby the function/method call overhead can be avoided and further optimisations enabled. Inlining can easily be done at compile, or run, time for <span style="font-family: 'Courier New',Courier,monospace; font-size: small;">static</span> or <span style="font-family: 'Courier New',Courier,monospace; font-size: small;">private</span> methods of a class because they cannot be overridden. It can also be done by Hotspot at run time which is way more interesting. In bytecode the runtime will see <span style="font-family: 'Courier New',Courier,monospace; font-size: small;">invokestatic</span><span style="font-size: small;"> </span>and <span style="font-family: 'Courier New',Courier,monospace; font-size: small;">invokespecial</span><span style="font-size: small;"> </span>opcodes for <span style="font-family: 'Courier New',Courier,monospace; font-size: small;">static</span> and <span style="font-family: 'Courier New',Courier,monospace; font-size: small;">private</span> methods respectively. Methods that involve late binding, such as interface implementations and method overriding, appear as the <span style="font-family: 'Courier New',Courier,monospace; font-size: small;">invokeinterface</span><span style="font-size: small;"> </span>and <span style="font-family: 'Courier New',Courier,monospace; font-size: small;">invokevirtual</span> opcodes respectively.<br />
<br />
At compile time it is not possible to determine how many implementations there will be for an interface, or how many classes will override a base method. The compiler can have some awareness but just how do you deal with dynamically loaded classes via <span style="font-family: 'Courier New',Courier,monospace; font-size: small;">Class.forName("x").newInstance()</span>? The Hotspot runtime is very smart. It can track all classes as they are loaded and apply appropriate optimisations to give the best possible performance for our code. One such approach is dynamic inlining at the call site which we will explore.<br />
<br />
<h3 style="text-align: left;">
Code</h3>
<pre>
public interface Operation
{
int map(int value);
}
public class IncOperation implements Operation
{
public int map(final int value)
{
return value + 1;
}
}
public class DecOperation implements Operation
{
public int map(final int value)
{
return value - 1;
}
}
public class StepIncOperation implements Operation
{
public int map(final int value)
{
return value + 7;
}
}
public class StepDecOperation implements Operation
{
public int map(final int value)
{
return value - 3;
}
}
public final class OperationPerfTest
{
private static final int ITERATIONS = 50 * 1000 * 1000;
public static void main(final String[] args)
throws Exception
{
final Operation[] operations = new Operation[4];
int index = 0;
operations[index++] = new StepIncOperation();
operations[index++] = new StepDecOperation();
operations[index++] = new IncOperation();
operations[index++] = new DecOperation();
int value = 777;
for (int i = 0; i < 3; i++)
{
System.out.println("*** Run each method in turn: loop " + i);
for (final Operation operation : operations)
{
System.out.println(operation.getClass().getName());
value = runTests(operation, value);
}
}
System.out.println("value = " + value);
}
private static int runTests(final Operation operation, int value)
{
for (int i = 0; i < 10; i++)
{
final long start = System.nanoTime();
value += opRun(operation, value);
final long duration = System.nanoTime() - start;
final long opsPerSec =
(ITERATIONS * 1000L * 1000L * 1000L) / duration;
System.out.printf(" %,d ops/sec\n", opsPerSec);
}
return value;
}
private static int opRun(final Operation operation, int value)
{
for (int i = 0; i < ITERATIONS; i++)
{
value += operation.map(value);
}
return value;
}
}
</pre>
<h3 style="text-align: left;">
Results</h3>
<br />
The following results are for running on a Linux 3.3.2 kernel with Oracle 1.7.0_02 server JVM on a Intel Sandy Bridge 2.4Ghz processor.<br />
<br />
<span style="font-family: 'Courier New',Courier,monospace; font-size: x-small;">*** Run each method in turn: loop 0</span><br />
<span style="font-family: 'Courier New',Courier,monospace; font-size: x-small;">StepIncOperation</span><br />
<span style="font-family: 'Courier New',Courier,monospace; font-size: x-small;"> 2,256,816,714 ops/sec</span><br />
<span style="font-family: 'Courier New',Courier,monospace; font-size: x-small;"> 2,245,800,936 ops/sec</span><br />
<span style="font-family: 'Courier New',Courier,monospace; font-size: x-small;"> 3,161,643,847 ops/sec</span><br />
<span style="font-family: 'Courier New',Courier,monospace; font-size: x-small;"> 3,100,375,269 ops/sec</span><br />
<span style="font-family: 'Courier New',Courier,monospace; font-size: x-small;"> 3,144,364,173 ops/sec</span><br />
<span style="font-family: 'Courier New',Courier,monospace; font-size: x-small;"> 3,091,009,138 ops/sec</span><br />
<span style="font-family: 'Courier New',Courier,monospace; font-size: x-small;"> 3,089,241,641 ops/sec</span><br />
<span style="font-family: 'Courier New',Courier,monospace; font-size: x-small;"> 3,153,922,056 ops/sec</span><br />
<span style="font-family: 'Courier New',Courier,monospace; font-size: x-small;"> 3,147,331,497 ops/sec</span><br />
<span style="font-family: 'Courier New',Courier,monospace; font-size: x-small;"> 3,076,211,099 ops/sec</span><br />
<span style="font-family: 'Courier New',Courier,monospace; font-size: x-small;">StepDecOperation</span><br />
<span style="font-family: 'Courier New',Courier,monospace; font-size: x-small;"> 623,131,120 ops/sec</span><br />
<span style="font-family: 'Courier New',Courier,monospace; font-size: x-small;"> 659,686,236 ops/sec</span><br />
<span style="font-family: 'Courier New',Courier,monospace; font-size: x-small;"> 1,029,231,089 ops/sec</span><br />
<span style="font-family: 'Courier New',Courier,monospace; font-size: x-small;"> 1,021,060,933 ops/sec</span><br />
<span style="font-family: 'Courier New',Courier,monospace; font-size: x-small;"> 999,287,607 ops/sec</span><br />
<span style="font-family: 'Courier New',Courier,monospace; font-size: x-small;"> 1,015,432,172 ops/sec</span><br />
<span style="font-family: 'Courier New',Courier,monospace; font-size: x-small;"> 1,023,581,307 ops/sec</span><br />
<span style="font-family: 'Courier New',Courier,monospace; font-size: x-small;"> 1,019,266,750 ops/sec</span><br />
<span style="font-family: 'Courier New',Courier,monospace; font-size: x-small;"> 1,022,726,580 ops/sec</span><br />
<span style="font-family: 'Courier New',Courier,monospace; font-size: x-small;"> 1,004,237,016 ops/sec</span><br />
<span style="font-family: 'Courier New',Courier,monospace; font-size: x-small;">IncOperation</span><br />
<span style="font-family: 'Courier New',Courier,monospace; font-size: x-small;"> 301,419,319 ops/sec</span><br />
<span style="font-family: 'Courier New',Courier,monospace; font-size: x-small;"> 304,712,250 ops/sec</span><br />
<span style="font-family: 'Courier New',Courier,monospace; font-size: x-small;"> 307,269,912 ops/sec</span><br />
<span style="font-family: 'Courier New',Courier,monospace; font-size: x-small;"> 308,519,923 ops/sec</span><br />
<span style="font-family: 'Courier New',Courier,monospace; font-size: x-small;"> 307,372,436 ops/sec</span><br />
<span style="font-family: 'Courier New',Courier,monospace; font-size: x-small;"> 306,230,247 ops/sec</span><br />
<span style="font-family: 'Courier New',Courier,monospace; font-size: x-small;"> 307,964,022 ops/sec</span><br />
<span style="font-family: 'Courier New',Courier,monospace; font-size: x-small;"> 306,243,292 ops/sec</span><br />
<span style="font-family: 'Courier New',Courier,monospace; font-size: x-small;"> 308,689,942 ops/sec</span><br />
<span style="font-family: 'Courier New',Courier,monospace; font-size: x-small;"> 365,152,716 ops/sec</span><br />
<span style="font-family: 'Courier New',Courier,monospace; font-size: x-small;">DecOperation</span><br />
<span style="font-family: 'Courier New',Courier,monospace; font-size: x-small;"> 236,804,700 ops/sec</span><br />
<span style="font-family: 'Courier New',Courier,monospace; font-size: x-small;"> 237,912,786 ops/sec</span><br />
<span style="font-family: 'Courier New',Courier,monospace; font-size: x-small;"> 238,672,489 ops/sec</span><br />
<span style="font-family: 'Courier New',Courier,monospace; font-size: x-small;"> 278,745,901 ops/sec</span><br />
<span style="font-family: 'Courier New',Courier,monospace; font-size: x-small;"> 278,169,934 ops/sec</span><br />
<span style="font-family: 'Courier New',Courier,monospace; font-size: x-small;"> 277,979,158 ops/sec</span><br />
<span style="font-family: 'Courier New',Courier,monospace; font-size: x-small;"> 276,620,509 ops/sec</span><br />
<span style="font-family: 'Courier New',Courier,monospace; font-size: x-small;"> 278,349,766 ops/sec</span><br />
<span style="font-family: 'Courier New',Courier,monospace; font-size: x-small;"> 276,159,225 ops/sec</span><br />
<span style="font-family: 'Courier New',Courier,monospace; font-size: x-small;"> 278,578,373 ops/sec</span><br />
<span style="font-family: 'Courier New',Courier,monospace; font-size: x-small;">*** Run each method in turn: loop 1</span><br />
<span style="font-family: 'Courier New',Courier,monospace; font-size: x-small;">StepIncOperation</span><br />
<span style="font-family: 'Courier New',Courier,monospace; font-size: x-small;"> 276,054,944 ops/sec</span><br />
<span style="font-family: 'Courier New',Courier,monospace; font-size: x-small;"> 276,683,805 ops/sec</span><br />
<span style="font-family: 'Courier New',Courier,monospace; font-size: x-small;"> 276,551,970 ops/sec</span><br />
<span style="font-family: 'Courier New',Courier,monospace; font-size: x-small;"> 279,861,144 ops/sec</span><br />
<span style="font-family: 'Courier New',Courier,monospace; font-size: x-small;"> 275,543,192 ops/sec</span><br />
<span style="font-family: 'Courier New',Courier,monospace; font-size: x-small;"> 278,451,092 ops/sec</span><br />
<span style="font-family: 'Courier New',Courier,monospace; font-size: x-small;"> 275,399,262 ops/sec</span><br />
<span style="font-family: 'Courier New',Courier,monospace; font-size: x-small;"> 277,340,411 ops/sec</span><br />
<span style="font-family: 'Courier New',Courier,monospace; font-size: x-small;"> 274,529,616 ops/sec</span><br />
<span style="font-family: 'Courier New',Courier,monospace; font-size: x-small;"> 277,091,930 ops/sec</span><br />
<span style="font-family: 'Courier New',Courier,monospace; font-size: x-small;">StepDecOperation</span><br />
<span style="font-family: 'Courier New',Courier,monospace; font-size: x-small;"> 279,729,066 ops/sec</span><br />
<span style="font-family: 'Courier New',Courier,monospace; font-size: x-small;"> 279,812,269 ops/sec</span><br />
<span style="font-family: 'Courier New',Courier,monospace; font-size: x-small;"> 276,478,587 ops/sec</span><br />
<span style="font-family: 'Courier New',Courier,monospace; font-size: x-small;"> 277,660,649 ops/sec</span><br />
<span style="font-family: 'Courier New',Courier,monospace; font-size: x-small;"> 276,844,441 ops/sec</span><br />
<span style="font-family: 'Courier New',Courier,monospace; font-size: x-small;"> 278,684,313 ops/sec</span><br />
<span style="font-family: 'Courier New',Courier,monospace; font-size: x-small;"> 277,791,665 ops/sec</span><br />
<span style="font-family: 'Courier New',Courier,monospace; font-size: x-small;"> 277,617,484 ops/sec</span><br />
<span style="font-family: 'Courier New',Courier,monospace; font-size: x-small;"> 278,575,241 ops/sec</span><br />
<span style="font-family: 'Courier New',Courier,monospace; font-size: x-small;"> 278,228,274 ops/sec</span><br />
<span style="font-family: 'Courier New',Courier,monospace; font-size: x-small;">IncOperation</span><br />
<span style="font-family: 'Courier New',Courier,monospace; font-size: x-small;"> 277,724,770 ops/sec</span><br />
<span style="font-family: 'Courier New',Courier,monospace; font-size: x-small;"> 278,234,042 ops/sec</span><br />
<span style="font-family: 'Courier New',Courier,monospace; font-size: x-small;"> 276,798,434 ops/sec</span><br />
<span style="font-family: 'Courier New',Courier,monospace; font-size: x-small;"> 277,926,962 ops/sec</span><br />
<span style="font-family: 'Courier New',Courier,monospace; font-size: x-small;"> 277,786,824 ops/sec</span><br />
<span style="font-family: 'Courier New',Courier,monospace; font-size: x-small;"> 278,739,590 ops/sec</span><br />
<span style="font-family: 'Courier New',Courier,monospace; font-size: x-small;"> 275,286,293 ops/sec</span><br />
<span style="font-family: 'Courier New',Courier,monospace; font-size: x-small;"> 279,062,831 ops/sec</span><br />
<span style="font-family: 'Courier New',Courier,monospace; font-size: x-small;"> 276,672,019 ops/sec</span><br />
<span style="font-family: 'Courier New',Courier,monospace; font-size: x-small;"> 277,248,956 ops/sec</span><br />
<span style="font-family: 'Courier New',Courier,monospace; font-size: x-small;">DecOperation</span><br />
<span style="font-family: 'Courier New',Courier,monospace; font-size: x-small;"> 277,303,150 ops/sec</span><br />
<span style="font-family: 'Courier New',Courier,monospace; font-size: x-small;"> 277,746,139 ops/sec</span><br />
<span style="font-family: 'Courier New',Courier,monospace; font-size: x-small;"> 276,245,511 ops/sec</span><br />
<span style="font-family: 'Courier New',Courier,monospace; font-size: x-small;"> 278,559,202 ops/sec</span><br />
<span style="font-family: 'Courier New',Courier,monospace; font-size: x-small;"> 274,683,406 ops/sec</span><br />
<span style="font-family: 'Courier New',Courier,monospace; font-size: x-small;"> 279,280,730 ops/sec</span><br />
<span style="font-family: 'Courier New',Courier,monospace; font-size: x-small;"> 276,174,620 ops/sec</span><br />
<span style="font-family: 'Courier New',Courier,monospace; font-size: x-small;"> 276,374,159 ops/sec</span><br />
<span style="font-family: 'Courier New',Courier,monospace; font-size: x-small;"> 275,943,446 ops/sec</span><br />
<span style="font-family: 'Courier New',Courier,monospace; font-size: x-small;"> 277,765,688 ops/sec</span><br />
<span style="font-family: 'Courier New',Courier,monospace; font-size: x-small;">*** Run each method in turn: loop 2</span><br />
<span style="font-family: 'Courier New',Courier,monospace; font-size: x-small;">StepIncOperation</span><br />
<span style="font-family: 'Courier New',Courier,monospace; font-size: x-small;"> 278,405,907 ops/sec</span><br />
<span style="font-family: 'Courier New',Courier,monospace; font-size: x-small;"> 278,713,953 ops/sec</span><br />
<span style="font-family: 'Courier New',Courier,monospace; font-size: x-small;"> 276,841,096 ops/sec</span><br />
<span style="font-family: 'Courier New',Courier,monospace; font-size: x-small;"> 277,891,660 ops/sec</span><br />
<span style="font-family: 'Courier New',Courier,monospace; font-size: x-small;"> 275,716,314 ops/sec</span><br />
<span style="font-family: 'Courier New',Courier,monospace; font-size: x-small;"> 277,474,242 ops/sec</span><br />
<span style="font-family: 'Courier New',Courier,monospace; font-size: x-small;"> 277,715,270 ops/sec</span><br />
<span style="font-family: 'Courier New',Courier,monospace; font-size: x-small;"> 277,857,014 ops/sec</span><br />
<span style="font-family: 'Courier New',Courier,monospace; font-size: x-small;"> 275,956,486 ops/sec</span><br />
<span style="font-family: 'Courier New',Courier,monospace; font-size: x-small;"> 277,675,378 ops/sec</span><br />
<span style="font-family: 'Courier New',Courier,monospace; font-size: x-small;">StepDecOperation</span><br />
<span style="font-family: 'Courier New',Courier,monospace; font-size: x-small;"> 277,273,039 ops/sec</span><br />
<span style="font-family: 'Courier New',Courier,monospace; font-size: x-small;"> 278,101,972 ops/sec</span><br />
<span style="font-family: 'Courier New',Courier,monospace; font-size: x-small;"> 275,694,572 ops/sec</span><br />
<span style="font-family: 'Courier New',Courier,monospace; font-size: x-small;"> 276,312,449 ops/sec</span><br />
<span style="font-family: 'Courier New',Courier,monospace; font-size: x-small;"> 275,964,418 ops/sec</span><br />
<span style="font-family: 'Courier New',Courier,monospace; font-size: x-small;"> 278,423,621 ops/sec</span><br />
<span style="font-family: 'Courier New',Courier,monospace; font-size: x-small;"> 276,498,569 ops/sec</span><br />
<span style="font-family: 'Courier New',Courier,monospace; font-size: x-small;"> 276,593,475 ops/sec</span><br />
<span style="font-family: 'Courier New',Courier,monospace; font-size: x-small;"> 276,238,451 ops/sec</span><br />
<span style="font-family: 'Courier New',Courier,monospace; font-size: x-small;"> 277,057,568 ops/sec</span><br />
<span style="font-family: 'Courier New',Courier,monospace; font-size: x-small;">IncOperation</span><br />
<span style="font-family: 'Courier New',Courier,monospace; font-size: x-small;"> 275,700,451 ops/sec</span><br />
<span style="font-family: 'Courier New',Courier,monospace; font-size: x-small;"> 277,463,507 ops/sec</span><br />
<span style="font-family: 'Courier New',Courier,monospace; font-size: x-small;"> 275,886,477 ops/sec</span><br />
<span style="font-family: 'Courier New',Courier,monospace; font-size: x-small;"> 277,546,096 ops/sec</span><br />
<span style="font-family: 'Courier New',Courier,monospace; font-size: x-small;"> 275,019,816 ops/sec</span><br />
<span style="font-family: 'Courier New',Courier,monospace; font-size: x-small;"> 278,242,287 ops/sec</span><br />
<span style="font-family: 'Courier New',Courier,monospace; font-size: x-small;"> 277,317,964 ops/sec</span><br />
<span style="font-family: 'Courier New',Courier,monospace; font-size: x-small;"> 277,252,014 ops/sec</span><br />
<span style="font-family: 'Courier New',Courier,monospace; font-size: x-small;"> 276,893,038 ops/sec</span><br />
<span style="font-family: 'Courier New',Courier,monospace; font-size: x-small;"> 277,601,325 ops/sec</span><br />
<span style="font-family: 'Courier New',Courier,monospace; font-size: x-small;">DecOperation</span><br />
<span style="font-family: 'Courier New',Courier,monospace; font-size: x-small;"> 275,580,894 ops/sec</span><br />
<span style="font-family: 'Courier New',Courier,monospace; font-size: x-small;"> 280,146,646 ops/sec</span><br />
<span style="font-family: 'Courier New',Courier,monospace; font-size: x-small;"> 276,901,134 ops/sec</span><br />
<span style="font-family: 'Courier New',Courier,monospace; font-size: x-small;"> 276,672,567 ops/sec</span><br />
<span style="font-family: 'Courier New',Courier,monospace; font-size: x-small;"> 276,879,422 ops/sec</span><br />
<span style="font-family: 'Courier New',Courier,monospace; font-size: x-small;"> 278,674,196 ops/sec</span><br />
<span style="font-family: 'Courier New',Courier,monospace; font-size: x-small;"> 275,606,174 ops/sec</span><br />
<span style="font-family: 'Courier New',Courier,monospace; font-size: x-small;"> 278,132,534 ops/sec</span><br />
<span style="font-family: 'Courier New',Courier,monospace; font-size: x-small;"> 275,858,358 ops/sec</span><br />
<span style="font-family: 'Courier New',Courier,monospace; font-size: x-small;"> 279,444,112 ops/sec</span><br />
<br />
<h3 style="text-align: left;">
What is going on here?</h3>
<br />
On the first iteration over the list of operations we see the performance degrade from ~3bn operations per second down to ~275m operations per second. This happens in a step function with each new implementation loaded. On the second, and subsequent, iteration over the array of operations, performance stabilised at ~275m operations per second. What we are seeing here is how Hotspot can optimise when we have a limited number of implementations for an interface, and how it has to fall back to late bound method calls when many implementations are possible from a given call site.<br />
<br />
If we run the JVM with <span style="font-family: 'Courier New',Courier,monospace; font-size: small;">-XX:+PrintCompilation</span> we can see Hotspot choosing to compile the methods then de-optimise existing optimisations as new implementations get loaded.<br />
<span style="font-family: 'Courier New',Courier,monospace; font-size: x-small;"></span><br />
<span style="font-family: 'Courier New',Courier,monospace; font-size: x-small;"> 52 1 java.lang.String::hashCode (67 bytes)</span><br />
<span style="font-family: 'Courier New',Courier,monospace; font-size: x-small;"> 54 2 StepIncOperation::map (5 bytes)</span><br />
<span style="font-family: 'Courier New',Courier,monospace; font-size: x-small;"> 55 1 % OperationPerfTest::opRun @ 2 (26 bytes)</span><br />
<span style="font-family: 'Courier New',Courier,monospace; font-size: x-small;"> 76 3 OperationPerfTest::opRun (26 bytes)</span><br />
<span style="font-family: 'Courier New',Courier,monospace; font-size: x-small;"> 223 3 OperationPerfTest::opRun (26 bytes) made not entrant</span><br />
<span style="font-family: 'Courier New',Courier,monospace; font-size: x-small;"> 223 1 % OperationPerfTest::opRun @ -2 (26 bytes) made not entrant</span><br />
<span style="font-family: 'Courier New',Courier,monospace; font-size: x-small;"> 224 2 % OperationPerfTest::opRun @ 2 (26 bytes)</span><br />
<span style="font-family: 'Courier New',Courier,monospace; font-size: x-small;"> 224 4 StepDecOperation::map (4 bytes)</span><br />
<span style="font-family: 'Courier New',Courier,monospace; font-size: x-small;"> 306 5 OperationPerfTest::opRun (26 bytes)</span><br />
<span style="font-family: 'Courier New',Courier,monospace; font-size: x-small;"> 772 2 % OperationPerfTest::opRun @ -2 (26 bytes) made not entrant</span><br />
<span style="font-family: 'Courier New',Courier,monospace; font-size: x-small;"> 772 3 % OperationPerfTest::opRun @ 2 (26 bytes)</span><br />
<span style="font-family: 'Courier New',Courier,monospace; font-size: x-small;"> 773 6 IncOperation::map (4 bytes)</span><br />
<span style="font-family: 'Courier New',Courier,monospace; font-size: x-small;"> 930 5 OperationPerfTest::opRun (26 bytes) made not entrant</span><br />
<span style="font-family: 'Courier New',Courier,monospace; font-size: x-small;"> 1995 7 OperationPerfTest::opRun (26 bytes)</span><br />
<span style="font-family: 'Courier New',Courier,monospace; font-size: x-small;"> 2293 8 DecOperation::map (4 bytes)</span><br />
<span style="font-family: 'Courier New',Courier,monospace; font-size: x-small;"> 11339 9 java.lang.String::indexOf (87 bytes)</span><br />
<span style="font-family: 'Courier New',Courier,monospace; font-size: x-small;"> 15017 10 java.lang.String::charAt (33 bytes)</span><br />
<div>
<span style="font-family: 'Courier New',Courier,monospace; font-size: x-small;"><br /></span></div>
<div>
The output above shows the decisions made by Hotspot as it compiles code. When the third column contains the symbol "%" it is performing <a href="http://mechanical-sympathy.blogspot.co.uk/2011/11/biased-locking-osr-and-benchmarking-fun.html">OSR</a> (On Stack Replacement) of the method. This is followed 4 times by the method being "made not entrant" as it is de-optimised when Hotspot discovers new implementations. 3 times the method is made not entrant for the newly discovered classes and once for removing the OSR version to be replaced by a non-OSR normal JIT'ed version when the final implementation is settled on. Even greater detail can be seen by replacing <span style="font-family: 'Courier New',Courier,monospace; font-size: small;">-XX:+PrintCompilation</span> with <span style="font-size: small;">-</span><span style="font-family: 'Courier New',Courier,monospace;"><span style="font-size: x-small;"><span style="font-size: small;">XX:+UnlockDiagnosticVMOptions</span> <span style="font-size: small;">-XX:+LogCompilation</span></span>.</span></div>
<br />
For the monomorphic single implementation case, Hotspot can simply inline the method and place a trap in the code to fire if future implementations are loaded. This gives performance very similar to no function call overhead. For the second bimorphic implementation, Hotspot can inline both methods and select the implementation based on a branch condition. Beyond this things get tricky and jump tables are required to resolve the method at runtime, thus making the code polymorphic or megamorphic. The generated assembly code can be viewed with <span style="font-family: 'Courier New',Courier,monospace;"><span style="font-size: small;">-</span><span style="font-size: x-small;"><span style="font-size: small;">XX:+UnlockDiagnosticVMOptions</span> <span style="font-size: small;">-XX:CompileCommand=print,OperationPerfTest.doRun</span></span></span> JVM options for Java 7. The output shows the steps in compilation whereby not only is the method inlining deoptimised, Hotspot also no longer does loop unrolling for this method.<br />
<br />
<h3 style="text-align: left;">
Conclusions</h3>
<br />
We can see that if an interface method has only one or two implementations then Hotspot can dynamically inline the method avoiding the function call overhead. This would only be possible with <a href="http://en.wikipedia.org/wiki/Profile-guided_optimization">profile guided optimisation</a> for a language like C or C++. We can also see that method calls are relatively cheap on a modern JVM, in the order of 12 cycles, even when we cannot avoid them. It should be noted that the cost of method calls goes up by a few cycles for each additional argument passed.<br />
<br />
In addition, I have observed that when a class implements multiple interfaces, with multiple methods, performance can degrade significantly because the method dispatch involves a linear search of method list to find the right implementation for dispatch. Overridden methods from a base class do not involve this linear search but still require the jump table dispatch. All the more reason to keep classes and interfaces simple.</div>Martin Thompsonhttp://www.blogger.com/profile/15893849163924476586noreply@blogger.com7London, UK51.5081289 -0.12800551.350006900000004 -0.443862 51.6662509 0.187852tag:blogger.com,1999:blog-5560209661389175529.post-15577218354202166592012-03-22T17:55:00.001+00:002012-04-16T13:30:35.411+01:00Fun with my-Channels Nirvana and Azul Zing<div dir="ltr" style="text-align: left;" trbidi="on">
Since leaving LMAX I have been neglecting my blog a bit. This is not because I have not been doing anything interesting. Quite the opposite really, things have been so busy the blog has taken a back seat. I’ve been consulting for a number of hedge funds and product companies, most of which are super secretive.<br />
<br />
One company I have been spending quite a bit of time with is <a href="http://www.my-channels.com/">my-Channels</a>, a messaging provider. They are really cool and have given me their blessing to blog about some of the interesting things I’ve been working on for them.<br />
<br />
For context, my-Channels are a messaging provider that specialise in delivering data to every device known to man over dodgy networks such as the Internet or your corporate WAN. They can deliver live financial market data to your desktop, laptop at home, or your iPhone, at the fastest possible rates. Lately, they have made the strategic move to enter the low-latency messaging space for the enterprise, and as part of this they have enlisted my services. They want to go low-latency without giving up the rich functionality their product offers which is giving me some interesting challenges.<br />
<br />
Just how bad is the latency of such a product when new to the low-latency space? I did not have high expectations because to be fair this was never their goal. After some initial tests, I’m thinking these guys are not in bad shape. They beat the crap out of most JMS implementations and it is going to be fun pushing them to the serious end of the low-latency space. <br />
<br />
OK enough of the basic tests, now it is time to get serious. I worked with them to create appropriate load tests and get the profilers running. No big surprises here, when we piled on the pressure, lock-contention came out as the biggest culprit limiting both latency and throughput. As we go down the list, lots of other interesting things showed up but let’s follow good discipline and start at the top of the list.<br />
<br />
Good discipline for “<a href="http://en.wikipedia.org/wiki/Theory_of_constraints">Theory of Constraints</a>” states that you always work on the most limiting factor because when it is removed the list below it can change radically as new pressures are applied. So to address this contention issue we developed a new lock-free <a href="http://docs.oracle.com/javase/1.5.0/docs/api/java/util/concurrent/Executor.html" style="font-family: "Courier New",Courier,monospace;">Executor</a> to replace the standard Java implementation. Tests showed this new executor is ~10X better than what the JDK has to offer. We integrated the new Executor into the code base and now the throughput bottleneck has been massively changed. The system can now cope with 16X more throughput, and the latency histogram has become much more compressed. This is a good example of how macro-benchmarking is so much more valuable than micro-benchmarking. Not a bad start we are all thinking.<br />
<br />
<b>Enter Azul Stage Left</b><br />
<br />
We tested on all the major JVMs and the most predictable latency was achieved with <a href="http://www.azulsystems.com/products/zing/whatisit">Azul Zing</a>. Zing had by far the best latency profile with virtually no long tail. For many of the tests it also had the greatest throughput.<br />
<br />
After the lock contention on the Executor issue had been resolved, the next big bottleneck when load testing on the same machine was being limited by using TCP between processes over the loopback adapter. We discussed developing a new transport that was not network based for Nirvana. For this we decided to apply a number of the techniques I teach on my lock-free concurrency course. This resulted in a new <a href="http://en.wikipedia.org/wiki/Inter-process_communication">IPC</a> transport based on shared memory via memory-mapped files in Java. We did inter-server testing using 10GigE networks, and had a fun using the new <a href="http://www.solarflare.com/">Solarflare</a> network adapters with <a href="http://www.openonload.org/">OpenOnload</a>, but for this article I’ll stick with the Java story. I think Paul is still sore from me stuffing his little Draytek ADSL router with huge amounts of multicast traffic when the poor thing was connected to our 10GigE test LAN. Sorry Paul!<br />
<br />
Developing the IPC transport unearthed a number of challenges with various JVM implementations of <a href="http://docs.oracle.com/javase/6/docs/api/java/nio/MappedByteBuffer.html" style="font-family: "Courier New",Courier,monospace;">MappedByteBuffer</a>. After some very useful chats with Cliff Click and Doug Lea we came up with a solution that worked across all JVMs. This solution has a mean latency of ~100ns on the best JVMs and can do ~12-22 million messages per second throughput for 60-byte messages depending on the JVM. This was the first time we had found a test whereby Azul was not close to being the fastest. I isolated a test case and sent it to them on a Friday. On Sunday evening I got an email from Gil Tene saying he had identified the issue and by Tuesday Cliff Click had a fix that we tried the next week. When we tested the new Azul JVM, we seen over 40 million messages per second at latencies just over 100ns for our new IPC transport. I had been teasing Azul that this must be possible in Java because I’d created similar algorithms in C and assembler that show what the x86_64 platform is capable of.<br />
<br />
I’m starting to ramble but we had great fun removing latency through many parts of the stack. When I get more time I will blog about some of the other findings. The current position is still a work in progress with daily progress on an amazing scale. The guys at my-Channels are very conservative and do not want to publish actual figures until they have version 7.0 of Nirvana ready for GA, and have done more comprehensive testing. For now they are happy with me being open about the following:<br />
<ul style="text-align: left;">
<li>Throughput increased 32X due to the implementation of lock-free techniques and optimising the call stack for message handling to remove any shared dependencies.</li>
<li>Average latency decreased 20X from applying the same techniques and we have identified many more possible improvements.</li>
<li>We know the raw transport for IPC is now ~100ns and the worst case pause due to GC is 80µs with Azul Zing. As to the latency for the double hop between a producer and consumer over IPC, via their broker, I’ll leave to your imagination as somewhere between those figures until the guys are willing to make an official announcement. As you can guess it is much much less than 80µs.</li>
</ul>
<div style="text-align: left;">
For me the big surprise was GC pauses only taking 80µs in the worst case. OS scheduling alone I have seen result in more jitter. I discussed this at length with Gil Tene from Azul, and even he was surprised. He expects some worst case scenarios with their JVM to be 1-2ms for a well behaved application. We then explored the my-Channels setup, and it turns out we have done everything almost perfectly to get the best out of a JVM which is worth sharing.</div>
<ol style="text-align: left;">
<li>Do not use locks in the main transaction flow because they cause context switches, and therefore latency and unpredictable jitter.</li>
<li>Never have more threads that need to run than you have cores available.</li>
<li>Set affinity of threads to cores, or at least sockets, to avoid cache pollution by avoiding migration. This is particularly important when on a server class machine having multiple sockets because of the <a href="http://en.wikipedia.org/wiki/Non-Uniform_Memory_Access">NUMA</a> effect.</li>
<li>Ensure uncontested access to any resource respecting the <a href="http://mechanical-sympathy.blogspot.co.uk/2011/09/single-writer-principle.html">Single Writer Principle</a> so that the likes of <a href="http://mechanical-sympathy.blogspot.co.uk/2011/11/biased-locking-osr-and-benchmarking-fun.html">biased locking</a> can be your friend.</li>
<li>Keep call stacks reasonably small. Still more work to do here. If you are crazy enough to use Spring, then check out your call stacks to see what I mean! The garbage collector has to walk them finding reachable objects.</li>
<li>Do not use finalizers.</li>
<li>Keep garbage generation to modest levels. This applies to most JVMs but is likely not an issue for Zing.</li>
<li>Ensure no disk IO on the main flow.</li>
<li>Do a proper warm-up before beginning to measure.</li>
<li>Do all the appropriate OS tunings for low-latency systems that are way beyond this blog. For example turn off <a href="http://en.wikipedia.org/wiki/Advanced_Configuration_and_Power_Interface">C-States</a> power management in the BIOS and watch out for RHEL 6 as it turns it back on without telling you!</li>
</ol>
<div style="text-align: left;">
It should be noted that we ran this on some state of the art Intel CPUs with very large L3 caches. It is possible to get 20-30MB L3 caches on a single socket these days. It is very likely that our entire application was running out of L3 cache with the exception of the message flow which is very predictable.<br />
<br />
Gil has added a cautionary note that while these results
are very impressive we had a team focused on this issue with the
appropriate skills to get the best out of the application. It is not
the usual case for every client to apply this level of focus.<br />
<br />
What I’ve taken from this experience is the amazing things that can be achieved by truly agile companies, staffed by talented individuals, who are empowered to make things happen. I love agile development but it has become a religion to some people who are more interested in following the “true” process than doing what is truly needed. Both my-Channels and Azul have shown during this engagement what is possible in making s*#t happen. It has been an absolute blast working with individuals who can assimilate information and ideas so fast, then turn them into working software. For this I will embarrass Matt Buckton at my-Channels, and Gil Tene & Cliff Click at Azul who never failed in rising to a challenge. So few organisations could have made so much progress over such a short time period. If you think Java cannot cut it in the high performance space, then deal with one of these two companies, and you will be thinking again. I bet a few months ago Matt never thought he’d be sitting in Singapore airport writing his first multi-producer lock-free queue when travelling home, and really enjoying it.</div>
</div>Martin Thompsonhttp://www.blogger.com/profile/15893849163924476586noreply@blogger.com57London, UK51.5081289 -0.12800551.350006900000004 -0.443862 51.6662509 0.187852tag:blogger.com,1999:blog-5560209661389175529.post-42086964840856693902011-12-26T19:47:00.002+00:002022-08-17T11:36:58.097+01:00Java Sequential IO Performance<div dir="ltr" style="text-align: left;" trbidi="on">
<div dir="ltr" style="text-align: left;" trbidi="on">
Many applications record a series of events to file-based storage for later use. This can be anything from logging and auditing, through to keeping a transaction redo log in an <a href="http://martinfowler.com/eaaDev/EventSourcing.html">event sourced</a> design or its close relative <a href="http://martinfowler.com/bliki/CQRS.html">CQRS</a>. <br />
<br />
Java has a number of means by which a file can be sequentially written to, or read back again. This article explores some of these mechanisms to understand their performance characteristics. For the scope of this article I will be using pre-allocated files because I want to focus on performance. Constantly extending a file imposes a significant performance overhead and adds jitter to an application resulting in highly variable latency. "Why is a pre-allocated file better performance?", I hear you ask. Well, on disk a file is made up from a series of blocks/pages containing the data. Firstly, it is important that these blocks are contiguous to provide fast sequential access. Secondly, meta-data must be allocated to describe this file on disk and saved within the file-system. A typical large file will have a number of "indirect" blocks allocated to describe the chain of data-blocks containing the file contents that make up part of this meta-data. I'll leave it as an exercise for the reader, or maybe a later article, to explore the performance impact of not preallocating the data files. If you have used a database you may have noticed that it preallocates the files it will require.<br />
<br />
<span style="font-size: large;"><b>The Test</b></span><br />
<br />
I want to experiment with 2 file sizes. One that is sufficiently large to test sequential access, but can easily fit in the file-system cache, and another that is much larger so that the cache subsystem is forced to retire pages so that new ones can be loaded. For these two cases I'll use 400MB and 8GB respectively. I'll also loop over the files a number of times to show the pre and post warm-up characteristics.<br />
<br />
I'll test 4 means of writing and reading back files sequentially:<br />
<ol style="text-align: left;">
<li><a href="http://docs.oracle.com/javase/6/docs/api/java/io/RandomAccessFile.html"><span style="font-family: 'Courier New', Courier, monospace;">RandomAccessFile</span></a> using a vanilla <span style="font-family: 'Courier New', Courier, monospace;">byte[]</span> of page size.</li>
<li>Buffered <a href="http://docs.oracle.com/javase/6/docs/api/java/io/FileInputStream.html"><span style="font-family: 'Courier New', Courier, monospace;">FileInputStream</span></a> and <a href="http://docs.oracle.com/javase/6/docs/api/java/io/FileOutputStream.html"><span style="font-family: 'Courier New', Courier, monospace;">FileOutputStream</span></a>.</li>
<li>NIO <a href="http://docs.oracle.com/javase/6/docs/api/java/nio/channels/FileChannel.html"><span style="font-family: 'Courier New', Courier, monospace;">FileChannel</span></a> with <a href="http://docs.oracle.com/javase/6/docs/api/java/nio/ByteBuffer.html"><span style="font-family: 'Courier New', Courier, monospace;">ByteBuffer</span></a> of page size.</li>
<li>Memory mapping a file using NIO and direct <a href="http://docs.oracle.com/javase/6/docs/api/java/nio/MappedByteBuffer.html"><span style="font-family: 'Courier New', Courier, monospace;">MappedByteBuffer</span></a>.</li>
</ol>
The tests are run on a 2.0Ghz Sandy Bridge CPU with 8GB RAM, an Intel 320 SSD on Fedora Core 15 64-bit Linux with an ext4 file system, and Oracle JDK 1.6.0_30.<br />
<br />
<b><span style="font-size: large;">The Code </span></b><br />
<pre>
import java.io.*;
import java.nio.ByteBuffer;
import java.nio.MappedByteBuffer;
import java.nio.channels.FileChannel;
import static java.lang.Integer.MAX_VALUE;
import static java.lang.System.out;
import static java.nio.channels.FileChannel.MapMode.READ_ONLY;
import static java.nio.channels.FileChannel.MapMode.READ_WRITE;
public final class TestSequentialIoPerf
{
public static final int PAGE_SIZE = 1024 * 4;
public static final long FILE_SIZE = PAGE_SIZE * 2000L * 1000L;
public static final String FILE_NAME = "test.dat";
public static final byte[] BLANK_PAGE = new byte[PAGE_SIZE];
public static void main(final String[] arg) throws Exception
{
preallocateTestFile(FILE_NAME);
for (final PerfTestCase testCase : testCases)
{
for (int i = 0; i < 5; i++)
{
System.gc();
long writeDurationMs = testCase.test(PerfTestCase.Type.WRITE,
FILE_NAME);
System.gc();
long readDurationMs = testCase.test(PerfTestCase.Type.READ,
FILE_NAME);
long bytesReadPerSec = (FILE_SIZE * 1000L) / readDurationMs;
long bytesWrittenPerSec = (FILE_SIZE * 1000L) / writeDurationMs;
out.format("%s\twrite=%,d\tread=%,d bytes/sec\n",
testCase.getName(),
bytesWrittenPerSec, bytesReadPerSec);
}
}
deleteFile(FILE_NAME);
}
private static void preallocateTestFile(final String fileName)
throws Exception
{
RandomAccessFile file = new RandomAccessFile(fileName, "rw");
for (long i = 0; i < FILE_SIZE; i += PAGE_SIZE)
{
file.write(BLANK_PAGE, 0, PAGE_SIZE);
}
file.close();
}
private static void deleteFile(final String testFileName) throws Exception
{
File file = new File(testFileName);
if (!file.delete())
{
out.println("Failed to delete test file=" + testFileName);
out.println("Windows does not allow mapped files to be deleted.");
}
}
public abstract static class PerfTestCase
{
public enum Type { READ, WRITE }
private final String name;
private int checkSum;
public PerfTestCase(final String name)
{
this.name = name;
}
public String getName()
{
return name;
}
public long test(final Type type, final String fileName)
{
long start = System.currentTimeMillis();
try
{
switch (type)
{
case WRITE:
{
checkSum = testWrite(fileName);
break;
}
case READ:
{
final int checkSum = testRead(fileName);
if (checkSum != this.checkSum)
{
final String msg = getName() +
" expected=" + this.checkSum +
" got=" + checkSum;
throw new IllegalStateException(msg);
}
break;
}
}
}
catch (Exception ex)
{
ex.printStackTrace();
}
return System.currentTimeMillis() - start;
}
public abstract int testWrite(final String fileName) throws Exception;
public abstract int testRead(final String fileName) throws Exception;
}
private static PerfTestCase[] testCases =
{
new PerfTestCase("RandomAccessFile")
{
public int testWrite(final String fileName) throws Exception
{
RandomAccessFile file = new RandomAccessFile(fileName, "rw");
final byte[] buffer = new byte[PAGE_SIZE];
int pos = 0;
int checkSum = 0;
for (long i = 0; i < FILE_SIZE; i++)
{
byte b = (byte)i;
checkSum += b;
buffer[pos++] = b;
if (PAGE_SIZE == pos)
{
file.write(buffer, 0, PAGE_SIZE);
pos = 0;
}
}
file.close();
return checkSum;
}
public int testRead(final String fileName) throws Exception
{
RandomAccessFile file = new RandomAccessFile(fileName, "r");
final byte[] buffer = new byte[PAGE_SIZE];
int checkSum = 0;
int bytesRead;
while (-1 != (bytesRead = file.read(buffer)))
{
for (int i = 0; i < bytesRead; i++)
{
checkSum += buffer[i];
}
}
file.close();
return checkSum;
}
},
new PerfTestCase("BufferedStreamFile")
{
public int testWrite(final String fileName) throws Exception
{
int checkSum = 0;
OutputStream out =
new BufferedOutputStream(new FileOutputStream(fileName));
for (long i = 0; i < FILE_SIZE; i++)
{
byte b = (byte)i;
checkSum += b;
out.write(b);
}
out.close();
return checkSum;
}
public int testRead(final String fileName) throws Exception
{
int checkSum = 0;
InputStream in =
new BufferedInputStream(new FileInputStream(fileName));
int b;
while (-1 != (b = in.read()))
{
checkSum += (byte)b;
}
in.close();
return checkSum;
}
},
new PerfTestCase("BufferedChannelFile")
{
public int testWrite(final String fileName) throws Exception
{
FileChannel channel =
new RandomAccessFile(fileName, "rw").getChannel();
ByteBuffer buffer = ByteBuffer.allocate(PAGE_SIZE);
int checkSum = 0;
for (long i = 0; i < FILE_SIZE; i++)
{
byte b = (byte)i;
checkSum += b;
buffer.put(b);
if (!buffer.hasRemaining())
{
buffer.flip();
channel.write(buffer);
buffer.clear();
}
}
channel.close();
return checkSum;
}
public int testRead(final String fileName) throws Exception
{
FileChannel channel =
new RandomAccessFile(fileName, "rw").getChannel();
ByteBuffer buffer = ByteBuffer.allocate(PAGE_SIZE);
int checkSum = 0;
while (-1 != (channel.read(buffer)))
{
buffer.flip();
while (buffer.hasRemaining())
{
checkSum += buffer.get();
}
buffer.clear();
}
return checkSum;
}
},
new PerfTestCase("MemoryMappedFile")
{
public int testWrite(final String fileName) throws Exception
{
FileChannel channel =
new RandomAccessFile(fileName, "rw").getChannel();
MappedByteBuffer buffer =
channel.map(READ_WRITE, 0,
Math.min(channel.size(), MAX_VALUE));
int checkSum = 0;
for (long i = 0; i < FILE_SIZE; i++)
{
if (!buffer.hasRemaining())
{
buffer =
channel.map(READ_WRITE, i,
Math.min(channel.size() - i , MAX_VALUE));
}
byte b = (byte)i;
checkSum += b;
buffer.put(b);
}
channel.close();
return checkSum;
}
public int testRead(final String fileName) throws Exception
{
FileChannel channel =
new RandomAccessFile(fileName, "rw").getChannel();
MappedByteBuffer buffer =
channel.map(READ_ONLY, 0,
Math.min(channel.size(), MAX_VALUE));
int checkSum = 0;
for (long i = 0; i < FILE_SIZE; i++)
{
if (!buffer.hasRemaining())
{
buffer =
channel.map(READ_WRITE, i,
Math.min(channel.size() - i , MAX_VALUE));
}
checkSum += buffer.get();
}
channel.close();
return checkSum;
}
},
};
}
</pre>
<span style="font-size: large;"><b>Results</b></span><br />
<div style="font-family: "Courier New",Courier,monospace;">
<br /></div>
<span style="font-family: 'Courier New', Courier, monospace;">400MB file</span><br />
<span style="font-family: 'Courier New', Courier, monospace;">===========</span><br />
<span style="font-family: 'Courier New', Courier, monospace;">RandomAccessFile write=379,610,750 read=1,452,482,269 bytes/sec</span><br />
<span style="font-family: 'Courier New', Courier, monospace;">RandomAccessFile write=294,041,636 read=1,494,890,510 bytes/sec</span><br />
<span style="font-family: 'Courier New', Courier, monospace;">RandomAccessFile write=250,980,392 read=1,422,222,222 bytes/sec</span><br />
<span style="font-family: 'Courier New', Courier, monospace;">RandomAccessFile write=250,366,748 read=1,388,474,576 bytes/sec</span><br />
<span style="font-family: 'Courier New', Courier, monospace;">RandomAccessFile write=260,394,151 read=1,422,222,222 bytes/sec</span><br />
<br />
<span style="font-family: 'Courier New', Courier, monospace;">BufferedStreamFile write=98,178,331 read=286,433,566 bytes/sec</span><br />
<span style="font-family: 'Courier New', Courier, monospace;">BufferedStreamFile write=100,244,738 read=288,857,545 bytes/sec</span><br />
<span style="font-family: 'Courier New', Courier, monospace;">BufferedStreamFile write=82,948,562 read=154,100,827 bytes/sec</span><br />
<span style="font-family: 'Courier New', Courier, monospace;">BufferedStreamFile write=108,503,311 read=153,869,271 bytes/sec</span><br />
<span style="font-family: 'Courier New', Courier, monospace;">BufferedStreamFile write=113,055,478 read=152,608,047 bytes/sec</span><br />
<span style="font-family: 'Courier New', Courier, monospace; text-align: -webkit-auto;"><br /></span>
<span style="font-family: 'Courier New', Courier, monospace; text-align: -webkit-auto;">BufferedChannelFile </span><span style="font-family: 'Courier New', Courier, monospace; text-align: -webkit-auto;">write=228,443,948 </span><span style="font-family: 'Courier New', Courier, monospace; text-align: -webkit-auto;">read=356,173,913 bytes/sec</span><br />
<span style="font-family: 'Courier New', Courier, monospace; text-align: -webkit-auto;">BufferedChannelFile write=265,629,053 read=374,063,926 bytes/sec</span><br />
<div style="text-align: -webkit-auto;">
<span style="font-family: 'Courier New', Courier, monospace;">BufferedChannelFile write=223,825,136 read=1,539,849,624 bytes/sec</span></div>
<div style="text-align: -webkit-auto;">
<span style="font-family: 'Courier New', Courier, monospace;">BufferedChannelFile write=232,992,036 read=1,539,849,624 bytes/sec</span></div>
<div style="text-align: -webkit-auto;">
<span style="font-family: 'Courier New', Courier, monospace;">BufferedChannelFile write=212,779,220 read=1,534,082,397 bytes/sec</span></div>
<br />
<span style="font-family: 'Courier New', Courier, monospace;">MemoryMappedFile write=300,955,180 read=305,899,925 bytes/sec</span><br />
<span style="font-family: 'Courier New', Courier, monospace;">MemoryMappedFile write=313,149,847 read=310,538,286 bytes/sec</span><br />
<span style="font-family: 'Courier New', Courier, monospace;">MemoryMappedFile write=326,374,501 read=303,857,566 bytes/sec</span><br />
<span style="font-family: 'Courier New', Courier, monospace;">MemoryMappedFile write=327,680,000 read=304,535,315 bytes/sec</span><br />
<span style="font-family: 'Courier New', Courier, monospace;">MemoryMappedFile write=326,895,450 read=303,632,320 bytes/sec</span><br />
<br />
<span style="font-family: 'Courier New', Courier, monospace;">8GB File</span><br />
<span style="font-family: 'Courier New', Courier, monospace;">============</span><br />
<span style="font-family: 'Courier New', Courier, monospace;">RandomAccessFile write=167,402,321 read=251,922,012 bytes/sec</span><br />
<span style="font-family: 'Courier New', Courier, monospace;">RandomAccessFile write=193,934,802 read=257,052,307 bytes/sec</span><br />
<span style="font-family: 'Courier New', Courier, monospace;">RandomAccessFile write=192,948,159 read=248,460,768 bytes/sec</span><br />
<span style="font-family: 'Courier New', Courier, monospace;">RandomAccessFile write=191,814,180 read=245,225,408 bytes/sec</span><br />
<span style="font-family: 'Courier New', Courier, monospace;">RandomAccessFile write=190,635,762 read=275,315,073 bytes/sec</span><br />
<br />
<span style="font-family: 'Courier New', Courier, monospace;">BufferedStreamFile write=154,823,102 read=248,355,313 bytes/sec</span><br />
<span style="font-family: 'Courier New', Courier, monospace;">BufferedStreamFile write=152,083,913 read=253,418,301 bytes/sec</span><br />
<span style="font-family: 'Courier New', Courier, monospace;">BufferedStreamFile write=133,099,369 read=146,056,197 bytes/sec</span><br />
<span style="font-family: 'Courier New', Courier, monospace;">BufferedStreamFile write=131,065,708 read=146,217,827 bytes/sec</span><br />
<span style="font-family: 'Courier New', Courier, monospace;">BufferedStreamFile write=132,694,052 read=148,116,004 bytes/sec</span><br />
<span style="font-family: 'Courier New', Courier, monospace; text-align: -webkit-auto;"><br /></span>
<span style="font-family: 'Courier New', Courier, monospace; text-align: -webkit-auto;">BufferedChannelFile </span><span style="font-family: 'Courier New', Courier, monospace; text-align: -webkit-auto;">write=186,703,740 </span><span style="font-family: 'Courier New', Courier, monospace; text-align: -webkit-auto;">read=215,075,218 bytes/sec</span><br />
<span style="font-family: 'Courier New', Courier, monospace;">BufferedChannelFile write=190,591,410 read=211,030,680 bytes/sec</span></div>
<span style="font-family: 'Courier New', Courier, monospace;">BufferedChannelFile write=187,220,038 read=223,087,606 bytes/sec</span><br />
<span style="font-family: 'Courier New', Courier, monospace;">BufferedChannelFile write=191,585,397 read=221,297,747 bytes/sec</span><br />
<span style="font-family: 'Courier New', Courier, monospace;">BufferedChannelFile write=192,653,214 read=211,789,038 bytes/sec</span>
<span style="font-family: 'Courier New', Courier, monospace;"><br /></span><br />
<span style="font-family: 'Courier New', Courier, monospace;">MemoryMappedFile write=123,023,322 read=231,530,156 bytes/sec</span><br />
<span style="font-family: 'Courier New', Courier, monospace;">MemoryMappedFile write=121,961,023 read=230,403,600 bytes/sec</span><br />
<span style="font-family: 'Courier New', Courier, monospace;">MemoryMappedFile write=123,317,778 read=229,899,250 bytes/sec</span><br />
<span style="font-family: 'Courier New', Courier, monospace;">MemoryMappedFile write=121,472,738 read=231,739,745 bytes/sec</span><br />
<span style="font-family: 'Courier New', Courier, monospace;">MemoryMappedFile write=120,362,615 read=231,190,382 bytes/sec</span><br />
<br />
<span style="font-size: large;"><b>Analysis</b></span><br />
<br />
For years I was a big fan of using <a href="http://docs.oracle.com/javase/6/docs/api/java/io/RandomAccessFile.html"><span style="font-family: 'Courier New', Courier, monospace;">RandomAccessFile</span></a> directly because of the control it gives and the predictable execution. I never found using buffered streams to be useful from a performance perspective and this still seems to be the case.<br />
<br />
In more recent testing I've found that using NIO <a href="http://docs.oracle.com/javase/6/docs/api/java/nio/channels/FileChannel.html"><span style="font-family: 'Courier New', Courier, monospace;">FileChannel</span></a> and <a href="http://docs.oracle.com/javase/6/docs/api/java/nio/ByteBuffer.html"><span style="font-family: 'Courier New', Courier, monospace;">ByteBuffer</span></a> are doing much better. With Java 7 the flexibility of this programming approach has been improved for random access with <a href="http://openjdk.java.net/projects/nio/javadoc/java/nio/channels/SeekableByteChannel.html"><span style="font-family: 'Courier New', Courier, monospace;">SeekableByteChannel</span></a>.<br />
<br />
It seems that for reading RandomAccessFile and NIO do very well with Memory Mapped files winning for writes in some cases.<br />
<br />
I've seen these results vary greatly depending on platform. File system, OS, storage devices, and available memory all have a significant impact. In a few cases I've seen memory-mapped files perform significantly better than the others but this needs to be tested on your platform because <i>your mileage may vary...</i><br />
<br />
A special note should be made for the use of memory-mapped large files when pushing for maximum throughput. I've often found the OS can become unresponsive due the the pressure put on the virtual memory sub-system.<br />
<br />
<span style="font-size: large;"><b>Conclusion</b></span><br />
<br />
There is a significant difference in performance for the different means of doing sequential file IO from Java. Not all methods are even remotely equal. For most IO I've found the use of ByteBuffers and Channels to be the best optimised parts of the IO libraries. If buffered streams are your IO libraries of choice, then it is worth branching out and and getting familiar with the implementations of <a href="http://docs.oracle.com/javase/6/docs/api/java/nio/channels/Channel.html"><span style="font-family: 'Courier New', Courier, monospace;">Channel</span></a> and <span style="font-family: 'Courier New', Courier, monospace;"><a href="http://docs.oracle.com/javase/6/docs/api/java/nio/Buffer.html">Buffer</a> </span><span style="font-family: inherit;">or even falling back and using the good old </span><span style="font-family: 'Courier New', Courier, monospace;"><a href="http://docs.oracle.com/javase/6/docs/api/java/io/RandomAccessFile.html">RandomAccessFile</a></span>.</div>Martin Thompsonhttp://www.blogger.com/profile/15893849163924476586noreply@blogger.com30London, UK51.508129 -0.1280050000000301251.3644275 -0.37787450000003009 51.651830499999996 0.12186449999996987tag:blogger.com,1999:blog-5560209661389175529.post-26943233151095471432011-11-22T16:36:00.004+00:002022-08-17T11:37:20.184+01:00Biased Locking, OSR, and Benchmarking Fun<div dir="ltr" style="text-align: left;" trbidi="on">
After my last post on <a href="http://mechanical-sympathy.blogspot.com/2011/11/java-lock-implementations.html">Java Lock Implementations</a>, I got a lot of good feedback about my results and micro-benchmark design approach. As a result I now understand JVM warmup, On Stack Replacement (OSR) and Biased Locking somewhat better than before. Special thanks to <a href="http://blogs.oracle.com/dave/">Dave Dice</a> from Oracle, and <a href="http://www.azulsystems.com/blog/cliff">Cliff Click</a> & Gil Tene from Azul, for their very useful feedback.<br />
<br />
In the last post I concluded, based on my experiments, that biased locking was no longer necessary on modern CPUs. While this conclusion is understandable given the data gathered in the experiment, it was not valid because the experiment did not take account of some JVM warm up behaviour that I was unaware of.<br />
<br />
In this post I will re-run the experiment taking into account the feedback and present some new results. I shall also expand on the changes I've made to the test and why it is important to consider the JVM warm-up behaviour when writing micro-benchmarks, or even very lean Java applications with quick start up time.<br />
<br />
<b><span style="font-size: large;">On Stack Replacement (OSR)</span></b><br />
<br />
Java virtual machines will compile code to achieve greater performance based on runtime profiling. Some VMs run an interpreter for the majority of code and replace hot areas with compiled code following the 80/20 rule. Other VMs compile all code simply at first then replace the simple code with more optimised code based on profiling. Oracle Hotspot and Azul are examples of the first type and Oracle JRockit is an example of the second.<br />
<br />
Oracle Hotspot will count invocations of a method return plus branch backs for loops in that method, and if this exceeds 10K in server mode the method will be compiled. The compiled code on normal JIT'ing can be used when the method is next called. However if a loop is still iterating it may make sense to replace the method before the loop completes, especially if it has many iterations to go. OSR is the means by which a method gets replaced with a compiled version part way through iterating a loop.<br />
<br />
I was under the impression that normal JIT'ing and OSR would result in similar code. Cliff Click pointed out that it is much harder for a runtime to optimise a loop part way through, and especially difficult if nested. For example, bounds checking within the loop may not be possible to eliminate. Cliff will <a href="http://www.azulsystems.com/blog/cliff/2011-11-22-what-the-heck-is-osr-and-why-is-it-bad-or-good">blog</a> in more detail on this shortly.<br />
<br />
What this means is that you are likely to get better optimised code by doing a small number of shorter warm ups than a single large one. You can see in the code below how I do 10 shorter runs in a loop before the main large run compared to the last article where I did a single large warm-up run.<br />
<br />
<span style="font-size: large;"><b>Biased Locking</b></span><br />
<br />
Dave Dice pointed out that Hotspot does not enable objects for biased locking in the first few seconds (4s at present) of JVM startup. This is because some benchmarks, and NetBeans, have a lot of thread contention on start up and the revocation cost is significant.<br />
<br />
All objects by default are created with biased locking enabled in Oracle Hotspot after the first few seconds of start-up delay, and can be configured with <span style="font-family: "courier new" , "courier" , monospace;">-XX:BiasedLockingStartupDelay=0</span>.<br />
<br />
This point, combined with knowing more about OSR, is important for micro-benchmarks. It is also important to be aware of these points if you have a lean Java application that starts in a few seconds.<br />
<br />
<span style="font-size: large;"><b>The Code</b></span>
<br />
<pre>import java.util.concurrent.locks.Lock;
import java.util.concurrent.locks.ReentrantLock;
import java.util.concurrent.CyclicBarrier;
import static java.lang.System.out;
public final class TestLocks implements Runnable
{
public enum LockType {JVM, JUC}
public static LockType lockType;
public static final long WARMUP_ITERATIONS = 100L * 1000L;
public static final long ITERATIONS = 500L * 1000L * 1000L;
public static long counter = 0L;
public static final Object jvmLock = new Object();
public static final Lock jucLock = new ReentrantLock();
private static int numThreads;
private final long iterationLimit;
private final CyclicBarrier barrier;
public TestLocks(final CyclicBarrier barrier, final long iterationLimit)
{
this.barrier = barrier;
this.iterationLimit = iterationLimit;
}
public static void main(final String[] args) throws Exception
{
lockType = LockType.valueOf(args[0]);
numThreads = Integer.parseInt(args[1]);
for (int i = 0; i < 10; i++)
{
runTest(numThreads, WARMUP_ITERATIONS);
counter = 0L;
}
final long start = System.nanoTime();
runTest(numThreads, ITERATIONS);
final long duration = System.nanoTime() - start;
out.printf("%d threads, duration %,d (ns)\n", numThreads, duration);
out.printf("%,d ns/op\n", duration / ITERATIONS);
out.printf("%,d ops/s\n", (ITERATIONS * 1000000000L) / duration);
out.println("counter = " + counter);
}
private static void runTest(final int numThreads, final long iterationLimit)
throws Exception
{
CyclicBarrier barrier = new CyclicBarrier(numThreads);
Thread[] threads = new Thread[numThreads];
for (int i = 0; i < threads.length; i++)
{
threads[i] = new Thread(new TestLocks(barrier, iterationLimit));
}
for (Thread t : threads)
{
t.start();
}
for (Thread t : threads)
{
t.join();
}
}
public void run()
{
try
{
barrier.await();
}
catch (Exception e)
{
// don't care
}
switch (lockType)
{
case JVM: jvmLockInc(); break;
case JUC: jucLockInc(); break;
}
}
private void jvmLockInc()
{
long count = iterationLimit / numThreads;
while (0 != count--)
{
synchronized (jvmLock)
{
++counter;
}
}
}
private void jucLockInc()
{
long count = iterationLimit / numThreads;
while (0 != count--)
{
jucLock.lock();
try
{
++counter;
}
finally
{
jucLock.unlock();
}
}
}
}
</pre>
<br />
<b>Script to run tests:</b><br />
<br />
<span style="font-family: "courier new" , "courier" , monospace;">set -x</span><span style="font-family: "courier new" , "courier" , monospace;"></span><br />
<span style="font-family: "courier new" , "courier" , monospace;">for i in {1..8}</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">do </span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> java -server -XX:-UseBiasedLocking TestLocks JVM $i</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">done</span><br />
<br />
<span style="font-family: "courier new" , "courier" , monospace;">for i in {1..8}</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">do </span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> java -server -XX:+UseBiasedLocking -XX:BiasedLockingStartupDelay=0 TestLocks JVM $i</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">done</span><br />
<br />
<span style="font-family: "courier new" , "courier" , monospace;">for i in {1..8}</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">do </span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> java -server TestLocks JUC $i</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">done </span><br />
<br />
<span style="font-size: large;"><b>Results</b></span><br />
<br />
The tests are carried out with 64-bit Linux (Fedora Core 15) and Oracle JDK 1.6.0_29. <br />
<br />
<div align="center">
<table border="1" cellpadding="5"><tbody>
<tr style="background-color: cyan;"><th colspan="4">Nehalem 2.8GHz - Ops/Sec</th></tr>
<tr><th style="background-color: #cfe2f3;">Threads</th><th style="background-color: #cfe2f3;">-UseBiasedLocking</th><th style="background-color: #cfe2f3;">+UseBiasedLocking</th><th style="background-color: #cfe2f3;">ReentrantLock</th></tr>
<tr><td align="right">1</td><td align="right">53,283,461</td><td align="right"><table border="0" cellpadding="0" cellspacing="0" style="width: 126px;"><colgroup><col width="126"></col></colgroup><tbody style="color: purple;">
<tr height="20"> <td align="right" class="xl63" height="20" style="height: 15.0pt; width: 95pt;" width="126"><b>450,950,969</b></td> </tr>
</tbody></table>
</td><td align="right"><table border="0" cellpadding="0" cellspacing="0" style="width: 97px;"><colgroup><col width="97"></col></colgroup><tbody>
<tr height="20"> <td align="right" class="xl65" height="20" style="height: 15.0pt; width: 73pt;" width="97">62,876,566</td> </tr>
</tbody></table>
</td></tr>
<tr><td align="right">2</td><td align="right">18,519,295</td><td align="right"><table border="0" cellpadding="0" cellspacing="0" style="width: 126px;"><colgroup><col width="126"></col></colgroup><tbody>
<tr height="20"> <td align="right" class="xl65" height="20" style="height: 15.0pt; width: 95pt;" width="126">18,108,615</td> </tr>
</tbody></table>
</td><td align="right"><table border="0" cellpadding="0" cellspacing="0" style="width: 97px;"><colgroup><col width="97"></col></colgroup><tbody>
<tr height="20"> <td align="right" class="xl65" height="20" style="height: 15.0pt; width: 73pt;" width="97">10,217,186</td> </tr>
</tbody></table>
</td></tr>
<tr><td align="right">3</td><td align="right">13,349,605</td><td align="right"><table border="0" cellpadding="0" cellspacing="0" style="width: 126px;"><colgroup><col width="126"></col></colgroup><tbody>
<tr height="20"> <td align="right" class="xl65" height="20" style="height: 15.0pt; width: 95pt;" width="126">13,416,198</td> </tr>
</tbody></table>
</td><td align="right"><table border="0" cellpadding="0" cellspacing="0" style="width: 97px;"><colgroup><col width="97"></col></colgroup><tbody>
<tr height="20"> <td align="right" class="xl65" height="20" style="height: 15.0pt; width: 73pt;" width="97">14,108,622</td> </tr>
</tbody></table>
</td></tr>
<tr><td align="right">4</td><td align="right">8,120,172</td><td align="right"><table border="0" cellpadding="0" cellspacing="0" style="width: 126px;"><colgroup><col width="126"></col></colgroup><tbody>
<tr height="20"> <td align="right" class="xl65" height="20" style="height: 15.0pt; width: 95pt;" width="126">8,040,773</td> </tr>
</tbody></table>
</td><td align="right"><table border="0" cellpadding="0" cellspacing="0" style="width: 97px;"><colgroup><col width="97"></col></colgroup><tbody>
<tr height="20"> <td align="right" class="xl65" height="20" style="height: 15.0pt; width: 73pt;" width="97">14,207,310</td> </tr>
</tbody></table>
</td></tr>
<tr><td align="right">5</td><td align="right">4,725,114</td><td align="right"><table border="0" cellpadding="0" cellspacing="0" style="width: 126px;"><colgroup><col width="126"></col></colgroup><tbody>
<tr height="20"> <td align="right" class="xl65" height="20" style="height: 15.0pt; width: 95pt;" width="126">4,551,766</td> </tr>
</tbody></table>
</td><td align="right"><table border="0" cellpadding="0" cellspacing="0" style="width: 97px;"><colgroup><col width="97"></col></colgroup><tbody>
<tr height="20"> <td align="right" class="xl65" height="20" style="height: 15.0pt; width: 73pt;" width="97">14,302,683</td> </tr>
</tbody></table>
</td></tr>
<tr><td align="right">6</td><td align="right">5,133,706</td><td align="right"><table border="0" cellpadding="0" cellspacing="0" style="width: 126px;"><colgroup><col width="126"></col></colgroup><tbody>
<tr height="20"> <td align="right" class="xl65" height="20" style="height: 15.0pt; width: 95pt;" width="126">5,246,548</td> </tr>
</tbody></table>
</td><td align="right"><table border="0" cellpadding="0" cellspacing="0" style="width: 97px;"><colgroup><col width="97"></col></colgroup><tbody>
<tr height="20"> <td align="right" class="xl65" height="20" style="height: 15.0pt; width: 73pt;" width="97">14,676,616</td> </tr>
</tbody></table>
</td></tr>
<tr><td align="right">7</td><td align="right">5,473,652</td><td align="right"><table border="0" cellpadding="0" cellspacing="0" style="width: 126px;"><colgroup><col width="126"></col></colgroup><tbody>
<tr height="20"> <td align="right" class="xl65" height="20" style="height: 15.0pt; width: 95pt;" width="126">5,585,666</td> </tr>
</tbody></table>
</td><td align="right"><table border="0" cellpadding="0" cellspacing="0" style="width: 97px;"><colgroup><col width="97"></col></colgroup><tbody>
<tr height="20"> <td align="right" class="xl65" height="20" style="height: 15.0pt; width: 73pt;" width="97">18,145,525</td> </tr>
</tbody></table>
</td></tr>
<tr><td align="right">8</td><td align="right">5,514,056</td><td align="right"><table border="0" cellpadding="0" cellspacing="0" style="width: 126px;"><colgroup><col width="126"></col></colgroup><tbody>
<tr height="20"> <td align="right" class="xl65" height="20" style="height: 15.0pt; width: 95pt;" width="126">5,414,171</td> </tr>
</tbody></table>
</td><td align="right"><table border="0" cellpadding="0" cellspacing="0" style="width: 97px;"><colgroup><col width="97"></col></colgroup><tbody>
<tr height="20"> <td align="right" class="xl65" height="20" style="height: 15.0pt; width: 73pt;" width="97">19,010,725</td> </tr>
</tbody></table>
</td></tr>
</tbody></table>
</div>
<br />
<br />
<div align="center">
<table border="1" cellpadding="5"><tbody>
<tr style="background-color: cyan;"><th colspan="4">Sandy Bridge 2.0GHz - Ops/Sec</th></tr>
<tr><th style="background-color: #cfe2f3;">Threads</th><th style="background-color: #cfe2f3;">-UseBiasedLocking</th><th style="background-color: #cfe2f3;">+UseBiasedLocking</th><th style="background-color: #cfe2f3;">ReentrantLock</th></tr>
<tr><td align="right">1</td><td align="right"><table border="0" cellpadding="0" cellspacing="0" style="width: 124px;"><colgroup><col width="124"></col></colgroup><tbody>
<tr height="20"> <td align="right" class="xl65" height="20" style="height: 15.0pt; width: 93pt;" width="124">34,500,407</td> </tr>
</tbody></table>
</td><td align="right"><table border="0" cellpadding="0" cellspacing="0" style="width: 126px;"><colgroup><col width="126"></col></colgroup><tbody style="color: purple;">
<tr height="20"> <td align="right" class="xl63" height="20" style="height: 15.0pt; width: 95pt;" width="126"><table border="0" cellpadding="0" cellspacing="0" style="width: 126px;"><colgroup><col width="126"></col></colgroup><tbody style="color: purple;">
<tr height="20"> <td align="right" class="xl65" height="20" style="height: 15.0pt; width: 95pt;" width="126"><b>396,511,324</b></td> </tr>
</tbody></table>
</td> </tr>
</tbody></table>
</td><td align="right"><table border="0" cellpadding="0" cellspacing="0" style="width: 97px;"><colgroup><col width="97"></col></colgroup><tbody>
<tr height="20"> <td align="right" class="xl65" height="20" style="height: 15.0pt; width: 73pt;" width="97"><table border="0" cellpadding="0" cellspacing="0" style="width: 97px;"><colgroup><col width="97"></col></colgroup><tbody>
<tr height="20"> <td align="right" class="xl65" height="20" style="height: 15.0pt; width: 73pt;" width="97">43,148,808</td> </tr>
</tbody></table>
</td> </tr>
</tbody></table>
</td></tr>
<tr><td align="right">2</td><td align="right"><table border="0" cellpadding="0" cellspacing="0" style="width: 124px;"><colgroup><col width="124"></col></colgroup><tbody>
<tr height="20"> <td align="right" class="xl65" height="20" style="height: 15.0pt; width: 93pt;" width="124">20,899,076</td> </tr>
</tbody></table>
</td><td align="right"><table border="0" cellpadding="0" cellspacing="0" style="width: 126px;"><colgroup><col width="126"></col></colgroup><tbody>
<tr height="20"> <td align="right" class="xl65" height="20" style="height: 15.0pt; width: 95pt;" width="126"><table border="0" cellpadding="0" cellspacing="0" style="width: 126px;"><colgroup><col width="126"></col></colgroup><tbody>
<tr height="20"> <td align="right" class="xl65" height="20" style="height: 15.0pt; width: 95pt;" width="126">19,742,639</td> </tr>
</tbody></table>
</td> </tr>
</tbody></table>
</td><td align="right"><table border="0" cellpadding="0" cellspacing="0" style="width: 97px;"><colgroup><col width="97"></col></colgroup><tbody>
<tr height="20"> <td align="right" class="xl65" height="20" style="height: 15.0pt; width: 73pt;" width="97"><table border="0" cellpadding="0" cellspacing="0" style="width: 97px;"><colgroup><col width="97"></col></colgroup><tbody>
<tr height="20"> <td align="right" class="xl65" height="20" style="height: 15.0pt; width: 73pt;" width="97">6,038,923</td> </tr>
</tbody></table>
</td> </tr>
</tbody></table>
</td></tr>
<tr><td align="right">3</td><td align="right"><table border="0" cellpadding="0" cellspacing="0" style="width: 124px;"><colgroup><col width="124"></col></colgroup><tbody>
<tr height="20"> <td align="right" class="xl65" height="20" style="height: 15.0pt; width: 93pt;" width="124">9,288,039</td> </tr>
</tbody></table>
</td><td align="right"><table border="0" cellpadding="0" cellspacing="0" style="width: 126px;"><colgroup><col width="126"></col></colgroup><tbody>
<tr height="20"> <td align="right" class="xl65" height="20" style="height: 15.0pt; width: 95pt;" width="126"><table border="0" cellpadding="0" cellspacing="0" style="width: 126px;"><colgroup><col width="126"></col></colgroup><tbody>
<tr height="20"> <td align="right" class="xl65" height="20" style="height: 15.0pt; width: 95pt;" width="126">11,957,032</td> </tr>
</tbody></table>
</td> </tr>
</tbody></table>
</td><td align="right"><table border="0" cellpadding="0" cellspacing="0" style="width: 97px;"><colgroup><col width="97"></col></colgroup><tbody>
<tr height="20"> <td align="right" class="xl65" height="20" style="height: 15.0pt; width: 73pt;" width="97"><table border="0" cellpadding="0" cellspacing="0" style="width: 97px;"><colgroup><col width="97"></col></colgroup><tbody>
<tr height="20"> <td align="right" class="xl65" height="20" style="height: 15.0pt; width: 73pt;" width="97">24,147,807</td> </tr>
</tbody></table>
</td> </tr>
</tbody></table>
</td></tr>
<tr><td align="right">4</td><td align="right"><table border="0" cellpadding="0" cellspacing="0" style="width: 124px;"><colgroup><col width="124"></col></colgroup><tbody>
<tr height="20"> <td align="right" class="xl65" height="20" style="height: 15.0pt; width: 93pt;" width="124">5,618,862</td> </tr>
</tbody></table>
</td><td align="right"><table border="0" cellpadding="0" cellspacing="0" style="width: 126px;"><colgroup><col width="126"></col></colgroup><tbody>
<tr height="20"> <td align="right" class="xl65" height="20" style="height: 15.0pt; width: 95pt;" width="126"><table border="0" cellpadding="0" cellspacing="0" style="width: 126px;"><colgroup><col width="126"></col></colgroup><tbody>
<tr height="20"> <td align="right" class="xl65" height="20" style="height: 15.0pt; width: 95pt;" width="126">5,589,289</td> </tr>
</tbody></table>
</td> </tr>
</tbody></table>
</td><td align="right"><table border="0" cellpadding="0" cellspacing="0" style="width: 97px;"><colgroup><col width="97"></col></colgroup><tbody>
<tr height="20"> <td align="right" class="xl65" height="20" style="height: 15.0pt; width: 73pt;" width="97"><table border="0" cellpadding="0" cellspacing="0" style="width: 97px;"><colgroup><col width="97"></col></colgroup><tbody>
<tr height="20"> <td align="right" class="xl65" height="20" style="height: 15.0pt; width: 73pt;" width="97">9,082,961</td> </tr>
</tbody></table>
</td> </tr>
</tbody></table>
</td></tr>
<tr><td align="right">5</td><td align="right"><table border="0" cellpadding="0" cellspacing="0" style="width: 124px;"><colgroup><col width="124"></col></colgroup><tbody>
<tr height="20"> <td align="right" class="xl65" height="20" style="height: 15.0pt; width: 93pt;" width="124">5,609,932</td> </tr>
</tbody></table>
</td><td align="right"><table border="0" cellpadding="0" cellspacing="0" style="width: 126px;"><colgroup><col width="126"></col></colgroup><tbody>
<tr height="20"> <td align="right" class="xl65" height="20" style="height: 15.0pt; width: 95pt;" width="126"><table border="0" cellpadding="0" cellspacing="0" style="width: 126px;"><colgroup><col width="126"></col></colgroup><tbody>
<tr height="20"> <td align="right" class="xl65" height="20" style="height: 15.0pt; width: 95pt;" width="126">5,592,574</td> </tr>
</tbody></table>
</td> </tr>
</tbody></table>
</td><td align="right"><table border="0" cellpadding="0" cellspacing="0" style="width: 97px;"><colgroup><col width="97"></col></colgroup><tbody>
<tr height="20"> <td align="right" class="xl65" height="20" style="height: 15.0pt; width: 73pt;" width="97"><table border="0" cellpadding="0" cellspacing="0" style="width: 97px;"><colgroup><col width="97"></col></colgroup><tbody>
<tr height="20"> <td align="right" class="xl65" height="20" style="height: 15.0pt; width: 73pt;" width="97">9,389,243</td> </tr>
</tbody></table>
</td> </tr>
</tbody></table>
</td></tr>
<tr><td align="right">6</td><td align="right"><table border="0" cellpadding="0" cellspacing="0" style="width: 124px;"><colgroup><col width="124"></col></colgroup><tbody>
<tr height="20"> <td align="right" class="xl65" height="20" style="height: 15.0pt; width: 93pt;" width="124">5,742,907</td> </tr>
</tbody></table>
</td><td align="right"><table border="0" cellpadding="0" cellspacing="0" style="width: 126px;"><colgroup><col width="126"></col></colgroup><tbody>
<tr height="20"> <td align="right" class="xl65" height="20" style="height: 15.0pt; width: 95pt;" width="126"><table border="0" cellpadding="0" cellspacing="0" style="width: 126px;"><colgroup><col width="126"></col></colgroup><tbody>
<tr height="20"> <td align="right" class="xl65" height="20" style="height: 15.0pt; width: 95pt;" width="126">5,760,558</td> </tr>
</tbody></table>
</td> </tr>
</tbody></table>
</td><td align="right"><table border="0" cellpadding="0" cellspacing="0" style="width: 97px;"><colgroup><col width="97"></col></colgroup><tbody>
<tr height="20"> <td align="right" class="xl65" height="20" style="height: 15.0pt; width: 73pt;" width="97"><table border="0" cellpadding="0" cellspacing="0" style="width: 97px;"><colgroup><col width="97"></col></colgroup><tbody>
<tr height="20"> <td align="right" class="xl65" height="20" style="height: 15.0pt; width: 73pt;" width="97">12,518,728</td> </tr>
</tbody></table>
</td> </tr>
</tbody></table>
</td></tr>
<tr><td align="right">7</td><td align="right"><table border="0" cellpadding="0" cellspacing="0" style="width: 124px;"><colgroup><col width="124"></col></colgroup><tbody>
<tr height="20"> <td align="right" class="xl65" height="20" style="height: 15.0pt; width: 93pt;" width="124">6,699,201</td> </tr>
</tbody></table>
</td><td align="right"><table border="0" cellpadding="0" cellspacing="0" style="width: 126px;"><colgroup><col width="126"></col></colgroup><tbody>
<tr height="20"> <td align="right" class="xl65" height="20" style="height: 15.0pt; width: 95pt;" width="126"><table border="0" cellpadding="0" cellspacing="0" style="width: 126px;"><colgroup><col width="126"></col></colgroup><tbody>
<tr height="20"> <td align="right" class="xl65" height="20" style="height: 15.0pt; width: 95pt;" width="126">6,641,886</td> </tr>
</tbody></table>
</td> </tr>
</tbody></table>
</td><td align="right"><table border="0" cellpadding="0" cellspacing="0" style="width: 97px;"><colgroup><col width="97"></col></colgroup><tbody>
<tr height="20"> <td align="right" class="xl65" height="20" style="height: 15.0pt; width: 73pt;" width="97"><table border="0" cellpadding="0" cellspacing="0" style="width: 97px;"><colgroup><col width="97"></col></colgroup><tbody>
<tr height="20"> <td align="right" class="xl65" height="20" style="height: 15.0pt; width: 73pt;" width="97">13,684,475</td> </tr>
</tbody></table>
</td> </tr>
</tbody></table>
</td></tr>
<tr><td align="right">8</td><td align="right"><table border="0" cellpadding="0" cellspacing="0" style="width: 124px;"><colgroup><col width="124"></col></colgroup><tbody>
<tr height="20"> <td align="right" class="xl65" height="20" style="height: 15.0pt; width: 93pt;" width="124">6,957,824</td> </tr>
</tbody></table>
</td><td align="right"><table border="0" cellpadding="0" cellspacing="0" style="width: 126px;"><colgroup><col width="126"></col></colgroup><tbody>
<tr height="20"> <td align="right" class="xl65" height="20" style="height: 15.0pt; width: 95pt;" width="126"><table border="0" cellpadding="0" cellspacing="0" style="width: 126px;"><colgroup><col width="126"></col></colgroup><tbody>
<tr height="20"> <td align="right" class="xl65" height="20" style="height: 15.0pt; width: 95pt;" width="126">6,925,410</td> </tr>
</tbody></table>
</td> </tr>
</tbody></table>
</td><td align="right"><table border="0" cellpadding="0" cellspacing="0" style="width: 97px;"><colgroup><col width="97"></col></colgroup><tbody>
<tr height="20"> <td align="right" class="xl65" height="20" style="height: 15.0pt; width: 73pt;" width="97"><table border="0" cellpadding="0" cellspacing="0" style="width: 97px;"><colgroup><col width="97"></col></colgroup><tbody>
<tr height="20"> <td align="right" class="xl65" height="20" style="height: 15.0pt; width: 73pt;" width="97">14,819,005</td> </tr>
</tbody></table>
</td> </tr>
</tbody></table>
</td></tr>
</tbody></table>
</div>
<br />
<span style="font-size: large;"><b>Observations</b></span><br />
<ol>
<li>Biased locking has a huge benefit in the un-contended single threaded case.</li>
<li>Biased locking when un-contended, and not revoked, only adds 4-5 cycles of cost. This is the cost when having a cache hit for the lock structures, on top of the code protected in the critical section.</li>
<li><span style="font-family: "courier new" , "courier" , monospace;">-XX:BiasedLockingStartupDelay=0</span> needs to be set for lean applications and micro-benchmarks.</li>
<li>Avoiding OSR does not make a material difference to this set of test results. This is likely to be because the loop is so simple or other costs are dominating.</li>
<li>For the current implementations, ReentrantLocks scale better than synchronised locks under contention, except in the case of 2 contending threads.</li>
</ol>
<span style="font-size: large;"><b>Conclusion</b></span><br />
<br />
My tests in the last <a href="http://mechanical-sympathy.blogspot.com/2011/11/java-lock-implementations.html">post</a> are invalid for the testing of an un-contended biased lock, because the lock was not actually biased. If you are designing code following the <a href="http://mechanical-sympathy.blogspot.com/2011/09/single-writer-principle.html">single writer principle</a>, and therefore having un-contended locks when using 3rd party libraries, then having biased locking enabled is a significant performance boost.</div>
Martin Thompsonhttp://www.blogger.com/profile/15893849163924476586noreply@blogger.com12tag:blogger.com,1999:blog-5560209661389175529.post-23094722098460122282011-11-19T02:57:00.009+00:002022-08-17T11:37:39.912+01:00Java Lock Implementations<div dir="ltr" style="text-align: left;" trbidi="on">
We all use 3rd party libraries as a normal part of development. Generally, we have no control over their internals. The libraries provided with the JDK are a typical example. Many of these libraries employ locks to manage contention.<br />
<br />
JDK locks come with two implementations. One uses atomic CAS style instructions to manage the claim process. CAS instructions tend to be the most expensive type of CPU instructions and on x86 have <a href="http://mechanical-sympathy.blogspot.com/2011/07/memory-barriersfences.html">memory ordering</a> semantics. Often locks are un-contended which gives rise to a possible optimisation whereby a lock can be <a href="http://home.comcast.net/%7Epjbishop/Dave/QRL-OpLocks-BiasedLocking.pdf">biased </a>to the un-contended thread using techniques to avoid the use of atomic instructions. This biasing allows a lock in theory to be quickly reacquired by the same thread. If the lock turns out to be contended by multiple threads the algorithm with revert from being biased and fall back to the standard approach using atomic instructions. Biased locking became the <a href="http://java.sun.com/performance/reference/whitepapers/6_performance.html#2.1.1">default lock implementation</a> with Java 6.<br />
<br />
When respecting the <a href="http://mechanical-sympathy.blogspot.com/2011/09/single-writer-principle.html">single writer principle,</a> biased locking should be your friend. Lately, when using the sockets API, I decided to measure the lock costs and was surprised by the results. I found that my un-contended thread was incurring a bit more cost than I expected from the lock. I put together the following test to compare the cost of the current lock implementations available in Java 6.<br />
<br />
<span style="font-size: large;"><b>The Test</b></span><br />
<br />
For the test I shall increment a counter within a lock, and increase the number of contending threads on the lock. This test will be repeated for the 3 major lock implementations available to Java:<br />
<ol>
<li>Atomic locking on Java language monitors</li>
<li>Biased locking on Java language monitors</li>
<li><a href="http://download.oracle.com/javase/1,5,0/docs/api/java/util/concurrent/locks/ReentrantLock.html"><span style="font-family: "Courier New",Courier,monospace;">ReentrantLock</span></a> introduced with the java.util.concurrent package in Java 5.</li>
</ol>
I'll also run the tests on the 3 most recent generations of the Intel CPU. For each CPU I'll execute the tests up to the maximum number of concurrent threads the core count will support.<br />
<br />
The tests are carried out with 64-bit Linux (Fedora Core 15) and Oracle JDK 1.6.0_29. <br />
<br />
<span style="font-size: large;"><b>The Code </b></span><br />
<pre>
import java.util.concurrent.BrokenBarrierException;
import java.util.concurrent.locks.Lock;
import java.util.concurrent.locks.ReentrantLock;
import java.util.concurrent.CyclicBarrier;
import static java.lang.System.out;
public final class TestLocks implements Runnable
{
public enum LockType { JVM, JUC }
public static LockType lockType;
public static final long ITERATIONS = 500L * 1000L *1000L;
public static long counter = 0L;
public static final Object jvmLock = new Object();
public static final Lock jucLock = new ReentrantLock();
private static int numThreads;
private static CyclicBarrier barrier;
public static void main(final String[] args) throws Exception
{
lockType = LockType.valueOf(args[0]);
numThreads = Integer.parseInt(args[1]);
runTest(numThreads); // warm up
counter = 0L;
final long start = System.nanoTime();
runTest(numThreads);
final long duration = System.nanoTime() - start;
out.printf("%d threads, duration %,d (ns)\n", numThreads, duration);
out.printf("%,d ns/op\n", duration / ITERATIONS);
out.printf("%,d ops/s\n", (ITERATIONS * 1000000000L) / duration);
out.println("counter = " + counter);
}
private static void runTest(final int numThreads) throws Exception
{
barrier = new CyclicBarrier(numThreads);
Thread[] threads = new Thread[numThreads];
for (int i = 0; i < threads.length; i++)
{
threads[i] = new Thread(new TestLocks());
}
for (Thread t : threads)
{
t.start();
}
for (Thread t : threads)
{
t.join();
}
}
public void run()
{
try
{
barrier.await();
}
catch (Exception e)
{
// don't care
}
switch (lockType)
{
case JVM: jvmLockInc(); break;
case JUC: jucLockInc(); break;
}
}
private void jvmLockInc()
{
long count = ITERATIONS / numThreads;
while (0 != count--)
{
synchronized (jvmLock)
{
++counter;
}
}
}
private void jucLockInc()
{
long count = ITERATIONS / numThreads;
while (0 != count--)
{
jucLock.lock();
try
{
++counter;
}
finally
{
jucLock.unlock();
}
}
}
}
</pre>
Script the tests:<br />
<div style="font-family: "Courier New",Courier,monospace;">
<br /></div>
<span style="font-family: "Courier New",Courier,monospace;">set -x</span><br />
<span style="font-family: "Courier New",Courier,monospace;">for i in {1..8}; do java -XX:-UseBiasedLocking TestLocks JVM $i; done</span><br />
<span style="font-family: "Courier New",Courier,monospace;">for i in {1..8}; do java -XX:+UseBiasedLocking TestLocks JVM $i; done</span><br />
<span style="font-family: "Courier New",Courier,monospace;">for i in {1..8}; do java TestLocks JUC $i; done</span><br />
<br />
<span style="font-size: large;"><b>The Results</b></span><br />
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgFlXpoecL55i_ddXSDSLAxXEC_1v8C8E-hZrjd0lmBXKuFlOYug9rEolwstar121tcRGVus2pRUeeCYR0ma6InM1l71KThvtUQZpCTN4NqQh89o73ElA38_Q_STA86rZ2ruBX-u88Fo3E/s1600/nehalem.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgFlXpoecL55i_ddXSDSLAxXEC_1v8C8E-hZrjd0lmBXKuFlOYug9rEolwstar121tcRGVus2pRUeeCYR0ma6InM1l71KThvtUQZpCTN4NqQh89o73ElA38_Q_STA86rZ2ruBX-u88Fo3E/s1600/nehalem.png" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Figure 1.</td></tr>
</tbody></table>
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhDI_7B6NhHI6twY25XqBufnUS95a9yr9pT-PFWe47GRaTwXrMF6satRVdHxOGZ1-mvZAWj-k554mEru7J-l8gNFVnLyOFFVkqBQ3_wVdeTRF0fxW5fbNysLFFiWVGKzw35YR_NAIvMxbA/s1600/westmere.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhDI_7B6NhHI6twY25XqBufnUS95a9yr9pT-PFWe47GRaTwXrMF6satRVdHxOGZ1-mvZAWj-k554mEru7J-l8gNFVnLyOFFVkqBQ3_wVdeTRF0fxW5fbNysLFFiWVGKzw35YR_NAIvMxbA/s1600/westmere.png" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Figure 2.</td></tr>
</tbody></table>
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhASWEbXVbV2QAkqQNCDuX5pG3hW0U2EmKYIMIARrvI0Dmm9ptFly5RAXuvK8hHop4UsoINlRfshR0lwP_-Ic94VVTPwa7N5dG3wdcUHG7rXbSUhWy9GOnsURhhOd42y5vYtwD_BuXi3i8/s1600/sandybridge.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhASWEbXVbV2QAkqQNCDuX5pG3hW0U2EmKYIMIARrvI0Dmm9ptFly5RAXuvK8hHop4UsoINlRfshR0lwP_-Ic94VVTPwa7N5dG3wdcUHG7rXbSUhWy9GOnsURhhOd42y5vYtwD_BuXi3i8/s1600/sandybridge.png" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Figure 3.</td></tr>
</tbody></table>
<br />
<span style="font-size: large;"><b>Observations</b></span><br />
<ol>
<li>Biased locking, in the un-contended case, is ~10% more expensive than the atomic locking. It seems that for recent CPU generations the cost of atomic instructions is less than the necessary housekeeping for biased locks. Previous to Nehalem, lock instructions would assert a lock on the memory bus to perform the these atomic operations, each would cost more than 100 cycles. Since Nehalem, atomic instructions can be handled local to a CPU core, and typically cost only 10-20 cycles if they do not need to wait on the store buffer to empty while enforcing memory ordering semantics.</li>
<li>As contention increases, language monitor locks quickly reach a throughput limit regardless of thread count.</li>
<li>ReentrantLock gives the best un-contended performance and scales significantly better with increasing contention compared to language monitors using synchronized.</li>
<li>ReentrantLock has an odd characteristic of reduced performance when 2 threads are contending. This deserves further investigation.</li>
<li>Sandybridge suffers from the <a href="http://mechanical-sympathy.blogspot.com/2011/09/adventures-with-atomiclong.html">increased latency</a> of atomic instructions I detailed in a previous article when contended thread count is low. As contended thread count continues to increase, the cost of the kernel arbitration tends to dominate and Sandybridge shows its strength with increased memory throughput.</li>
</ol>
<span style="font-size: large;"><b>Conclusion</b></span><br />
<br />
Biased locking should no longer be the default lock implementation on modern Intel processors. I recommend you measure your applications and experiement with the <span style="font-family: "Courier New",Courier,monospace;">-XX:-UseBiasedLocking</span> JVM option to determine if you can benefit from using atomic lock based algorithm for the un-contended case. <br />
<br />
When developing your own concurrent libraries I would recommend <a href="http://download.oracle.com/javase/1,5,0/docs/api/java/util/concurrent/locks/ReentrantLock.html"><span style="font-family: "Courier New",Courier,monospace;">ReentrantLock</span></a> rather than using the synchronized keyword due to the significantly better performance on x86, if a lock-free alternative algorithm is not a viable option.<br />
<br />
<b><i>Update 20-Nov-2011</i></b><br />
<br />
<a href="http://blogs.oracle.com/dave/">Dave Dice</a> has pointed out that biased locking is not implemented for the locks created in the first few seconds of the JVM startup. I'll re-run my tests this week and post the results. I've had some more quality feedback that suggests my results could be potentially invalid. Micro benchmarks can be tricky but the advice of measuring your own application in the large still stands.<br />
<br />
A re-run of the tests can be seen in <a href="http://mechanical-sympathy.blogspot.com/2011/11/biased-locking-osr-and-benchmarking-fun.html">this</a> follow-on blog taking account of Dave's feedback.</div>Martin Thompsonhttp://www.blogger.com/profile/15893849163924476586noreply@blogger.com5London, UK51.5001524 -0.1262361999999939151.322796399999994 -0.39052969999999393 51.6775084 0.1380573000000061tag:blogger.com,1999:blog-5560209661389175529.post-59978063773768055032011-11-05T13:52:00.006+00:002012-07-05T19:44:29.407+01:00Locks & Condition Variables - Latency ImpactIn a previous article on <a href="http://mechanical-sympathy.blogspot.com/2011/08/inter-thread-latency.html">Inter-Thread Latency</a> I showed how it is possible to signal a state change between 2 threads with less than 50ns of latency. To many developers, writing concurrent code using locks is a scary experience. Writing concurrent code using lock-free algorithms, i.e. algorithms that rely on the use of memory barriers and an intimate understanding of the underlying memory models, can be totally terrifying. To me lock-free / <a href="http://en.wikipedia.org/wiki/Non-blocking_algorithm">non-blocking</a> algorithms are like playing with explosives or corrosive chemicals, if you do not understand what you are doing, or show the ultimate respect, then very bad things can, and most likely will, happen!<br />
<br />
In this article, I'd like to illustrate the impact of using locks and the resulting latency they can impose on your designs. I want to use a very similar algorithm to that used in my previous inter-thread latency article to illustrate the ping-pong effect of handing control back and forth between 2 threads. In this case, rather than using a couple of volatile variables, I will employ a pair of condition variables to signal a state change so control can be passed back and forth.<br />
<br />
<span style="font-size: large;"><b>The Code</b></span>
<pre class="brush: java; toolbar: false">
import java.util.concurrent.locks.Condition;
import java.util.concurrent.locks.Lock;
import java.util.concurrent.locks.ReentrantLock;
import static java.lang.System.out;
public final class LockedSignallingLatency
{
private static final int ITERATIONS = 10 * 1000 * 1000;
private static final Lock lock = new ReentrantLock();
private static final Condition sendCondition = lock.newCondition();
private static final Condition echoCondition = lock.newCondition();
private static long sendValue = -1L;
private static long echoValue = -1L;
public static void main(final String[] args)
throws Exception
{
final Thread sendThread = new Thread(new SendRunner());
final Thread echoThread = new Thread(new EchoRunner());
final long start = System.nanoTime();
echoThread.start();
sendThread.start();
sendThread.join();
echoThread.join();
final long duration = System.nanoTime() - start;
out.printf("duration %,d (ns)\n", duration);
out.printf("%,d ns/op\n", duration / (ITERATIONS * 2L));
out.printf("%,d ops/s\n", (ITERATIONS * 2L * 1000000000L) / duration);
}
public static final class SendRunner implements Runnable
{
public void run()
{
for (long i = 0; i < ITERATIONS; i++)
{
lock.lock();
try
{
sendValue = i;
sendCondition.signal();
}
finally
{
lock.unlock();
}
lock.lock();
try
{
while (echoValue != i)
{
echoCondition.await();
}
}
catch (final InterruptedException ex)
{
break;
}
finally
{
lock.unlock();
}
}
}
}
public static final class EchoRunner implements Runnable
{
public void run()
{
for (long i = 0; i < ITERATIONS; i++)
{
lock.lock();
try
{
while (sendValue != i)
{
sendCondition.await();
}
}
catch (final InterruptedException ex)
{
break;
}
finally
{
lock.unlock();
}
lock.lock();
try
{
echoValue = i;
echoCondition.signal();
}
finally
{
lock.unlock();
}
}
}
}
}
</pre>
<span style="font-size: large;"><b>Test Results</b></span><br />
<br />
<b>Windows 7 Professional 64-bit - Oracle JDK 1.6.0 - Nehalem 2.8 GHz</b><br />
<br />
<span style="font-family: "Courier New",Courier,monospace;">$ start /AFFINITY 0x14 /B /WAIT java LockedSignallingLatency</span><br />
<span style="font-family: "Courier New",Courier,monospace;">duration 41,649,616,343 (ns)</span><br />
<span style="font-family: "Courier New",Courier,monospace;">2,082 ns/op</span><br />
<span style="font-family: "Courier New",Courier,monospace;">480,196 ops/s</span><br />
<br />
<span style="font-family: "Courier New",Courier,monospace;">$ java LockedSignallingLatency</span><br />
<span style="font-family: "Courier New",Courier,monospace;">duration 73,789,456,491 (ns)</span><br />
<span style="font-family: "Courier New",Courier,monospace;">3,689 ns/op</span><br />
<span style="font-family: "Courier New",Courier,monospace;">271,041 ops/s</span><br />
<br />
<b>Linux Fedora Core 15 64-bit - Oracle JDK 1.6.0 - Nehalem 2.8 GHz</b><br />
<br />
<div style="font-family: "Courier New",Courier,monospace;">$ taskset -c 2,4 java LockedSignallingLatency<br />
duration 40,469,689,559 (ns)<br />
2,023 ns/op<br />
494,197 ops/s</div><div style="font-family: "Courier New",Courier,monospace;"><br />
</div><span style="font-family: "Courier New",Courier,monospace;">$ java LockedSignallingLatency</span><br />
<span style="font-family: "Courier New",Courier,monospace;">duration 169,795,756,230 (ns)</span><br />
<span style="font-family: "Courier New",Courier,monospace;">8,489 ns/op</span><br />
<span style="font-family: "Courier New",Courier,monospace;">117,788 ops/s</span><br />
<br />
<b>Linux Fedora Core 15 64-bit - Oracle JDK 1.6.0 - Sandybridge 2.0 GHz</b><br />
<br />
<span style="font-family: "Courier New",Courier,monospace;">$ taskset -c 2,4 java LockedSignallingLatency</span><br />
<span style="font-family: "Courier New",Courier,monospace;">duration 47,209,549,484 (ns)</span><br />
<span style="font-family: "Courier New",Courier,monospace;">2,360 ns/op</span><br />
<span style="font-family: "Courier New",Courier,monospace;">423,643 ops/s</span><br />
<br />
<span style="font-family: "Courier New",Courier,monospace;">$ java LockedSignallingLatency</span><br />
<span style="font-family: "Courier New",Courier,monospace;">duration 336,168,489,093 (ns)</span><br />
<span style="font-family: "Courier New",Courier,monospace;">16,808 ns/op</span><br />
<span style="font-family: "Courier New",Courier,monospace;">59,493 ops/s</span><br />
<br />
<span style="font-size: large;"><b>Observations</b></span><br />
<br />
The above is a typical set of results I've seen in the middle of the range from multiple runs. There are a couple of interesting observations I'd like to expand on.<br />
<br />
Firstly, this is 3 orders of magnitude greater latency than what I illustrated in the previous article using just memory barriers to signal between threads. This cost comes about because the kernel needs to get involved to arbitrate between the threads for the lock, and then manage the scheduling for the threads to awaken when the condition is signalled. The one-way latency to signal a change is pretty much the same as what is considered current state of the art for network hops between nodes via a switch. It is possible to get ~1<span style="font-family: "Calibri","sans-serif"; font-size: 11pt; line-height: 115%;">µ</span>s latency with <a href="http://en.wikipedia.org/wiki/InfiniBand">InfiniBand</a> and less than 5<span style="font-family: "Calibri","sans-serif"; font-size: 11pt; line-height: 115%;">µ</span>s with <a href="http://www.solarflare.com/09-14-11-Solarflare-Arista-Complete-Ultra-Low-Latency-Testing">10GigE and user-space IP stacks</a>.<br />
<br />
Secondly, the impact is clear when letting the OS choose what CPUs the threads get scheduled on rather than pinning them manually. I've observed this same issue across many use cases whereby Linux, in default configuration for its scheduler, will greatly impact the performance of a low-latency system by scheduling threads on different cores resulting in cache pollution. Windows by default seems to make a better job of this.<br />
<br />
I recently had an interesting discussion with <a href="http://www.azulsystems.com/blog/cliff">Cliff Click</a> about using condition variables and their cost. He pointed out a problem he was seeing. If you look at the case where a sleeping thread gets signalled within the lock, it goes to run and then discovers it cannot get the lock because the signalling thread already has the lock, so it gets put back to sleep until the signalling thread releases the lock, thus causing more work than necessary. Modern schedulers would benefit from being more aware of communication mechanisms between threads to have more efficient location and rescheduling logic. As we go more concurrent and parallel our schedulers need to become more aware of <a href="http://en.wikipedia.org/wiki/Inter-process_communication">IPC</a> mechanisms.<br />
<br />
<span style="font-size: large;"><b>Conclusion</b></span><br />
<br />
When designing a low-latency system it is crucial to avoid the use of locks and condition variables for the main transaction flows. Non-blocking or lock-free algorithms are key to achieving ultra-low latency but can be very difficult to prove correct. I would not recommend designing lock-free algorithms for business logic but they can be very effectively employed for low-level infrastructure components. The business logic is best run on single threads following the <a href="http://mechanical-sympathy.blogspot.com/2011/09/single-writer-principle.html"><i>Single Writer Principle</i></a> from my previous article.Martin Thompsonhttp://www.blogger.com/profile/15893849163924476586noreply@blogger.com0London, UK51.5001524 -0.1262361999999939151.322796399999994 -0.39052969999999393 51.6775084 0.1380573000000061tag:blogger.com,1999:blog-5560209661389175529.post-26329242196604611652011-10-19T17:44:00.004+01:002022-08-17T11:38:07.585+01:00Smart BatchingHow often have we all heard that “batching” will increase latency? As someone with a passion for low-latency systems this surprises me. In my experience when batching is done correctly, not only does it increase throughput, it can also reduce average latency and keep it consistent.<br />
<br />
Well then, how can batching magically reduce latency? It comes down to what algorithm and data structures are employed. In a distributed environment we are often having to batch up messages/events into network packets to achieve greater throughput. We also employ similar techniques in buffering writes to storage to reduce the number of <a href="http://en.wikipedia.org/wiki/IOPS">IOPS</a>. That storage could be a block device backed file-system or a relational database. Most IO devices can only handle a modest number of IO operations per second, so it is best to fill those operations efficiently. Many approaches to batching involve waiting for a timeout to occur and this will by its very nature increase latency. The batch can also get filled before the timeout occurs making the latency even more unpredictable.<br />
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgbAz63c_G_srlTqfUG8aFNslfkqIXxyUMTOPozuYz0aVSliQ5Xj2eFH_VRqIIxUwA_IYpGgHbWV6hNue2-c1Jo3p4tsLCvH0uZnos1sz00Zpo-MRX3FM75gx78zpOIgacnRf_uDtxtqHk/s1600/batching.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgbAz63c_G_srlTqfUG8aFNslfkqIXxyUMTOPozuYz0aVSliQ5Xj2eFH_VRqIIxUwA_IYpGgHbWV6hNue2-c1Jo3p4tsLCvH0uZnos1sz00Zpo-MRX3FM75gx78zpOIgacnRf_uDtxtqHk/s1600/batching.png" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Figure 1.</td></tr>
</tbody></table><br />
<div class="separator" style="clear: both; text-align: center;"></div>Figure 1. above depicts decoupling the access to an IO device, and therefore the contention for access to it, by introducing a queue like structure to stage the messages/events to be sent and a thread doing the batching for writing to the device.<br />
<br />
<span style="font-size: large;"><b>The Algorithm</b></span><br />
<br />
An approach to batching uses the following algorithm in Java pseudo code:<br />
<pre>
public final class NetworkBatcher
implements Runnable
{
private final NetworkFacade network;
private final Queue<Message> queue;
private final ByteBuffer buffer;
public NetworkBatcher(final NetworkFacade network,
final int maxPacketSize,
final Queue<Message> queue)
{
this.network = network;
buffer = ByteBuffer.allocate(maxPacketSize);
this.queue = queue;
}
public void run()
{
while (!Thread.currentThread().isInterrupted())
{
while (null == queue.peek())
{
employWaitStrategy(); // block, spin, yield, etc.
}
Message msg;
while (null != (msg = queue.poll()))
{
if (msg.size() > buffer.remaining())
{
sendBuffer();
}
buffer.put(msg.getBytes());
}
sendBuffer();
}
}
private void sendBuffer()
{
buffer.flip();
network.send(buffer);
buffer.clear();
}
}
</pre>
<br />
Basically, wait for data to become available and as soon as it is, send it right away. While sending a previous message or waiting on new messages, a burst of traffic may arrive which can all be sent in a batch, up to the size of the buffer, to the underlying resource. This approach can use <a href="http://download.oracle.com/javase/6/docs/api/java/util/concurrent/ConcurrentLinkedQueue.html">ConcurrentLinkedQueue</a> which provides low-latency and avoid locks. However it has an issue in not creating back pressure to stall producing/publishing threads if they are outpacing the batcher whereby the queue could grow out of control because it is unbounded. I’ve often had to wrap <a href="http://download.oracle.com/javase/6/docs/api/java/util/concurrent/ConcurrentLinkedQueue.html">ConcurrentLinkedQueue</a> to track its size and thus create back pressure. This size tracking can add 50% to the processing cost of using this queue in my experience.<br />
<br />
This algorithm respects the <a href="http://mechanical-sympathy.blogspot.com/2011/09/single-writer-principle.html"><i>single writer principle</i></a> and can often be employed when writing to a network or storage device, and thus avoid lock contention in third party API libraries. By avoiding the contention we avoid the J-Curve latency profile normally associated with contention on resources, due to the queuing effect on locks. With this algorithm, as load increases, latency stays constant until the underlying device is saturated with traffic resulting in a more "bathtub" profile than the J-Curve.<br />
<br />
Let’s take a worked example of handling 10 messages that arrive as a burst of traffic. In most systems traffic comes in bursts and is seldom uniformly spaced out in time. One approach will assume no batching and the threads write to device API directly as in Figure 1. above. The other will use a lock free data structure to collect the messages plus a single thread consuming messages in a loop as per the algorithm above. For the example let’s assume it takes 100<span style="font-family: "Calibri","sans-serif"; font-size: 11pt; line-height: 115%;">µ</span>s to write a single buffer to the network device as a synchronous operation and have it acknowledged. The buffer will ideally be less than the MTU of the network in size when latency is critical. Many network sub-systems are asynchronous and support <a href="http://en.wikipedia.org/wiki/Pipeline_%28computing%29">pipelining</a> but we will make the above assumption to clarify the example. If the network operation is using a protocol like <a href="http://en.wikipedia.org/wiki/Hypertext_Transfer_Protocol">HTTP</a> under <a href="http://en.wikipedia.org/wiki/Representational_state_transfer">REST</a> or <a href="http://en.wikipedia.org/wiki/Web_service">Web Services</a> then this assumption matches the underlying implementation.<br />
<br />
<div align="center"><table border="1" cellpadding="5"><tbody>
<tr><th></th><th>Best (<span style="font-family: "Calibri","sans-serif"; font-size: 11pt; line-height: 115%;">µ</span>s)</th><th>Average (<span style="font-family: "Calibri","sans-serif"; font-size: 11pt; line-height: 115%;">µ</span>s)</th><th>Worst (<span style="font-family: "Calibri","sans-serif"; font-size: 11pt; line-height: 115%;">µ</span>s)</th><th>Packets Sent</th></tr>
<tr><td>Serial</td><td align="right">100</td><td align="right">500</td><td align="right">1,000</td><td align="right">10</td></tr>
<tr><td>Smart Batching</td><td align="right">100</td><td align="right">150</td><td align="right">200</td><td align="right">1-2</td></tr>
</tbody></table></div><br />
The absolute lowest latency will be achieved if a message is sent from the thread originating the data directly to the resource, if the resource is un-contended. The table above shows what happens when contention occurs and a queuing effect kicks in. With the serial approach 10 individual packets will have to be sent and these typically need to queue on a lock managing access to the resource, therefore they get processed sequentially. The above figures assume the locking strategy works perfectly with no perceivable overhead which is unlikely in a real application.<br />
<br />
For the batching solution it is likely all 10 packets will be picked up in first batch if the concurrent queue is efficient, thus giving the best case latency scenario. In the worst case only one message is sent in the first batch with the other nine following in the next. Therefore in the worst case scenario one message has a latency of 100<span style="font-family: "Calibri","sans-serif"; font-size: 11pt; line-height: 115%;">µ</span>s and the following 9 have a latency of 200<span style="font-family: "Calibri","sans-serif"; font-size: 11pt; line-height: 115%;">µ</span>s thus giving a worst case average of 190<span style="font-family: "Calibri","sans-serif"; font-size: 11pt; line-height: 115%;">µ</span>s which is significantly better than the serial approach.<br />
<br />
This is one good example when the simplest solution is just a bit too simple because of the contention. The batching solution helps achieve consistent low-latency under burst conditions and is best for throughput. It also has a nice effect across the network on the receiving end in that the receiver has to process fewer packets and therefore makes the communication more efficient both ends.<br />
<br />
Most hardware handles data in buffers up to a fixed size for efficiency. For a storage device this will typically be a 4KB block. For networks this will be the MTU and is typically 1500 bytes for Ethernet. When batching, it is best to understand the underlying hardware and write batches down in ideal buffer size to be optimally efficient. However keep in mind that some devices need to envelope the data, e.g. the Ethernet and IP headers for network packets so the buffer needs to allow for this.<br />
<br />
There will always be an increased latency from a thread switch and the cost of exchange via the data structure. However there are a number of very good non-blocking structures available using lock-free techniques. For the <a href="http://code.google.com/p/disruptor/">Disruptor</a> this type of exchange can be achieved in as little as 50-100ns thus making the choice of taking the smart batching approach a no brainer for low-latency or high-throughput distributed systems. <br />
<br />
This technique can be employed for many problems and not just IO. The core of the Disruptor uses this technique to help rebalance the system when the publishers burst and outpace the <a href="http://code.google.com/p/disruptor/source/browse/trunk/code/src/main/com/lmax/disruptor/EventProcessor.java" style="font-family: "Courier New",Courier,monospace;">EventProcessor</a>s. The algorithm can be seen inside the <a href="http://code.google.com/p/disruptor/source/browse/trunk/code/src/main/com/lmax/disruptor/BatchEventProcessor.java" style="font-family: "Courier New",Courier,monospace;">BatchEventProcessor</a>.<br />
<br />
<b>Note:</b> For this algorithm to work the queueing structure must handle the contention better than the underlying resource. Many queue implementations are extremely poor at managing contention. Use science and measure before coming to a conclusion.<br />
<br />
<span style="font-size: large;"><b>Batching with the Disruptor</b></span><br />
<br />
The code below shows the same algorithm in action using the Disruptor's <a href="http://code.google.com/p/disruptor/source/browse/trunk/code/src/main/com/lmax/disruptor/EventHandler.java"><span style="font-family: "Courier New",Courier,monospace;">EventHandler</span></a> mechanism. In my experience, this is a very effective technique for handling any IO device efficiently and keeping latency low when dealing with load or burst traffic.<br />
<pre>
public final class NetworkBatchHandler
implements EventHander<Message>
{
private final NetworkFacade network;
private final ByteBuffer buffer;
public NetworkBatchHandler(final NetworkFacade network,
final int maxPacketSize)
{
this.network = network;
buffer = ByteBuffer.allocate(maxPacketSize);
}
public void onEvent(Message msg, long sequence, boolean endOfBatch)
throws Exception
{
if (msg.size() > buffer.remaining())
{
sendBuffer();
}
buffer.put(msg.getBytes());
if (endOfBatch)
{
sendBuffer();
}
}
private void sendBuffer()
{
buffer.flip();
network.send(buffer);
buffer.clear();
}
}
</pre>
The <span style="font-family: "Courier New",Courier,monospace;">endOfBatch</span> parameter greatly simplifies the handling of the batch compared to the double loop in the algorithm above.<br />
<br />
I have simplified the examples to illustrate the algorithm. Clearly error handling and other edge conditions need to be considered.<br />
<br />
<span style="font-size: large;"><b>Separation of IO from Work Processing</b></span><br />
<br />
There is another very good reason to separate the IO from the threads doing the work processing. Handing off the IO to another thread means the worker thread, or threads, can continue processing without blocking in a nice cache friendly manner. I've found this to be critical in achieving high-performance throughput.<br />
<br />
If the underlying IO device or resource becomes briefly saturated then the messages can be queued for the batcher thread allowing the work processing threads to continue. The batching thread then feeds the messages to the IO device in the most efficient way possible allowing the data structure to handle the burst and if full apply the necessary back pressure, thus providing a good separation of concerns in the workflow.<br />
<br />
<span style="font-size: large;"><b>Conclusion</b></span><br />
<br />
So there you have it. Smart Batching can be employed in concert with the appropriate data structures to achieve consistent low-latency and maximum throughput.Martin Thompsonhttp://www.blogger.com/profile/15893849163924476586noreply@blogger.com20London, UK51.5001524 -0.1262361999999939151.322796399999994 -0.39052969999999393 51.6775084 0.1380573000000061tag:blogger.com,1999:blog-5560209661389175529.post-50891628849652617032011-09-22T15:24:00.006+01:002012-07-10T04:16:48.970+01:00Single Writer PrincipleWhen trying to build a highly scalable system the single biggest limitation on scalability is having multiple writers contend for any item of data or resource. Sure, algorithms can be bad, but let’s assume they have a reasonable <a href="http://en.wikipedia.org/wiki/Big_O_notation">Big O notation</a> so we'll focus on the scalability limitations of the systems design. <br />
<br />
I keep seeing people just accept having multiple writers as the norm. There is a lot of research in computer science for managing this contention that boils down to 2 basic approaches. One is to provide mutual exclusion to the contended resource while the mutation takes place; the other is to take an optimistic strategy and swap in the changes if the underlying resource has not changed while you created the new copy. <br />
<br />
<span style="font-size: large;"><b>Mutual Exclusion </b></span><br />
<br />
Mutual exclusion is the means by which only one writer can have access to a protected resource at a time, and is usually implemented with a locking strategy. Locking strategies require an arbitrator, usually the operating system kernel, to get involved when the contention occurs to decide who gains access and in what order. This can be a very expensive process often requiring many more CPU cycles than the actual transaction to be applied to the business logic would use. Those waiting to enter the <a href="http://en.wikipedia.org/wiki/Critical_section">critical section</a>, in advance of performing the mutation must queue, and this queuing effect (<a href="http://en.wikipedia.org/wiki/Little%27s_law">Little's Law</a>) causes latency to become unpredictable and ultimately restricts throughput.<br />
<br />
<b><span style="font-size: large;">Optimistic Concurrency Control</span></b><br />
<br />
Optimistic strategies involve taking a copy of the data, modifying it, then copying back the changes if data has not mutated in the meantime. If a change has happened in the meantime you repeat the process until successful. This repeating of the process increases with contention and therefore causes a queuing effect just like with mutual exclusion. If you work with a source code control system, such as Subversion or CVS, then you are using this algorithm every day. Optimistic strategies can work with data but do not work so well with resources such as hardware because you cannot take a copy of the hardware! The ability to perform the changes atomically to data is made possible by <a href="http://en.wikipedia.org/wiki/Compare-and-swap">CAS</a> instructions offered by the hardware.<br />
<br />
Most locking strategies are composed from optimistic strategies for changing the lock state or mutual exclusion primitive.<br />
<br />
<span style="font-size: large;"><b>Managing Contention vs. Doing Real Work</b></span><br />
<br />
CPUs can typically process one or more instructions per cycle. For example, modern Intel CPU cores each have 6 execution units that can be doing a combination of arithmetic, branch logic, word manipulation and memory loads/stores in parallel. If while doing work the CPU core incurs a cache miss, and has to go to main memory, it will stall for hundreds of cycles until the result of that memory request returns. To try and improve things the CPU will make some speculative guesses as to what a memory request will return to continue processing. If a second miss occurs the CPU will no longer speculate and simply wait for the memory request to return because it cannot typically keep the state for speculative execution beyond 2 cache misses. Managing cache misses is the single largest limitation to scaling the performance of our current generation of CPUs.<br />
<br />
Now what does this have to do with managing contention? Well if two or more threads are using locks to provide mutual exclusion, at best they will be going to the L3 cache, or over a socket interconnect, to access share state of the lock using CAS operations. These lock/CAS instructions cost 10s of cycles in the best case when un-contended, plus they cause out-of-order execution for the CPU to be suspended and load/store buffers to be flushed. At worst, collisions occur and the kernel will need to get involved and put one or more of the threads to sleep until the lock is released. This rescheduling of the blocked thread will result in cache pollution. The situation can be even worse when the thread is re-scheduled on another core with a cold cache resulting in many cache misses. <br />
<br />
For highly contended data it is very easy to get into a situation whereby the system spends significantly more time managing contention than doing real work. The table below gives an idea of basic costs for managing contention when the program state is very small and easy to reload from the L2/L3 cache, never mind main memory. <br />
<br />
<div align="center"><table border="1" cellpadding="5"><tbody>
<tr><th>Method</th><th>Time (ms)</th></tr>
<tr><td>One Thread</td><td align="right">300</td></tr>
<tr><td>One Thread with Memory Barrier</td><td align="right">4,700</td></tr>
<tr><td>One Thread with CAS</td><td align="right">5,700</td></tr>
<tr><td>Two Threads with CAS</td><td align="right">18,000</td></tr>
<tr><td>One Thread with Lock</td><td align="right">10,000</td></tr>
<tr><td>Two Threads with Lock</td><td align="right">118,000</td></tr>
</tbody></table></div><br />
This table illustrates the costs of incrementing a 64-bit counter 500 million times using a variety of techniques on a 2.4Ghz Westmere processor. I can hear people coming back with “but this is a trivial example and real-world applications are not that contended”. This is true but remember real-world applications have way more state, and what do you think happens to all that state which is warm in cache when the context switch occurs??? By measuring the basic cost of contention it is possible to extrapolate the scalability limits of a system which has contention points. As multi-core becomes ever more significant another approach is required. My last <a href="http://mechanical-sympathy.blogspot.com/2011/09/adventures-with-atomiclong.html">post</a> illustrates the micro level effects of CAS operations on modern CPUs, whereby Sandybridge can be worse for CAS and locks.<br />
<br />
<span style="font-size: large;"><b>Single Writer Designs</b></span><br />
<br />
Now, what if you could design a system whereby any item of data, or resource, is only mutated by a single writer/thread? It is actually easier than you think in my experience. It is OK if multiple threads, or other execution contexts, read the same data. CPUs can broadcast read only copies of data to other cores via the cache coherency sub-system. This has a cost but it scales very well.<br />
<br />
If you have a system that can honour this single writer principle then each execution context can spend all its time and resources processing the logic for its purpose, and not be wasting cycles and resource on dealing with the contention problem. You can also scale up without limitation until the hardware is saturated. There is also a really nice benefit in that when working on architectures, such as x86/x64, where at a hardware level they have a <a href="http://en.wikipedia.org/wiki/Memory_model_%28computing%29">memory model</a>, whereby load/store memory operations have preserved order, thus <a href="http://mechanical-sympathy.blogspot.com/2011/07/memory-barriersfences.html">memory barriers</a> are not required if you adhere strictly to the single writer principle. On x86/x64 "<i>loads can be re-ordered with older stores</i>" according to the memory model so memory barriers are required when multiple threads mutate the same data across cores. The single writer principle avoids this issue because it never has to deal with writing the latest version of a data item that may have been written by another thread and currently in the store buffer of another core.<br />
<br />
So how can we drive towards single writer designs? I’ve found it is a very natural thing. Consider how humans, or any other autonomous creatures of nature, operate with their model of the world. We all have our own model of the world contained in our own heads, i.e. We have a copy of the world state for our own use. We mutate the state in our heads based on inputs (events/messages) we receive via our senses. As we process these inputs and apply them to our model we may take action that produces outputs, which others can take as their own inputs. None of us reach directly into each other’s heads and mess with the neurons. If we did this it would be a serious breach of encapsulation! Originally, Object Oriented (OO) design was all about message passing, and somehow along the way we bastardised the message passing to be method calls and even allowed direct field manipulation – Yuk! Who's bright idea was it to allow public access to fields of an object? You deserve your own special hell. <br />
<br />
At university I studied <a href="http://en.wikipedia.org/wiki/Transputer">transputers</a> and interesting languages like <a href="http://en.wikipedia.org/wiki/Occam_%28programming_language%29">Occam</a>. I thought very elegant designs appeared by having the nodes collaborate via message passing rather than mutating shared state. I’m sure some of this has inspired the <a href="http://code.google.com/p/disruptor/">Disruptor</a>. My experience with the Disruptor has shown that is it possible to build systems with one or more orders of magnitude better throughput than locking or contended state based approaches. It also gives much more predictable latency that stays constant until the hardware is saturated rather than the traditional J-curve latency profile.<br />
<br />
It is interesting to see the emergence of numerous approaches that lend themselves to single writer solutions such as Node.js, Erlang, Actor patterns, and SEDA to name a few. Unfortunately most use queue based implementations underneath, which breaks the single writer principle, whereas the Disruptor strives to separate the concerns so that the single writer principle can be preserved for the common cases.<br />
<br />
Now I’m not saying locks and optimistic strategies are bad and should not be used. They are excellent for many problems. For example, bootstrapping a concurrent system or making major state stages in configuration or reference data. However if the main flow of transactions act on contended data, and locks or optimistic strategies have to be employed, then the scalability is fundamentally limited. <br />
<br />
<span style="font-size: large;"><b>The Principle at Scale</b></span><br />
<br />
This principle works at all levels of scale. <a href="http://en.wikipedia.org/wiki/Benoit_Mandelbrot">Mandelbrot</a> got this so right. CPU cores are just nodes of execution and the cache system provides message passing for communication. The same patterns apply if the processing node is a server and the communication system is a local network. If a service, in <a href="http://en.wikipedia.org/wiki/Service-oriented_architecture">SOA</a> architecture parlance, is the only service that can write to its data store it can be made to scale and perform much better. Let’s say that underlying data is stored in a database and other services can go directly to that data, without sending a message to the service that owns the data, then the data is contended and requires the database to manage the contention and coherence of that data. This prevents the service from caching copies of the data for faster response to the clients and restricts how the data can be sharded. Encapsulation has just been broken at a more macro level when multiple different services write to the same data store.<br />
<br />
<span style="font-size: large;"><b>Summary</b></span><br />
<br />
If a system is decomposed into components that keep their own relevant state model, without a central shared model, and all communication is achieved via message passing then you have a system without contention naturally. This type of system obeys the single writer principle if the messaging passing sub-system is not implemented as queues. If you cannot move straight to a model like this, but are finding scalability issues related to contention, then start by asking the question, “How do I change this code to preserve the <i>Single Writer Principle</i> and thus avoid the contention?”<br />
<br />
The <i>Single Writer Principle</i> is that for any item of data, or resource, that item of data should be owned by a single execution context for all mutations.Martin Thompsonhttp://www.blogger.com/profile/15893849163924476586noreply@blogger.com61London, UK51.5001524 -0.1262361999999939151.322796399999994 -0.39052969999999393 51.6775084 0.1380573000000061tag:blogger.com,1999:blog-5560209661389175529.post-89849076382355542952011-09-11T12:46:00.009+01:002022-08-17T11:38:47.350+01:00Adventures with AtomicLong<div dir="ltr" style="text-align: left;" trbidi="on">
Sequencing events between threads is a common operation for many multi-threaded algorithms. These sequences could be used for assigning identity to orders, trades, transactions, messages, events, etc. Within the <a href="http://code.google.com/p/disruptor/">Disruptor </a>we use a monotonic sequence for all events which is implemented as <span style="font-family: "Courier New",Courier,monospace;">AtomicLong</span> <a href="http://download.oracle.com/javase/6/docs/api/java/util/concurrent/atomic/AtomicLong.html#incrementAndGet%28%29" style="font-family: "Courier New",Courier,monospace;">incrementAndGet</a> for the <a href="http://code.google.com/p/disruptor/source/browse/trunk/code/src/main/com/lmax/disruptor/ClaimStrategy.java">multi-threaded publishing</a> scenario.<br />
<br />
While working on the latest version of the Disruptor I made some changes which I was convinced would improve performance, however the results surprised me. I had removed some potentially megamorphic method calls and the performance got worse rather than better. After a lot of investigation, I discovered that the megamorphic method calls were hiding a performance issue with the latest Intel <a href="http://en.wikipedia.org/wiki/Sandy_Bridge">Sandybridge</a> processors. With the megamorphic calls out of the way, the contention on the atomic sequence generation increased exposing the issue. I've also observed this performance issue with other Java concurrent structures such as <a href="http://download.oracle.com/javase/6/docs/api/java/util/concurrent/ArrayBlockingQueue.html"><span style="font-family: "Courier New",Courier,monospace;">ArrayBlockingQueue</span></a>.<br />
<br />
I’ve been running various benchmarks on Sandybridge and have so far been impressed with performance improvements over <a href="http://en.wikipedia.org/wiki/Nehalem_%28microarchitecture%29">Nehalem</a>, especially for memory intensive applications due to the changes in its front-end. However with this sequencing benchmark, I discovered that Sandybridge has taken a major step backward in performance with regard to atomic instructions.<br />
<br />
<a href="http://en.wikipedia.org/wiki/Atomic_instruction">Atomic instructions</a> enable read-modify-write actions to be combined into an atomic operation. A good example is incrementing a counter. To complete the increment operation a thread must read the current value, increment it, and then write back the results. In a multi-threaded environment these distinct operations could interleave with other threads doing the same with corrupt results as a consequence. The normal way to avoid this interleaving is to take out a lock for mutual exclusion while performing the steps. Locks are very expensive and often require kernel arbitration between threads. Modern CPUs provide a number of atomic instructions which allow operations such as atomically incrementing a counter, or the ability to conditional set a pointer reference if the value is still as expected. These operations are commonly referred to as <a href="http://en.wikipedia.org/wiki/Compare-and-swap">CAS</a> (Compare And Swap) instructions. A good way to think of these CAS instructions is like optimistic locks, similar to what you experience when using a version control system like Subversion or CVS. You try to make a change and if the version is what you expect then you succeed, otherwise the action aborts.<br />
<br />
On x86/x64 these instructions are known as “lock” instructions. The "lock" name comes from how a processor, after setting its lock signal, would lock the front-side/memory bus (<a href="http://en.wikipedia.org/wiki/Front-side_bus">FSB</a>) for serialising memory access while the three steps of the operation took place atomically. On more recent processors the lock instruction is simply implemented by getting an exclusive lock on the cache-line for modification.<br />
<br />
These instructions are the basic building blocks used for implementing higher-level locks and semaphores. This is, as will be explained shorty, why I've seen performing issues on Sandybridge for <span style="font-family: "Courier New",Courier,monospace;">ArrayBlockingQueue </span>in some of the Disruptor comparative <a href="http://code.google.com/p/disruptor/source/browse/#svn%2Ftrunk%2Fcode%2Fsrc%2Fperf%2Fcom%2Flmax%2Fdisruptor">performance tests</a>.<br />
<br />
Back to my benchmark. The test was spending significantly more time in AtomicLong.incrementAndGet() than I had previously observed. Initially, I suspected an issue with JDK 1.6.0_27 which I had just installed. I ran the following test with various JVMs, including 1.7.0, and kept getting the same results. I then booted different operating systems (Ubuntu, Fedora, Windows 7 - all 64-bit), again the same results. This lead me to write an isolated test which I ran on Nehalem (2.8 GHz Core i7 860) and Sandybridge (2.2Ghz Core i7-2720QM).<br />
<br />
<pre>import java.util.concurrent.atomic.AtomicLong;
public final class TestAtomicIncrement
implements Runnable
{
public static final long COUNT = 500L * 1000L * 1000L;
public static final AtomicLong counter = new AtomicLong(0L);
public static void main(final String[] args) throws Exception
{
final int numThreads = Integer.parseInt(args[0]);
final long start = System.nanoTime();
runTest(numThreads);
System.out.println("duration = " + (System.nanoTime() - start));
System.out.println("counter = " + counter);
}
private static void runTest(final int numThreads)
throws InterruptedException
{
Thread[] threads = new Thread[numThreads];
for (int i = 0; i < threads.length; i++)
{
threads[i] = new Thread(new TestAtomicIncrement());
}
for (Thread t : threads)
{
t.start();
}
for (Thread t : threads)
{
t.join();
}
}
public void run()
{
long i = 0L;
while (i < COUNT)
{
i = counter.incrementAndGet();
}
}
}
</pre>
<br />
<div class="separator" style="clear: both; text-align: center;">
</div>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhcQXA365uiNPWYXCQLq5iXHEqUEc-esyKt8VTTLz9s7YghFvWfFtNwjhfoly4u69HEorLdQWmR7xuZQ2EdOfdApcNVW-id2SzOjZT_4_SaTzb_aJVjIkAqg6cO7zGG4sJkS_bYQGaQUKY/s1600/JavaCAS.png" style="margin-left: auto; margin-right: auto;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhcQXA365uiNPWYXCQLq5iXHEqUEc-esyKt8VTTLz9s7YghFvWfFtNwjhfoly4u69HEorLdQWmR7xuZQ2EdOfdApcNVW-id2SzOjZT_4_SaTzb_aJVjIkAqg6cO7zGG4sJkS_bYQGaQUKY/s1600/JavaCAS.png" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Figure 1.</td></tr>
</tbody></table>
<br />
After running this test on 4 different Sandybridge processors with a range of clock speeds, I concluded that using LOCK CMPXCHG, under contention with increasing numbers of threads, is much less scalable than the previous Nehalem generation of processors. Figure 1. above charts the results in nanoseconds duration to complete 500 million increments of a counter with increasing thread count. Less is better.<br />
<br />
I confirmed the JVM was generating the correct instructions for the CAS loop by getting Hotspot to print the assembler it generated. I also confirmed that Hotspot generated identical assembler instructions for both Nehalem and Sandybridge.<br />
<br />
I then decided to investigate further and write the following C++ program to test the relevant lock instructions to compare Nehalem and Sandybridge. I know from using “<span style="font-family: "Courier New",Courier,monospace;">objdump -d</span>” on the binary that the <a href="http://gcc.gnu.org/onlinedocs/gcc-4.1.2/gcc/Atomic-Builtins.html">GNU Atomic Builtins</a> generate the lock instructions for ADD, XADD, and CMPXCHG, for the respectively named functions below. <br />
<pre>#include <time.h>
#include <pthread.h>
#include <stdlib.h>
#include <iostream>
typedef unsigned long long uint64;
const uint64 COUNT = 500LL * 1000LL * 1000LL;
volatile uint64 counter = 0;
void* run_add(void* numThreads)
{
register uint64 value = (COUNT / *((int*)numThreads)) + 1;
while (--value != 0)
{
__sync_add_and_fetch(&counter, 1);
}
}
void* run_xadd(void*)
{
register uint64 value = counter;
while (value < COUNT)
{
value = __sync_add_and_fetch(&counter, 1);
}
}
void* run_cas(void*)
{
register uint64 value = 0;
while (value < COUNT)
{
do
{
value = counter;
}
while (!__sync_bool_compare_and_swap(&counter, value, value + 1));
}
}
int main (int argc, char* argv[])
{
const int NUM_THREADS = atoi(argv[1]);
pthread_t threads[NUM_THREADS];
void* status;
timespec ts_start;
timespec ts_finish;
clock_gettime(CLOCK_MONOTONIC, &ts_start);
for (int i = 0; i < NUM_THREADS; i++)
{
pthread_create(&threads[i], NULL, run_add, (void*)&NUM_THREADS);
}
for (int i = 0; i < NUM_THREADS; i++)
{
pthread_join(threads[i], &status);
}
clock_gettime(CLOCK_MONOTONIC, &ts_finish);
uint64 start = (ts_start.tv_sec * 1000000000LL) + ts_start.tv_nsec;
uint64 finish = (ts_finish.tv_sec * 1000000000LL) + ts_finish.tv_nsec;
uint64 duration = finish - start;
std::cout << "threads = " << NUM_THREADS << std::endl;
std::cout << "duration = " << duration << std::endl;
std::cout << "counter = " << counter << std::endl;
return 0;
}
</pre>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEixnFqFZpl7cCziuEC-lM53XBVolvheLuShjHvnCcS5mPMVhyphenhyphenWWj-fNdDNrN_dTa4zC1NmzOneAucfsvCR11Wzrza7J8d3P3n89YDfEnsxOXnNXnvowUOik4TuM8AiF3UaUhFttIOy7sXM/s1600/atomics.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEixnFqFZpl7cCziuEC-lM53XBVolvheLuShjHvnCcS5mPMVhyphenhyphenWWj-fNdDNrN_dTa4zC1NmzOneAucfsvCR11Wzrza7J8d3P3n89YDfEnsxOXnNXnvowUOik4TuM8AiF3UaUhFttIOy7sXM/s1600/atomics.png" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Figure 2.</td></tr>
</tbody></table>
<br />
It is clear from Figure 2. that Nehalem performs nearly an order of magnitude better for atomic operations as contention increases with threads. I found LOCK ADD and LOCK XADD to be similar so I've only charted XADD for clarity. The CAS operations for C++ and Java are comparable.<br />
<br />
It is also very interesting how XADD greatly outperforms CAS and gives a nice scalable profile. For 3 threads and above, XADD does not degrade further and simply performs at the rate at which the processor can keep the caches coherent. Nehalem and Sandybridge level out respectively at ~100m and ~20m XADD operations per second for 3+ concurrent threads, whereas CAS continues to degrade with increasing thread count because of contention. Naturally, performance degrades when <a href="http://en.wikipedia.org/wiki/Intel_QuickPath_Interconnect">QPI</a> links are involved for a multi-socket scenario. Oracle have now accepted that not supporting XADD is a <a href="http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=7023898">bug</a> and will hopefully fix it soon for the JVM. <br />
<br />
As to the performance I’ve observed with Sandybridge, it would be great if others could confirm my findings so we can all feedback to Intel and have this addressed. I've not been able to get my hands on a server class system with Sandybridge. I can confirm that for the "tick" to Westmere, the performance is similar to Nehalem and not an issue. The "tock" to Sandybridge seems to introduce the issue.<br />
<br />
<b>Update:</b> After discussions with Intel I wrote the following <a href="http://mechanical-sympathy.blogspot.com/2013/01/further-adventures-with-cas.html">blog entry</a>.</div>
Martin Thompsonhttp://www.blogger.com/profile/15893849163924476586noreply@blogger.com28London, UK51.5001524 -0.1262361999999939151.322796399999994 -0.39052969999999393 51.6775084 0.1380573000000061tag:blogger.com,1999:blog-5560209661389175529.post-23107215689577268962011-09-02T19:23:00.004+01:002014-09-13T22:01:06.391+01:00Modelling Is Everything<div dir="ltr" style="text-align: left;" trbidi="on">
I’m often asked, “What is the best way to learn about building high-performance systems”? There are many perfectly valid answers to this question but there is one thing that stands out for me above everything else, and that is modelling. Modelling what you need to implement is the most important and effective step in the process. I’d go further and say this principle applies to any development and the rest is just typing :-)<br />
<br />
<a href="http://en.wikipedia.org/wiki/Domain-driven_design">Domain Driven Design</a> (DDD) advocates modelling the domain and expressing this model in code as fundamental to the successful delivery and ongoing maintenance of software. I wholeheartedly agree with this. How often do we see code that is an approximation of the problem domain? Code that exhibits behaviour which approximates to what is required via inappropriate abstractions and mappings which just about cope. Those mappings between what is in the code and the real domain are only contained in the developers’ heads and this is just not good enough.<br />
<br />
When requiring high-performance, code for parts of the system often have to model what is happening with the CPU, memory, storage sub-systems, or network sub-systems. When we have imperfect abstractions on top of these domains, performance can be very adversely affected. The goal of my “<a href="http://mechanical-sympathy.blogspot.com/">Mechanical Sympathy</a>” blog is to peek at what is under the hood so we can improve our abstractions.<br />
<br />
<span style="font-size: large;"><b>What is a Model?</b></span><br />
<br />
A model does not need to be the result of a 3-year exercise producing UML. It can be, and often is best as, people communicating via various means including speech, drawings, illustrations, metaphors, analogies, etc, to build a mental model for shared understanding. If an accurate and distilled understanding can be reached then this model can be turned into code with great results.<br />
<br />
<span style="font-size: large;"><b>Infrastructure Domain Models</b></span><br />
<br />
If developers writing a concurrent framework do not have a good model of how a typical cache sub-system works, i.e. it uses message passing to exchange cache lines, then the framework is unlikely to perform well or be correct. If their code drives the cache sub-system with mechanical sympathy and understanding, it is less likely to have bugs and more likely to perform well.<br />
<br />
It is much easier to predict performance from a sound model when coming from an understanding of the infrastructure for the underlying platform and its published abilities. For example, if you know how many packets per second a network sub-system can handle, and the size of its transfer unit, then it is easy to extrapolate expected bandwidth. With this model based understanding we can test our code for expectations with confidence.<br />
<br />
I’ve fixed many performance issues whereby a framework treated a storage sub-system as stream-based when it is really a block-based model. If you update part of a file on disk, the block to be updated must be read, the changes applied, and the results written back. Now if you know the system is block based and the boundaries of the blocks, you can write whole blocks back without incurring the read, modify, write back cycle replacing these actions with a single write. This applies even when appending to a file as the last block is likely to have been partially written previously.<br />
<br />
<span style="font-size: large;"><b>Business Domain Models</b></span><br />
<br />
The same thinking should be applied to the models we construct for the business domain. If a business process is modelled accurately, then the software will not surprise its end users. When we draw up a model it is important to describe the relationships for cardinality and the characteristics by which they will be traversed. This understanding will guide the selection of data structures to those best suited for implementing the relationships. I often see people use a list for a relationship which is mostly searched by key, for this case a map could be more appropriate. Are the entities at the other end of a relationship ordered? A tree or skiplist implementation may be a better option.<br />
<br />
<span style="font-size: large;"><b>Identity</b></span><br />
<br />
Identity of entities in a model is so important. All models have to be entered in some way, and this normally starts with an entity from which to walk. That entity could be “Customer” by customer ID but could equally be “DiskBlock” by filename and offset in an infrastructure domain. The identity of each entity in the system needs to be clear so the model can be accessed efficiently. If for each interaction with a model we waste precious cycles trying to find our entity as a starting point, then other optimisations can become almost irrelevant. Make identity explicit in your model and, if necessary, index entities by their identity so you can efficiently enter the model for each interaction.<br />
<br />
<span style="font-size: large;"><b>Refine as we learn</b></span><br />
<br />
It is also important to keep refining a model as we learn. If the model grows as a series of extensions without refining and distilling, then we end up with a spaghetti mess that is very difficult to manage when trying to achieve predictable performance. Never mind how difficult it is to maintain and support. Everyday we learn new things. Reflect this in the model and keep it up to date.<br />
<br />
<span style="font-size: large;"><b>Implement no more, but also no less, than what is needed!</b></span><br />
<br />
The fastest code is code that does just what is needed and no more. Perform the instructions to complete the task and no more. Really fast code is normally not a weird mess of bit-shifting and complier tricks. It is best to start with something clean and elegant. Then measure to see if you are within performance targets. So often this will be sufficient. Sometimes performance will be a surprise. You then need to apply science to test and measure before jumping to conclusions. A profiler will often tell you where the time is being taken. Once the basic modelling mistakes and assumptions have been corrected, it usually takes just a little <a href="http://mechanical-sympathy.blogspot.com/2011/07/why-mechanical-sympathy.html">mechanical sympathy</a> to reach the performance goal. Unused code is waste. Try not to create it. If you happen to create some, then remove it from your codebase as soon as you notice it.<br />
<br />
<span style="font-size: large;"><b>Conclusion</b></span><br />
<br />
When cross-functional requirements, such as performance and availability, are critical to success, I’ve found the most important thing is to get the model correct for the domain at all levels. That is, take the principles of DDD and make sure your code is an appropriate reflection of each domain. Be that the domain of business applications, or the domain of interactions with infrastructure, I’ve found modelling is everything.</div>
Martin Thompsonhttp://www.blogger.com/profile/15893849163924476586noreply@blogger.com5London, UK51.5001524 -0.1262361999999939151.322796399999994 -0.39052969999999393 51.6775084 0.1380573000000061tag:blogger.com,1999:blog-5560209661389175529.post-67821736804295545852011-08-27T09:49:00.021+01:002011-09-16T15:10:01.396+01:00Disruptor 2.0 ReleasedSignificantly improved performance and a cleaner API are the key takeaways for the <a href="http://code.google.com/p/disruptor/">Disruptor 2.0</a> concurrent programming framework for Java. This release is the result of all the great feedback we have received from the community. Feedback is very welcome and really improves the end product so please keep it coming.<br />
<br />
You can find the Disruptor project <a href="http://code.google.com/p/disruptor/">here</a>, plus we have a <a href="http://code.google.com/p/disruptor/w/list">wiki</a> with links to detailed blogs describing how things work.<br />
<br />
<span style="font-size: large;"><b>Naming & API</b></span><br />
<br />
Over the lifetime of the Disruptor naming has been a challenge. The funny thing is that with the 2.0 release we have come almost full circle. Originally we considered the <a href="http://code.google.com/p/disruptor/">Disruptor</a> as an event processing framework that often got used as a queue replacement. To make it understandable to queue users we adopted the nomenclature of producers and consumers. However the consumers are not true consumers. With this release the consensus is to return to the event processing roots and adopt the following naming changes.<br />
<br />
<b>Producer -> Publisher</b><br />
Events are claimed in strict sequence and published to the <a href="http://code.google.com/p/disruptor/source/browse/trunk/code/src/main/com/lmax/disruptor/RingBuffer.java"><span style="font-family: 'Courier New',Courier,monospace;">RingBuffer</span></a>.<br />
<br />
<b>Entry -> Event</b><br />
Events represent the currency of data exchange through the dependency graph of <a href="http://code.google.com/p/disruptor/source/browse/trunk/code/src/main/com/lmax/disruptor/EventProcessor.java"><span style="font-family: 'Courier New',Courier,monospace;">EventProcessor</span></a>s.<br />
<br />
<b>Consumer -> EventProcessor</b><br />
Events are processed by <a href="http://code.google.com/p/disruptor/source/browse/trunk/code/src/main/com/lmax/disruptor/EventProcessor.java"><span style="font-family: 'Courier New',Courier,monospace;">EventProcessor</span></a>s. The processing of an event can be read only, but can also involve mutations on which other <a href="http://code.google.com/p/disruptor/source/browse/trunk/code/src/main/com/lmax/disruptor/EventProcessor.java"><span style="font-family: 'Courier New',Courier,monospace;">EventProcessor</span></a>s depend.<br />
<br />
<b>ConsumerBarrier -> DependencyBarrier</b><br />
Complex graphs of dependent <a href="http://code.google.com/p/disruptor/source/browse/trunk/code/src/main/com/lmax/disruptor/EventProcessor.java"><span style="font-family: 'Courier New',Courier,monospace;">EventProcessor</span></a>s can be constructed for the processing of an Event. The <a href="http://code.google.com/p/disruptor/source/browse/trunk/code/src/main/com/lmax/disruptor/DependencyBarrier.java"><span style="font-family: 'Courier New',Courier,monospace;">DependencyBarrier</span></a>s are assembled to represent the dependency graph. This topic is the real value of the Disruptor and often misunderstood. A fun example can be seen playing <a href="http://code.google.com/p/disruptor/source/browse/trunk/code/src/perf/com/lmax/disruptor/OnePublisherToThreeProcessorDiamondThroughputTest.java">FizzBuzz</a> in our <a href="http://code.google.com/p/disruptor/source/browse/#svn%2Ftrunk%2Fcode%2Fsrc%2Fperf%2Fcom%2Flmax%2Fdisruptor">performance tests</a>.<br />
<br />
The <span style="font-family: 'Courier New',Courier,monospace;">ProducerBarrier</span> was always a one-to-one relationship with the <a href="http://code.google.com/p/disruptor/source/browse/trunk/code/src/main/com/lmax/disruptor/RingBuffer.java"><span style="font-family: 'Courier New',Courier,monospace;">RingBuffer</span></a> so for ease of use its behaviour has been merged into the <a href="http://code.google.com/p/disruptor/source/browse/trunk/code/src/main/com/lmax/disruptor/RingBuffer.java"><span style="font-family: 'Courier New',Courier,monospace;">RingBuffer</span></a>. This allows direct publishing into the <a href="http://code.google.com/p/disruptor/source/browse/trunk/code/src/main/com/lmax/disruptor/RingBuffer.java"><span style="font-family: 'Courier New',Courier,monospace;">RingBuffer</span></a>.<br />
<br />
<span style="font-size: large;"><b>DSL Wizard</b></span><br />
<br />
The most complex part of using the Disruptor is the setting up of the dependency graph of <a href="http://code.google.com/p/disruptor/source/browse/trunk/code/src/main/com/lmax/disruptor/EventProcessor.java"><span style="font-family: 'Courier New',Courier,monospace;">EventProcessor</span></a>s. To simplify this for the most common cases we have integrated the <a href="http://www.symphonious.net/2011/07/11/lmax-disruptor-high-performance-low-latency-and-simple-too/">DisruptorWizard</a> project which provides a <a href="http://en.wikipedia.org/wiki/Domain-specific_language">DSL</a> as a fluent API for assembling the graph and assigning threads.<br />
<br />
<span style="font-size: large;"><b>Performance</b></span><br />
<br />
Significant performance tuning effort has gone into this release. This effort has resulted in a ~2-3X improvement in throughput depending on CPU architecture. For most use cases it is now an order of magnitude better than queue based approaches. On <a href="http://en.wikipedia.org/wiki/Sandy_Bridge">Sandybridge</a> processors I've seen over 50 million events processed per second.<br />
<br />
Sequence tracking has been completely rewritten to reduce the usage of hardware <a href="http://mechanical-sympathy.blogspot.com/2011/07/memory-barriersfences.html">memory barriers</a>, indirection layers, and megamorphic method calls resulting in a much more data and instruction cache friendly design. New techniques have been employed to prevent <a href="http://mechanical-sympathy.blogspot.com/2011/08/false-sharing-java-7.html">false sharing</a> because the previous ones got optimised out by the Oracle Java 7 JVM.<br />
<br />
The one area not seeing a significant performance increase is the <a href="http://code.google.com/p/disruptor/source/browse/trunk/code/src/perf/com/lmax/disruptor/ThreePublisherToOneProcessorSequencedThroughputTest.java">sequencer</a> pattern. The Disruptor is still much faster than queue based approaches for this pattern but a limitation of Java hits us hard here. Java on x86/x64 is using LOCK CMPXCHG for CAS operations to implement the <a href="http://download.oracle.com/javase/6/docs/api/java/util/concurrent/atomic/AtomicLong.html"><span style="font-family: 'Courier New',Courier,monospace;">AtomicLong</span></a> <a href="http://download.oracle.com/javase/6/docs/api/java/util/concurrent/atomic/AtomicLong.html#incrementAndGet%28%29" style="font-family: "Courier New",Courier,monospace;">incrementAndGet()</a> method which, based on my measurements, is ~2-10X slower than using LOCK XADD as contention increases. Hopefully Oracle will see the error of SUNs ways on this and embrace x86/x64 to take advantage of such instructions. Dave Dice at Oracle has <a href="http://blogs.oracle.com/dave/entry/atomic_fetch_and_add_vs">blogged</a> on the subject so I live in hope.<br />
<b><span style="font-size: large;"><br />
Memory Barriers</span></b><br />
<br />
Of special note for this release is the elimination of hardware memory barriers on x86/x64 for <a href="http://code.google.com/p/disruptor/source/browse/trunk/code/src/main/com/lmax/disruptor/Sequence.java">Sequence</a> tracking. The beauty in the Disruptor design is that on CPU architectures that have a memory model <span style="font-size: xx-small;">[1]</span> whereby:<br />
<br />
<ul><li>“<i>loads are not reordered with older loads</i>”, and</li>
<li>“<i>stores are not reordered with older stores</i>”;</li>
</ul><br />
it is then possible to take advantage of the semantics provided by <a href="http://download.oracle.com/javase/6/docs/api/java/util/concurrent/atomic/AtomicLong.html" style="font-family: "Courier New",Courier,monospace;">AtomicLong</a> to avoid the use of the Java <span style="font-family: 'Courier New',Courier,monospace;">volatile</span> keyword, and thus hardware fences on x86/x64. The one sticky rule for concurrent algorithms, such as Dekker <span style="font-size: x-small;">[2]</span> and Peterson <span style="font-size: x-small;">[3]</span> locks, on x86/x64 is “<i>loads can be re-ordered with older stores</i>”. This is not an issue given the design of the Disruptor. The issue relates to the snooping of CPU local store buffers for older writes. I’m likely to blog in more detail about why this is the case at a later date. The code should be safe on other CPU architectures if the JVM implementers get the semantics of <a href="http://download.oracle.com/javase/6/docs/api/java/util/concurrent/atomic/AtomicLong.html"><span style="font-family: 'Courier New',Courier,monospace;">AtomicLong</span></a> and <a href="http://www.java2s.com/Open-Source/Java-Document/Apache-Harmony-Java-SE/com-sun-package/sun/misc/Unsafe.java.java-doc.htm"><span style="font-family: 'Courier New',Courier,monospace;">Unsafe</span></a> correct, however your mileage may vary for performance on other architectures compared to x64.<br />
<br />
<span style="font-size: large;"><b>Roadmap</b></span><br />
<br />
With this latest release it is becoming increasingly obvious how sensitive some CPU architectures are to processor affinity for threads. When an <a href="http://code.google.com/p/disruptor/source/browse/trunk/code/src/main/com/lmax/disruptor/EventProcessor.java"><span style="font-family: 'Courier New',Courier,monospace;">EventProcessor</span></a> gets rescheduled on a different core, after its time-slice is exhausted or it yields, the resulting cache pollution really hits performance. For those who require more extreme and predictable performance I plan to release an <a href="http://download.oracle.com/javase/6/docs/api/java/util/concurrent/Executor.html" style="font-family: "Courier New",Courier,monospace;">Executor</a> service with the Disruptor to allow the pinning of threads to CPU cores.<br />
<br />
I'm also thinking of adding a progressive back off strategy for waiting <a href="http://code.google.com/p/disruptor/source/browse/trunk/code/src/main/com/lmax/disruptor/EventProcessor.java"><span style="font-family: 'Courier New',Courier,monospace;">EventProcessor</span></a>s as a <a href="http://code.google.com/p/disruptor/source/browse/trunk/code/src/main/com/lmax/disruptor/WaitStrategy.java"><span style="font-family: 'Courier New',Courier,monospace;">WaitStrategy</span></a>. This strategy would first busy spin, then yield, then eventually sleep in millisecond periods to conserve CPU resource for those applications that burst for a while then go quiet.<br />
<br />
<ol><li>Memory Model: See Section 8.2 of http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-software-developer-vol-3a-part-1-manual.html</li>
<li>Dekker algorithm: http://en.wikipedia.org/wiki/Dekker%27s_algorithm</li>
<li>Peterson Algorithm: http://en.wikipedia.org/wiki/Peterson%27s_algorithm</li>
</ol>Martin Thompsonhttp://www.blogger.com/profile/15893849163924476586noreply@blogger.com13London, UK51.5001524 -0.1262361999999939151.322796399999994 -0.39052969999999393 51.6775084 0.1380573000000061tag:blogger.com,1999:blog-5560209661389175529.post-64537975785396492652011-08-20T17:02:00.006+01:002011-08-20T18:41:08.247+01:00Code RefurbishmentWithin our industry we use a huge range of terminology. Unfortunately we don’t all agree on what individual terms actually mean. I so often hear people misuse the term “<a href="http://en.wikipedia.org/wiki/Code_refactoring">Refactoring</a>” which has come to make the business in many organisations recoil in fear. The reason for this fear I’ve observed is because of what people often mean when misusing this term.<br />
<br />
I feel we are holding back our industry by not being disciplined in our use of terminology. If one chemist said to another chemist “we are about to perform <a href="http://en.wikipedia.org/wiki/Titration">titration</a>”, both would have a good idea what is involved. I believe computing is still a very immature science. As our subject matures hopefully we will become more precise and disciplined in our use of terminology and thus make our communication more accurate and effective.<br />
<br />
Refactoring is a very useful technique for improving code quality and clarity. To be precise it is a behaviour preserving change that improves a code base for future maintenance and understanding. A good example would be extracting a method to remove code duplication and applying this method at every site of the duplication, thus removing the duplication. Refactoring was first discussed in the early 1990s and became mainstream after Martin Fowler’s excellent “<a href="http://www.amazon.co.uk/Refactoring-Improving-Design-Existing-Technology/dp/0201485672/">Refactoring</a>” book in 1999.<br />
<br />
Refactoring involves making a number of small internal changes to the code structure. These changes will typically not have any external impact. Well written unit tests that just assert externally observable behaviour will not change when code is refactored. If the external behaviour of code is changing when the structure is being changed then this is not refactoring.<br />
<br />
Now, why do our business folk recoil in fear when this simple and useful technique of “refactoring” is mentioned? I believe this is because developers are actually talking about a much more extensive structural redevelopment technique that does not have a common term. These structural changes are often not a complete ground-up rewrite because much of the existing code will be reused. The reason the business folk have come to recoil is that they fear we are about to head off into uncharted waters with no idea of how long things will take and if any value will come out of the exercise. <br />
<br />
This example of significant structural change reminds me of when a bar or restaurant gets taken over by new management. The new management often undertake a refurbishment exercise to make the place more appealing and suitable for the customers they are targeting. A lot of the building will be preserved and reused thus greatly reducing the costs of a complete rebuild. In my experience when developers use the term “refactoring” what they really mean is that some module, or <a href="http://domaindrivendesign.org/node/91">bounded context</a>, in a code base is about to undergo significant refurbishment. If we define this term, and agree the goal and value to the business, we may be able to better plan and manage our projects.<br />
<br />
These code refurbishment exercises should have clear goals defined at the outset and all change must be tested against these goals. For example, we may have discovered that code is not a true reflection of the business domain after new insights. These insights may have been gleaned over a period of time and the code has grown out of step to become an approximation of what the business requires. While performing <a href="http://domaindrivendesign.org/resources/what_is_ddd">Domain Driven Design</a> the penny may drop with the essence of the business model becoming clear. After this clarity of understanding the code may need a major overhaul to align it with this new understanding of the business. Code can also drift from being a distilled model of the business domain if quick hacks are put in place to meet a deadline. Over time these hacks can build on each other until the model no longer describes the business, it just about makes itself useful by side effect. During this exercise our tests are likely to see significant change as we tighten up the specification for our new improved understanding of the business domain.<br />
<br />
A code refurbishment is worthwhile to correct the core domain if it's about to undergo significant further development, or if a module is business critical and needs to be occasionally corrected under production pressure to preserve revenue generation.<br />
<br />
I’m interested to know if other folk have observed similar developments and if you think refinement of this concept would be valuable?Martin Thompsonhttp://www.blogger.com/profile/15893849163924476586noreply@blogger.com9tag:blogger.com,1999:blog-5560209661389175529.post-56836415133598002152011-08-13T10:07:00.008+01:002022-08-17T11:39:23.131+01:00False Sharing && Java 7In my previous post on <a href="http://mechanical-sympathy.blogspot.com/2011/07/false-sharing.html">False Sharing</a> I suggested it can be avoided by padding the cache line with unused <span style="font-family: "Courier New",Courier,monospace;">long</span> fields. It seems Java 7 got clever and eliminated or re-ordered the unused fields, thus re-introducing false sharing. I've experimented with a number of techniques on different platforms and found the following code to be the most reliable.<br />
<pre>
import java.util.concurrent.atomic.AtomicLong;
public final class FalseSharing
implements Runnable
{
public final static int NUM_THREADS = 4; // change
public final static long ITERATIONS = 500L * 1000L * 1000L;
private final int arrayIndex;
private static PaddedAtomicLong[] longs = new PaddedAtomicLong[NUM_THREADS];
static
{
for (int i = 0; i < longs.length; i++)
{
longs[i] = new PaddedAtomicLong();
}
}
public FalseSharing(final int arrayIndex)
{
this.arrayIndex = arrayIndex;
}
public static void main(final String[] args) throws Exception
{
final long start = System.nanoTime();
runTest();
System.out.println("duration = " + (System.nanoTime() - start));
}
private static void runTest() throws InterruptedException
{
Thread[] threads = new Thread[NUM_THREADS];
for (int i = 0; i < threads.length; i++)
{
threads[i] = new Thread(new FalseSharing(i));
}
for (Thread t : threads)
{
t.start();
}
for (Thread t : threads)
{
t.join();
}
}
public void run()
{
long i = ITERATIONS + 1;
while (0 != --i)
{
longs[arrayIndex].set(i);
}
}
public static long sumPaddingToPreventOptimisation(final int index)
{
PaddedAtomicLong v = longs[index];
return v.p1 + v.p2 + v.p3 + v.p4 + v.p5 + v.p6;
}
public static class PaddedAtomicLong extends AtomicLong
{
public volatile long p1, p2, p3, p4, p5, p6 = 7L;
}
}
</pre>
<br />
With this code I get similar performance results to those stated in the previous <a href="http://mechanical-sympathy.blogspot.com/2011/07/false-sharing.html">False Sharing</a> article. The padding in <span style="font-family: "Courier New",Courier,monospace;">PaddedAtomicLong</span> above can be commented out to see the false sharing effect.<br />
<br />
I think we should all lobby the powers that be inside Oracle to have intrinsics added to the language so we can have cache line aligned and padded atomic classes. This and some other low-level changes would help make Java a real concurrent programming language. We keep hearing them say multi-core is coming. I say it is here and Java needs to catch up.Martin Thompsonhttp://www.blogger.com/profile/15893849163924476586noreply@blogger.com41tag:blogger.com,1999:blog-5560209661389175529.post-61999203312812405622011-08-09T20:37:00.017+01:002022-08-17T11:39:47.591+01:00Inter Thread LatencyMessage rates between threads are fundamentally determined by the latency of memory exchange between CPU cores. The minimum unit of transfer will be a cache line exchanged via shared caches or socket interconnects. In a previous article I explained <a href="http://mechanical-sympathy.blogspot.com/2011/07/memory-barriersfences.html">Memory Barriers</a> and why they are important to concurrent programming between threads. These are the instructions that cause a CPU to make memory visible to other cores in an ordered and timely manner.<br />
<br />
Lately I’ve been asked a lot about how much faster the <a href="http://code.google.com/p/disruptor/">Disruptor</a> would be if C++ was used instead of Java. For sure C++ would give more control for memory alignment and potential access to underlying CPU instructions such as memory barriers and lock instructions. In this article I’ll directly compare C++ and Java to measure the cost of signalling a change between threads.<br />
<br />
For the test we'll use two counters each updated by their own thread. A simple ping-pong algorithm will be used to signal from one to the other and back again. The exchange will be repeated millions of times to measure the average latency between cores. This measurement will give us the latency of exchanging a cache line between cores in a serial manner.<br />
<br />
For Java we’ll use volatile counters which the JVM will kindly insert a lock instruction for the update giving us an effective memory barrier.<br />
<pre>
public final class InterThreadLatency
implements Runnable
{
public static final long ITERATIONS = 500L * 1000L * 1000L;
public static volatile long s1;
public static volatile long s2;
public static void main(final String[] args)
{
Thread t = new Thread(new InterThreadLatency());
t.setDaemon(true);
t.start();
long start = System.nanoTime();
long value = s1;
while (s1 < ITERATIONS)
{
while (s2 != value)
{
// busy spin
}
value = ++s1;
}
long duration = System.nanoTime() - start;
System.out.println("duration = " + duration);
System.out.println("ns per op = " + duration / (ITERATIONS * 2));
System.out.println("op/sec = " +
(ITERATIONS * 2L * 1000L * 1000L * 1000L) / duration);
System.out.println("s1 = " + s1 + ", s2 = " + s2);
}
public void run()
{
long value = s2;
while (true)
{
while (value == s1)
{
// busy spin
}
value = ++s2;
}
}
}
</pre>
<br />
For C++ we’ll use the <a href="http://gcc.gnu.org/onlinedocs/gcc-4.1.2/gcc/Atomic-Builtins.html">GNU Atomic Builtins</a> which give us a similar lock instruction insertion to that which the JVM uses.<br />
<pre>
#include <time.h>
#include <pthread.h>
#include <stdio.h>
typedef unsigned long long uint64;
const uint64 ITERATIONS = 500LL * 1000LL * 1000LL;
volatile uint64 s1 = 0;
volatile uint64 s2 = 0;
void* run(void*)
{
register uint64 value = s2;
while (true)
{
while (value == s1)
{
// busy spin
}
value = __sync_add_and_fetch(&s2, 1);
}
}
int main (int argc, char *argv[])
{
pthread_t threads[1];
pthread_create(&threads[0], NULL, run, NULL);
timespec ts_start;
timespec ts_finish;
clock_gettime(CLOCK_MONOTONIC, &ts_start);
register uint64 value = s1;
while (s1 < ITERATIONS)
{
while (s2 != value)
{
// busy spin
}
value = __sync_add_and_fetch(&s1, 1);
}
clock_gettime(CLOCK_MONOTONIC, &ts_finish);
uint64 start = (ts_start.tv_sec * 1000000000LL) + ts_start.tv_nsec;
uint64 finish = (ts_finish.tv_sec * 1000000000LL) + ts_finish.tv_nsec;
uint64 duration = finish - start;
printf("duration = %lld\n", duration);
printf("ns per op = %lld\n", (duration / (ITERATIONS * 2)));
printf("op/sec = %lld\n",
((ITERATIONS * 2L * 1000L * 1000L * 1000L) / duration));
printf("s1 = %lld, s2 = %lld\n", s1, s2);
return 0;
}
</pre>
<span style="font-size: large;"><b>Results</b></span><br />
<br />
<span style="font-family: "Courier New",Courier,monospace;">$ taskset -c 2,4 /opt/jdk1.7.0/bin/java InterThreadLatency</span><br />
<span style="font-family: "Courier New",Courier,monospace;">duration = 50790271150</span><br />
<span style="font-family: "Courier New",Courier,monospace;">ns per op = 50</span><br />
<span style="font-family: "Courier New",Courier,monospace;">op/sec = 19,688,810</span><br />
<span style="font-family: "Courier New",Courier,monospace;">s1 = 500000000, s2 = 500000000</span><br />
<br />
<span style="font-family: "Courier New",Courier,monospace;">$ g++ -O3 -lpthread -lrt -o itl itl.cpp</span><br />
<span style="font-family: "Courier New",Courier,monospace;">$ taskset -c 2,4 ./itl</span><br />
<span style="font-family: "Courier New",Courier,monospace;">duration = 45087955393</span><br />
<span style="font-family: "Courier New",Courier,monospace;">ns per op = 45</span><br />
<span style="font-family: "Courier New",Courier,monospace;">op/sec = 22,178,872</span><br />
<span style="font-family: "Courier New",Courier,monospace;">s1 = 500000000, s2 = 500000000</span><br />
<br />
The C++ version is slightly faster on my Intel Sandybridge laptop. So what does this tell us? Well, that the latency between 2 cores on a 2.2 GHz machine is ~45ns and that you can exchange 22m messages per second in a serial fashion. On an Intel CPU this is fundamentally the cost of the lock instruction enforcing total order and forcing the store buffer and <a href="http://mechanical-sympathy.blogspot.com/2011/07/write-combining.html">write combining buffers</a> to drain, followed by the resulting cache coherency traffic between the cores. Note that each core has a 96GB/s port onto the L3 cache ring bus, yet 22m * 64-bytes is only 1.4 GB/s. This is because we have measured latency and not throughput. We could easily fit some nice fat messages between those memory barriers as part of the exchange if the data has been written before the lock instruction was executed.<br />
<br />
So what does this all mean for the Disruptor? Basically, the latency of the Disruptor is about as low as we can get from Java. It would be possible to get a ~10% latency improvement by moving to C++. I’d expect a similar improvement in throughput for C++. The main win with C++ would be the control, and therefore, the predictability that comes with it if used correctly. The JVM gives us nice safety features like garbage collection in complex applications but we pay a little for that with the extra instructions it inserts that can be seen if you get Hotspot to dump the assembler instructions it is generating.<br />
<br />
How does the Disruptor achieve more than 25m messages per second I hear you say??? Well that is one of the neat parts of its design. The “<span style="font-family: "Courier New",Courier,monospace;">waitFor</span>” semantics on the <a href="http://code.google.com/p/disruptor/source/browse/trunk/code/src/main/com/lmax/disruptor/SequenceBarrier.java"><span style="font-family: "Courier New",Courier,monospace;">SequenceBarrier</span></a> enables a very efficient form of batching, which allows the <a href="http://code.google.com/p/disruptor/source/browse/trunk/code/src/main/com/lmax/disruptor/BatchEventProcessor.java"><span style="font-family: "Courier New",Courier,monospace;">BatchEventProcessor</span></a> to process a series of events that occurred since it last checked in with the <a href="http://code.google.com/p/disruptor/source/browse/trunk/code/src/main/com/lmax/disruptor/RingBuffer.java"><span style="font-family: "Courier New",Courier,monospace;">RingBuffer</span></a>, all without incurring a memory barrier. For real world applications this batching effect is really significant. For micro benchmarks it only makes the results more random, especially when there is little work done other than accepting the message.<br />
<br />
<span style="font-size: large;"><b>Conclusion</b></span><br />
<br />
So when processing events in series, the measurements tell us that the current generation of processors can do between 20-30 million exchanges per second at a latency less than 50ns. The Disruptor design allows us to get greater throughput without explicit batching on the publisher side. In addition the Disruptor has an explicit <a href="http://code.google.com/p/disruptor/source/browse/trunk/code/src/perf/com/lmax/disruptor/OnePublisherToOneProcessorUniCastBatchThroughputTest.java">batching API</a> on the publisher side that can give over <a href="http://code.google.com/p/disruptor/source/browse/trunk/code/src/perf/com/lmax/disruptor/OnePublisherToOneProcessorUniCastBatchThroughputTest.java">100 million </a>messages per second.Martin Thompsonhttp://www.blogger.com/profile/15893849163924476586noreply@blogger.com36London, UK51.5001524 -0.1262361999999939151.322796399999994 -0.39052969999999393 51.6775084 0.1380573000000061