Showing posts with label Performance. Show all posts
Showing posts with label Performance. Show all posts

Monday, 5 May 2014

Simple Binary Encoding

Financial systems communicate by sending and receiving vast numbers of messages in many different formats. When people use terms like "vast" I normally think, "really..how many?" So lets quantify "vast" for the finance industry. Market data feeds from financial exchanges typically can be emitting tens or hundreds of thousands of message per second, and aggregate feeds like OPRA can peak at over 10 million messages per second with volumes growing year-on-year. This presentation gives a good overview.

In this crazy world we still see significant use of ASCII encoded presentations, such as FIX tag value, and some more slightly sane binary encoded presentations like FAST. Some markets even commit the sin of sending out market data as XML! Well I cannot complain too much as they have at times provided me a good income writing ultra fast XML parsers.

Last year the CME, who are a member the FIX community, commissioned Todd Montgomery, of 29West LBM fame, and myself to build the reference implementation of the new FIX Simple Binary Encoding (SBE) standard. SBE is a codec aimed at addressing the efficiency issues in low-latency trading, with a specific focus on market data. The CME, working within the FIX community, have done a great job of coming up with an encoding presentation that can be so efficient. Maybe a suitable atonement for the sins of past FIX tag value implementations. Todd and I worked on the Java and C++ implementation, and later we were helped on the .Net side by the amazing Olivier Deheurles at Adaptive. Working on a cool technical problem with such a team is a dream job.

SBE Overview

SBE is an OSI layer 6 presentation for encoding/decoding messages in binary format to support low-latency applications. Of the many applications I profile with performance issues, message encoding/decoding is often the most significant cost. I've seen many applications that spend significantly more CPU time parsing and transforming XML and JSON than executing business logic. SBE is designed to make this part of a system the most efficient it can be. SBE follows a number of design principles to achieve this goal. By adhering to these design principles sometimes means features available in other codecs will not being offered. For example, many codecs allow strings to be encoded at any field position in a message; SBE only allows variable length fields, such as strings, as fields grouped at the end of a message.

The SBE reference implementation consists of a compiler that takes a message schema as input and then generates language specific stubs. The stubs are used to directly encode and decode messages from buffers. The SBE tool can also generate a binary representation of the schema that can be used for the on-the-fly decoding of messages in a dynamic environment, such as for a log viewer or network sniffer.

The design principles drive the implementation of a codec that ensures messages are streamed through memory without backtracking, copying, or unnecessary allocation. Memory access patterns should not be underestimated in the design of a high-performance application. Low-latency systems in any language especially need to consider all allocation to avoid the resulting issues in reclamation. This applies for both managed runtime and native languages. SBE is totally allocation free in all three language implementations.

The end result of applying these design principles is a codec that has ~16-25 times greater throughput than Google Protocol Buffers (GPB) with very low and predictable latency. This has been observed in micro-benchmarks and real-world application use. A typical market data message can be encoded, or decoded, in ~25ns compared to ~1000ns for the same message with GPB on the same hardware. XML and FIX tag value messages are orders of magnitude slower again.

The sweet spot for SBE is as a codec for structured data that is mostly fixed size fields which are numbers, bitsets, enums, and arrays. While it does work for strings and blobs, many my find some of the restrictions a usability issue. These users would be better off with another codec more suited to string encoding.

Message Structure

A message must be capable of being read or written sequentially to preserve the streaming access design principle, i.e. with no need to backtrack. Some codecs insert location pointers for variable length fields, such as string types, that have to be indirected for access. This indirection comes at a cost of extra instructions plus losing the support of the hardware prefetchers. SBE's design allows for pure sequential access and copy-free native access semantics.

Figure 1
SBE messages have a common header that identifies the type and version of the message body to follow. The header is followed by the root fields of the message which are all fixed length with static offsets. The root fields are very similar to a struct in C. If the message is more complex then one or more repeating groups similar to the root block can follow. Repeating groups can nest other repeating group structures. Finally, variable length strings and blobs come at the end of the message. Fields may also be optional. The XML schema describing the SBE presentation can be found here.

SbeTool and the Compiler

To use SBE it is first necessary to define a schema for your messages. SBE provides a language independent type system supporting integers, floating point numbers, characters, arrays, constants, enums, bitsets, composites, grouped structures that repeat, and variable length strings and blobs.

A message schema can be input into the SbeTool and compiled to produce stubs in a range of languages, or to generate binary metadata suitable for decoding messages on-the-fly.

    java [-Doption=value] -jar sbe.jar <message-declarations-file.xml>

SbeTool and the compiler are written in Java. The tool can currently output stubs in Java, C++, and C#.

Programming with Stubs

A full example of messages defined in a schema with supporting code can be found here. The generated stubs follow a flyweight pattern with instances reused to avoid allocation. The stubs wrap a buffer at an offset and then read it sequentially and natively.
    // Write the message header first
    MESSAGE_HEADER.wrap(directBuffer, bufferOffset, messageTemplateVersion)
                  .blockLength(CAR.sbeBlockLength())
                  .templateId(CAR.sbeTemplateId())
                  .schemaId(CAR.sbeSchemaId())
                  .version(CAR.sbeSchemaVersion());

    // Then write the body of the message
    car.wrapForEncode(directBuffer, bufferOffset)
       .serialNumber(1234)
       .modelYear(2013)
       .available(BooleanType.TRUE)
       .code(Model.A)
       .putVehicleCode(VEHICLE_CODE, srcOffset);
Messages can be written via the generated stubs in a fluent manner. Each field appears as a generated pair of methods to encode and decode.
    // Read the header and lookup the appropriate template to decode
    MESSAGE_HEADER.wrap(directBuffer, bufferOffset, messageTemplateVersion);

    final int templateId = MESSAGE_HEADER.templateId();
    final int actingBlockLength = MESSAGE_HEADER.blockLength();
    final int schemaId = MESSAGE_HEADER.schemaId();
    final int actingVersion = MESSAGE_HEADER.version();

    // Once the template is located then the fields can be decoded.
    car.wrapForDecode(directBuffer, bufferOffset, actingBlockLength, actingVersion);

    final StringBuilder sb = new StringBuilder();
    sb.append("\ncar.templateId=").append(car.sbeTemplateId());
    sb.append("\ncar.schemaId=").append(schemaId);
    sb.append("\ncar.schemaVersion=").append(car.sbeSchemaVersion());
    sb.append("\ncar.serialNumber=").append(car.serialNumber());
    sb.append("\ncar.modelYear=").append(car.modelYear());
    sb.append("\ncar.available=").append(car.available());
    sb.append("\ncar.code=").append(car.code());

The generated code in all languages gives performance similar to casting a C struct over the memory.

On-The-Fly Decoding

The compiler produces an intermediate representation (IR) for the input XML message schema. This IR can be serialised in the SBE binary format to be used for later on-the-fly decoding of messages that have been stored. It is also useful for tools, such as a network sniffer, that will not have been compiled with the stubs. A full example of the IR being used can be found here.

Direct Buffers

SBE, via Agrona, provides an abstraction to Java, with the MutableDirectBuffer class, to work with buffers that are byte[], heap or direct ByteBuffer buffers, and off heap memory addresses returned from Unsafe.allocateMemory(long) or JNI. In low-latency applications, messages are often encoded/decoded in memory mapped files via MappedByteBuffer and thus can be be transferred to a network channel by the kernel thus avoiding user space copies.

C++ and C# have built-in support for direct memory access and do not require such an abstraction as the Java version does. A DirectBuffer abstraction was added for C# to support Endianess and encapsulate the unsafe pointer access.

Message Extension and Versioning

SBE schemas carry a version number that allows for message extension. A message can be extended by adding fields at the end of a block. Fields cannot be removed or reordered for backwards compatibility.

Extension fields must be optional otherwise a newer template reading an older message would not work. Templates carry metadata for min, max, null, timeunit, character encoding, etc., these are accessible via static (class level) methods on the stubs.

Byte Ordering and Alignment

The message schema allows for precise alignment of fields by specifying offsets. Fields are by default encoded in Little Endian form unless otherwise specified in a schema. For maximum performance native encoding with fields on word aligned boundaries should be used. The penalty for accessing non-aligned fields on some processors can be very significant. For alignment one must consider the framing protocol and buffer locations in memory.

Message Protocols

I often see people complain that a codec cannot support a particular presentation in a single message. However this is often possible to address with a protocol of messages. Protocols are a great way to split an interaction into its component parts, these parts are then often composable for many interactions between systems. For example, the IR implementation of schema metadata is more complex than can be supported by the structure of a single message. We encode IR by first sending a template message providing an overview, followed by a stream of messages, each encoding the tokens from the compiler IR. This allows for the design of a very fast OTF decoder which can be implemented as a threaded interpreter with much less branching than the typical switch based state machines.

Protocol design is an area that most developers don't seem to get an opportunity to learn. I feel this is a great loss. The fact that so many developers will call an "encoding" such as ASCII a "protocol" is very telling. The value of protocols is so obvious when one gets to work with a programmer like Todd who has spent his life successfully designing protocols.

Stub Performance

The stubs provide a significant performance advantage over the dynamic OTF decoding. For accessing primitive fields we believe the performance is reaching the limits of what is possible from a general purpose tool. The generated assembly code is very similar to what a compiler will generate for accessing a C struct, even from Java!

Regarding the general performance of the stubs, we have observed that C++ has a very marginal advantage over the Java which we believe is due to runtime inserted Safepoint checks. The C# version lags a little further behind due to its runtime not being as aggressive with inlining methods as the Java runtime. Stubs for all three languages are capable of encoding or decoding typical financial messages in tens of nanoseconds. This effectively makes the encoding and decoding of messages almost free for most applications relative to the rest of the application logic.

Feedback

This is the first version of SBE and we would welcome feedback. The reference implementation is constrained by the FIX community specification. It is possible to influence the specification but please don't expect pull requests to be accepted that significantly go against the specification. Support for Javascript, Python, Erlang, and other languages has been discussed and would be very welcome.


Update: 08-May-2014

Thanks to feedback from Kenton Varda, the creator of GPB, we were able to improve the benchmarks to get the best performance out of GPB. Below are the results for the changes to the Java benchmarks.

The C++ GPB examples on optimisation show approximately a doubling of throughput compared to initial results. It should be noted that you often have to do the opposite in Java with GPB compared to C++ to get performance improvements, such as allocate objects rather than reuse them.

Before GPB Optimisation:
Mode Thr    Cnt  Sec         Mean   Mean error    Units
     [exec] u.c.r.protobuf.CarBenchmark.testDecode           thrpt   1     30    1      462.817        6.474   ops/ms
     [exec] u.c.r.protobuf.CarBenchmark.testEncode           thrpt   1     30    1      326.018        2.972   ops/ms
     [exec] u.c.r.protobuf.MarketDataBenchmark.testDecode    thrpt   1     30    1     1148.050       17.194   ops/ms
     [exec] u.c.r.protobuf.MarketDataBenchmark.testEncode    thrpt   1     30    1     1242.252       12.248   ops/ms

     [exec] u.c.r.sbe.CarBenchmark.testDecode                thrpt   1     30    1    10436.476      102.114   ops/ms
     [exec] u.c.r.sbe.CarBenchmark.testEncode                thrpt   1     30    1    11657.190       65.168   ops/ms
     [exec] u.c.r.sbe.MarketDataBenchmark.testDecode         thrpt   1     30    1    34078.646      261.775   ops/ms
     [exec] u.c.r.sbe.MarketDataBenchmark.testEncode         thrpt   1     30    1    29193.600      443.638   ops/ms
After GPB Optimisation:
Mode Thr    Cnt  Sec         Mean   Mean error    Units
     [exec] u.c.r.protobuf.CarBenchmark.testDecode           thrpt   1     30    1      619.467        4.429   ops/ms
     [exec] u.c.r.protobuf.CarBenchmark.testEncode           thrpt   1     30    1      433.711       10.364   ops/ms
     [exec] u.c.r.protobuf.MarketDataBenchmark.testDecode    thrpt   1     30    1     2088.998       60.619   ops/ms
     [exec] u.c.r.protobuf.MarketDataBenchmark.testEncode    thrpt   1     30    1     1316.123       19.816   ops/ms


Throughput msg/ms - Before GPB Optimisation
TestProtocol BuffersSBERatio
Car Encode462.817
10436.476
22.52
Car Decode326.018
11657.190
35.76
Market Data Encode1148.050
34078.646
29.68
Market Data Decode1242.252
29193.600
23.50

Throughput msg/ms - After GPB Optimisation
TestProtocol BuffersSBERatio
Car Encode619.467
10436.476
16.85
Car Decode433.711
11657.190
26.88
Market Data Encode2088.998
34078.646
16.31
Market Data Decode1316.123
29193.600
22.18

Monday, 26 August 2013

Lock-Based vs Lock-Free Concurrent Algorithms

Last week I attended a review session of the new JSR166 StampedLock run by Heinz Kabutz at the excellent JCrete unconference. StampedLock is an attempt to address the contention issues that arise in a system when multiple readers concurrently access shared state. StampedLock is designed to perform better than ReentrantReadWriteLock by taking an optimistic read approach.

While attending the session a couple of things occurred to me. Firstly, I thought it was about time I reviewed the current status of Java lock implementations. Secondly, that although StampedLock looks like a good addition to the JDK, it seems to miss the fact that lock-free algorithms are often a better solution to the multiple reader case.

Test Case

To compare implementations I needed an API test case that would not favour a particular approach. For example, the API should be garbage free and allow the methods to be atomic. A simple test case is to design a spaceship that can be moved around a 2-dimensional space with the coordinates of its position available to be read atomically. At least 2 fields need to be read, or written, per transaction to make the concurrency interesting.
/**
 * Interface to a concurrent representation of a ship that can move around
 * a 2 dimensional space with updates and reads performed concurrently.
 */
public interface Spaceship
{
    /**
     * Read the position of the spaceship into the array of coordinates provided.
     *
     * @param coordinates into which the x and y coordinates should be read.
     * @return the number of attempts made to read the current state.
     */
    int readPosition(final int[] coordinates);

    /**
     * Move the position of the spaceship by a delta to the x and y coordinates.
     *
     * @param xDelta delta by which the spaceship should be moved in the x-axis.
     * @param yDelta delta by which the spaceship should be moved in the y-axis.
     * @return the number of attempts made to write the new coordinates.
     */
    int move(final int xDelta, final int yDelta);
}
The above API would be cleaner by factoring out an immutable Position object but I want to keep it garbage free and create the need to update multiple internal fields with minimal indirection. This API could easily be extended for a 3-dimensional space and require the implementations to be atomic.

Multiple implementations are built for each spaceship and exercised by a test harness. All the code and results for this blog can be found here.

The test harness will run each of the implementations in turn by using a megamorphic dispatch pattern to try and prevent inlining, lock-coarsening, and loop unrolling when accessing the concurrent methods.

Each implementation is subjected to 4 distinct threading scenarios that result in different contention profiles:
  • 1 reader - 1 writer
  • 2 readers - 1 writer
  • 3 readers - 1 writer
  • 2 readers - 2 writers
All tests are run with 64-bit Java 1.7.0_25, Linux 3.6.30, and a quad core 2.2GHz Ivy Bridge i7-3632QM. Throughput is measured over 5 second periods for each implementation with the tests repeated 5 times to ensure sufficient warm up. The results below are throughputs averaged per second over 5 runs. To approximate a typical Java deployment, no thread affinity or core isolation has been employed which would have reduced variance significantly.

Note: Other CPUs and operating systems can produce very different results.

Results

Figure 1.
Figure 2.
Figure 3.
Figure 4.

The raw data for the above charts can be found here.

Analysis

The real surprise for me from the results is the performance of ReentrantReadWriteLock.  I cannot see a use for this implementation beyond a case whereby there is a huge balance of reads and very little writes. My main takeaways are:
  1. StampedLock is a major improvement over existing lock implementations especially with increasing numbers of reader threads.
  2. StampedLock has a complex API. It is very easy to mistakenly call the wrong method for locking actions.
  3. Synchronised is a good general purpose lock implementation when contention is from only 2 threads.
  4. ReentrantLock is a good general purpose lock implementation when thread counts grow as previously discovered.
  5. Choosing to use ReentrantReadWriteLock should be based on careful and appropriate measurement. As with all major decisions, measure and make decisions based on data.
  6. Lock-free implementations can offer significant throughput advantages over lock-based algorithms.
Conclusion

It is nice seeing the influence of lock-free techniques appearing in lock-based algorithms. The optimistic strategy employed on read is effectively a lock-free algorithm at the times when a writer is not updating.

In my experience of teaching and developing lock-free algorithms, not only do they provide significant throughput advantages as evidenced here, they also provide much lower and less variance in latency.

Tuesday, 16 July 2013

Java Garbage Collection Distilled

Serial, Parallel, Concurrent, CMS, G1, Young Gen, New Gen, Old Gen, Perm Gen, Eden, Tenured, Survivor Spaces, Safepoints, and the hundreds of JVM startup flags. Does this all baffle you when trying to tune the garbage collector while trying to get the required throughput and latency from your Java application? If it does then do not worry, you are not alone. Documentation describing garbage collection feels like man pages for an aircraft. Every knob and dial is detailed and explained but nowhere can you find a guide on how to fly. This article will attempt to explain the tradeoffs when choosing and tuning garbage collection algorithms for a particular workload.

The focus will be on Oracle Hotspot JVM and OpenJDK collectors as those are the ones in most common usage. Towards the end other commercial JVMs will be discussed to illustrate alternatives.

The Tradeoffs

Wise folk keep telling us, “You do not get something for nothing”. When we get something we usually have to give up something in return. When it comes to garbage collection we play with 3 major variables that set targets for the collectors:
  1. Throughput: The amount of work done by an application as a ratio of time spent in GC. Target throughput with ‑XX:GCTimeRatio=99 ; 99 is the default equating to 1% GC time.
  2. Latency: The time taken by systems in responding to events which is impacted by pauses introduced by garbage collection. Target latency for GC pauses with ‑XX:MaxGCPauseMillis=<n>.
  3. Memory: The amount of memory our systems use to store state, which is often copied and moved around when being managed. The set of active objects retained by the application at any point in time is known as the Live Set. Maximum heap size –Xmx<n> is a tuning parameter for setting the heap size available to an application.
Note: Often Hotspot cannot achieve these targets and will silently continue without warning, having missed its target by a great margin.

Latency is a distribution across events. It may be acceptable to have an increased average latency to reduce the worst-case latency, or make it less frequent. We should not interpret the term “real-time” to mean the lowest possible latency; rather real-time refers to having deterministic latency regardless of throughput.

For some application workloads, throughput is the most important target. An example would be a long running batch-processing job; it does not matter if a batch job is occasionally paused for a few seconds while garbage collection takes place, as long as the overall job can be completed sooner.

For virtually all other workloads, from human facing interactive applications to financial trading systems, if a system goes unresponsive for anything more than a few seconds or even milliseconds in some cases, it can spell disaster. In financial trading it is often worthwhile to trade off some throughput in return for consistent latency. We may also have applications that are limited by the amount of physical memory available and have to maintain a footprint, in which case we have to give up performance on both latency and throughput fronts.

Tradeoffs often play out as follows:
  • To a large extent the cost of garbage collection, as an amortized cost, can be reduced by providing the garbage collection algorithms with more memory.
  • The observed worst-case latency-inducing pauses due to garbage collecting can be reduced by containing the live set and keeping the heap size small.
  • The frequency with which pauses occur can be reduced by managing the heap and generation sizes, and by controlling the application’s object allocation rate.
  • The frequency of large pauses can be reduced by concurrently running the GC with the application, sometimes at the expense of throughput.

Object Lifetimes

Garbage collection algorithms are often optimised with the expectation that most objects live for a very short period of time, while relatively few live for very long. In most applications, objects that live for a significant period of time tend to constitute a very small percentage of objects allocated over time. In garbage collection theory this observed behavior is often known as “infant mortality” or the “weak generational hypothesis”. For example, loop Iterators are mostly short lived whereas static Strings are effectively immortal.

Experimentation has shown that generational garbage collectors can usually support an order-of-magnitude greater throughput than non-generational collectors do, and thus are almost ubiquitously used in server JVMs. By separating the generations of objects, we know that a region of newly allocated objects is likely to be very sparse for live objects. Therefore a collector that scavenges for the few live objects in this new region and copies them to another region for older objects can be very efficient. Hotspot garbage collectors record the age of an object in terms of the number of GC cycles survived.

Note: If your application consistently generates a lot of objects that live for a fairly long time then expect your application to be spending a significant portion of its time garbage collecting, and expect to be spending a significant portion of your time tuning the Hotspot garbage collectors. This is due to the reduced GC efficiency that happens when the generational “filter” is less effective, and resulting cost of collecting the longer living generations more frequently. Older generations are less sparse, and as a result the efficiency of older generation collection algorithms tends to be much lower. Generational garbage collectors tend to operate in two distinct collection cycles: Minor collections, when short-lived objects are collected, and the less frequent Major collections, when the older regions are collected.

Stop-The-World Events

The pauses that applications suffer during garbage collection are due to what are known as stop-the-world events. For garbage collectors to operate it is necessary, for practical engineering reasons, to periodically stop the running application so that memory can be managed. Depending on the algorithms, different collectors will stop-the-world at specific points of execution for varying durations of time. To bring an application to a total stop it is necessary to pause all the running threads. Garbage collectors do this by signaling the threads to stop when they come to a “safepoint”, which is a point during program execution at which all GC roots are known and all heap object contents are consistent. Depending on what a thread is doing it may take some time to reach a safepoint. Safepoint checks are normally performed on method returns and loop back edges, but can be optimized away in some places making them more dynamically rare. For example, if a thread is copying a large array, cloning a large object, or executing a monotonic counted loop with a finite bound, it may be many milliseconds before a safepoint is reached. Time To Safepoint (TTSP) is an important consideration in low-latency applications. This time can be surfaced by enabling the ‑XX:+PrintGCApplicationStoppedTime flag in addition to the other GC flags.

Note: For applications with a large number of running threads, when a stop-the-world event occurs a system will undergo significant scheduling pressure as the threads resume when released. Therefore algorithms with less reliance on stop-the-world events can potentially be more efficient.

Heap Organisation in Hotspot

To understand how the different collectors operate it is best to explore how the Java heap is organised to support generational collectors.

Eden is the region where most objects are initially allocated. The survivor spaces are a temporary store for objects that have survived a collection of the Eden space. Survivor space usage will be described when minor collections are discussed. Collectively Eden and the survivor spaces are known as the “young” or “new” generation.

Objects that live long enough are eventually promoted to the tenured space.

The perm generation is where the runtime stores objects it “knows” to be effectively immortal, such as Classes and static Strings. Unfortunately the common use of class loading on an ongoing basis in many applications makes the motivating assumption behind the perm generation wrong, i.e. that classes are immortal. In Java 7 interned Strings were moved from permgen to tenured, and from Java 8 the perm generation is no more and will not be discussed in this article. Most other commercial collectors do not use a separate perm space and tend to treat all long living objects as tenured.

Note: The Virtual spaces allow the collectors to adjust the size of regions to meet throughput and latency targets. Collectors keep statistics for each collection phase and adjust the region sizes accordingly in an attempt to reach the targets.

Object Allocation

To avoid contention each thread is assigned a Thread Local Allocation Buffer (TLAB) from which it allocates objects. Using TLABs allows object allocation to scale with number of threads by avoiding contention on a single memory resource. Object allocation via a TLAB is a very cheap operation; it simply bumps a pointer for the object size which takes roughly 10 instructions on most platforms. Heap memory allocation for Java is even cheaper than using malloc from the C runtime.


Note: Whereas individual object allocation is very cheap, the rate at which minor collections must occur is directly proportional to the rate of object allocation.

When a TLAB is exhausted a thread simply requests a new one from the Eden space. When Eden has been filled a minor collection commences.

Large objects (-XX:PretenureSizeThreshold=<n>) may fail to be accommodated in the young generation and thus have to be allocated in the old generation, e.g. a large array. If the threshold is set below TLAB size then objects that fit in the TLAB will not be created in the old generation. The new G1 collector handles large objects differently and will be discussed later in its own section.

Minor Collections

A minor collection is triggered when Eden becomes full. This is done by copying all the live objects in the new generation to either a survivor space or the tenured space as appropriate. Copying to the tenured space is known as promotion or tenuring. Promotion occurs for objects that are sufficiently old (– XX:MaxTenuringThreshold=<n>), or when the survivor space overflows.

Live objects are objects that are reachable by the application; any other objects cannot be reached and can therefore be considered dead. In a minor collection, the copying of live objects is performed by first following what are known as GC Roots, and iteratively copying anything reachable to the survivor space. GC Roots normally include references from application and JVM-internal static fields, and from thread stack-frames, all of which effectively point to the application’s reachable object graphs.

In generational collection, the GC Roots for the new generation’s reachable object graph also include any references from the old generation to the new generation. These references must also be processed to make sure all reachable objects in the new generation survive the minor collection. Identifying these cross-generational references is achieved by use of a “card table”. The Hotspot card table is an array of bytes in which each byte is used to track the potential existence of cross-generational references in a corresponding 512 byte region of the old generation. As references are stored to the heap, “store barrier” code will mark cards to indicate that a potential reference from the old generation to the new generation may exist in the associated 512 byte heap region. At collection time, the card table is used to scan for such cross-generational references, which effectively represent additional GC Roots into the new generation. Therefore a significant fixed cost of minor collections is directly proportional to the size of the old generation.

There are two survivor spaces in the Hotspot new generation, which alternate in their “to-space” and “from-space” roles. At the beginning of a minor collection, the to-space survivor space is always empty, and acts as a target copy area for the minor collection. The previous minor collection’s target survivor space is part of the from-space, which also includes Eden, where live objects that need to be copied may be found.

The cost of a minor GC collection is usually dominated by the cost of copying objects to the survivor and tenured spaces. Objects that do not survive a minor collection are effectively free to be dealt with. The work done during a minor collection is directly proportional to the number of live objects found, and not to the size of the new generation. The total time spent doing minor collections can be almost be halved each time the Eden size is doubled. Memory can therefore be traded for throughput. A doubling of Eden size will result in an increase in collection time per-collection cycle, but this is relatively small if both the number of objects being promoted and size of the old generation is constant.

Note: In Hotspot minor collections are stop-the-world events. This is rapidly becoming a major issue as our heaps get larger with more live objects. We are already starting to see the need for concurrent collection of the young generation to reach pause-time targets.

Major Collections

Major collections collect the old generation so that objects can be promoted from the young generation. In most applications, the vast majority of program state ends up in the old generation. The greatest variety of GC algorithms exists for the old generation. Some will compact the whole space when it fills, whereas others will collect concurrently with the application in an effort to prevent it from filling up.

The old generation collector will try to predict when it needs to collect to avoid a promotion failure from the young generation. The collectors track a fill threshold for the old generation and begin collection when this threshold is passed. If this threshold is not sufficient to meet promotion requirements then a “FullGC” is triggered. A FullGC involves promoting all live objects from the young generations followed by a collection and compaction of the old generation. Promotion failure is a very expensive operation as state and promoted objects from this cycle must be unwound so the FullGC event can occur.

Note: To avoid promotion failure you will need to tune the padding that the old generation allows to accommodate promotions (‑XX:PromotedPadding=<n>).

Note: When the Heap needs to grow a FullGC is triggered. These heap-resizing FullGCs can be avoided by setting –Xms and –Xmx to the same value.

Other than a FullGC, a compaction of the old generation is likely to be the largest stop-the-world pause an application will experience. The time for this compaction tends to grow linearly with the number of live objects in the tenured space.

The rate at which the tenured space fills up can sometimes be reduced by increasing the size of the survivor spaces and the age of objects before being promoted to the tenured generation. However, increasing the size of the survivor spaces and object age in Minor collections (–XX:MaxTenuringThreshold=<n>) before promotion can also increase the cost and pause times in the minor collections due to the increased copy cost between survivor spaces on minor collections.

Serial Collector

The Serial collector (-XX:+UseSerialGC) is the simplest collector and is a good option for single processor systems. It also has the smallest footprint of any collector. It uses a single thread for both minor and major collections. Objects are allocated in the tenured space using a simple bump the pointer algorithm. Major collections are triggered when the tenured space is full.

Parallel Collector

The Parallel collector comes in two forms. The Parallel collector (‑XX:+UseParallelGC) which uses multiple threads to perform minor collections of the young generation and a single thread for major collections on the old generation. The Parallel Old collector (‑XX:+UseParallelOldGC) , the default since Java 7u4, uses multiple threads for minor collections and multiple threads for major collections. Objects are allocated in the tenured space using a simple bump the pointer algorithm. Major collections are triggered when the tenured space is full.

On multiprocessor systems the Parallel Old collector will give the greatest throughput of any collector. It has no impact on a running application until a collection occurs, and then will collect in parallel using multiple threads using the most efficient algorithm. This makes the Parallel Old collector very suitable for batch applications.

The cost of collecting the old generations is affected by the number of objects to retain to a greater extent than by the size of the heap. Therefore the efficiency of the Parallel Old collector can be increased to achieve greater throughput by providing more memory and accepting larger, but fewer, collection pauses.

Expect the fastest minor collections with this collector because the promotion to tenured space is a simple bump the pointer and copy operation.

For server applications the Parallel Old collector should be the first port-of-call. However if the major collection pauses are more than your application can tolerate then you need to consider employing a concurrent collector that collects the tenured objects concurrently while the application is running.

Note: Expect pauses in the order of one to five seconds per GB of live data on modern hardware while the old generation is compacted.

Note: The parallel collector can sometimes gain performance benefits from -XX:+UseNUMA on multi-socket CPU server applications by allocating Eden memory for threads local to the CPU socket. It is a shame this feature is not available to the other collectors.

Concurrent Mark Sweep (CMS) Collector

The CMS (-XX:+UseConcMarkSweepGC) collector runs in the Old generation collecting tenured objects that are no longer reachable during a major collection. It runs concurrently with the application with the goal of keeping sufficient free space in the old generation so that a promotion failure from the young generation does not occur.

Promotion failure will trigger a FullGC. CMS follows a multistep process:
  1. Initial Mark : Find GC Roots.
  2. Concurrent Mark: Mark all reachable objects from the GC Roots.
  3. Concurrent Pre-clean: Check for object references that have been updated and objects that have been promoted during the concurrent mark phase by remarking.
  4. Re-mark : Capture object references that have been updated since the Pre-clean stage.
  5. Concurrent Sweep: Update the free-lists by reclaiming memory occupied by dead objects.
  6. Concurrent Reset: Reset data structures for next run.
As tenured objects become unreachable, the space is reclaimed by CMS and put on free-lists. When promotion occurs, the free-lists must be searched for a suitable sized hole for the promoted object. This increases the cost of promotion and thus increases the cost of the Minor collections compared to the Parallel Collector.

Note: CMS is not a compacting collector, which over time can result in old generation fragmentation. Object promotion can fail because a large object may not fit in the available holes in the old generation. When this happens a “promotion failed” message is logged and a FullGC is triggered to compact the live tenured objects. For such compaction-driven FullGCs, expect pauses to be worse than major collections using the Parallel Old collector because CMS uses only a single thread for compaction.

CMS is mostly concurrent with the application, which has a number of implications. First, CPU time is taken by the collector, thus reducing the CPU available to the application. The amount of time required by CMS grows linearly with the amount of object promotion to the tenured space. Second, for some phases of the concurrent GC cycle, all application threads have to be brought to a safepoint for marking GC Roots and performing a parallel re-mark to check for mutation.

Note: If an application sees significant mutation of tenured objects then the re-mark phase can be significant, at the extremes it may take longer than a full compaction with the Parallel Old collector.

CMS makes FullGC a less frequent event at the expense of reduced throughput, more expensive minor collections, and greater footprint. The reduction in throughput can be anything from 10%-40% compared to the Parallel collector, depending on promotion rate. CMS also requires a 20% greater footprint to accommodate additional data structures and “floating garbage” that can be missed during the concurrent marking that gets carried over to the next cycle.

High promotion rates and resulting fragmentation can sometimes be reduced by increasing the size of both the young and old generation spaces.

Note: CMS can suffer “concurrent mode failures”, which can be seen in the logs, when it fails to collect at a sufficient rate to keep up with promotion. This can be caused when the collection commences too late, which can sometimes be addressed by tuning. But it can also occur when the collection rate cannot keep up with the high promotion rate or with the high object mutation rate of some applications. If the promotion rate, or mutation rate, of the application is too high then your application might require some changes to reduce the promotion pressure. Adding more memory to such a system can sometimes make the situation worse, as CMS would then have more memory to scan.

Garbage First (G1) Collector

G1 (-XX:+UseG1GC) is a new collector introduced in Java 6 and now officially supported as of Java 7u4. It is a partially concurrent collecting algorithm that also tries to compact the tenured space in smaller incremental stop-the-world pauses to try and minimize the FullGC events that plague CMS because of fragmentation. G1 is a generational collector that organizes the heap differently from the other collectors by dividing it into a large number (~2000) of fixed size regions of variable purpose, rather than contiguous regions for the same purpose.


G1 takes the approach of concurrently marking regions to track references between regions, and to focus collection on the regions with the most free space. These regions are then collected in stop-the-world pause increments by evacuating the live objects to an empty region, thus compacting in the process.  The regions to be collected in a cycle are known as the Collection Set.

Objects larger than 50% of a region are allocated in humongous regions, which are a multiple of region size. Allocation and collection of humongous objects can be very costly under G1, and to date has had little or no optimisation effort applied.

The challenge with any compacting collector is not the moving of objects but the updating of references to those objects. If an object is referenced from many regions then updating those references can take significantly longer than moving the object. G1 tracks which objects in a region have references from other regions via the “Remembered Sets”. Remember Sets are collections of cards that have been marked for mutation. If the Remembered Sets become large then G1 can significantly slow down. When evacuating objects from one region to another, the length of the associated stop-the-world event tends to be proportional to the number of regions with references that need to be scanned and potentially patched.

Maintaining the Remembered Sets increases the cost of minor collections resulting in pauses greater than those seen with Parallel Old or CMS collectors for Minor collections.

G1 is target driven on latency –XX:MaxGCPauseMillis=<n>, default value = 200ms. The target will influence the amount of work done on each cycle on a best-efforts only basis. Setting targets in tens of milliseconds is mostly futile, and as of this writing targeting tens of milliseconds has not been a focus of G1.

G1 is a good general-purpose collector for larger heaps that have a tendency to become fragmented when an application can tolerate pauses in the 0.5-1.0 second range for incremental compactions. G1 tends to reduce the frequency of the worst-case pauses seen by CMS because of fragmentation at the cost of extended minor collections and incremental compactions of the old generation. Most pauses end up being constrained to regional rather than full heap compactions.

Like CMS, G1 can also fail to keep up with promotion rates, and will fall back to a stop-the-world FullGC. Just like CMS has “concurrent mode failure”, G1 can suffer an evacuation failure, seen in the logs as “to-space overflow”. This occurs when there are no free regions into which objects can be evacuated, which is similar to a promotion failure. If this occurs, try using a larger heap and more marking threads, but in some cases application changes may be necessary to reduce allocation rates.

A challenging problem for G1 is dealing with popular objects and regions. Incremental stop-the-world compaction works well when regions have live objects that are not heavily referenced from other regions. If an object or region is popular then the Remembered Set will be large, and G1 will try to avoid collecting those objects. Eventually it can have no choice, which results in very frequent mid-length pauses as the heap gets compacted.

Alternative Concurrent Collectors

CMS and G1 are often called mostly concurrent collectors. When you look at the total work performed it is clear that the young generation, promotion and even much of the old generation work is not concurrent at all. CMS is mostly concurrent for the old generation; G1 is much more of a stop-the-world incremental collector. Both CMS and G1 have significant and regularly occurring stop-the-world events, and worst-case scenarios that often make them unsuitable for strict low-latency applications, such a financial trading or reactive user interfaces.

Alternative collectors are available such as Oracle JRockit Real Time, IBM Websphere Real Time, and Azul Zing. The JRockit and Websphere collectors have latency advantages in most cases over CMS and G1 but often see throughput limitations and still suffer significant stop-the-world events. Zing is the only Java collector know to this author that can be truly concurrent for collection and compaction while maintaining a high-throughput rate for all generations. Zing does have some sub-millisecond stop-the-world events but these are for phase shifts in the collection cycle that are not related to live object set size.

JRockit RT can achieve typical pause times in the tens of milliseconds for high allocation rates at contained heap sizes but occasionally has to fail back to full compaction pauses. Websphere RT can achieve single-digit millisecond pause times via constrained allocation rates and live set sizes. Zing can achieve sub-millisecond pauses with high allocation rates by being concurrent for all phases, including during minor collections. Zing is able to maintain this consistent behavior regardless of heap size, allowing the user to apply large heap sizes as needed for keeping up with application throughput or object model state needs, without fear of increased pause times.

For all the concurrent collectors targeting latency you have to give up some throughput and gain footprint. Depending on the efficiency of the concurrent collector you may give up a little throughput but you are always adding significant footprint. If truly concurrent, with few stop-the-world events, then more CPU cores are needed to enable the concurrent operation and maintain throughput.

Note: All the concurrent collectors tend to function more efficiently when sufficient space is allocated. As a starting point rule of thumb, you should budget a heap of at least two to three times the size of the live set for efficient operation. However, space requirements for maintaining concurrent operation grows with application throughput, and the associated allocation and promotion rates. So for higher throughput applications a higher heap-size to live set ratio may be warranted. Given the huge memory spaces available to today’s systems footprint is seldom an issue on the server side.

Garbage Collection Monitoring & Tuning

To understand how your application and garbage collector are behaving, start your JVM with at least the following settings:
-verbose:gc
-Xloggc:
-XX:+PrintGCDetails
-XX:+PrintGCDateStamps
-XX:+PrintTenuringDistribution
-XX:+PrintGCApplicationConcurrentTime 
-XX:+PrintGCApplicationStoppedTime

Then load the logs into a tool like Chewiebug for analysis.

To see the dynamic nature of GC, launch JVisualVM and install the Visual GC plugin. This will enable you to see the GC in action for your application as below.


To get an understanding of your applcations’ GC needs, you need representative load tests that can be executed repeatedly. As you get to grips with how each of the collectors work then run your load tests with different configurations as experiments until you reach your throughput and latency targets. It is important to measure latency from the end user perspective. This can be achieved by capturing the response time of every test request in a histogram, e.g. HdrHistogram or Disruptor Histogram. If you have latency spikes that are outside your acceptable range, then try and correlate these with the GC logs to determine if GC is the issue. It is possible other issues may be causing latency spikes. Another useful tool to consider is jHiccup which can be used to track pauses within the JVM and across a system as a whole. Measure your idle systems for a few hours with jHiccup and you will often be very surprised.

If latency spikes are due to GC then invest in tuning CMS or G1 to see if your latency targets can be meet. Sometimes this may not be possible because of high allocation and promotion rates combined with low-latency requirements. GC tuning can become a highly skilled exercise that often requires application changes to reduce object allocation rates or object lifetimes. If this is the case then a commercial trade-off between time and resource spent on GC tuning and application changes, verses, purchasing one of the commercial concurrent compacting JVMs such as JRockit Real Time or Azul Zing may be required.

Thursday, 14 February 2013

CPU Cache Flushing Fallacy

Even from highly experienced technologists I often hear talk about how certain operations cause a CPU cache to "flush".  This seems to be illustrating a very common fallacy about how CPU caches work, and how the cache sub-system interacts with the execution cores.  In this article I will attempt to explain the function CPU caches fulfil, and how the cores, which execute our programs of instructions, interact with them.  For a concrete example I will dive into one of the latest Intel x86 server CPUs.  Other CPUs use similar techniques to achieve the same ends.

Most modern systems that execute our programs are shared-memory multi-processor systems in design.  A shared-memory system has a single memory resource that is accessed by 2 or more independent CPU cores.  Latency to main memory is highly variable from 10s to 100s of nanoseconds.  Within 100ns it is possible for a 3.0GHz CPU to process up to 1200 instructions.  Each Sandy Bridge core is capable of retiring up to 4 instructions-per-cycle (IPC) in parallel.  CPUs employ cache sub-systems to hide this latency and allow them to exercise their huge capacity to process instructions.  Some of these caches are small, very fast, and local to each core; others are slower, larger, and shared across cores.  Together with registers and main-memory, these caches make up our non-persistent memory hierarchy.

Next time you are developing an important algorithm, try pondering that a cache-miss is a lost opportunity to have executed ~500 CPU instructions!  This is for a single-socket system, on a multi-socket system you can effectively double the lost opportunity as memory requests cross socket interconnects.

Memory Hierarchy

Figure 1.

For the circa 2012 Sandy Bridge E class servers our memory hierarchy can be decomposed as follows:
  1. Registers: Within each core are separate register files containing 160 entries for integers and 144 floating point numbers.  These registers are accessible within a single cycle and constitute the fastest memory available to our execution cores.  Compilers will allocate our local variables and function arguments to these registers.  Compilers allocate to subset of registers know as the architectural registers, then the hardware expands on these as it runs instructions in parallel and out-of-order.  Compilers are aware of out-of-order and parallel execution ability for given processor, and order instruction streams and register allocation to take advantage of this.  When hyperthreading is enabled these registers are shared between the co-located hyperthreads.
  2. Memory Ordering Buffers (MOB): The MOB is comprised of a 64-entry load and 36-entry store buffer.  These buffers are used to track in-flight operations while waiting on the cache sub-system as instructions get executed out-of-order.  The store buffer is a fully associative queue that can be searched for existing store operations, which have been queued when waiting on the L1 cache.  These buffers enable our fast processors to run without blocking while data is transferred to and from the cache sub-system.  When the processor issues reads and writes they can can come back out-of-order.  The MOB is used to disambiguate the load and store ordering for compliance to the published memory model.  Instructions are executed out-of-order in addition to our loads and stores that can come back out-of-order from the cache sub-system.  These buffers enable an ordered view of the world to be re-constructed for what is expected according to the memory model.
  3. Level 1 Cache: The L1 is a core-local cache split into separate 32K data and 32K instruction caches.  Access time is 3 cycles and can be hidden as instructions are pipelined by the core for data already in the L1 cache.
  4. Level 2 Cache: The L2 cache is a core-local cache designed to buffer access between the L1 and the shared L3 cache.  The L2 cache is 256K in size and acts as an effective queue of memory accesses between the L1 and L3.  L2 contains both data and instructions.  L2 access latency is 12 cycles.  
  5. Level 3 Cache: The L3 cache is shared across all cores within a socket.  The L3 is split into 2MB segments each connected to a ring-bus network on the socket.  Each core is also connected to this ring-bus.  Addresses are hashed to segments for greater throughput.  Latency can be up to 38 cycles depending on cache size.  Cache size can be up to 20MB depending on the number of segments, with each additional hop around the ring taking an additional cycle.  The L3 cache is inclusive of all data in the L1 and L2 for each core on the same socket.  This inclusiveness, at the cost of space, allows the L3 cache to intercept requests thus removing the burden from private core-local L1 & L2 caches.
  6. Main Memory: DRAM channels are connected to each socket with an average latency of ~65ns for socket local access on a full cache-miss.  This is however extremely variable, being much less for subsequent accesses to columns in the same row buffer, through to significantly more when queuing effects and memory refresh cycles conflict.  4 memory channels are aggregated together on each socket for throughput, and to hide latency via  pipelining on the independent memory channels.
  7. NUMA: In a multi-socket server we have non-uniform memory access.  It is non-uniform because the required memory maybe on a remote socket having an additional 40ns hop across the QPI bus.  Sandy Bridge is a major step forward for 2-socket systems over Westmere and Nehalem.  With Sandy Bridge the QPI limit has been raised from 6.4GT/s to 8.0GT/s, and two lanes can be aggregated thus eliminating the bottleneck of the previous systems.  For Nehalem and Westmere the QPI link is only capable of ~40% the bandwidth that could be delivered by the memory controller for an individual socket.  This limitation made accessing remote memory a choke point.  In addition, the QPI link can now forward pre-fetch requests which previous generations could not.
Associativity Levels

Caches are effectively hardware based hash tables.  The hash function is usually a simple masking of some low-order bits for cache indexing.  Hash tables need some means to handle a collision for the same slot.  The associativity level is the number of slots, also known as ways or sets, which can be used to hold a hashed version of an address.  Having more levels of associativity is a trade off between storing more data vs. power requirements and time to search each of the ways.

For Sandy Bridge the L1D and L2 are 8-way associative, the L3 is 12-way associative.

Cache Coherence

With some caches being local to cores, we need a means of keeping them coherent so all cores can have a consistent view of memory.  The cache sub-system is considered the "source of truth" for mainstream systems.  If memory is fetched from the cache it is never stale; the cache is the master copy when data exists in both the cache and main-memory.  This style of memory management is known as write-back whereby data in the cache is only written back to main-memory when the cache-line is evicted because a new line is taking its place.  An x86 cache works on blocks of data that are 64-bytes in size, known as a cache-line.  Other processors can use a different size for the cache-line.  A larger cache-line size reduces effective latency at the expense of increased bandwidth requirements.

To keep the caches coherent the cache controller tracks the state of each cache-line as being in one of a finite number of states.  The protocol Intel employs for this is MESIF, AMD employs a variant know as MOESI.  Under the MESIF protocol each cache-line can be in 1 of the 5 following states:
  1. Modified: Indicates the cache-line is dirty and must be written back to memory at a later stage.  When written back to main-memory the state transitions to Exclusive.
  2. Exclusive: Indicates the cache-line is held exclusively and that it matches main-memory.  When written to, the state then transitions to Modified.  To achieve this state a Read-For-Ownership (RFO) message is sent which involves a read plus an invalidate broadcast to all other copies.
  3. Shared: Indicates a clean copy of a cache-line that matches main-memory.
  4. Invalid: Indicates an unused cache-line.
  5. Forward: Indicates a specialised version of the shared state i.e. this is the designated cache which should respond to other caches in a NUMA system.
To transition from one state to another, a series of messages are sent between the caches to effect state changes.  Previous to Nehalem for Intel, and Opteron for AMD, this cache coherence traffic between sockets had to share the memory bus which greatly limited scalability.  These days the memory controller traffic is on a separate bus.  The Intel QPI, and AMD HyperTransport, buses are used for cache coherence between sockets.

The cache controller exists as a module within each L3 cache segment that is connected to the on-socket ring-bus network.  Each core, L3 cache segment, QPI controller, memory controller, and integrated graphics sub-system are connected to this ring-bus.  The ring is made up of 4 independent lanes for: request, snoop, acknowledge, and 32-bytes data per cycle.  The L3 cache is inclusive in that any cache-line held in the L1 or L2 caches is also held in the L3.  This provides for rapid identification of the core containing a modified line when snooping for changes.  The cache controller for the L3 segment keeps track of which core could have a modified version of a cache-line it owns.

If a core wants to read some memory, and it does not have it in a Shared, Exclusive, or Modified state; then it must make a read on the ring bus.  It will then either be read from main-memory if not in the cache sub-systems, or read from L3 if clean, or snooped from another core if Modified.  In any case the read will never return a stale copy from the cache sub-system, it is guaranteed to be coherent.

Concurrent Programming

If our caches are always coherent then why do we worry about visibility when writing concurrent programs?  This is because within our cores, in their quest for ever greater performance, data modifications can appear out-of-order to other threads.  There are 2 major reasons for this.

Firstly, our compilers can generate programs that store variables in registers for relatively long periods of time for performance reasons, e.g. variables used repeatedly within a loop.  If we need these variables to be visible across cores then the updates must not be register allocated.  This is achieved in C by qualifying a variable as "volatile".  Beware that C/C++ volatile is inadequate for telling the compiler not to reorder other instructions.  For this you need memory fences/barriers.

The second major issue with ordering we have to be aware of is a thread could write a variable and then, if it reads it shortly after, could see the value in its store buffer which may be older than the latest value in the cache sub-system.  This is never an issue for algorithms following the Single Writer Principle.  The store buffer also allows a load instruction to get ahead of an older store and is thus an issue for the likes of the Dekker and Peterson lock algorithms.  To overcome these issues, the thread must not let a sequential consistent load get ahead of the sequentially consistent store of the value in the local store buffer.  This can be achieved by issuing a fence instruction.  The write of a volatile variable in Java, in addition to never being register allocated, is accompanied by a full fence instruction.  This fence instruction on x86 has a significant performance impact by preventing progress on the issuing thread until the store buffer is drained.  Fences on other processors can have more efficient implementations that simply put a marker in the store buffer for the search boundary, e.g. the Azul Vega does this.

If you want to ensure memory ordering across Java threads when following the Single Writer Principle, and avoid the store fence, it is possible by using the j.u.c.Atomic(Int|Long|Reference).lazySet() method, as opposed to setting a volatile variable.

The Fallacy

Returning to the fallacy of "flushing the cache" as part of a concurrent algorithm.  I think we can safely say that we never "flush" the CPU cache within our user space programs.  I believe the source of this fallacy is the need to flush, mark or drain to a point, the store buffer for some classes of concurrent algorithms so the latest value can be observed on a subsequent load operation.  For this we require a memory ordering fence and not a cache flush.

Another possible source of this fallacy is that L1 caches, or the TLB, may need to be flushed based on address indexing policy on a context switch.  ARM, previous to ARMv6, did not use address space tags on TLB entries thus requiring the whole L1 cache to be flushed on a context switch.  Many processors require the L1 instruction cache to be flushed for similar reasons, in many cases this is simply because instruction caches are not required to be kept coherent. The bottom line is, context switching is expensive and a bit off topic, so in addition to the cache pollution of the L2, a context switch can also cause the TLB and/or L1 caches to require a flush.  Intel x86 processors require only a TLB flush on context switch.

Friday, 25 January 2013

Further Adventures With CAS Instructions And Micro Benchmarking

In a previous article I reported what appeared to be a performance issue with CAS/LOCK instructions on the Sandy Bridge microarchitecture compared to the previous Nehalem microarchitecture.  Since then I've worked with the good people of Intel to understand what was going on and I'm now pleased to be able to shine some light on the previous results.

I observed a small drop in throughput with the uncontended single-thread case, and an order-of-magnitude decrease in throughput once multiple threads contend when performing updates.  This testing spawned out of observations testing Java Queue implementations and the Disruptor for the multi-producer case.  I was initially puzzled by these findings because almost every other performance test I applied to Sandy Bridge indicated a major step forward for this microarchitecture.

After digging deeper into this issue it has come to light that my tests have once again fallen fowl of the difficulties in micro-benchmarking.  My test is not a good means of testing throughput and it is actually testing fairness in a roundabout manner.  Let's revisit the code and work through what is going on.

Test Code
#include <time.h>
#include <pthread.h>
#include <stdlib.h>
#include <iostream>

typedef unsigned long long uint64;

const uint64 COUNT = 500 * 1000 * 1000;

volatile uint64 counter = 0;

void* run_add(void* numThreads)
{
    register uint64 value = (COUNT / *((int*)numThreads)) + 1;

    while (--value != 0)
    {
        __sync_add_and_fetch(&counter, 1);
    }
}

void* run_xadd(void*)
{
    register uint64 value = counter;

    while (value < COUNT)
    {
        value = __sync_add_and_fetch(&counter, 1);
    }
}

void* run_cas(void*)
{
    register uint64 value = 0;

    while (value < COUNT)
    {
        do
        {
            value = counter;
        }
        while (!__sync_bool_compare_and_swap(&counter, value, value + 1));
    }
}

void* run_cas2(void*)
{
    register uint64 value = 0;
    register uint64 next = 0;

    while (value < COUNT)
    {
        value = counter;
        do
        {
            next = value + 1;
            value = __sync_val_compare_and_swap(&counter, value, next);
        }
        while (value != next);
    }
}

int main (int argc, char *argv[])
{
    const int NUM_THREADS = atoi(argv[1]);
    const int TESTCASE = atoi(argv[2]);

    pthread_t threads[NUM_THREADS];
    void* status;

    timespec ts_start;
    timespec ts_finish;
    clock_gettime(CLOCK_MONOTONIC, &ts_start);


    for (int i = 0; i < NUM_THREADS; i++)
    {
        switch (TESTCASE)
        {
            case 1:
                std::cout << "LOCK ADD" << std::endl;
                pthread_create(&threads[i], NULL, run_add, (void*)&NUM_THREADS);
                break;

            case 2:
                std::cout << "LOCK XADD" << std::endl;
                pthread_create(&threads[i], NULL, run_xadd, (void*)&NUM_THREADS);
                break;

            case 3:
                std::cout << "LOCK CMPXCHG BOOL" << std::endl;
                pthread_create(&threads[i], NULL, run_cas, (void*)&NUM_THREADS);
                break;

            case 4:
                std::cout << "LOCK CMPXCHG VAL" << std::endl;
                pthread_create(&threads[i], NULL, run_cas2, (void*)&NUM_THREADS);
                break;

            default:
                exit(1);
        }
    }

    for (int i = 0; i < NUM_THREADS; i++)
    {
        pthread_join(threads[i], &status);
    }

    clock_gettime(CLOCK_MONOTONIC, &ts_finish);

    uint64 start = (ts_start.tv_sec * 1000000000) + ts_start.tv_nsec;
    uint64 finish = (ts_finish.tv_sec * 1000000000) + ts_finish.tv_nsec;
    uint64 duration = finish - start;

    std::cout << "threads = " << NUM_THREADS << std::endl;
    std::cout << "duration = " <<  duration << std::endl;
    std::cout << "ns per op = " << (duration / (COUNT * 2)) << std::endl;
    std::cout << "op/sec = " << ((COUNT * 2 * 1000 * 1000 * 1000) / duration) << std::endl;
    std::cout << "counter = " << counter << std::endl;

    return 0;
}
The code above makes it possible to test the major CAS based techniques on x86. For full clarity an objdump -d of the binary reveals the compiler generated assembly instructions for the above methods. The "lock" instruction in each section is where the atomic update is happening.
0000000000400dc0 <_z8run_cas2pv>:
  400dc0: 48 8b 05 d9 07 20 00  mov    0x2007d9(%rip),%rax      
  400dc7: 66 0f 1f 84 00 00 00  nopw   0x0(%rax,%rax,1)
  400dce: 00 00 
  400dd0: 48 8d 50 01           lea    0x1(%rax),%rdx
  400dd4: f0 48 0f b1 15 c3 07  lock cmpxchg %rdx,0x2007c3(%rip)
  400ddb: 20 00 
  400ddd: 48 39 c2              cmp    %rax,%rdx
  400de0: 75 ee                 jne    400dd0 <_z8run_cas2pv>
  400de2: 48 3d ff 64 cd 1d     cmp    $0x1dcd64ff,%rax
  400de8: 76 d6                 jbe    400dc0 <_z8run_cas2pv>
  400dea: f3 c3                 repz retq 
  400dec: 0f 1f 40 00           nopl   0x0(%rax)

0000000000400df0 <_z7run_caspv>:
  400df0: 48 8b 15 a9 07 20 00  mov    0x2007a9(%rip),%rdx     
  400df7: 48 8d 4a 01           lea    0x1(%rdx),%rcx
  400dfb: 48 89 d0              mov    %rdx,%rax
  400dfe: f0 48 0f b1 0d 99 07  lock cmpxchg %rcx,0x200799(%rip)  
  400e05: 20 00 
  400e07: 75 e7                 jne    400df0 <_z7run_caspv>
  400e09: 48 81 fa ff 64 cd 1d  cmp    $0x1dcd64ff,%rdx
  400e10: 76 de                 jbe    400df0 <_z7run_caspv>
  400e12: f3 c3                 repz retq 
  400e14: 66 66 66 2e 0f 1f 84  data32 data32 nopw %cs:0x0(%rax,%rax,1)
  400e1b: 00 00 00 00 00 

0000000000400e20 <_z8run_xaddpv>:
  400e20: 48 8b 05 79 07 20 00  mov    0x200779(%rip),%rax    
  400e27: 48 3d ff 64 cd 1d     cmp    $0x1dcd64ff,%rax
  400e2d: 77 1b                 ja     400e4a <_z8run_xaddpv>
  400e2f: 90                    nop
  400e30: b8 01 00 00 00        mov    $0x1,%eax
  400e35: f0 48 0f c1 05 62 07  lock xadd %rax,0x200762(%rip) 
  400e3c: 20 00 
  400e3e: 48 83 c0 01           add    $0x1,%rax
  400e42: 48 3d ff 64 cd 1d     cmp    $0x1dcd64ff,%rax
  400e48: 76 e6                 jbe    400e30 <_z8run_xaddp>
  400e4a: f3 c3                 repz retq 
  400e4c: 0f 1f 40 00           nopl   0x0(%rax)

0000000000400e50 <_z7run_addpv>:
  400e50: 48 63 0f              movslq (%rdi),%rcx
  400e53: 31 d2                 xor    %edx,%edx
  400e55: b8 00 65 cd 1d        mov    $0x1dcd6500,%eax
  400e5a: 48 f7 f1              div    %rcx
  400e5d: 48 85 c0              test   %rax,%rax
  400e60: 74 15                 je     400e77 <_z7run_addpv>
  400e62: 66 0f 1f 44 00 00     nopw   0x0(%rax,%rax,1)
  400e68: f0 48 83 05 2f 07 20  lock addq $0x1,0x20072f(%rip)    
  400e6f: 00 01 
  400e71: 48 83 e8 01           sub    $0x1,%rax
  400e75: 75 f1                 jne    400e68 <_z7run_addpv>
  400e77: f3 c3                 repz retq 
  400e79: 90                    nop
  400e7a: 90                    nop
  400e7b: 90                    nop
  400e7c: 90                    nop
  400e7d: 90                    nop
  400e7e: 90                    nop
  400e7f: 90                    nop
To purely isolate the performance of the CAS operation the test should be run using the lock xadd option for an atomic increment in hardware.  This instruction lets us avoid the spin-retry loop of a pure software CAS that can dirty the experiment.

I repeated the experiment from the previous article and got very similar results.  Previously, I thought I'd observed a throughput drop even in the uncontended single-threaded case.  So I focused in on this to confirm.  To do this I had to find two processors that once Turbo Boost had kicked in then the clock speeds would be comparable.  I found this by using a 2.8GHz Nehalem and 2.4GHz Sandy Bridge.  For the single-threaded case they are both operating at ~3.4GHz.
Nehalem 2.8GHz
==============
$ perf stat ./atomic_inc 1 2
LOCK XADD
threads = 1
duration = 3090445546
ns per op = 3
op/sec = 323577938

 Performance counter stats for './atomic_inc 1 2':

       3085.466216 task-clock                #    0.997 CPUs utilized          
               331 context-switches          #    0.107 K/sec                  
                 4 CPU-migrations            #    0.001 K/sec                  
               360 page-faults               #    0.117 K/sec                  
    10,527,264,923 cycles                    #    3.412 GHz                 
     9,394,575,677 stalled-cycles-frontend   #   89.24% frontend cycles idle
     7,423,070,202 stalled-cycles-backend    #   70.51% backend  cycles idle 
     2,517,668,293 instructions              #    0.24  insns per cycle        
                                             #    3.73  stalled cycles per insn
       503,526,119 branches                  #  163.193 M/sec                  
           110,695 branch-misses             #    0.02% of all branches       

       3.093402966 seconds time elapsed

Sandy Bridge 2.4GHz
===================
$ perf stat ./atomic_inc 1 2
LOCK XADD
threads = 1
duration = 3394221940
ns per op = 3
op/sec = 294618330

 Performance counter stats for './atomic_inc 1 2':

       3390.404400 task-clock                #    0.998 CPUs utilized          
               357 context-switches          #    0.105 K/sec                  
                 1 CPU-migrations            #    0.000 K/sec                  
               358 page-faults               #    0.106 K/sec                  
    11,522,932,068 cycles                    #    3.399 GHz                 
     9,542,667,311 stalled-cycles-frontend   #   82.81% frontend cycles idle  
     6,721,330,874 stalled-cycles-backend    #   58.33% backend  cycles idle  
     2,518,638,461 instructions              #    0.22  insns per cycle        
                                             #    3.79  stalled cycles per insn
       502,490,710 branches                  #  148.210 M/sec                  
            36,955 branch-misses             #    0.01% of all branches        

       3.398206155 seconds time elapsed

Analysis

So repeating the tests with comparable clock speeds confirmed the previous results.  The single-threaded case shows a ~10% drop in throughput and the multi-threaded contended case displays an order-of-magnitude difference in throughput.

Now the big question is what is going on and why has throughput dropped?  Well the single-threaded case suggests nothing major has happened to number of cycles required to execute the instruction when uncontended.  The small differences could be attributed to system noise or the changes in the CPU front-end for Sandy Bridge with introduction of the additional load address generation unit.

For the multi-threaded case we found an interesting surprise when Intel monitored what the instructions are doing.  We found that each thread on Nehalem was able to perform more updates in a batch before loosing the exclusive state on the cacheline containing the counter.  This is because the inter-core latency has improved with Sandy Bridge so other threads are able to faster claim the cacheline containing the counter to do their own updates.  What we are actually measuring with this micro-benchmark is how long a core can hold a cacheline before it is released to another core.  Sandy Bridge is exhibiting greater fairness which is what you'd want in a real world application.

This micro-benchmark is very unrealistic for a real world application.  Normally between performing counter updates a core would be doing a lot of other work.  At the point when the counter needs to be updated the reduced latency inter-core would then be a benefit.

In all my macro application benchmarks Sandy Bridge has proved to have better performance than Nehalem at comparable clock speeds.

Conclusion

What did I learn from this?  Well once again that writing micro-benchmarks is notoriously difficult.  It is so hard to know what you are measuring and what effects can come into play.  To illustrate how difficult it is to recognise such a flaw, for all those who have read this blog, no one has identified the issue and fed this back to me.

It also shows that what on first blush can be considered a performance bug is actually the opposite.  This shows how it is possible to have a second order effect when a performance improvement can make a specific work case run more slowly.