Saturday, 19 May 2012

Applying Back Pressure When Overloaded

How should a system respond when under sustained load?  Should it keep accepting requests until its response times follow the deadly hockey stick, followed by a crash?  All too often this is what happens unless a system is designed to cope with the case of more requests arriving than it is capable of processing.  If we are seeing a sustained arrival rate of requests, greater than our system is capable of processing, then something has to give.  Having the entire system degrade is not the ideal service we want to give our customers.  A better approach would be to process transactions at our systems maximum possible throughput rate, while maintaining a good response time, and rejecting requests above this arrival rate.

Let’s consider a small art gallery as an metaphor.  In this gallery the typical viewer spends on average 20 minutes browsing, and the gallery can hold a maximum of 30 viewers.  If more than 30 viewers occupy the gallery at the same time then customers become unhappy because they cannot have a clear view of the paintings.  If this happens they are unlikely to purchase or return.  To keep our viewers happy it is better to recommend that some viewers visit the café a few doors down and come back when the gallery is less busy.  This way the viewers in the gallery get to see all the paintings without other viewers in the way, and in the meantime those we cannot accommodate enjoy a coffee.  If we apply Little’s Law we cannot have customers arriving at more than 90 per hour, otherwise the maximum capacity is exceeded.  If between 9:00-10:00 they are arriving at 100 per hour, then I’m sure the café down the road will appreciate the extra 10 customers.

Within our systems the available capacity is generally a function of the size of our thread pools and time to process individual transactions.  These thread pools are usually fronted by queues to handle bursts of traffic above our maximum arrival rate.  If the queues are unbounded, and we have a sustained arrival rate above the maximum capacity, then the queues will grow unchecked.  As the queues grow they increasingly add latency beyond acceptable response times, and eventually they will consume all memory causing our systems to fail.  Would it not be better to send the overflow of requests to the café while still serving everyone else at the maximum possible rate?  We can do this by designing our systems to apply “Back Pressure”.

Figure 1.

Separation of concerns encourages good systems design at all levels.  I like to layer a design so that the gateways to third parties are separated from the main transaction services.  This can be achieved by having gateways responsible for protocol translation and border security only.  A typical gateway could be a web container running Servlets.  Gateways accept customer requests, apply appropriate security, and translate the channel protocols for forwarding to the transaction service hosting the domain model.  The transaction service may use a durable store if transactions need to be preserved.  For example, the state of a chat server domain model may not require preservation, whereas a model for financial transactions must be kept for many years for compliance and business reasons.

Figure 1. above is a simplified view of the typical request flow in many systems.  Pools of threads in a gateway accept user requests and forward them to a transaction service.  Let’s assume we have asynchronous transaction services fronted by an input and output queues, or similar FIFO structures.  If we want the system to meet a response time quality-of-service (QoS) guarantee, then we need to consider the three following variables:
  1. The time taken for individual transactions on a thread
  2. The number of threads in a pool that can execute transactions in parallel
  3. The length of the input queue to set the maximum acceptable latency
    max latency = (transaction time / number of threads) * queue length
    queue length = max latency / (transaction time / number of threads)

By allowing the queue to be unbounded the latency will continue to increase.  So if we want to set a maximum response time then we need to limit the queue length.

By bounding the input queue we block the thread receiving network packets which will apply back pressure up stream.  If the network protocol is TCP, similar back pressure is applied via the filling of network buffers, on the sender.  This process can repeat all the way back via the gateway to the customer.  For each service we need to configure the queues so that they do their part in achieving the required quality-of-service for the end-to-end customer experience.

One of the biggest wins I often find is to improve the time taken to process individual transaction latency.  This helps in the best and worst case scenarios.

Worst Case Scenario

Let’s say the queue is unbounded and the system is under sustained heavy load.  Things can begin to go wrong very quickly in subtle ways before memory is exhausted.  What do you think will happen when the queue is larger than the processor cache?  The consumer threads will be suffering cache misses just at the time when they are struggling to keep up, thus compounding the problem.  This can cause a system to get into trouble very quickly and eventually crash.  Under Linux this is particularly nasty because malloc, or one of its friends, will succeed because Linux allows “Over Commit” by default, then later at the point of using that memory, the OOM Killer will start shooting processes. When the OS starts shooting processes, you just know things are not going to end well!

What About Synchronous Designs?

You may say that with synchronous designs there are no queues.  Well not such obvious ones.  If you have a thread pool then it will have a lock, or semaphore, wait queues to assign threads.  If you are crazy enough to allocate a new thread on every request, then once you are over the huge cost of thread creation, your thread is in the run queue for a processor to execute.  Also, these queues involve context switches and condition variables which greatly increase the costs.  You just cannot run away from queues, they are everywhere!  Best to embrace them and design for the quality-of-service your system needs to deliver to its customers.  If we must have queues, then design for them, and maybe choose some nice lock-free ones with great performance.

When we need to support synchronous protocols like REST then use back pressure, signalled by our full incoming queue at the gateway, to send a meaningful “server busy” message such as the HTTP 503 status code.  The customer can then interpret this as time for a coffee and cake at the café down the road.

Subtleties To Watch Out For...

You need to consider the whole end-to-end service.  What if a client is very slow at consuming data from your system?  It could tie up a thread in the gateway taking it out of action.  Now you have less threads working the queue so the response time will be increasing.  Queues and threads need to be monitored, and appropriate action needs to be taken when thresholds are crossed.  For example, when a queue is 70% full, maybe an alert should be raised so an investigation can take place?  Also, transaction times need to be sampled to ensure they are in the expected range.

Summary

If we do not consider how our systems will behave when under heavy load then they will most likely seriously degrade at best, and at worst crash.  When they crash this way, we get to find out if there are any really evil data corruption bugs lurking in those dark places.  Applying back pressure is one effective technique for coping with sustained high-load, such that maximum throughput can be delivered without degrading system performance for the already accepted requests and transactions.

Sunday, 29 April 2012

Invoke Interface Optimisations

I'm often asked about the performance differences between Java, C, and C++, and which is better.  As with most things in life there is no black and white answer.  A lot is often discussed about how managed runtime based languages offer less performance than their statically compiled compatriots.  There are however a few tricks available to managed runtimes that can provide optimisation opportunities not available to statically optimised languages.

One such optimisation available to the runtime is to dynamically inline a method at the call site.  Many would say inlining is *the* major optimisation of dynamic languages.  This is an approach whereby the function/method call overhead can be avoided and further optimisations enabled.  Inlining can easily be done at compile, or run, time for static or private methods of a class because they cannot be overridden.  It can also be done by Hotspot at run time which is way more interesting.  In bytecode the runtime will see invokestatic and invokespecial opcodes for static and private methods respectively.  Methods that involve late binding, such as interface implementations and method overriding, appear as the invokeinterface and invokevirtual opcodes respectively.

At compile time it is not possible to determine how many implementations there will be for an interface, or how many classes will override a base method.  The compiler can have some awareness but just how do you deal with dynamically loaded classes via Class.forName("x").newInstance()?  The Hotspot runtime is very smart.  It can track all classes as they are loaded and apply appropriate optimisations to give the best possible performance for our code.  One such approach is dynamic inlining at the call site which we will explore.

Code

public interface Operation
{
    int map(int value);
}

public class IncOperation implements Operation
{
    public int map(final int value)
    {
        return value + 1;
    }
}

public class DecOperation implements Operation
{
    public int map(final int value)
    {
        return value - 1;
    }
}

public class StepIncOperation implements Operation
{
    public int map(final int value)
    {
        return value + 7;
    }
}

public class StepDecOperation implements Operation
{
    public int map(final int value)
    {
        return value - 3;
    }
}

public final class OperationPerfTest
{
    private static final int ITERATIONS = 50 * 1000 * 1000;

    public static void main(final String[] args)
        throws Exception
    {
        final Operation[] operations = new Operation[4];
        int index = 0;
        operations[index++] = new StepIncOperation();
        operations[index++] = new StepDecOperation();
        operations[index++] = new IncOperation();
        operations[index++] = new DecOperation();

        int value = 777;
        for (int i = 0; i < 3; i++)
        {
            System.out.println("*** Run each method in turn: loop " + i);

            for (final Operation operation : operations)
            {
                System.out.println(operation.getClass().getName()); 
                value = runTests(operation, value);
            }
        } 

        System.out.println("value = " + value);
    } 

    private static int runTests(final Operation operation, int value)
    {
        for (int i = 0; i < 10; i++)
        {
            final long start = System.nanoTime();

            value += opRun(operation, value);

            final long duration = System.nanoTime() - start;
            final long opsPerSec = 
                (ITERATIONS * 1000L * 1000L * 1000L) / duration;
            System.out.printf("    %,d ops/sec\n", opsPerSec);
        }

        return value;
    } 

    private static int opRun(final Operation operation, int value)
    {
        for (int i = 0; i < ITERATIONS; i++)
        {
            value += operation.map(value);
        } 

        return value;
    } 
}

Results


The following results are for running on a Linux 3.3.2 kernel with Oracle 1.7.0_02 server JVM on a Intel Sandy Bridge 2.4Ghz processor.

*** Run each method in turn: loop 0
StepIncOperation
    2,256,816,714 ops/sec
    2,245,800,936 ops/sec
    3,161,643,847 ops/sec
    3,100,375,269 ops/sec
    3,144,364,173 ops/sec
    3,091,009,138 ops/sec
    3,089,241,641 ops/sec
    3,153,922,056 ops/sec
    3,147,331,497 ops/sec
    3,076,211,099 ops/sec
StepDecOperation
    623,131,120 ops/sec
    659,686,236 ops/sec
    1,029,231,089 ops/sec
    1,021,060,933 ops/sec
    999,287,607 ops/sec
    1,015,432,172 ops/sec
    1,023,581,307 ops/sec
    1,019,266,750 ops/sec
    1,022,726,580 ops/sec
    1,004,237,016 ops/sec
IncOperation
    301,419,319 ops/sec
    304,712,250 ops/sec
    307,269,912 ops/sec
    308,519,923 ops/sec
    307,372,436 ops/sec
    306,230,247 ops/sec
    307,964,022 ops/sec
    306,243,292 ops/sec
    308,689,942 ops/sec
    365,152,716 ops/sec
DecOperation
    236,804,700 ops/sec
    237,912,786 ops/sec
    238,672,489 ops/sec
    278,745,901 ops/sec
    278,169,934 ops/sec
    277,979,158 ops/sec
    276,620,509 ops/sec
    278,349,766 ops/sec
    276,159,225 ops/sec
    278,578,373 ops/sec
*** Run each method in turn: loop 1
StepIncOperation
    276,054,944 ops/sec
    276,683,805 ops/sec
    276,551,970 ops/sec
    279,861,144 ops/sec
    275,543,192 ops/sec
    278,451,092 ops/sec
    275,399,262 ops/sec
    277,340,411 ops/sec
    274,529,616 ops/sec
    277,091,930 ops/sec
StepDecOperation
    279,729,066 ops/sec
    279,812,269 ops/sec
    276,478,587 ops/sec
    277,660,649 ops/sec
    276,844,441 ops/sec
    278,684,313 ops/sec
    277,791,665 ops/sec
    277,617,484 ops/sec
    278,575,241 ops/sec
    278,228,274 ops/sec
IncOperation
    277,724,770 ops/sec
    278,234,042 ops/sec
    276,798,434 ops/sec
    277,926,962 ops/sec
    277,786,824 ops/sec
    278,739,590 ops/sec
    275,286,293 ops/sec
    279,062,831 ops/sec
    276,672,019 ops/sec
    277,248,956 ops/sec
DecOperation
    277,303,150 ops/sec
    277,746,139 ops/sec
    276,245,511 ops/sec
    278,559,202 ops/sec
    274,683,406 ops/sec
    279,280,730 ops/sec
    276,174,620 ops/sec
    276,374,159 ops/sec
    275,943,446 ops/sec
    277,765,688 ops/sec
*** Run each method in turn: loop 2
StepIncOperation
    278,405,907 ops/sec
    278,713,953 ops/sec
    276,841,096 ops/sec
    277,891,660 ops/sec
    275,716,314 ops/sec
    277,474,242 ops/sec
    277,715,270 ops/sec
    277,857,014 ops/sec
    275,956,486 ops/sec
    277,675,378 ops/sec
StepDecOperation
    277,273,039 ops/sec
    278,101,972 ops/sec
    275,694,572 ops/sec
    276,312,449 ops/sec
    275,964,418 ops/sec
    278,423,621 ops/sec
    276,498,569 ops/sec
    276,593,475 ops/sec
    276,238,451 ops/sec
    277,057,568 ops/sec
IncOperation
    275,700,451 ops/sec
    277,463,507 ops/sec
    275,886,477 ops/sec
    277,546,096 ops/sec
    275,019,816 ops/sec
    278,242,287 ops/sec
    277,317,964 ops/sec
    277,252,014 ops/sec
    276,893,038 ops/sec
    277,601,325 ops/sec
DecOperation
    275,580,894 ops/sec
    280,146,646 ops/sec
    276,901,134 ops/sec
    276,672,567 ops/sec
    276,879,422 ops/sec
    278,674,196 ops/sec
    275,606,174 ops/sec
    278,132,534 ops/sec
    275,858,358 ops/sec
    279,444,112 ops/sec

What is going on here?


On the first iteration over the list of operations we see the performance degrade from ~3bn operations per second down to ~275m operations per second.  This happens in a step function with each new implementation loaded.  On the second, and subsequent, iteration over the array of operations, performance stabilised at ~275m operations per second.  What we are seeing here is how Hotspot can optimise when we have a limited number of implementations for an interface, and how it has to fall back to late bound method calls when many implementations are possible from a given call site.

If we run the JVM with -XX:+PrintCompilation we can see Hotspot choosing to compile the methods then de-optimise existing optimisations as new implementations get loaded.

     52    1             java.lang.String::hashCode (67 bytes)
     54    2             StepIncOperation::map (5 bytes)
     55    1 %           OperationPerfTest::opRun @ 2 (26 bytes)
     76    3             OperationPerfTest::opRun (26 bytes)
    223    3             OperationPerfTest::opRun (26 bytes)   made not entrant
    223    1 %           OperationPerfTest::opRun @ -2 (26 bytes)   made not entrant
    224    2 %           OperationPerfTest::opRun @ 2 (26 bytes)
    224    4             StepDecOperation::map (4 bytes)
    306    5             OperationPerfTest::opRun (26 bytes)
    772    2 %           OperationPerfTest::opRun @ -2 (26 bytes)   made not entrant
    772    3 %           OperationPerfTest::opRun @ 2 (26 bytes)
    773    6             IncOperation::map (4 bytes)
    930    5             OperationPerfTest::opRun (26 bytes)   made not entrant
   1995    7             OperationPerfTest::opRun (26 bytes)
   2293    8             DecOperation::map (4 bytes)
  11339    9             java.lang.String::indexOf (87 bytes)
  15017   10             java.lang.String::charAt (33 bytes)

The output above shows the decisions made by Hotspot as it compiles code.  When the third column contains the symbol "%" it is performing OSR (On Stack Replacement) of the method.  This is followed 4 times by the method being "made not entrant" as it is de-optimised when Hotspot discovers new implementations.  3 times the method is made not entrant for the newly discovered classes and once for removing the OSR version to be replaced by a non-OSR normal JIT'ed version when the final implementation is settled on.  Even greater detail can be seen by replacing -XX:+PrintCompilation with -XX:+UnlockDiagnosticVMOptions -XX:+LogCompilation.

For the monomorphic single implementation case, Hotspot can simply inline the method and place a trap in the code to fire if future implementations are loaded.  This gives performance very similar to no function call overhead. For the second bimorphic implementation, Hotspot can inline both methods and select the implementation based on a branch condition.  Beyond this things get tricky and jump tables are required to  resolve the method at runtime, thus making the code polymorphic or megamorphic.  The generated assembly code can be viewed with -XX:+UnlockDiagnosticVMOptions -XX:CompileCommand=print,OperationPerfTest.doRun JVM options for Java 7. The output shows the steps in compilation whereby not only is the method inlining deoptimised, Hotspot also no longer does loop unrolling for this method.

Conclusions


We can see that if an interface method has only one or two implementations then Hotspot can dynamically inline the method avoiding the function call overhead.  This would only be possible with profile guided optimisation for a language like C or C++.  We can also see that method calls are relatively cheap on a modern JVM, in the order of 12 cycles, even when we cannot avoid them.  It should be noted that the cost of method calls goes up by a few cycles for each additional argument passed.

In addition, I have observed that when a class implements multiple interfaces, with multiple methods, performance can degrade significantly because the method dispatch involves a linear search of method list to find the right implementation for dispatch.  Overridden methods from a base class do not involve this linear search but still require the jump table dispatch.  All the more reason to keep classes and interfaces simple.

Thursday, 22 March 2012

Fun with my-Channels Nirvana and Azul Zing

Since leaving LMAX I have been neglecting my blog a bit.  This is not because I have not been doing anything interesting.  Quite the opposite really, things have been so busy the blog has taken a back seat.  I’ve been consulting for a number of hedge funds and product companies, most of which are super secretive.

One company I have been spending quite a bit of time with is my-Channels, a messaging provider.  They are really cool and have given me their blessing to blog about some of the interesting things I’ve been working on for them.

For context, my-Channels are a messaging provider that specialise in delivering data to every device known to man over dodgy networks such as the Internet or your corporate WAN.  They can deliver live financial market data to your desktop, laptop at home, or your iPhone, at the fastest possible rates.  Lately, they have made the strategic move to enter the low-latency messaging space for the enterprise, and as part of this they have enlisted my services.  They want to go low-latency without giving up the rich functionality their product offers which is giving me some interesting challenges.

Just how bad is the latency of such a product when new to the low-latency space?  I did not have high expectations because to be fair this was never their goal.  After some initial tests, I’m thinking these guys are not in bad shape.  They beat the crap out of most JMS implementations and it is going to be fun pushing them to the serious end of the low-latency space. 

OK enough of the basic tests, now it is time to get serious.  I worked with them to create appropriate load tests and get the profilers running.  No big surprises here, when we piled on the pressure, lock-contention came out as the biggest culprit limiting both latency and throughput.  As we go down the list, lots of other interesting things showed up but let’s follow good discipline and start at the top of the list.

Good discipline for “Theory of Constraints” states that you always work on the most limiting factor because when it is removed the list below it can change radically as new pressures are applied.  So to address this contention issue we developed a new lock-free Executor to replace the standard Java implementation.  Tests showed this new executor is ~10X better than what the JDK has to offer.  We integrated the new Executor into the code base and now the throughput bottleneck has been massively changed.  The system can now cope with 16X more throughput, and the latency histogram has become much more compressed.  This is a good example of how macro-benchmarking is so much more valuable than micro-benchmarking.  Not a bad start we are all thinking.

Enter Azul Stage Left

We tested on all the major JVMs and the most predictable latency was achieved with Azul Zing.  Zing had by far the best latency profile with virtually no long tail.  For many of the tests it also had the greatest throughput.

After the lock contention on the Executor issue had been resolved, the next big bottleneck when load testing on the same machine was being limited by using TCP between processes over the loopback adapter.  We discussed developing a new transport that was not network based for Nirvana.  For this we decided to apply a number of the techniques I teach on my lock-free concurrency course.  This resulted in a new IPC transport based on shared memory via memory-mapped files in Java.  We did inter-server testing using 10GigE networks, and had a fun using the new Solarflare network adapters with OpenOnload, but for this article I’ll stick with the Java story.  I think Paul is still sore from me stuffing his little Draytek ADSL router with huge amounts of multicast traffic when the poor thing was connected to our 10GigE test LAN.  Sorry Paul!

Developing the IPC transport unearthed a number of challenges with various JVM implementations of MappedByteBuffer.  After some very useful chats with Cliff Click and Doug Lea we came up with a solution that worked across all JVMs.   This solution has a mean latency of ~100ns on the best JVMs and can do ~12-22 million messages per second throughput for 60-byte messages depending on the JVM.  This was the first time we had found a test whereby Azul was not close to being the fastest.   I isolated a test case and sent it to them on a Friday.  On Sunday evening I got an email from Gil Tene saying he had identified the issue and by Tuesday Cliff Click had a fix that we tried the next week.  When we tested the new Azul JVM, we seen over 40 million messages per second at latencies just over 100ns for our new IPC transport.  I had been teasing Azul that this must be possible in Java because I’d created similar algorithms in C and assembler that show what the x86_64 platform is capable of.

I’m starting to ramble but we had great fun removing latency through many parts of the stack.  When I get more time I will blog about some of the other findings.  The current position is still a work in progress with daily progress on an amazing scale.  The guys at my-Channels are very conservative and do not want to publish actual figures until they have version 7.0 of Nirvana ready for GA, and have done more comprehensive testing.  For now they are happy with me being open about the following:
  • Throughput increased 32X due to the implementation of lock-free techniques and optimising the call stack for message handling to remove any shared dependencies.
  • Average latency decreased 20X from applying the same techniques and we have identified many more possible improvements.
  • We know the raw transport for IPC is now ~100ns and the worst case pause due to GC is 80µs with Azul Zing.  As to the latency for the double hop between a producer and consumer over IPC, via their broker, I’ll leave to your imagination as somewhere between those figures until the guys are willing to make an official announcement.  As you can guess it is much much less than 80µs.
For me the big surprise was GC pauses only taking 80µs in the worst case.  OS scheduling alone I have seen result in more jitter.  I discussed this at length with Gil Tene from Azul, and even he was surprised.  He expects some worst case scenarios with their JVM to be 1-2ms for a well behaved application.  We then explored the my-Channels setup, and it turns out we have done everything almost perfectly to get the best out of a JVM which is worth sharing.
  1. Do not use locks in the main transaction flow because they cause context switches, and therefore latency and unpredictable jitter.
  2. Never have more threads that need to run than you have cores available.
  3. Set affinity of threads to cores, or at least sockets, to avoid cache pollution by avoiding migration.  This is particularly important when on a server class machine having multiple sockets because of the NUMA effect.
  4. Ensure uncontested access to any resource respecting the Single Writer Principle so that the likes of biased locking can be your friend.
  5. Keep call stacks reasonably small.  Still more work to do here.  If you are crazy enough to use Spring, then check out your call stacks to see what I mean!  The garbage collector has to walk them finding reachable objects.
  6. Do not use finalizers.
  7. Keep garbage generation to modest levels.  This applies to most JVMs but is likely not an issue for Zing.
  8. Ensure no disk IO on the main flow.
  9. Do a proper warm-up before beginning to measure.
  10. Do all the appropriate OS tunings for low-latency systems that are way beyond this blog.  For example turn off C-States power management in the BIOS and watch out for RHEL 6 as it turns it back on without telling you!
It should be noted that we ran this on some state of the art Intel CPUs with very large L3 caches.  It is possible to get 20-30MB L3 caches on a single socket these days.  It is very likely that our entire application was running out of L3 cache with the exception of the message flow which is very predictable.

Gil has added a cautionary note that while these results are very impressive we had a team focused on this issue with the appropriate skills to get the best out of the application.  It is not the usual case for every client to apply this level of focus.

What I’ve taken from this experience is the amazing things that can be achieved by truly agile companies, staffed by talented individuals, who are empowered to make things happen.  I love agile development but it has become a religion to some people who are more interested in following the “true” process than doing what is truly needed.  Both my-Channels and Azul have shown during this engagement what is possible in making s*#t happen.  It has been an absolute blast working with individuals who can assimilate information and ideas so fast, then turn them into working software.  For this I will embarrass Matt Buckton at my-Channels, and Gil Tene & Cliff Click at Azul who never failed in rising to a challenge.  So few organisations could have made so much progress over such a short time period.  If you think Java cannot cut it in the high performance space, then deal with one of these two companies, and you will be thinking again.  I bet a few months ago Matt never thought he’d be sitting in Singapore airport writing his first multi-producer lock-free queue when travelling home, and really enjoying it.

Monday, 26 December 2011

Java Sequential IO Performance

Many applications record a series of events to file-based storage for later use.  This can be anything from logging and auditing, through to keeping a transaction redo log in an event sourced design or its close relative CQRS

Java has a number of means by which a file can be sequentially written to, or read back again.  This article explores some of these mechanisms to understand their performance characteristics.  For the scope of this article I will be using pre-allocated files because I want to focus on performance.  Constantly extending a file imposes a significant performance overhead and adds jitter to an application resulting in highly variable latency.  "Why is a pre-allocated file better performance?", I hear you ask.  Well, on disk a file is made up from a series of blocks/pages containing the data.  Firstly, it is important that these blocks are contiguous to provide fast sequential access.   Secondly, meta-data must be allocated to describe this file on disk and saved within the file-system.  A typical large file will have a number of "indirect" blocks allocated to describe the chain of data-blocks containing the file contents that make up part of this meta-data.   I'll leave it as an exercise for the reader, or maybe a later article, to explore the performance impact of not preallocating the data files.  If you have used a database you may have noticed that it preallocates the files it will require.

The Test

I want to experiment with 2 file sizes.  One that is sufficiently large to test sequential access, but can easily fit in the file-system cache, and another that is much larger so that the cache subsystem is forced to retire pages so that new ones can be loaded.  For these two cases I'll use 400MB and 8GB respectively.  I'll also loop over the files a number of times to show the pre and post warm-up characteristics.

I'll test 4 means of writing and reading back files sequentially:
  1. RandomAccessFile using a vanilla byte[] of page size.
  2. Buffered FileInputStream and FileOutputStream.
  3. NIO FileChannel with ByteBuffer of page size.
  4. Memory mapping a file using NIO and direct MappedByteBuffer.
The tests are run on a 2.0Ghz Sandy Bridge CPU with 8GB RAM, an Intel 320 SSD on Fedora Core 15 64-bit Linux with an ext4 file system, and Oracle JDK 1.6.0_30.

The Code
import java.io.*;
import java.nio.ByteBuffer;
import java.nio.MappedByteBuffer;
import java.nio.channels.FileChannel;

import static java.lang.Integer.MAX_VALUE;
import static java.lang.System.out;
import static java.nio.channels.FileChannel.MapMode.READ_ONLY;
import static java.nio.channels.FileChannel.MapMode.READ_WRITE;

public final class TestSequentialIoPerf
{
    public static final int PAGE_SIZE = 1024 * 4;
    public static final long FILE_SIZE = PAGE_SIZE * 2000L * 1000L;
    public static final String FILE_NAME = "test.dat";
    public static final byte[] BLANK_PAGE = new byte[PAGE_SIZE];

    public static void main(final String[] arg) throws Exception
    {
        preallocateTestFile(FILE_NAME);

        for (final PerfTestCase testCase : testCases)
        {
            for (int i = 0; i < 5; i++)
            {
                System.gc();
                long writeDurationMs = testCase.test(PerfTestCase.Type.WRITE,
                                                     FILE_NAME);

                System.gc();
                long readDurationMs = testCase.test(PerfTestCase.Type.READ,
                                                    FILE_NAME);

                long bytesReadPerSec = (FILE_SIZE * 1000L) / readDurationMs;
                long bytesWrittenPerSec = (FILE_SIZE * 1000L) / writeDurationMs;

                out.format("%s\twrite=%,d\tread=%,d bytes/sec\n",
                           testCase.getName(),
                           bytesWrittenPerSec, bytesReadPerSec);
            }
        }

        deleteFile(FILE_NAME);
    }

    private static void preallocateTestFile(final String fileName)
        throws Exception
    {
        RandomAccessFile file = new RandomAccessFile(fileName, "rw");

        for (long i = 0; i < FILE_SIZE; i += PAGE_SIZE)
        {
            file.write(BLANK_PAGE, 0, PAGE_SIZE);
        }

        file.close();
    }

    private static void deleteFile(final String testFileName) throws Exception
    {
        File file = new File(testFileName);
        if (!file.delete())
        {
            out.println("Failed to delete test file=" + testFileName);
            out.println("Windows does not allow mapped files to be deleted.");
        }
    }

    public abstract static class PerfTestCase
    {
        public enum Type { READ, WRITE }

        private final String name;
        private int checkSum;

        public PerfTestCase(final String name)
        {
            this.name = name;
        }

        public String getName()
        {
            return name;
        }

        public long test(final Type type, final String fileName)
        {
            long start = System.currentTimeMillis();

            try
            {
                switch (type)
                {
                    case WRITE:
                    {
                        checkSum = testWrite(fileName);
                        break;
                    }

                    case READ:
                    {
                        final int checkSum = testRead(fileName);
                        if (checkSum != this.checkSum)
                        {
                            final String msg = getName() +
                                " expected=" + this.checkSum +
                                " got=" + checkSum;
                            throw new IllegalStateException(msg);
                        }
                        break;
                    }
                }
            }
            catch (Exception ex)
            {
                ex.printStackTrace();
            }

            return System.currentTimeMillis() - start;
        }

        public abstract int testWrite(final String fileName) throws Exception;
        public abstract int testRead(final String fileName) throws Exception;
    }

    private static PerfTestCase[] testCases =
    {
        new PerfTestCase("RandomAccessFile")
        {
            public int testWrite(final String fileName) throws Exception
            {
                RandomAccessFile file = new RandomAccessFile(fileName, "rw");
                final byte[] buffer = new byte[PAGE_SIZE];
                int pos = 0;
                int checkSum = 0;

                for (long i = 0; i < FILE_SIZE; i++)
                {
                    byte b = (byte)i;
                    checkSum += b;

                    buffer[pos++] = b;
                    if (PAGE_SIZE == pos)
                    {
                        file.write(buffer, 0, PAGE_SIZE);
                        pos = 0;
                    }
                }

                file.close();

                return checkSum;
            }

            public int testRead(final String fileName) throws Exception
            {
                RandomAccessFile file = new RandomAccessFile(fileName, "r");
                final byte[] buffer = new byte[PAGE_SIZE];
                int checkSum = 0;
                int bytesRead;

                while (-1 != (bytesRead = file.read(buffer)))
                {
                    for (int i = 0; i < bytesRead; i++)
                    {
                        checkSum += buffer[i];
                    }
                }

                file.close();

                return checkSum;
            }
        },

        new PerfTestCase("BufferedStreamFile")
        {
            public int testWrite(final String fileName) throws Exception
            {
                int checkSum = 0;
                OutputStream out = 
                    new BufferedOutputStream(new FileOutputStream(fileName));

                for (long i = 0; i < FILE_SIZE; i++)
                {
                    byte b = (byte)i;
                    checkSum += b;
                    out.write(b);
                }

                out.close();

                return checkSum;
            }

            public int testRead(final String fileName) throws Exception
            {
                int checkSum = 0;
                InputStream in = 
                    new BufferedInputStream(new FileInputStream(fileName));

                int b;
                while (-1 != (b = in.read()))
                {
                    checkSum += (byte)b;
                }

                in.close();

                return checkSum;
            }
        },


        new PerfTestCase("BufferedChannelFile")
        {
            public int testWrite(final String fileName) throws Exception
            {
                FileChannel channel = 
                    new RandomAccessFile(fileName, "rw").getChannel();
                ByteBuffer buffer = ByteBuffer.allocate(PAGE_SIZE);
                int checkSum = 0;

                for (long i = 0; i < FILE_SIZE; i++)
                {
                    byte b = (byte)i;
                    checkSum += b;
                    buffer.put(b);

                    if (!buffer.hasRemaining())
                    {
                        buffer.flip();
                        channel.write(buffer);
                        buffer.clear();
                    }
                }

                channel.close();

                return checkSum;
            }

            public int testRead(final String fileName) throws Exception
            {
                FileChannel channel = 
                    new RandomAccessFile(fileName, "rw").getChannel();
                ByteBuffer buffer = ByteBuffer.allocate(PAGE_SIZE);
                int checkSum = 0;

                while (-1 != (channel.read(buffer)))
                {
                    buffer.flip();

                    while (buffer.hasRemaining())
                    {
                        checkSum += buffer.get();
                    }

                    buffer.clear();
                }

                return checkSum;
            }
        },

        new PerfTestCase("MemoryMappedFile")
        {
            public int testWrite(final String fileName) throws Exception
            {
                FileChannel channel = 
                    new RandomAccessFile(fileName, "rw").getChannel();
                MappedByteBuffer buffer = 
                    channel.map(READ_WRITE, 0,
                                Math.min(channel.size(), MAX_VALUE));
                int checkSum = 0;

                for (long i = 0; i < FILE_SIZE; i++)
                {
                    if (!buffer.hasRemaining())
                    {
                        buffer = 
                            channel.map(READ_WRITE, i,
                                        Math.min(channel.size() - i , MAX_VALUE));
                    }

                    byte b = (byte)i;
                    checkSum += b;
                    buffer.put(b);
                }

                channel.close();

                return checkSum;
            }

            public int testRead(final String fileName) throws Exception
            {
                FileChannel channel = 
                    new RandomAccessFile(fileName, "rw").getChannel();
                MappedByteBuffer buffer = 
                    channel.map(READ_ONLY, 0,
                                Math.min(channel.size(), MAX_VALUE));
                int checkSum = 0;

                for (long i = 0; i < FILE_SIZE; i++)
                {
                    if (!buffer.hasRemaining())
                    {
                        buffer = 
                            channel.map(READ_WRITE, i,
                                        Math.min(channel.size() - i , MAX_VALUE));
                    }

                    checkSum += buffer.get();
                }

                channel.close();

                return checkSum;
            }
        },
    };
}
Results

400MB file
===========
RandomAccessFile    write=379,610,750   read=1,452,482,269 bytes/sec
RandomAccessFile    write=294,041,636   read=1,494,890,510 bytes/sec
RandomAccessFile    write=250,980,392   read=1,422,222,222 bytes/sec
RandomAccessFile    write=250,366,748   read=1,388,474,576 bytes/sec
RandomAccessFile    write=260,394,151   read=1,422,222,222 bytes/sec

BufferedStreamFile  write=98,178,331    read=286,433,566 bytes/sec
BufferedStreamFile  write=100,244,738   read=288,857,545 bytes/sec
BufferedStreamFile  write=82,948,562    read=154,100,827 bytes/sec
BufferedStreamFile  write=108,503,311   read=153,869,271 bytes/sec
BufferedStreamFile  write=113,055,478   read=152,608,047 bytes/sec

BufferedChannelFile write=228,443,948   read=356,173,913 bytes/sec
BufferedChannelFile write=265,629,053   read=374,063,926 bytes/sec
BufferedChannelFile write=223,825,136   read=1,539,849,624 bytes/sec
BufferedChannelFile write=232,992,036   read=1,539,849,624 bytes/sec
BufferedChannelFile write=212,779,220   read=1,534,082,397 bytes/sec

MemoryMappedFile    write=300,955,180   read=305,899,925 bytes/sec
MemoryMappedFile    write=313,149,847   read=310,538,286 bytes/sec
MemoryMappedFile    write=326,374,501   read=303,857,566 bytes/sec
MemoryMappedFile    write=327,680,000   read=304,535,315 bytes/sec
MemoryMappedFile    write=326,895,450   read=303,632,320 bytes/sec

8GB File
============
RandomAccessFile    write=167,402,321   read=251,922,012 bytes/sec
RandomAccessFile    write=193,934,802   read=257,052,307 bytes/sec
RandomAccessFile    write=192,948,159   read=248,460,768 bytes/sec
RandomAccessFile    write=191,814,180   read=245,225,408 bytes/sec
RandomAccessFile    write=190,635,762   read=275,315,073 bytes/sec

BufferedStreamFile  write=154,823,102   read=248,355,313 bytes/sec
BufferedStreamFile  write=152,083,913   read=253,418,301 bytes/sec
BufferedStreamFile  write=133,099,369   read=146,056,197 bytes/sec
BufferedStreamFile  write=131,065,708   read=146,217,827 bytes/sec
BufferedStreamFile  write=132,694,052   read=148,116,004 bytes/sec

BufferedChannelFile write=186,703,740   read=215,075,218 bytes/sec
BufferedChannelFile write=190,591,410   read=211,030,680 bytes/sec
BufferedChannelFile write=187,220,038   read=223,087,606 bytes/sec
BufferedChannelFile write=191,585,397   read=221,297,747 bytes/sec
BufferedChannelFile write=192,653,214   read=211,789,038 bytes/sec

MemoryMappedFile    write=123,023,322   read=231,530,156 bytes/sec
MemoryMappedFile    write=121,961,023   read=230,403,600 bytes/sec
MemoryMappedFile    write=123,317,778   read=229,899,250 bytes/sec
MemoryMappedFile    write=121,472,738   read=231,739,745 bytes/sec
MemoryMappedFile    write=120,362,615   read=231,190,382 bytes/sec

Analysis

For years I was a big fan of using RandomAccessFile directly because of the control it gives and the predictable execution.  I never found using buffered streams to be useful from a performance perspective and this still seems to be the case.

In more recent testing I've found that using NIO FileChannel and ByteBuffer are doing much better. With Java 7 the flexibility of this programming approach has been improved for random access with SeekableByteChannel.

It seems that for reading RandomAccessFile and NIO do very well with Memory Mapped files winning for writes in some cases.

I've seen these results vary greatly depending on platform.  File system, OS, storage devices, and available memory all have a significant impact.  In a few cases I've seen memory-mapped files perform significantly better than the others but this needs to be tested on your platform because your mileage may vary...

A special note should be made for the use of memory-mapped large files when pushing for maximum throughput.  I've often found the OS can become unresponsive due the the pressure put on the virtual memory sub-system.

Conclusion

There is a significant difference in performance for the different means of doing sequential file IO from Java.  Not all methods are even remotely equal.  For most IO I've found the use of ByteBuffers and Channels to be the best optimised parts of the IO libraries.  If buffered streams are your IO libraries of choice, then it is worth branching out and and getting familiar with the implementations of Channel and Buffer or even falling back and using the good old RandomAccessFile.

Tuesday, 22 November 2011

Biased Locking, OSR, and Benchmarking Fun

After my last post on Java Lock Implementations, I got a lot of good feedback about my results and micro-benchmark design approach.  As a result I now understand JVM warmup, On Stack Replacement (OSR) and Biased Locking somewhat better than before.  Special thanks to Dave Dice from Oracle, and Cliff Click & Gil Tene from Azul, for their very useful feedback.

In the last post I concluded, based on my experiments, that biased locking was no longer necessary on modern CPUs.  While this conclusion is understandable given the data gathered in the experiment, it was not valid because the experiment did not take account of some JVM warm up behaviour that I was unaware of.

In this post I will re-run the experiment taking into account the feedback and present some new results.  I shall also expand on the changes I've made to the test and why it is important to consider the JVM warm-up behaviour when writing micro-benchmarks, or even very lean Java applications with quick start up time.

On Stack Replacement (OSR)

Java virtual machines will compile code to achieve greater performance based on runtime profiling.  Some VMs run an interpreter for the majority of code and replace hot areas with compiled code following the 80/20 rule.  Other VMs compile all code simply at first then replace the simple code with more optimised code based on profiling.  Oracle Hotspot and Azul are examples of the first type and Oracle JRockit is an example of the second.

Oracle Hotspot will count invocations of a method return plus branch backs for loops in that method, and if this exceeds 10K in server mode the method will be compiled.  The compiled code on normal JIT'ing can be used when the method is next called.  However if a loop is still iterating it may make sense to replace the method before the loop completes, especially if it has many iterations to go.  OSR is the means by which a method gets replaced with a compiled version part way through iterating a loop.

I was under the impression that normal JIT'ing and OSR would result in similar code.  Cliff Click pointed out that it is much harder for a runtime to optimise a loop part way through, and especially difficult if nested.  For example, bounds checking within the loop may not be possible to eliminate. Cliff will blog in more detail on this shortly.

What this means is that you are likely to get better optimised code by doing a small number of shorter warm ups than a single large one.  You can see in the code below how I do 10 shorter runs in a loop before the main large run compared to the last article where I did a single large warm-up run.

Biased Locking

Dave Dice pointed out that Hotspot does not enable objects for biased locking in the first few seconds (4s at present) of JVM startup. This is because some benchmarks, and NetBeans, have a lot of thread contention on start up and the revocation cost is significant.

All objects by default are created with biased locking enabled in Oracle Hotspot after the first few seconds of start-up delay, and can be configured with -XX:BiasedLockingStartupDelay=0.

This point, combined with knowing more about OSR, is important for micro-benchmarks.  It is also important to be aware of these points if you have a lean Java application that starts in a few seconds.

The Code
import java.util.concurrent.locks.Lock;
import java.util.concurrent.locks.ReentrantLock;
import java.util.concurrent.CyclicBarrier;

import static java.lang.System.out;

public final class TestLocks implements Runnable
{
    public enum LockType {JVM, JUC}
    public static LockType lockType;

    public static final long WARMUP_ITERATIONS = 100L * 1000L;
    public static final long ITERATIONS = 500L * 1000L * 1000L;
    public static long counter = 0L;

    public static final Object jvmLock = new Object();
    public static final Lock jucLock = new ReentrantLock();
    private static int numThreads;

    private final long iterationLimit;
    private final CyclicBarrier barrier;

    public TestLocks(final CyclicBarrier barrier, final long iterationLimit)
    {
        this.barrier = barrier;
        this.iterationLimit = iterationLimit;
    }

    public static void main(final String[] args) throws Exception
    {
        lockType = LockType.valueOf(args[0]);
        numThreads = Integer.parseInt(args[1]);

        for (int i = 0; i < 10; i++)
        {
            runTest(numThreads, WARMUP_ITERATIONS);
            counter = 0L;
        }

        final long start = System.nanoTime();
        runTest(numThreads, ITERATIONS);
        final long duration = System.nanoTime() - start;

        out.printf("%d threads, duration %,d (ns)\n", numThreads, duration);
        out.printf("%,d ns/op\n", duration / ITERATIONS);
        out.printf("%,d ops/s\n", (ITERATIONS * 1000000000L) / duration);
        out.println("counter = " + counter);
    }

    private static void runTest(final int numThreads, final long iterationLimit)
        throws Exception
    {
        CyclicBarrier barrier = new CyclicBarrier(numThreads);
        Thread[] threads = new Thread[numThreads];

        for (int i = 0; i < threads.length; i++)
        {
            threads[i] = new Thread(new TestLocks(barrier, iterationLimit));
        }

        for (Thread t : threads)
        {
            t.start();
        }

        for (Thread t : threads)
        {
            t.join();
        }
    }

    public void run()
    {
        try
        {
            barrier.await();
        }
        catch (Exception e)
        {
            // don't care
        }

        switch (lockType)
        {
            case JVM: jvmLockInc(); break;
            case JUC: jucLockInc(); break;
        }
    }

    private void jvmLockInc()
    {
        long count = iterationLimit / numThreads;
        while (0 != count--)
        {
            synchronized (jvmLock)
            {
                ++counter;
            }
        }
    }

    private void jucLockInc()
    {
        long count = iterationLimit / numThreads;
        while (0 != count--)
        {
            jucLock.lock();
            try
            {
                ++counter;
            }
            finally
            {
                jucLock.unlock();
            }
        }
    }
}

Script to run tests:

set -x
for i in {1..8}
do 
    java -server -XX:-UseBiasedLocking TestLocks JVM $i
done

for i in {1..8}
do 
    java -server -XX:+UseBiasedLocking -XX:BiasedLockingStartupDelay=0 TestLocks JVM $i
done

for i in {1..8}
do 
    java -server TestLocks JUC $i
done

Results

The tests are carried out with 64-bit Linux (Fedora Core 15) and Oracle JDK 1.6.0_29.

Nehalem 2.8GHz - Ops/Sec
Threads-UseBiasedLocking+UseBiasedLockingReentrantLock
153,283,461
450,950,969
62,876,566
218,519,295
18,108,615
10,217,186
313,349,605
13,416,198
14,108,622
48,120,172
8,040,773
14,207,310
54,725,114
4,551,766
14,302,683
65,133,706
5,246,548
14,676,616
75,473,652
5,585,666
18,145,525
85,514,056
5,414,171
19,010,725


Sandy Bridge 2.0GHz - Ops/Sec
Threads-UseBiasedLocking+UseBiasedLockingReentrantLock
1
34,500,407
396,511,324
43,148,808
2
20,899,076
19,742,639
6,038,923
3
9,288,039
11,957,032
24,147,807
4
5,618,862
5,589,289
9,082,961
5
5,609,932
5,592,574
9,389,243
6
5,742,907
5,760,558
12,518,728
7
6,699,201
6,641,886
13,684,475
8
6,957,824
6,925,410
14,819,005

Observations
  1. Biased locking has a huge benefit in the un-contended single threaded case.
  2. Biased locking when un-contended, and not revoked, only adds 4-5 cycles of cost.  This is the cost when having a cache hit for the lock structures, on top of the code protected in the critical section.
  3. -XX:BiasedLockingStartupDelay=0 needs to be set for lean applications and micro-benchmarks.
  4. Avoiding OSR does not make a material difference to this set of test results.  This is likely to be because the loop is so simple or other costs are dominating.
  5. For the current implementations, ReentrantLocks scale better than synchronised locks under contention, except in the case of 2 contending threads.
Conclusion

My tests in the last post are invalid for the testing of an un-contended biased lock, because the lock was not actually biased.  If you are designing code following the single writer principle, and therefore having un-contended locks when using 3rd party libraries, then having biased locking enabled is a significant performance boost.