Sunday, 29 April 2012

Invoke Interface Optimisations

I'm often asked about the performance differences between Java, C, and C++, and which is better.  As with most things in life there is no black and white answer.  A lot is often discussed about how managed runtime based languages offer less performance than their statically compiled compatriots.  There are however a few tricks available to managed runtimes that can provide optimisation opportunities not available to statically optimised languages.

One such optimisation available to the runtime is to dynamically inline a method at the call site.  Many would say inlining is *the* major optimisation of dynamic languages.  This is an approach whereby the function/method call overhead can be avoided and further optimisations enabled.  Inlining can easily be done at compile, or run, time for static or private methods of a class because they cannot be overridden.  It can also be done by Hotspot at run time which is way more interesting.  In bytecode the runtime will see invokestatic and invokespecial opcodes for static and private methods respectively.  Methods that involve late binding, such as interface implementations and method overriding, appear as the invokeinterface and invokevirtual opcodes respectively.

At compile time it is not possible to determine how many implementations there will be for an interface, or how many classes will override a base method.  The compiler can have some awareness but just how do you deal with dynamically loaded classes via Class.forName("x").newInstance()?  The Hotspot runtime is very smart.  It can track all classes as they are loaded and apply appropriate optimisations to give the best possible performance for our code.  One such approach is dynamic inlining at the call site which we will explore.

Code

public interface Operation
{
    int map(int value);
}

public class IncOperation implements Operation
{
    public int map(final int value)
    {
        return value + 1;
    }
}

public class DecOperation implements Operation
{
    public int map(final int value)
    {
        return value - 1;
    }
}

public class StepIncOperation implements Operation
{
    public int map(final int value)
    {
        return value + 7;
    }
}

public class StepDecOperation implements Operation
{
    public int map(final int value)
    {
        return value - 3;
    }
}

public final class OperationPerfTest
{
    private static final int ITERATIONS = 50 * 1000 * 1000;

    public static void main(final String[] args)
        throws Exception
    {
        final Operation[] operations = new Operation[4];
        int index = 0;
        operations[index++] = new StepIncOperation();
        operations[index++] = new StepDecOperation();
        operations[index++] = new IncOperation();
        operations[index++] = new DecOperation();

        int value = 777;
        for (int i = 0; i < 3; i++)
        {
            System.out.println("*** Run each method in turn: loop " + i);

            for (final Operation operation : operations)
            {
                System.out.println(operation.getClass().getName()); 
                value = runTests(operation, value);
            }
        } 

        System.out.println("value = " + value);
    } 

    private static int runTests(final Operation operation, int value)
    {
        for (int i = 0; i < 10; i++)
        {
            final long start = System.nanoTime();

            value += opRun(operation, value);

            final long duration = System.nanoTime() - start;
            final long opsPerSec = 
                (ITERATIONS * 1000L * 1000L * 1000L) / duration;
            System.out.printf("    %,d ops/sec\n", opsPerSec);
        }

        return value;
    } 

    private static int opRun(final Operation operation, int value)
    {
        for (int i = 0; i < ITERATIONS; i++)
        {
            value += operation.map(value);
        } 

        return value;
    } 
}

Results


The following results are for running on a Linux 3.3.2 kernel with Oracle 1.7.0_02 server JVM on a Intel Sandy Bridge 2.4Ghz processor.

*** Run each method in turn: loop 0
StepIncOperation
    2,256,816,714 ops/sec
    2,245,800,936 ops/sec
    3,161,643,847 ops/sec
    3,100,375,269 ops/sec
    3,144,364,173 ops/sec
    3,091,009,138 ops/sec
    3,089,241,641 ops/sec
    3,153,922,056 ops/sec
    3,147,331,497 ops/sec
    3,076,211,099 ops/sec
StepDecOperation
    623,131,120 ops/sec
    659,686,236 ops/sec
    1,029,231,089 ops/sec
    1,021,060,933 ops/sec
    999,287,607 ops/sec
    1,015,432,172 ops/sec
    1,023,581,307 ops/sec
    1,019,266,750 ops/sec
    1,022,726,580 ops/sec
    1,004,237,016 ops/sec
IncOperation
    301,419,319 ops/sec
    304,712,250 ops/sec
    307,269,912 ops/sec
    308,519,923 ops/sec
    307,372,436 ops/sec
    306,230,247 ops/sec
    307,964,022 ops/sec
    306,243,292 ops/sec
    308,689,942 ops/sec
    365,152,716 ops/sec
DecOperation
    236,804,700 ops/sec
    237,912,786 ops/sec
    238,672,489 ops/sec
    278,745,901 ops/sec
    278,169,934 ops/sec
    277,979,158 ops/sec
    276,620,509 ops/sec
    278,349,766 ops/sec
    276,159,225 ops/sec
    278,578,373 ops/sec
*** Run each method in turn: loop 1
StepIncOperation
    276,054,944 ops/sec
    276,683,805 ops/sec
    276,551,970 ops/sec
    279,861,144 ops/sec
    275,543,192 ops/sec
    278,451,092 ops/sec
    275,399,262 ops/sec
    277,340,411 ops/sec
    274,529,616 ops/sec
    277,091,930 ops/sec
StepDecOperation
    279,729,066 ops/sec
    279,812,269 ops/sec
    276,478,587 ops/sec
    277,660,649 ops/sec
    276,844,441 ops/sec
    278,684,313 ops/sec
    277,791,665 ops/sec
    277,617,484 ops/sec
    278,575,241 ops/sec
    278,228,274 ops/sec
IncOperation
    277,724,770 ops/sec
    278,234,042 ops/sec
    276,798,434 ops/sec
    277,926,962 ops/sec
    277,786,824 ops/sec
    278,739,590 ops/sec
    275,286,293 ops/sec
    279,062,831 ops/sec
    276,672,019 ops/sec
    277,248,956 ops/sec
DecOperation
    277,303,150 ops/sec
    277,746,139 ops/sec
    276,245,511 ops/sec
    278,559,202 ops/sec
    274,683,406 ops/sec
    279,280,730 ops/sec
    276,174,620 ops/sec
    276,374,159 ops/sec
    275,943,446 ops/sec
    277,765,688 ops/sec
*** Run each method in turn: loop 2
StepIncOperation
    278,405,907 ops/sec
    278,713,953 ops/sec
    276,841,096 ops/sec
    277,891,660 ops/sec
    275,716,314 ops/sec
    277,474,242 ops/sec
    277,715,270 ops/sec
    277,857,014 ops/sec
    275,956,486 ops/sec
    277,675,378 ops/sec
StepDecOperation
    277,273,039 ops/sec
    278,101,972 ops/sec
    275,694,572 ops/sec
    276,312,449 ops/sec
    275,964,418 ops/sec
    278,423,621 ops/sec
    276,498,569 ops/sec
    276,593,475 ops/sec
    276,238,451 ops/sec
    277,057,568 ops/sec
IncOperation
    275,700,451 ops/sec
    277,463,507 ops/sec
    275,886,477 ops/sec
    277,546,096 ops/sec
    275,019,816 ops/sec
    278,242,287 ops/sec
    277,317,964 ops/sec
    277,252,014 ops/sec
    276,893,038 ops/sec
    277,601,325 ops/sec
DecOperation
    275,580,894 ops/sec
    280,146,646 ops/sec
    276,901,134 ops/sec
    276,672,567 ops/sec
    276,879,422 ops/sec
    278,674,196 ops/sec
    275,606,174 ops/sec
    278,132,534 ops/sec
    275,858,358 ops/sec
    279,444,112 ops/sec

What is going on here?


On the first iteration over the list of operations we see the performance degrade from ~3bn operations per second down to ~275m operations per second.  This happens in a step function with each new implementation loaded.  On the second, and subsequent, iteration over the array of operations, performance stabilised at ~275m operations per second.  What we are seeing here is how Hotspot can optimise when we have a limited number of implementations for an interface, and how it has to fall back to late bound method calls when many implementations are possible from a given call site.

If we run the JVM with -XX:+PrintCompilation we can see Hotspot choosing to compile the methods then de-optimise existing optimisations as new implementations get loaded.

     52    1             java.lang.String::hashCode (67 bytes)
     54    2             StepIncOperation::map (5 bytes)
     55    1 %           OperationPerfTest::opRun @ 2 (26 bytes)
     76    3             OperationPerfTest::opRun (26 bytes)
    223    3             OperationPerfTest::opRun (26 bytes)   made not entrant
    223    1 %           OperationPerfTest::opRun @ -2 (26 bytes)   made not entrant
    224    2 %           OperationPerfTest::opRun @ 2 (26 bytes)
    224    4             StepDecOperation::map (4 bytes)
    306    5             OperationPerfTest::opRun (26 bytes)
    772    2 %           OperationPerfTest::opRun @ -2 (26 bytes)   made not entrant
    772    3 %           OperationPerfTest::opRun @ 2 (26 bytes)
    773    6             IncOperation::map (4 bytes)
    930    5             OperationPerfTest::opRun (26 bytes)   made not entrant
   1995    7             OperationPerfTest::opRun (26 bytes)
   2293    8             DecOperation::map (4 bytes)
  11339    9             java.lang.String::indexOf (87 bytes)
  15017   10             java.lang.String::charAt (33 bytes)

The output above shows the decisions made by Hotspot as it compiles code.  When the third column contains the symbol "%" it is performing OSR (On Stack Replacement) of the method.  This is followed 4 times by the method being "made not entrant" as it is de-optimised when Hotspot discovers new implementations.  3 times the method is made not entrant for the newly discovered classes and once for removing the OSR version to be replaced by a non-OSR normal JIT'ed version when the final implementation is settled on.  Even greater detail can be seen by replacing -XX:+PrintCompilation with -XX:+UnlockDiagnosticVMOptions -XX:+LogCompilation.

For the monomorphic single implementation case, Hotspot can simply inline the method and place a trap in the code to fire if future implementations are loaded.  This gives performance very similar to no function call overhead. For the second bimorphic implementation, Hotspot can inline both methods and select the implementation based on a branch condition.  Beyond this things get tricky and jump tables are required to  resolve the method at runtime, thus making the code polymorphic or megamorphic.  The generated assembly code can be viewed with -XX:+UnlockDiagnosticVMOptions -XX:CompileCommand=print,OperationPerfTest.doRun JVM options for Java 7. The output shows the steps in compilation whereby not only is the method inlining deoptimised, Hotspot also no longer does loop unrolling for this method.

Conclusions


We can see that if an interface method has only one or two implementations then Hotspot can dynamically inline the method avoiding the function call overhead.  This would only be possible with profile guided optimisation for a language like C or C++.  We can also see that method calls are relatively cheap on a modern JVM, in the order of 12 cycles, even when we cannot avoid them.  It should be noted that the cost of method calls goes up by a few cycles for each additional argument passed.

In addition, I have observed that when a class implements multiple interfaces, with multiple methods, performance can degrade significantly because the method dispatch involves a linear search of method list to find the right implementation for dispatch.  Overridden methods from a base class do not involve this linear search but still require the jump table dispatch.  All the more reason to keep classes and interfaces simple.

7 comments:

  1. Thank you for this interesting post (and for the others that made me bookmark your blog as soon as I stumbled upon it) !
    However, wrt to C++, I have to stress that idiomatic C++ is more often than not using static polymorphism with templates rather than dynamic polymorphism with virtual member functions.
    Case in point: all the functions (e.g. in ) take their function object argument by value (not by reference) to prevent dynamic polymorphism and ensure that the compiler has full knowledge of the actual function to be called (in fact, inlined ).

    Cheers,

    B.

    ReplyDelete
    Replies
    1. Your point is interesting. Thanks. There are several different types of polymorphism. The three major ones being: overloading, overriding, and generics. In my example I refer to overriding, and here you point to generics/templates. Both have their place, but typically solve different needs. In my experience C++ developers tend to use all three, not just generics. Interestingly, overuse of C++ templates can cause code bloat resulting L1 instruction cache thrashing. Still all mechanical sympathy at the end of the day :-) No one size fits all. Right tool for the job...

      Delete
    2. Nice post.
      [a] For the megamorphic case on 1.8.0_66, I noticed if I run the loop around 8000+ times +PrintInlining shows a new optimization kick in. It __appears__ that opRun is getting inlined and type specialized for the input operation type. Probably need to check with JVM folks if they have added some special smarts for super-hot metamorphic cal sites

      [b] I often hear i$ thrashing being touted for Templates. In my experience that is really not the case. i$ contains the near working set of the code being executed. Unless you are executing 10 different types at the same time, with templates you will pick up the right version of the types for the current code. Yes the binary will be bigger but try get past the initial mindset of Templates causing i$ thrashing. Its not that bad :-) I work on 3D Graphics device drivers and Firmware (think super small i$) and we use it safely. C++ Templates are messy (unlike D) but thats a different story.

      Delete
  2. This blog is a pure treasure trove! Keep it coming!

    I was aware of the mono-, bi- and megamorphic method call cases, but I didn't think the difference between the monomorphic and the bimorphic case was so large.

    The difference should only really be a single branch, right?

    ReplyDelete
    Replies
    1. Yes, a branch for the in-lined case. However once bimorphic other optimisations such as loop unrolling may not be applied.

      Delete
  3. Why didnt you include a warmup phase in the test?

    ReplyDelete
    Replies
    1. The test runs multiple times thus you get a warm up if you take results after it becomes stable. These days it would be better done with JMH.

      Delete