After my last post on Java Lock Implementations, I got a lot of good feedback about my results and micro-benchmark design approach. As a result I now understand JVM warmup, On Stack Replacement (OSR) and Biased Locking somewhat better than before. Special thanks to Dave Dice from Oracle, and Cliff Click & Gil Tene from Azul, for their very useful feedback.
In the last post I concluded, based on my experiments, that biased locking was no longer necessary on modern CPUs. While this conclusion is understandable given the data gathered in the experiment, it was not valid because the experiment did not take account of some JVM warm up behaviour that I was unaware of.
In this post I will re-run the experiment taking into account the feedback and present some new results. I shall also expand on the changes I've made to the test and why it is important to consider the JVM warm-up behaviour when writing micro-benchmarks, or even very lean Java applications with quick start up time.
On Stack Replacement (OSR)
Java virtual machines will compile code to achieve greater performance based on runtime profiling. Some VMs run an interpreter for the majority of code and replace hot areas with compiled code following the 80/20 rule. Other VMs compile all code simply at first then replace the simple code with more optimised code based on profiling. Oracle Hotspot and Azul are examples of the first type and Oracle JRockit is an example of the second.
Oracle Hotspot will count invocations of a method return plus branch backs for loops in that method, and if this exceeds 10K in server mode the method will be compiled. The compiled code on normal JIT'ing can be used when the method is next called. However if a loop is still iterating it may make sense to replace the method before the loop completes, especially if it has many iterations to go. OSR is the means by which a method gets replaced with a compiled version part way through iterating a loop.
I was under the impression that normal JIT'ing and OSR would result in similar code. Cliff Click pointed out that it is much harder for a runtime to optimise a loop part way through, and especially difficult if nested. For example, bounds checking within the loop may not be possible to eliminate. Cliff will blog in more detail on this shortly.
What this means is that you are likely to get better optimised code by doing a small number of shorter warm ups than a single large one. You can see in the code below how I do 10 shorter runs in a loop before the main large run compared to the last article where I did a single large warm-up run.
Biased Locking
Dave Dice pointed out that Hotspot does not enable objects for biased locking in the first few seconds (4s at present) of JVM startup. This is because some benchmarks, and NetBeans, have a lot of thread contention on start up and the revocation cost is significant.
All objects by default are created with biased locking enabled in Oracle Hotspot after the first few seconds of start-up delay, and can be configured with -XX:BiasedLockingStartupDelay=0.
This point, combined with knowing more about OSR, is important for micro-benchmarks. It is also important to be aware of these points if you have a lean Java application that starts in a few seconds.
The Code
Script to run tests:
set -x
for i in {1..8}
do
java -server -XX:-UseBiasedLocking TestLocks JVM $i
done
for i in {1..8}
do
java -server -XX:+UseBiasedLocking -XX:BiasedLockingStartupDelay=0 TestLocks JVM $i
done
for i in {1..8}
do
java -server TestLocks JUC $i
done
Results
The tests are carried out with 64-bit Linux (Fedora Core 15) and Oracle JDK 1.6.0_29.
Observations
My tests in the last post are invalid for the testing of an un-contended biased lock, because the lock was not actually biased. If you are designing code following the single writer principle, and therefore having un-contended locks when using 3rd party libraries, then having biased locking enabled is a significant performance boost.
In the last post I concluded, based on my experiments, that biased locking was no longer necessary on modern CPUs. While this conclusion is understandable given the data gathered in the experiment, it was not valid because the experiment did not take account of some JVM warm up behaviour that I was unaware of.
In this post I will re-run the experiment taking into account the feedback and present some new results. I shall also expand on the changes I've made to the test and why it is important to consider the JVM warm-up behaviour when writing micro-benchmarks, or even very lean Java applications with quick start up time.
On Stack Replacement (OSR)
Java virtual machines will compile code to achieve greater performance based on runtime profiling. Some VMs run an interpreter for the majority of code and replace hot areas with compiled code following the 80/20 rule. Other VMs compile all code simply at first then replace the simple code with more optimised code based on profiling. Oracle Hotspot and Azul are examples of the first type and Oracle JRockit is an example of the second.
Oracle Hotspot will count invocations of a method return plus branch backs for loops in that method, and if this exceeds 10K in server mode the method will be compiled. The compiled code on normal JIT'ing can be used when the method is next called. However if a loop is still iterating it may make sense to replace the method before the loop completes, especially if it has many iterations to go. OSR is the means by which a method gets replaced with a compiled version part way through iterating a loop.
I was under the impression that normal JIT'ing and OSR would result in similar code. Cliff Click pointed out that it is much harder for a runtime to optimise a loop part way through, and especially difficult if nested. For example, bounds checking within the loop may not be possible to eliminate. Cliff will blog in more detail on this shortly.
What this means is that you are likely to get better optimised code by doing a small number of shorter warm ups than a single large one. You can see in the code below how I do 10 shorter runs in a loop before the main large run compared to the last article where I did a single large warm-up run.
Biased Locking
Dave Dice pointed out that Hotspot does not enable objects for biased locking in the first few seconds (4s at present) of JVM startup. This is because some benchmarks, and NetBeans, have a lot of thread contention on start up and the revocation cost is significant.
All objects by default are created with biased locking enabled in Oracle Hotspot after the first few seconds of start-up delay, and can be configured with -XX:BiasedLockingStartupDelay=0.
This point, combined with knowing more about OSR, is important for micro-benchmarks. It is also important to be aware of these points if you have a lean Java application that starts in a few seconds.
The Code
import java.util.concurrent.locks.Lock; import java.util.concurrent.locks.ReentrantLock; import java.util.concurrent.CyclicBarrier; import static java.lang.System.out; public final class TestLocks implements Runnable { public enum LockType {JVM, JUC} public static LockType lockType; public static final long WARMUP_ITERATIONS = 100L * 1000L; public static final long ITERATIONS = 500L * 1000L * 1000L; public static long counter = 0L; public static final Object jvmLock = new Object(); public static final Lock jucLock = new ReentrantLock(); private static int numThreads; private final long iterationLimit; private final CyclicBarrier barrier; public TestLocks(final CyclicBarrier barrier, final long iterationLimit) { this.barrier = barrier; this.iterationLimit = iterationLimit; } public static void main(final String[] args) throws Exception { lockType = LockType.valueOf(args[0]); numThreads = Integer.parseInt(args[1]); for (int i = 0; i < 10; i++) { runTest(numThreads, WARMUP_ITERATIONS); counter = 0L; } final long start = System.nanoTime(); runTest(numThreads, ITERATIONS); final long duration = System.nanoTime() - start; out.printf("%d threads, duration %,d (ns)\n", numThreads, duration); out.printf("%,d ns/op\n", duration / ITERATIONS); out.printf("%,d ops/s\n", (ITERATIONS * 1000000000L) / duration); out.println("counter = " + counter); } private static void runTest(final int numThreads, final long iterationLimit) throws Exception { CyclicBarrier barrier = new CyclicBarrier(numThreads); Thread[] threads = new Thread[numThreads]; for (int i = 0; i < threads.length; i++) { threads[i] = new Thread(new TestLocks(barrier, iterationLimit)); } for (Thread t : threads) { t.start(); } for (Thread t : threads) { t.join(); } } public void run() { try { barrier.await(); } catch (Exception e) { // don't care } switch (lockType) { case JVM: jvmLockInc(); break; case JUC: jucLockInc(); break; } } private void jvmLockInc() { long count = iterationLimit / numThreads; while (0 != count--) { synchronized (jvmLock) { ++counter; } } } private void jucLockInc() { long count = iterationLimit / numThreads; while (0 != count--) { jucLock.lock(); try { ++counter; } finally { jucLock.unlock(); } } } }
Script to run tests:
set -x
for i in {1..8}
do
java -server -XX:-UseBiasedLocking TestLocks JVM $i
done
for i in {1..8}
do
java -server -XX:+UseBiasedLocking -XX:BiasedLockingStartupDelay=0 TestLocks JVM $i
done
for i in {1..8}
do
java -server TestLocks JUC $i
done
Results
The tests are carried out with 64-bit Linux (Fedora Core 15) and Oracle JDK 1.6.0_29.
Nehalem 2.8GHz - Ops/Sec | |||||
---|---|---|---|---|---|
Threads | -UseBiasedLocking | +UseBiasedLocking | ReentrantLock | ||
1 | 53,283,461 |
|
| ||
2 | 18,519,295 |
|
| ||
3 | 13,349,605 |
|
| ||
4 | 8,120,172 |
|
| ||
5 | 4,725,114 |
|
| ||
6 | 5,133,706 |
|
| ||
7 | 5,473,652 |
|
| ||
8 | 5,514,056 |
|
|
Sandy Bridge 2.0GHz - Ops/Sec | ||||||||
---|---|---|---|---|---|---|---|---|
Threads | -UseBiasedLocking | +UseBiasedLocking | ReentrantLock | |||||
1 |
|
|
| |||||
2 |
|
|
| |||||
3 |
|
|
| |||||
4 |
|
|
| |||||
5 |
|
|
| |||||
6 |
|
|
| |||||
7 |
|
|
| |||||
8 |
|
|
|
Observations
- Biased locking has a huge benefit in the un-contended single threaded case.
- Biased locking when un-contended, and not revoked, only adds 4-5 cycles of cost. This is the cost when having a cache hit for the lock structures, on top of the code protected in the critical section.
- -XX:BiasedLockingStartupDelay=0 needs to be set for lean applications and micro-benchmarks.
- Avoiding OSR does not make a material difference to this set of test results. This is likely to be because the loop is so simple or other costs are dominating.
- For the current implementations, ReentrantLocks scale better than synchronised locks under contention, except in the case of 2 contending threads.
My tests in the last post are invalid for the testing of an un-contended biased lock, because the lock was not actually biased. If you are designing code following the single writer principle, and therefore having un-contended locks when using 3rd party libraries, then having biased locking enabled is a significant performance boost.