Saturday, 13 August 2011

False Sharing && Java 7

In my previous post on False Sharing I suggested it can be avoided by padding the cache line with unused long fields.  It seems Java 7 got clever and eliminated or re-ordered the unused fields, thus re-introducing false sharing.  I've experimented with a number of techniques on different platforms and found the following code to be the most reliable.
import java.util.concurrent.atomic.AtomicLong;

public final class FalseSharing
    implements Runnable
{
    public final static int NUM_THREADS = 4; // change
    public final static long ITERATIONS = 500L * 1000L * 1000L;
    private final int arrayIndex;

    private static PaddedAtomicLong[] longs = new PaddedAtomicLong[NUM_THREADS];
    static
    {
        for (int i = 0; i < longs.length; i++)
        {
            longs[i] = new PaddedAtomicLong();
        }
    }

    public FalseSharing(final int arrayIndex)
    {
        this.arrayIndex = arrayIndex;
    }

    public static void main(final String[] args) throws Exception
    {
        final long start = System.nanoTime();
        runTest();
        System.out.println("duration = " + (System.nanoTime() - start));
    }

    private static void runTest() throws InterruptedException
    {
        Thread[] threads = new Thread[NUM_THREADS];

        for (int i = 0; i < threads.length; i++)
        {
            threads[i] = new Thread(new FalseSharing(i));
        }

        for (Thread t : threads)
        {
            t.start();
        }

        for (Thread t : threads)
        {
            t.join();
        }
    }

    public void run()
    {
        long i = ITERATIONS + 1;
        while (0 != --i)
        {
            longs[arrayIndex].set(i);
        }
    }

    public static long sumPaddingToPreventOptimisation(final int index)
    {
        PaddedAtomicLong v = longs[index];
        return v.p1 + v.p2 + v.p3 + v.p4 + v.p5 + v.p6;
    }

    public static class PaddedAtomicLong extends AtomicLong
    {
        public volatile long p1, p2, p3, p4, p5, p6 = 7L;
    }
}

With this code I get similar performance results to those stated in the previous False Sharing article.  The padding in PaddedAtomicLong above can be commented out to see the false sharing effect.

I think we should all lobby the powers that be inside Oracle to have intrinsics added to the language so we can have cache line aligned and padded atomic classes.  This and some other low-level changes would help make Java a real concurrent programming language.  We keep hearing them say multi-core is coming.   I say it is here and Java needs to catch up.

41 comments:

  1. In your prev article you created an array of VolatileLong and here an array of AtomicLongArray.

    What is the guarantee that the AtomicLongArray or VolatileLong instances will be allocated next to each other?

    Well, yes if you create them in a loop, then it is very likely that they will be co-located. But there is no guarantee that these 4 instances will be next to each other on the heap.

    If they do get spread across on the old gen heap where there is no compaction, re-location (until a major GC) padding might not even be necessary if they are spread out. Correct?

    It would be nice if you made it clear to your readers that we really have no control over where the JVM places instances on the heap.

    ReplyDelete
  2. Ashwin,

    You are totally right in that we have no guarantee how Java objects will be placed on the heap. This is the source of the problem with false sharing. If you have counters or pointers that are used across threads it is vitally important to ensure they are in different cache lines or the application will not scale up with cores. The point of the padding is to protect the counters/pointers to ensure they ever get placed in the same cache line.

    Does this make sense?

    Martin...

    ReplyDelete
  3. Yup. Makes sense. Couldn't you have created a large AtomicLongArray and then let each thread update position 8, 16, 24, 32 and so on?

    Instead of 4 AtomicLongArrays and then updating position 0 in each of those 4 arrays?

    Thanks for taking the time to write such things.

    ReplyDelete
  4. If I knew up front that would be a possibility. Often when designing large systems we don't know such things, or we have created a library used by another application.

    It is hard to create a small enough example with sufficient context. The above example illustrates how bad things can been when it happens. If you pad such structures then you need not worry about where they get located in memory. Only the highly contended concurrent structures need such padding.

    ReplyDelete
  5. After a good discussion with Jeff Hain we have discovered the JVM cannot be trusted when it comes to re-ordering code. What happens before the volatile seems to be safe but what happens after is not reliable. We have a better solution by replacing the VolatileLong with the following and using the normal methods of AtomicLong.

    static class PaddedAtomicLong extends AtomicLong
    {
    public volatile long p1, p2, p3, p4, p5, p6, p7 = 7L;
    }

    Oh how I wish the Java powers the be would get serious about this stuff and add some intrinsics that are cache line aligned and padded. This have been the biggest source or performance bugs in the Disruptor.

    ReplyDelete
  6. I've updated this blog based on all the feedback.

    ReplyDelete
  7. Martin,

    Agree that it would be great if there was a way to declare a field such it is guaranteed to occupy it's own cache line, and let the JVM deal with placing the right padding in the object layout. What you do here with artificial padding is the next best thing, but as you know, what actually happens to object layout becomes JVM-implementation specific.

    Being paranoid, I'd add some more stuff to your padding technique to make it harder to get rid of the padding fields through optimization. A hypothetically smart enough JVM would still be able to prove away the need to actually implement the atomic long field p1...p7 in your code above: The PaddedAtomicLong class is only visible from the final (and therefor non-extendable) FalseSharing class. The compiler can therefore "know" that it is looking at all the code that will ever see the padding fields, and can prove that no behavior actually depends on p1...p7. The JVM could then implement the above without taking any space for the padding fields...

    You can probably defeat this by making the PaddedAtomicLong class visible outside of FalseSharing, either directly or by adding accessor method on FalseSharing that would depend on all the p1...p7 values and could theoretically be called by someone outside.

    ReplyDelete
  8. Updated example based on Gil's feedback.

    ReplyDelete
  9. Just use an array (or direct bytebuffer but you pay a full page for it) and place the element in the middle. Java cant reorder that. This is what I do if I want dedicated cache lines.

    ReplyDelete
  10. Stanimir,

    I've often used an array of 15 longs for example and put the value at index 7 to achieve the same result. However this does not work if the value requires volatile semantics. For that you need to use AtomicLongArray or similar. Based on my measurements the indirection and bounds checking costs for this can be significant in an algorithm.

    I know a bunch of folk are proposing an @Contended annotation for Java to mark fields so they can be cache line aligned and padded. I hope it comes soon. :-)

    ReplyDelete
    Replies
    1. Hi Martin,

      I saw that the current version of Sequence in Disruptor now uses unsafe.compareAndSwapLong(..) to update the 7th index.

      Why is the length of the long array 15 and not another size? Is it because an array of 15 longs exactly fills the Level 2 cache?

      Cheers

      John

      Delete
    2. By using index 7 (8th element) ensures that there will be 56 bytes of padding either side of the value. 56 bytes plus the 8 for the value ensures nothing else can share the same 64-byte cache line regardless of starting location in memory for the array.

      Delete
    3. Hi Martin,
      I was trying your benchmark code with the Java 8 @Contended annotation. The performance was similar to the one without padding.... Did you manage to try it ?

      Delete
  11. Martin,
    yes I meant AtomicLongArray. if you do not wish to pay for the indirection, Unsafe is (always) an option. No bounds checking w/ unsafe either.

    ReplyDelete
  12. Are there any simple hardware manuals of multiple cores that show key components like caches ? This is for illustration to understand how caches and cores interact to create false sharing.

    ReplyDelete
  13. Mohan,

    See section 3.4 of the following document:

    http://img.delivery.net/cm50content/intel/ProductLibrary/100412_Parallel_Programming_02.pdf

    ReplyDelete
  14. FWIW - I have a centos5.8 VM running on hyperv w/ oracle java 1.6.0_22, and this version of the false sharing test with the padded atomic is about 4 times faster than VolatileLong. A typical run for me with padded atomic long:

    duration = 7019998000

    ReplyDelete
  15. Your blog is a wonderful reading. Thanks. I have 2 questions regarding memory padding:
    1. long type take 8byte. obj takes 2 words = 16 byte. but with impl as
    public final static class VolatileLong // 16byte
    {
    public volatile long value = 0L; // 8 byte
    public long p1, p2, p3, p4, p5, p6; // 6*8 = 48byte
    }
    seems total = 16 + 8 + 48 = 72 bytes > 64 byte cache line.
    2. How do you find this problem? are you looking at assembly code to understand cache line is constantly sync-ed?

    ReplyDelete
    Replies
    1. I do not want the mark word to be in the cache line which can be modified by taking out locks or the garbage collector ageing the object.

      Even in 64-bit mode compressed pointers are enabled by default, this includes the class pointer in the object header.

      https://wikis.oracle.com/display/HotSpotInternals/CompressedOops

      If found the problem years ago when investigating the reasons why one can get variations in performance runs.

      Delete
    2. ok, so you find the problem during perf test run. How do you narrow down the issue? Are you digging the assembly?

      Delete
    3. Assembly does not show the problem. You need to look for a lot of unexplained L2 caches misses as a hint that something is going on.

      Delete
  16. See also @Contended proposal: http://mail.openjdk.java.net/pipermail/hotspot-dev/2012-November/007309.html

    Great blog, thumbs up!

    ReplyDelete
    Replies
    1. I've not tried @Contended as it has not made it into a mainstream JVM yet.

      Delete
  17. Hi Martin,

    Wonderful blog.

    Can you explain how do you come up with how many long member variable to use in PaddedAtomicLong?

    Eg.
    public static class PaddedAtomicLong extends AtomicLong
    {
    public volatile long p1, p2, p3, p4, p5, p6 = 7L; // 8 * 6 = 48bytes
    }
    }

    How do we ensure this will fit exactly inside the cache line (which is 64 bits I presume)? Does this means that PaddedAtomicLong object itself will take up 16bytes (8 bytes for object but what about the remaining 8 bytes) ?

    Thanks Martin.

    ReplyDelete
    Replies
    1. Add another 8 bytes for the value itself and a minimum of 8 bytes for the object header.

      Delete
  18. Hi Martin,

    how will be store an object of type VolatileLong in this case:

    public final static class VolatileLong // 16byte
    {
    public volatile long value = 0L; // 8 byte
    public long p1, p2, p3, p4, p5, p6; // 6*8 = 48byte
    }
    seems total = 16 + 8 + 48 = 72 bytes > 64 byte cache line. As far as I know header will have 2 words since is about 64bites system. an object of type VolatileLong will be store on 2 cache lines ? If was the case when two different objects of type VolatileLong can be stored one after another in memory, in the example bellow, second object of type VolatileLong will be stored from byte 9 of second cache line, since first object ends at byte 8 on second cache line or it's added another 7 bytes as padding and start from byte 16 ?

    ReplyDelete
    Replies
    1. It does not matter if too much padding it used. It is important that we use enough. To discover object layout try the following tool:

      http://openjdk.java.net/projects/code-tools/jol/

      I've seen object headers range from 4 - 12 bytes depending on JVM.

      Delete
  19. Fantastic article! Martin, small question: does Intel for example have a tool that would measure the cache hits and misses that you could recommend?

    ReplyDelete
    Replies
    1. Yes Intel has VTune. There are some other free tools that I discuss in this blog.

      http://mechanical-sympathy.blogspot.co.uk/2012/08/memory-access-patterns-are-important.html

      Delete
  20. Hi Martin ,
    Nice article . Two questions "
    1. In general normal systems that don't need to have very high performance criteria (most apps I have worked in enterprise works well without any of these things that does have acceptable performance) , does the above complexity for cache line miss justified ?
    2 . Having each programmer adding this typing of padding by programmer can be erroneous and make it look complex.

    ReplyDelete
    Replies
    1. These are the sorts of issues which concern programmers working on high-performance applications or those providing libraries to support concurrent/parallel code. These are not the issues which should concern the typical programmer.

      Delete
  21. I modified your previous version of the code to annotate the value with @Contended and the results are as bad as the false sharing. Did you see better performance with @Contended?
    public final static class VolatileLong {
    public @Contended volatile long v;}

    ReplyDelete
    Replies
    1. Try -XX:-RestrictContended as a JVM flag.

      Delete
    2. Yes, the flag made it performant. Interesting that oracle would want to restrict it by default.

      Delete
  22. Hi Martin,

    Been reading about false sharing, cache coherence protocols, etc. and am doing some microbenchmarking in Java to see how false sharing effects performance. Basically, I have an array of longs of size num_threads * longs_per_cacheline (8 * 8 in my case). Then each thread i increments its respective long in the array (a[0], a[8], a[16], ...). This should prevent false sharing since there is only one counter per cacheline. However, when I change the offset so that the counters reside next to each other in the array (a[0], a[1], a[2], ...), I see little to no performance degradation. Running perf should reveal an increase in cache misses in the non-padded version due to an increase in cache coherence traffic but I see no indication of that. What is going on??

    public class SharedCounter2 {
    private static final int NUM_THREADS = Runtime.getRuntime().availableProcessors();
    private static final int OFFSET = 8;
    private static long[] counters = new long[NUM_THREADS * OFFSET];
    private static Thread[] threads = new Thread[NUM_THREADS];

    public static void main(String[] args) {
    TimerUtil.time(() -> {
    for (int i = 0; i < NUM_THREADS; i++) {
    final int j = i;
    threads[i] = new Thread(() -> {
    for (int k = 0; k < 2100000000; k++) {
    counters[j * OFFSET]++;
    }
    });

    }

    for (int i = 0; i < NUM_THREADS; i++) {
    threads[i].start();
    }

    long sum = 0;
    for (int i = 0; i < NUM_THREADS; i++) {
    try {
    threads[i].join();
    sum += counters[i * OFFSET];
    counters[i * OFFSET] = 0;
    } catch (InterruptedException e) {
    e.printStackTrace();
    }
    }

    System.out.println("Sum: " + sum);
    }, 10);
    }
    }

    P.S. A similar version of the above code using a shared AtomicLong such that all threads simply call getAndIncrement() performs horribly as expected.

    ReplyDelete
    Replies
    1. Try changing your code to be an AtomicLongArray then try the various options on that.

      Delete
    2. Yes, changing it to an AtomicLongArray does cause the performance to degrade when there is false sharing.

      I suspect that since AtomicLongArray is doing volatile accesses, the memory barriers are flushing the store buffer, thus causing invalidations if the counters reside in the same cacheline. On the other hand, the code above does not perform volatile accesses (they're not shared, except for when the main thread accumulates the sum), the store buffers aren't being flushed as often?

      Delete
    3. It is all about blocking progress. Store buffers, write combining buffers, out of order execution, etc. Memory ordered writes highlight the issues. Fundamentally can you make progress without the L1 cacheline being in exclusive state when using the ordered write operations is worth digging into.

      Delete
  23. Is there possibility to write java agent that will pad objects if annotated. This way client code will be clean.
    Thoughts

    ReplyDelete
  24. Do contended variables work well with Unsafe.putLong/Unsafe.putOrderedLong? I've written a very simple class it doesnt seem to working as I expect it to:

    public class ContendedAtomicCounter {

    private static final Unsafe UNSAFE = UnsafeAccess.UNSAFE;
    private static final long VALUE_OFFSET;

    @Contended
    private volatile long value;

    static {
    try {
    VALUE_OFFSET = UNSAFE.objectFieldOffset(Field.class.getDeclaredField("value"));
    } catch (NoSuchFieldException e) {
    throw new RuntimeException(e);
    }
    }

    public ContendedAtomicCounter(long value) {
    set(value);
    }

    public void set(final long value) {
    UNSAFE.putLong(this, VALUE_OFFSET, value);
    }

    public void setVolatile(long value) {
    UNSAFE.putLongVolatile(this, VALUE_OFFSET, value);
    }

    public long get() {
    return value;
    }

    public boolean compareAndSet(long expected, long updated) {
    return UNSAFE.compareAndSwapLong(this, VALUE_OFFSET, expected, updated);
    }

    public long incrementAndGet() {
    return getAndIncrement() + 1L;
    }

    public long getAndIncrement() {
    return UNSAFE.getAndAddLong(this, VALUE_OFFSET, 1L);
    }

    }

    ReplyDelete
  25. Hi Martin
    Im confused of my results which I have run on my 64bit jvm, jdk8 and Intel Core i7-4790 3.60ghz
    JOL output says that object header stores 12 bytes, and also alignment / padding gap stores 4 bytes:

     OFFSET SIZE TYPE DESCRIPTION VALUE
          0 12 (object header) N / A
         12 4 (alignment / padding gap)
         16 8 long AtomicLong.value N / A
         24 8 long PaddedAtomicLong.p1 N / A
         32 8 long PaddedAtomicLong.p2 N / A
         40 8 long PaddedAtomicLong.p3 N / A
         48 8 long PaddedAtomicLong.p4 N / A
         56 8 long PaddedAtomicLong.p5 N / A
    Instance size: 64 bytes
    Space losses: 4 bytes internal + 0 bytes external = 4 bytes total

    So regarding that, I'm correct with using only 5 paddings in my PaddedAtomicLong.
    Surprisingly, tests duration results vary from time to time ...
    Sometimes duration is about 5 ~ seconds, and sometimes it runs for 18 ~ seconds.

    I've even tried to put some warmup phases before exact test, but that didnt help either.

    Why is that happening? Any Ideas?

    ReplyDelete