tag:blogger.com,1999:blog-5560209661389175529.post5683641513359800215..comments2021-11-26T19:34:10.855+00:00Comments on Mechanical Sympathy: False Sharing && Java 7Martin Thompsonhttp://www.blogger.com/profile/15893849163924476586noreply@blogger.comBlogger41125tag:blogger.com,1999:blog-5560209661389175529.post-36187687012006673022019-08-20T14:29:28.702+01:002019-08-20T14:29:28.702+01:00Hi Martin
Im confused of my results which I have r...Hi Martin<br />Im confused of my results which I have run on my 64bit jvm, jdk8 and Intel Core i7-4790 3.60ghz<br />JOL output says that object header stores 12 bytes, and also alignment / padding gap stores 4 bytes:<br /><br /> OFFSET SIZE TYPE DESCRIPTION VALUE<br /> 0 12 (object header) N / A<br /> 12 4 (alignment / padding gap)<br /> 16 8 long AtomicLong.value N / A<br /> 24 8 long PaddedAtomicLong.p1 N / A<br /> 32 8 long PaddedAtomicLong.p2 N / A<br /> 40 8 long PaddedAtomicLong.p3 N / A<br /> 48 8 long PaddedAtomicLong.p4 N / A<br /> 56 8 long PaddedAtomicLong.p5 N / A<br />Instance size: 64 bytes<br />Space losses: 4 bytes internal + 0 bytes external = 4 bytes total<br /><br />So regarding that, I'm correct with using only 5 paddings in my PaddedAtomicLong.<br />Surprisingly, tests duration results vary from time to time ...<br />Sometimes duration is about 5 ~ seconds, and sometimes it runs for 18 ~ seconds.<br /><br />I've even tried to put some warmup phases before exact test, but that didnt help either.<br /><br />Why is that happening? Any Ideas?bartek shttps://www.blogger.com/profile/12436232656782924438noreply@blogger.comtag:blogger.com,1999:blog-5560209661389175529.post-21504638486872041362019-03-23T15:10:03.137+00:002019-03-23T15:10:03.137+00:00Do contended variables work well with Unsafe.putLo...Do contended variables work well with Unsafe.putLong/Unsafe.putOrderedLong? I've written a very simple class it doesnt seem to working as I expect it to:<br /><br />public class ContendedAtomicCounter {<br /><br /> private static final Unsafe UNSAFE = UnsafeAccess.UNSAFE;<br /> private static final long VALUE_OFFSET;<br /><br /> @Contended<br /> private volatile long value;<br /><br /> static {<br /> try {<br /> VALUE_OFFSET = UNSAFE.objectFieldOffset(Field.class.getDeclaredField("value"));<br /> } catch (NoSuchFieldException e) {<br /> throw new RuntimeException(e);<br /> }<br /> }<br /><br /> public ContendedAtomicCounter(long value) {<br /> set(value);<br /> }<br /><br /> public void set(final long value) {<br /> UNSAFE.putLong(this, VALUE_OFFSET, value);<br /> }<br /><br /> public void setVolatile(long value) {<br /> UNSAFE.putLongVolatile(this, VALUE_OFFSET, value);<br /> }<br /><br /> public long get() {<br /> return value;<br /> }<br /><br /> public boolean compareAndSet(long expected, long updated) {<br /> return UNSAFE.compareAndSwapLong(this, VALUE_OFFSET, expected, updated);<br /> }<br /><br /> public long incrementAndGet() {<br /> return getAndIncrement() + 1L;<br /> }<br /><br /> public long getAndIncrement() {<br /> return UNSAFE.getAndAddLong(this, VALUE_OFFSET, 1L);<br /> }<br /><br />}<br />Anonymoushttps://www.blogger.com/profile/16038430247672441985noreply@blogger.comtag:blogger.com,1999:blog-5560209661389175529.post-35356411699218931482017-04-25T04:01:54.451+01:002017-04-25T04:01:54.451+01:00Is there possibility to write java agent that will...Is there possibility to write java agent that will pad objects if annotated. This way client code will be clean. <br />Thoughts Anonymoushttps://www.blogger.com/profile/06997056040124374920noreply@blogger.comtag:blogger.com,1999:blog-5560209661389175529.post-44655543917284420882017-03-15T17:26:14.832+00:002017-03-15T17:26:14.832+00:00It is all about blocking progress. Store buffers, ...It is all about blocking progress. Store buffers, write combining buffers, out of order execution, etc. Memory ordered writes highlight the issues. Fundamentally can you make progress without the L1 cacheline being in exclusive state when using the ordered write operations is worth digging into.Martin Thompsonhttps://www.blogger.com/profile/15893849163924476586noreply@blogger.comtag:blogger.com,1999:blog-5560209661389175529.post-48502726934231133962017-03-15T12:54:20.610+00:002017-03-15T12:54:20.610+00:00Yes, changing it to an AtomicLongArray does cause ...Yes, changing it to an AtomicLongArray does cause the performance to degrade when there is false sharing.<br /><br />I suspect that since AtomicLongArray is doing volatile accesses, the memory barriers are flushing the store buffer, thus causing invalidations if the counters reside in the same cacheline. On the other hand, the code above does not perform volatile accesses (they're not shared, except for when the main thread accumulates the sum), the store buffers aren't being flushed as often?Anonymoushttps://www.blogger.com/profile/01496855006123176775noreply@blogger.comtag:blogger.com,1999:blog-5560209661389175529.post-23547090364379569322017-03-15T12:05:51.359+00:002017-03-15T12:05:51.359+00:00Try changing your code to be an AtomicLongArray th...Try changing your code to be an AtomicLongArray then try the various options on that.Martin Thompsonhttps://www.blogger.com/profile/15893849163924476586noreply@blogger.comtag:blogger.com,1999:blog-5560209661389175529.post-24466526717055678502017-03-15T12:01:52.301+00:002017-03-15T12:01:52.301+00:00Hi Martin,
Been reading about false sharing, cach...Hi Martin,<br /><br />Been reading about false sharing, cache coherence protocols, etc. and am doing some microbenchmarking in Java to see how false sharing effects performance. Basically, I have an array of longs of size num_threads * longs_per_cacheline (8 * 8 in my case). Then each thread i increments its respective long in the array (a[0], a[8], a[16], ...). This should prevent false sharing since there is only one counter per cacheline. However, when I change the offset so that the counters reside next to each other in the array (a[0], a[1], a[2], ...), I see little to no performance degradation. Running perf should reveal an increase in cache misses in the non-padded version due to an increase in cache coherence traffic but I see no indication of that. What is going on??<br /><br />public class SharedCounter2 {<br /> private static final int NUM_THREADS = Runtime.getRuntime().availableProcessors();<br /> private static final int OFFSET = 8;<br /> private static long[] counters = new long[NUM_THREADS * OFFSET];<br /> private static Thread[] threads = new Thread[NUM_THREADS];<br /><br /> public static void main(String[] args) {<br /> TimerUtil.time(() -> {<br /> for (int i = 0; i < NUM_THREADS; i++) {<br /> final int j = i;<br /> threads[i] = new Thread(() -> {<br /> for (int k = 0; k < 2100000000; k++) {<br /> counters[j * OFFSET]++;<br /> }<br /> });<br /><br /> }<br /><br /> for (int i = 0; i < NUM_THREADS; i++) {<br /> threads[i].start();<br /> }<br /><br /> long sum = 0;<br /> for (int i = 0; i < NUM_THREADS; i++) {<br /> try {<br /> threads[i].join();<br /> sum += counters[i * OFFSET];<br /> counters[i * OFFSET] = 0;<br /> } catch (InterruptedException e) {<br /> e.printStackTrace();<br /> }<br /> }<br /><br /> System.out.println("Sum: " + sum);<br /> }, 10);<br /> }<br />}<br /><br />P.S. A similar version of the above code using a shared AtomicLong such that all threads simply call getAndIncrement() performs horribly as expected. Anonymoushttps://www.blogger.com/profile/01496855006123176775noreply@blogger.comtag:blogger.com,1999:blog-5560209661389175529.post-38801052080332696262017-02-05T19:07:41.822+00:002017-02-05T19:07:41.822+00:00Yes, the flag made it performant. Interesting that...Yes, the flag made it performant. Interesting that oracle would want to restrict it by default.Anonymoushttps://www.blogger.com/profile/08797031830133225722noreply@blogger.comtag:blogger.com,1999:blog-5560209661389175529.post-24112586070923772312017-02-05T18:35:25.847+00:002017-02-05T18:35:25.847+00:00Try -XX:-RestrictContended as a JVM flag.Try -XX:-RestrictContended as a JVM flag.Martin Thompsonhttps://www.blogger.com/profile/15893849163924476586noreply@blogger.comtag:blogger.com,1999:blog-5560209661389175529.post-3117840448895036002017-02-05T18:28:36.061+00:002017-02-05T18:28:36.061+00:00I modified your previous version of the code to an...I modified your previous version of the code to annotate the value with @Contended and the results are as bad as the false sharing. Did you see better performance with @Contended? <br /> public final static class VolatileLong {<br /> public @Contended volatile long v;}<br />Anonymoushttps://www.blogger.com/profile/08797031830133225722noreply@blogger.comtag:blogger.com,1999:blog-5560209661389175529.post-7007341103380758652016-09-05T18:38:35.779+01:002016-09-05T18:38:35.779+01:00These are the sorts of issues which concern progra...These are the sorts of issues which concern programmers working on high-performance applications or those providing libraries to support concurrent/parallel code. These are not the issues which should concern the typical programmer.Martin Thompsonhttps://www.blogger.com/profile/15893849163924476586noreply@blogger.comtag:blogger.com,1999:blog-5560209661389175529.post-70328992214020988742016-09-05T18:35:22.350+01:002016-09-05T18:35:22.350+01:00Hi Martin ,
Nice article . Two questions "
...Hi Martin ,<br /> Nice article . Two questions "<br />1. In general normal systems that don't need to have very high performance criteria (most apps I have worked in enterprise works well without any of these things that does have acceptable performance) , does the above complexity for cache line miss justified ?<br />2 . Having each programmer adding this typing of padding by programmer can be erroneous and make it look complex. Vishal Chougulehttps://www.blogger.com/profile/14170801245384345125noreply@blogger.comtag:blogger.com,1999:blog-5560209661389175529.post-7973546024418604792014-03-03T14:50:33.646+00:002014-03-03T14:50:33.646+00:00Yes Intel has VTune. There are some other free too...Yes Intel has VTune. There are some other free tools that I discuss in this blog.<br /><br />http://mechanical-sympathy.blogspot.co.uk/2012/08/memory-access-patterns-are-important.htmlMartin Thompsonhttps://www.blogger.com/profile/15893849163924476586noreply@blogger.comtag:blogger.com,1999:blog-5560209661389175529.post-14603428906313738712014-03-03T14:23:25.391+00:002014-03-03T14:23:25.391+00:00Fantastic article! Martin, small question: does In...Fantastic article! Martin, small question: does Intel for example have a tool that would measure the cache hits and misses that you could recommend? eugenehttps://www.blogger.com/profile/17610978920670653103noreply@blogger.comtag:blogger.com,1999:blog-5560209661389175529.post-44599367342972781642014-02-06T10:23:37.906+00:002014-02-06T10:23:37.906+00:00It does not matter if too much padding it used. It...It does not matter if too much padding it used. It is important that we use enough. To discover object layout try the following tool:<br /><br />http://openjdk.java.net/projects/code-tools/jol/<br /><br />I've seen object headers range from 4 - 12 bytes depending on JVM.Martin Thompsonhttps://www.blogger.com/profile/15893849163924476586noreply@blogger.comtag:blogger.com,1999:blog-5560209661389175529.post-32262117495911658832014-02-04T20:51:02.784+00:002014-02-04T20:51:02.784+00:00Hi Martin,
how will be store an object of type Vo...Hi Martin,<br /><br />how will be store an object of type VolatileLong in this case:<br /><br />public final static class VolatileLong // 16byte<br />{<br /> public volatile long value = 0L; // 8 byte<br /> public long p1, p2, p3, p4, p5, p6; // 6*8 = 48byte<br />}<br />seems total = 16 + 8 + 48 = 72 bytes > 64 byte cache line. As far as I know header will have 2 words since is about 64bites system. an object of type VolatileLong will be store on 2 cache lines ? If was the case when two different objects of type VolatileLong can be stored one after another in memory, in the example bellow, second object of type VolatileLong will be stored from byte 9 of second cache line, since first object ends at byte 8 on second cache line or it's added another 7 bytes as padding and start from byte 16 ?<br />Anonymoushttps://www.blogger.com/profile/14979787102337446172noreply@blogger.comtag:blogger.com,1999:blog-5560209661389175529.post-39237947257007967992014-01-18T09:27:06.664+00:002014-01-18T09:27:06.664+00:00Add another 8 bytes for the value itself and a min...Add another 8 bytes for the value itself and a minimum of 8 bytes for the object header.Martin Thompsonhttps://www.blogger.com/profile/15893849163924476586noreply@blogger.comtag:blogger.com,1999:blog-5560209661389175529.post-51601134562455655232014-01-17T23:41:47.174+00:002014-01-17T23:41:47.174+00:00Hi Martin,
Wonderful blog.
Can you explain how...Hi Martin, <br /><br />Wonderful blog. <br /><br />Can you explain how do you come up with how many long member variable to use in PaddedAtomicLong?<br /><br />Eg.<br />public static class PaddedAtomicLong extends AtomicLong<br /> {<br /> public volatile long p1, p2, p3, p4, p5, p6 = 7L; // 8 * 6 = 48bytes<br /> }<br />}<br /><br />How do we ensure this will fit exactly inside the cache line (which is 64 bits I presume)? Does this means that PaddedAtomicLong object itself will take up 16bytes (8 bytes for object but what about the remaining 8 bytes) ? <br /><br />Thanks Martin.Fuyukohttps://www.blogger.com/profile/06751052946921608256noreply@blogger.comtag:blogger.com,1999:blog-5560209661389175529.post-75537459334148499832013-11-06T18:07:03.242+00:002013-11-06T18:07:03.242+00:00I've not tried @Contended as it has not made i...I've not tried @Contended as it has not made it into a mainstream JVM yet.Martin Thompsonhttps://www.blogger.com/profile/15893849163924476586noreply@blogger.comtag:blogger.com,1999:blog-5560209661389175529.post-22637912506983883342013-11-06T14:33:46.870+00:002013-11-06T14:33:46.870+00:00Hi Martin,
I was trying your benchmark code with t...Hi Martin,<br />I was trying your benchmark code with the Java 8 @Contended annotation. The performance was similar to the one without padding.... Did you manage to try it ?lifeyhttps://www.blogger.com/profile/15267616340430020475noreply@blogger.comtag:blogger.com,1999:blog-5560209661389175529.post-60359203266746135302012-11-23T22:16:36.590+00:002012-11-23T22:16:36.590+00:00See also @Contended proposal: http://mail.openjdk....See also @Contended proposal: http://mail.openjdk.java.net/pipermail/hotspot-dev/2012-November/007309.html<br /><br />Great blog, thumbs up!Anonymoushttps://www.blogger.com/profile/05611416514327880926noreply@blogger.comtag:blogger.com,1999:blog-5560209661389175529.post-7529188824994783652012-06-27T16:39:16.274+01:002012-06-27T16:39:16.274+01:00By using index 7 (8th element) ensures that there ...By using index 7 (8th element) ensures that there will be 56 bytes of padding either side of the value. 56 bytes plus the 8 for the value ensures nothing else can share the same 64-byte cache line regardless of starting location in memory for the array.Martin Thompsonhttps://www.blogger.com/profile/15893849163924476586noreply@blogger.comtag:blogger.com,1999:blog-5560209661389175529.post-51935997157561135252012-06-27T16:08:19.093+01:002012-06-27T16:08:19.093+01:00Hi Martin,
I saw that the current version of Sequ...Hi Martin,<br /><br />I saw that the current version of Sequence in Disruptor now uses unsafe.compareAndSwapLong(..) to update the 7th index.<br /><br />Why is the length of the long array 15 and not another size? Is it because an array of 15 longs exactly fills the Level 2 cache? <br /><br />Cheers<br /><br />JohnJohnhttps://www.blogger.com/profile/09264384154420386736noreply@blogger.comtag:blogger.com,1999:blog-5560209661389175529.post-60854483827435538352012-05-13T14:47:09.469+01:002012-05-13T14:47:09.469+01:00Assembly does not show the problem. You need to l...Assembly does not show the problem. You need to look for a lot of unexplained L2 caches misses as a hint that something is going on.Martin Thompsonhttps://www.blogger.com/profile/15893849163924476586noreply@blogger.comtag:blogger.com,1999:blog-5560209661389175529.post-12610579843519983482012-05-13T05:51:48.236+01:002012-05-13T05:51:48.236+01:00ok, so you find the problem during perf test run. ...ok, so you find the problem during perf test run. How do you narrow down the issue? Are you digging the assembly?yinghttps://www.blogger.com/profile/13841165158731972156noreply@blogger.com