tag:blogger.com,1999:blog-5560209661389175529.post1940143616890829807..comments2021-11-26T19:34:10.855+00:00Comments on Mechanical Sympathy: Memory Barriers/FencesMartin Thompsonhttp://www.blogger.com/profile/15893849163924476586noreply@blogger.comBlogger51125tag:blogger.com,1999:blog-5560209661389175529.post-70070464991763419582021-08-09T11:49:37.520+01:002021-08-09T11:49:37.520+01:00Thank you for sharing the blog! Keep sharing with ...Thank you for sharing the blog! Keep sharing with us. <br />RAM is considered to be one of the most important components in determining<br /> the system's performance. The higher the frequency of Ram better would be is<br /> an easiest and sufficient way to boost the system performance. Its<br /> Important to have the <a rel="nofollow"> best<br /> server memory </a> in order to increase the system's performance. Anonymousnoreply@blogger.comtag:blogger.com,1999:blog-5560209661389175529.post-8678279725647595432018-08-12T06:43:41.419+01:002018-08-12T06:43:41.419+01:00Hi Martin,
I'm still puzzled by one thing. I...Hi Martin, <br /><br />I'm still puzzled by one thing. In a multicore CPU, how would memory barriers work ? For an example as you've stated, SFENCE would make sure that all store instruction prior to the barrier would be put in the L1 cache. But my understanding is that cache coherency is not instantaneous and it may take couple of cycles for that to to be globally visible. Am I right on this ? If I am then how would using SFENCE make sure of happens before contract in Java volatile ? <br /><br />Prabathhttps://www.blogger.com/profile/01336462152917820046noreply@blogger.comtag:blogger.com,1999:blog-5560209661389175529.post-1083789147941728682017-11-30T22:23:46.842+00:002017-11-30T22:23:46.842+00:00The L3 cache can snoop into the private L1/L2 cach...The L3 cache can snoop into the private L1/L2 caches to fetch a copy of a modified cache line. You can read up on how MESI+F is implemented for Intel caches.Martin Thompsonhttps://www.blogger.com/profile/15893849163924476586noreply@blogger.comtag:blogger.com,1999:blog-5560209661389175529.post-50200443356433441402017-11-30T21:03:14.987+00:002017-11-30T21:03:14.987+00:00Hi Martin
"A store barrier, “sfence” instruc...Hi Martin<br /><br />"A store barrier, “sfence” instruction on x86, waits for all store instructions prior to the barrier to be written from the store buffer to the L1 cache for the CPU on which it is issued. This will make the program state visible to other CPUs so they can act on it if necessary."<br /><br />I thought that L1 cache is CPU specific, how can other CPUs can act on the data which is in L1 cache of one CPU? I thought that (according the picture) L3 Cache is shared between CPUs<br /><br />Thanks and regards :),<br />Petr <br />Petr Boudahttps://www.blogger.com/profile/17047974212554407119noreply@blogger.comtag:blogger.com,1999:blog-5560209661389175529.post-22895113911730892622017-09-21T22:25:44.005+01:002017-09-21T22:25:44.005+01:00This is good stuff,
in other words the less the u...This is good stuff,<br /><br />in other words the less the use of memory barriers the faster the program.<br /><br />I bet Agrona, JCTools, Disruptor, Aeron any derivative were built with such principles in mind.<br /><br />Thanks for publishing deep and comprehensive knowledge on such low level, otherwise unknown, issues.<br />Elias Balasishttps://www.blogger.com/profile/17309749755524092300noreply@blogger.comtag:blogger.com,1999:blog-5560209661389175529.post-7360061023103094712016-08-20T15:42:37.264+01:002016-08-20T15:42:37.264+01:00cool! thankscool! thanksGaurav Agarwalhttps://www.blogger.com/profile/09257062245358985831noreply@blogger.comtag:blogger.com,1999:blog-5560209661389175529.post-51891375400726363002016-08-20T09:03:20.377+01:002016-08-20T09:03:20.377+01:00Yes you are correct. This old post needed fixed. I...Yes you are correct. This old post needed fixed. I have :-)Martin Thompsonhttps://www.blogger.com/profile/15893849163924476586noreply@blogger.comtag:blogger.com,1999:blog-5560209661389175529.post-5587502291483709012016-08-20T04:19:14.656+01:002016-08-20T04:19:14.656+01:00Hi Martin, I am probably misunderstanding the foll...Hi Martin, I am probably misunderstanding the following comment on the blog:<br /><br />`In the Java Memory Model a volatile field has a store barrier inserted after a write to it and a load barrier inserted before a read of it.`<br /><br />Should not a store barrier be inserted before a write instruction and a load-barrier after a read instruction?<br /><br />In original case, if a store barrier is inserted after the write then it does not prevent all the instructions before that barrier to reorder among themselves and similar argument for load instructions after the load-barrier.Gaurav Agarwalhttps://www.blogger.com/profile/09257062245358985831noreply@blogger.comtag:blogger.com,1999:blog-5560209661389175529.post-5709017334621472772016-06-24T12:46:31.283+01:002016-06-24T12:46:31.283+01:00Memory barriers are about providing sequential con...Memory barriers are about providing sequential consistency[1] and not about what version of a value is seen. The memory barrier prevents the re-ordering of "b" with the store of "a" on thread A, and vice versa on thread B. The full Dekker's algorithm also requires a "turn"[2].<br /><br />[1] - https://en.wikipedia.org/wiki/Sequential_consistency<br />[2] - https://en.wikipedia.org/wiki/Dekker%27s_algorithmMartin Thompsonhttps://www.blogger.com/profile/15893849163924476586noreply@blogger.comtag:blogger.com,1999:blog-5560209661389175529.post-4336753360943122412016-06-24T01:11:03.394+01:002016-06-24T01:11:03.394+01:00Hello Martin,
I think I understand the acquire/re...Hello Martin,<br /><br />I think I understand the acquire/release semantics of memory barriers and the fact that they can create a happens-before relationship. However, I am having trouble understanding how visibility kicks in.<br /><br />For example if we take the beginning of Dekker's algorithm and add a memory barrier, is it guaranteed that only a single thread will win?<br /><br />int a = 0;<br />int b = 0;<br /><br />thread A: a = 1; MemoryBarrier(); if (b == 0) Console.WriteLine("A wins");<br />thread B: b = 1; MemoryBarrier(); if (a == 0) Console.WriteLine("B wins");<br /><br />From my understanding, the memory barrier does not guarantee that the second thread will be able to read the "fresh" value that the first thread wrote. However, *if* it sees it, it will also see all the previous stores.Petrakeashttps://www.blogger.com/profile/08472013741226260359noreply@blogger.comtag:blogger.com,1999:blog-5560209661389175529.post-45130612846663908982016-03-06T13:36:18.230+00:002016-03-06T13:36:18.230+00:00The LOCK prefix is implicit on some instructions, ...The LOCK prefix is implicit on some instructions, when added it makes no difference but it does clarify the ASM output for those that read it.Martin Thompsonhttps://www.blogger.com/profile/15893849163924476586noreply@blogger.comtag:blogger.com,1999:blog-5560209661389175529.post-23373432042247032792016-01-05T04:11:19.804+00:002016-01-05T04:11:19.804+00:00Hi Martin,
Excellent article. Thank you for shari...Hi Martin,<br /><br />Excellent article. Thank you for sharing. I have one question though.<br /><br />I was looking at Intel's IA manual and it mentioned that the CMPXCHG8B Instruction (Compare and Exchange) Instruction is implicitly locked.<br /><br />"CMPXCHG8B Instruction. The compare-and-exchange 8 bytes (64 bits) instruction is supported (implicitly locked and atomic)."<br /><br />If its implicitly locked, then why would Java translate Atomic instructions to prefixed LOCK, as mentioned in your article? Please let me know if I am missing something or if my understanding is not correct. <br /><br />ThanksFKhttps://www.blogger.com/profile/00085434869050552229noreply@blogger.comtag:blogger.com,1999:blog-5560209661389175529.post-46627918410348127342015-07-25T14:02:35.035+01:002015-07-25T14:02:35.035+01:00There is a lot to learn. To help I offer courses o...There is a lot to learn. To help I offer courses on this subject.<br /><br />http://real-logic.co.uk/training.htmlMartin Thompsonhttps://www.blogger.com/profile/15893849163924476586noreply@blogger.comtag:blogger.com,1999:blog-5560209661389175529.post-50194083700951129792015-07-25T12:08:44.109+01:002015-07-25T12:08:44.109+01:00Hi Martin,
I've read of lot of articles (on bl...Hi Martin,<br />I've read of lot of articles (on blogs & papers too) + Doug Lea 's Cookbook (and the preogrammer's view one edited by Gil Tene) and i'm really confused by the ratio behind the JMM (becouse i don't catch it!)...<br />I feel that i need to start from the basis to understand these concepts and use it in the right way while programming! Do you suggest a good approach or sequence of lectures to master the JMM concepts (from the high level POWs of volatile,atomics etc to the low level of memory barriers...)? I hope that it wouldn't be necessary to simply memorized all the "rules" but that exist a logic thought that could be applied to deduce all the expected behaviours of the compiler (at least)!<br /><br />Regards,<br />Francescoforked_franzhttps://www.blogger.com/profile/06893499394559691979noreply@blogger.comtag:blogger.com,1999:blog-5560209661389175529.post-8482644125537740772014-08-15T18:23:46.876+01:002014-08-15T18:23:46.876+01:00By "hint to JIT", I mean the CPU does no...By "hint to JIT", I mean the CPU does not do anything with the volatile keyword. It's the JIT interpreter that needs to generate instructions, fences, etc. <br /><br />"As an interesting side note, the Java programming language takes a different approach. The Java memory model has a slightly stronger definition of “volatile” that doesn’t permit store-load reordering, so a Java compiler on the x86 will typically emit a locked instruction after a volatile write." http://msdn.microsoft.com/en-us/magazine/jj883956.aspxNadeemhttps://www.blogger.com/profile/01150025803478068794noreply@blogger.comtag:blogger.com,1999:blog-5560209661389175529.post-90847049412362570912014-08-15T18:02:46.789+01:002014-08-15T18:02:46.789+01:00My understanding of the .Net Memory Model is that ...My understanding of the .Net Memory Model is that it is stronger than the Java Memory Model, particularly for field access - writes to a field have StoreStore ordering.<br /><br />http://msdn.microsoft.com/en-us/magazine/jj863136.aspx<br /><br />BTW volatile is not a "hint" to JIT. The interpreter, and code generated by the JIT complier, must produce in very specific behaviour for this synchronising action. In general, memory barriers are way more significant to the compiler optimisation than the hardware.Martin Thompsonhttps://www.blogger.com/profile/15893849163924476586noreply@blogger.comtag:blogger.com,1999:blog-5560209661389175529.post-52770040644592727772014-08-15T16:41:52.211+01:002014-08-15T16:41:52.211+01:00Yes, I think it's the JIT not the compiler tha...Yes, I think it's the JIT not the compiler that generates ordering instructions and fences. Same thing in .Net. But the fields are not volatile in this case. That's why the barriers are needed. (I think also Java memory model is stricter than .NET)<br /><br />For example, a volatile on Intel x86 is only a hint for the JIT not to optimize the variable since Intel x86 has a strong memory model (with the exception of write-read reorder). <br /><br />Anyway, the store buffer is still an issue even on x86. Nadeemhttps://www.blogger.com/profile/01150025803478068794noreply@blogger.comtag:blogger.com,1999:blog-5560209661389175529.post-64783439817069686612014-08-15T09:32:06.186+01:002014-08-15T09:32:06.186+01:00If you write code in a language that has a memory ...If you write code in a language that has a memory model, then the compiler will generate the appropriate ordering instructions for the processor it runs on.<br /><br />For example a Java load of a volatile field on x86 is just a simple MOV instruction but on other processors, such as Power and ARM, it has to generate additional fences. You don't change the code you write in Java.Martin Thompsonhttps://www.blogger.com/profile/15893849163924476586noreply@blogger.comtag:blogger.com,1999:blog-5560209661389175529.post-57998284130079174892014-08-14T21:18:07.041+01:002014-08-14T21:18:07.041+01:00"The Wikipedia article is generic. It is not ..."The Wikipedia article is generic. It is not necessary to implement invalidate queues for MESI, and x86 does not."<br />This code may not run on x86. What if it runs on a CPU that implements invalidate queues. It's safe to have a barrier then.Nadeemhttps://www.blogger.com/profile/01150025803478068794noreply@blogger.comtag:blogger.com,1999:blog-5560209661389175529.post-28828329803784993772014-08-13T08:26:56.407+01:002014-08-13T08:26:56.407+01:00The Wikipedia article is generic. It is not necess...The Wikipedia article is generic. It is not necessary to implement invalidate queues for MESI, and x86 does not. If you look at what x86 assembly instructions get generated for a volatile load in Java you will see it is just a simple MOV instruction. For the normal write back memory, to achieve sequential consistency, x86 only needs a fence for preventing younger loads passing older stores due to the store buffer. We have far more need for soft fences to prevent compiler re-orderings.Martin Thompsonhttps://www.blogger.com/profile/15893849163924476586noreply@blogger.comtag:blogger.com,1999:blog-5560209661389175529.post-38999193649135715852014-08-09T03:38:03.198+01:002014-08-09T03:38:03.198+01:00I read in different places that barriers not only ...I read in different places that barriers not only are for ordering, but also for flushing. See http://en.wikipedia.org/wiki/MESI_protocol<br />"A store barrier will flush the store buffer, ensuring all writes have been applied to that CPU's cache. A read barrier will flush the invalidation queue, thus ensuring that all writes by other CPUs become visible to the flushing CPU."<br /><br />Nadeemhttps://www.blogger.com/profile/01150025803478068794noreply@blogger.comtag:blogger.com,1999:blog-5560209661389175529.post-35887813970531068962014-08-08T22:44:59.972+01:002014-08-08T22:44:59.972+01:00You cannot make any assumptions regarding the cont...You cannot make any assumptions regarding the context in which B() will be called if you designed it as a library. <br /><br />I do not believe Barrier 3 is about flushing caching queues. Barriers/fences are for ordering and not flushing queues/buffers. Barrier 3 ensures the load of _complete is not ordered back in the stream from its intended position.Martin Thompsonhttps://www.blogger.com/profile/15893849163924476586noreply@blogger.comtag:blogger.com,1999:blog-5560209661389175529.post-26068559607612761542014-08-08T18:46:23.710+01:002014-08-08T18:46:23.710+01:00I understand that for a loop Barrier 3 is needed, ...I understand that for a loop Barrier 3 is needed, but there is no loop in B(). Assuming there is no loop, is barrier 3 necessary?<br /><br />My guess and I could be wrong is that cache coherency is not atomic and there is a delay because there exists a queue for cache delivery between CPUs. When the store buffer is flushed out in Barrier 2, _complete will be present in that CPU cache only, but it is not immediately present for the other CPU cache running B(). So Barrier 3 will flush the caching queue. Is this possible ?<br /><br /> Nadeemhttps://www.blogger.com/profile/01150025803478068794noreply@blogger.comtag:blogger.com,1999:blog-5560209661389175529.post-37828450099556964312014-08-08T16:59:57.118+01:002014-08-08T16:59:57.118+01:00Barrier 2 is required for sequential consistency w...Barrier 2 is required for sequential consistency which can be achieved by waiting on the store buffer to drain. If B() was called in a loop then barrier 3 prevents the read of _complete being hoisted outside the loop which would be possible after inlining.Martin Thompsonhttps://www.blogger.com/profile/15893849163924476586noreply@blogger.comtag:blogger.com,1999:blog-5560209661389175529.post-90999819625552073842014-08-06T22:09:54.477+01:002014-08-06T22:09:54.477+01:00Actually, I read that barriers have to be paired i...Actually, I read that barriers have to be paired in order to work correctly and that a barrier in one CPU does not affect the other CPUs. That is what I don't understand. For example, the original code that confuses me is the following C# code which uses full barriers:<br /><br />Taken from O'Reilly's C# in a Nutshell:<br /><br />class Foo<br />{<br />int _answer;<br />bool _complete;<br />void A()<br />{<br />_answer = 123;<br />Thread.MemoryBarrier(); // Barrier 1<br />_complete = true;<br />Thread.MemoryBarrier(); // Barrier 2<br />}<br />void B()<br />{<br />Thread.MemoryBarrier(); // Barrier 3<br />if (_complete)<br />{<br />Thread.MemoryBarrier(); // Barrier 4<br />Console.WriteLine (_answer);<br />}<br />}<br />}<br />The author says: "Barriers 1 and 4 prevent this example from writing “0”. Barriers 2 and 3 provide a freshness guarantee: they ensure that if B ran after A, reading _complete would evaluate to true."<br /><br /><br />These are full barriers. I understand that Barrier 1 and 4 are needed for order, that is to make sure that answer is written to before _complete. While barrier 4 ensures that the read from _complete happens-before the read from _answer.<br /><br />Now I don't understand why both Barrier 2 and 3 are needed. Isn't one of them enough ? Let's say barrier 2 flushes the store buffer, then barrier 3 is redundant. <br /><br />Nadeemhttps://www.blogger.com/profile/01150025803478068794noreply@blogger.com