Comments on Mechanical Sympathy: Write Combining

Just ran this on my machine and got following resu...

2019-05-13T08:28:52.151+01:00

Just ran this on my machine and got following results when running the loops 10 times:
1 SingleLoop duration (ns) = 7074676009
1 SplitLoop duration (ns) = 4179656857
2 SingleLoop duration (ns) = 6974691458
2 SplitLoop duration (ns) = 4243834696
3 SingleLoop duration (ns) = 5057173801
3 SplitLoop duration (ns) = 4281710759
4 SingleLoop duration (ns) = 5053223285
4 SplitLoop duration (ns) = 3952401242
5 SingleLoop duration (ns) = 4739710461
5 SplitLoop duration (ns) = 4188487184
6 SingleLoop duration (ns) = 4761019124
6 SplitLoop duration (ns) = 4219472213
7 SingleLoop duration (ns) = 5078802967
7 SplitLoop duration (ns) = 3970636511
8 SingleLoop duration (ns) = 4778556539
8 SplitLoop duration (ns) = 4002392222
9 SingleLoop duration (ns) = 4764734738
9 SplitLoop duration (ns) = 3940427992
10 SingleLoop duration (ns) = 4931735291
10 SplitLoop duration (ns) = 3963758487

I'm wondering why is the difference much smaller starting from the 3rd loop.

Peter Cordes' comment is interesting and perhaps explains this somehow but I'm not sure I fully understand it O:-).

PS: Running this on recent Mac Book pro with Intel core i7 2.6 Ghz 6 cores, 32 GB ram and JDK 12

This is an old blog and you are correct to point o...

2019-02-26T17:17:37.740+00:00

This is an old blog and you are correct to point out that what is happening is not that well explained. I am referring to the Line Fill Buffers which can be used for write combining on Intel CPUs, AMD separate them. I keep meaning to revisit my blog but struggle to find the time.

I'm pretty sure this explanation for the obser...

2019-02-26T06:49:32.678+00:00

I'm pretty sure this explanation for the observed performance isn't right.

"write combining" to write-back memory happens either inside the store buffer (for back-to-back stores to the same line), or by L1d itself *being* the buffer: a line stays hot in Modified state while multiple stores commit to it, so it only needs to be written back once.

The performance effect you're seeing (and which Intel's optimization manual recommends avoiding by splitting loops with more than 4 output streams) is more likely from conflict misses when lines are evicted from L1d while there are still pending writes to them. How can this happen? L1d is 8-way associative.

But L1d replacement is only pseudo-LRU. True LRU takes significantly more bits per set to track to LRU state of 8 ways, so my understanding is that pseudo-LRU is common.

---

In any case, you seem to be talking about the LFBs (Line Fill Buffers). Nehalem has 10 of them, same as later CPUs.

LFBs (instead of L1d lines) are used for write-combining of NT stores, or I think stores to WC memory. That's only because they have to bypass cache.

That's where the limit of 4 maybe comes in, although I Nehalem can use all 10 of its LFBs as WC buffers like SnB-family CPUs can. Still, they're also needed for incoming lines and regular write-back to L2, so unless NT stores are *all* you're doing, it's definitely best to do all the stores for a single line back-to-back in the right order.

But your microbench doesn't do anything but normal stores.

So the mechanism you're proposing as the cause for this effect just doesn't make sense.

See also discussion on Stack Overflow: https://stackoverflow.com/questions/53435632/are-write-combining-buffers-used-for-normal-writes-to-wb-memory-regions-on-intel#comment96160262_53435632 The actual question is asking about this, and the answer is a similar microbenchmark. The conclusions are questionable, but the performance counter results are maybe interesting. (Still, neither BeeOnRope nor I are convinced that it's actually demonstrating use of LFBs for write-combining of normal stores.)

And in any case, that's about combining in an LFB while waiting for a cache line to arrive. You're talking about somehow combining something before/during write-back from L1 or L2 to L3. That just makes no sense; there's nothing to combine with, it's already a full line write-back.

Here is slightly generalized C-based version: http...

2018-03-31T22:25:51.919+01:00

Here is slightly generalized C-based version:
https://github.com/artpol84/poc/tree/master/benchmarks/write_combine

Here is slightly generalized C version with some r...

2018-03-31T22:25:05.617+01:00

Here is slightly generalized C version with some results:
https://github.com/artpol84/poc/tree/master/benchmarks/write_combine

Intel processors now have 10 LF/WC buffers so this...

2016-03-06T15:25:48.993+00:00

Intel processors now have 10 LF/WC buffers so this is not such an issue any longer.

Hmm... I wrote a C++ version which demonstrates ~2...

2015-11-17T16:23:53.998+00:00

Hmm... I wrote a C++ version which demonstrates ~2.5x improvement for split loop:
1 SingleLoop duration (ns) = 12139922244
1 SplitLoop duration (ns) = 4732561921
2 SingleLoop duration (ns) = 12129320126
2 SplitLoop duration (ns) = 4777225938
3 SingleLoop duration (ns) = 12126297712
3 SplitLoop duration (ns) = 4716507099
result = 21

However, Java performs well in both cases (note only ~10% performance hit by Java compared to the best case in C++):
1 SingleLoop duration (ns) = 5311133217
1 SplitLoop duration (ns) = 5054977738
2 SingleLoop duration (ns) = 5090976210
2 SplitLoop duration (ns) = 5276584630
3 SingleLoop duration (ns) = 5219806807
3 SplitLoop duration (ns) = 5931649956
result = 21

Does modern JIT smartly compile away the difference?

Platform: Fedora 23 x86_64, Intel Core i5-4460, 8GB, openjdk-1.8.0.65
C++ sources: https://gist.github.com/uvsmtid/52caa3f2cfab287b2b80

Hmm... I compared C++ version with Java (and see 1...

2015-11-17T16:00:01.922+00:00

Hmm... I compared C++ version with Java (and see 10% C++ improvement in the best/3-arrays version).

However both cases in Java are almost the same - no performance hit.
Does JIT optimize away the difference?

Platform: Fedor 23 x86_64, Intel® Core™ i5-4460, openjdk-1.8.0.65-3.b17, Java, gcc-c++-5.1.1
C++ source: https://gist.github.com/uvsmtid/52caa3f2cfab287b2b80

thank you so much for the information. this blog i...

2015-09-14T09:53:37.702+01:00

thank you so much for the information. this blog is very interesting in a geeky way.

i have read a post somewhere saying that write combining buffer is not worth for application programmers looking into anymore and also it has been renamed/replaced with fill buffer to reflect its change of function.

also I have tested it on my i7 2640M laptop, I don't see any performance gain either, rather a performance degradation in split loop case tests.

The latest Intel CPU now have 10 write combining b...

2014-06-03T10:32:31.969+01:00

The latest Intel CPU now have 10 write combining buffer so the effect is much less pronounced. Other processors such as AMD can have less.

It is impressive to see your! I can see ~10% impr...

2014-06-03T10:08:13.275+01:00

It is impressive to see your!
I can see ~10% improvement i7-2630QM Win7 x64. As you mentioned thread may compete with each other, but anyway there is some benefit in applying the technique even on threaded Intel CPU.

The content of the WC buffer does not wait to fill...

2014-01-19T18:29:52.943+00:00

The content of the WC buffer does not wait to fill before writing. It is written to the cacheline as soon as it is available. A WC buffer is 64 bytes, i.e. the size of a cache line.

You have only 4 WC buffers per core. Therefore you can only write to 4 distinct locations that reside in different cachelines, if those cache lines are not in the L1/L2 caches.

it's something related to the fact that you ha...

2014-01-13T13:34:59.105+00:00

it's something related to the fact that you have on your processor pc only 4 buffers and you can write from max four distinct memory zones to your WC buffers ?

Hi Martin, great post; a lot of useful informati...

2014-01-13T13:31:07.050+00:00

Hi Martin,

great post; a lot of useful information can be find on this blog.

I have a question related to write combining process. all information that have to be written in memory in case of a cache miss(L1 or L2) will be grouped and write only when WC buffer is fill up with data, this bring us an important improvement of latency, beside to write each change in a cache line. What is not clear to me in the previous code when you write only three array elements inside of while loop, when WC buffer will be fill with information it will be write to memory, right(after each loop we write to WC buffer 3 bytes and from what I know the size of this buffer is ~ 32 bytes, this mean only after ~ ten loops will be filled buffer with data, if my logic is correct)? if yes what is the difference when you loop against all six array elements in the same loop ?
I misunderstand something ?

There are some issues with this test on more recen...

2013-12-30T18:31:56.955+00:00

There are some issues with this test on more recent processors. I plan to redo this blog and bring it up to date.

Hi Martin, thanks a lot for this blog: I am runni...

2013-12-30T18:24:16.939+00:00

Hi Martin, thanks a lot for this blog:

I am running this program in my MacBook Pro which has the following specs:

Processor Name: Intel Core i7
Processor Speed: 2.9 GHz
Number of Processors: 1
Total Number of Cores: 2
L2 Cache (per Core): 256 KB
L3 Cache: 4 MB
Memory: 8 GB

These are the results i am getting:

1 SingleLoop duration (ns) = 5051671000
1 SplitLoop duration (ns) = 6574749000
2 SingleLoop duration (ns) = 4806397000
2 SplitLoop duration (ns) = 5931679000
3 SingleLoop duration (ns) = 4786564000
3 SplitLoop duration (ns) = 5521178000
result = 21

I am seeing that the single loop is actually faster than the split loop, how can this be possible?

Thanks a lot,
Carlos.

Hi Martin, Thanks for this really useful article. ...

2013-09-04T04:19:50.583+01:00

Hi Martin,
Thanks for this really useful article. Could you please clarify two things for me,
Is there a reason for assigning arrays in a reverse order and I don't get how separating the assigning part into two loops helps on using wc store buffers efficiently. I mean how does the splitting helps in flushing the buffers ?

I cannot speak for how the __iowrite64_copy() func...

2013-03-19T11:08:03.712+00:00

I cannot speak for how the __iowrite64_copy() function is implemented. Have you looked at the generated assembler? Just Google for the MOVNTDQ instruction :-)

Yes access is aligned. Sorry, but could you pls el...

2013-03-19T10:56:31.938+00:00

Yes access is aligned. Sorry, but could you pls elaborate more on this MOVNTDQ instruction? The __iowrite64_copy routine in the Linux kernel is s'pposed to help in the combining ,if my guess on what you are referring to is correct?

I believe you need to use MOVNTDQ to have streamin...

2013-03-19T10:50:37.003+00:00

I believe you need to use MOVNTDQ to have streaming writes to WC memory and get them combined. Also are you ensuring aligned access?

Thanks for your reply. Yes so i copy 64 bytes to a...

2013-03-19T09:42:48.640+00:00

Thanks for your reply. Yes so i copy 64 bytes to a particular memory location on this WC mapped BAR/area. Ideally these should have gone out on the PCIe Bus as just one big 64 byte PCIe TLP/Packet, but i see 8 packets of 64-bit each going out ,which indicates combining hasn't kicked in.

My experience of kernel drivers is very dated now....

2013-03-19T08:30:55.097+00:00

My experience of kernel drivers is very dated now. It sounds like you want to set the memory type to WC. This blog refers to how the WC buffer are using with write-back memory. I'd need to much better understand the issue you are seeing before I could give advice.

How do you know the writes to the same cache line are not being combined?

Martin, Great Post!! I want to know how does this...

2013-03-18T08:25:11.728+00:00

Martin,
Great Post!! I want to know how does this work in the case of kernel drivers however, would you happen to know that? THat is , i have a BAR region on my adapter that i can map in WC mode using ioremap_wc() in the linux kernel. And then use a routine like iowrite_64_copy() to copy the data onto this mapped area. However this does not always do the combining for me!

Is there a possibility that if the system is idle , it just sends out 64-bit/8 byte writes as it recieves them instead of combining them into 1 big 64-byte write ?

Sorry for the slow response getting lost in my inb...

2012-07-04T09:18:03.830+01:00

Sorry for the slow response getting lost in my inbox. Write Combines happen for a combined cache-miss on L1 and L2. If you follow the code above, the arrays are sufficiently large so they do not fit in combined L1 & L2 caches. L2 is not inclusive or exclusive with L1 on Nehalem onwards. Think of L2 as a staging area between L1 and L3 to reduce each core beating on the L3. L1 and L2 for each core is inclusive in the L3.

If an existing line in the L1 and L2 combination needs evicted to L3 then it may only need written to L3 and is thus not always a write-back to main memory.

From Java you have no control over the type of memory such as write-back, write-through, or write-combining, etc. For Java everything is write-back and all goes via the cache. Write combines as discussed here is different from enabling the write-combining memory type, that requires ASM.

Hi Martin, I found this blog very very in...

2012-06-07T06:19:16.838+01:00

Hi Martin,
I found this blog very very interesting,I have a question, write-combines happen only when there is a cache miss at L1 or else write-backs happen? If yes, to utilize the optimization of write-combines how do we make sure/guarantee there is a cache miss? And any ideas about using write-throughs ? please correct me if I'm completely wrong.

thanks