Mechanical Sympathy: Write Combining

Friday, 15 July 2011

Write Combining

Modern CPUs employ lots of techniques to counteract the latency cost of going to main memory. These days CPUs can process hundreds of instructions in the time it takes to read or write data to the DRAM memory banks.

The major tool used to hide this latency is multiple layers of SRAM cache. In addition, SMP systems employ message passing protocols to achieve coherence between caches. Unfortunately CPUs are now so fast that even these caches cannot keep up at times. So to further hide this latency a number of less well known buffers are used.

This article explores “write combining buffers” and how we can write code that uses them effectively.

CPU caches are effectively unchained hash maps where each bucket is typically 64-bytes. This is known as a “cache line”. The cache line is the effective unit of memory transfer. For example, an address A in main memory would hash to map to a given cache line C.

If a CPU needs to work with an address which hashes to a line that is not already in cache, then the existing line that matches that hash needs to be evicted so the new line can take its place. For example if we have two addresses which both map via the hashing algorithm to the same cache line, then the old one must make way for the new cache line.

When a CPU executes a store operation it will try to write the data to the L1 cache nearest to the CPU. If a cache miss occurs at this stage the CPU goes out to the next layer of cache. At this point on an Intel, and many other, CPUs a technique known as “write combining” comes into play.

While the request for ownership of the L2 cache line is outstanding the data to be stored is written to one of a number of cache line sized buffers on the processor itself, known as line fill buffers on Intel CPUs. These on chip buffers allow the CPU to continue processing instructions while the cache sub-system gets ready to receive and process the data. The biggest advantage comes when the data is not present in any of the other cache layers.

These buffers become very interesting when subsequent writes happen to require the same cache line. The subsequent writes can be combined into the buffer before it is committed to the L2 cache. These 64-byte buffers maintain a 64-bit field which has the corresponding bit set for each byte that is updated to indicate what data is valid when the buffer is transferred to the outer caches.

Hang on I hear you say. What happens if the program wants to read some of the data that has been written to a buffer? Well our hardware friends have thought of that and they will snoop the buffers before they read the caches.

What does all this mean for our programs?

If we can fill these buffers before they are transferred to the outer caches then we will greatly improve the effective use of the transfer bus at every level. How do we do this? Well programs spend most of their time in loops doing work.

There are a limited number of these buffers, and they differ by CPU model. For example on an Intel CPU you are only guaranteed to get 4 of them at one time. What this means is that within a loop you should not write to more than 4 distinct memory locations at one time or you will not benefit from the write combining effect.

What does this look like in code?

import static java.lang.System.out;

public final class WriteCombining
{
    private static final int ITERATIONS = Integer.MAX_VALUE;
    private static final int ITEMS = 1 << 24;
    private static final int MASK = ITEMS - 1;

    private static final byte[] arrayA = new byte[ITEMS];
    private static final byte[] arrayB = new byte[ITEMS];
    private static final byte[] arrayC = new byte[ITEMS];
    private static final byte[] arrayD = new byte[ITEMS];
    private static final byte[] arrayE = new byte[ITEMS];
    private static final byte[] arrayF = new byte[ITEMS];

    public static void main(final String[] args)
    {
        for (int i = 1; i <= 3; i++)
        {
            out.println(i + " SingleLoop duration (ns) = " + runCaseOne());
            out.println(i + " SplitLoop  duration (ns) = " + runCaseTwo());
        }

        int result = arrayA[1] + arrayB[2] + arrayC[3] +
                     arrayD[4] + arrayE[5] + arrayF[6];
        out.println("result = " + result);
    }

    public static long runCaseOne()
    {
        long start = System.nanoTime();

        int i = ITERATIONS;
        while (--i != 0)
        {
            int slot = i & MASK;
            byte b = (byte)i;
            arrayA[slot] = b;
            arrayB[slot] = b;
            arrayC[slot] = b;
            arrayD[slot] = b;
            arrayE[slot] = b;
            arrayF[slot] = b;
        }

        return System.nanoTime() - start;
    }

    public static long runCaseTwo()
    {
        long start = System.nanoTime();

        int i = ITERATIONS;
        while (--i != 0)
        {
            int slot = i & MASK;
            byte b = (byte)i;
            arrayA[slot] = b;
            arrayB[slot] = b;
            arrayC[slot] = b;
        }

        i = ITERATIONS;
        while (--i != 0)
        {
            int slot = i & MASK;
            byte b = (byte)i;
            arrayD[slot] = b;
            arrayE[slot] = b;
            arrayF[slot] = b;
        }

        return System.nanoTime() - start;
    }
}

This program on my Windows 7 64-bit Intel Core i7 860 @ 2.8 GHz system produces the following output:

1 SingleLoop duration (ns) = 14019753545
1 SplitLoop duration (ns) = 8972368661
2 SingleLoop duration (ns) = 14162455066
2 SplitLoop duration (ns) = 8887610558
3 SingleLoop duration (ns) = 13800914725
3 SplitLoop duration (ns) = 7271752889

To spell it out, if we write to 6 array locations (memory addresses) inside one loop we see that the program takes significantly longer than if we split the work up, and write first to 3 array locations, then to the other 3 locations sequentially.

By splitting the loop we do much more work yet the program completes in much less time! Welcome to the magic of “write combining”. By using our knowledge of the CPU architecture to fill those buffers properly we can use the underlying hardware to accelerate our code by a factor of two.

Don’t forget that with hyper-threading you can have 2 threads in competition for these buffers on the same core.

49 comments:

Olivier Deheurles15 July 2011 at 17:02
Interesting, I never heard about this mechanism..

I imagine this must be a source of memory reordering?
I assume a fence in the loop must kill performance pretty badly (ie. if one of field is declared volatile for instance).

Writing to memory is slow so let's write locally and publish later... it reminds me the optimisation we were talking about on the disrupt's batch consumer: we keep incrementing consumer's sequence locally and publish to the producer only when we think it's required.
ReplyDelete
Replies
Martin Thompson15 July 2011 at 17:08
One of my next posts is going to be on why memory barriers are important to memory ordering and their impact on performance :-) This is something I considered a lot when writing the Disruptor (http://code.google.com/p/disruptor/)
ReplyDelete
Replies
Derek Lewis16 July 2011 at 00:17
Wow, that's definitely a new one to me. I didn't believe the results until I ran the test myself. ;) I've heard of write combining, but didn't realize what it was, and how to exploit it. Thanks Martin!
ReplyDelete
Replies
billywhizz16 July 2011 at 18:42
very interesting. i took the liberty of porting this to c and i see a 3x improvement between 8 writes versus 2 sets of 4 writes on my setup (fedora12/64bit/Intel Core 2 Quad 8200) - https://gist.github.com/1086581
ReplyDelete
Replies
Ashwin Jayaprakash16 July 2011 at 23:45
Very refreshing to see someone thinking/doing this in a high level lang like Java.
ReplyDelete
Replies
Martin Thompson17 July 2011 at 10:09
What factor improvement do you see in the 6/3 case in C? Be interesting to compare the array bounds checking cost of Java over C for this. Removed the last version because I hit return too soon :-)
ReplyDelete
Replies
billywhizz17 July 2011 at 15:29
i see a 3x improvement doing 3 v 6. goes up to 7x for 4 v 8. am using -O3 compiler flag with gcc. i'm no c expert so may be doing something wrong. latest code is here if you want to test: https://gist.github.com/1086581
ReplyDelete
Replies
billywhizz17 July 2011 at 15:36
on another note, have been thinking about doing an implementation of the disruptor in c/c++. i really like the idea of having all the business logic on a single thread with a large heap using as much physical RAM as possible. just doing that alone takes a lot of the pain out of coding in c/c++ and would make the system pretty easy to test and maintain, even in a low level language. would be interested if you have any thoughts on that... keep up the great work btw!
ReplyDelete
Replies
billywhizz17 July 2011 at 16:25
on the same machine, the java code above runs about 5-6% slower than the c code and i see a 2x performance improvment in the split loop (3 v 6)...
ReplyDelete
Replies
Martin Thompson17 July 2011 at 16:42
Doing a Disruptor in C/C++ is something I've been considering for a while. In the multi-producer scenario on x86 we could take advantage of the "lock xadd" instruction rather than a CAS. I like to think we have proven Java can give great performance. With C/C++/asm we can go further especially with being cache friendly and avoiding megamorphic method calls. The Disruptor is now so fast there is little real application benefit in making it faster. My choice would come down to what language the rest of the application is written in and to focus there. If I don't do one soon I'd be happy to help another port of the Disruptor.
ReplyDelete
Replies
billywhizz17 July 2011 at 17:44
i've done some tests here and using lock addq comes out 20% faster than lock xadd. this is the asm that gcc __sync_fetch_and_add generates:

0000000000400940 :
400940: 48 83 ec 08 sub $0x8,%rsp
400944: 8b 07 mov (%rdi),%eax
400946: c1 e0 14 shl $0x14,%eax
400949: 48 98 cltq
40094b: 48 85 c0 test %rax,%rax
40094e: 74 0f je 40095f
400950: f0 48 83 05 5f 0a 20 lock addq $0x1,0x200a5f(%rip) # 6013b8
400957: 00 01
400959: 48 83 e8 01 sub $0x1,%rax
40095d: 75 f1 jne 400950
40095f: 31 ff xor %edi,%edi
400961: e8 fa fc ff ff callq 400660
400966: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1)
40096d: 00 00 00
ReplyDelete
Replies
Martin Thompson17 July 2011 at 19:05
Another interesting alternative to CAS :-) I'm going to have to brush up my x86-64.
ReplyDelete
Replies
Martin Thompson17 July 2011 at 19:38
How does the addq code deal with threads racing to do the update? The lock makes it atomic but I don't see how it handles two threads loading the same value before the add then doing the add.
ReplyDelete
Replies
Martin Thompson18 July 2011 at 22:13
Is the asm above the calling of __sync_add_and_fetch() in a loop? I get a very different dump when I try it?
ReplyDelete
Replies
billywhizz19 July 2011 at 04:14
yes it's in a loop. this is the code i took the objdump from: https://gist.github.com/1091224. i've tested with 4 threads contending on the same counter and haven't seen any issues on my setup. will the fact it's in a loop make a difference to the locking?
ReplyDelete
Replies
Martin Thompson19 July 2011 at 07:43
OK that makes more sense now. The ASM above was for more than just the __sync_fetch_and_add(). I want to do some tests this morning to confirm a few things. I believe the code is correct but not sure if it is suitable for the multi-producer sequence claim in the Disruptor. I'll post my findings later.
ReplyDelete
Replies
billywhizz19 July 2011 at 16:20
cool. it looks to me like the __sync_fetch_and_add() just gets turned into a lock/addq instruction in the assembly.
ReplyDelete
Replies
Martin Thompson22 July 2011 at 11:13
On further investigation I've discovered that GCC uses the lock addq instruction when *only* a single thread accesses the variable and lock xadd when 2 or more threads access the variable like below.

40068e: f0 48 0f c1 15 a9 09 lock xadd %rdx,0x2009a9(%rip) # 601040

This is GCC being clever and optimising in the single threaded case.
ReplyDelete
Replies
Martin Thompson25 July 2011 at 08:42
It is even more simple than multiple threads. If the returned value is assigned to a variable then lock xadd is used.

int main (int argc, char *argv[])

{

unsigned long value = 0;

value = __sync_add_and_fetch(&value, 1);

printf("main value = %ld\n", value);

}

00000000004004f4 :
4004f4: 55 push %rbp
4004f5: 48 89 e5 mov %rsp,%rbp
4004f8: 48 83 ec 20 sub $0x20,%rsp
4004fc: 89 7d ec mov %edi,-0x14(%rbp)
4004ff: 48 89 75 e0 mov %rsi,-0x20(%rbp)
400503: 48 c7 45 f8 00 00 00 movq $0x0,-0x8(%rbp)
40050a: 00
40050b: 48 8d 55 f8 lea -0x8(%rbp),%rdx
40050f: b9 01 00 00 00 mov $0x1,%ecx
400514: 48 89 c8 mov %rcx,%rax
400517: f0 48 0f c1 02 lock xadd %rax,(%rdx)
40051c: 48 01 c8 add %rcx,%rax
40051f: 48 89 45 f8 mov %rax,-0x8(%rbp)
400523: 48 8b 55 f8 mov -0x8(%rbp),%rdx
400527: b8 2c 06 40 00 mov $0x40062c,%eax
40052c: 48 89 d6 mov %rdx,%rsi
40052f: 48 89 c7 mov %rax,%rdi
400532: b8 00 00 00 00 mov $0x0,%eax
400537: e8 b4 fe ff ff callq 4003f0
40053c: c9 leaveq
40053d: c3 retq
40053e: 90 nop
40053f: 90 nop
ReplyDelete
Replies
John Carrino10 September 2011 at 06:32
I found this blog and think it's brilliant. One thing I came across that might solve your problem of avoiding a write barrier in the add case was AtomicLong.weakCompareAndSet.

Sadly, this just does compareAndSet under the covers which is very unfortunate. I think that if they added a weakAddAndGet then it could just use 'lock xadd' and be blazing fast.
ReplyDelete
Replies
Anonymous14 September 2011 at 10:04
This is of course very interesting to know, but how many real life applications have such big loops?
Splitting a small loop will degrade performance. Also it makes a program less readable.
ReplyDelete
Replies
Martin Thompson14 September 2011 at 10:56
Tadzys,

I totally agree that you should not split loops or change code from the ideal model unless you absolutely need to for non-functional reasons. My post on modelling makes this point.

http://mechanical-sympathy.blogspot.com/2011/09/modelling-is-everything.html

What I'm hopefully achieving is an increased awareness of what is possible if people need to fix performance issues that cannot be addressed by model changes.
ReplyDelete
Replies
shrini100019 April 2012 at 14:41
Great post, but this applies only to 'real' hardware, right? I got different results when I ran them on a VM.
ReplyDelete
Replies
Anonymous7 June 2012 at 06:19
Hi Martin,
I found this blog very very interesting,I have a question, write-combines happen only when there is a cache miss at L1 or else write-backs happen? If yes, to utilize the optimization of write-combines how do we make sure/guarantee there is a cache miss? And any ideas about using write-throughs ? please correct me if I'm completely wrong.

thanks
ReplyDelete
Replies
smirnon19 March 2013 at 10:56
Yes access is aligned. Sorry, but could you pls elaborate more on this MOVNTDQ instruction? The __iowrite64_copy routine in the Linux kernel is s'pposed to help in the combining ,if my guess on what you are referring to is correct?
ReplyDelete
Replies
Prabath4 September 2013 at 04:19
Hi Martin,
Thanks for this really useful article. Could you please clarify two things for me,
Is there a reason for assigning arrays in a reverse order and I don't get how separating the assigning part into two loops helps on using wc store buffers efficiently. I mean how does the splitting helps in flushing the buffers ?

ReplyDelete
Replies
Carlos Curotto30 December 2013 at 18:24
Hi Martin, thanks a lot for this blog:

I am running this program in my MacBook Pro which has the following specs:

Processor Name: Intel Core i7
Processor Speed: 2.9 GHz
Number of Processors: 1
Total Number of Cores: 2
L2 Cache (per Core): 256 KB
L3 Cache: 4 MB
Memory: 8 GB

These are the results i am getting:

1 SingleLoop duration (ns) = 5051671000
1 SplitLoop duration (ns) = 6574749000
2 SingleLoop duration (ns) = 4806397000
2 SplitLoop duration (ns) = 5931679000
3 SingleLoop duration (ns) = 4786564000
3 SplitLoop duration (ns) = 5521178000
result = 21

I am seeing that the single loop is actually faster than the split loop, how can this be possible?

Thanks a lot,
Carlos.
ReplyDelete
Replies
Unknown13 January 2014 at 13:31
Hi Martin,

great post; a lot of useful information can be find on this blog.

I have a question related to write combining process. all information that have to be written in memory in case of a cache miss(L1 or L2) will be grouped and write only when WC buffer is fill up with data, this bring us an important improvement of latency, beside to write each change in a cache line. What is not clear to me in the previous code when you write only three array elements inside of while loop, when WC buffer will be fill with information it will be write to memory, right(after each loop we write to WC buffer 3 bytes and from what I know the size of this buffer is ~ 32 bytes, this mean only after ~ ten loops will be filled buffer with data, if my logic is correct)? if yes what is the difference when you loop against all six array elements in the same loop ?
I misunderstand something ?
ReplyDelete
Replies
Pranas3 June 2014 at 10:08
It is impressive to see your!
I can see ~10% improvement i7-2630QM Win7 x64. As you mentioned thread may compete with each other, but anyway there is some benefit in applying the technique even on threaded Intel CPU.
ReplyDelete
Replies
Kin Cheung14 September 2015 at 09:53
thank you so much for the information. this blog is very interesting in a geeky way.

i have read a post somewhere saying that write combining buffer is not worth for application programmers looking into anymore and also it has been renamed/replaced with fill buffer to reflect its change of function.

also I have tested it on my i7 2640M laptop, I don't see any performance gain either, rather a performance degradation in split loop case tests.
ReplyDelete
Replies
Alexey Pakseykin17 November 2015 at 16:00
Hmm... I compared C++ version with Java (and see 10% C++ improvement in the best/3-arrays version).

However both cases in Java are almost the same - no performance hit.
Does JIT optimize away the difference?

Platform: Fedor 23 x86_64, Intel® Core™ i5-4460, openjdk-1.8.0.65-3.b17, Java, gcc-c++-5.1.1
C++ source: https://gist.github.com/uvsmtid/52caa3f2cfab287b2b80
ReplyDelete
Replies
Alexey Pakseykin17 November 2015 at 16:23
Hmm... I wrote a C++ version which demonstrates ~2.5x improvement for split loop:
1 SingleLoop duration (ns) = 12139922244
1 SplitLoop duration (ns) = 4732561921
2 SingleLoop duration (ns) = 12129320126
2 SplitLoop duration (ns) = 4777225938
3 SingleLoop duration (ns) = 12126297712
3 SplitLoop duration (ns) = 4716507099
result = 21

However, Java performs well in both cases (note only ~10% performance hit by Java compared to the best case in C++):
1 SingleLoop duration (ns) = 5311133217
1 SplitLoop duration (ns) = 5054977738
2 SingleLoop duration (ns) = 5090976210
2 SplitLoop duration (ns) = 5276584630
3 SingleLoop duration (ns) = 5219806807
3 SplitLoop duration (ns) = 5931649956
result = 21

Does modern JIT smartly compile away the difference?

Platform: Fedora 23 x86_64, Intel Core i5-4460, 8GB, openjdk-1.8.0.65
C++ sources: https://gist.github.com/uvsmtid/52caa3f2cfab287b2b80
ReplyDelete
Replies
Unknown31 March 2018 at 22:25
Here is slightly generalized C version with some results:
https://github.com/artpol84/poc/tree/master/benchmarks/write_combine
ReplyDelete
Replies
Unknown31 March 2018 at 22:25
Here is slightly generalized C-based version:
https://github.com/artpol84/poc/tree/master/benchmarks/write_combine
ReplyDelete
Replies
Unknown26 February 2019 at 06:49
I'm pretty sure this explanation for the observed performance isn't right.

"write combining" to write-back memory happens either inside the store buffer (for back-to-back stores to the same line), or by L1d itself *being* the buffer: a line stays hot in Modified state while multiple stores commit to it, so it only needs to be written back once.

The performance effect you're seeing (and which Intel's optimization manual recommends avoiding by splitting loops with more than 4 output streams) is more likely from conflict misses when lines are evicted from L1d while there are still pending writes to them. How can this happen? L1d is 8-way associative.

But L1d replacement is only pseudo-LRU. True LRU takes significantly more bits per set to track to LRU state of 8 ways, so my understanding is that pseudo-LRU is common.

---

In any case, you seem to be talking about the LFBs (Line Fill Buffers). Nehalem has 10 of them, same as later CPUs.

LFBs (instead of L1d lines) are used for write-combining of NT stores, or I think stores to WC memory. That's only because they have to bypass cache.

That's where the limit of 4 maybe comes in, although I Nehalem can use all 10 of its LFBs as WC buffers like SnB-family CPUs can. Still, they're also needed for incoming lines and regular write-back to L2, so unless NT stores are *all* you're doing, it's definitely best to do all the stores for a single line back-to-back in the right order.

But your microbench doesn't do anything but normal stores.

So the mechanism you're proposing as the cause for this effect just doesn't make sense.

See also discussion on Stack Overflow: https://stackoverflow.com/questions/53435632/are-write-combining-buffers-used-for-normal-writes-to-wb-memory-regions-on-intel#comment96160262_53435632 The actual question is asking about this, and the answer is a similar microbenchmark. The conclusions are questionable, but the performance counter results are maybe interesting. (Still, neither BeeOnRope nor I are convinced that it's actually demonstrating use of LFBs for write-combining of normal stores.)

And in any case, that's about combining in an LFB while waiting for a cache line to arrive. You're talking about somehow combining something before/during write-back from L1 or L2 to L3. That just makes no sense; there's nothing to combine with, it's already a full line write-back.
ReplyDelete
Replies
jumar13 May 2019 at 08:28
Just ran this on my machine and got following results when running the loops 10 times:
1 SingleLoop duration (ns) = 7074676009
1 SplitLoop duration (ns) = 4179656857
2 SingleLoop duration (ns) = 6974691458
2 SplitLoop duration (ns) = 4243834696
3 SingleLoop duration (ns) = 5057173801
3 SplitLoop duration (ns) = 4281710759
4 SingleLoop duration (ns) = 5053223285
4 SplitLoop duration (ns) = 3952401242
5 SingleLoop duration (ns) = 4739710461
5 SplitLoop duration (ns) = 4188487184
6 SingleLoop duration (ns) = 4761019124
6 SplitLoop duration (ns) = 4219472213
7 SingleLoop duration (ns) = 5078802967
7 SplitLoop duration (ns) = 3970636511
8 SingleLoop duration (ns) = 4778556539
8 SplitLoop duration (ns) = 4002392222
9 SingleLoop duration (ns) = 4764734738
9 SplitLoop duration (ns) = 3940427992
10 SingleLoop duration (ns) = 4931735291
10 SplitLoop duration (ns) = 3963758487

I'm wondering why is the difference much smaller starting from the 3rd loop.

Peter Cordes' comment is interesting and perhaps explains this somehow but I'm not sure I fully understand it O:-).

PS: Running this on recent Mac Book pro with Intel core i7 2.6 Ghz 6 cores, 32 GB ram and JDK 12
ReplyDelete
Replies

Add comment