Friday, 15 July 2011

Write Combining

Modern CPUs employ lots of techniques to counteract the latency cost of going to main memory.  These days CPUs can process hundreds of instructions in the time it takes to read or write data to the DRAM memory banks. 

The major tool used to hide this latency is multiple layers of SRAM cache.  In addition, SMP systems employ message passing protocols to achieve coherence between caches.  Unfortunately CPUs are now so fast that even these caches cannot keep up at times.  So to further hide this latency a number of less well known buffers are used. 

This article explores “write combining buffers” and how we can write code that uses them effectively.

CPU caches are effectively unchained hash maps where each bucket is typically 64-bytes. This is known as a “cache line”.  The cache line is the effective unit of memory transfer.  For example, an address A in main memory would hash to map to a given cache line C.

If a CPU needs to work with an address which hashes to a line that is not already in cache, then the existing line that matches that hash needs to be evicted so the new line can take its place.  For example if we have two addresses which both map via the hashing algorithm to the same cache line, then the old one must make way for the new cache line.

When a CPU executes a store operation it will try to write the data to the L1 cache nearest to the CPU.  If a cache miss occurs at this stage the CPU goes out to the next layer of cache.  At this point on an Intel, and many other, CPUs a technique known as “write combining” comes into play. 

While the request for ownership of the L2 cache line is outstanding the data to be stored is written to one of a number of cache line sized buffers on the processor itself, known as line fill buffers on Intel CPUs.  These on chip buffers allow the CPU to continue processing instructions while the cache sub-system gets ready to receive and process the data.  The biggest advantage comes when the data is not present in any of the other cache layers.

These buffers become very interesting when subsequent writes happen to require the same cache line.  The subsequent writes can be combined into the buffer before it is committed to the L2 cache. These 64-byte buffers maintain a 64-bit field which has the corresponding bit set for each byte that is updated to indicate what data is valid when the buffer is transferred to the outer caches.

Hang on I hear you say.  What happens if the program wants to read some of the data that has been written to a buffer?  Well our hardware friends have thought of that and they will snoop the buffers before they read the caches.

What does all this mean for our programs?

If we can fill these buffers before they are transferred to the outer caches then we will greatly improve the effective use of the transfer bus at every level.  How do we do this?  Well programs spend most of their time in loops doing work. 

There are a limited number of these buffers, and they differ by CPU model.  For example on an Intel CPU you are only guaranteed to get 4 of them at one time.  What this means is that within a loop you should not write to more than 4 distinct memory locations at one time or you will not benefit from the write combining effect.

What does this look like in code?
import static java.lang.System.out;

public final class WriteCombining
{
    private static final int ITERATIONS = Integer.MAX_VALUE;
    private static final int ITEMS = 1 << 24;
    private static final int MASK = ITEMS - 1;

    private static final byte[] arrayA = new byte[ITEMS];
    private static final byte[] arrayB = new byte[ITEMS];
    private static final byte[] arrayC = new byte[ITEMS];
    private static final byte[] arrayD = new byte[ITEMS];
    private static final byte[] arrayE = new byte[ITEMS];
    private static final byte[] arrayF = new byte[ITEMS];

    public static void main(final String[] args)
    {
        for (int i = 1; i <= 3; i++)
        {
            out.println(i + " SingleLoop duration (ns) = " + runCaseOne());
            out.println(i + " SplitLoop  duration (ns) = " + runCaseTwo());
        }

        int result = arrayA[1] + arrayB[2] + arrayC[3] +
                     arrayD[4] + arrayE[5] + arrayF[6];
        out.println("result = " + result);
    }

    public static long runCaseOne()
    {
        long start = System.nanoTime();

        int i = ITERATIONS;
        while (--i != 0)
        {
            int slot = i & MASK;
            byte b = (byte)i;
            arrayA[slot] = b;
            arrayB[slot] = b;
            arrayC[slot] = b;
            arrayD[slot] = b;
            arrayE[slot] = b;
            arrayF[slot] = b;
        }

        return System.nanoTime() - start;
    }

    public static long runCaseTwo()
    {
        long start = System.nanoTime();

        int i = ITERATIONS;
        while (--i != 0)
        {
            int slot = i & MASK;
            byte b = (byte)i;
            arrayA[slot] = b;
            arrayB[slot] = b;
            arrayC[slot] = b;
        }

        i = ITERATIONS;
        while (--i != 0)
        {
            int slot = i & MASK;
            byte b = (byte)i;
            arrayD[slot] = b;
            arrayE[slot] = b;
            arrayF[slot] = b;
        }

        return System.nanoTime() - start;
    }
}
This program on my Windows 7 64-bit Intel Core i7 860 @ 2.8 GHz system produces the following output:

1 SingleLoop duration (ns) = 14019753545
1 SplitLoop  duration (ns) = 8972368661
2 SingleLoop duration (ns) = 14162455066
2 SplitLoop  duration (ns) = 8887610558
3 SingleLoop duration (ns) = 13800914725
3 SplitLoop  duration (ns) = 7271752889


To spell it out, if we write to 6 array locations (memory addresses) inside one loop we see that the program takes significantly longer than if we split the work up, and write first to 3 array locations, then to the other 3 locations sequentially.

By splitting the loop we do much more work yet the program completes in much less time!  Welcome to the magic of “write combining”.  By using our knowledge of the CPU architecture to fill those buffers properly we can use the underlying hardware to accelerate our code by a factor of two.

Don’t forget that with hyper-threading you can have 2 threads in competition for these buffers on the same core.

49 comments:

  1. Interesting, I never heard about this mechanism..

    I imagine this must be a source of memory reordering?
    I assume a fence in the loop must kill performance pretty badly (ie. if one of field is declared volatile for instance).

    Writing to memory is slow so let's write locally and publish later... it reminds me the optimisation we were talking about on the disrupt's batch consumer: we keep incrementing consumer's sequence locally and publish to the producer only when we think it's required.

    ReplyDelete
  2. One of my next posts is going to be on why memory barriers are important to memory ordering and their impact on performance :-) This is something I considered a lot when writing the Disruptor (http://code.google.com/p/disruptor/)

    ReplyDelete
  3. Wow, that's definitely a new one to me. I didn't believe the results until I ran the test myself. ;) I've heard of write combining, but didn't realize what it was, and how to exploit it. Thanks Martin!

    ReplyDelete
  4. very interesting. i took the liberty of porting this to c and i see a 3x improvement between 8 writes versus 2 sets of 4 writes on my setup (fedora12/64bit/Intel Core 2 Quad 8200) - https://gist.github.com/1086581

    ReplyDelete
  5. Very refreshing to see someone thinking/doing this in a high level lang like Java.

    ReplyDelete
  6. What factor improvement do you see in the 6/3 case in C? Be interesting to compare the array bounds checking cost of Java over C for this. Removed the last version because I hit return too soon :-)

    ReplyDelete
  7. i see a 3x improvement doing 3 v 6. goes up to 7x for 4 v 8. am using -O3 compiler flag with gcc. i'm no c expert so may be doing something wrong. latest code is here if you want to test: https://gist.github.com/1086581

    ReplyDelete
  8. on another note, have been thinking about doing an implementation of the disruptor in c/c++. i really like the idea of having all the business logic on a single thread with a large heap using as much physical RAM as possible. just doing that alone takes a lot of the pain out of coding in c/c++ and would make the system pretty easy to test and maintain, even in a low level language. would be interested if you have any thoughts on that... keep up the great work btw!

    ReplyDelete
  9. on the same machine, the java code above runs about 5-6% slower than the c code and i see a 2x performance improvment in the split loop (3 v 6)...

    ReplyDelete
  10. Doing a Disruptor in C/C++ is something I've been considering for a while. In the multi-producer scenario on x86 we could take advantage of the "lock xadd" instruction rather than a CAS. I like to think we have proven Java can give great performance. With C/C++/asm we can go further especially with being cache friendly and avoiding megamorphic method calls. The Disruptor is now so fast there is little real application benefit in making it faster. My choice would come down to what language the rest of the application is written in and to focus there. If I don't do one soon I'd be happy to help another port of the Disruptor.

    ReplyDelete
  11. i've done some tests here and using lock addq comes out 20% faster than lock xadd. this is the asm that gcc __sync_fetch_and_add generates:

    0000000000400940 :
    400940: 48 83 ec 08 sub $0x8,%rsp
    400944: 8b 07 mov (%rdi),%eax
    400946: c1 e0 14 shl $0x14,%eax
    400949: 48 98 cltq
    40094b: 48 85 c0 test %rax,%rax
    40094e: 74 0f je 40095f
    400950: f0 48 83 05 5f 0a 20 lock addq $0x1,0x200a5f(%rip) # 6013b8
    400957: 00 01
    400959: 48 83 e8 01 sub $0x1,%rax
    40095d: 75 f1 jne 400950
    40095f: 31 ff xor %edi,%edi
    400961: e8 fa fc ff ff callq 400660
    400966: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1)
    40096d: 00 00 00

    ReplyDelete
  12. Another interesting alternative to CAS :-) I'm going to have to brush up my x86-64.

    ReplyDelete
  13. How does the addq code deal with threads racing to do the update? The lock makes it atomic but I don't see how it handles two threads loading the same value before the add then doing the add.

    ReplyDelete
  14. Is the asm above the calling of __sync_add_and_fetch() in a loop? I get a very different dump when I try it?

    ReplyDelete
  15. yes it's in a loop. this is the code i took the objdump from: https://gist.github.com/1091224. i've tested with 4 threads contending on the same counter and haven't seen any issues on my setup. will the fact it's in a loop make a difference to the locking?

    ReplyDelete
  16. OK that makes more sense now. The ASM above was for more than just the __sync_fetch_and_add(). I want to do some tests this morning to confirm a few things. I believe the code is correct but not sure if it is suitable for the multi-producer sequence claim in the Disruptor. I'll post my findings later.

    ReplyDelete
  17. cool. it looks to me like the __sync_fetch_and_add() just gets turned into a lock/addq instruction in the assembly.

    ReplyDelete
  18. On further investigation I've discovered that GCC uses the lock addq instruction when *only* a single thread accesses the variable and lock xadd when 2 or more threads access the variable like below.

    40068e: f0 48 0f c1 15 a9 09 lock xadd %rdx,0x2009a9(%rip) # 601040

    This is GCC being clever and optimising in the single threaded case.

    ReplyDelete
  19. It is even more simple than multiple threads. If the returned value is assigned to a variable then lock xadd is used.

    int main (int argc, char *argv[])

    {

    unsigned long value = 0;

    value = __sync_add_and_fetch(&value, 1);

    printf("main value = %ld\n", value);

    }

    00000000004004f4 :
    4004f4: 55 push %rbp
    4004f5: 48 89 e5 mov %rsp,%rbp
    4004f8: 48 83 ec 20 sub $0x20,%rsp
    4004fc: 89 7d ec mov %edi,-0x14(%rbp)
    4004ff: 48 89 75 e0 mov %rsi,-0x20(%rbp)
    400503: 48 c7 45 f8 00 00 00 movq $0x0,-0x8(%rbp)
    40050a: 00
    40050b: 48 8d 55 f8 lea -0x8(%rbp),%rdx
    40050f: b9 01 00 00 00 mov $0x1,%ecx
    400514: 48 89 c8 mov %rcx,%rax
    400517: f0 48 0f c1 02 lock xadd %rax,(%rdx)
    40051c: 48 01 c8 add %rcx,%rax
    40051f: 48 89 45 f8 mov %rax,-0x8(%rbp)
    400523: 48 8b 55 f8 mov -0x8(%rbp),%rdx
    400527: b8 2c 06 40 00 mov $0x40062c,%eax
    40052c: 48 89 d6 mov %rdx,%rsi
    40052f: 48 89 c7 mov %rax,%rdi
    400532: b8 00 00 00 00 mov $0x0,%eax
    400537: e8 b4 fe ff ff callq 4003f0
    40053c: c9 leaveq
    40053d: c3 retq
    40053e: 90 nop
    40053f: 90 nop

    ReplyDelete
  20. I found this blog and think it's brilliant. One thing I came across that might solve your problem of avoiding a write barrier in the add case was AtomicLong.weakCompareAndSet.

    Sadly, this just does compareAndSet under the covers which is very unfortunate. I think that if they added a weakAddAndGet then it could just use 'lock xadd' and be blazing fast.

    ReplyDelete
  21. This is of course very interesting to know, but how many real life applications have such big loops?
    Splitting a small loop will degrade performance. Also it makes a program less readable.

    ReplyDelete
  22. Tadzys,

    I totally agree that you should not split loops or change code from the ideal model unless you absolutely need to for non-functional reasons. My post on modelling makes this point.

    http://mechanical-sympathy.blogspot.com/2011/09/modelling-is-everything.html

    What I'm hopefully achieving is an increased awareness of what is possible if people need to fix performance issues that cannot be addressed by model changes.

    ReplyDelete
  23. Great post, but this applies only to 'real' hardware, right? I got different results when I ran them on a VM.

    ReplyDelete
    Replies
    1. You have 4 write-combining buffers available per processor core. When on an OS VM you can have issues with who else is sharing your processor core. This can also be an issue when running native but less likely to happen.

      Delete
  24. Hi Martin,
    I found this blog very very interesting,I have a question, write-combines happen only when there is a cache miss at L1 or else write-backs happen? If yes, to utilize the optimization of write-combines how do we make sure/guarantee there is a cache miss? And any ideas about using write-throughs ? please correct me if I'm completely wrong.

    thanks

    ReplyDelete
    Replies
    1. Sorry for the slow response getting lost in my inbox. Write Combines happen for a combined cache-miss on L1 and L2. If you follow the code above, the arrays are sufficiently large so they do not fit in combined L1 & L2 caches. L2 is not inclusive or exclusive with L1 on Nehalem onwards. Think of L2 as a staging area between L1 and L3 to reduce each core beating on the L3. L1 and L2 for each core is inclusive in the L3.

      If an existing line in the L1 and L2 combination needs evicted to L3 then it may only need written to L3 and is thus not always a write-back to main memory.

      From Java you have no control over the type of memory such as write-back, write-through, or write-combining, etc. For Java everything is write-back and all goes via the cache. Write combines as discussed here is different from enabling the write-combining memory type, that requires ASM.

      Delete
    2. Martin,
      Great Post!! I want to know how does this work in the case of kernel drivers however, would you happen to know that? THat is , i have a BAR region on my adapter that i can map in WC mode using ioremap_wc() in the linux kernel. And then use a routine like iowrite_64_copy() to copy the data onto this mapped area. However this does not always do the combining for me!

      Is there a possibility that if the system is idle , it just sends out 64-bit/8 byte writes as it recieves them instead of combining them into 1 big 64-byte write ?

      Delete
    3. My experience of kernel drivers is very dated now. It sounds like you want to set the memory type to WC. This blog refers to how the WC buffer are using with write-back memory. I'd need to much better understand the issue you are seeing before I could give advice.

      How do you know the writes to the same cache line are not being combined?

      Delete
    4. Thanks for your reply. Yes so i copy 64 bytes to a particular memory location on this WC mapped BAR/area. Ideally these should have gone out on the PCIe Bus as just one big 64 byte PCIe TLP/Packet, but i see 8 packets of 64-bit each going out ,which indicates combining hasn't kicked in.

      Delete
    5. I believe you need to use MOVNTDQ to have streaming writes to WC memory and get them combined. Also are you ensuring aligned access?

      Delete
  25. Yes access is aligned. Sorry, but could you pls elaborate more on this MOVNTDQ instruction? The __iowrite64_copy routine in the Linux kernel is s'pposed to help in the combining ,if my guess on what you are referring to is correct?

    ReplyDelete
    Replies
    1. I cannot speak for how the __iowrite64_copy() function is implemented. Have you looked at the generated assembler? Just Google for the MOVNTDQ instruction :-)

      Delete
  26. Hi Martin,
    Thanks for this really useful article. Could you please clarify two things for me,
    Is there a reason for assigning arrays in a reverse order and I don't get how separating the assigning part into two loops helps on using wc store buffers efficiently. I mean how does the splitting helps in flushing the buffers ?

    ReplyDelete
  27. Hi Martin, thanks a lot for this blog:

    I am running this program in my MacBook Pro which has the following specs:

    Processor Name: Intel Core i7
    Processor Speed: 2.9 GHz
    Number of Processors: 1
    Total Number of Cores: 2
    L2 Cache (per Core): 256 KB
    L3 Cache: 4 MB
    Memory: 8 GB

    These are the results i am getting:

    1 SingleLoop duration (ns) = 5051671000
    1 SplitLoop duration (ns) = 6574749000
    2 SingleLoop duration (ns) = 4806397000
    2 SplitLoop duration (ns) = 5931679000
    3 SingleLoop duration (ns) = 4786564000
    3 SplitLoop duration (ns) = 5521178000
    result = 21

    I am seeing that the single loop is actually faster than the split loop, how can this be possible?

    Thanks a lot,
    Carlos.

    ReplyDelete
    Replies
    1. There are some issues with this test on more recent processors. I plan to redo this blog and bring it up to date.

      Delete
  28. Hi Martin,

    great post; a lot of useful information can be find on this blog.

    I have a question related to write combining process. all information that have to be written in memory in case of a cache miss(L1 or L2) will be grouped and write only when WC buffer is fill up with data, this bring us an important improvement of latency, beside to write each change in a cache line. What is not clear to me in the previous code when you write only three array elements inside of while loop, when WC buffer will be fill with information it will be write to memory, right(after each loop we write to WC buffer 3 bytes and from what I know the size of this buffer is ~ 32 bytes, this mean only after ~ ten loops will be filled buffer with data, if my logic is correct)? if yes what is the difference when you loop against all six array elements in the same loop ?
    I misunderstand something ?

    ReplyDelete
    Replies
    1. it's something related to the fact that you have on your processor pc only 4 buffers and you can write from max four distinct memory zones to your WC buffers ?

      Delete
    2. The content of the WC buffer does not wait to fill before writing. It is written to the cacheline as soon as it is available. A WC buffer is 64 bytes, i.e. the size of a cache line.

      You have only 4 WC buffers per core. Therefore you can only write to 4 distinct locations that reside in different cachelines, if those cache lines are not in the L1/L2 caches.

      Delete
  29. It is impressive to see your!
    I can see ~10% improvement i7-2630QM Win7 x64. As you mentioned thread may compete with each other, but anyway there is some benefit in applying the technique even on threaded Intel CPU.

    ReplyDelete
    Replies
    1. The latest Intel CPU now have 10 write combining buffer so the effect is much less pronounced. Other processors such as AMD can have less.

      Delete
  30. thank you so much for the information. this blog is very interesting in a geeky way.

    i have read a post somewhere saying that write combining buffer is not worth for application programmers looking into anymore and also it has been renamed/replaced with fill buffer to reflect its change of function.

    also I have tested it on my i7 2640M laptop, I don't see any performance gain either, rather a performance degradation in split loop case tests.

    ReplyDelete
    Replies
    1. Intel processors now have 10 LF/WC buffers so this is not such an issue any longer.

      Delete
  31. Hmm... I compared C++ version with Java (and see 10% C++ improvement in the best/3-arrays version).

    However both cases in Java are almost the same - no performance hit.
    Does JIT optimize away the difference?

    Platform: Fedor 23 x86_64, Intel® Core™ i5-4460, openjdk-1.8.0.65-3.b17, Java, gcc-c++-5.1.1
    C++ source: https://gist.github.com/uvsmtid/52caa3f2cfab287b2b80

    ReplyDelete
  32. Hmm... I wrote a C++ version which demonstrates ~2.5x improvement for split loop:
    1 SingleLoop duration (ns) = 12139922244
    1 SplitLoop duration (ns) = 4732561921
    2 SingleLoop duration (ns) = 12129320126
    2 SplitLoop duration (ns) = 4777225938
    3 SingleLoop duration (ns) = 12126297712
    3 SplitLoop duration (ns) = 4716507099
    result = 21

    However, Java performs well in both cases (note only ~10% performance hit by Java compared to the best case in C++):
    1 SingleLoop duration (ns) = 5311133217
    1 SplitLoop duration (ns) = 5054977738
    2 SingleLoop duration (ns) = 5090976210
    2 SplitLoop duration (ns) = 5276584630
    3 SingleLoop duration (ns) = 5219806807
    3 SplitLoop duration (ns) = 5931649956
    result = 21


    Does modern JIT smartly compile away the difference?


    Platform: Fedora 23 x86_64, Intel Core i5-4460, 8GB, openjdk-1.8.0.65
    C++ sources: https://gist.github.com/uvsmtid/52caa3f2cfab287b2b80

    ReplyDelete
  33. Here is slightly generalized C version with some results:
    https://github.com/artpol84/poc/tree/master/benchmarks/write_combine

    ReplyDelete
  34. Here is slightly generalized C-based version:
    https://github.com/artpol84/poc/tree/master/benchmarks/write_combine

    ReplyDelete
  35. I'm pretty sure this explanation for the observed performance isn't right.

    "write combining" to write-back memory happens either inside the store buffer (for back-to-back stores to the same line), or by L1d itself *being* the buffer: a line stays hot in Modified state while multiple stores commit to it, so it only needs to be written back once.

    The performance effect you're seeing (and which Intel's optimization manual recommends avoiding by splitting loops with more than 4 output streams) is more likely from conflict misses when lines are evicted from L1d while there are still pending writes to them. How can this happen? L1d is 8-way associative.

    But L1d replacement is only pseudo-LRU. True LRU takes significantly more bits per set to track to LRU state of 8 ways, so my understanding is that pseudo-LRU is common.

    ---

    In any case, you seem to be talking about the LFBs (Line Fill Buffers). Nehalem has 10 of them, same as later CPUs.

    LFBs (instead of L1d lines) are used for write-combining of NT stores, or I think stores to WC memory. That's only because they have to bypass cache.

    That's where the limit of 4 maybe comes in, although I Nehalem can use all 10 of its LFBs as WC buffers like SnB-family CPUs can. Still, they're also needed for incoming lines and regular write-back to L2, so unless NT stores are *all* you're doing, it's definitely best to do all the stores for a single line back-to-back in the right order.

    But your microbench doesn't do anything but normal stores.

    So the mechanism you're proposing as the cause for this effect just doesn't make sense.

    See also discussion on Stack Overflow: https://stackoverflow.com/questions/53435632/are-write-combining-buffers-used-for-normal-writes-to-wb-memory-regions-on-intel#comment96160262_53435632 The actual question is asking about this, and the answer is a similar microbenchmark. The conclusions are questionable, but the performance counter results are maybe interesting. (Still, neither BeeOnRope nor I are convinced that it's actually demonstrating use of LFBs for write-combining of normal stores.)

    And in any case, that's about combining in an LFB while waiting for a cache line to arrive. You're talking about somehow combining something before/during write-back from L1 or L2 to L3. That just makes no sense; there's nothing to combine with, it's already a full line write-back.

    ReplyDelete
    Replies
    1. This is an old blog and you are correct to point out that what is happening is not that well explained. I am referring to the Line Fill Buffers which can be used for write combining on Intel CPUs, AMD separate them. I keep meaning to revisit my blog but struggle to find the time.

      Delete
  36. Just ran this on my machine and got following results when running the loops 10 times:
    1 SingleLoop duration (ns) = 7074676009
    1 SplitLoop duration (ns) = 4179656857
    2 SingleLoop duration (ns) = 6974691458
    2 SplitLoop duration (ns) = 4243834696
    3 SingleLoop duration (ns) = 5057173801
    3 SplitLoop duration (ns) = 4281710759
    4 SingleLoop duration (ns) = 5053223285
    4 SplitLoop duration (ns) = 3952401242
    5 SingleLoop duration (ns) = 4739710461
    5 SplitLoop duration (ns) = 4188487184
    6 SingleLoop duration (ns) = 4761019124
    6 SplitLoop duration (ns) = 4219472213
    7 SingleLoop duration (ns) = 5078802967
    7 SplitLoop duration (ns) = 3970636511
    8 SingleLoop duration (ns) = 4778556539
    8 SplitLoop duration (ns) = 4002392222
    9 SingleLoop duration (ns) = 4764734738
    9 SplitLoop duration (ns) = 3940427992
    10 SingleLoop duration (ns) = 4931735291
    10 SplitLoop duration (ns) = 3963758487

    I'm wondering why is the difference much smaller starting from the 3rd loop.

    Peter Cordes' comment is interesting and perhaps explains this somehow but I'm not sure I fully understand it O:-).

    PS: Running this on recent Mac Book pro with Intel core i7 2.6 Ghz 6 cores, 32 GB ram and JDK 12

    ReplyDelete