Mechanical Sympathy: Processor Affinity

When trying to build a highly scalable system the single biggest limitation on scalability is having multiple writers contend for any item of data or resource. Sure, algorithms can be bad, but let’s assume they have a reasonable Big O notation so we'll focus on the scalability limitations of the systems design.

I keep seeing people just accept having multiple writers as the norm. There is a lot of research in computer science for managing this contention that boils down to 2 basic approaches. One is to provide mutual exclusion to the contended resource while the mutation takes place; the other is to take an optimistic strategy and swap in the changes if the underlying resource has not changed while you created the new copy.

Mutual Exclusion

Mutual exclusion is the means by which only one writer can have access to a protected resource at a time, and is usually implemented with a locking strategy. Locking strategies require an arbitrator, usually the operating system kernel, to get involved when the contention occurs to decide who gains access and in what order. This can be a very expensive process often requiring many more CPU cycles than the actual transaction to be applied to the business logic would use. Those waiting to enter the critical section, in advance of performing the mutation must queue, and this queuing effect (Little's Law) causes latency to become unpredictable and ultimately restricts throughput.

Optimistic Concurrency Control

Optimistic strategies involve taking a copy of the data, modifying it, then copying back the changes if data has not mutated in the meantime. If a change has happened in the meantime you repeat the process until successful. This repeating of the process increases with contention and therefore causes a queuing effect just like with mutual exclusion. If you work with a source code control system, such as Subversion or CVS, then you are using this algorithm every day. Optimistic strategies can work with data but do not work so well with resources such as hardware because you cannot take a copy of the hardware! The ability to perform the changes atomically to data is made possible by CAS instructions offered by the hardware.

Most locking strategies are composed from optimistic strategies for changing the lock state or mutual exclusion primitive.

Managing Contention vs. Doing Real Work

CPUs can typically process one or more instructions per cycle. For example, modern Intel CPU cores each have 6 execution units that can be doing a combination of arithmetic, branch logic, word manipulation and memory loads/stores in parallel. If while doing work the CPU core incurs a cache miss, and has to go to main memory, it will stall for hundreds of cycles until the result of that memory request returns. To try and improve things the CPU will make some speculative guesses as to what a memory request will return to continue processing. If a second miss occurs the CPU will no longer speculate and simply wait for the memory request to return because it cannot typically keep the state for speculative execution beyond 2 cache misses. Managing cache misses is the single largest limitation to scaling the performance of our current generation of CPUs.

Now what does this have to do with managing contention? Well if two or more threads are using locks to provide mutual exclusion, at best they will be going to the L3 cache, or over a socket interconnect, to access share state of the lock using CAS operations. These lock/CAS instructions cost 10s of cycles in the best case when un-contended, plus they cause out-of-order execution for the CPU to be suspended and load/store buffers to be flushed. At worst, collisions occur and the kernel will need to get involved and put one or more of the threads to sleep until the lock is released. This rescheduling of the blocked thread will result in cache pollution. The situation can be even worse when the thread is re-scheduled on another core with a cold cache resulting in many cache misses.

For highly contended data it is very easy to get into a situation whereby the system spends significantly more time managing contention than doing real work. The table below gives an idea of basic costs for managing contention when the program state is very small and easy to reload from the L2/L3 cache, never mind main memory.

Method	Time (ms)
One Thread	300
One Thread with Memory Barrier	4,700
One Thread with CAS	5,700
Two Threads with CAS	18,000
One Thread with Lock	10,000
Two Threads with Lock	118,000

This table illustrates the costs of incrementing a 64-bit counter 500 million times using a variety of techniques on a 2.4Ghz Westmere processor. I can hear people coming back with “but this is a trivial example and real-world applications are not that contended”. This is true but remember real-world applications have way more state, and what do you think happens to all that state which is warm in cache when the context switch occurs??? By measuring the basic cost of contention it is possible to extrapolate the scalability limits of a system which has contention points. As multi-core becomes ever more significant another approach is required. My last post illustrates the micro level effects of CAS operations on modern CPUs, whereby Sandybridge can be worse for CAS and locks.

Single Writer Designs

Now, what if you could design a system whereby any item of data, or resource, is only mutated by a single writer/thread? It is actually easier than you think in my experience. It is OK if multiple threads, or other execution contexts, read the same data. CPUs can broadcast read only copies of data to other cores via the cache coherency sub-system. This has a cost but it scales very well.

If you have a system that can honour this single writer principle then each execution context can spend all its time and resources processing the logic for its purpose, and not be wasting cycles and resource on dealing with the contention problem. You can also scale up without limitation until the hardware is saturated. There is also a really nice benefit in that when working on architectures, such as x86/x64, where at a hardware level they have a memory model, whereby load/store memory operations have preserved order, thus memory barriers are not required if you adhere strictly to the single writer principle. On x86/x64 "loads can be re-ordered with older stores" according to the memory model so memory barriers are required when multiple threads mutate the same data across cores. The single writer principle avoids this issue because it never has to deal with writing the latest version of a data item that may have been written by another thread and currently in the store buffer of another core.

So how can we drive towards single writer designs? I’ve found it is a very natural thing. Consider how humans, or any other autonomous creatures of nature, operate with their model of the world. We all have our own model of the world contained in our own heads, i.e. We have a copy of the world state for our own use. We mutate the state in our heads based on inputs (events/messages) we receive via our senses. As we process these inputs and apply them to our model we may take action that produces outputs, which others can take as their own inputs. None of us reach directly into each other’s heads and mess with the neurons. If we did this it would be a serious breach of encapsulation! Originally, Object Oriented (OO) design was all about message passing, and somehow along the way we bastardised the message passing to be method calls and even allowed direct field manipulation – Yuk! Who's bright idea was it to allow public access to fields of an object? You deserve your own special hell.

At university I studied transputers and interesting languages like Occam. I thought very elegant designs appeared by having the nodes collaborate via message passing rather than mutating shared state. I’m sure some of this has inspired the Disruptor. My experience with the Disruptor has shown that is it possible to build systems with one or more orders of magnitude better throughput than locking or contended state based approaches. It also gives much more predictable latency that stays constant until the hardware is saturated rather than the traditional J-curve latency profile.

It is interesting to see the emergence of numerous approaches that lend themselves to single writer solutions such as Node.js, Erlang, Actor patterns, and SEDA to name a few. Unfortunately most use queue based implementations underneath, which breaks the single writer principle, whereas the Disruptor strives to separate the concerns so that the single writer principle can be preserved for the common cases.

Now I’m not saying locks and optimistic strategies are bad and should not be used. They are excellent for many problems. For example, bootstrapping a concurrent system or making major state stages in configuration or reference data. However if the main flow of transactions act on contended data, and locks or optimistic strategies have to be employed, then the scalability is fundamentally limited.

The Principle at Scale

This principle works at all levels of scale. Mandelbrot got this so right. CPU cores are just nodes of execution and the cache system provides message passing for communication. The same patterns apply if the processing node is a server and the communication system is a local network. If a service, in SOA architecture parlance, is the only service that can write to its data store it can be made to scale and perform much better. Let’s say that underlying data is stored in a database and other services can go directly to that data, without sending a message to the service that owns the data, then the data is contended and requires the database to manage the contention and coherence of that data. This prevents the service from caching copies of the data for faster response to the clients and restricts how the data can be sharded. Encapsulation has just been broken at a more macro level when multiple different services write to the same data store.

Summary

If a system is decomposed into components that keep their own relevant state model, without a central shared model, and all communication is achieved via message passing then you have a system without contention naturally. This type of system obeys the single writer principle if the messaging passing sub-system is not implemented as queues. If you cannot move straight to a model like this, but are finding scalability issues related to contention, then start by asking the question, “How do I change this code to preserve the Single Writer Principle and thus avoid the contention?”

The Single Writer Principle is that for any item of data, or resource, that item of data should be owned by a single execution context for all mutations.

In a series of articles I’ll aim to show the performance impact of processor affinity in a range of use cases.

Background

A thread of execution will typically run until it has used up its quantum (aka time slice), at which point it joins the back of the run queue waiting to be re-scheduled as soon as a processor core becomes available. While running the thread will have accumulated a significant amount of state in the processor, including instructions and data in the cache. If the thread can be re-scheduled to run on the same core as last time it can benefit from all that accumulated state. A thread may equally not run to the end of its quantum because it has been pre-empted, or blocked on IO or a lock. After which, when it is ready to run again, the same holds true.

There are numerous techniques available for pinning threads to a particular core. In this article I’ll illustrate the use of the taskset command on two threads exchanging IP multicast messages via a dummy interface. I’ve chosen this as the first example because in a low-latency environment multicast is the preferred IP protocol. For simplicity, I’ve also chosen to not involve the physical network while introducing the concepts. In the next article I’ll expand on this example and the issues involving a real network.

1. Create the dummy interface

$ su -
$ modprobe dummy
$ ifconfig dummy0 172.16.1.1 netmask 255.255.255.0
$ ifconfig dummy0 multicast

2. Get the Java files (Sender and Receiver) and compile them

$ javac *.java

3. Run the tests without CPU pinning

Window 1:
$ java MultiCastReceiver 230.0.0.1 dummy0

Window 2:
$ java MultiCastSender 230.0.0.1 dummy0 20000000

4. Run the tests with CPU pinning

Window 1:
$ taskset -c 2 java MultiCastReceiver 230.0.0.1 dummy0

Window 2:
$ taskset -c 4 java MultiCastSender 230.0.0.1 dummy0 20000000

Results

The tests output once per second the number of messages they have managed to send and receive. A typically example run is charted in Figure 1 below.

Figure 1.

The interesting thing I've observed is that the unpinned test will follow a step function of unpredictable performance. Across many runs I've seen different patterns but all similar in this step function nature. For the pinned tests I get consistent throughput with no step pattern and always the greatest throughput.

This test is not particularly CPU intensive, nor does it access the physical network device, yet it shows how critical processor affinity is to not just high performance but also predictable performance. In the next article of this series I'll introduce a network hop and the issues arising from interrupt handling.

Mechanical Sympathy

Thursday, 22 September 2011

Single Writer Principle

Tuesday, 19 July 2011

Processor Affinity - Part 1