Mechanical Sympathy: Single Writer Principle

Thursday, 22 September 2011

Single Writer Principle

When trying to build a highly scalable system the single biggest limitation on scalability is having multiple writers contend for any item of data or resource. Sure, algorithms can be bad, but let’s assume they have a reasonable Big O notation so we'll focus on the scalability limitations of the systems design.

I keep seeing people just accept having multiple writers as the norm. There is a lot of research in computer science for managing this contention that boils down to 2 basic approaches. One is to provide mutual exclusion to the contended resource while the mutation takes place; the other is to take an optimistic strategy and swap in the changes if the underlying resource has not changed while you created the new copy.

Mutual Exclusion

Mutual exclusion is the means by which only one writer can have access to a protected resource at a time, and is usually implemented with a locking strategy. Locking strategies require an arbitrator, usually the operating system kernel, to get involved when the contention occurs to decide who gains access and in what order. This can be a very expensive process often requiring many more CPU cycles than the actual transaction to be applied to the business logic would use. Those waiting to enter the critical section, in advance of performing the mutation must queue, and this queuing effect (Little's Law) causes latency to become unpredictable and ultimately restricts throughput.

Optimistic Concurrency Control

Optimistic strategies involve taking a copy of the data, modifying it, then copying back the changes if data has not mutated in the meantime. If a change has happened in the meantime you repeat the process until successful. This repeating of the process increases with contention and therefore causes a queuing effect just like with mutual exclusion. If you work with a source code control system, such as Subversion or CVS, then you are using this algorithm every day. Optimistic strategies can work with data but do not work so well with resources such as hardware because you cannot take a copy of the hardware! The ability to perform the changes atomically to data is made possible by CAS instructions offered by the hardware.

Most locking strategies are composed from optimistic strategies for changing the lock state or mutual exclusion primitive.

Managing Contention vs. Doing Real Work

CPUs can typically process one or more instructions per cycle. For example, modern Intel CPU cores each have 6 execution units that can be doing a combination of arithmetic, branch logic, word manipulation and memory loads/stores in parallel. If while doing work the CPU core incurs a cache miss, and has to go to main memory, it will stall for hundreds of cycles until the result of that memory request returns. To try and improve things the CPU will make some speculative guesses as to what a memory request will return to continue processing. If a second miss occurs the CPU will no longer speculate and simply wait for the memory request to return because it cannot typically keep the state for speculative execution beyond 2 cache misses. Managing cache misses is the single largest limitation to scaling the performance of our current generation of CPUs.

Now what does this have to do with managing contention? Well if two or more threads are using locks to provide mutual exclusion, at best they will be going to the L3 cache, or over a socket interconnect, to access share state of the lock using CAS operations. These lock/CAS instructions cost 10s of cycles in the best case when un-contended, plus they cause out-of-order execution for the CPU to be suspended and load/store buffers to be flushed. At worst, collisions occur and the kernel will need to get involved and put one or more of the threads to sleep until the lock is released. This rescheduling of the blocked thread will result in cache pollution. The situation can be even worse when the thread is re-scheduled on another core with a cold cache resulting in many cache misses.

For highly contended data it is very easy to get into a situation whereby the system spends significantly more time managing contention than doing real work. The table below gives an idea of basic costs for managing contention when the program state is very small and easy to reload from the L2/L3 cache, never mind main memory.

Method	Time (ms)
One Thread	300
One Thread with Memory Barrier	4,700
One Thread with CAS	5,700
Two Threads with CAS	18,000
One Thread with Lock	10,000
Two Threads with Lock	118,000

This table illustrates the costs of incrementing a 64-bit counter 500 million times using a variety of techniques on a 2.4Ghz Westmere processor. I can hear people coming back with “but this is a trivial example and real-world applications are not that contended”. This is true but remember real-world applications have way more state, and what do you think happens to all that state which is warm in cache when the context switch occurs??? By measuring the basic cost of contention it is possible to extrapolate the scalability limits of a system which has contention points. As multi-core becomes ever more significant another approach is required. My last post illustrates the micro level effects of CAS operations on modern CPUs, whereby Sandybridge can be worse for CAS and locks.

Single Writer Designs

Now, what if you could design a system whereby any item of data, or resource, is only mutated by a single writer/thread? It is actually easier than you think in my experience. It is OK if multiple threads, or other execution contexts, read the same data. CPUs can broadcast read only copies of data to other cores via the cache coherency sub-system. This has a cost but it scales very well.

If you have a system that can honour this single writer principle then each execution context can spend all its time and resources processing the logic for its purpose, and not be wasting cycles and resource on dealing with the contention problem. You can also scale up without limitation until the hardware is saturated. There is also a really nice benefit in that when working on architectures, such as x86/x64, where at a hardware level they have a memory model, whereby load/store memory operations have preserved order, thus memory barriers are not required if you adhere strictly to the single writer principle. On x86/x64 "loads can be re-ordered with older stores" according to the memory model so memory barriers are required when multiple threads mutate the same data across cores. The single writer principle avoids this issue because it never has to deal with writing the latest version of a data item that may have been written by another thread and currently in the store buffer of another core.

So how can we drive towards single writer designs? I’ve found it is a very natural thing. Consider how humans, or any other autonomous creatures of nature, operate with their model of the world. We all have our own model of the world contained in our own heads, i.e. We have a copy of the world state for our own use. We mutate the state in our heads based on inputs (events/messages) we receive via our senses. As we process these inputs and apply them to our model we may take action that produces outputs, which others can take as their own inputs. None of us reach directly into each other’s heads and mess with the neurons. If we did this it would be a serious breach of encapsulation! Originally, Object Oriented (OO) design was all about message passing, and somehow along the way we bastardised the message passing to be method calls and even allowed direct field manipulation – Yuk! Who's bright idea was it to allow public access to fields of an object? You deserve your own special hell.

At university I studied transputers and interesting languages like Occam. I thought very elegant designs appeared by having the nodes collaborate via message passing rather than mutating shared state. I’m sure some of this has inspired the Disruptor. My experience with the Disruptor has shown that is it possible to build systems with one or more orders of magnitude better throughput than locking or contended state based approaches. It also gives much more predictable latency that stays constant until the hardware is saturated rather than the traditional J-curve latency profile.

It is interesting to see the emergence of numerous approaches that lend themselves to single writer solutions such as Node.js, Erlang, Actor patterns, and SEDA to name a few. Unfortunately most use queue based implementations underneath, which breaks the single writer principle, whereas the Disruptor strives to separate the concerns so that the single writer principle can be preserved for the common cases.

Now I’m not saying locks and optimistic strategies are bad and should not be used. They are excellent for many problems. For example, bootstrapping a concurrent system or making major state stages in configuration or reference data. However if the main flow of transactions act on contended data, and locks or optimistic strategies have to be employed, then the scalability is fundamentally limited.

The Principle at Scale

This principle works at all levels of scale. Mandelbrot got this so right. CPU cores are just nodes of execution and the cache system provides message passing for communication. The same patterns apply if the processing node is a server and the communication system is a local network. If a service, in SOA architecture parlance, is the only service that can write to its data store it can be made to scale and perform much better. Let’s say that underlying data is stored in a database and other services can go directly to that data, without sending a message to the service that owns the data, then the data is contended and requires the database to manage the contention and coherence of that data. This prevents the service from caching copies of the data for faster response to the clients and restricts how the data can be sharded. Encapsulation has just been broken at a more macro level when multiple different services write to the same data store.

Summary

If a system is decomposed into components that keep their own relevant state model, without a central shared model, and all communication is achieved via message passing then you have a system without contention naturally. This type of system obeys the single writer principle if the messaging passing sub-system is not implemented as queues. If you cannot move straight to a model like this, but are finding scalability issues related to contention, then start by asking the question, “How do I change this code to preserve the Single Writer Principle and thus avoid the contention?”

The Single Writer Principle is that for any item of data, or resource, that item of data should be owned by a single execution context for all mutations.

61 comments:

Manish28 September 2011 at 03:57
Thanks Martin for the excellent post(s), We are working in FX algo domain and you can think how much latency is important for us.

I have learnt lot of things from your blog.

Thanks
Manish
ReplyDelete
Replies
Edward Yavno11 October 2011 at 04:08
Thanks for sharing and writing this up.

However, I don't see why message passing implemented as a queue violates single writer principle.

Less predictable - yes. Incompatible - I don't see why?
ReplyDelete
Replies
Martin Thompson12 October 2011 at 11:51
Edward,

About 30-35mins into the following presentation I touch on this subject for queues.

http://www.infoq.com/presentations/LMAX

The summary is that the head, tail, and possibly size, have to be concurrently modified. The act of having to add something to a queue is a write operation, as is the act of removing something. Therefore you have multiple writers thus breaking the principle. This is why you need locks or CAS operations for the implementation. Just look at the source of ArrayBlockingQueue or ConcurrentLinkedQueue.
ReplyDelete
Replies
Edward Yavno14 October 2011 at 15:23
Hi Martin,

I'm not disputing additional overhead when using queues.

I'm just pointing out that since you're defining the "Single Writer Principal" in this post (which I like), you should probably elaborate on on why it should not use queues. Is it even part of the "Single Writer Principal" or just a recommended design? You're alluding to it, but there's no clear explanation.

I actually think Single Writer Principal has its place even with queues, especially in a distributed system where it eliminates contention on a shared state if there are multiple writers update it. The (relatively) minor overhead of queuing may be an acceptable trade-off in place of say distributed locking on a distributed cache.

- Ed Y.
ReplyDelete
Replies
Martin Thompson15 October 2011 at 15:24
Edward,

Sorry if I was unclear here. Using queues does not break the single writer principle as you point out. What I'm trying to say is queues themselves break the single writer principle. If, for example, you use the Disruptor instead of queues then you can have a full design that avoids having multiple writers to any resource :-)
ReplyDelete
Replies
Anonymous28 October 2011 at 14:40
Can you share your experience with writing performance tests?

Thank you.
ReplyDelete
Replies
Martin Thompson28 October 2011 at 16:14
Siryc,

Is there something specific you'd like to know about performance testing or just my general approach?
ReplyDelete
Replies
J Chris A10 November 2011 at 09:04
Erlang in general is a very goodample of this at the memory level. Also the CouchDB append only Btree is a classic example at the data level.
ReplyDelete
Replies
Monster10 November 2011 at 12:04
Just saw your entry in HackerNews.

While I get your points and agree with what you say (which is no news to me), I have no idea *how* you can have an actor/thread receive messages from multiple other ones without a queue. If you have no shared state, you can scale infinitely; no problem here. If you do have a shared state, and a single writer to it, then you need a funnel architecture where N front-ends (because you don't want a single-point-of-failure front-end, and you can't call it "scalable" anyway if you can have only one front-end) send changes/events to this one writer.

I don't remember ever hearing about a pattern that allows multiple senders to one receiver *without* a queue.

Your post feels like "I know the magic solution to this very important problem and it's actually sooo obvious to me that I'm not going to tell you what it is". In other words, it's not a very useful post for someone who is already aware of the problem itself, and actually needs a solution.

So, could you please actually *explain* how you solve the "don't use queues" point, so we don't have to read your source code to grasp the pattern?
ReplyDelete
Replies
Martin Thompson10 November 2011 at 13:03
Monster,

The Disruptor is an alternative to queues. It can replace a whole graph of dependencies that could be represented by queues. For some background, rather than reading the code, you can check out the following links:

Technical Paper: http://code.google.com/p/disruptor/downloads/list

Blogs: http://code.google.com/p/disruptor/wiki/BlogsAndArticles

Martin Fowler overview of the Disruptor in context: http://martinfowler.com/articles/lmax.html

Video with Q&A: http://www.infoq.com/presentations/LMAX

To answer your "multiple front end" question. At least three approaches can be taken:

1. Put the front ends on separate machines. This can be good for protocol translation and border security anyway. Then forward requests to a HA cluster of machines with the single thread for the state mutation. This needs to be asynchronous to scale.

2. If the threads are on the same machine. Configure the Disruptor with the MultiThreadedClaimStrategy which minimises contention and can be an order of magnitude faster than queue based alternatives.

3. If the threads are on the same machine with massive contention. Use one Disruptor instance for each producer/publisher thread and then have a multiplexer thread combining their traffic and publishing it on to the single business logic thread via another Disruptor instance. This solution can be extended and federated.

Martin...
ReplyDelete
Replies
Monster10 November 2011 at 13:47
Thank you! I wasn't expecting such a quick response! I went and found the article of Martin Fowler myself after reading your post.

In short, what I missed from your post is that a Disruptor is a "fixed-size ring-buffer where each entry field can be written by a single thread" (or at least that is what I understood). Just one more little sentence would have made things much clearer, since most people don't know what a Disruptor is.
ReplyDelete
Replies
Anonymous21 November 2011 at 13:18
Martin,

Just general approach.

Than you
ReplyDelete
Replies
Frazer Clement12 March 2012 at 15:13
Martin,

Interesting ideas and write-up.

As a black-box, the disruptor offers :
- One producer-to-many consumers queued message passing
- Producer-to-consumer synchronisation
- Consumer-to-consumer synchronisation

Internally, performance relies on :
- One writer for any location at any time
- Pre-allocation of slots

Would you agree that the disruptor implements a queue in the sense that there is some buffering of work items between producer and consumer(s)? In that sense a disruptor between a single producer and single consumer looks very much like a queue from the outside.

Further, this pattern seems similar to some queue designs where producers write to a queue tail pointer and a consumer writes to a queue head pointer. The extension here is to allow multiple consumers without contending head writes by each consumer maintaining their own head pointers. The interesting part is where the producer can only move its head pointer forward (making space for the tail) if all other head pointers have already moved forward. Perhaps it's non-intuitive that having the producer continuously performing a global-min on the header pointers of the consumers is more efficient than maintaining a contended tail pointer - do you have any numbers to quantify the cost to the producer of maintaining the 'safe' queue head with N consumers?

I definitely agree about the benefits of avoiding locks, batching requests and keeping cache-warm-threads busy. I think one of the main problems with understanding these benefits are the lack of numbers.

Frazer
ReplyDelete
Replies
Unknown25 October 2013 at 22:19
I don't get how the Disruptor is not just an optimized queue/FIFO.
And your "non-blocking busy spin" looks just just like a spinlock to me.
ReplyDelete
Replies
J.J31 March 2014 at 15:55
Hi Martin

About the statement

"
On x86/x64 "loads can be re-ordered with older stores" according to the memory model so memory barriers are required when multiple threads mutate the same data across cores. The single writer principle avoids this issue because it never has to deal with writing the latest version of a data item that may have been written by another thread and currently in the store buffer of another core.
"

Even using single writer, because of "loads can be re-ordered with older stores", so reader could see out of date state?

ReplyDelete
Replies
Qarlo17 April 2014 at 22:18
I appreciate this good post. I have a question though. I don't understand the SOA example. If a service is the owner of the data and all modification of the data should be done through the service. Isn't it just passing the problem to the service? don't we still have the same problem? A DB would do the same: first request is served first in a "queue" model and then invalidate any cached result (the DB can cache the result of a query). A service would have to implement something like that. We would be applying the Single Write Principle because it is the DB the only one process that can touch the actual data on disk.
I know that having one service as the owner of the data is an encapsulation principle with certain benefits but I don't see the "concurrency" benefit in it, but the maintainability and flexibility benefits only.
ReplyDelete
Replies
Unknown13 December 2014 at 14:36
How far can the definition of a "single writer principle" be taken before you consider the principle is broken?

Would the use of a ConcurrentHashMap in a scenario where you have a single writer, always the same thread, but multiple readers be sufficient for compliance?

The map uses locks internally on writes be is it enough to break the principle? Here if it's always the same thread writing, I would think the JVM would do a good job of deflating those locks, but again, it's not clear from the article if you consider controlled CAS operations out of bounds.

Would the CAS operations used by the map on read be bad enough to again break the principle?
ReplyDelete
Replies
Gary16 April 2016 at 04:11
If I have a HashMap that is only updated by a single thread, but read by other threads, do I need to synchronize it?
ReplyDelete
Replies
GriNDeR6 May 2016 at 09:08
Hi Martin,

Could you say how it is work in multi nodes env?
For example if I have two nodes A & B I must create one writer (singleton) on one of the node? But this is potential bottle neck, also network communication between nodes is slowly too.
ReplyDelete
Replies
Ivan Mushketyk23 September 2016 at 10:36
Hi Martin,

Thank you for the great blog post and amazing blog.

One question about what you've written:
"It is interesting to see the emergence of numerous approaches that lend themselves to single writer solutions such as Node.js, Erlang, Actor patterns, and SEDA to name a few. Unfortunately most use queue based implementations underneath, which breaks the single writer principle".

Do you suggest that there is a better way to implement such systems? If we have an actor/object that can receive messages from two other entities how can we implement such a system without violating the Single Writer principle? Do we need to have two different input queues, one for each writer? Do we need to avoid queues all together? What if there are more than 2 senders?

ReplyDelete
Replies
pokerbot1016 January 2017 at 01:58
If I have a single writer to a simple variable type such as a 32-bit integer on a 32-bit system, do I need to have any synchronization if there are other threads reading it?
ReplyDelete
Replies
Unknown13 February 2019 at 20:35
first question is about logic, ok single thread writes data and others can read liberally without locks, but what if this write operation is not atomic and/or incures multiple locations? wont other threads see inconsistent state? what if they always require latest state, wont they wait until current writes flushed by writer? writes initiated from a thread wont be visible to others until write operation completed.
second about implementation: im sure you didnt replace queue implementation without using CAS/Spinning based approahes. which basically cant make your implementation equal/better than having seperate Spinlocks/MCS Spinlocks around tail and head of queue (in case of single producer or single-consumer there will be lock only on one end).
ReplyDelete
Replies
amit8 March 2019 at 00:47
Thank you Martin. For sharing this knowledge. This is one of the best blogs for understanding micro architecture for applications.
ReplyDelete
Replies