Thursday, 22 September 2011

Single Writer Principle

When trying to build a highly scalable system the single biggest limitation on scalability is having multiple writers contend for any item of data or resource.  Sure, algorithms can be bad, but let’s assume they have a reasonable Big O notation so we'll focus on the scalability limitations of the systems design. 

I keep seeing people just accept having multiple writers as the norm.  There is a lot of research in computer science for managing this contention that boils down to 2 basic approaches.  One is to provide mutual exclusion to the contended resource while the mutation takes place; the other is to take an optimistic strategy and swap in the changes if the underlying resource has not changed while you created the new copy. 

Mutual Exclusion

Mutual exclusion is the means by which only one writer can have access to a protected resource at a time, and is usually implemented with a locking strategy.  Locking strategies require an arbitrator, usually the operating system kernel, to get involved when the contention occurs to decide who gains access and in what order.  This can be a very expensive process often requiring many more CPU cycles than the actual transaction to be applied to the business logic would use.  Those waiting to enter the critical section, in advance of performing the mutation must queue, and this queuing effect (Little's Law) causes latency to become unpredictable and ultimately restricts throughput.

Optimistic Concurrency Control

Optimistic strategies involve taking a copy of the data, modifying it, then copying back the changes if data has not mutated in the meantime.  If a change has happened in the meantime you repeat the process until successful.  This repeating of the process increases with contention and therefore causes a queuing effect just like with mutual exclusion.  If you work with a source code control system, such as Subversion or CVS, then you are using this algorithm every day.  Optimistic strategies can work with data but do not work so well with resources such as hardware because you cannot take a copy of the hardware!  The ability to perform the changes atomically to data is made possible by CAS instructions offered by the hardware.

Most locking strategies are composed from optimistic strategies for changing the lock state or mutual exclusion primitive.

Managing Contention vs. Doing Real Work

CPUs can typically process one or more instructions per cycle.  For example, modern Intel CPU cores each have 6 execution units that can be doing a combination of arithmetic, branch logic, word manipulation and memory loads/stores in parallel.  If while doing work the CPU core incurs a cache miss, and has to go to main memory, it will stall for hundreds of cycles until the result of that memory request returns.  To try and improve things the CPU will make some speculative guesses as to what a memory request will return to continue processing.  If a second miss occurs the CPU will no longer speculate and simply wait for the memory request to return because it cannot typically keep the state for speculative execution beyond 2 cache misses.  Managing cache misses is the single largest limitation to scaling the performance of our current generation of CPUs.

Now what does this have to do with managing contention?  Well if two or more threads are using locks to provide mutual exclusion, at best they will be going to the L3 cache, or over a socket interconnect, to access share state of the lock using CAS operations.  These lock/CAS instructions cost 10s of cycles in the best case when un-contended, plus they cause out-of-order execution for the CPU to be suspended and load/store buffers to be flushed.  At worst, collisions occur and the kernel will need to get involved and put one or more of the threads to sleep until the lock is released.  This rescheduling of the blocked thread will result in cache pollution.  The situation can be even worse when the thread is re-scheduled on another core with a cold cache resulting in many cache misses. 

For highly contended data it is very easy to get into a situation whereby the system spends significantly more time managing contention than doing real work.  The table below gives an idea of basic costs for managing contention when the program state is very small and easy to reload from the L2/L3 cache, never mind main memory. 

MethodTime (ms)
One Thread300
One Thread with Memory Barrier4,700
One Thread with CAS5,700
Two Threads with CAS18,000
One Thread with Lock10,000
Two Threads with Lock118,000

This table illustrates the costs of incrementing a 64-bit counter 500 million times using a variety of techniques on a 2.4Ghz Westmere processor.   I can hear people coming back with “but this is a trivial example and real-world applications are not that contended”.  This is true but remember real-world applications have way more state, and what do you think happens to all that state which is warm in cache when the context switch occurs???  By measuring the basic cost of contention it is possible to extrapolate the scalability limits of a system which has contention points.  As multi-core becomes ever more significant another approach is required.  My last post illustrates the micro level effects of CAS operations on modern CPUs, whereby Sandybridge can be worse for CAS and locks.

Single Writer Designs

Now, what if you could design a system whereby any item of data, or resource, is only mutated by a single writer/thread?  It is actually easier than you think in my experience.  It is OK if multiple threads, or other execution contexts, read the same data.  CPUs can broadcast read only copies of data to other cores via the cache coherency sub-system.  This has a cost but it scales very well.

If you have a system that can honour this single writer principle then each execution context can spend all its time and resources processing the logic for its purpose, and not be wasting cycles and resource on dealing with the contention problem.  You can also scale up without limitation until the hardware is saturated.  There is also a really nice benefit in that when working on architectures, such as x86/x64, where at a hardware level they have a memory model, whereby load/store memory operations have preserved order, thus memory barriers are not required if you adhere strictly to the single writer principle.  On x86/x64 "loads can be re-ordered with older stores" according to the memory model so memory barriers are required when multiple threads mutate the same data across cores.  The single writer principle avoids this issue because it never has to deal with writing the latest version of a data item that may have been written by another thread and currently in the store buffer of another core.

So how can we drive towards single writer designs?  I’ve found it is a very natural thing.  Consider how humans, or any other autonomous creatures of nature, operate with their model of the world.  We all have our own model of the world contained in our own heads, i.e. We have a copy of the world state for our own use.  We mutate the state in our heads based on inputs (events/messages) we receive via our senses.  As we process these inputs and apply them to our model we may take action that produces outputs, which others can take as their own inputs.  None of us reach directly into each other’s heads and mess with the neurons.  If we did this it would be a serious breach of encapsulation!  Originally, Object Oriented (OO) design was all about message passing, and somehow along the way we bastardised the message passing to be method calls and even allowed direct field manipulation – Yuk!  Who's bright idea was it to allow public access to fields of an object?  You deserve your own special hell. 

At university I studied transputers and interesting languages like Occam.  I thought very elegant designs appeared by having the nodes collaborate via message passing rather than mutating shared state.  I’m sure some of this has inspired the Disruptor.  My experience with the Disruptor has shown that is it possible to build systems with one or more orders of magnitude better throughput than locking or contended state based approaches.  It also gives much more predictable latency that stays constant until the hardware is saturated rather than the traditional J-curve latency profile.

It is interesting to see the emergence of numerous approaches that lend themselves to single writer solutions such as Node.js, Erlang, Actor patterns, and SEDA to name a few.  Unfortunately most use queue based implementations underneath, which breaks the single writer principle, whereas the Disruptor strives to separate the concerns so that the single writer principle can be preserved for the common cases.

Now I’m not saying locks and optimistic strategies are bad and should not be used.  They are excellent for many problems.  For example, bootstrapping a concurrent system or making major state stages in configuration or reference data.  However if the main flow of transactions act on contended data, and locks or optimistic strategies have to be employed, then the scalability is fundamentally limited. 

The Principle at Scale

This principle works at all levels of scale.  Mandelbrot got this so right.  CPU cores are just nodes of execution and the cache system provides message passing for communication.  The same patterns apply if the processing node is a server and the communication system is a local network.  If a service, in SOA architecture parlance, is the only service that can write to its data store it can be made to scale and perform much better.  Let’s say that underlying data is stored in a database and other services can go directly to that data, without sending a message to the service that owns the data, then the data is contended and requires the database to manage the contention and coherence of that data.  This prevents the service from caching copies of the data for faster response to the clients and restricts how the data can be sharded.  Encapsulation has just been broken at a more macro level when multiple different services write to the same data store.

Summary

If a system is decomposed into components that keep their own relevant state model, without a central shared model, and all communication is achieved via message passing then you have a system without contention naturally.  This type of system obeys the single writer principle if the messaging passing sub-system is not implemented as queues.  If you cannot move straight to a model like this, but are finding scalability issues related to contention, then start by asking the question, “How do I change this code to preserve the Single Writer Principle and thus avoid the contention?”

The Single Writer Principle is that for any item of data, or resource, that item of data should be owned by a single execution context for all mutations.

61 comments:

  1. Thanks Martin for the excellent post(s), We are working in FX algo domain and you can think how much latency is important for us.

    I have learnt lot of things from your blog.

    Thanks
    Manish

    ReplyDelete
  2. Thanks for sharing and writing this up.

    However, I don't see why message passing implemented as a queue violates single writer principle.

    Less predictable - yes. Incompatible - I don't see why?

    ReplyDelete
  3. Edward,

    About 30-35mins into the following presentation I touch on this subject for queues.

    http://www.infoq.com/presentations/LMAX

    The summary is that the head, tail, and possibly size, have to be concurrently modified. The act of having to add something to a queue is a write operation, as is the act of removing something. Therefore you have multiple writers thus breaking the principle. This is why you need locks or CAS operations for the implementation. Just look at the source of ArrayBlockingQueue or ConcurrentLinkedQueue.

    ReplyDelete
  4. Hi Martin,

    I'm not disputing additional overhead when using queues.

    I'm just pointing out that since you're defining the "Single Writer Principal" in this post (which I like), you should probably elaborate on on why it should not use queues. Is it even part of the "Single Writer Principal" or just a recommended design? You're alluding to it, but there's no clear explanation.

    I actually think Single Writer Principal has its place even with queues, especially in a distributed system where it eliminates contention on a shared state if there are multiple writers update it. The (relatively) minor overhead of queuing may be an acceptable trade-off in place of say distributed locking on a distributed cache.

    - Ed Y.

    ReplyDelete
  5. Edward,

    Sorry if I was unclear here. Using queues does not break the single writer principle as you point out. What I'm trying to say is queues themselves break the single writer principle. If, for example, you use the Disruptor instead of queues then you can have a full design that avoids having multiple writers to any resource :-)

    ReplyDelete
  6. Can you share your experience with writing performance tests?

    Thank you.

    ReplyDelete
  7. Siryc,

    Is there something specific you'd like to know about performance testing or just my general approach?

    ReplyDelete
  8. Erlang in general is a very goodample of this at the memory level. Also the CouchDB append only Btree is a classic example at the data level.

    ReplyDelete
  9. Just saw your entry in HackerNews.

    While I get your points and agree with what you say (which is no news to me), I have no idea *how* you can have an actor/thread receive messages from multiple other ones without a queue. If you have no shared state, you can scale infinitely; no problem here. If you do have a shared state, and a single writer to it, then you need a funnel architecture where N front-ends (because you don't want a single-point-of-failure front-end, and you can't call it "scalable" anyway if you can have only one front-end) send changes/events to this one writer.

    I don't remember ever hearing about a pattern that allows multiple senders to one receiver *without* a queue.

    Your post feels like "I know the magic solution to this very important problem and it's actually sooo obvious to me that I'm not going to tell you what it is". In other words, it's not a very useful post for someone who is already aware of the problem itself, and actually needs a solution.

    So, could you please actually *explain* how you solve the "don't use queues" point, so we don't have to read your source code to grasp the pattern?

    ReplyDelete
  10. Monster,

    The Disruptor is an alternative to queues. It can replace a whole graph of dependencies that could be represented by queues. For some background, rather than reading the code, you can check out the following links:

    Technical Paper: http://code.google.com/p/disruptor/downloads/list

    Blogs: http://code.google.com/p/disruptor/wiki/BlogsAndArticles

    Martin Fowler overview of the Disruptor in context: http://martinfowler.com/articles/lmax.html

    Video with Q&A: http://www.infoq.com/presentations/LMAX

    To answer your "multiple front end" question. At least three approaches can be taken:

    1. Put the front ends on separate machines. This can be good for protocol translation and border security anyway. Then forward requests to a HA cluster of machines with the single thread for the state mutation. This needs to be asynchronous to scale.

    2. If the threads are on the same machine. Configure the Disruptor with the MultiThreadedClaimStrategy which minimises contention and can be an order of magnitude faster than queue based alternatives.

    3. If the threads are on the same machine with massive contention. Use one Disruptor instance for each producer/publisher thread and then have a multiplexer thread combining their traffic and publishing it on to the single business logic thread via another Disruptor instance. This solution can be extended and federated.

    Martin...

    ReplyDelete
    Replies
    1. This is good stuff guys, well done and well said,

      However, one question crossed my mind while trying to apply the "multiplexer thread" technique above.

      Is the multiplexing cost equivalent to the loss of single writer advantage?
      If yes, then perhaps a multiple writer scenario would work equally fast.

      Any thoughts?

      Delete
    2. Best thing to do it measure for your usecase. I've found a multiplexer can often be better performance and easier to reason about. It comes down to levels of contention and number of producers. Other times it is not.

      Delete
    3. Thanks Martin,

      It is trial and error finally, as usual :-)

      I intend to measure this anyway.

      However it sounds like, the larger the contention (number of producers) the more the chances for a multiplexer to be performing better.

      Is this a valid statement?

      Delete
    4. Thanks Martin,

      Trying to get to the bottom of this and share to all,

      Assuming that multiplexing means collecting all input from multiple producers into a collection and forward this collection as a single producer/writer to a final collection for consumption by final consumers

      here are some early calculations
      (using JDK blocking queues, disruptor variant coming soon)

      |=============|===========|===========|==============|==================|================|
      | EVENTS | CONSUMERS | PRODUCERS | MILLIS | TYPE | IMPLEMENTATION |
      |=============|===========|===========|==============|==================|================|
      | 10000000 | 1 | 10 | 2087 | MULTIPLE_WRITERS | JDK |
      | 10000000 | 1 | 1 | 2305 | MULTIPLE_WRITERS | JDK |
      | 10000000 | 10 | 100 | 2770 | MULTIPLE_WRITERS | JDK |
      | 10000000 | 10 | 1 | 2922 | MULTIPLE_WRITERS | JDK |
      | 10000000 | 100 | 10 | 2946 | MULTIPLE_WRITERS | JDK |
      | 10000000 | 10 | 10 | 3002 | MULTIPLE_WRITERS | JDK |
      | 10000000 | 100 | 100 | 3061 | MULTIPLE_WRITERS | JDK |
      | 10000000 | 100 | 10 | 4305 | SINGLE_WRITER | JDK |
      | 10000000 | 100 | 100 | 4535 | SINGLE_WRITER | JDK |
      | 10000000 | 10 | 100 | 4651 | SINGLE_WRITER | JDK |
      | 10000000 | 10 | 10 | 4737 | SINGLE_WRITER | JDK |
      | 10000000 | 1 | 10 | 5238 | SINGLE_WRITER | JDK |
      | 10000000 | 10 | 1 | 5807 | SINGLE_WRITER | JDK |
      | 10000000 | 1 | 1 | 6040 | SINGLE_WRITER | JDK |
      |=============|===========|===========|==============|==================|================|
      (sorry for badly aligned output, no support for fixed size fonts)

      So far, it looks like multiple writers perform better.

      Delete
    5. Try the queues from JCTools or Agrona that are single producer with the ability to do batch draining.

      https://github.com/real-logic/Agrona
      https://github.com/JCTools/JCTools

      Running your benchmarks with JMH and publishing the tests for review can help with getting support from others.

      Delete
    6. Can "Disruptors" be used as well?
      for comparison to "JCTools" queues and JDK blocking queues

      Delete
    7. Yes the Disruptor is another example.

      Delete
    8. Any direction for "wait strategy" with JCTools?

      Delete
    9. Best to ask at the JCTools.

      Delete
    10. Here is a JMH benchmark using
      JDK (BlockingQueues)
      and
      JCTools (Bounded Arrays with Yielding IdleStrategy)

      https://drive.google.com/open?id=0B3tIYUvecvkAZlRDb21NYTBZbEE

      It looks to me that, as expected, as the number of producers grows JCTools gets the advantage over JDK's BlockingQueue

      "Disruptor" could be another impl to measure

      Any thoughts?

      I am trying to pull out the benchmark code and share it somewhere for convenience

      Delete
    11. Did you test on a machine capable of truly running 40+ threads concurrently? If the number of available cores is significantly less then you are measuring the scheduler for a significant part of the test and not the concurrent interactions on the FIFO data structures.

      Delete
    12. Unfortunately, I don't have such a machine (8 cores only).
      I am indeed measuring the scheduler significantly I am afraid.

      Even so, the measurement displays the dramatic difference between queue implementations which has been one of my original goals.

      The added value here is the measurement of the "multiplexer" (SingleWriter impl)

      Still, without a good machine I cannot prove much I am afraid.

      So, I intend to isolate and share my code.

      I will get back soon.

      Thanks again for helping, this is good stuff :-)

      Delete
    13. This comment has been removed by the author.

      Delete
    14. This comment has been removed by the author.

      Delete
    15. It has been some time since last talk about this, but here are some perhaps more mature and better looking results

      https://docs.google.com/document/d/1X7RobP2tn7OyGZ8VrXzxa5R5fhT38QnO2mx1CrKWXrE/edit?usp=sharing

      I believe this aligns with the expectation of multiple consumers contention and the single writer (producer) multiplexing benefit.

      Difference between JDK (BlockingQueue) and JCTools is more than obvious, even though I do not have a powerful enough machine to measure accurately.

      Benchmarking results from JMH coming soon, although it is not expected to change the difference ratio noticeably.

      Disruptor version coming soon as well.

      I intend to publish the project under GitHub soon.

      I hope this give some food for thought

      Delete
    16. It has been some time since last talk about this, but here are some perhaps more mature and better looking results

      https://docs.google.com/document/d/1X7RobP2tn7OyGZ8VrXzxa5R5fhT38QnO2mx1CrKWXrE/edit?usp=sharing

      I believe this aligns with the expectation of multiple consumers contention and the single writer (producer) multiplexing benefit.

      Difference between JDK (BlockingQueue) and JCTools is more than obvious, even though I do not have a powerful enough machine to measure accurately.

      Benchmarking results from JMH coming soon, although it is not expected to change the difference ratio noticeably.

      Disruptor version coming soon as well.

      I intend to publish the project under GitHub soon.

      I hope this gives some more food for thought :-)

      Delete
  11. Thank you! I wasn't expecting such a quick response! I went and found the article of Martin Fowler myself after reading your post.

    In short, what I missed from your post is that a Disruptor is a "fixed-size ring-buffer where each entry field can be written by a single thread" (or at least that is what I understood). Just one more little sentence would have made things much clearer, since most people don't know what a Disruptor is.

    ReplyDelete
  12. Martin,

    Just general approach.

    Than you

    ReplyDelete
  13. Martin,

    Interesting ideas and write-up.

    As a black-box, the disruptor offers :
    - One producer-to-many consumers queued message passing
    - Producer-to-consumer synchronisation
    - Consumer-to-consumer synchronisation

    Internally, performance relies on :
    - One writer for any location at any time
    - Pre-allocation of slots

    Would you agree that the disruptor implements a queue in the sense that there is some buffering of work items between producer and consumer(s)? In that sense a disruptor between a single producer and single consumer looks very much like a queue from the outside.

    Further, this pattern seems similar to some queue designs where producers write to a queue tail pointer and a consumer writes to a queue head pointer. The extension here is to allow multiple consumers without contending head writes by each consumer maintaining their own head pointers. The interesting part is where the producer can only move its head pointer forward (making space for the tail) if all other head pointers have already moved forward. Perhaps it's non-intuitive that having the producer continuously performing a global-min on the header pointers of the consumers is more efficient than maintaining a contended tail pointer - do you have any numbers to quantify the cost to the producer of maintaining the 'safe' queue head with N consumers?

    I definitely agree about the benefits of avoiding locks, batching requests and keeping cache-warm-threads busy. I think one of the main problems with understanding these benefits are the lack of numbers.

    Frazer

    ReplyDelete
    Replies
    1. Frazer,

      I agree that is some ways you could say it is a better queue. Many network cards have similar internal implementations so this is not new. Hardware tends to be better designed than software in my experience. One of the other benefits is the organisation of dependent consumers into a graph which can be useful. This dependency graph has nothing to do with the Single Writer Principle.

      It should also be noted that the concerns are well separated in the Disruptor thus allowing multiple or single producers, and the best performance possible given each choice. This is independent of how you organise the graph of consumers.

      As to lack of numbers. The Disruptor project comes with performance tests which illustrate how it performs vs traditional queue approaches. Check out the source and run them for yourself.

      Delete
  14. I don't get how the Disruptor is not just an optimized queue/FIFO.
    And your "non-blocking busy spin" looks just just like a spinlock to me.

    ReplyDelete
    Replies
    1. I'm not sure if you are asking a question or making a statement.

      The Disruptor does offer FIFO semantics but it also allows the configuration of a dependency graph that work like static actors. Please check out the performance tests for examples.

      A spin lock provides mutual exclusion by spinning while waiting to enter the critical section. The Disruptor has a optional multi-threaded claim strategy for a slot that uses a CAS but not for a critical section. Slots once claimed can be used in parallel without being in a mutually exclusive critical section. This is similar to many lock-free algorithms that are not using locks (spinning or not).

      Delete
  15. Hi Martin

    About the statement

    "
    On x86/x64 "loads can be re-ordered with older stores" according to the memory model so memory barriers are required when multiple threads mutate the same data across cores. The single writer principle avoids this issue because it never has to deal with writing the latest version of a data item that may have been written by another thread and currently in the store buffer of another core.
    "

    Even using single writer, because of "loads can be re-ordered with older stores", so reader could see out of date state?

    ReplyDelete
    Replies
    1. Other threads do not see stale data because this rule only applies to the thread issuing the loads and stores. If it read what it just stored then it sees what is in its store buffer. It has to be stores to other locations.

      The rule of the memory model is only an issue for algorithms like the Dekker or Peterson locks. This is where intent is expressed with a store then immediately followed by a load to read another intent. For this scenario it can appear that the subsequent load is re-ordered with an older store to a different location, thus breaking sequential consistency. For such operations a CAS approach is a better technique on modern processors.

      Delete
    2. Thread A: store
      Thread A: load

      It is fine, because the load could get the data from store buffer.

      Thread A: store
      Thread B: load

      Thread B might not get A's store, correct?

      Delete
    3. In the second case Thread B might not see A's store due to timing issues. This would be the same regardless of store buffer implementation.

      The more interesting, but seldom required, version is:

      Thread A store $address_one
      [fence]
      Thread A load $address_two

      Thread B store $address_two
      [fence]
      Thread B load $address_one

      For each thread it could appear that the load got reordered with the older store due to the stores working there way out the store buffer. The fences are required if you desire sequential consistency in this case.

      Delete
    4. so if we have single writer, we do not need both 1. lock and 2. fence ?

      If we have only one thread doing write

      Thread B store $address_one

      thread A load $address_one

      though it is possible that A might see the write with some "delay", it is fine for a "queue". Am i right?

      Delete
    5. There is no additional delay to when the value stored by B becomes visible, whether a fence is used or not. The store buffer will drain as fast as it can. A soft fence, e.g. lazySet/putOrdered in Java, is required to the compiler to ensure StoreStore ordering.

      Delete
  16. I appreciate this good post. I have a question though. I don't understand the SOA example. If a service is the owner of the data and all modification of the data should be done through the service. Isn't it just passing the problem to the service? don't we still have the same problem? A DB would do the same: first request is served first in a "queue" model and then invalidate any cached result (the DB can cache the result of a query). A service would have to implement something like that. We would be applying the Single Write Principle because it is the DB the only one process that can touch the actual data on disk.
    I know that having one service as the owner of the data is an encapsulation principle with certain benefits but I don't see the "concurrency" benefit in it, but the maintainability and flexibility benefits only.

    ReplyDelete
    Replies
    1. There are many concurrency benefits to fronting a database with a service beyond caching. The services can deal with higher level aggregate operations rather than row level operations in it transactions. If the service is asynchronous then these aggregates can be smart batched into the database without competing transactions.

      The database is just the raw data. The service provides the higher level semantics. Just the same a component provides the higher level semantics over raw memory access to the bytes. All concurrency has a granularity. We want to queue up updates to a single mutator and not have a free-for-all from any code wanting to mutate the raw data.

      Delete
  17. How far can the definition of a "single writer principle" be taken before you consider the principle is broken?

    Would the use of a ConcurrentHashMap in a scenario where you have a single writer, always the same thread, but multiple readers be sufficient for compliance?

    The map uses locks internally on writes be is it enough to break the principle? Here if it's always the same thread writing, I would think the JVM would do a good job of deflating those locks, but again, it's not clear from the article if you consider controlled CAS operations out of bounds.

    Would the CAS operations used by the map on read be bad enough to again break the principle?

    ReplyDelete
    Replies
    1. If you have only a single writer on ConcurrentHashMap then you have no contention. The readers are non-blocking the last time I checked and therefore do not contend with the writer for the lock.

      If readers and a single writer share a lock then you break the single writer principles because the readers and writers all contend on the lock so effectively they are all writers.

      Delete
    2. Thank you for taking the time to answer. Explaining the principle in terms of who is blocking makes the article so much clearer for me.

      Delete
  18. If I have a HashMap that is only updated by a single thread, but read by other threads, do I need to synchronize it?

    ReplyDelete
    Replies
    1. Yes. You need to have safe publication of writes. You could use a ConcurrentHashMap which does not require synchronized locks for the readers.

      Delete
  19. Hi Martin,

    Could you say how it is work in multi nodes env?
    For example if I have two nodes A & B I must create one writer (singleton) on one of the node? But this is potential bottle neck, also network communication between nodes is slowly too.

    ReplyDelete
    Replies
    1. The point is you need to design to avoid contention. In a multi-node environment if you where to distribute and allow multiple copies of the data, then you'd need something like Paxos which is complex and greater overhead.

      Delete
    2. So, single writer pattern not suitable for multi-node env, is it true?)
      Thanks for your answer.

      Delete
    3. It totally is. I think you are misunderstanding. RAFT is an example of how to employ single writer in distributed environment requiring resilience.

      Delete
  20. Hi Martin,

    Thank you for the great blog post and amazing blog.

    One question about what you've written:
    "It is interesting to see the emergence of numerous approaches that lend themselves to single writer solutions such as Node.js, Erlang, Actor patterns, and SEDA to name a few. Unfortunately most use queue based implementations underneath, which breaks the single writer principle".

    Do you suggest that there is a better way to implement such systems? If we have an actor/object that can receive messages from two other entities how can we implement such a system without violating the Single Writer principle? Do we need to have two different input queues, one for each writer? Do we need to avoid queues all together? What if there are more than 2 senders?

    ReplyDelete
    Replies
    1. The queues are a good way to manage the multiple producing parties. For an MPSC queue the single writer principle cannot apply. However it is at least isolated here. This approach is the most advanced I've seen in the Aeron messaging system in how it handles publication from multiple contending producers. I talk about it here: https://2016.javazone.no/program/a-quest-for-predictable-latency-with-java-concurrency

      When you have a fixed number of input sources then single producer to single consumer queues are a good option like this one: https://github.com/real-logic/Agrona/blob/master/src/main/java/org/agrona/concurrent/OneToOneConcurrentArrayQueue.java

      Delete
  21. If I have a single writer to a simple variable type such as a 32-bit integer on a 32-bit system, do I need to have any synchronization if there are other threads reading it?

    ReplyDelete
    Replies
    1. Yes you need to have a synchronization or any other mechanism that ensures safe publication: https://shipilev.net/blog/2014/safe-public-construction/#_safe_publication

      The problem is that if one thread writes data to a variable Java Memory Model does not guarantee that other threads will see the written value unless you use safe publication mechanism. If you don't use synchronization, JVM may assume that a variable is only used by a single thread and may "optimize" write operations in a way that no other threads will ever see any writes.

      Delete
    2. Is there a way to prevent the JVM optimization without the cost of the memory barriers or volatiles or locks etc? For example, C++ has volatile as well but it simply means to not use compiler optimizations.

      Delete
    3. There are few mechanisms for safe publication that you can use that vary in their cost and power. Using "volatile" is definitely an option in this case. If a value is written only once you can use "final" modifier that also guarantees safe publication. If you have multiple writers you can use AtomicInteger (https://docs.oracle.com/javase/7/docs/api/java/util/concurrent/atomic/AtomicInteger.html) that gives volatility and atomicity.

      Delete
  22. first question is about logic, ok single thread writes data and others can read liberally without locks, but what if this write operation is not atomic and/or incures multiple locations? wont other threads see inconsistent state? what if they always require latest state, wont they wait until current writes flushed by writer? writes initiated from a thread wont be visible to others until write operation completed.
    second about implementation: im sure you didnt replace queue implementation without using CAS/Spinning based approahes. which basically cant make your implementation equal/better than having seperate Spinlocks/MCS Spinlocks around tail and head of queue (in case of single producer or single-consumer there will be lock only on one end).

    ReplyDelete
    Replies
    1. On your first question this can be done without locks using a number of techniques such a model changes via a single reference or guard with before and after counters. For the second, if a queue support either multiple producers or multiple consumers then it is not single writer and then you bring in the complexity.

      Delete
    2. i thought you delegate all write operations into single thread without interesting in immediate changes in state in process. and you communicate that thread without locks.
      after investigating disruptor pattern i see there comes data from outside, one thread is responsible for writing it into fixed sized ring buffer, and other threads doesnt consume anything, just increment their own local counters. this is very specific for you particular needs and i cant get how you invented something it was complex before.

      Delete
    3. in short you decided to use single thread to fetch results from network which is scalable up to a point anyway, decided to consume completions from multile threads which are doing independent jobs so consuming queue seperately, which is basically single producer/single consumer which doesnt require synchronization if you use linked list or bounded ring buffer and doesnt block if qeueu empty thus 0 synchronization, but polling.

      Delete
  23. Thank you Martin. For sharing this knowledge. This is one of the best blogs for understanding micro architecture for applications.

    ReplyDelete