Monday 26 August 2013

Lock-Based vs Lock-Free Concurrent Algorithms

Last week I attended a review session of the new JSR166 StampedLock run by Heinz Kabutz at the excellent JCrete unconference. StampedLock is an attempt to address the contention issues that arise in a system when multiple readers concurrently access shared state. StampedLock is designed to perform better than ReentrantReadWriteLock by taking an optimistic read approach.

While attending the session a couple of things occurred to me. Firstly, I thought it was about time I reviewed the current status of Java lock implementations. Secondly, that although StampedLock looks like a good addition to the JDK, it seems to miss the fact that lock-free algorithms are often a better solution to the multiple reader case.

Test Case

To compare implementations I needed an API test case that would not favour a particular approach. For example, the API should be garbage free and allow the methods to be atomic. A simple test case is to design a spaceship that can be moved around a 2-dimensional space with the coordinates of its position available to be read atomically. At least 2 fields need to be read, or written, per transaction to make the concurrency interesting.
/**
 * Interface to a concurrent representation of a ship that can move around
 * a 2 dimensional space with updates and reads performed concurrently.
 */
public interface Spaceship
{
    /**
     * Read the position of the spaceship into the array of coordinates provided.
     *
     * @param coordinates into which the x and y coordinates should be read.
     * @return the number of attempts made to read the current state.
     */
    int readPosition(final int[] coordinates);

    /**
     * Move the position of the spaceship by a delta to the x and y coordinates.
     *
     * @param xDelta delta by which the spaceship should be moved in the x-axis.
     * @param yDelta delta by which the spaceship should be moved in the y-axis.
     * @return the number of attempts made to write the new coordinates.
     */
    int move(final int xDelta, final int yDelta);
}
The above API would be cleaner by factoring out an immutable Position object but I want to keep it garbage free and create the need to update multiple internal fields with minimal indirection. This API could easily be extended for a 3-dimensional space and require the implementations to be atomic.

Multiple implementations are built for each spaceship and exercised by a test harness. All the code and results for this blog can be found here.

The test harness will run each of the implementations in turn by using a megamorphic dispatch pattern to try and prevent inlining, lock-coarsening, and loop unrolling when accessing the concurrent methods.

Each implementation is subjected to 4 distinct threading scenarios that result in different contention profiles:
  • 1 reader - 1 writer
  • 2 readers - 1 writer
  • 3 readers - 1 writer
  • 2 readers - 2 writers
All tests are run with 64-bit Java 1.7.0_25, Linux 3.6.30, and a quad core 2.2GHz Ivy Bridge i7-3632QM. Throughput is measured over 5 second periods for each implementation with the tests repeated 5 times to ensure sufficient warm up. The results below are throughputs averaged per second over 5 runs. To approximate a typical Java deployment, no thread affinity or core isolation has been employed which would have reduced variance significantly.

Note: Other CPUs and operating systems can produce very different results.

Results

Figure 1.
Figure 2.
Figure 3.
Figure 4.

The raw data for the above charts can be found here.

Analysis

The real surprise for me from the results is the performance of ReentrantReadWriteLock.  I cannot see a use for this implementation beyond a case whereby there is a huge balance of reads and very little writes. My main takeaways are:
  1. StampedLock is a major improvement over existing lock implementations especially with increasing numbers of reader threads.
  2. StampedLock has a complex API. It is very easy to mistakenly call the wrong method for locking actions.
  3. Synchronised is a good general purpose lock implementation when contention is from only 2 threads.
  4. ReentrantLock is a good general purpose lock implementation when thread counts grow as previously discovered.
  5. Choosing to use ReentrantReadWriteLock should be based on careful and appropriate measurement. As with all major decisions, measure and make decisions based on data.
  6. Lock-free implementations can offer significant throughput advantages over lock-based algorithms.
Conclusion

It is nice seeing the influence of lock-free techniques appearing in lock-based algorithms. The optimistic strategy employed on read is effectively a lock-free algorithm at the times when a writer is not updating.

In my experience of teaching and developing lock-free algorithms, not only do they provide significant throughput advantages as evidenced here, they also provide much lower and less variance in latency.

50 comments:

  1. This comment has been removed by the author.

    ReplyDelete
    Replies
    1. I do not believe it is a bug. The code is similar to the documentation.

      http://docs.oracle.com/javase/7/docs/api/java/util/concurrent/locks/ReentrantReadWriteLock.html

      Delete
  2. Hi Martin
    thanks for sharing this excellent benchmark. After taking a peek at the code, I have a couple of questions for you: if I am not mistaken, the lock free implementation relies on an (internal) immutable position class. Don't you think it introduces some (favorable) bias? Likewise, it should introduce some garbage due to new allocation for each movement.
    What do you think/where I am wrong?

    ReplyDelete
    Replies
    1. I don't understand what you mean by "favorable bias". The Internal class encapsulates the package of state that is read or written atomically. I think it is a clean abstraction. This is a standard technique for lock-free algorithms.

      The update does create some garbage. Locks also generate garbage when contended. I do not see your point.

      Externally all Spaceships honour the same API and they are subject to the same test case. The results are the results regardless of implementation. This is the value of OO for information hiding and encapsulation.

      Delete
    2. Hi Martin,

      The results are quite interesting. Thanks for sharing.

      If I understood Cyrille correctly, he is probably talking about GC issues due to lot of extra objects created here. With 2 writer method you are doing 10M writes/sec. So, it should be creating 10M objects. Probably GC should be doing fine for short lived objects (young GC). But, I tried something similar for one of my implementations. Even though my lock-free implementations are doing good, GC wasn't able to catch up in long term. I haven't spent enough time to nail down the problem yet. I could be wrong. Let me know if you have any thoughts on it.

      - Bhaskar

      Delete
    3. Yes the writers do create garbage. The lock based implementations also can create garage internally. No garbage is free and it is best to keep it restricted in an application. In real world applications, this level of garage tends to be so low it is not noticed unless you want a totally GC free application. Just use a web server, network IO, or a JDBC connector and they will dwarf this in the GC profile.

      If you want a totally garbage free approach other techniques are available but I wanted to keep the example relatively simple.

      Delete
    4. Thanks Martin. You mentioned there are garbage free approaches. Would you please point me to them?

      I was under the impression that, any lock free implementation would be doing cmpxchg on reference which means it supposed to create new object for each update.

      Thanks,
      Bhaskar

      Delete
    5. I teach some of these on my course. They can involve word packing or managing a cycling memory structure. These techniques add complexity but someone people need them to for the extra performance and avoidance of garbage.

      Delete
    6. IMO Cyrille and Bhaskar have a point. In real world applications you might be moving thousands of ships in one go having hundreds of concurrent threads and GC *may* become a real problem. It is good Cyrille brought this up because when people copy/paste your approach to production systems they at least understand the risks.

      Delete
    7. It is a good point of discussion. When working on in-memory data structures you are never context switching the thread so it is a counter productive to run more threads than you have cores. On the typical server these days that will be 8-16 cores minus some for the OS and other background activities.

      If you have a garbage issue, which you determined via profiling, then a garbage free approach can be valid. For this there are many other techniques. I just wanted to show one possible technique that can be applied for lock-free algorithms as a comparison to the lock based approach. I spend a lot of my life writing totally garbage free systems. :-) If you use Reentrant lock it creates garbage on contention.

      It is great to have the debate and people need to understand the consequences of their own choices. We cannot protect those who do not think from every eventuality.

      Delete
  3. It would be interesting to have the number of retries per operation.

    ReplyDelete
    Replies
    1. That is available in the provided link to the raw data.

      https://github.com/mjpt777/rw-concurrency/blob/master/data/results-i7-3632QM-Linux-3.6.30-Java-1.7.0_25-x64.txt

      Delete
  4. Hi Martin,

    LockFreeSpaceship beats all your other spaceships, but LockAndGcFreeSpaceship is even faster:

    public class LockAngGcFreeSpaceship implements Spaceship {

    private final AtomicLong position = new AtomicLong();

    @Override
    public int move(int xDelta, int yDelta) {
    int tries = 0;

    long pos, newPos;
    do {
    ++tries;
    pos = position.get();
    int x = (int) (pos & 0xFFFFFFFFL) + xDelta;
    int y = (int) (pos >>> 32) + yDelta;
    newPos = (((long) x) & 0xFFFFFFFFL) | (((long) y) << 32);
    } while (!position.compareAndSet(pos, newPos));

    return tries;
    }

    @Override
    public int readPosition(int[] coordinates) {
    long pos = position.get();
    coordinates[0] = (int)(pos & 0xFFFFFFFFL);
    coordinates[1] = (int)(pos >>> 32);
    return 1;
    }
    }

    PeLe

    ReplyDelete
    Replies
    1. Yes this will work for the above example. I've done tricks like this when performance is key. I chose the example technique as it is generic and it would be easy to add a 3rd dimension, which would not be possible with your technique.

      It could be even faster by avoiding the indirection to the AtomicLong and just use Unsafe :-)

      Delete
  5. Hi Martin.

    The test requirement is so simple I wonder if the lockfree implementation could be done with Unsafe. You would have a set of already available X and Y positions and basically in each mome you would (try to) increment the sequence within that array. Kinda like disruptor (wink wink). The main benefits of this approach would be to show the *effects* of false sharing from multiple threads in comparison with the AtomicReference implementation as opposed to Unsafe. The introduction of garbage would also be minimal.

    Would this be a good scenario ?

    Regards

    ReplyDelete
    Replies
    1. Yes it can be done with Unsafe which will be better for real world apps. However in the write case it can lower throughput because of the tight loop in this microbenchmark.

      Delete
    2. Sorry misread your comment. The above response is in relation to replacing the AtomicReference with Unsafe directly for a Position ref in the spaceship.

      There should be no false sharing in the this example from the fields because they are updated together. However it may be worth padding them from other objects that may be located nearby and thus cause false sharing.

      Delete
  6. I've recently discovered your blog and I just want to say thank you for sharing your deep insight on the concurrency. While reading this post I remembered that you've wrote about a lock-free Executor you've implemented a while ago but you never actually explained how it was done. Wouldn't it be a great example of lock-free structure?

    ReplyDelete
    Replies
    1. This is one of the data structures I teach on my lock-free algorithms course.

      http://www.real-logic.co.uk/training.html

      Delete
  7. I changed input parameters for your tests and results are not quite the same. My thinking is that 1 or 2 threads are hardly providing enough contention for the hardware you have.

    I was running your tests with 200 readers and 200 writers. I also changed TEST_DURATION_MS to 50 seconds.
    What I noticed is (however these results may only be specific to my old laptop):
    1. The Lock Free test is running ~15-20% longer then the other tests and this may contribute to the "exceptional" performance of the Lock Free implementation.
    I saw that Lock Free test is actually running for 60-65 seconds instead of allotted 50.
    2. Factoring in #1 I can say that with 200 readers/200 writers the read performance of the StampedLock is on par with Lock Free implementation if not better.

    I am curious if you would see similar results if you change these input parameters and maybe even plot new results on the new set of charts.

    Thanks
    -Nick

    ReplyDelete
    Replies
    1. Can you expand on what you mean by "change input parameters"? Is this number of readers and writers?

      Any system you can only run the number of threads that matches the number of cores. So you'd need a pretty amazing box to run 400 threads concurrently. In this case you would be measuring OS scheduling and not the concurrency.

      Also note this is a micro benchmark and that in a system with 400 threads you'd be doing other things on those threads and the cost of a lock becomes significant in the instruction stream by comparison.

      Delete
    2. Yes, I meant changing number of readers and writers and amount of time tests run. My point #2 is that the tests you are running may be skewed in favor of lock free implementation just because it runs longer (unless this turns out to be specific to my environment only).

      With the practical perspective in mind you would want to measure an impact of different locking/lock free implementations applied to your production system and it seems that under load your test results do not stand. Otherwise what's the value of the micro benchmark if it leads to wrong conclusion?

      Delete
    3. The ultimate answer to all is "measure for our own environment with realistic data". :-)

      Micro benchmarks do have their advantages when you are reasonably confident you are measuring what you expect. For example, running 400 threads on a standard laptop will realistically only be running 2 threads concurrently at any one time. Therefore the benchmark is dominated by the cost of thread scheduling and not managing concurrent access. Think how unrealistic that is to isolating the contention issues in real applications. Microbenchmarks can easily become "dirty experiments" whereby the meaningful information gets lost in the noise.

      I like to use microbenchmarks to get an idea of some extreme cases then also run component and system level benchmarks for better coverage.

      Delete
    4. I guess you and I have different goals in benchmarking.

      Delete
  8. Great article, it's nice to see the code for all of the synchronization mechanisms in one place. It would be nice if you added javadoc to your Spaceship interface. It took me a while to figure out that the int being returned was the number of attempts.

    ReplyDelete
  9. Hi!

    Thanks for the blog post and all the others you've written in time. They are very informative.

    Unfortunately you just tripped my "why the hell do we call them algorithms lock free?" wire. Yes, I realize that part of the answer is "because it is the convention", but I still need to rant, sorry.

    There is no such thing as "free lunch" and these algorithms are not "lock free". It's just that we moved the locks to a different level. What I mean:

    - Classic locks (synch / explicit locks): they call the kernel which either obtains a lock or puts the thread in a wait state. The funny thing is that the kernel level locks actually use the atomic CAS operations provided by the CPU to implement locks :-)

    - "Lock free" / optimistic concurrency: use the same CAS operations without entering the kernel.

    Going even further down to the level of the CPU hardware, CAS operations do generate locks (that's what the LOCK prefix in LOCK CMPXCHG stands for :-)). In the first multicore x86 implementations there was an explicit LOCK signal which was raised, now we have more sophisticated mechanisms trough the MESI protocol, but there are still locks in the sense of "some electrical signal which is asserted by only one party at a time and prevents others from doing an action".

    Trying to sum up: optimistic concurrency is a better way than "lock free". There are always locks when we need to establish a causal chain of events (A happened before B). Synchronized uses the thread scheduler (which in turn uses CAS) to wait in an efficient manner (as in: don't burn CPU). Using just CAS might give you lower latency (unless you have 200 threads on 4 cores :-)).

    As always, it depends :-).

    ReplyDelete
    Replies
    1. Naming is always a fun challenge. The term is not mine to be clear :-) It is an industry convention as you point out. There are also wait-free algorithms as a sub-set of lock-free that fun to the mix.

      You are right in that many algorithms take optimistic strategies. However some are lock-free concurrent algorithms that do not take an optimistic approach.

      Many lock-free algorithms do take an optimistic strategy and I often use the analogy of source code control systems. We all have mostly moved away from pessimistic locking strategies because of how they impede progress.

      I feel you are right that it is an area needing more exploration and maybe better naming can result, as well as other discoveries.

      BTW the Disruptor can support a causal chain of events without requiring locks, check out various examples in the performance tests. :-)

      Delete
    2. I think you are partly arguing semantics here, but there is a simple and concrete test you can apply to determine if something is lock free as the term is commonly used:

      If some thread participating in the algorithm (e.g., a thread accessing a concurrent structure) is suspended indefinitely, will any or all other threads participating possibly block indefinitely?

      If yes, you could consider that your algorithm has a "lock" (even if you didn't explicitly use one off the the shelf). Algorithms that pass this test could be considered lock free, and also generally (hopefully?) have the performance characteristics of lock free algorithms (but you can certainly come up with exceptions).

      What the hardware does, or how Intel chooses to name their assembly mnemonics isn't really part of this software definition, and in practice the operations don't violate the definitions above in any case (any instruction will complete in some bounded amount of time - there isn't really exclusive locking going on, except perhaps in more or less unobservable ways).

      Delete
  10. Hi Martin,
    Thanks for posting such a nice topic. I have one question, is it possible to apply lock-free algorithm for operations that involve sequential operations for e.g. generation of sequence number using Java. If yes, can you give hints how it can be done?

    ReplyDelete
    Replies
    1. AtomicLong.getAndIncrement() will generate you a sequence number that can be used in a concurrent algorithm.

      Delete
  11. Martin - thanks for another great and in-depth article, as usual.

    For what it's worth, RRWL has a poor showing here because the benchmark is pretty much a worst case scenario for that lock, even under heavy read loads, because the critical section in your benchmark is so short (nanoseconds).

    Internally, both the RRWL lock() and unlock() operations are themselves getting a lock (a Sync object), manipulating the internal state of the lock (e.g., incrementing the number of readers, or setting the lock to "writer in progress"), then unlocking the internal state. So in fact a lock()/unlock() pair for involves RRWL atomic operations and two (hidden) exclusive critical sections, while a normal synchronized block or ReentrantLock involves two and one respectively - with the difference that the exclusive critical section extends from the lock() to the unlock(), while in the RRWL case the exclusive sections are very short and only exist inside the lock() and unlock() implementations.

    When the section of code protected by the lock is reasonably long (say 1 us or longer), the benefit of allowing multiple readers into the protected code is great and will favor RRWL. However, when the protected section takes negligible time, as in the benchmark, the internal locks obtained by the RRWL will dominate the execution time and greatly increase contention, making the RRWL perform very poorly. In effect, it will generally behave at least twice as poorly as a simple lock (the writer preference built in to the lock adds another twist and may result in additional context switching or lock convoys).

    I think if the experiment was repeated with a longer critical section (e.g., if the protected data was a 10k element array), then RRWL would prove more useful.

    It is possible to write a "lock free" RW lock, where "lock free" means that internally the lock() and unlock() methods don't take any locks while manipulating state - they just CAS some shared state. This means that with readers only flowing in and out of the lock, there are no exclusive sections at all, and the only cost is two atomic operations per reader (one in, one out), plus the occasional retry. Implementing re-entrancy is possible and doesn't change the core performance, but requires some more book keeping.

    Final bit of trivia: the way RRWL works, as described above, is responsible for how they show up (and fail to show) in jstack and other tools or APIs that report held/waiting locks. These tools won't show that any thread is inside a RRWL (holds the lock) as either a reader or a writer, because there is no JVM-understood lock held at that point (which are object monitors or things that extend AQS or things like that). The tools will show threads that are *waiting* to enter the lock (and also including, if the timing is exactly right, threads that aren't waiting and won't wait, but are in the middle of the lock() or unlock() methods) because at that point they are blocking on the AQS Sync object internal to the lock. So for RRWL you'll always see zero or more threads waiting for the lock, but even if some are waiting, you'll almost never see any thread owning the Sync object, unless you catch them at the exact moment they are passing through the critical section of the lock() or unlock() methods - which might not even be possible for jstack if there is no safepoint in that tiny critical section.

    ReentrantLock has no such issue since its Sync object (derived from AQS) is held for the duration of the lock, so it shows up properly in jstack.

    ReplyDelete
    Replies
    1. I think its a good point about critical section being too small for lock overhead involved with RRWL and ReentrantLock. Good catch!

      Delete
  12. Thank you for sharing this well written article.

    While reading, I was just wondering
    how Memory Channel Storage (MCS) Architecture[1],
    will affects concurrency programming in a couple of years?

    If you have a near constant storage latency of ~ 5 to 10ms,
    how would you program your limited CPU cache then?

    MCS still is in its infancy but considering how it already simplifies system design,
    there is a high chance that it might take off within 2 or 3 years. One generation further,
    MCS over DDR4 may reaches 2 or 3 GB/s read/write speed which eventually minimizes the storage performance penalty.

    Getting the most out of such a "flat" memory/storage hierarchy could need a different thinking about concurrency. Assuming a very low cache miss rate and fast fetching from MCS-storage, what would you pay the most attention to?

    Thread contention?

    What would you say?

    Thank you
    [1]
    http://electronicdesign.com/memory/memory-channel-storage-puts-ssd-next-cpu
    http://www.diablo-technologies.com/files/AMPSMCSInfoSheet-HQ%20ReExport.pdf
    http://www.storagesearch.com/ibm-jim-jan2014.html

    ReplyDelete
    Replies
    1. I don't think your comment is related to this blog but it is a interesting question.

      I don't see this storage is related to contention. Most contention occurs in the cache subsystem. Why do you think this is flat memory? The cache subsystem would be fronting it.

      Storage latency for non-volatile memory would be more like 5-10us and not ms.

      The big benefit I see for MCS is the avoiding of system calls and nice low energy medium for high read to write ratio storage.

      Delete
  13. I just run the test code which shows different result from yours,
    Enviroment: Intel(R) Core(TM) i5-2400 CPU @ 3.10GHz - Linux 3.8.0 - Java 1.7.0_21 x64
    3 Reader 3 Writer Read Write
    StampedLock 327,258,291 15,279,037
    LockFree 47,777,941 12,028,039

    ReplyDelete
    Replies
    1. Why run 6 threads on a processor only capable of running 4 threads? You are testing scheduling effects and not contention.

      Delete
    2. Rerun the test case, the Read performance is close, about Write performance StampedLock is better than LockFree.

      2 Reader 2 Writer Read Write
      StampedLock 42,181,836 25,371,810
      LockFree 44,516,359 10,738,735

      Delete
    3. Interesting results. Do you get much variance across runs? I assume the figures you are posting are the read and write totals?

      Delete
    4. The figures are Reads/sec and Writes/sec, below are total results,

      Run - 0
      StampedLockSpaceship reads=215,881,561:[107161095, 108720466] moves=166,261,245:[93211515, 73049730] readAttempts=[107328273, 108887073] moveAttempts=[93211515, 73049730] observedMoves=[167711, 167489]
      LockFreeSpaceship reads=185,502,786:[85307289, 100195497] moves=52,613,165:[24484000, 28129165] readAttempts=[85307289, 100195497] moveAttempts=[40374087, 42532786] observedMoves=[45249302, 45769796]
      Run - 1
      StampedLockSpaceship reads=270,011,565:[134469529, 135542036] moves=117,449,987:[54500150, 62949837] readAttempts=[134680475, 135752427] moveAttempts=[54500150, 62949837] observedMoves=[211037, 210487]
      LockFreeSpaceship reads=180,331,197:[75204646, 105126551] moves=52,410,648:[35206356, 17204292] readAttempts=[75204646, 105126551] moveAttempts=[46054047, 37089113] observedMoves=[42474165, 44015382]
      Run - 2
      StampedLockSpaceship reads=210,581,725:[104538029, 106043696] moves=114,806,389:[58274420, 56531969] readAttempts=[104753400, 106263724] moveAttempts=[58274420, 56531969] observedMoves=[215483, 220138]
      LockFreeSpaceship reads=225,236,980:[111453183, 113783797] moves=52,637,522:[30403385, 22234137] readAttempts=[111453183, 113783797] moveAttempts=[43617230, 38599054] observedMoves=[46078175, 44275854]
      Run - 3
      StampedLockSpaceship reads=134,218,004:[67653741, 66564263] moves=118,828,323:[56238740, 62589583] readAttempts=[67862395, 66773772] moveAttempts=[56238740, 62589583] observedMoves=[208750, 209609]
      LockFreeSpaceship reads=212,648,989:[107300759, 105348230] moves=51,970,787:[21615285, 30355502] readAttempts=[107300759, 105348230] moveAttempts=[39257122, 44540591] observedMoves=[43958203, 45108953]
      Run - 4
      StampedLockSpaceship reads=223,853,034:[114072244, 109780790] moves=116,949,295:[62102466, 54846829] readAttempts=[114328416, 110036350] moveAttempts=[62102466, 54846829] observedMoves=[256342, 255768]
      LockFreeSpaceship reads=309,189,018:[154445048, 154743970] moves=58,836,245:[28468033, 30368212] readAttempts=[154445048, 154743970] moveAttempts=[44936374, 46655735] observedMoves=[52178988, 52606562]

      Delete
    5. Have you changed some of the parameters? I don't see how your overall figures are so much higher given the clock speed difference.

      Delete
    6. I did not change any parameters, the test duration is still 5000ms, you mean the cpu clock speed? the only different is that my CPU frequency scaling mode is performance, I am not sure it impacted the test result.

      Delete
    7. I've tried 3 different processor architectures and cannot get close to those figures you have for StampedLock, especially on the move side. This is a good illustration of how you need to measure and then chose the most suitable implementation for your target platform.

      Delete
  14. Really interesting talk on QCon!
    But I found your statement about static helpers a bit controversial. Don’t you think that behaviour should be as close as possible to data it operates on? Static helpers make it harder to achieve, they seems to me a bit procedural.
    Sure, it all depends on the usecase. I think that there are situations when it is very appropriate to use static function helpers – like matchers and assertions in unit tests

    ReplyDelete
    Replies
    1. Thanks for the feedback.

      There is no one size fits all. Sometimes bring data to behaviour, sometimes combine data and behaviour, sometimes bring behaviour to the data.

      The point I was making is sometimes one approach is more suitable than another and it is best to have design approach options in your experience kitbag. As you point out matcher can be good examples of pure functions.

      Feel free to discuss this further on the mechanical sympathy discussion group.

      Delete
  15. In the implementation of LockFree version, a new object is created for each write. But efficiency doesn't count the GC time. So, kinda unfair.

    ReplyDelete
  16. Hi Martin,

    Checking the code on github. I saw you had an Unsafe spaceship implementation, that you later removed...
    I was wandering what were the performance results for it and why it got removed.

    Cheers,
    Nikolay

    ReplyDelete
    Replies
    1. The results where very similar to the lock free spaceship version. I tried to see if Unsafe give any significant advantage in this case and it did not.

      Delete
  17. Another version of AtomicBufferSpaceship which use Agrona library (UnsafeBuffer) to implement "Spaceship". The performance figure is very similar to "LockAngGcFreeSpaceship".

    import java.nio.ByteBuffer;

    import uk.co.real_logic.agrona.concurrent.AtomicBuffer;
    import uk.co.real_logic.agrona.concurrent.UnsafeBuffer;

    public class AtomicBufferSpaceship implements Spaceship{
    private final AtomicBuffer buffer = new UnsafeBuffer(ByteBuffer.allocateDirect(8));

    @Override
    public int readPostion(int[] coordinates) {
    long pos = buffer.getLong(0);
    coordinates[0] = (int)(pos & 0xFFFFFFFFL);
    coordinates[1] = (int)(pos>>>32);
    return 1;
    }

    @Override
    public int move(int xDelta, int yDelta) {
    int tries = 0;
    long pos, newPos;
    do{
    tries++;
    pos = buffer.getLong(0);
    int x = (int)(pos & 0xFFFFFFFFL) + xDelta;
    int y = (int)(pos>>>32) + yDelta;
    newPos = (((long) x) & 0xFFFFFFFFL) | (((long)y)<<32);
    }while(!buffer.compareAndSetLong(0, pos, newPos));
    return tries;
    }
    }

    ReplyDelete
  18. Could you clarify what you mean with “word packing“ and “cycling memory structures”?

    Does word packaging refer to the technique used in the garbage and lock free implementation in PeLe’s comment which packs two integers into one AtomicLong? Are there approaches to make this work for more than two ints and without using Unsafe?

    With cycling memory structures, do you mean object pooling? What I mean by that is manually managing the life cycle of the Points object. Instead of creating a new objects, get it from a pool of objects and instead of making it eligible for GC, return it to the pool. This in itself may be implemented with a data structure requiring locks, which would defeat the whole purpose of avoiding locks. However, an object pool can also be implemented on top of a lock free data structure such as a ring buffer.

    ReplyDelete