Mechanical Sympathy: Simple Binary Encoding

Monday, 5 May 2014

Simple Binary Encoding

Financial systems communicate by sending and receiving vast numbers of messages in many different formats. When people use terms like "vast" I normally think, "really..how many?" So lets quantify "vast" for the finance industry. Market data feeds from financial exchanges typically can be emitting tens or hundreds of thousands of message per second, and aggregate feeds like OPRA can peak at over 10 million messages per second with volumes growing year-on-year. This presentation gives a good overview.

In this crazy world we still see significant use of ASCII encoded presentations, such as FIX tag value, and some more slightly sane binary encoded presentations like FAST. Some markets even commit the sin of sending out market data as XML! Well I cannot complain too much as they have at times provided me a good income writing ultra fast XML parsers.

Last year the CME, who are a member the FIX community, commissioned Todd Montgomery, of 29West LBM fame, and myself to build the reference implementation of the new FIX Simple Binary Encoding (SBE) standard. SBE is a codec aimed at addressing the efficiency issues in low-latency trading, with a specific focus on market data. The CME, working within the FIX community, have done a great job of coming up with an encoding presentation that can be so efficient. Maybe a suitable atonement for the sins of past FIX tag value implementations. Todd and I worked on the Java and C++ implementation, and later we were helped on the .Net side by the amazing Olivier Deheurles at Adaptive. Working on a cool technical problem with such a team is a dream job.

SBE Overview

SBE is an OSI layer 6 presentation for encoding/decoding messages in binary format to support low-latency applications. Of the many applications I profile with performance issues, message encoding/decoding is often the most significant cost. I've seen many applications that spend significantly more CPU time parsing and transforming XML and JSON than executing business logic. SBE is designed to make this part of a system the most efficient it can be. SBE follows a number of design principles to achieve this goal. By adhering to these design principles sometimes means features available in other codecs will not being offered. For example, many codecs allow strings to be encoded at any field position in a message; SBE only allows variable length fields, such as strings, as fields grouped at the end of a message.

The SBE reference implementation consists of a compiler that takes a message schema as input and then generates language specific stubs. The stubs are used to directly encode and decode messages from buffers. The SBE tool can also generate a binary representation of the schema that can be used for the on-the-fly decoding of messages in a dynamic environment, such as for a log viewer or network sniffer.

The design principles drive the implementation of a codec that ensures messages are streamed through memory without backtracking, copying, or unnecessary allocation. Memory access patterns should not be underestimated in the design of a high-performance application. Low-latency systems in any language especially need to consider all allocation to avoid the resulting issues in reclamation. This applies for both managed runtime and native languages. SBE is totally allocation free in all three language implementations.

The end result of applying these design principles is a codec that has ~16-25 times greater throughput than Google Protocol Buffers (GPB) with very low and predictable latency. This has been observed in micro-benchmarks and real-world application use. A typical market data message can be encoded, or decoded, in ~25ns compared to ~1000ns for the same message with GPB on the same hardware. XML and FIX tag value messages are orders of magnitude slower again.

The sweet spot for SBE is as a codec for structured data that is mostly fixed size fields which are numbers, bitsets, enums, and arrays. While it does work for strings and blobs, many my find some of the restrictions a usability issue. These users would be better off with another codec more suited to string encoding.

Message Structure

A message must be capable of being read or written sequentially to preserve the streaming access design principle, i.e. with no need to backtrack. Some codecs insert location pointers for variable length fields, such as string types, that have to be indirected for access. This indirection comes at a cost of extra instructions plus losing the support of the hardware prefetchers. SBE's design allows for pure sequential access and copy-free native access semantics.

Figure 1

SBE messages have a common header that identifies the type and version of the message body to follow. The header is followed by the root fields of the message which are all fixed length with static offsets. The root fields are very similar to a struct in C. If the message is more complex then one or more repeating groups similar to the root block can follow. Repeating groups can nest other repeating group structures. Finally, variable length strings and blobs come at the end of the message. Fields may also be optional. The XML schema describing the SBE presentation can be found here.

SbeTool and the Compiler

To use SBE it is first necessary to define a schema for your messages. SBE provides a language independent type system supporting integers, floating point numbers, characters, arrays, constants, enums, bitsets, composites, grouped structures that repeat, and variable length strings and blobs.

A message schema can be input into the SbeTool and compiled to produce stubs in a range of languages, or to generate binary metadata suitable for decoding messages on-the-fly.

    java [-Doption=value] -jar sbe.jar <message-declarations-file.xml>

SbeTool and the compiler are written in Java. The tool can currently output stubs in Java, C++, and C#.

Programming with Stubs

A full example of messages defined in a schema with supporting code can be found here. The generated stubs follow a flyweight pattern with instances reused to avoid allocation. The stubs wrap a buffer at an offset and then read it sequentially and natively.

    // Write the message header first
    MESSAGE_HEADER.wrap(directBuffer, bufferOffset, messageTemplateVersion)
                  .blockLength(CAR.sbeBlockLength())
                  .templateId(CAR.sbeTemplateId())
                  .schemaId(CAR.sbeSchemaId())
                  .version(CAR.sbeSchemaVersion());

    // Then write the body of the message
    car.wrapForEncode(directBuffer, bufferOffset)
       .serialNumber(1234)
       .modelYear(2013)
       .available(BooleanType.TRUE)
       .code(Model.A)
       .putVehicleCode(VEHICLE_CODE, srcOffset);

Messages can be written via the generated stubs in a fluent manner. Each field appears as a generated pair of methods to encode and decode.

    // Read the header and lookup the appropriate template to decode
    MESSAGE_HEADER.wrap(directBuffer, bufferOffset, messageTemplateVersion);

    final int templateId = MESSAGE_HEADER.templateId();
    final int actingBlockLength = MESSAGE_HEADER.blockLength();
    final int schemaId = MESSAGE_HEADER.schemaId();
    final int actingVersion = MESSAGE_HEADER.version();

    // Once the template is located then the fields can be decoded.
    car.wrapForDecode(directBuffer, bufferOffset, actingBlockLength, actingVersion);

    final StringBuilder sb = new StringBuilder();
    sb.append("\ncar.templateId=").append(car.sbeTemplateId());
    sb.append("\ncar.schemaId=").append(schemaId);
    sb.append("\ncar.schemaVersion=").append(car.sbeSchemaVersion());
    sb.append("\ncar.serialNumber=").append(car.serialNumber());
    sb.append("\ncar.modelYear=").append(car.modelYear());
    sb.append("\ncar.available=").append(car.available());
    sb.append("\ncar.code=").append(car.code());

The generated code in all languages gives performance similar to casting a C struct over the memory.

On-The-Fly Decoding

The compiler produces an intermediate representation (IR) for the input XML message schema. This IR can be serialised in the SBE binary format to be used for later on-the-fly decoding of messages that have been stored. It is also useful for tools, such as a network sniffer, that will not have been compiled with the stubs. A full example of the IR being used can be found here.

Direct Buffers

SBE, via Agrona, provides an abstraction to Java, with the MutableDirectBuffer class, to work with buffers that are byte[], heap or direct ByteBuffer buffers, and off heap memory addresses returned from Unsafe.allocateMemory(long) or JNI. In low-latency applications, messages are often encoded/decoded in memory mapped files via MappedByteBuffer and thus can be be transferred to a network channel by the kernel thus avoiding user space copies.

C++ and C# have built-in support for direct memory access and do not require such an abstraction as the Java version does. A DirectBuffer abstraction was added for C# to support Endianess and encapsulate the unsafe pointer access.

Message Extension and Versioning

SBE schemas carry a version number that allows for message extension. A message can be extended by adding fields at the end of a block. Fields cannot be removed or reordered for backwards compatibility.

Extension fields must be optional otherwise a newer template reading an older message would not work. Templates carry metadata for min, max, null, timeunit, character encoding, etc., these are accessible via static (class level) methods on the stubs.

Byte Ordering and Alignment

The message schema allows for precise alignment of fields by specifying offsets. Fields are by default encoded in Little Endian form unless otherwise specified in a schema. For maximum performance native encoding with fields on word aligned boundaries should be used. The penalty for accessing non-aligned fields on some processors can be very significant. For alignment one must consider the framing protocol and buffer locations in memory.

Message Protocols

I often see people complain that a codec cannot support a particular presentation in a single message. However this is often possible to address with a protocol of messages. Protocols are a great way to split an interaction into its component parts, these parts are then often composable for many interactions between systems. For example, the IR implementation of schema metadata is more complex than can be supported by the structure of a single message. We encode IR by first sending a template message providing an overview, followed by a stream of messages, each encoding the tokens from the compiler IR. This allows for the design of a very fast OTF decoder which can be implemented as a threaded interpreter with much less branching than the typical switch based state machines.

Protocol design is an area that most developers don't seem to get an opportunity to learn. I feel this is a great loss. The fact that so many developers will call an "encoding" such as ASCII a "protocol" is very telling. The value of protocols is so obvious when one gets to work with a programmer like Todd who has spent his life successfully designing protocols.

Stub Performance

The stubs provide a significant performance advantage over the dynamic OTF decoding. For accessing primitive fields we believe the performance is reaching the limits of what is possible from a general purpose tool. The generated assembly code is very similar to what a compiler will generate for accessing a C struct, even from Java!

Regarding the general performance of the stubs, we have observed that C++ has a very marginal advantage over the Java which we believe is due to runtime inserted Safepoint checks. The C# version lags a little further behind due to its runtime not being as aggressive with inlining methods as the Java runtime. Stubs for all three languages are capable of encoding or decoding typical financial messages in tens of nanoseconds. This effectively makes the encoding and decoding of messages almost free for most applications relative to the rest of the application logic.

Feedback

This is the first version of SBE and we would welcome feedback. The reference implementation is constrained by the FIX community specification. It is possible to influence the specification but please don't expect pull requests to be accepted that significantly go against the specification. Support for Javascript, Python, Erlang, and other languages has been discussed and would be very welcome.

Update: 08-May-2014

Thanks to feedback from Kenton Varda, the creator of GPB, we were able to improve the benchmarks to get the best performance out of GPB. Below are the results for the changes to the Java benchmarks.

The C++ GPB examples on optimisation show approximately a doubling of throughput compared to initial results. It should be noted that you often have to do the opposite in Java with GPB compared to C++ to get performance improvements, such as allocate objects rather than reuse them.

Before GPB Optimisation:

Mode Thr    Cnt  Sec         Mean   Mean error    Units
     [exec] u.c.r.protobuf.CarBenchmark.testDecode           thrpt   1     30    1      462.817        6.474   ops/ms
     [exec] u.c.r.protobuf.CarBenchmark.testEncode           thrpt   1     30    1      326.018        2.972   ops/ms
     [exec] u.c.r.protobuf.MarketDataBenchmark.testDecode    thrpt   1     30    1     1148.050       17.194   ops/ms
     [exec] u.c.r.protobuf.MarketDataBenchmark.testEncode    thrpt   1     30    1     1242.252       12.248   ops/ms

     [exec] u.c.r.sbe.CarBenchmark.testDecode                thrpt   1     30    1    10436.476      102.114   ops/ms
     [exec] u.c.r.sbe.CarBenchmark.testEncode                thrpt   1     30    1    11657.190       65.168   ops/ms
     [exec] u.c.r.sbe.MarketDataBenchmark.testDecode         thrpt   1     30    1    34078.646      261.775   ops/ms
     [exec] u.c.r.sbe.MarketDataBenchmark.testEncode         thrpt   1     30    1    29193.600      443.638   ops/ms

After GPB Optimisation:

Mode Thr    Cnt  Sec         Mean   Mean error    Units
     [exec] u.c.r.protobuf.CarBenchmark.testDecode           thrpt   1     30    1      619.467        4.429   ops/ms
     [exec] u.c.r.protobuf.CarBenchmark.testEncode           thrpt   1     30    1      433.711       10.364   ops/ms
     [exec] u.c.r.protobuf.MarketDataBenchmark.testDecode    thrpt   1     30    1     2088.998       60.619   ops/ms
     [exec] u.c.r.protobuf.MarketDataBenchmark.testEncode    thrpt   1     30    1     1316.123       19.816   ops/ms

Throughput msg/ms - Before GPB Optimisation

Test

Protocol Buffers

SBE

Ratio

Car Encode

462.817

10436.476

22.52

Car Decode

326.018

11657.190

35.76

Market Data Encode

1148.050

34078.646

29.68

Market Data Decode

1242.252

29193.600

23.50

Throughput msg/ms - After GPB Optimisation

Test

Protocol Buffers

SBE

Ratio

Car Encode

619.467

10436.476

16.85

Car Decode

433.711

11657.190

26.88

Market Data Encode

2088.998

34078.646

16.31

Market Data Decode

1316.123

29193.600

22.18

43 comments:

Unknown5 May 2014 at 21:35
Martin, thank you for the article. Could you talk a bit more about this "We encode IR by first sending a template message providing an overview, followed by a stream of messages, each encoding the tokens from the compiler IR. This allows for the design of a very fast OTF decoder which can be implemented as a threaded interrupter with much less branching than the typical switch based state machines." Especially interested in the "threaded interrupter vs Switch based state machine" bit.
ReplyDelete
Replies
Unknown6 May 2014 at 02:06
Might this be "threaded interpreter", which is an alternative to a switch-based interpreter? I always liked Forth, and apparently it can be more CPU-cache friendly (see stuff at http://www.complang.tuwien.ac.at/projects/interpreters.html)
ReplyDelete
Replies
Martin Thompson7 May 2014 at 10:11
Rather than encode the IR tokens as a finger tree we encode them as a stream. This stream can then be feed into a parser that, even with a Java implementation, can be implemented without using a single big switch statement. Too much unpredictable branching can really hurt CPU throughput. Branching is OK provided it is mostly predictable based on past statistics. By using recursion in Java it is also possible to make the OTF decoder allocation free. Recursion in this case is safe because we only need to recurse into nested repeating groups.

Debug the following example to see the IR being used and the parser in action.

https://github.com/real-logic/simple-binary-encoding/blob/master/examples/java/uk/co/real_logic/sbe/examples/OtfExample.java
ReplyDelete
Replies
Unknown8 May 2014 at 18:57
Thanks Martin. I looked at the project a bit more in detail and had a brief thread on the Cap'n Proto boards too. So even though you do mention it I think a fair comparison would highlight the difference in features especially compared to something like Cap'n Proto where a lot of the same principles are used. Two things especially stick out:
i) No bounds checking in the CPP code as far as I can tell. This means that you probably only support trusted sources. Seems like you could perform heartbleed like attacks if you accept messages from the internet. Maybe the responsibility for these checks lies somewhere else?
ii) The sequential access requirement is a killer for some projects. You mention this very clearly and I understand that this is the norm for trading data, but some applications just can't live with this constraint. For example imagine I want to represent my objects using SBE in a replicated object database. One of the replicas gets a query for only a particular field (and this is unpredictable), I need to iterate every field just to satisfy that query. Further the CPP bindings at least don't prevent you from shooting yourself in the foot. You could easily call car.available() before car.modelYear() and it won't complain.
ReplyDelete
Replies
Martin Thompson8 May 2014 at 19:22
Cap'n Proto is a good project. There are many others. We just picked GBP as a comparison because of how commonly it is used to show people a difference. I could have chosen ASN.1 but not so many people know that.

i) The bound checking reaction is fascinating in how people so misunderstand heartbleed and the like. Any codec could be used to window over a buffer from the network. However any externally sourced input should be validated, this is the crux of the problem.

I think a check similar to the Java and C# side should be added to help prevent people being silly, but this is not a security issue. If people need this protection for security then I'd not trust them with any other part of a secure app. If you take your thinking to its conclusion then char* is not allowed in C/C++.

ii) The sequential access is actually more flexible than I outlined. Best that you are totally sequential, but SBE can allow arbitrary access to any field within a given block. Think C structures and how it can move over memory. Each block has a C structure over it. If arbitrary access is required to fields across blocks then maybe you should be considering another codec and accept the costs that implementing that features requires.
ReplyDelete
Replies
Unknown8 May 2014 at 21:57
Re (i) If I send some one a buffer saying it can be cast to struct foo {int length; char* data} and they blindly believe the length part and then send the data back to me when requested later - it is a problem. Of course it's their fault and they should have validated the data. In SBE's case since a separate part of the program (networking code) is allocating the buffer (char*) and knows it's length, it needs to be able to tell the decoding code (which is generated) to not exceed the bounds when returning data from the getters. The decoding logic needs to know the size of the buffer and ensure that it doesn't reach for something out of bounds. To your point about char*, it is a pain isn't it? C/C++ allow a lot of things including returning pointers to stack allocated data, doesn't mean it's a good idea.

Re (ii) I see, good to know. So you just prefer sequential access because of locality within a block, but don't require it. That seems pretty workable.
ReplyDelete
Replies
Monster20 May 2014 at 14:37
Hi. I just read your blog on SBE. I would like to know how this compares to those: https://github.com/eishay/jvm-serializers Since ProtoBuf isn't quite the "fastest contender out there", when it comes to Java serialisation.
ReplyDelete
Replies
ac310 June 2014 at 09:14
Any example of decoding a byte array? I'm listening to the instrument feed from CME. I've tried the following, but templateId is always 0?

ByteBuffer encodedMsgBuffer = ByteBuffer.wrap(data, 0, data.length);
encodedMsgBuffer.order(ByteOrder.BIG_ENDIAN);
DirectBuffer buffer = new DirectBuffer(encodedMsgBuffer)

Any help will be much appreciated...
ReplyDelete
Replies
Unknown25 August 2014 at 11:13
Is SBE compatible with java 8?
ReplyDelete
Replies
Anonymous30 April 2015 at 09:09
Martin, Is it possible to access fields by offset without reading all fields in the message? It seems like the functionality is missing from the first look to a generated Java code. Thanks for considering a new feature (if it is missing)!
ReplyDelete
Replies
Steve Morin17 September 2015 at 07:42
Martin, Would you be open to a patch to offer the inclusion of being able to have default values for fields? The reason to add them would be to introduce the concept of forward compatibility like with avro. Any thoughts on supporting an official set of resolution rules like AVRO?
References:
- http://martin.kleppmann.com/2012/12/05/schema-evolution-in-avro-protocol-buffers-thrift.html
- http://docs.confluent.io/1.0.1/avro.html#serialization-and-evolution
- http://avro.apache.org/docs/1.7.7/spec.html#Schema+Resolution
ReplyDelete
Replies
Unknown20 October 2015 at 13:12
Hi Martin, do you remember the details of the machine the benchmarks were run on + were they using all cores or just one?
ReplyDelete
Replies
Andrew Marlow8 March 2016 at 11:50
How does sbe compare with asn1 ber and der?
ReplyDelete
Replies
Unknown11 March 2016 at 02:15
I have a sbe project code that have build successfully ，but cannot find c++ code。
There are lot java class that output in
\simple-binary-encoding-master\sbe-benchmarks\build\generated\uk\co\real_logic\sbe\benchmarks\fix.
In the java class,import uk.co.real_logic.agrona.concurrent.UnsafeBuffer.They are java style.

If my project is c++ project,how can I build sbe project to out c++ class ,so I can use it in my c++ project.
ReplyDelete
Replies
cheburashka132618 May 2016 at 19:29
well written, very interesting, too bad it does not reflect the javamare the generator code is, as well horror of the code it produces for c++(in terms of readablility, performance , maintanability, bloat). My manager unfortunately is sold on this peace of shit, so I am stack supporting it. Simple example: the so called token builder, whotf wrote that, it is impossible to troubleshoot. don't belive me - try to find(under 5 minutes) existing bug in the code where individual values for an enum loose their description attribute values, while being read from a xml config
ReplyDelete
Replies
DNT12 July 2017 at 07:56
1. Do I have to maintain separate message schema for Little Endian servers and Big Endian servers?

2. When I use SBE with Aeron, is it always required to run on a Little Endian server as it said on design assumptions? (https://github.com/real-logic/aeron/wiki/Protocol-Specification#design-assumptions)
ReplyDelete
Replies
faa5 January 2019 at 05:52
I don't really understand the point of designing SBE when ITCH must be much faster and simpler to decode?
ReplyDelete
Replies
SD7 August 2019 at 17:50
How does SBE compare to FAST?
ReplyDelete
Replies