Why BSON is "The" Intercomponent Message Format

Let's set the tone early: This rant is about mainstream use cases. It is not about the ultimate high performance encoding/decoding. It is not about the slickest integration with objects and code generation. This is simply about taking data from point A to point B with high fidelity and the fewest possible technical issues. BSON offers several very important features that make it The Best format for intercomponent data transfer. We will draw comparisons to some other formats but not XML because frankly that ship sailed years ago.

Type fidelity that includes datetime and decimal
Dates and penny-precise numbers (i.e. 19.9 stays as 19.9 not 19.89999999) are ubiquitous in messages; a format that does not treat them as natively as strings is going to have to do a lot of work to standardize all to- and from- message codecs. For example, consider a sales invoice:

    class Invoice {
        private Money amt; // see separate rant on Money
        private Date invoiceDate;
    }

With BSON, serializing the data is trivial:

    class Invoice {
        public byte[] toBSON() {
           Document doc = new Document();
           doc.put("amt", amt.getMagnitude()); // gets a BigDecimal
           doc.put("ccode", amt.getCurrency()); // gets a String
           doc.put("invoiceDate", invoiceDate); // simply take the java.util.Date
           return BsonUtils.toBytes(doc); // a convenience func over 3 calls of low level BSON functions
        }
    }

Now, let's assume these bytes are sent thru a message bus and a Java listener and a python listener pick them up.

Java:
        public fromBSON(byte[] material) { // could be an InputStream too...
           Document doc = BsonUtils.fromBytes(material);
	   // doc.amt is a BigDecimal
	   // doc.ccode is a String
	   // doc.invoiceDate is a java.util.Date
        }

Python:
        def fromBSON(material):
          doc = bson_utils.fromBytes(material)
          # doc.amt is decimal
          # doc.ccode is a unicode string
          # doc.invoiceDate is a datetime.datetime

There is no other work or standards or conventions required. Dates and BigDecimal go in and they come out the other side.

JSON does not support dates natively; you have to "trust" ISO8601 strings. And numeric support is very primitive: only int and floating point are supported and the textual nature of JSON means a floating point value of "1" (as opposed to "1.0") gets picked up as an int.
AVRO calls dates "logical types" and requires you to first convert java.util.Date (and datetime.datetime) to int. Conversion to and from decimal via the underlying type of bytes is fraught with even more peril.

Schema is external to the data carrier
Carrying a schema around with each record -- especially a schema that can indicate if a field is required or optional -- means that different uses of a common shape require different schemas to properly handle them. The contextuality and nuances of schema validation are far greater than the data carrier and the two should not overlap. Indeed, different specs of schema definitions (e.g. the AVRO spec vs. json-schema vs. XSD) are geared toward richly and/or easily expressing the shape and it may be desired to bring 2 or more specs to bear on a common underlying data carrier. BSON is blissfully free of schema specs, name management, subschema inheritance, etc. etc. That problem is properly left up to the consumer.

No IDL
Related to the Schema issue, BSON imposes no spec or constraints on how one represents/validates/manages a data shape. IDLs impair dynamic data design and create intercomponent coupling, something that dramatically affected the use of CORBA in discrete lifecycle distributed systems and to a lesser degree with Google Protocol Buffers.

If you make it, they can read it
BSON has a fixed type ensemble, unlike CBOR (RFC 7049) which has the concept of underlying type plus a semantic type on top (date, time, bigdecimal, etc.). As extensible and flexible as this sounds, it also means that you are exposed to CBOR producers creating content that cannot be interpreted by consumers. To be fair, it does not invalidate the concept outright and deserializing unknown types will not break the machinery. The deserializer will simply tell you cannot map this type to a handler. But this leaves you with (typically) a byte array of unknown content. And from a distributed system perspective it is just one more thing to worry about. When one consumer must start to reinterpret the byte[] transmissions from others, this grows to two then five then eight then ...

Field order upon creation is preserved through roundtripping
This has hugely valuable applications in crypto, digest/hashing (e.g. SHA2) and blockchain data representation. You can go from a Document in Java to bytes to a dict in python then out again to a C++ program and the byte order of the BSON is preserved and SHA2-identical.

Java and C# BSON implements Map
The reference implementations of BsonDocument and Document implement the Map interface. This means 100s of tools and utils like generic diff, pretty printers, etc. can operate on BsonDocument and Document. AVRO GenericRecord does not. There is no one particular CBOR reference implementation but the most popular (which seem to be github.com/peteroupc/CBOR and the Jackson backend framework) have different ways of handling this.

BSON has no dependencies
The reference implementation of BSON in Java has no dependencies. Zero. Not even org.slf4j. Incorporating BSON into a build has a zero percent chance of causing deep stack library version conflicts. Compare this to AVRO which depends on the Jackson JSON databind and annotations libraries which are dependencies for dozens if not 100s of open source jars.

MongoDB speaks BSON natively
BSON is a fine data carrier in its own right but the fact that it is totally integrated as a data carrier for MongoDB makes capturing and querying of arbitrary data simple.