BGN: A Better Chess Game Action Information Architecture

27-Jun-2021, 28-Dec-2020 Like this? Dislike this? Let me know

Note: This is a rant-under-construction. Some parts may change significantly as more thought and work are put into the project.

Let's be honest: Portable Game Notation had a great run but the time has come for a better way to capture chess action.
Here we will outline Better Game Notation or BGN. BGN has these design goals:

  1. Explicit piece from-to data to completely eliminate ambiguity and (more importantly) permit analysis of moves without having to run a chess engine against the whole game to figure out what is moving. In other words, you can look at move 20 and know instantly what happened without replaying from move 1.
  2. Explicit piece capture and game events
  3. Flexible to permit extensible annotations for commentary (blunders, etc.)
  4. Provision for additional game action data, e.g. clock time at move
In describing the structure and values within, we will use JSON as a rendering example but it is important to understand that BGN is an information architecture, not a rendering / storage specification. BGN structures can be easily implemented in all popular languages (Java, python, etc.) and easily externalized in JSON or XML. It can also be easily read and written to MongoDB. It can also be easily read and written to relational databases that support XML or JSON representations of columns although the queryability of such a representation may be limited. It is possible (but much less easily) to convert BGN into a purely relational form especially when considering alternate lines within alternate lines.
A formal specification of types within the BGN architecture is forthcoming but assume at least scalar string, int64, double, and date, and maps (objects) and arrays thereof. In this rant we'll explore:
  1. Basic BGN design details
  2. Practical Reasons For Exploring This At All

Moves

Moves are the heart of the thing. We will start there and back up into the more pedestrian data elements. Moves are an array of structures.
The BGN moves info is designed to facilitate direct queryability for analysis as opposed to the most compact notation. Each move captures from-to squares and the pieces involved, plus additional optional commentary. This allows direct assessment of the data without having to run a "chess engine" to figure out what pieces moved or captured what.
White always starts; therefore, all even moves (0,2,4,6) are white. All odd moves (1,3,5,7) are black.
In PGN, a basic opening would be notated as:

1. e4 e5   2. Nf3 Nf6
  
In BGN the same basic opening would have this in the moves array. In our documentation here, we show the array offset first for a little more context but it is not part of actual spec.
  
 0  { "p":"P", "f": "e2", "t": "e4" }
 1  { "p":"P", "f": "e7", "t": "e5" }
 2  { "p":"N", "f": "g1", "t": "f3" }
 3  { "p":"N", "f": "g8", "t": "f6" }
  
The piece being moved is explicity identified. We call the "piece-from-to" construct a pft and the field names are made short on purpose. In addition, although pft is a strong recuring concept and could be modeled as an array of three elements, we deliberately use field names to avoid the confusion of nested arrays, e.g. array[4][2] = 'F4'. It is more workable like this: array[4]['to'] = 'F4'.
There is a lot more to the pft which we will see shortly but again, it is important to know that pft is not parsed. There is no whitespace, there is no explict numbering of the moves e.g. the "2." in 2. Nf3 Nf6. All data has real field names and a set of valid values including optional values. Here is sample JSON implementation of moves:
  
	moves = [
	  { "p":"P", "f": "e2", "t": "e4" },
	  { "p":"P", "f": "e7", "t": "e5" },
	  { "p":"N", "f": "g1", "t": "f3" },
	  { "p":"N", "f": "g8", "t": "f6" }
	]
And to prove the point, here it is in XML (although XML is highly NOT recommended):
  
	<moves>
	  <move><p>P</p><f>e2</f><t>e4</t></move>
	  <move><p>P</p><f>e7</f><t>e5</t></move>
	  <move><p>N</p><f>g1</f><t>f3</t></move>
	  <move><p>N</p><f>g8</f><t>f6</t></move>	  
	</moves>

Castling is a king-touch move. In PGN:

n. O-O
  
In BGN:
n  { "p":"K", "f": "e1", "t": "g1", "castle":"K" }
  
King vs. queen castling is disambiguated with the value of the castle field. The rook moving from h1 to f1 is implicit. This is the only non-explicit move of a piece in BGN. But when the rook moves later (maybe!) we will see the move from f1 (where it landed during the castle) to the new landing square.

BGN supports additional annotations. In PGN we might see:

13. Qh7 Nxd4+
  
In BGN this would be:
26  { "p":"Q", "f": "h4", "t": "h7" }
27  { "p":"N", "f": "b5", "t": "d4", "x":"B", "c":2 }
  
The "x" field explicitly identifies the piece captured and the "c" field identifies we have placed the opponent in check for the SECOND time.

Promotions are fairly straightforward. In PGN we might see

13. f8/Q Bf7
14. Qf4 ...
  
In BGN this would be:
26  { "p":"P", "f": "f7", "t": "f8": "promote":"Q" }
27  { "p":"B", "f": "d5", "t": "f7"}
28  { "p":"Q", "f": "f8", "t": "f4"}
  
Note that on move 26, the pawn on F8 was promoted to queen, and was subseqeuently moved as a queen on move 28.

Subjective Annotations

Beyond the basic game play, it is possible to add additional information to each move. This is managed through the a field. All annotations must be attributed to a source id which is identified in the main BGN structure. The id needs to be consistent within the game, but certainly there is value in having a common ID for the same individual over multiple games. The management of these IDs is not a core requirement of BGN so we will park id management for the moment.
As an example, consider this PGN:

13. Qh7? Bf3??
14. Nf3! Rc2!!
  
Someone has subjectively questioned white's queen move and called the bishop move a blunder. It is probably the annotator name as (maybe) set up in the PGN headers. In the next exchange, apparently there is brilliance. In BGN this is represented thusly through the nag field (using standard NAG codes) for quality, e.g. 1 ("!", good move), 4 ("??", blunder):
26  { "p":"Q", "f": "h4", "t": "h7", "a": [{"id":"AA2","nag":1} ] }
27  { "p":"B", "f": "e4", "t": "f3", "a": [{"id":"AA2","nag":4} ] }
28  { "p":"N", "f": "f5", "t": "f3", "a": [{"id":"AA2","nag":3} ] }
29  { "p":"R", "f": "c1", "t": "c2", "a": [{"id":"AA2","nag":8}] }
  
This allows multiple authors to opine subjectively on moves. For example, if such a thing was legitimate in PGN, meaning "AA2 thinks it is questionable but AA7 believes it is fine:"
13. Qh7 (AA2 ?, AA7 -)
  
Then in BGN we would have:
13  { "p":"Q", "f": "h4", "t": "h7", "a": [{"id":"AA2","nag":4},{"id":"AA7","nag":3}] }
  
Annotations optionally can have dates. Annotations without dates are assumed to be relevant to the timeframe of the game event itself. This means other subjective annotators can come in later and opine. For example, suppose author AA7 later on decided that it was a blunder. We could update the move as follows:
      13  { "p":"Q", "f": "h4", "t": "h7", "a": [
      {"id":"AA2","nag":1},
      {"id":"AA7","nag":0},
      {"id":"AA7","date":"2022-03-04", "q":1, "comment":"yeah..."}
      ] }
  
The permissioning of performing such an update, much like the physical persistence itself, is out of scope for this BGN data design doc but there are at least 2 very practical, very fast ways this could implemented in either MongoDB or a JSON/XML column RDBMS.

PGN "V2" Commands

PGN was not designed with richly structured data in mind, nor extensibility. As a result, additional information on moves called commands (detailed at https://www.enpassant.dk/chess/palview/enhancedpgn.htm) is tucked away inside the comment field, e.g.:
      14. Rd8 { [%clk "0:02:30"] [%eval #-3] } bf2 { [%clk "0:02:20"] [%eval #-2] } 
      17. fxe3 Kd7 ( Kd6 { [%clk "0:02:30"] [%eval #-3] the better move...?} )  18. ...
  
In BGN, commands are captured as properties within a namespace. Properties are considered to be objective and do not require association with an id:
      13  { "p":"Q", "f": "h4", "t": "h7", "properties": {
            {"namespace1": { "key1": val1, "key2": val2 },
             "namespace2": { "key1": val1, "key2": val2 }
          }
  
For example, lichess.org now adds %clk and %eval commands into game output for ??? and stockfish analysis, respectfully, and could be modeled this way:
      13  { "p":"Q", "f": "h4", "t": "h7", "properties": {
            {"lichess": { "eval": "#-3", "clk": "0:02:03" }}
          }
  
The information architecture within a namespace is completely at the discretion of the owner of the namespace. Because BGN is inherently neutral to the data types, the value of keys in the namespace are not restricted to simple scalar strings:
      13  { "p":"Q", "f": "h4", "t": "h7", "properties": {
            {"myNamespace": { "eval": [-3, 0.001, "foo"], "rushFactor": [11.023,0]}}
          }
  

Alternate Lines

The a field can carry complete alternate move structures. Suppose author AA3 thought that this would be better:
13. Qh7 (AA3 alt1 would be Nxc5, followed almost certainly by Bxc5)
  
In BGN, this is explicitly described using the same set of structures as in the mainline game:
13  { "p":"Q", "f": "h4", "t": "h7",
      "a": [ {"id":"AA2",
              "alt": {"name":"Knight press",
                      "comment": "bla bla bla",
                      "moves": [
                        {"p":"N", "f":"a4", "t":"c5", "x":"P"},
                        {"p":"B", "f":"f6", "t":"c5", "x":"N"}
                      ]
              }
        } ]
}
  
The hidden gem here is moves array in the alt structure is the same as the mainline -- which means that alternate lines themselves can have alternate lines within! Any feature that is added to the move info architecture is automatically available recursively in the alternate lines.

Game Structure

As promised earlier, we will back up to the overall game structure that holds moves. BGN carries two namespaced sections of data to hold both PGN compatible information and more expressive information. An example serves us well:

{
	"pgn" : {
		"event" : "F/S Return Match",
		"site" : "Somewhere in Serbia",
		"date" : "12.19.04"
		"round" : "29",
		"white" : "Roosevelt, Ted",
		"black" : "Harding, Warren",
		"result" : "1/2-1/2"
	},
	"bgn" : {
                "eventDate": a real datetime object,
                "white" : {
			"last" : "Roosevelt",
			"first" : "Theodore",
			"nick" : "Teddy"
		},
		"black" : {
			"last" : "Harding",
			"first" : "Warren"
                },
                "sources": [
                  {"id": "AA2", info: { "name": {"last":"Hoover"}, "rating":1000}}
                ]

	},
	"moves" : [ (as above) ]
}
  
The pgn section carries simple key-value pairs where the value is decidedly a human-readable string, and in the case of result, highly jargonistic ("1/2-1/2" means a draw)

The bgn section contains more highly structured data that is beyond the expressibility of PGN.
The moves array in BGN is convertible to PGN without additional information as it overspecifies data compared to PGN.

What is The Real Point?

  1. BGN is a modern, database and software-friendly data design
    BGN is both explicit and expressive without syntactical shortcuts like ?? for blunders and also is very digestible by rich shape databases like MongoDB. Consider a data set of 100,000,000 games and we wish to ask the question: How many games did a castle occur in the first 5 moves between 1960 and 1980? 10 moves? 15? In MongoDB we could query the hypothetical chessdata collection thusly:
    db.chessdata.aggregate([
    // Step 1: Filter for only the dates we want, which should cut down a LOT of the material:
       {$match: {$and: [ {"bgn.eventDate":{$gte:new ISODate("1960-01-01")}} ,
    		      {"bgn.eventDate":{$lt:new ISODate("1980-01-01")}}
    		    ] }}
    
    // Step 2:  Use the $reduce function to "walk" the moves array and sniff out at what point,
    // if ever, the castle occurs.  We only need to check up to the first 5 (or 10 or 15) moves
    // OR the max length of moves array, whichever is shorter:
        ,{$project: {X: {$reduce: {
    	input: {$range:[0, {$min:[ {$size:"$moves"},5 ]} ]},
    	initialValue: [],
    	in: {$let: {
    
              // $$this is the sequential int generated from $range in the input
              vars: { ee: {$arrayElemAt:["$moves","$$this"] } },
    
              // The following translates to:  "if the castle field value is true, then append to the
              // every growing $$value array a new array of one containing the offset where it was
              // found, else append a ZERO length array -- essentially a noop":
              in: {$concatArrays: [ "$$value",
    				{$cond: [ {$eq:["$$ee.castle",true]} , ["$$this"] , [] ]}
    			      ]}
    	}}
         }}
       }}		
    
    // Step 3:  The $reduce function can leave us with an empty -- but non-null! -- array, so
    // lastly filter those out:
    ,{$match: {$expr: {$ne:[0,{$size:"$X"}]} }}
    ]);
    
  2. BGN stored in MongoDB or Hadoop makes terabyte sized analytics a possibility
    With appropriate indexing, such a query might only take seconds or a minute as opposed to, say, hours struggling with running python programs using chess.pgn over and over. The PGN archive at https://database.lichess.org is adding a nearly 20GB file of bzip2 compressed PGN representing approx. 100 million new games per month. Beyond MongoDB, BGN rendered as JSON can be spread out in a Hadoop cluster spanning many terabytes with dozens of machines deployed to solve very large scale data analysis problems.
  3. BGN -- especially the moves data design -- is highly extensible
    One of the biggest drawbacks to PGN is the difficulty -- never mind lack of standardization -- in adding fields, simple or complex, to each move. It is trivial in BGN to do so because it is a modern structured data design requiring no parsing. For example, we can add a field et for elapsed time in seconds from start of game.
        26  { "p":"Q", "f": "h4", "t": "h7", "x":"N", "et":1234 }
        27  { "p":"B", "f": "e4", "t": "f3", "et": 1256}
    
    It now becomes easy to measure the pace of the game by comparing move[n].et to move[n+1].et. This can even be bucketed e.g. % moves executed in 1-10 second, % executed in 10 to 120 seconds, % more than 120 seconds, etc.
  4. BGN is easily externalizable into highly digestable JSON
    After spending days struggling with 22GB PGN files, BGN externalized as CR-delimited JSON offers some interesting advantages:
    1. Finding anything with grep is as fast as, well, grep and will yield the complete game; no other lines (rows) necessary
    2. Splitting a CR-delimited file is easy as using split, again because a game is on one line/row.
    3. You can use jq -- the de facto standard for command line hacking of JSON -- to filter, transform, and otherwise hack the JSON. Or any other JSON hacking tool you like.
    4. Like this? Dislike this? Let me know


      Site copyright © 2013-2021 Buzz Moschetti. All rights reserved