PQL: The Prefix Query Language

2-Sep-2013

What is PQL?

PQL stands for (P)refix (Q)uery (L)anguage and is pronounced "pea-quill." It is a specification for how a set of functions and their associated arguments can be applied to a shape of data to yield a boolean match or no match. It is not a database unto itself, nor the "house band" for a particular database engine. It is simply ... software.

What makes PQL different from SQL and XPath and some other popular query languages is that it is based on a physical data structure, not a stringified grammar which is then parsed into some internal form. In other words, there is no standard string equivalent to the full SQL statement:


    String s = "select * from FOO where A = 'NY' and B < 12";

PQL has no intrinsic parser per-se; you build a PQL statement by assembling Maps of Maps in a way very similar to MongoDB but with no syntax shorcuts. The PQL equivalent to the SQL above is:

    Map m1 = new HashMap();
    m1.put("A", "NY");
    Map m2 = new HashMap();
    m2.put("eq", m1);

    Map m3 = new HashMap();
    m3.put("B", 12); // autobox to Integer
    Map m4 = new HashMap();
    m4.put("lt", m3);

    List l2 = new ArrayList();
    l2.add(m2);
    l2.add(m4);

    Map m5 = new HashMap();
    m5.put("and", l2);

We show the construction above implemented in Java but the concept of Map of Maps is largely language-neutral; most popular languages are capable of constructing such a structure.

Each Map contains exactly one function name and one argument (i.e. a single key- value pair), although that argument can be of type scalar, Map, or List. The topmost map above, m5, can then be consumed by utils such as MapFilter, SQLRewrite, MBI/MBD, and other tools. PQL is mostly about physical structure and a very small set of basic operators. Externalized in JSON form (a convenient way to show it), the statement above would be:

    { "and": [
      { "eq": { "A": "NY" }},
      { "lt": { "B": 12 }}
      ]
    }

Again, to be clear, although PQL is in fact nicely stringified in JSON, the language specification is not dependent on JSON. Rather, the specification is about standard function names and the physical structure of their arguments (scalar, List, or Map).

Why PQL?

The following factors motivated the design and implementation of PQL:

Systems increasingly use caches of data and maintain this data as rich shapes (maps of maps of lists of maps, etc.) instead of the traditional RDBMS database-centric approach that vends ResultSets and flat rectangles. A framework that allows seamless, consistent, and efficient rich shape filtering across a persistor and items in-memory is very desirable because it eliminates the nagging problem of needing to build a "paradigm bridge":
[SQL / flat rectangles] ==> [some other means of filtering / rich shapes]
SQL and XPath are heavily human readable/typeable oriented grammars, including infix notation, parentheses, whitespace, etc. This is good for experimentation at the command line. It is also good for "smallish" queries. When predicates get large, however, the ability to quickly absorb the logical operators and parentheses and nested conditions is lost, and it becomes no more difficult to "think like a computer" when examining a prefix notation statement. Perhaps more importantly, the readability of SQL and XPath actually confounds the programmatic construction of queries, from a GUI or backend logic both, especially with respect to to-string conversion of non-string types. Software does not need whitespace, syntactic sugar, or English nouns and verbs. And going in the other direction (parsing/deconstruction of SQL or XPath) is positively laborious. PQL, however, is about low-level structure of the query and enforcing prefix notation for all functions and is very easily constructed -- and it is just as easily deconstructed. The map-of-map paradigm means it can be externalized in a number of familiar forms including JSON or XML which means it can be rehydrated easily into the original map-of-maps object hierarchy.
The low-level structural approach permits modularity in query construction. Query fragments can be pre-constructed and dynamically assembled to perform the desired query. Common and perhaps complex query logic can be set up in one place and leveraged elsewhere in a runtime. SQL and XPath have a harder time with this because they are non-modular. They are designed around a full statement, not fragments. Even if you do fragment a query, you typically run into the issue of variable substitution and so you would likely have an implementation that held the data pre-formatting (i.e. before creating the final string A = 'B' and C = 'D' ). And that implementation starts to look a lot like ... PQL!
Most SQL is not pure-portable across database engines. In contrast, most PQL queries can be implemented not only on all SQL based RDBMS but other database engine types like MongoDB and Cassandra and Aerospike. PQL is also easily converted to Ehcache Search API.
PQL has a straightforward means to add user-defined functions, largely because the spec itself is mostly about about functions and arguments.
Cross-language and multi-language leverage across Java, python, ruby, etc. is increasingly useful and being able to define rich structure (and a query expression) in the paradigm of the "host" language is very valuable. The PQL map-of-map structure is natively "transportable" because essentially every language knows how to make a Map, a List, or one of the core scalar types. High performance utils written in Java to consume Maps of data and PQL queries (as Maps) are immediately and natively leveragable in JRuby and JPython; just build the hashes in the familiar ruby or python way and pass them to the Java layer.

In general, most queries -- even complex ones -- can be expressed in PQL. The simplicity of analyzing a PQL query combined with flexibility of mating PQL to a backend persistor means that even persistor-specific features and query capabilities can be exploited in the rewrite layer while protecting the application from the details of that implementation. For example, it is straightforward to programmatically walk a PQL AND clause and check if two date or numeric fields are a "less-than-greater-than" pair, implying BETWEEN. The rewrite would substitute a single BETWEEN expression instead of the two separate "lt" and "gt" expressions.

PQL "Compliance"

Since PQL is merely a structure and a dictionary of well-known function strings, PQL compliance means that tools consuming it must conform to these rules:

PQL function Maps contain a single key-value entry. The key is the function name and the value is the argument to that function. Functions requiring multiple scalars to perform their task can be passed a Map and that Map (the argument Map) can contain more than a single entry. Functions that operate on more than one unit can be passed a List. Single-entry Maps are used rather than Pair because the Map interface is much more widely used so familiarity and breadth of support and implementation trumps minimal interface design here.
A util digesting PQL must be capable of supporting the following 12 Basic Functions:
- Basic operators: lt, lte, ne, eq, gte, gt
  For all of these, the argument is a Map containg:
  - A field name as the Map key, which is a dotpath (e.g. header.request.createDate) to a scalar value.
  - A target value scalar String, Double, Long, Integer, Date , or BigDecimal as the Map value. The first five of these are somewhat obvious and enjoy automatic conversion across languages and excellent native to- and from-String handling. BigDecimal is also required because too many situations call for lossless, precise manipulation of dollars and cents and double won't do. Because PQL uses the Map interface exclusively, there is no explicit restriction on putting other types into a PQL map structure, but if a util does not know what to do with that type, then there is little that can be done. From a compliance standpoint, it is simply an assertion that any software consuming PQL must recognize and appropriately deal with the 6 types above.
    As a special case, the target value may be a Map. Currently the only use for the Map is for the indirection function:
```
    SQL:  select * from FOO where field1 = 'NY'
    PQL:  { "eq": { "field1": "NY" }}

    SQL:  select * from FOO where field1 = field2
    PQL:  { "eq": { "field1": {"ind": "field2"}}}
```
- Inlist (optimized equals): in
```
    SQL:  select * from FOO where field1 in ('NY','NJ')
    PQL:  { "in": { "field1": [ "NY", "NJ" }}
```
- Logicals: and, or, not
  The backbone of complex expressions, and and or take a List of of subexpressions to evaluate and can be nested arbitrarily deep. not takes a single Map as an argument.
- Existence: null
  Due to the vagaries of some Map implementations permitting entries with null values, PQL avoids ambiguous terms like "exists." Instead, the null operator clearly checks for whether a field -- or any parent component of the field if it is a dotpath -- is null, regardless of how the underlying Map might be carrying the value (if at all). Use logical not to determine if a dotpath component is not null, e.g.
```
    PQL:  { "not": { "null": "a.b" }}
```
- Type: type
  An equals operator for target field type instead of value. To aid in the to- and from-string serialization, the following simple strings identify the types: STRING, INT, LONG, DATE, DOUBLE, BIGDECIMAL, MAP, LIST. Implementations may provide convenient macros or constants for these strings.
```
    PQL:  { "type": { "birthday": "DATE" }}
```

PQL challenges

It always comes down to Dates and BigDecimal and round tripping. PQL by itself has zero issues with Dates and BigDecimal, but PQL as part of a larger system design with externalization requires something else to manage and/or define the Date and BigDecimal types. For example, PQL queries and the Maps they operate on are trivially externalized and parsed via JSON if they contain only strings and integers. Most JSON parsers have switch selectable modes to parse floating point numbers as BigDecimal instead of double so that's only slightly less trivial. In other words, JSON out of the box can be used as the externalization solution!
But dates are a pain because once externalized as a String representation (say, in ISO 8601), an arbitrary consumer does not know upon rehydration if it is a real date, a date deliberately made to look like a String, or soemthing else. Sure, you can sample-parse it to see if a successful conversion can be performed, but that might not be the intention of the producer. SQL and other languages are based around a string grammar and thus the standards for to- and from-string conversion are baked in. That said, there is still enough "variety" in the SQL standard such that exteralizing something like DATE('2013-08-01') from Postgres will not work in Oracle (although TO_DATE(data, format) does in fact work in both).

    SQL:  select * from FOO where createDate < date('2013-08-01')

    PQL:  
    Map m1 = new HashMap();
    {
        // An acceptable built-in string-to-date parser...
        Calendar c2 = javax.xml.bind.DatatypeConverter.parseDateTime("2013-08-01");
        m1.put("createDate", c2.getTime());
    }					       
    Map m2 = new HashMap();
    m2.put("lt", m1);

    // Up to here, PQL and all tools still working fine

    // But here, a trivial externalizer might produce this:
    { "lt": { "createDate": "2013-08-01" }}
    // ...and now that looks like a String, not a date...

In short, it comes down to defining a standard for externalization. One possibility is to leverage the special Map function framework (that supports ind) and add a function called class

    { "lt": { "createDate": { "class": { "type": "java.util.Date", "str": "2013-08-01" }}}}

Again, this representation is NOT used internally in PQL. Post rehydration of the fragment just above, the function Map would look like this:

    Map.Entry me1 = extractOneME(m); // assume m is the "lt" map
    me1.getKey();  // "lt"
    Map m2 = me1.getValue(); // the Map arg
    Map.Entry me2 = extractOneME(m2);
    me2.getKey();  // "createDate"

    // ah HA!  This is not the "class" thing above.  The rehydrator has 
    // recognized the special function "class" and created a real Date object
    // in place of the type info and string equivalent:
    Object o = me2.getValue(); // o.getClass() is Date, value is 20130801

Aggregates

Standard PQL does not have a specification for aggregate functions because it is oriented around boolean matching of single candidate shapes, not lists. Aggregate functions have different semantics and although it is tempting to create functions like agg:

    { "agg": {
        "filter": { # regular PQL here },
        "expr": { # commands to do something BESIDES filtering }
    }

these need to be set up in an engine that does something intelligent with the output especially since agg functions typically generate new shapes of data. The MongoDB aggregation framework and pipelining is step in that direction.

Extending PQL

The strict but simple nature of PQL means that adding functions is easy. We do need, however, a convention for namespacing the functions. The obvious choice would be using a colon just like XML:

    { "eq": { "field1": "NY" }}  # eq is in the default Basic namespace
    { "ns1:myFunc": arg }  # myFunc is in the ns1 namespace

The filter would provide the predictable handler interface to declare new sets of functions:

    MapFilter e = new TheMapFilterImpl();
    class MyFunctions implements MapFilterFunctionSet {
        // register function "myFunc"
    }
    e.registerFunctions("ns1", new MyFunctions(args));

At the point we tackle that, we should also craft a means to identify namespaces of variables that can be addressed. Yes, ultimately we are rewriting LISP, but we would stop well short of that.

It's not SQL

PQL is not a large, multi-decade mature, general purpose query language. It is a simple framework to perform modest operations on a Map of data. Although user defined functions can be declared, don't equate the Map-of-Map design paradigm with SQL. SQL and its variants are much richer and more complex.

Dates

Date is always a tricky type and PQL avoids some of the trickiness by not complicating it with its own rules. Date timezone, daylight savings time, etc., are completely up to the producer and consumer to negotiate. PQL does not get into the middle of the discussion; it simply carries the Date type around. A Date is a single instant in time and intrinsically has no timezone or daylight savings time (DST) information. In other words, operations like (Date)d1.compareTo((Date)d2) work correctly independently of timezone and DST. It is only when Date is externalized or parsed from a String that the multitimezone interpretation of that instant in time becomes an issue. The rubber hits the road when a PQL-compliant util actually performs String operations on the date. This is where things like default locale and DST settings start to muddy the waters. A pragmatic recommendation is to NOT use TZ and DST offsets in string representation of dates and instead specifically define the scope/purpose of the target date item i.e. businessTradeDate will carry the datetime of activity as observed on the clock in that business day context -- which is not necessary the clock that the "user" sees nor the UTC time. Other information can be used later to take the point-in-time datetime and represent it in different String-ified forms.

Does PQL "Work?"

Yes! I have crafted these compact utils that use PQL:

MapFilter. Given a candidate data Map and a PQL map, return a boolean match or no match. One might call MapFilter the reference implementation for processing PQL.
MapListFilter. Given a candidate List of Maps and a PQL map, return a boolean array indicating which items in the list match or not.
SQLRewrite. An essential component for MBI/MBD, a hybridized relational/ JSON persistence layer. MBI/MBD sits in between the persistor and the data access layer and allows a regular RDMS like Oracle or MySQL or Postgres to efficiently capture and perform queries on rich data. In summary, fields -- arbitrarily deep into the rich data structure -- that are indexed in MBI/MBD are copied from the shape into "helper" columns in the table which can be used as part of a SQL expression. MBI/MBD manages consistency between the JSON BLOB/CLOB and the helper columns so copy, indexing, update, etc. are all invisible to the DAL.

PQLtoMongo. Probably obvious that this would exist. It creates the shortcuts necessary to drive a MongoDB query, e.g.

    PQL:  { "eq": {"fld1": "val1" }}
    MDB:  { "fld1": "val1"}

    PQL:  { "lt": {"fld1": "val1" }}
    MDB:  { "fld1": {"$lt": "val1"}}

Is PQL "Fast?"

Given MapFilter as a reference implementation, here are some results collected on a 2013 MacBookPro (Intel Core i7, 2.7GHz, 4 cores, 256KB L2 per core, 6MB L3 cache, 16GB total physical memory) using Java 1.7.0_40 with no special tuning or -X startup options.

    // Data shape looks like this:
    // {
    //    "c1": "val",
    //    "c2": "val2"
    // }

    // PQL query:
    { "and": [ { "eq": { "c1": "val1" }}, { "eq": { "c2": "val2" }} ] }

    MapFilter.eval(data, query) achieves between 4.2 and 4.4 million
    evals per second.  Note: With a single "eq" operation (no "and"), you
    get a whopping 15.2 million per second.


    // Data shape looks like this:
    // {
    //   "a": {
    //     "b": {
    //       "c": {
    //         "d1": "val1",
    //         "d2": "val2"
    //       }
    //     }
    //   }
    // }

    // PQL query:
    { "and": [ { "eq": { "a.b.c.d1": "val1" }}, { "eq": { "a.b.c.d2": "val2" }} ] }

    MapFilter.eval(data, query) achieves between 2.6 and 2.7 million
    evals per second.

In the context of AND (a common logical), basic operator tests can be performed at a rate of about 7 million per second. A big query -- say, an "and" list containing 20 basic operators going only one level deep into a rich data structure -- runs at about 450000 evals per second. In the reference implementation, performance is governed largely by the number of basic operations. A performance optimization for non-nested dotpaths keeps things fast for those queries. Performance degrades only a 2 or 3% per level as the depth of the dotpath increases. "Pushing a frame" for and, or, and not has little effect. In practice, there will be very few examples of deeply nested data combined with deeply nested logical operators.

The in operator works particularly well even with very large lists.

    // Put one million items in the list:
    { "in": { "name": [ "N1", "N2", ... "N1000000" ] }}

    // Even with a trivial implementation (i.e. no internal optimizations
    // like putting lists greater than 1000 in a hashmap) combined with a 
    // "worst case" input data map like this 
    // {
    //   "name": "N1000000"  
    // }

    // it still only takes about 50ms, about 20,000,000/sec for String.equals().

Because PQL queries are essentially "modular" Map units, large inlists can be constructed ahead of time and managed as a group. Later, as appropriate, they can be added as ingredients into a new query. You have the additional flexibility of managing the raw List of String separately and "plugging it into" the value for "name" later on.

In general, it can be said that it is fast enough.

PQL Examples

Here are some examples (in JSONified representation of the qexpr) in the context of a traditional select statement from SQL. Only the first two examples have Java snippets but the approach to construction is the same for all examples:

    // The equiv of hello world:
    //
    // fld1 = 'value1'

    The Java to construct this Map-of-Maps might look like this:
      Map q = new HashMap();
      Map q1 = new HashMap();
      q1.put("fld1", "value");
      q.put("eq", q1);

    and the string representation (easier to visualize it this way) would be:
    {
      "eq": { "fld1": "value1" }
    }


    // More than a single boolean op requires one or more logical functions.
    // Here we see that the and function expects a list of expressions
    // to eval, not a single map or scalar value.   But the overall
    // function-argument design remains consistent.
    //
    // fld1 = 'value1' and fld2 = 'value2'
    {
      "and": [
        { "eq": {"fld1": "value1"}},
        { "eq": {"fld2": "value2"}}
      ]
    }
    Map qexpr = new HashMap();
    List q2 = new ArrayList();
    {
      Map q = new HashMap();
      Map m7 = new HashMap();
      m7.put("fld1", "val1");
      q.put("eq", m7);
      q2.add(q);
    }
    {
      Map q = new HashMap();
      Map m7 = new HashMap();
      m7.put("fld2", "val2");
      q.put("eq", m7);
      q2.add(q);
    }
    qexpr.put("and", q2);
    


    // "Roughly equiv" to:  
    //    select where trade.header.sys = 'ABC'.
    // The qexpr map semantics permit rich nested structure.  Maps
    // of maps in the document Map being queried can be "walked" using
    // the familiar dot operator.
    // In conventional SQL this would be some sort of join of foreign
    // keys, or perhaps the data itself would be "flattened" to a column
    // like :  select where trade_header_sys = "ABC"
    {
      "eq": { "trade.header.sys": "value1" }
    }


    // SQL is not oriented toward dealing with non-scalars in a structure
    // and that's OK; we're not trying to make this an apples-to-apples
    // SQL v. PQL comparison.  But suppose we have a structure like this:
    // {
    //    "authUsers": [ "buzz", "dave" ]
    // }
    // where there can be a variable number of authUsers.  Suppose we want
    // to find any record where size(authUsers) > 1.  We exploit the number
    // convention of the dotpath syntax to see which records have a NULL entry
    // at offset 1 (meaning the list has only 1 element), and then NOT the
    // result to return those records that have more than one element.  This
    // is assuming well-behaved lists without embedded nulls and such.
    {
        "not": { "null": "authUsers.1" }
    }    


    // fld1 = 'value1' and fld2 <= 20
    {
      "and": [
        { "eq": {"fld1": "value1"}},
        { "lte": {"fld2": 20}}
      ]
    }

    // fld1 in ('value1','value2', 'value3')
    {
      "in": {"fld1": [ "value", "value2", "value" ] }
    }


    // fld4 not in (10, 20, 30)
    {
      "not": { "in": {"fld4": [ 10, 20, 30 ] }}
    }


    // Typically, the AND and OR functions will be the entry in the
    // topmost map because almost any nontrivial query will require them.
    // Here, OR is the main construct:
    // 
    // fld1 in ('value1','value2') or fld2 != 'bar'
    {
      "or": [ 
        { "in": {"fld1": [ "value", "value2" ] } },
        { "ne": {"fld2": "bar"}}
       ]
    }


    // Getting fancy but the qexpr syntax is very simple so although
    // it would be messier than SQL to type, it is very easy for
    // program logic to construct.  
    // We show datetime strings here but note the conversion to a date
    // type.  java.util.Date type is one of the recognized scalar types.
    //
    //    fld1 in ('value1','value2') 
    //    or fld2 != 'bar'
    //    or (fld3 != 0 and fld3 = TO_DATE('2013-08-27T00:00:00')
    //
    {
      "or": [ 
        { "in": {"fld1": [ "value", "value2" ] } },
        { "ne": {"fld2": "bar"}},
        { "and": [
          { "ne": {"fld3": 0}},
          { "eq": {"fld3": str2date("2013-08-27T00:00:00")}}
          ]
        }
       ]
    }

    // Careful with those parens:
    // fld1 = 'A' and (fld2 = 'B' or fld3 = 'C')
    {
      "and": [ 
        { "eq": {"fld1": "A"},
        { "or": [ { "eq": {"fld2": "B"}}, { "eq": {"fld3": "C"}} ] }
      ]
    }

    // Again, be careful with those parens!   
    // (fld1 = 'A' and fld2 = 'B') or fld3 = 'C'
    {
      "or": [ 
        { "and": [ { "eq": {"fld1": "A"}}, { "eq": {"fld2": "B"}} ] }
        { "eq": {"fld3": "C"}}
      ]
    }

    // All basic scalar booleans can make use of the special ind function.
    // If a literal value is not provided for comparison, then a Map
    // containing the ind (indirect) function and a target field name in
    // docuemnt can be specified instead:
    //
    // fld1 = fld2
    {
      "eq": { "fld1": {"ind": "fld2"} }
    }

Like this? Dislike this? Let me know