Software & Databases: You're All Missing The Point

4-Jul-2013

SQL. NoSQL. Java. Rails. Oracle. MongoDB. C++. perl/python/ruby/scala/... Etc., etc., etc.
Do you really think choice of query syntax, ACID compliance, and language are the reason why software design succeeds or fails? It's not.
Software engineering and then ultimately system flexibility and performance suffers because we routinely and inappropriately allow databases and database APIs to nudge our designs to mix three things together that need to be kept very separate: data, key, and domain.

To begin, let's start with objects, not the database. If you start a design in terms of tables (or key/value or document collection; this isn't a SQL v. noSQL issue) then you're already going down the wrong path.
Let's make a simple class:

    class MyClass {
        private String name;
        private int age;
    
        public MyClass() { // things.... }
        public MyClass(MyClass a) { // the copy constructor! }
    
        //  usual barrage of get/set, etc...
    }

Now, before we do anything about databases and such, we should be looking at this class and asking things like:

Is this class going to observe value semantics?
Is this class immutable?
What are the space and performance considerations for this class?
What are the logical and physical dependencies for this class (get ready for the next rant on open source version management....)
How can this class adapt and change over time?

If we wanted stricter adherence to the philosophy of design-by-contract then our simple class would have a subtle difference in its initial documentation:

    class MyClass {
        // various internal state

        public MyClass() { // things.... }
        public MyClass(MyClass a) { // the copy constructor! }

        public String getName() { // }
        public void setName(String a) { // }
    }

Very often we too closely couple the internal state variables with what we think we're going to set up as columns in an RDBMS and that's wrong ... but I am getting ahead of myself. All this has to be considered before we think about the database. And as we flesh out MyClass a bit, we do so without concerning ourselves about a key or a clustered index. There is no persistor. There is no O-R framework. There is just this object.
So having defined a tidy MyClass, we can create instances of it, put it in Lists, etc.

    // perform operations on o
    MyClass o = new MyClass();
    o.setName("Bob");

    // Perhaps if modeled as immutable, no setters; data at construciton only:
    MyClass o2 = new MyClass("Bob", 26);
    // perform operations on o2

    //  Put them in containers:
    for(int i = 0; i < 8; i++) {
        list.add(new MyClass());
        set.add(new MyClass());
    }

All this is great -- but rather academic. The real-life use case is once we create an object, how do we find it again (or otherwise get at it) once we are outside local scope of the method that constructed it? Almost always the answer is to construct a key that is associated with the object and place both in a domain for lookup. A HashMap provides a convenient implementation of a domain:

    Map m = new HashMap();
    MyClass o = new MyClass();
    m.put("K1", o);

At this point, object o can be found in domain m via key K1. Three separate components (key, domain, object) operating in harmony. So far, so good. The really important points of the exercise are:

Key "K1", from an information architecture perspective, has nothing to do with object o. K1 is a key in domain m to find object o; that's all. Under no circumstances can object o rely on the fact that string K1 is being used to index it in domain m.
Key "K1" is unique only in the domain of the HashMap.

This is fairly straightforward implementation and just about all programmers are familiar with HashMaps and key/value entries.

The Problem

Now let's see what typically happens when we switch the implementation of domain from a HashMap to a database. We will use RDBMS semantics here but the issue is not isolated to RDBMS nor is it related to SQL:

    Statement px = conn.createStatement() ;
    ResultSet resultSet = px.executeQuery("select id, name, age from people");

    while(resultSet.next()) {
        MyClass o = mapResultSetToMyClass(resultSet);
	// etc...
        o.getName(); // OK
        o.getKey(); // ?!?   BAD!  What is key doing in MyClass?
    }
    resultSet.close();
    px.close();

The subtle issue here is that the resultset is carrying not only important data for the object (name, age) but also the key (id). All too often, a naive implementation of mapResultSetToMyClass() will "influence" the definition of MyClass by trying to shove the key into it, i.e.:

    MyClass mapResultSetToMyClass(ResultSet r) {
        MyClass mc = new MyClass();
        String key = r.getString(1); // 1 is position of id in result
        mc.setKey(key);
        String name = r.getString(2); // 2 is position of name in result
        mc.setName(name);    
        //...
    }

MyClass is no longer pure! Instead of the key being associated with the object and managed by a domain, the key is now IN the object. How did this happen? Sloppiness. Because the ResultSet record has the key to the record right there with the data, we take the shortcut and make key a peer to the rest of the data. It isn't. Why is this bad? Leaving aside the acadenics around key/data separation, for starters:

    MyClass o = mapResultSetToMyClass();
    
    MyClass o2 = new MyClass(o);  
    o2.getKey();  // ...so now BOTH o and o2 have the same key?

Because the domain key is now intrinsic to the object, copy construction -- and essentially any other write operation on the object -- puts the object into a state that is inconsistent with the persistor. It's not a divide-by-zero exception, no, but it greatly confuses the design because it is implied by the existence of the key in the object that it represents exactly what is in the persistor. Within a few revs of the software, the binding between key and data in the object becomes inseparable as utils and other functions come to rely on the presence of the key to craft operations to the persistor(s). The object looses it's identity as a heap-bound thing and becomes a flashier version of ResultSet, which means in practice it becomes little more than a flashier HashMap. Do you really want to go through all the trouble of managing and documenting an object when it is no more than a HashMap, where the behaviors and types are driven not from the topside interface but rather from the bottomside data supplier?

Depurifying the object by embedding a key also means you cannot create and manipulate the object independently of the domain. Technically, you can but what does this mean:

    MyClass o2 = new MyClass();  
    o2.getKey();  // I guess the key has to be ... null?
    // and later...
    o2.setKey(someThing);  // ?  Huh?

The Solution

The path to goodness is actually pretty simple but requires diligence in design and code factoring. When reading and writing a database, one has to keep the objects and keys separate but associated -- just like any other domain:

    Map m = new HashMap();

    while(resultSet.next()) {
        String key = resultSet.getString(1); // same as above
        MyClass o = mapNonKeyPartsOfResultSetToMyClass(resultSet);

        m.put(key, o);
    }

Well that seems barely more complicated than the HashMap example! So why don't we always do it? Because hibernate and other O-R frameworks, all-too-helpful IDEs with code generators, plus an industry plague of convenience instead of good engineering, drives developers to point tools directly at the database and let the code be built up from there. When this happens, inevitably, key and data are bundled together as peers in the to/from DB code.

Sometimes, though, we can't use a HashMap, which means we also lose the component that is acting as the collection for the key space. Let's say we need to use a List of data instead. No problem:

    List list = new ArrayList();

    while(resultSet.next()) {
        String key = resultSet.getString(1); // same as above
        MyClass o = mapNonKeyPartsOfResultSetToMyClass(resultSet);

        // NO SHORTCUTS!  Use a Pair class to associate key and data and
        // store that in the list!
        MyPair<String,MyClass> p = new MyPair<String,MyClass>(key, o):

	list.add(p);
    }

Seems like a lot of extra work, right? It isn't. Why do we want to do this? How about some pro-forma, pre-save analytics:

    // Get everything from the persistor:
    while(resultSet.next()) {
        String key = resultSet.getString(1); // same as above
        MyClass o = mapNonKeyPartsOfResultSetToMyClass(resultSet);

        MyPair<String,MyClass> p = new MyPair<String,MyClass>(key, o):

	list.add(p);
    }

    // Now add in things from the GUI:
    while(grab things from GUI as a Map or JSON or XML) {
        MyClass o3 = convert things to a MyClass();
        MyPair<String,MyClass> p = new MyPair<String,MyClass>(null, o): // ah HA!
     	list.add(p);
    }
    analyze_everything(list);

The analyze_everything function is interested only in performing operations on MyClass. It will iterate over the list, plucking out the MyPair objects and then MyClass. It leaves the key alone. Later, if we want to take action on a particular MyClass and it has a persistor key, we can use it to update the persistor. If the key is null, we can ask the persistor to vend a new key (or we can invent one) which will be associated with MyClass and the whole thing written to the persistor.

There is a nice Pair class available at commons.apache.org.

Ultimately, this is just another example of the important of factoring and separation of concerns. Objects need to be defined independently of the persistence and keying frameworks that manage them.

Like this? Dislike this? Let me know