Thursday, June 11, 2009

A DiffSet generic collection

Working on a GWT project that requires fairly frequent RPC calls to the server to exchange data I was confronted with the following problem:


  • the data set itself was likely to be of fairly large size -- not massive (we are not talking tens of MBs) but not smaller either: to give a rough estimate, assume anywhere up to 500-600 kB;

  • in addition, the time taken to 'process' the full dataset (both on the server, but, more importantly on the client) was likely to be such that a user would be faced with frequent, annoying interruptions;

  • however, the actual ('incremental') changes to the data are very likely to be minimal: roughly, on a data set that may contain up to several '000s entries, probably less than 10 (and usually only 2-3) would be changed between round-trips.
I was therefore rather surprised to find that nobody seemed to have already implemented a fairly simple solution to the problem: a DiffSet<T> that essentially 'captures' the incremental changes between two Set<T> as a sequence of Commands (which I called DiffCommand).

If you are not familiar with the Command pattern, I suggest you have a good read of the Head First Design Patterns book.

I have thus created a few classes, that implement exactly this behaviour, allowing client code to 'Compute a diff' on two arbitrary Collections, one being the original dataset, the other the modified one.

'Executing the diff' on the original will re-generate the new, modified one.

Doing this on the same pair of collections would be so close to pointless to border insanity (one definition of which is expecting a different outcome from the same behaviour) -- the true value is when you have a Client / Server setup and a 'wire protocol' (such as GWT RPC) over which you need to efficiently transfer large datasets:

  1. the Client requires the data set to get started -- the Server computes it, stores it somewhere (let's call this checkPoint) and sends it over the network to the Client:

    Set<DataRecord> checkPoint = server.getData();


  2. the client stores its own local copy of checkPoint, and creates a new Collection (let's call it localCopy) that following a number of user interactions (or whatever), will be modified (records will be added, deleted or edited):

    Set<DataRecord> localCopy = new HashSet<DataRecord>(localCopy);


  3. at some point (for example, after an auto-save timer expires) the client will compute the DiffSet between localCopy and checkPoint, and send the difference to the server:

    DiffSet<DataRecord> diff = new DiffSet<DataRecord>();
    diff.computeDiff(localCopy, checkPoint);
    server.update(diff, callback);


  4. on the server side, we will have stored somewhere safe our checkPoint, which we will now update with the diff'ed data:

    public boolean update(DiffSet<DataRecord> diff) {
    Set<DataRecord> original = retrieveCheckPoint();
    diff.executeDiff(original);
    boolean success = saveCheckPoint(original);
    return success;
    }

  5. finally, in the client's callback (see GWT's RPC mechanism and callbacks) update checkPoint in case of success, or alert the user if things went pear-shape:

    AsyncCallback callback = new AsyncCallback() {
    public void onFailure(Throwable t) {
    // display 'user error'
    GWT.log(t.getClass().getName()+" was thrown because of " +
    t.getLocalizedMessage(), t);
    view.alertUser("Server Error", "Failed to save data on server ");
    }

    public void onSuccess(Boolean result) {
    if (result)
    controller.updateCheckPoint();
    }
    };


There is a lot more info in the javadoc and you can see some typical usage in the associated unit tests: the code is hosted on bitbucket, and can be retrieved using Mercurial:
$ hg clone http://bitbucket.org/marco/diffset/

3 comments:

  1. This is because you do not have operator overloading in Java.

    Look how clean is the implementation in Python:

    >>> x = set(['a', 'b', 'c', 'd'])
    >>> y = set(['a', 'c', 'e', 'f'])
    >>> x - y
    set(['b', 'd'])

    ReplyDelete
  2. As it happens, the "lack" of operators overload is a feature of Java (which I happen to agree with): very much like multiple inheritance, not having it means that the compiler is much simpler to write, the bytecode (probably) cleaner and (most importantly) keeps the code easy to read, understand, maintain and upgrade.

    Don't get me wrong, I use a lot of C++ on internal Google stuff and I love it, but sometimes it's just too much :-)

    As for Python, I don't know enough to comment on your snippet, I just doubt that it would so 'neatly' extend beyond a simple "set of chars" (perhaps it does) but I challenge you to write 'real life' code that handles complex record sets, and then go back to it after a few months and figure out what it was that it was meant to do?

    Dynamic languages (very much like Javascript) are awesome and allow you to do stuff that would be either cumbersome or even impossible with statically typed languages (such as Java and, to a lesser extent, C++) but they do suffer from a 'shelf-life' issue: maintaining and extending them may be difficult over extended periods of time.

    A very interesting alternative would be Scala, a statically typed language, with very exciting features that make it as appealing as a dynamic language.

    ReplyDelete
  3. > I just doubt that it would so
    > 'neatly' extend beyond a simple
    > "set of chars" (perhaps it does)
    > but I challenge you to write
    > 'real life' code that handles
    > complex record sets

    The mechanism for object hashing is exactly the same in place for Java.

    You can have an object as complex as you need it to be and then a "sensible" hash method.

    In the case of the string there is a default "hash" method exactly as it happen for the Java String object.

    As in Java you can create a new object with many fields and a custom hash method.

    The mechanism is exactly the same, the only thing that does change is how many redundant characters you have to type to obtain the same behaviour.

    I generally do agree that "huge" projects tends to be more stable if coded in a static typing language, but my point is that if you are modular you do not need to be "huge".

    Just to give you an example, take a look at the thousands of important Python web applications that are daily "maintained and extended" in a more proficient way of the previous programming paradigms:
    * http://www.djangosites.org/
    * http://www.djangoproject.com/

    ReplyDelete