Thursday, June 11, 2009

A DiffSet generic collection

Working on a GWT project that requires fairly frequent RPC calls to the server to exchange data I was confronted with the following problem:


  • the data set itself was likely to be of fairly large size -- not massive (we are not talking tens of MBs) but not smaller either: to give a rough estimate, assume anywhere up to 500-600 kB;

  • in addition, the time taken to 'process' the full dataset (both on the server, but, more importantly on the client) was likely to be such that a user would be faced with frequent, annoying interruptions;

  • however, the actual ('incremental') changes to the data are very likely to be minimal: roughly, on a data set that may contain up to several '000s entries, probably less than 10 (and usually only 2-3) would be changed between round-trips.
I was therefore rather surprised to find that nobody seemed to have already implemented a fairly simple solution to the problem: a DiffSet<T> that essentially 'captures' the incremental changes between two Set<T> as a sequence of Commands (which I called DiffCommand).

If you are not familiar with the Command pattern, I suggest you have a good read of the Head First Design Patterns book.

I have thus created a few classes, that implement exactly this behaviour, allowing client code to 'Compute a diff' on two arbitrary Collections, one being the original dataset, the other the modified one.

'Executing the diff' on the original will re-generate the new, modified one.

Doing this on the same pair of collections would be so close to pointless to border insanity (one definition of which is expecting a different outcome from the same behaviour) -- the true value is when you have a Client / Server setup and a 'wire protocol' (such as GWT RPC) over which you need to efficiently transfer large datasets:

  1. the Client requires the data set to get started -- the Server computes it, stores it somewhere (let's call this checkPoint) and sends it over the network to the Client:

    Set<DataRecord> checkPoint = server.getData();


  2. the client stores its own local copy of checkPoint, and creates a new Collection (let's call it localCopy) that following a number of user interactions (or whatever), will be modified (records will be added, deleted or edited):

    Set<DataRecord> localCopy = new HashSet<DataRecord>(localCopy);


  3. at some point (for example, after an auto-save timer expires) the client will compute the DiffSet between localCopy and checkPoint, and send the difference to the server:

    DiffSet<DataRecord> diff = new DiffSet<DataRecord>();
    diff.computeDiff(localCopy, checkPoint);
    server.update(diff, callback);


  4. on the server side, we will have stored somewhere safe our checkPoint, which we will now update with the diff'ed data:

    public boolean update(DiffSet<DataRecord> diff) {
    Set<DataRecord> original = retrieveCheckPoint();
    diff.executeDiff(original);
    boolean success = saveCheckPoint(original);
    return success;
    }

  5. finally, in the client's callback (see GWT's RPC mechanism and callbacks) update checkPoint in case of success, or alert the user if things went pear-shape:

    AsyncCallback callback = new AsyncCallback() {
    public void onFailure(Throwable t) {
    // display 'user error'
    GWT.log(t.getClass().getName()+" was thrown because of " +
    t.getLocalizedMessage(), t);
    view.alertUser("Server Error", "Failed to save data on server ");
    }

    public void onSuccess(Boolean result) {
    if (result)
    controller.updateCheckPoint();
    }
    };


There is a lot more info in the javadoc and you can see some typical usage in the associated unit tests: the code is hosted on bitbucket, and can be retrieved using Mercurial:
$ hg clone http://bitbucket.org/marco/diffset/