Panlibus Blog

A cautionary tail of distributed dependency

I couldn’t help noticing this from the NCIP mailing list last night :

We have just become aware of a problem that seems to be impacting SirsiDynix Unicorn, Polaris, TLC, Endeavor and Relais sites using NCIP. The problem is that the version 1.0 response messages from these various systems are unable to find the dtd at the NISO website and the applications fail.

Translated from the Buzzwordeese:

The modules, in several Integrated Library Systems (ILS), that read messages (inter-library loan, inter-library borrowing, etc.) from other ILS systems are all failing. This is because the web site, that holds the document that defines the protocol for these messages, stopped serving the document.

So the systems supported by several library system vendors, and by implication many more than several libraries, were prevented from carrying out a key part of their business because a single document disappeared from a single web site.

A classic ‘all eggs in one basket’ situation. The obvious solution to this, you would assume, would be to hold many copies of the document so that the systems can always get at a copy. Unfortunately it is not that simple. That centrally located document is the single source of truth for the agreed standard, so should be the only one referenced by the checking algorithms in the message reading software.

As these established standards evolve very slowly, the solution is probably to get the individual library systems to hold a copy locally and only check for changes occasionally. That way the systems could continue until the master version was available again

I assume that the effected vendors are engaged in a rapid analysis as to how this situation can be prevented from being repeated. But what lessons can be taken away for the increasing number of similar situations in a Web 2.0 world?

The first lesson is for most applications using XML messaging. Architect your code so that your application does not depend a single centralised copy of the XML schema always being available.

The second lesson is to ensure that key data is replicated in reliable robust way. Take the example of internationally distributed networks of data that Google & Amazon use to distribute the risk of failure and overload.

Finally, if a change to a configuration could effect many systems, ensure that the change can be easily made once in a logically central place. This one seems to contradict my previous thoughts, except that I use the phrase ‘logically central’, so that in application terms it is held in one place but physically it could be held in many places – store it in Amazon S3 for instance and it could reliably be anywhere on the planet.

These principles are reflected in the core Directory components of the Talis Platform. The Directory provides a central place to store and serve information about library and other collections, their locations, and the protocols used to access those collections. If information about a collection changes, (be it a correction the geo-location for a library building so that it now appears in the right place on a map, or the fact that the search interface now runs at a different Internet address) that change can be made by anyone that is aware of the correct information by accessing the Directory User Interface. The change is then immediately reflected to all the systems that use it.

Obviously, as with all of the Platform components, the Directory has been architected so that it can be scaled and widely distributed across many locations.

Technorati Tags: , ,

Leave a Reply