This blog covers chapter 2 of Understanding the Semantic Web: Bibliographic data and metadata, Karen Coyle’s report on the potential of the Semantic Web for libraries.
In the first chapter, Karen took us through a detailed analysis of the development of library metadata, culminating in an argument in favour of Semantic Web principles.
The shortcomings of the MARC record
In this second chapter, Karen focuses on the era of machine-readable data, kicking off with our old and trusted friend, the MARC record. She delivers a devastating and detailed critique of the MARC format, highlighting the problem of the duplication of many data elements being in the same record but in slightly different formats – exemplified by the disconnect between bibliographic and name authority data (with no automatic update when the authority record changes), and the differing format within the bibliographic record itself between indexed and description fields.
The relational database era
Coyle then turns her attention to the relational database, where the problem for library data is of a different order, and I found Coyle’s analysis and interpretation here to be particularly impressive:
Database technology was designed for a different kind of data, less textual and more compact, with fewer data elements, and less of a range of content in those elements. Database technology is designed to retrieve. For example, it can retrieve all of the invoices that contain a particular product code. Database management systems work best in environments with a lot of repetition of data values.
In contrast, as Coyle goes on to point out, most bibliographic titles are unique, for example; there is relatively little in the way of repeated data values. Databases are also relatively poor at alphabetical sorting of long strings of text. Another interesting point is the specificity of the MARC format to the library domain, eliminating the possibility of using standard business software.
These format and technology shortcomings are exacerbated by problems intrinsic to words themselves:
They can be ambiguous (e.g. Pluto the Disney character versus the orbiting body). They can be incomplete informationally, since many concepts require more than one word (e.g. solar energy, ancient Rome). They are language-based, so a search on computer does not bring up documents with the term ordinateur. And of course keyword searching falls prey to differences in spelling (fiber versus fibre) and errors in spelling or typography (history or histroy).
Taking advantage of new technical possibilities
Having laid out the problems, Karen presents a call to action, to bring library data “into the twenty-first century for machine processing and to improve service to our human end users by being able to offer more functionality in our systems.” She picks up the argument that she introduced in chapter 1, namely that we need to join our bibliographic data to the web, where a near-universal set of information resides, with many semantic relationships to the bibliographic sphere, and of course, where our users are located, for that very reason.
Semantic Web – flavour of the month
Coyle describes the Semantic Web as “the flavor of the month” in technology terms, differentiating between the web of documents, i.e. what we already have, and the web of data, which fundamentally what the Semantic Web will be.
”The Semantic Web as introduced by Tim Berners Lee is a linked web of information encoded in documents throughout the web. Achievement of this vision is still over the visible horizon. In practice, however, there is a growing community of people and organizations who have metadata available to them that they have structured using Semantic Web rules. These disparate sets of data can be combined into a base of actionable data. These sets of data are being referred to as “linked data”, and the Linked Data Cloud is an open and informal representation of compatible data available over the Internet.”
She goes on to point out that at this juncture, many of the early participants are institutions with already existing scientific data sets. Linked data, as my colleague Richard Wallis frequently points out, is a pragmatic implementation of the semantic web, the latter being the vision that we are moving towards.
In this way, library data will link into broader sources of information, and Coyle returns to the example of Moby Dick by Herman Melville to illustrate that a bibliographic record has powerful relationships with the external world – Herman Melville the author, New England, whaling et al, are potentially invoked from Moby Dick the bibliographic work.
The mechanics of the Semantic Web
Karen proceeds to guide the reader through the rudiments of the Semantic Web, beginning with RDF, or Resource Description Framework, the data model for the Semantic Web.
It defines a set of rules for the formal semantics of metadata that is meant for the elements and structure of metadata that will be able to operate on the Semantic Web. Very simply put, in the Semantic Web all data consists of things and relationships between them, with the smallest unit being a statement of the form a thing → with relationship to → another thing
The combination of the infinitesimal applicability of this model, and its readability by machines opens up the potential to create and follow previously unimagined paths in the pursuit of new knowledge at webscale.
The model is underpinned by identifiers, and this is what gives the Semantic Web its precision, going back to the textual ambiguity problem explored earlier.
The primary rule for the Semantic Web is that identifiers need to be in the form of a Uniform Resource Identifier, which is a particular form of identifier. We don’t need to go into the structure of URIs because it turns out that the common Uniform Resource Locator, URL, is in URI format, and is the preferred identifier to use on the Semantic Web.
We see then, that the URI provides not only precision, but also continuity with the web as we know it today. She does explore the problematic ambiguity of the URI / URL. A URI can point to a definitive description of something, hence play the part of an identifier, or a URL, it can simply denote a location. They are, of course, identical in format. However she is clear on the advantage of the location dimension, namely that it enables information to be returned based on the identifier.
Coyle backs away from a utopian vision of a single system of identifiers for everything on the Web. Inevitably, she says, different communities will assign identifiers of their own, some overlapping with those of another community. And that is indeed what is coming to pass. It is ironic that just as humanity arrives at a point that a single, absolutist, universal set of definition becomes technically possible, we retreat back to subjectivity and relativism. On the other hand, it is precisely this pragmatic approach which might be the killer asset of the Semantic Web in terms of adoption.
On a similar vein, Coyle addresses the area of controlled vocabulary, explaining that the Semantic Web facilitates both controlled and uncontrolled data.
Karen Coyle uses colour as an example here, and to great effect. She compares the following two sets of examples:
Librarians will readily perceive the first set as being a controlled list of values, and anyone who has worked with software will be aware of the limitations of hard-coded values, as per the second set. Karen also illustrates how the first set also offers multi-language flexibility, whilst maintaining semantic coherence.
The benefits of the Semantic Web
Coyle itemises a number of general benefits of the Semantic Web. For me the key benefits are its global and extensible qualities, the latter meaning that new data and data types can be added at any time, and data can be endlessly recombined to create new information. But comparing the original web to its semantic younger sibling, the ability to ascribe meaningful relationships between entities is crucial, with its potential to “transform the Web from what it is today to a richer, more meaningful information environment.”
Karen illustrates this with the work of Library of Congress creating an online version of the Library of Congress Subject Headings in linked data format.
There is a separate identifier for each entry in the subject authority file, about 350,000 total. Because the identifier is also a URL, the Library has placed information about the subject heading at that location and can display it in formats for human readers or for programs.
In the final posting on this report, I will talk about what might be the ramifications for libraries of the Semantic Web as explored in this report.