Panlibus Blog

Archive for the 'Open Data' Category

What interests 250+ librarians at 8:30 on a Sunday morning

IMG_0165 Linked Data, that’s what! 

I must admit I was a little skeptical of the timing when I accepted the invitation to provide the keynote for a Linked Data session – on the last day of IFLA 2010 – at 8:30 in the morning – in August – on a Sunday.  Who was going to want to get up at that time, on the day they were probably going to leave beautiful Gothenburg, to hear me witter on about the Semantic Web and the obvious benefits of Linked Data for libraries? A few minutes before the start, I was beginning to think my skepticism was well founded, viewing the acres of empty seats laid out in their menacing ranks in front of me. But then almost as if from nowhere, the room rapidly filled and by the time I took the stage we had something approaching a full house.  As you can see from my iPhone snap below, we ended up with a significant group (I lost count at about 250) of interested librarians.

250+ Librarians in Gothenburg

So was it worth them turning up at such an unsociable time?  I obviously can’t speak for my session, but I believe it was well worth turning up.  We had a series talks which varied from the in-depth technical/ontological spectrum to the rousing plea to open up your data now – and don’t hamper it with too much licensing.

First on after my session was Gordon Dunsire from the University of Strathclyde who gave us some in depth reasoning as to why we needed complex detailed ontologies based upon standards like RDA, FRBR, FRAD, and RDA to describe library resources in RDF for the Semantic Web.   To represent the full detail that catalogers have, and want to, provide for resource description I agree with him.  I also believe that we need to temper that detailed view by including more generic ontologies in addition. People from outside of the library world, dipping into library data [with more ways to describe a title than there are flavors of ice cream], will back off and not link to it unless the can find a nice friendly dc:title or foaf:name that they understand.

Some of the other speakers that I caught included Patrick Danowski’s entertaining presentation entitled “Step 1: Blow up the silo!. He took us through the possible licenses to use for sharing data, only to conclude that the best approach was totally open public domain.  He then went on to recommend CC0 and/or PDDL as the best way to indicate that your data is open for anyone to do anything with.

Jan Hanneman from the German National Library delivered an interesting description [pdf]of the way they have been publishing their authority data as Linked Data, and the challenges they met on the way.  These included legal and licensing issues, around what and under what terms they could publish.  Scalability of their service, being another key issue once they move beyond authority data.

All in all it was an excellent Sunday morning in Gothenburg.  I presume the organizers of IFLA 2011 will take note of the interest and build a larger, more convenient, slot in the programme for Linked Data.

Note: My presentation slides can be viewed on Slideshare and downloaded in pdf form

Will Linked Data mean an early end for Marc & RDA

For the uninitiated, NGC4LIB is a library focused mailing list which has a reputation for often engaging in massive discussions and disagreements around the minutiae of future cataloguing and library focused metadata practices.  They have recently been involved in one of these great debates stimulated by the comments of Sir Tim Berners-Lee in a recent interview.    As is often is the case on this list, the debate wandered well off topic in to the realms of FRBR and it’s alternatives before being brought back on topic by Jim Weinheimer, who started the conversation in the first place.

A statement in Jim’s contribution caught my eye:

Implementing linked data, although it would be great, is years and years away from any kind of practical implementation

hmg.gov.uk_data Implementing linked data is already well underway with many groups across the Globe.  For instance there are couple that we at Talis are closely involved with.  Following on from Sir Tim’s interview comments, the British Government are currently running a, soon to be opened, closed beta of data.gov.uk.  Through this site they are not only opening up data in many forms such as CSV, like their American cousins at data.gov, but they are also starting to encode in RDF and publishing it via the Talis Platform which provides a SPARQL (the query language of the Linked Data web) end point.  This approach not only lets anyone download the raw data, but also enables them to query it for whatever they have in mind. If you want a sneak preview of how such data is queried, take a look at some of theses examples.   In a similar vein, metadata from BBC programmes and music is being harvested in to Talis Platform stores.  Again these are open to anyone to innovate with – check out these screencasts  to see some of the early possibilities.

Ah but that is not bibliographic data, I hear someone cry – It’ll never catch on in libraries.  I get the impression from some comments on the NGC4LIB list, that it will not be possible for ‘our’ data to participate in this Link Data web until ‘we’ have predicted all possible uses for it, analysed them, and developed a metadata standard to cope with every eventuality.   There are already a few examples of the library world engaging with RDF and Linked data, one obvious one being the Library of  Congress with LCSH another the National Library of Sweden.  Neither of these examples are encoding the kind of detail you would expect in a Marc record, they are using ontology to describe associated concepts such as subjects.

There has been some ontology development towards this larger goal with Bibo (Bibliographic Ontology Specification).  Although not there yet, Bibo is good enough to be used in live applications whishing to encode bibliographic data.  Such an example is Talis Aspire.  Underpinned by the same Platform as the UK Government and BBC Linked Data services, it uses the Bibo ontology to describe resources an an academic context

Alongside data.gov.uk there is a Google Group conversation taking place. The refreshing part of this conversation is that it is between the producers of the data sets, those developing the way it should be encoded in to RDF, and those who want to consume it.  Several times you will see a difference of opinion between those that want to describe the data to it’s fullest, and those that wish to extract the most value from it. “I agree that is a cleaner way of encoding, but can you imagine how complex the query will be to extract what I want!”.  This approach is not unusual in the Linked Data world, where producers and consumers get together, pragmatically evolving a way forward.  Dataincubator.org is an open place where such pragmatic development and evolution is taking place.  Check out examples of a subset of Open Library data. (note this is an example of data, not a user interface).

Semantic Library _ Mark Twain Another, bibliographic focused, experiment can be found at semanticlibrary.org. From some of the example links on the home page, you can see that building in this way enables very different ways of exploring metadata.  People, subjects, publishers, works, editions, series, all being equally valid starting points to explore from.

Doth the bell toll for Marc and RDA?
Not for a long old time – Ontology like Bibo, and the results of work at Dataincubator.org and semanticlibrary.org, may well lead to more open useful, and most importantly linked, access to data previously limited to library search interfaces.  That data has to come from somewhere though, and the massive global network of libraries encoding their data using Marc ,and maybe soon RDA, are ideally placed to continue producing rich bibliographic metadata.  Metadata to be fed in to Linked Data web in the most appropriate form for that purpose.  There will continue to be a place for current cataloguing practices and processes for a significant period -supporting and enabling the bibliographic part of the Linked Data web, not being replaced by it.

No doubt the NGC4LIB conversation on this topic will continue. Regardless of how it progresses, there is a current need and desire for bibliographic data in the linked data web.  The people behind that desire, and the innovation to satisfy it, may well have come up with a satisfactory solution, for them, whilst we are still talking.

JISC Grasp the Marc Record Re-use Legality Nettle

The JISC Information Environment Team have just announced a study to explore the legal and ownership implications of making catalogue records available to others when this involves copying, transferring them into different formats.

The JISC has just commissioned a study to explore some of these issues as they apply to UK university libraries and to provide practical guidance to library managers who may be interested in making their catalogue records available in new ways. Outcomes are expected by the end of 2009.

The specific objectives of the study are to:
•    Establish the provenance of records in the catalogues of a small but representative sample of UK university libraries and in the national Copac and SUNCAT catalogues;
•     Identify any rights or licences applying to the records and assess how these apply to re-use in the Web environment. This work should include clarifying the legal status of MARC records and copies of MARC records, and the legal implications of translating records between different formats such as MARC and MODS XML;
•     Provide practical guidance to UK university libraries about the legal issues to be considered in making catalogue records available for re-use in Web applications such as social networking sites – drawing on the findings from the sample;
•     Make recommendations to the JISC and the UK higher education community about any initiatives which could usefully be undertaken to facilitate the re-use of catalogue records in Web applications in a way which respects legal rights and business interests.

The core nugget of this being clarifying the legal status of MARC records and copies of MARC records.  Without establishing that anything else would be building castles on sand.

One of the many things that was never fully clarified in the OCLC record re-use saga earlier in the year was the legal status of a Marc record – can it, or parts of it, be considered as a creative work and therefore be applicable for copyright and a concept of ownership.

I wish whoever is undertaking the JISC study (the announcement does not indicate any study group members) well as they set foot in to this minefield of assumption, traditional practice, legal interpretation, and commercial interest and bias.  Let’s hope they do a thorough job and carry enough weight from legal, library, and publishing backgrounds to deliver advice and opinion that will clarify these particularly murky waters well beyond the UK University sector.

Library of Congress launch Linked Data Subject Headings

Back in December I was very critical of the Library of Congress for forcing the take down of the Linked Data service at lcsh.info.  LoC employee, and Talking with Talis Interviewee, Ed Summers had created a powerful and useful demonstration of how applying Linked Data principles to a LoC dataset  such as the Library of Congress Subject Headings could deliver an open asset to add value to other systems.  Very rapidly after it’s initial release another Talking with Talis interviewee Martin Malmsten, from the Royal Library of Sweden, almost immediately made use of the links to the LCSH data.   Ed was asked to take the service down, ahead of the LoC releasing their own equivalent in the future.

I still wonder at the LoC approach to this, but that is all water under the bridge now, as they have now launched their service, under the snappy title of “Authorities & Vocabularies” at http://id.loc.gov/authorities/.

The Library of Congress Authorities and Vocabularies service enables both humans and machines to programmatically access authority data at the Library of Congress via URIs.

The first release under this banner is the aforementioned Library of Congress Subject Headings.

As well as delivering access to the information via a Linked Data service, they also provide a search interface, and a ‘visualization’ via which you can see the relationship between terms, both broader and narrower, that are held in the data.

To quote Jonathan Rochkind “id.loc.gov is AWESOME”:

Not only is it the first (so far as I know) online free search and browse of LCSH (with in fact a BETTER interace than the proprietary for-pay online alternative I’m aware of).

But it also gives you access to the data itself via BOTH a bulk download AND some limited machine-readable APIs. (RSS feeds for a simple keyword query; easy lookup of metadata about a known-item LCSH term, when you know the authority number; I don’t think there’s a SPARQL endpoint? Yet?).

On the surface, to those not yet bought in to the potential of Linked Data, and especially Linked Open Data, this may seem like an interesting but not necessarily massive leap forward.   I believe that what underpins the fairly simple functional user interface they provide will gradually become core to bibliographic data becoming a first-class citizen in the web of data.

Overnight this uri ‘http://id.loc.gov/authorities/sh85042531’ has now become the globally available, machine and human readable, reliable source for the description for the subject heading of ‘Elephants’ containing links to its related terms (in a way that both machines and humans can navigate).  This means that system developers and integrators can rely upon that link to represent a concept, not necessarily the way they want to [locally] describe it.  This should facilitate the ability for disparate systems and services to simply share concepts and therefore understanding – one of the basic principles behind the Semantic Web.

This move by the LoC has two aspects to it that should make it a success.  The first one is technical.  Adopting the approach, standards, and conventions promoted by the Linked Data community ensures a ready made developer community to use and spread the word about it.  The second, one is openness.  Anyone and everyone will not have to think ”is it OK to use this stuff” before taking advantage of this valuable asset.  Many in the bibliographic community, who seem to spend far too much time on licensing and logins, should watch and learn from this.

A bit of a bumpy ride to get here but nevertheless a great initiative from the LoC that should be welcomed.  On that I hope they and many others will build upon in many ways.  – Bring on the innovation that this will encourage.

Image from the Library of Congress Flickr photostream.

OCLC Take aim at the library automation market from the Cloud

OCLCclouds Over the last few years OCLC the US based not –for-profit cataloguing cooperative has been acquiring many for-profit organisations from the world of library automation such as PICA, Fretwell-Downing Informatics, and Sisis Information Systems. 

About fifteen months ago, Andrew Pace joined OCLC, from North Carolina State University Libraries, and was given the title of Executive Director, Networked Library Services.  After joining OCLC Andrew, who had a reputation for promoting change in the library technology sphere, almost disappeared from the radar.  

Putting these two things together, it was clear that the folks from Dublin were up to something beyond just owning a few non-US ILS vendors.

From a recent post on Andrew’s Hectic Pace blog, and press releases from OCLC themselves, we now know what that something was.  It is actually a few separate things, but the overall  approach is to deliver the functionality, traditionally provided by the ILS vendors (Innovative, SirsiDynix, Polaris, Ex Libris, etc., etc.), as services from OCLC’s data centres.   This moves the OCLC reach beyond cataloguing in to the realms of acquisitions, license management, and even circulation.

The idea of braking up the monolithic ILS (or LMS as UK libraries refer to it) is not a new one – as followers of Panlibus will know. Equally, delivering functionality as Software-as-a-Service (SaaS) has been native to the Talis Platform since its inception.  It is this that underpins already established SaaS applications Talis Prism, Talis Aspire and Talis Engage.

Both OCLC, with WorldCat Local, and Talis with Prism have been delivering public discovery interfaces (OPACs) as SaaS applications for a while now, ‡biblios.net have recently launched their social cataloguing as a service [check out the podcast with Josh Ferraro], but I think this is the first significant announcement of circulation as a service that I have been aware of.

The move to Cloud Computing, with it’s obvious benefits of economies of scale and the removal of need for libraries to be machine minders and data centre operators, is a reflection a much wider computing industry trend.  The increasing customer base of Salesforce.com, the number of organisations letting Google take care of their email, and even their whole office operation (such as the Guardian) are testament to this trend.  So the sales pitch from OCLC, and others including ourselves here at Talis, about the total cost of ownership benefits of a Cloud Computing approach are supported and validated industry wide.

So as a long time predictor of computing transforming from a set of locally managed and hosted applications to services delivered as utilities from the cloud, mirroring the same transformation for electricity generation and supply from a century ago,  I welcome this initiative by OCLC.   That’s not to say that I don’t have reservations. I do. 

The rhetoric emanating from OCLC in these announcements is reminiscent of the language of the traditional ILS vendors who are probably very concerned by this new and different encroachment on to their market place.  There is an assumption that if you get your OPAC from WorldCat (and as a FirstSearch subscriber, with this on the surface ‘free offer’,  you are probably thinking that way), you will get circulation and cataloguing and all the rest from a single supplier – OCLC.

The question that comes to mind, as with all ILS systems, is will you be able to mix and match different modules (or in this case services) from different suppliers, so that libraries can have the choice of what is best for them.  Will OCLC open up the protocols (or to be technical for a moment, the hopefully RESTful APIs) to access these application/service modules so that they can not only be used with other OCLC services but with services/applications from Open Source and other commercial vendors.  Will they take note of, or even adopt, the recommendations that will come from the OLE group [discussed in last month’s Library 2.0 Gang], that should lead towards such choice.

Some have also expressed concern that a library going down the OCLC cloud services route, will be exposing themselves to the risk of ceding to OCLC control of how all their data is used and shared, not just the bibliographic data that has been at the centre of the recent storm about record reuse policies.  Against that background, one can but wonder what OCLC’s reaction to a library’s request to openly share circulation statistics from the use of their OCLC hosted circulation service would be.  

This announcement brings to the surface many thoughts, issues, concerns and technological benefits and questions, that will no doubt rattle around the library podcasting and blogosphere for many months to come.  I also expect that in the board rooms of the the well known commercial [buy our ILS and a machine to run it on] providers, there will be many searching questions being asked about how they deal with the 500lb [not-for-profit] gorilla that has just moved from the corner of the room to start dining from their [for profit] table.

This will be really interesting to watch…..

The composite image was created using pictures published on Flickr by webhamser and Crystl.

UKSG09 Uncertain vision in sunny Torquay

uksg Glorious sunshine greeted the opening of the first day of UKSG 2009 in Torquay yesterday.  The stroll along the seafront from the conference hotel (Grand in name and all facilities, except Internet access – £1/minute for dialup indeed!)  was in delightful sharp contrast to the often depressing plane and taxi rides to downtown conference centres.

IMG_0012 The seaside theme was continued with the bright conference bags.  Someone had obviously got hold of a job lot of old deckchair canvas.  700 plus academic librarians and publishers and supplier representatives settled down, in the auditorium of the Riviera Centre, to hear about the future of their world.

The first keynote speakers were very different in topic and delivery, but all three left you with the impression of upcoming change the next few years for which they were not totally sure of the shape.

First up was Knewco Inc’s Jan Velterop pitch was a somewhat meandering treatise on the wonders and benefits of storing metadata in triples – something he kept saying he would explain later.  The Twitter #uksg09 channel was screaming “when is he going to tell us about triples” and “what’s a triple” whilst he was talking.  He eventually got there but I’m not sure how many of the audience understood the massive benefits of storing and liking data in triples, that we at Talis are fully aware of.   Coincidentally, for those who did get his message, I was posting about the launch of the Talis Connected Commons for open free storage of data – in triples, in the Talis Platform.

Next up was Sir Timothy O’Shea from the University of Edinburgh, who talked about the many virtual things they are doing up in Scotland.  You can take your virtual sheep from your virtual farm to the virtual vet, and even on to a virtual post mortem.  His picture of the way information technology is playing its part in changing life at the university, apart from being a great sales pitch for it, left him predicting that this was only the early stages of a massive revolution.  As to where it was going to lead us n a few years he was less clear.

Joseph Janes, of the University of Washington Information School, was one of those great speakers who dispensed with any visual aids or prompts and delivered us a very entertaining 30 minutes comparing the entry in to this new world of technology enhance information access, with his experience as an American wandering around a British seaside town.  His message that we expect the next few years to feel very similar on the surface, as we will recognise most of the components, but will actually be very different when you analyse it.  As an American he recognises cars, buses, adverts, and food, but in Britain they travel on the wrong side of the road, are different shapes, and are products he doesn’t recognise.   As we travel in to an uncertain but exciting future, don’t be fooled recognising a technology, watch how it is being used.

A great start to the day, which included a good break-out session from Huddersfield’s Dave Pattern. He ended his review of OPACs and predictions about the development of OPAC 2.0 and beyond, with a heads-up about my session today, which caused me to spend a couple of hours in the hotel bar, the only place with Wifi, tweaking my slides.  It would be much easier to follow Mr Janes’ example and deliver my message of the cuff without slides – not this time perhaps ;-)

Looking forward to another good day – even if the sun seems to have deserted us.

Free hosting for Open Data

Over on our sister blog Nodalities, my colleague Leigh Dodds has announced the launch of  the Talis Connected Commons.

True to our desire to see a truly open web of data, under the terms of the Connected Commons scheme Talis is offering free access to the [Talis] Platform for the purposes of hosting public domain data. And the offer isn’t just limited to free hosting: the data access services, including access to a public SPARQL endpoint, are also freely available.

The terms of the offer are as follows: if you own, or are creating, a public domain dataset then you can store that data in the Platform as RDF, for free. We’re setting an initial cap of 50 million triples on each dataset, but that should be plenty of space in which to collect some really interesting data.

So have you got, or want to create, up to 50 million triples you would like to put in the public domain along with up to 10Gb of content.  Yes, well get yourself over to the The Connected Commons page and check out if you qualify.  There is also a FAQ to give you more detail.

The Connected Commons is for all sorts of data, but I’m positive that the library world provides a rich source of such open data sets – get in there guys and get your data openly linked and out there.

 

Code4lib final day in Providence – looking forward to Asheville

As always, a slightly shorter day for the last day of the conference but no less stimulating.  Talis CTO Ian Davis provided the keynote for the day, entitled if you love something…    …set it free.

He provided a broad view of how the linking capability of the web has changed the way things are connected and with participation have caused network effects to result.  But that is still at the level of linking documents together.  The Semantic Web fundamentally changes how information, machines, and people are connected.  Information semantics have been around for a while, but it is this coupling with the web that is the difference.  He conjectured that data outlasts code, meaning that Open Data is more important than Open Source; there is more structured data than unstructured, therefore people that understand structure are important; and most of the value in data is unexpected or unintended, so we should engineer for serendipity. 

He gave a couple warnings about being very clear about how you licence your data so that people know what they can & can’t do with it, and about how you control the use of some of the personal parts of data.  He made it clear that we have barely begun on the road but the goal was not to build a web of data, but to enrich lives through access to information.  Making the world a better place.

Edward M. Corrado of Binghamton University gave us an overview of the Ex Libris Open Platform strategy.  This was the topic of a previous Talking with Talis podcast with Ex Libris CSO  Oren Beit-Arie.  Edward set the scene as to why APIs were important to get data out of a library system He then explained the internal (formalised design, documentation, implementation and publishing of APIs) and external (publish documentation, host community code, provide tools, and opportunities for face to face meetings with customers) initiatives from Ex Libris.  The fact that you needed to log in to an open area raised, as it has before, some comments on the background IRC channel.

The final two full presentations of the day demonstrated two very different results of applying linking data to services. Adam Soroka, of the University of Virginia, showed how Geospatial data could be linked to bibliographic data with fascinating results. Whereas Chris Beer and Courtney Michael, from WGBH Media Library and Archives showed some innovative simple techniques for representing relationships between people and data.

The day was drawn to a close with a set of 5 minute lightening talks, a feature of all three days.  These lightening talks are one of the gems of the Code4lib conference a rapid dip in to what people are doing or thinking about.  They are unstructured and folks put their name on a list to talk about whatever they want.  The vast majority of these are are fascinating to watch.

During the conference the voting for Code4lib 2010 was completed so we now know that it will all take place again next year in Asheville, NC.  From the above picture, I can’t wait.

Technorati Tags: ,,

Dave Pattern challenges libraries to open their goldmine of data

The simple title of Dave’s recent blog post ‘Free book usage data from the University of Huddersfield’ hides the significance of what he is announcing.

I’m very proud to announce that Library Services at the University of Huddersfield has just done something that would have perhaps been unthinkable a few years ago: we’ve just released a major portion of our book circulation and recommendation data under an Open Data Commons/CC0 licence. In total, there’s data for over 80,000 titles derived from a pool of just under 3 million circulation transactions spanning a 13 year period.

13 years worth of library circulation data opened up for anyone to use – he is right about it being unthinkable a few years ago.  I suggest that for many it is probably still unthinkable now, to whom I would ask the question why not?

In isolation the University of Huddersfield’s data may only be of limited use but if others did the same, the potential for trend analysis, and the ability to offer recommendations and who-borrowed-this-borrowed-that  services, could be significant.

If you have 14 minutes to spend I would recommend viewing Dave’s slidecast from the recent TILE project meeting, where he announced this, so you can see how he uses this data to add value to the Huddersfield University search experience..

Patrick Murry-John picked up on Dave’s announcement and within a couple of days has produced an RDF based view of this data – I recommend you download the Tabulator Firefox plug-in to help you navigate his data.

Patrick was alerted to Dave’s announcement by Tony Hirst who amplified Dave’s challenge “DON’T YOU DARE NOT DO THIS…”

As Dave puts it, your library is sitting on a goldmine of useful data that should be mined (and refined by sharing with that of other libraries).  A hat tip to Dave for doing this, and another one for using a sensible open licence to do it with.

Picture published by ToOliver2 on Flickr