Things we have learned building an Institutional Repository

Publications Repository     2014•02•13     conor

As mentioned in last month’s post we recently “soft” launched our new Institutional Repository (collections.unu.edu). We are officially launching the site this week and we feel it’s a good time to share some of the lessons we have learned on this project.

When we tell most people about the repository they generally imagine a database where we just store PDF’s. If only it were as simple as that. Over the past 2 years the team has been working hard to better understand this area and develop a tool that can be used both inside and outside of the UNU. We have learned a lot in this time and here is a small selection of the highlights.

Metadata, Metadata, Metadata

One thing we have in common with the NSA is our hunger for metadata. We not only want the title of an article, but we also want the publishers name, the date it was published, the keywords associated with the article and any identifying numbers associated with it. The more metadata we have, the better we can represent the document in the repository. Obviously the point of the repository is to store digital objects, but a document is no use if we are not able to categorize it.

We spent a lot of time reviewing what sort of metadata the UNU already has on its publications and comparing this against what metadata we wanted. Generally these lists would match up, but in some cases we would need to add new elements (internal project ID numbers). We worked closely with the Library in order to ensure we were following best practices. We also spent some time looking outside to see how other repositories managed their metadata and what elements they were storing. What we have come away with is a robust set of metadata elements that allow us to accurately model our documents, enabling future researchers to categorize and find our output.

Taxonomy/Vocabulary/Nomenclature

There are many ways to say the same thing, and not all of them are wrong. In the same vein we discovered that there were many people within the UNU publishing similar types of documents categorized under different names. At the outset of the project we ran a survey across all our institutes, looking for the different types of documents we were publishing. This resulted in a list of approximately 30+ document types. This ranged from the obvious Book and Book Chapters all the way down to Policy Brief vs. Policy Report.

Once again we turned to the Library to help us rationalize this list of document types down to a core set of generic types. Eventually we came up with a set of 8 document types (Book, Book Chapter, Report, Conference Proceeding, Article, Conference Publication, Thesis and Serial) that we feel covers the range of publications the UNU produces. Each type allows for multiple sub-types making it flexible enough to support additional document types in the future.

Look around you

Before we began designing our document types we first had to figure out what our understanding of an institutional repository was. One of the original inspirations for the project came from the ETH e-collection. Seeing how easy it was to find and browse their publications we immediately understood that the UNU needed a similar tool. Before moving forward and building our own we first needed to have a look at some other services. We spent a good amount of time looking into different institutional repositories and investigating the platforms that drove them.

Eventually this led us to dSpace and Fedora Commons. We evaluated both and eventually settled (as had ETH) on Fedora Commons. Both platforms are mature and stable but the flexibility and availability of multiple API’s sold us on Fedora Commons. Coupling Fedora with Fez, provides us with a stable and scalable platform. Spending the time at the outset to review what tools are available in comparison to what features we needed saved us a lot of time down the road.

License to drive

Compiling the metadata, acquiring the document and ingesting the document into the repository is one thing, but it can be pointless if we are not legally allowed to do this. Since the UNU and our researchers publish with a number of different publishers it was important that any content we ingest into the system be vetted to ensure that we were displaying the correct copyright information and that we legally allowed to publish the content.

One invaluable tool we discovered is SHERPA/RoMEO. This is a service providing a definitive list of publishers’ copyright agreements and retained author rights. Using this we can search for a journal an article is published in and ascertain if we are able to publish the article or not. SHERPA/RoMEO provides an API for querying its database which has been integrated into Fez allowing us to simply display the open access status of an article.

Don’t forget the human element

A lot of time was spent developing automated processes for converting lists of documents, ingesting content and generating digital objects and these automated tools greatly simplified our lives in the run up to launching the site. But not all problems can be solved by technology. The old rule of Garbage In, Garbage Out still stands. During the first round of imports in our development environment we were working from patchy lists of documents that had been pulled together from various different sources. Unfortunately very little of this data had been validated or reviewed. This resulted in a largely redundant set of documents being ingested into the system.

We quickly learned two things from this. 1) Data quality is key and 2) in order for the repository to be relevant we must have a document for each entry. In hindsight both of these seem painfully obvious (especially point 2). We solved both of these relatively simply. We set the criteria for inclusion in the repository such that either a digital version of the document OR a link to a location where it can be acquired be included in the record (due to licensing we cannot always store the document itself in the repository). If neither of these are available then we cannot store the document in the repository.

tl;dr

With the launch of the repository this week our work is only beginning. We have many more documents to ingest into the system and more features to add over the coming months. Many thanks to the team who have worked on it, particularly Shivani Rana for her development magic over the last few months. Thanks as well to the Fez developers for their assistance and pointers while we were trying to figure out how to use it.

Please browse the repository and let us know your thoughts. We look forward to hearing your feedback.

- Conor