Select a collection

Live musical events play a vital role in community life across the globe, yet their very ‘liveness’ means they often leave only faint traces on the historical record, even in modern times. While musicologists have used some types of concert ephemera to capture the nature and identity of musical events, by their very nature these resources can be confusingly inconsistent, tantalisingly incomplete, and often scattered between different archives and collections.

This InterMusE Project is a two-year research endeavour, funded by AHRC’s UK-US New Directions for Digital Scholarship in Cultural Institutions programme, that seeks to better capture and represent these historical events, leveraging natural-language processing, optical character recognition (OCR), and other forms of artificial intelligence. To illustrate the potential of the approach we work with digitised resources sourced from:

Borthwick Institute for Archives (University of York),
Krannert Center for the Performing Arts (University of Illinois at Urbana-Champaign),
The British Library, and
The Royal College of Music.

Material is also sourced from three former chapters of the British Music Society (est. 1918):

Prototype Digital Library

Greenstone3 is an open-source digital-library system with a versatile service-based software architecture, managed through an extension mechanism. Taking the Huddersfield Music Society Programmes as the set of digitised content processed, this online resource demonstrates how Greenstone3 can be used to meet the aspirations of the InterMusE project.

When content is added to the digital library, it is automatically processed using the Google Vision API, and any text extracted is added to the digital library's full-text index, as well as stored as Linked Open Data using the SimpleAnnotationServer. We make the OCR'd text available as Open Annotations, accessible through a Mirador3 Image Viewer embedded into the digital library. Through the Mirador3 Viewer, annotations can be edited (correcting OCR errors, for example), as well as allowing for the addition of complete new annotations (unrelated to the OCR'd text, if so desired). Because Apache Jena Fuseki is the internal triplestore the Simple Annotation Server uses, this means all the OCR'd content—along with all the other metadata amassed in the digital library—can also be accessed via a SPARQL endpoint. More details are available through the InterMuse project website.

In addition to the automatically generated OCR'd content, an Excel spreadsheet has been painstakingly assembled from the programmes 'by the HMS archivist' recording who the performers were, and which musical works they performed at what concert. We fold this into the digital-library collection, both as information to display, but also as metadata that can be used to enrich how users can locate content of interest to them in the collection.

Designed for Different Types of User

Use the browsing and searching features the digital library provides to locate content of interest. Register as a user to become an annotator/editor of the content. For an external developer, interested in further enriching the forms of access to this content, a machine-readable version of the content is accessible through the following SPARQL endpoint.

This prototype collection contains documents focusing on a sample of programmes from the Huddersfield Music Society.

Implementation Details

To form this prototype InterMusE digital library, we have taken the base digital-library system and added in Greenstone's extensions for:

structured-image to automatically perform OCR on programme pages using Google Vision's API;
iiif-servlet to allow images in the digital library to be available at a range of resolutions via the IIIF Image API; and
apache-jena so that content—such as annotations added to programme pages—can be accessed as Linked Data.

A key strength of the Greenstone3 software architecture is its ability to be customised, which is aligned with its three phases for forming a digital-library collection: importing, building, and runtime presentation. The first two phases typically go hand-in-hand, and form the ingest process by which content selected for the digital-library collection is turned into a browseable and searchable online resource.

Importing centres around a pipeline of document-processing plugins, written in Perl, that turn a wide array of document and metadata formats into a canonical format known as GreenstoneXML. Using one folder per document, this format represents everything that constitutes the processed document: the text and metadata of the document, along with any supporting files. The internal format allows for hierarchical structure, such as which occurs in Word, PDF, and HTML documents using headings. Metadata can be attached to any level of the hierarchy. Examples of associated files include: automatically generated web-friendly resources, such as screen-sized and thumbnail-sized images in the case of photos; embedded resources in the case of HTML; and the original file itself, so it can be downloaded.

In terms of customisation, plugins support a myriad of settings for fine-tuning how the processing is undertaken. New plugins can also be introduced at any time, with the digital-library system automatically detecting their presence.

The building step takes the standardised XML form, and processes it to form the backend indexes and database structures needed to deliver the forms of searching—such as full-text search, and search by title—and browsing—such as a hierarchical subject classification—specified in the collection's configuration file. Effectively the building phase turns the standardised/serialised GreenstoneXML form back into in-memory data-structures representing a document's hierarchical structure of text and metadata, along with how supporting files relate to that. Following the directives specified in the collection's configuration file, it is then a simple matter to transmit this text, metadata, and associated files as needed to the digital-library's indexing/database/backing-store.

Beyond the customisations that can be specified in a collection-configuration file for the building phase, Greenstone supports orthogonal indexers. Like the document-processing plugins used in importing, orthogonal indexers are modules written in Perl, and their inclusion is automatically detected by the Greenstone3 installation. Orthogonal indexers get presented with the same in-memory stream of "reconstructed" documents, allowing them to undertake additional processing if required (such as computing audio features), which can then be transmitted to a specialist indexing/database/backing-store (such as a content-based music-recommender system), or otherwise added to the existing indexing/database/backing-store.

The third phase of the Greenstone3 digital-library architecture governs how functionality is accessed and data is extracted from the digital library and presented to the user. The Greenstone3 runtime is a service-based architecture, written in Java, consisting of a network of connected modules. Modules are self-describing and advertise the services they offer. Communication between modules is by XML messages, with the service handling the final layer of communication responsible for presentation. Here, XSL Transforms (XSLTs) are used to convert the underlying XML content into the web page displayed by the digital library, blending in CSS and Javascript files that control appearance and functionality.

The XSLT files are grouped together in one place, forming the interface for the digital library. An inheritance mechanism is deployed throughout this part of the design. A collection can override individual XSLT template rules, as required to tweak presentation details. A collection can also provide an entire replacement XSLT file, if so desired. For more substantial changes a new interface is typically developed.

In terms of crafting the features and functionality to form this prototype InterMusE digital library, we make use of all three areas for customisation. Mirador3 is a NodeJS web stack, and so to switch the digital library's document display to use this viewer, replacement XSL template rules were introduced to load in the necessary CSS and JavaScript files, and call the viewer's initialisation function. Mirador draws its image content from an IIIF compliant server. This was achieved by using the above mentioned IIIF-Servlet extension to Greenstone3. Mirador, however, cannot natively handle Google Vision JSON, but does support the OpenAnnotation JSON format. We therefore extended StructuredImagePlugin to include a function that performs a cross-walk of the former JSON format to the latter.

To support the editing of annotations the sequence undertaken is informally, but best, described as ``a plumbing exercise.'' Mirador3 requires the addition of the mirador-annotations plugin to allow editing. This in turn was configured to direct the plugin to use a Simple Annotation Server (SAS) endpoint to store the annotations. SAS supports a variety of different storage backends. We set this to be Apache Jena and directed SAS to use the one we had installed as a Greenstone3 extension. To get the OpenAnnotations produced by StructuredImagePlugin into the Jena store, we added in a new orthogonal indexer. The net result of all of this is that, upon a fresh rebuild of the digital-library collection, a user accessing one of the digitised programmes can now edit the OCR'd text, or else lay in new annotations over the page. An XSLT-based if-statement completes the plumbing exercise, checking settings provided by the digital library to ensure the editing-based version of the Mirador viewer is only activated if the user is logged in and has an editing role assigned for the collection they are accessing.

Available Services

Cross collection search

Search over multiple collections

Administration Page

Allows you to manage users

Depositor

Add more documents to an existing collection

Greenstone Librarian Interface (GLI)

Allows you to configure and build collections

About Greenstone

About the Greenstone software