CM3010 Topic 10: Applications

Main Info

Title: Applications
Teachers: David Lewis
Semester Taken: April 2022
Parent Module:

Description

DBs in apps, focusing on 3 alternative databases of medieval manuscripts, and combining them.

Key Reading

Lecture Summaries

10.1 Databases of Medieval Manuscripts

We’ll look at three different approaches to modelling and implementing a database of medieval manuscripts.

The framing will be how do we combine the datasets? We have three different views of the same objects.

The databases being merged are:

The Oxford Medieval Manuscripts in Oxford Libraries database. Primary purpose is catalogue, users are library users, unit of interest is books and works. Scale is > 10k manuscripts.
Bibale - French database by the IRHT research institute. Specialise in medieval texts and transmission of texts in MSS. Origins in card catalogues. Primary function is as a study guide. Users might be manuscript scholars using it as a primary source, might not consult the manuscript itself. Scale is > 17k books, > 13k people. It’s a sample.
Schoenberg Database of Manuscripts, housed in Pennsylvania. Founded by Larry Schoenberg. Had a passion for manuscripts. Focus on observations of MSS, particular interest in sales catalogues. users might be provenance scholars, manuscript dealers. Scale is > 250k observations.

If we combine these, how do we handle updates? Or errors? What would a UI look like?

Schoenberg

The starting point of this db was an excel spreadsheet maintained by Schoenberg. He later migrated it to Access, then MySQL.

Motivations were those of a collector, interested in the market. Involves some community edits (suggestion based).

It’s a fully relational approach. It’s MYSQL + Ruby on Rails.

Legacy data issues, newer records have more fine grained data, that aren’t there in older records. There are editorial and technical teams supporting the project. Resourcing came from a bequest.

Bibale

Quite a contrasting purpose, design, and history compared with Schoenberg. Starting points were card catalogues.

Many researchers contribute. When we came to model it in a relational model, it must have been daunting.

The relational database was built around generic structures. An Object has a type and name. There around 9 object types.

Elements have a type and value, these are attached to objects. They don’t do this in the conventional way of a column per attribute, but a row per attribute.

They did this because if they defined the attributes as columns it would be very sparse, there are a lot of element types, and only a few are observed on each object.

This is a strange use of relational database. You can arbitrarily extend your schema without redesigning your db.

In relational design you typically make the relations explicit, so the database can keep things reliable. But here they break that. This gives more flexibility, but risky code.

The result looks like a standard catalogue page. It means that the scholars can extend the schema, they have power over the data. But the model now is in the docs, not the database schema. Enforcement of rules is informal, different editors have different practices.

Oxford

The Oxford one is closest to an industrial organization that we have here - the libraries employ software engineers. They maintain their software over time and have a high number of users.

The starting points were books, they give info about content, dates of MSs, source info, citations of literature. Lots of prose description.

Requirements were to replace the physical catalogue. Should be a searchable, faceted catalogue.

We want domain experts to be able to make updates.

The designers did not want to be stuck in software dependency. They weren’t going to get much ongoing funding, they didn’t want to be dependent on some specialized software.

They decided to use TEI. The text from the catalogue can be preserved, marked up with machine-readable annotations.

Supports rich, loose structure of manuscripts. But not all the meaning will be explicit in the structure, some will be in the text. So not fully machine-accessible.

Searchable, facetted cataloguing is a challenge here when starting from the TEI markup.

The data is too complex for a simple pipeline. Third party indexing needed anyway. Process then to simplify the xml, to turn it into the web page. The complexity is stripped out for publication. Then the web page and search index is generated from the simpler data.

TEI -> XQuery -> Simplified XML

From the simplified XML two different xquery pipelines generate the html for the web page, and for the Lucene index.

How are the experts updating the data? It’s a github repo. Curators make their edits in text and then make a pull request.

XQuery is doing the work of the transform.

So how do we combine these datasets?

Do we merge them into an existing DB? Or create a new one? What if we took Bibale and Bodleian and add them to Schoenberg? Or should we create a union db? Or create federated queries?

What do we do with updates or corrections? How do we ensure they propagate?

What happens when the project ends?

How do we reconcile the models and data?

Here’s how they approached it:

On merging the data - they decided that there was no single db schema rich enough for merge. Each of them has fine-grained information on some stuff and coarse on others. There might be editorial policy issues as well. Changes to the schema could also be problematic.

Two other options - federate queries to the separate dbs, or create a new union db.

Federated queries are attractive - if we do that then each of the datasets stay the single source of the data.

But we’d require a federated capable query engine to be available. There wasn’t one. The Oxford set has no db server.

Where should alignment info go in the federated model? It ends up being a new db anyway.

So they decided to make a union db.

But now what about updates/corrections?

Updates: No ambition to replace source DBs, union db will be limited shelf-life. Any enrichments/corrections should go to the sources.

What about when the project ends? No resource for paying for maintenance/server provision.

Focus on the data and the model - that means static data can be published. Maintain tools as ‘best effort’.

How do we reconcile the data models? Not all concepts can be fully merged, there are different conceptualizations in the databases. Work with domain experts to express the data honestly.

We need to store simplifications in the data as ‘indexes’ for retrieval. In practice they threw out almost no data, they preserved almost all of it. However, for retrieval, they added summaries for indexes.

Also make queries more complex to deal with different conceptualizations.

MMM Modelling

Modelling was a critical point of the project.

This took time. Weekly meetings took place for the life of the project. Technical specialists were there from each DB. Subject specialists were also there from each DB.

There was also a linked data lab in Aalto, Finland, who advised on the linked data side.

Needed to know how much we’d extend, which vocabularies we’d need to use. Check how the data is, not just how it is documented.

Examples of the schema and data being slightly different. Needed to verify against research questions.

Compare queries before and after export.

Introduces the core ontology used as the basis for the project and gives an extended run through of the knotty problems in data modelling, especially around ownership of a resource.

Pipelines

Each institution writes an export script, which would be automated and periodic.

That generates open linked data. We combine the data, connect entities linked to authorities. Optional manual reconciliation step that persists.

Then publish MMM data into a triple store.

Discusses the process of reconciling names and locations using different authority files.

User Interfaces

Didn’t have resources for UI. A question that connects multiple datasets is often nuanced.

Basic UI available here

main point was to look at merging the datasets, not build the UI.

At some point users complained that the data was no good, and we realized that they were taking the UI’s limitations as limitations of the data. So we taught them SPARQL.

Module Summary

We asked at the start what shape is your data? Shape is not neutral, once you’ve decided it that will shape a lot of functionality going forward.

Is the data linky? How fixed is the schema? What shape will stakeholders expect?

Can you decide the data shape just based on the data? Probably not, it depends on your users and application purpose.

How do I want to use the data? What approaches am I comfortable with? What do I want to build? Do I want others to use the data? If so how?

Who do I want to use the data? Do I want to control that? How will I maintain it?

XML is specialized compared with SQL or MongoDB. Likewise RDF.

Key question, why does my data exist, who is it for, how do I best empower the best use of my data.

Alex's Notes