CM3010 Topic 07: Semantic Databases
Main Info
Title: Semantic Databases
Teachers: David Lewis
Semester Taken: April 2022
Parent Module: cm3010: Databases and Advanced Data Techniques
Description
XML in theory and practice.
Assigned Reading
Liu et al, A Decade of XML Data Management, IEEE Conference on Data Engineering, 2009
Crawford and Lewis, Music Encoding Initiative, Journal of the American Musicological Society, 69.
Lab Summaries
7.206 - Work with XSLT and the NY Phil concert data, using xslt to generate html tables with different views of the data.
Lecture Summaries
Intro lecture frames this topic and the next as focusing on how to more effectively link the semantics of information across different databases and applications. So far we’ve been treating datasets as independent, and with a semantics we define purely for a single database or app. But much of that information may share its meaning across different databases. Think of the date of birth of a particular actor, for example. We’ll look at approaches to that shared semantics.
7.1 What are Semantics?
7.101 Semantic databases: what does a table tell us?
Let’s imagine we want to share our relational database. We could share a CSV but that’s just the data, not the relation, it doesn’t include the column specificiation.
We could do the full relation by dumping the database. That produces a text file with a set of commands that would be needed to reconstruct the database.
Stepping back we can see different layers of meaning.
There are data types - how the computer should store and interpret the information (string, float, integer).
There are data domains - it’s not just a number, but a year. It’s a latitude and longitude.
Then there are data semantics - this is not just a person, it’s an actor in a film.
The more meaning we can specify, the easier we can spot when someone else has specified the same information.
We can share definitions with data specifications. We can share syntax for how to validate new data.
Introduces the style of deductive data stores (prolog style) reasoning over information with first order logic. There’s a large amount of work in the semantic web communities about using linked data to build towards the deductive reasoning abilities of logic programming.
7.103
Often the definition of the language semantics is done in the language itself. Ideally you would want a document to be self-specifying. Otherwise formal specifications outside the document can be used, and often this is more practical.
Shows some examples of self-describing documents, including an xml doc.
7.105: XML
Introduces document markup. We mean marking up text. We add commentary to the text in the form of tags. Now we can enrich the text as much as we like, without changing the text.
XML is a tree, it always has one root node.
Contrasts xml well-formedness (strict nesting, single root), with validity. We can point to the schema definition with the xmlns
attribute on the root note. There’s also the syntax of <?xml-model href=<schema> schematypens=<schema_type>>
7.2 Using XML
XML gives a looser structure than a table, it’s a tree. Harder to index as we don’t know as much about the specific structure.
But we can run rich searches (xquery) and third party indexing technology (like Lucene).
It’s parallelisable typically as each doc can be treated independently. They can be on the end of a url, and shared directly.
Expectations are different from a relational database - which is a guaranteed system.
In SQL there’s a lot of complex data modelling, but then the query language is quite simple.
In XML the query and transformation languages are more complex.
We might transform the source xml into other docs, csv, html, Word documents…
We use XSLT
and XQuery
for these. They were developed in parallel.
XSLT files describe transformations to an xml document. An XSLT processor takes the doc and the stylesheet and processes the transformation, producing the output.
Walks through an xslt example using some concert data.
XSL transformations work through templates. The templates describe a match, and then return a value.
XQuery is a very different approach, intended to be SQL like. XML databases like eXist build XQuery into their structure.
You can embed the xquery into a doc like this:
<body>
<table>
{
let $doc := doc("my_file.xml")
for $concert in $doc//concertInfo
where xf:month-from-dateTime($concert/Date)=12
order by
xf:day-fromdateTime($concert/Date)
ascending
return
<tr>
<td>{$concert/../season/text()}</td>
<td>{$concert/../orchestra/text()}</td>
<td>{ format-dateTime($concert/Date, "[D01]-[M01]-[Y0001]")}</td>
</tr>
}
</table>
</body>
This looks almost like a template language, with the power of xpath.
in xpath we talk about FLWOR - the sequence of the xpath query
F for
L let
W where
O order by
R return
7.3 XML Schema
There are three main schema languages for xml:
DTD (Document Type Definition) - the oldest, originated in SGML.
XML Schema (XSD)
Relax NG
Also Schematron - for describing more nuanced rules.
A schema will define the following:
What elements can be used?
What can those elements contain?
What data can go inside the elements?
Is there a cardinal order?
What must they contain?
What attributes are used?
Which elements to attributes associate with?
Which structures are equivalent?
Which structures are mutually exclusive? (eg a recording can’t be analogue and digital)
What do we use them for? We can validate encodings, to debug code or enforce integrity.
A schema aware editor will use the information to help authors.
Schemas can also be used for document automation - generating eg class definitions.
They also support machine reasoning.
Document Type Definitions
In any of the schema languages, the xml file should declare the schema it follows and where to find it. in the case of a DTD this looks like:
<?xml version="1.0"?>
<!DOCTYPE programs SYSTEM
"http://exapmle.com/nyp.dtd">
the System declaration is a findable reference to the DTD, usually a URI. If I follow the URI I should find the schema.
The DTD itself starts with declaring the root node:
<?xml version="1.0"?>
<!DOCTYPE programs [
<!ELEMENT programs
(program+)>
]>
We can then have element type declarations as follows:
<!ELEMENT program (orchestra, season, concertInfo+)>
<!ELEMENT concertInfo (Venue, Date, Time)>
We can define an element that will have text data as follows:
<!ELEMENT orchestra (#PCDATA)>
The issue with DTDs is the limits of its expressive power. People wanted to express more complex structures and validate more complex constraints than DTDs allow.
XML Schema Language (XSD)
Here’s how we link an xml doc to an xsd, we declare on the root node as follows:
<programs
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:nyp="nyp.xsd"
>
The first attribute is a reference to the schema language itself. It’s not privileged like DTDS which have their own !DOCTYPE
declaration, other schema languages have to declare themselves like this.
So we’ve declared two namespaces, for the schema and the schema object. Now we prefix elements with <nyp:concertInfo>
and can borrow semantics across schemas.
The schema doc itself looks like:
<xs:element name="orchestra"
type="xs:string"/>
<xs:element name="Date"
type="xs:dateTime"/>
There are more data types with XSD schemas than DTDs, so we can eg specify date times here.
We can define complexType
elements for those that contain complex substructures, for example a specific sequence of child nodes like this:
<xs:element name="concertInfo">
<xs:complexType>
<xs:sequence>
<xs:element ref="Location" />
<xs:element ref="Venue" />
<xs:element ref="Date" />
<xs:element ref="Time" />
</xs:sequence>
</xs:complexType>
</xs:element>
Instead of a specific sequence (which might change) we might say <xs:all>
instead, which means we should have all the children, in any order, or <xs:any>
. We can specify things like minOccurs="1" maxOccurs="unbounded"
to say 1 or more of that node must be present.
Relax NG
(NG stands for Next Generation)
Developed in parallel with xsd with a slightly different philosophy. The schema builds out the tree that reflects the conforming document tree:
<element name="program">
<element name="orchestra">
<text/>
</element>
<element name="season">
<data type="NMTOKEN"/>
</element>
<oneOrMore>
<element name="concertInfo">
<element name="Venue">
<text />
</element>
<element name="Time">
<data type="NMTOKEN">
</element>
</element>
</oneOrMore>
</element>
There is an alternative syntax too, which looks more like JSON.
Schematron
Schematron is different to the other schema languages, which defines the grammar of the document. Schematron is a pattern-based checking mechanism.
You define a rule using an xpath pattern, and then assertion mechanisms that can check the validity of the document semantics.
Here’s a basic example:
<schema
xmlns="http://purl.oclc.org/dsdl/schematron">
<pattern>
<title>Basic Checks</title>
<rule context="//concertInfo">
<assert
test="date < current-date()">
A concert cannot happen in the future
</assert>
</rule>
</pattern>
</schema>
These are integrity checks essentially - combined with other schemas you can add business logical constraints to the more formal grammars.
Finally mentions ODD from the TEI project, which is a meta-language for expressing schema fragments and documentation.
XML and Music
Introduces the history of trying to represent the visual language of music notation in digital encoding standards.
Many attempts were tied to individual applications.
Once xml came along two dominant standards occurred - MusicXML is used for moving files between different typesetting programs.
MEI, or the Music Encoding Initiative, is the other, used heavily by musicologists.
Walks through an example of MEI
Concludes the topic by noting that XML is widely used in document contexts, industry databases and academic research.
Schemas are still monumental though, we can share schemas but rarely do we share smaller units of meaning. We’re sharing documents and document semantics.
For granular sharing we need the semantic web, which we cover next.