CM3010 Topic 08: Linked Data

Main Info

Title: Linked Data
Teachers: David Lewis
Semester Taken: April 2022
Parent Module: cm3010: Databases and Advanced Data Techniques

Description

RDF and SPARQL

Key Reading

Information Management: A Proposal Tim Berners Lee, 1989.

Lecture Summaries

8.0 Linked Data and the Semantic Web

8.001 Intro to Linked Data

We’ve been looking at distributing data across machines. We’re going to look at a particular proposal, written in 1989 in a large organization, to tackle issues of losing information and communication within the organization. The proposal is called Information Management: A Proposal - it is the TBL paper available here

The requirements for the information system would be that it afford remote access, be heterogonous, couldn’t be centralized, needed access to existing data. Linking would be important - to connect different datasets. Linking would need to be owned by the people who made the links - they could be private. The links must be live.

The connections would allow data analysis, because the information would be connected together.

The proposal describes a “web” of nodes with links (like references) between them, as much more useful than a fixed hierarchical system.

Originally the nodes could be anything - people, groups, hardward, software. But what took hold was documents.

In this topic we’ll look more abstractly at the concept of graph based data models - there are nodes and edges between them, no fixed hierarchy. How do we manage that?

8.002 Open, Linked, and Data

We’ve introduced the idea of a semantic web - semantics that can be shared over a web. But what about the concept of linked open data?

Openness, genuinely means cost free, and then some lack of barriers - should be findable, barrier-free use, and reusable.

One concept you’ll see is FAIR - findable, accessible, interoperable, reusable.

Linked - comes down to web links. Links are one-way, they are controlled by the person who makes the link. This means you can’t maintain the integrity of a link - seen as primitive when it was proposed.

No central registry of links.

Links will typically be a URL. They are guaranteed unique. Domain names registered with an organization so there is clear responsibility for maintenance. Unlimited number of URLs.

Unique ID is independent of the server or implementation.

8.1 Reading Linked Data

8.101 RDF

Web technologies for documents are brokered through the Document Object Model. Whatever you do in JS ultimately is rendered to the user through the DOM.

Linked Data also has a model, the Resource Description Framework. A framework for describing information that is simple and allows us to build complex structures from more simple ones.

The RDF model is as simple as Subject -> Predicate -> Object.

That’s it. Here’s an example Deimos -> Orbits -> Mars.

Note that Mars could be the subject too Mars -> Orbits -> The Sun.

We build this up from simple triples to a complex graph.

We can simplify our diagrams a bit by including the ‘type’ link in the node itself (like Deimos, a Moon) rather than Deimos -> Is A -> Moon.

How do we take this model and turn it into linked data?

The challenges are maintaining keys - it should be reliable and consistent.

Finding data on the web is challenging.

Sharing meaning is also challenging, as is sharing entitities.

URLs give us a reliable, unique key. EG orbiting is described in the wikidata schea as https://wikidata.org/prop/direct/P397. If we use that we can leverage any other systems that rely on the same schema.

Likewise entities. So we can join up datasets based on shared entities.

The rules are simple - the subject and predicate must be URIs. The object can be a URI, but doesn’t have to be. For example if I want to say Mars -> is called -> Mars then the object ‘Mars’ needs to be a string. It can also be a number, date etc. At some point we need actual data.

If two differnt datasets are using the same concept we can use the ‘same as’ predicate to connect them.

Dereferencing

A URL is unique, guranteed. It can be used as a key. It doesn’t need to be linked to a web page, or provide any response to an http request for it to be usable. I can make up a URL, say http://london.ac.uk/databases/1234 to refer to the current lecture. If I’m consistent, it works as a key, it’s unique.

That only gives us some of the benefits. The best version of linked data is a url that gives us data when we request it. URLs don’t need to resolve, it’s ok to build a db that just uses them as keys. But far better is that if you request the URL you get some data back about the thing that it denotes.

Ideally that would also make use of http’s content negotiation protocol - where the requester can specify the type of data they want. So if I’m a browser I ask for an html doc. If I’m working in a programmatic environment I could ask for rdf data.

Serializations

So what should come back? We’ve talked about RDF as an abstract model, but how should we serialize it?

There are multiple serializations available.

You can have a simple one, or one that’s easy to read and write, one that’s easy to mix into documents, easy to manipulate with software. There’s also a way that’s painful in every way.

The simple one is n-triples or n3

Here you just list the elements with angle brackets around the triples, terminated by a full stop. EG:

<https://www.wikidata.org/entity/Q111> <https://www.wikidata.org/prop/direct/P397> <https://wikidata.org/entity/Q525> .

This is quick to get up and running with, very simple.

One of the easiest to manipulate and read is Turtle or TTL

This is a special form of n-triples with a few additions. There are prefixes to shorten the URLs themselves.

For example we can say:

prefix wd:
<http://wikidata.org/entity>
prefix wdt:
<http://www/wikidata.org/prop/direct>

wd:Q111 wdt:P397 wd:Q525.

The . character still ends a triple but there are now additional syntax. Ending a triple with a ; character says that the subject is kept to the next triple, which is great for chaining sequences of descriptions of a single subject.

We also don’t need to use the url for ‘type’ we can just use the ‘a’ character so <https://example.org/spiderman> a <https://example.org/superhero> for example.

We often use rdfs:label to show the human readable label of an entity.

There is also RDFa allows you to embed RDF statements in HTML. Lecturer tends to recommend against it. Let the human readable data come in human readable form, and the machine data come in the machine form. Use content negotiation instead.

Load in JS - JSON-LD not exactly a serialization of RDF, it’s a JSON-based graph data format.

A final one is XML/RDF one of the earliest serializations. It’s horrible to use and read. It’s putting a graph structure into a tree structure.

8.103 Thinking in Graphs

Graph data modelling can be tricky. Let’s say we start with an important entity, like a ‘film’ and start by connecting it to an actor. this might be true and important, but the film also has a characer, with a name, and that character is played by the actor. Do we need the direct link from the actor to the film, or infer it from the fact they played a character in the film?

I need the role connected to the film for sure however. This would get even more complicated, let’s say if the same character had two different actors playing different ages of them. They get really complex really quickly.

There is no hierarchy or order in graphs, Every Node -> Node is 1 triple, so serialization is at least straightforward.

8.2 Web Ontologies

8.201 Ontologies

You can’t share information without sharing semantics and linked data shares that through ontologies.

How do we do that? Just like XML shares its semantics via schemas defined in XML itself, so linked data shares its semantics via linked data.

The main mechanism is RDFS - RDF Schema - simple structures for talking about the data structure.

RDF Schema is a really simple schema.

We can expand the things we can say about an entity by pulling in other ontologies. For example we can say that a person is in the broader class of agents, so can do things in the world. Also that they are a geographical entity, so can be present in a space.

a rdfs:Property can be used as a predicate. A predicate can have a rdfs:domain - what can be a subject, and a rdfs:range - what can be an object.

So we can define that foaf:knows must have a domain and range of people - ie that it’s a narrower sense of knows than the full extent of the term.

We can also give foaf:name to give a human readable version. We mustn’t trust that the url is self-describing as to its semantics. We must have a label, and tell us what the triples really show.

We can have rdfs:subProperty like sub-classing for entities, for example name is a sub property of rdfs:label which is the broader concept of giving something a human readable label.

You can augment the core rdfs schema, likely you will use OWL or the Web Ontology Language, adding to the core schema, similarly to how we can define our own schemas in XML using XSDs.

We can use OWL to specify logical concepts like owl:disjointWith or owl:inverseOf and build inference engines with it.

We can also specify property types as either owl:ObjectProperty - predicates connecting entities, or owl:DataProperty - predicates connecting entities to literal values (numbers, strings etc).

There is also the critical owl:equivalentProperty which defines equivalences across ontologies.

We can use the owl:sameAs to define an absolute equivalence (stronger than an equivalence class).

We can also specify restrictions that define triples that must exist if an entity is to be considered a member of a class. This enables us to do things like this:


mo:Arranger a owl:Class;
    rdfs:subClassOf foaf:Agent;
    owl:equivalentClass [
      rdfs:type owl:Restriction;
      owl:onProperty event:isAgentIn;
      owl:someValuesFrom mo:Arrangement
    ]

Where we say that if an agent has taken part in an arrangement, they are equivalently an arranger.

We have to beware the open world assumption. In a SQL database, the truth is defined by the relation. When we ask a query of it, the truth is solely in the relation.

The linked data world explicitly does not have that assumption - we can never know if there is a triple out there that talks about our entity. We have to assume there are other statements out there.

8.205 Designing an Ontology

Designing an ontology should be no different from designing your relational database. But you often have to take extra care. If your model is good and exposed to others people might start using it, in which case you create a standard.

If you use others’ ontologies, people could start using the data assuming you’ve used the other ontology correctly. So you have to take into account other people’s use.

Web ontologies are also more explicit than other types of database schemas. Tips:

Use existing ontologies where possible. Combine efforts with others.

Test with real data and workshop specific scenarios.

Don’t get lost in rabbit holes, don’t add detail you don’t need. prioritize getting something done rather than capturing everything.

Don’t be wrong, be over-general rather than inaccurate.

Good ontologies take time. Days of workshopping and iteration. Multiple viewpoints are vital.

Draw a lot, even if you’re not a visual thinker. Be as explicit as possible.

https://webprotege.stanford.edu is a useful web tool, there’s also free desktop clients for it.

You’ll likely review and extend existing ontologies rather than create new ones from scratch most of the time.

8.3 Graph Databases

8.301 Triplestores and SPARQL

The semantic web is hard to search efficiently.

It’s decentralised, there’s no registry of information.

It has open world assumptions.

We can consider a cache of triples and make a database of that. Then we can index it and query it. We can write any inferences as new triples in the graph, and then searching becomes quicker.

Graph databases don’t need to be based on RDF. But one class of graph db, the triplestore, is based on RDF.

Search works through pattern based searching. The language is SPARQL the SPARQL Protocol and RDF Query Language

The structure of a query is as follows:

start with any prefixes in the form of PREFIX ex: <http://example.org/>

then for retrieval we start with SELECT ?friend where question-marked terms are named variables in the subsequent query.

There is no FROM equivalent since there are no table equivalents, we query the whole graph.

The WHERE clause introduces the pattern that we’ll be searching for. For example:

PREFIX foaf: <http://xmlns.com/foaf/0.1.>
PREFIX ex: <http://example.org/>

SELECT ?fName
WHERE {
  ex:David foaf:knows ?friend .
  ?friend foaf:name ?fName .
} LIMIT 20

There are additions to the syntax. We can use the + character on a predicate to extend its reach. IE ex:David foaf:knows+ ?friend. would connect any chain of the knows relationship that included David. We can use SELECT DISTINCT like SQL to just get unique names.

We can also FILTER within the body, and LIMIT afterwards.

The architecture of triple stores is interesting. It’s possible to do something similar to a SQL server - which listens on a port and communicates on a protocol. Ports are usually behind a firewall and only accessible internally.

A SPARQL endpoint is a server listening on an HTTP connection. It communicates using a protocol based on HTTP. This is often visible to the world. It can be used programmatically. It can be hand-queried. They’ll often have a web form to try it out.

There’s also software called YASGUI web version which can send a query to any endpoint.

It can be daunting to get started with a new graph dataset. The ontologies won’t tell you much about the shape and feel of the data in the same way that a relational ERD will, or an XML schema will.

One way of getting started is just get some statements, like this:

SELECT *
WHERE {
  ?subject ?predicate ?object
} LIMIT 20

Another good starting query is to look at the types in the graph:

SELECT DISTINCT ?type
WHERE {
  ?subject a ?type
} LIMIT 20

Why don’t we see more sparql endpoints? It’s a very trusting mechanism - you can write expensive queries very easily. You need to maintain a live triplestore.

The maintenance burden of an endpoint is high, and even when an org invests in one it can often be taken down later.

8.4 Linked Data Outside Databases

SPARQL is great to treat the linked data world as though it’s a single database.

We can also create federated queries to address multiple endpoints.

But only a minority of linked data is in triple stores, and only a handful of sparql endpoints allow federated queries.

The strategy to work with the reality of linked data is ‘follow your nose’. Dereferencing is a critical technique - we have a uri, what happens if we request it? We might specify the data type we want back and if we ask for RDF we might get that back.

We get the RDF back, it tells us relations with other objects, we can then dereference those URLs and repeat, to build up a graph of knowledge locally from the links.

We can then put this into a local triple store for SPARQL queries if we wish.

8.404 Schema.org

It’s easy to conclude that linked data is a research-oriented technology that has no place in the business world.

Schema.org is evidence to the contrary.

Web crawlers have long mined structured information from web pages. Schema.org provided a way to declare structured relations and this is used explicitly in search engines. It relies on linked data standards.

Alex's Notes