Alex's Notes

DuCharme: Chapter 02: The Semantic Web, RDF, and Linked Data

Metadata

URLs, URIs, Namespaces

Uniform Resource Locators (URLs) were one of the initial trilogy of specifications from TBL, the others being html and http. They are a compact way for the client to specify the resource it wants.

Ownership of a domain gives you authority to control the file structure and resource names. This led to people to use the names for resources that weren’t web addresses, eg the FOAF vocabulary used urls like http://xmlns.com/foaf/0.1/Person to refer to a concept, with no actual web page at the end of it.

This confused people and an attempt was made to create a new syntax, Universal Resource Names or URNS that would look like: urn:xmlns.com/foaf/0.1/Person instead. A meta concept to encompass both URNs and URLs was introduced, the Universal Resource Identifier or (URI) - note that it’s Universal RI, but Uniform RL.

URNs didn’t catch on at all, so now people often use URI and URL interchangeably.

URIs don’t support many alphabets, eg Chinese or Cyrillic, so there is yet another spec, the Internationalized Resource Identifiers or IRIs, that extend the spec to support more characters. IRI is the most inclusive, though again they are often used interchangeably.

These identifiers happened to solve another problem, clashes in XML entity models. URIs were selected as the way to specify xml namespaces. So you’ll often see xml docs like this:

<xml xmlns:dc="http://purl.org/dc/elements/1.1/">
<book>
  <dc:title>My Great Book</dc:title>
</book>

Prefixes don’t identify the namespace, the URIs do, they are just shorthand.

In RDF, subjects and predicates must belong to specific namespaces. RDF follows the same prefix convention as xml, but you can also just use the full URI instead if you want (just enclose it in angle brackets in many syntaxes). Just about anywhere in RDF and SPARQL where you can use a URI you can use a prefixed name instead, so long as the prefix has been declared. The part of the name after the colon is the local name, the full name is the prefixed name.

NamespaceTTL Prefix declarationPrefixed NameLocal Name
<http://purl.org/dc/elements/1.1/>@prefix dc: <http://purl...>dc:titletitle

RDF

To recap, the Resource Description Framework RDF:

  • Is a data model in which the basic unit of information is a triple.

  • A triple consists of a subject, predicate, and object - a resource identifier, an attribute or property name, and an attribute or property value.

  • To remove ambiguity, the subject and predicate of a triple must be URIs (though we can use prefixed names in place of full URIs).

Serialization

The most common serialization format for RDF is Turtle, but RDF has a number of different ones.

N-Triples

The simplest is N-Triples, a subset of N3. In N-Triples URIs are written in angle brackets, strings in quote marks. Each triple has its own line, with a period at the end, eg:

<urn:isbn:011421421> <http://purl.org/dc/elements/1.1/creator> <http://www.w3.org/People/Berners-Lee/card#i> .

The order of a set of triples does not matter. N-Triples is popular for teaching and can be quick to parse, but is verbose.

RDF/XML and RDFa

The oldest serialization is RDF/XML which was part of the original RDF spec. At time of publication, it’s still the only standardized serialization format. Here is an example:

<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
	 xmlns:dc="http://purl.org/dc/elements/1.1/"
	 xmlns:v="http://www.w3.org/2006/vcard/">

  <rdf:Description rdf:about="urn:isbn:006251587X">
    <dc:title>Weaving the Web</dc:title>
    <dc:creator rdf:resource="http://www.w3.org/People/Berners-Lee/card#i"/>
  </rdf:Description>

  <rdf:Description rdf:about="http://www.w3.org/People/Berners-Lee/card#i">
    <v:title>Director</v:title>
  </rdf:Description>

</rdf:RDF>

Some notes on the spec:

  • The element containing all the triples must be an RDF element from the namespace: http://www.w3.org/1999/02/22-rdf-syntax-ns

  • The subject of each triple is named in the rdf:about attribute in the rdf:Description element.

  • You can have a separate rdf:Description tag for each triple, or you can group them, as here, to have multiple children, and so multiple triples attached to one subject.

  • Note the objects of the triples may be plain strings (eg v:title) or resources themselves (see dc:creator).

RDF/XMl never became popular, it is a pain to process.

An alternative is RDFa where RDF triples are embedded in other html or xml documents as attributes. Tools can then pull the triples out of the html on a web page, for example and let you run SPARQL queries over them.

N3 and Turtle

N3 was a personal project by TBL to be an easier way to write triples than RDF/XML, here’s an example:


@prefix dc: <http://purl.org/dc/elements/1.1/> .
@prefix v:  <http://www.w3.org/2006/vcard/> .

<http://www.w3.org/People/Berners-Lee/card#i>
  v:title "Director" .

<urn:isbn:006251587X>
  dc:creator <http://www.w3.org/People/Berners-Lee/card#i> ;
  dc:title "Weaving the Web" .

To compare with N-Triples, extra whitespace and prefixes are allowed. Note the semi-colon to indicate another predicate and object are coming for that subject. You can also use a comma to indicate that the next object will belong to the preceding subject and predicate.

N3 also has extra features, like specifying inference rules or referencing a whole graph of triples as a resource.

Turtle is N3 without the inference and other bells and whistles. Turtle is the most popular format and is being standardized.

RDF Databases and Data Typing

If you’re storing a lot of triples then a flat file isn’t going to be great, several database management systems are available. You can store RDF in relational databases, but optimized triple stores are a better bet.

Objects can be more than just strings or other resources, you can assign types, often people use the xml schema types. Turtle processors will infer the type if you leave quote marks off literals and use true or 4 or 3.15 for example. You can also explicitly declare the type (for example for date types. Here’s an example with all types explicit:

@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix d:   <http://learningsparql.com/ns/data#> .
@prefix dm:  <http://learningsparql.com/ns/demo#> .

d:item342 dm:shipped     "2011-02-14"^^<http://www.w3.org/2001/XMLSchema#date> .
d:item342 dm:quantity    "4"^^xsd:integer .
d:item342 dm:invoiced    "false"^^xsd:boolean .
d:item342 dm:costPerItem "3.50"^^xsd:decimal .

Labelling

Most RDF subjects are just unique identifiers, they often don’t have an interpretable human meaning. So the rdfs:label predicate is very important, it’s best practice to assing values to resources so that human readers can more easily see what they represent. It’s also common to use multiple labels with language tags, like this:

@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .

<http://dbpedia.org/resource/Switzerland> rdfs:label "Switzerland"@en,
  "Suiza"@es, "Sveitsi"@fi, "Suisse"@fr .

The @en etc is a language tag. The SKOS standard offers skos:prefLabel for preferred label and skos:altLabel for alternate labels, and more. There is also a rdfs:comment property that can be used for longer descriptions.

Blank nodes

Blank nodes can be used to group together otherwise flat nodes (think elements of an address). They can be denoted in Turtle by a prefix of underscore, for example _:b1, which can then be used as the subject of a set of triples. Square brackets are sometimes used instead of the underscore prefix.

Vocabularies: RDF Schema

Vocabularies are usually stored using RDF Schema and OWL standards.

RDF schema is a vocabulary description language - a way to describe a vocabulary. For example here is a description of the dc:creator predicate:


@prefix dc:   <http://purl.org/dc/elements/1.1/> .
@prefix rdf:  <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .

dc:creator
    rdf:type rdf:Property ;
    # a rdf:Property ;
    rdfs:comment "An entity primarily responsible for making the resource."@en-US ;
    rdfs:label "Creator"@en-US .

Note the rdf:type property, which means that the subject is an instance of the object class. The a shorthand offered by Turtle, N3, and SPARQL makes this clearer to read. We can define new classes too:

@prefix ab:   <http://learningsparql.com/ns/addressbook#> .
@prefix rdf:  <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .

ab:Musician
    rdf:type rdfs:Class ;
    rdfs:label "Musician" ;
    rdfs:comment "Someone who plays a musical instrument" .

ab:MusicalInstrument
    a rdfs:Class ;
    rdfs:label "musical instrument" .


ab:playsInstrument
    rdf:type rdf:Property ;
    rdfs:comment "Identifies the instrument that someone plays" ;
    rdfs:label "plays instrument" ;
    rdfs:domain ab:Musician ;
    rdfs:range ab:MusicalInstrument .

Note that with the rdfs:domain property I can say that the subject of a ?s ab:playsInstrument ?o triple belongs to the class of Musicians, while the rdf:range property here means that the object belongs to the class of musical instruments. Those will be inferred by RDFS aware query processors. Note that RDFS-aware means OWL in practice.

Ontologies: OWL

OWL is the W3C’s Web Ontology Language. It builds on RDFS to let you define ontologies, formal definitions of vocabularies that allow you to define complex structures as well as new relationships between your vocabulary terms and members of the classes you define.

An ontology defined with OWL is just another collection of triples. We can specify that are properties are symmetric, or inverse for example, to allow reasoning engines to do more with our data.

Links to this note