Jurafsky Manning Lecture Summaries: Topic 09 Relation Extraction
9.1 What is Relation Extraction?
Relation extraction is the second core sub-task in Information Extraction, along with Named Entity Recognition.
Consider a complex sentence: “International Business Machines Corporation (IBM or the company) was incorporated in the State of New York on June 16, 1911, as the Computing-Tabulating-Recording Co. (C-T-R)…
Ultimately we want to extract a fairly complex relation here, something along the lines of:
Company-Founding Company: IBM Location: New York Date: June 16, 1911 Original-Name: Computing-Tabulating-Recording Co.
But to manage the complexitywe focus on the simpler task of extracting relation triples, such as Founding-year(IBM, 1911)
, Founding-location(IBM, New York)
and so on.
This is useful for many applications, any time we need structured knowledge, which is far easier for downstream applications to work with than unstructured text. We might be constructing a new knowledge base, or augmenting an existing one.
It also supports QA. we can model the relations in the question and search our knowledge base for corresponding relations.
But what relations should we extract? One set is from the ‘Automated Content Extraction’ task (ACE), which defined 17 relations from 6 classes.
For example family relations among people, physical relations between physical things, part-whole relations among geographic or corporate entities.
But there are of course different sets of interesting relations for different tasks. EG the Unified Medical Language System (UMLS) defines relations of interest to the biomedical domain.
Wikipedia also maintains a set of relations for populating its info boxes. RDF is an important description language for Wikipedia-driven applications.
Ontological relations are usually very important, for example hypernyms IS-A
relation, which holds between classes, or Instance-of
which holds between individuals and classes.
How do we build these relation extractors? There are a number of methods:
Hand-written patterns
Supervised ML
Semi-supervised and unsupervised approaches
Bootstrapping (using seeds)
Distant supervision
Unsupervised learning from the web
The remaining lectures will explore the three approaches.
9.2 Using Patterns to Build Relations
The simplest way to extract relations is to use hand-built patterns. For example consider a sentence like “Agar is a substance prepared from a mixture of red algae, such as Gelidium, for laboratory or industrial use”.
From this can you can tell what Gelidium means, even without having seen it before. how? the ‘algae, such as’ formulation is the give-away.
This intuition was formalized in Hearst 1992 who identified a series of patterns that can be used to identify this IS-A
relation. “X or other Y”, “Y, including X”, “Y, especially X” and so on.
The intuition can be extended to richer relations. Often such relations hold between specific classes of entities. So we start with NER, and that helps with Relation Extraction.
Named Entities aren’t enough on their own, since many relations can hold between instances of two classes. But we can combine them with the pattern intuition to extract richer relations.
For example we might discern the pattern PERSON, POSITION of ORG
to extract who holds the office of what organization.
An advantage of this is that human patterns tend to be high-precision, and can be tailored to specific domains.
But they are often low-recall. It’s a lot of work to do this for all possible patterns, imagine doing it for all possible patterns for all possible relations. We’d like better accuracy too.
9.3: Supervised Relation Extraction
Supervised ML is an important way to do relation extraction. The algorithm works as follows:
We choose a set of relations we wish to extract
We choose a set of entities we want to extract relations between
We find and label data:
We choose a representative corpus
We label the named entities
We hand-label relations between the entities
We break into train, dev, and test sets
Train a classifier on the train set and test on the dev/test sets
We might often build two classifiers, one to decide on a boolean basis if two entities in a sentence related. If yes we classify the relation.
Why? It’s faster classification, also we can use specific feature sets to the problem.
Discusses feature engineering for relation extraction:
We might use the headwords of the two entities, and their combination (eg ‘Airlines’ for ‘American Airlines’).
We might use the bag of words and bigrams for the two entities.
We might pick words or bigrams in particular positions left and right of the entities (eg +/- 1 of the different mentions).
We might pick the bag of words or bigrams between the two entities.
Named Entity type is an important feature, we might also concatenate the two types for a third feature.
The entity level might also be used (eg is it a name, or a nominal, or a pronoun).
Parse features, like the syntactic chunk sequence between the two mentions, or a dependency path.
Trigger word features are useful. So eg kinship terms for family relations (parent, wife, husband), from eg the WordNet features.
Gazeteers provide useful words from the domain, like country name lists and sub-entities, list of common person names etc. Used to mean any long list of proper nouns that can help.
So we end up with a whole load of features, combine them, and then try out standard classifiers to see if they work.
We evaluate with the usual classification metrics, precision, recall and f1.
This lets us get high accuracies if we have enough data and the test set is similar enough to training.
Downsides are the expense of labeling and feature engineering, and the brittleness, they don’t generalise well to different genres.
9.4 Semi-supervised and Unsupervised Relation Extraction
Some of the most popular algorithms at the time were semi-supervised or unsupervised approaches.
What happens if you don’t have a training set? Or just a few seed tuples or high-precision patterns. Can you use these as seeds to do anything useful?
Bootstrapping is the process of using a seed to learn to populate a relation. The general approach is:
We gather a set of seed pairs that have a relation
R
.We iterate:
Find sentences with the pairs
Look at the context between or around the pair and generalize the context to create patterns
Use the patterns to grep for more pairs
This was applied by Sergei Brin in 1998 to extract patterns and relations from the web for book-author pairs.
You start with a seed like “William Shakespeare” and “The Comedy of Errors”. You trawl the web to find sentences with the entities.
You group the observations by what’s in between the entities. Then you look for the longest common sequence before and after the entities, and that becomes the pattern.
Now you iterate to find new seeds that match the pattern.
This was improved by the Snowball algorithm, which adds the requirement that X and Y have to be named entities. It also added a confidence score.
It was extended further with the distant supervision approach. Here instead of taking a small number of seeds (eg 5), we start with a large database to get a huge number of seed examples. Then we create lots and lots of features from these examples, and instead of iterating we just use all those features to build a supervised classifier.
Like supervised classification this uses a classifier with lots of features, supervised by detailed hand-created knowledge, and we don’t require the iterative, expanding patterns of the bootstrapping approach.
But in common with unsupervised classificaiton we use very large amounts of unlabeled data, and we’re not sensitive to genre issues in the training corpus.
We can also go all the way and use unsupervised approaches, with no training data and no list of relations. The Banko et al Textrunner algorithm for this is as follows:
We first use parsed data to train a “trustworthy tuple” classifier - using expensive parse features on a small dataset
Then we walk through a large corpus (like the web), to extract all relations between NPs, keep if trustworthy.
Then an Assessor ranks relations based on text redundancy.
How do we evaluate these, since it’s extracting new relations and there is no gold standard set of correct relation instances. We can’t compute precision or recall.
We can compute approximate precision only. We draw a random sample of relations from the output and check precision manually.
We can get precision at different levels of recall (eg take the top 1000 relations, or top 10000), but we can’t evaluate recall itself.