Alex's Notes

cm3060 Topic 07: Information Extraction

Main Info

Description

Covers the application of NLP to extract structured information from unstructured text. Includes POS tagging, relation extraction, and named entity recognition (NER).

Assigned Reading

Supplementary Reading

Introduction to Information Extraction, NER, and Sequence Classification

See also Jurafsky Manning Lecture Summaries: Topic 08 Information Extraction

Introduction to Relation Extraction

See also Jurafsky Manning Lecture Summaries: Topic 09 Relation Extraction

Part of Speech Labeling

See also Jurafsky Manning Lecture Summaries: Topic 11: POS Tagging

UoL Lecture Summaries

Introduction to Information Extraction

Information extraction is the task of making the implicit structure of information in natural language explicit.

Typically factual information.

You can also think of it from a data model point of view, you’re transforming strings to relations.

Discusses some use cases - from scientific literature mining to extracting dates from an email.

Divides information extraction into sub-tasks:

Named Entity Recognition

He includes in NER not just labeling the tokens, but resolving them to a canonical database representation.

Relation Extraction

how do the entities we’ve extracted relate to each other?

Event Extraction

Identify specific events or episodes to reconstruct a timeline

Event coreference

Coreference resolution on events

Temporal event expression extraction

trying to find out when the events actually occurred.

Reviews the standard sets of relations from ACE to UMLS, all from Jurafsky Manning Lecture Summaries: Topic 08 Information Extraction

Introduction to Relation Extraction

Four main categories:

1. Handwritten patterns

Uses the same example as Jurafsky Manning Lecture Summaries: Topic 09 Relation Extraction

High precision, low recall, not scalable, domain specific.

2. Supervised Learning

Follow the standard ML workflow. Includes the idea of using a binary classifier first to see if any relation exists.

Can use neural approaches, eg a pre-trained encoder like BERT which is fine-tuned.

High accuracy, domain specific, but expensive and brittle.

3. Semi-supervised learning

Covers bootstrapping - start with seed relations, find some example sentences, generalize the context and use them to find new seeds.

Also covers distant supervision. Create lots of noisy patterns from a large database of seed tuples, then combine them into a supervised classifier.

Distant supervision avoids the issues of semantic drift of bootstrapping, but needs existing DB and has relatively low precision.

4. Unsupervised or Open Relation Extraction

No labelled data.

We run a POS tagger and entity chunker over each s.

We find the longest phrase that starts with a verb (w)

Find nearest NPs to the left (x) and right (y)

Assign a confidence to relation r = (x,w,y).

Relations are stored if they occur with at least N arguments.

This is flexible, but the output needs to be mapped to some canonical form.

Shows a standard model for an IE pipeline:

POS Tagging and IE with NLTK

Switches to practical demos of POS tagging with NLTK. You tokenize it, apply a pos tagger

Next video looks at IE with NLTK, walking through a basic pipeline.

Chunking

Looks at chunking, the identification of meaningful chunks, a simplistic parse essentially.

Walks through constructing a simple grammar from POS tags, and builds up more complex grammars.

Chinking is the reverse, so we define patterns we want to exclude from a chunk pattern.

Introduces BIO tags - beginning, inside, and outside tags that are commonly used in chunking and NER.

Next video switches from NLTK to Spacy and shows the noun chunk API there, accessed through doc.noun_chunks

Final video looks at the NER API in Spacy. Includes LAW as a category of entities which is ‘Named documents made into laws’. Entities can be accessed through the doc.ents iterator when the text is parsed.