NLTK Book Chapter 02: Accessing Text Corpora and Lexical Resources
Metadata
Title: Accessing Text Corpora and Lexical Resources
Number: 2
Core Ideas
Practical work in Natural Language Processing typically uses large bodies of linguistic data, or corpora. The goal of this chapter is to answer the following questions:
- What are some useful text corpora and lexical resources, and how can we access them with Python?
- Which Python constructs are most helpful for this work?
- How do we avoid repeating ourselves when writing Python code?
The chapter includes in an intro to functions and modules in python which aren’t included in the notes here.
Summary
The chapter walks through several corpora available in NLTK, the corpora lecture in cm3060 Topic 02: Basic Processing follows this closely, using the same examples so notes aren’t repeated here. For a full list of corpora available see the nltk website.
NLTK Corpus Structure
Corpus objects have the following basic methods, see help(nltk.corpus.reader)
for more info:
Example | Description |
---|---|
fileids() | the files of the corpus |
fileids([categories]) | the files of the corpus corresponding to these categories |
categories() | the categories of the corpus |
categories([fileids]) | the categories of the corpus corresponding to these files |
raw() | the raw content of the corpus |
raw(fileids=[f1,f2,f3]) | the raw content of the specified files |
raw(categories=[c1,c2]) | the raw content of the specified categories |
words() | the words of the whole corpus |
words(fileids=[f1,f2,f3]) | the words of the specified fileids |
words(categories=[c1,c2]) | the words of the specified categories |
sents() | the sentences of the whole corpus |
sents(fileids=[f1,f2,f3]) | the sentences of the specified fileids |
sents(categories=[c1,c2]) | the sentences of the specified categories |
abspath(fileid) | the location of the given file on disk |
encoding(fileid) | the encoding of the file (if known) |
open(fileid) | open a stream for reading the given corpus file |
root | if the path to the root of locally installed corpus |
readme() | the contents of the README file of the corpus |
Loading your own Corpus
You can create your own corpus from plain text files, which will enable use of the above methods.
To do that call the PlaintextCorpusReader
method in nltk.corpus
with the following positional argument: The directory containing the files as the first argument; either a list of file ids or a regex pattern as the second argument:
from nltk.corpus import PlaintextCorpusReader
corpus_root = '/my/amazing/corpus'
my_corpus = PlaintextCorpusReader(corpus_root, '.*') # will match every file in the directory
For more on creating corpus readers to read your own files, see the nltk docs.
For the XML file corpus reader, see the source code.
Statistical Analysis of Corpora
The chapter then runs through some basic statistics that can be run on corpora using nltk’s Conditional Frequency Distribution methods:
Example | Description |
---|---|
cfdist = ConditionalFreqDist(pairs) | create a conditional frequency distribution from a list of pairs |
cfdist.conditions() | the conditions |
cfdist[condition] | the frequency distribution for this condition |
cfdist[condition][sample] | frequency for the given sample for this condition |
cfdist.tabulate() | tabulate the conditional frequency distribution |
cfdist.tabulate(samples, conditions) | tabulation limited to the specified samples and conditions |
cfdist.plot() | graphical plot of the conditional frequency distribution |
cfdist.plot(samples, conditions) | graphical plot limited to the specified samples and conditions |
cfdist1 < cfdist2 | test if samples in cfdist1 occur less frequently than in cfdist2 |
Lexical Resources
The chapter concludes by introducing different types of lexical resources available:
A lexicon, or lexical resource, is a collection of words and/or phrases along with associated information such as part of speech and sense definitions. Lexical resources are secondary to texts, and are usually created and enriched with the help of texts.
A lexical entry consists of a headword (also known as a lemma) along with additional information such as the part of speech and the sense definition. Two distinct words having the same spelling are called homonyms.
Wordlists
The simplest lexical resources are just lists of words, for example lists of common words, or lists of stop words. For example, we can easily identify unusual words in a corpus by checking against the nltk Words resource, which is the /usr/share/dict/words
file from Unix, or see the proportion of ‘signal’ to ‘noise’ by checking against the stopword list.
def unusual_words(text):
text_vocab = set(w.lower() for w in text if w.isalpha())
english_vocab = set(w.lower() for w in nltk.corpus.words.words())
unusual = text_vocab - english_vocab
return sorted(unusual)
def content_fraction(text):
stopwords = nltk.corpus.stopwords.words('english')
content = [w for w in text if w.lower() not in stopwords]
return len(content) / len(text)
NLTK also provides a list of first names by gender:
names = nltk.corpus.names
names.fileids() # ['female.txt', 'male.txt']
male_names = names.words('male.txt')
female_names = names.words('female.txt')
Tabular Lexicons
A more sophisticated lexical resource would include a word plus some properties of the word in each row. EG NLTK includes the CMU pronouncing dictionary for US english at nltk.corpus.cmudict
. See the chapter for some examples.
NLTK also includes a comparative wordlist with common words across languages, called the swadesh wordlist, around 200 words total.
from nltk.corpus import swadesh
fr2en = swadesh.entries(['fr', 'en'])
translate = dict(fr2en)
translate['chien'] # 'dog'
More complex lexicons: Toolbox and Wordnet
Perhaps the single most popular tool used by linguists for managing data is Toolbox, previously known as Shoebox since it replaces the field linguist’s traditional shoebox full of file cards. Toolbox is freely downloadable from http://www.sil.org/computing/toolbox/ and available in the nltk.corpus
module.
A Toolbox file consists of a collection of entries, where each entry is made up of one or more fields. Most fields are optional or repeatable, which means that this kind of lexical resource cannot be treated as a table or spreadsheet. The loose structure means XML is a useful processing approach.
Wordnet is also available in the nltk.corpus
module. It is covered elsewhere in the module so not noted here, see the chapter for some examples of use.