cm3060 Topic 02: Basic Processing
Main Info
Title: Basic Text Processing
Teachers: Tony Russell-Rose
Semester Taken: October 2021
Parent Module: cm3060 Natural Language Processing
Description
The topic concludes the introduction to pre-processing documents and working with corpora.
Week Three: Text Processing Fundamentals; Representing Text; Sentence Segmentation; Word Tokenization; Text Normalization
Week Four: Regular Expressions; Stop Words; Text Corpora
Key Reading
Jurafsky Martin Chapter 02: Regular Expressions, Text Normalization, Edit Distance
NLTK Book Chapter 02: Accessing Text Corpora and Lexical Resources
Additional Reading
Provost and Fawcett, Data Science for Business, ch. 10, section on representing text (pp 249 - 260). This covers bag of words representations, TF/IDF, N-Gram models, and briefly introduces entity extraction and topic modelling.
Manning et al, Introduction to Information Retrieval, ch. 2 sections on text processing (pp 21 - 33). This covers tokenization, stop words, and text normalization: stemming, lemmatization. Noted in: Manning et al: Chapter 02 - The term vocabulary and postings lists
Not assigned until the next topic, but Unix for Poets also includes some text processing guidance in Unix.
Video lecture summaries
Topic Introduction
The intro video reviews sources of ‘unstructured’ text data and then focuses on the challenges of processing text data:
Polysemy: One word maps to many concepts (eg bat)
Synonymy: One concept maps to many words (eg smart/bright/clever)
Word order: Sequence matters (venetian blind vs blind venetian)
Language is generative: many ways to express the same thought (paraphrase, metaphor, idioms etc)
Human purposes matter, as does context: language not used to simply assert truths, context and purpose matters
Language is changing: (eg I want to buy a mobile)
Ill-formed input: misspelling, confusion etc, a lot of text data is noisy
Co-ordination and negation: challenging for eg topic categorization (“This is not a talk about genetic algorithms”) might confuse a naive categorizer or search index
Multilinguality: eg Claudia Schiffer is on the cover of Elle (an English sentence that includes a German name and French title)
Sarcasm, irony, slang etc (“That was a wicked lecture”)
Text Processing Fundamentals - Overview
Two part lecture. Starts with some ‘humourous’ examples of ambiguity, and then stresses that dealing with ambiguity is a central concern of NLP. To determine the structure of information, we have to resolve ambiguity. Ambiguity exists in different levels, giving rise to basic text processing techniques:
Lexical analysis (tokenisation). Trivial in some cases ‘The cat sat on the mat’, but what about ‘I can’t tokenise this’ - how do we tokenise “can’t”?
Stop word removal: No definitive list, what about “To be or not to be”, “The Who” etc… We might want to remove words with low information value, but we have be be careful.
Stemming: crudely ‘strip the ending off’ (fishing, fished, fish, fisher -> fish). Stemming can reduce index size, but can index things that aren’t words.
Lemmatization: Linguistically principled analysis, sensitive to syntactic category. (were -> be + PAST), but what about ratify (rat-ify, infest with rats?!)
Morphology: prefixes and suffixes, English morphology is fairly straightforward compared with eg German. To process a word we may need to break down the internals of a word to examine its component pieces.
Syntax: Part of speech tagging, what role is the word playing in a sentence? Sometimes it’s ambiguous ‘Time flies like an arrow’ vs ‘Fruit flies like a banana’, ‘Eats shoots and leaves’.
Parsing: grammar, “I saw the man on the hill with a telescope” - who has the telescope? Parsing cannot resolve the ambiguity.
Sentence boundary detection: Punctuation denotes the end of the sentence, but can get murky, usually exceptions to every rule.
Text Processing: Practical
The next videos demonstrate some of these core techniques using nltk
.
Segmentation
It starts with sentence segmentation using the punkt
sentence tokenizer.
import nltk
nltk.download('punkt') # the sentence tokenizer
text = """my long text with some sentences in"""
sentences = nltk.sent_tokenize(text)
sentences[0] # supports basic container ops
len(sentences)
print('\n-----\n'.join(sentences)) # show how they've split them
It then covers word tokenization, showing how nltk’s default tokenizer deals with, eg “aren’t” ([‘are’, “n’t”]) and “it’s” ([‘it’, “‘s”]), urls and emails (splits into several tokens), and acronyms like ‘I.B.M.’ (one token).
import nltk
nltk.download('punkt') # the sentence tokenizer
text = """my long text with some sentences in"""
sentences = nltk.sent_tokenize(text)
word_tokens = [nltk.word_tokenize(s) for s in sentences]
Normalization
The lectures turn to text normalization, ie stemming and lemmatization.
Distinguishes between:
inflectional morphology/variation, variations within a common grammatical category (like a verb). EG runs, run, running -> run (all variations of the verb)
derivational morphology, variations across grammatical categories (eg democratic, democracy)
import nltk
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk import wordnet
nltk.download('wordnet') # provides the lemmatizer
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()
text = """my long text with some sentences in"""
word_tokens = nltk.word_tokenize(text)
stems = [stemmer.stem(word) for word in word_tokens]
lemmas = [lemmatizer.lemmatize(word) for word in word_tokens] # lemmatizer can take a part of speech optionally
Stop Words
After a detour into regular expressions it comes back to look at stop words
stop words are are those words that are so high frequency in the language that they are relatively common across all different document types and all different document genres. As a result, they don’t tend to offer much in the way of differentiation.
They add noise to ML tasks, so we tend to filter them. No universal list though, choice depends on application’s purpose.
import nltk
nltk.download('stopwords')
print(stopwords.words('english'))
stop_words = set(stopwords.words('english'))
text = """my lovely text"""
words = nltk.word_tokenize(text)
filtered = [w for w in words if not w in stop_words]
Corpora
Walks through several corpora types: isolated (Gutenberg); categorized (Brown); overlapping (Reuters, documents can have multiple categories); temporal (Inaugural addresses)
import nltk
from nltk.corpus import gutenberg # literary corpus
from nltk.corpus import nps_chat # chat corpus
from nltk.corpus import brown # categorized corpus, early milestone
from nltk.corpus import reuters # original reuters corpus
from nltk.corpus import inaugural # temporal corpus, can look at changes over time
nltk.download('reuters')
gutenberg.fileids()
#extract words or sentences
mac = gutenberg.words('shakespeare-macbeth.txt')
mac_senences = gutenberg.sents('shakespeare-macbeth.txt')
brown.categories()
brown.words(categories='news')
Regular Expressions
The videos introduce regex practically looking at different subtopics:
1. Disjunctions
Introduces basic substitution in Python using re.sub(pattern, replacement, text)
. And using disjunction to match on [Tt]
upper case/lower case, ranges like [0-9]
and disjoint sequences like the|The|of
.
2. Negations
Introduces the ^ operator to negate within a range, like [^0-9]
to pull digits out of a text by substituting their negation with an empty string.
3. Optionality
Introduces several operators:
the . operator to match any character,
? to express 0 or 1 counts of the prior character or expression.
* the Kleene star to denote 0 or more, and the notion of greedy matches (shows *? for non-greedy matching).
- to denote 1 or more.
4. Aliases
Introduces the shortcuts/aliases: \w, \d etc. See the notes for Jurafsky Martin Chapter 02: Regular Expressions, Text Normalization, Edit Distance where these are set out in a table.
5. Anchors
Introduces the anchors ^, $ for start and end of string, again see the Jurafsky Martin reading note for more on these. Shows the way to set a flag flags=re.MULTILINE
as an argument to re.sub
to match on the start of every line, not just the start of the string.