cm3060 Topic 01: Introduction
Main Info
Title: Introduction to NLP
Teachers: Tony Russell-Rose
Semester Taken: October 2021
Parent Module: cm3060 Natural Language Processing
Description
Week One: Introduction to the history, paradigms, and applications of NLP.
Week Two: Refresher on Python and Jupyter notebooks. Intro to evaluation metrics used in NLP.
Key Reading
Additional Reading
Perkins, Python3 text processing, ch. 7 on precision and recall (this reading is not very helpful and not summarised here, it just repeats the lecture content and shows how the metrics are implemented in nltk.metrics).
Video lecture summaries
1. Intro to NLP
Reviews disciplinary perspectives on NLP: Computer/data science; information science; computational linguistics; cognitive science.
Stresses that NLP is not just a subset of Machine Learning, other disciplinary perspectives will enrich your understanding and abilities.
Categorizes two paradigms:
symbolic approaches, rule-based, hand coded, knowledge intensive
statistical approaches, distributional and neural, supervised or unsupervised, data intensive.
Note that this lumps together earlier statistical approaches with more recent neural/transformer approaches. Not how everyone approaches the field, you can split these out to give three main paradigms instead. His approach is reflected in the lack of deep treatment of neural perspectives in the module.
2. NLP Applications
Briefly summarizes some main ‘problems’ that have been tackled under NLP:
Text categorization: Media monitoring (classifying incoming news feeds); search engines classifying user intent; spam detection.
Machine translation: Fully and semi-automated.
Text summarisation: single document vs. multi-document; search results; research tools; word processors.
Dialog systems: Chatbots; smart speakers; smartphone assistants; call handling systems (travel, banking, booking).
Sentiment analysis: identify and extract opinion or /subjective information. Sub-tasks: identify polarity (+ve, -ve, neutral); identify emotional states; separate subjective from objective information (fact from opinion); feature/aspect-based evaluation (eg “the service was good but the food was terrible”).
Text mining: Infer new knowledge from unstructured text resources. A <-> B and B<->C, can we infer A <-> C? Applications in life sciences, counter terrorism.
Question Answering: going beyond document retrieval, provide specific answers to specific questions.
And mentions there are lots more… text generation, speech recognition and synthesis
3. History of NLP
NLP not a recent phenomenon, has had a cyclic history of fashions. Modern approaches can be complex and opaque, helpful to see how they have built on earlier, simpler ideas and techniques which can still be combined to produce useful applications.
Outlines several phases:
Phase One: Foundational Insights: 1940s and 50s
Foundational paradigms: the automaton and information theoretic models.
Cites Turing’s work on computability and Chomsky’s Three Models (1956) for automata theory and its relation to language grammar.
Cites Shannon’s development of mathematical information theory, and the concept of entropy to measure the information content of a language using probabilistic techniques.
Phase Two: Two camps: 1960s and 70s
Speech and language processing splits into symbolic and statistical camps.
Symbolic: Chomsky and others working on parsing algorithms; AI work on reasoning and logic, NLU systems based on pattern matching, keyword search, heuristics for reasoning.
Statistical: Cites Mosteller and Wallace (1964) on Bayesian methods and authorship attribution in the Federalist Papers.
Gives examples of early applications:
Eliza, the ‘first chatbot’, Wiezenbaum 1966, based on simple pattern matches.
SHRDLU, Winograd 1972, focused more on NLU. Based on an imaginary ‘block world’, tries to resolve referential ambiguity.
First text corpora, the Brown corpus, a million word collection from 500 written texts. Assembled in 1963-4.
Phase Three: Empiricism: 1980s and 90s
Return of finite state models and probabilistic and data-driven approaches.
Finite-state models: developed by Kaplan and Kay in morphology (1981) and in syntax models by Church (1980)
Probabilistic/data-driven: from speech recognition to POS tagging, parsing and semantics.
Model evaluation - field developed quantitative metrics and comparison with prior work. Competitive evaluation exercises, and growth of the web furthered the field.
Phase Four: The Rise of Machine Learning: 2000s
Large amounts of data became available, including annotated collections. (eg Penn Treebank, Marcus et al 1993).
Traditional NLP problems like parsing and semantics became problems for supervised learning.
Unsupervised statistical approaches began to receive renewed attention as costs of producing annotated corpora became prohibitive:
Translation (Brown et al 1990; Och and Ney 2003), and topic modelling (Blei et al 2003) demonstrated effective applications with unannotated data.
Phase Five: Ascendance of Deep Learning: 2010s to present
Deep learning methods have become pervasive in NLP and AI more generally due to advances in GPUs for computation, reduction in cost of memory, availability of software platforms and frameworks.
Classic ML methods require analysts to select features based on domain knowledge, DL introduced automated feature engineering.
Collobert et al (2011) applied convolutional neural networks (CNNs) to POS tagging, chunking, NE tags and language modelling.
But CNNs unable to handle long-distance contexual information.
Recurrent neural networks (RNNs) process items as a sequence with a memory of previous inputs. Applicable to many tasks such as:
word level: NER, language modelling
sentence level: sentiment analysis, selecting responses to messages
language generation: machine translation, image captioning
RNNs are supplemented with long short-term memory (LSTM) or gated recurrent units (GRUs) to improve training performance.
4. Development Environment
Introduces the practical side of the course and the tools used: Python; Jupyter notebooks and Jupyter lab; and the libraries:
nltk: simple, consistent, extensible, modular. Origins in computational linguistics, popular for teaching, but not optimized for runtime performance, not ‘state of the art’. Also look at TextBlob, built on nltk.
scikit-learn: general ML library
gensim: used for word embeddings, topic modelling
spaCy: picks up where NLTK leaves off, good for industrial use; and textacy, built on spaCy.
TensorFlow and PyTorch, DL libraries
RASA for chatbot development.
5. Evaluation Metrics
NLP tasks can be categorized by problem type - categorization, regression, sequence labelling, but how do we evaluate our models?
There are various metrics:
Accuracy - \(\frac{\#\;of\;correct\;predictions}{\#\;of\;predictions}\)
Seems OK, but what if we have an unbalanced dataset? Let’s say we have a binary classification where 80% of the cases are one category. We could just pick that every time and end up with 80% accuracy.
We can look at it more carefully with the help of a confusion matrix
Actual +ve | Actual -ve | |
---|---|---|
Predicted +ve | True Postive | False Positive |
Predicted -ve | False Negative | True Negative |
This then gives us the following metrics:
Metric | Value |
---|---|
Accuracy | \(\frac{(TP + TN)}{(TP + TN + FP + FN)}\) |
Recall | \(\frac{TP}{(TP+FN)}\) |
Precision | \(\frac{TP}{(TP+FP)}\) |
Informally we can think of it this way:
recall - what proportion of actual cases did we detect?
precision - of the classifications we predicted, how many were correct?