cm3060 Topic 04: Lexical Semantics

Main Info

Title: Lexical Semantics
Teachers: Tony Russell-Rose
Semester Taken: October 2021
Parent Module: cm3060 Natural Language Processing

Description

Week Seven introduces lexical resources like WordNet, and lexical semantics more broadly.

Week Eight introduces Static Embeddings

Key Reading

Additional Assigned Reading

Lecture Summaries

Week Seven

Lexical Semantics

Suggests there are three broad categories of NLP task: Classification/categorization (eg spam detection); Sequence tasks (eg text generation); and meaning tasks.

Meaning tasks useful for IR, QA and topic modelling. Where do we find word meaning? We could look in a dictionary.

Introduces WordNet, entries can correspond to abstract concepts. Nodes are synsets. Polyhierarchical structure.

Using WordNet we can programmatically identify hyponyms and hypernyms, and measure semantic similarity.

WordNet

Next lectures are a practical demo of WordNet using nltk.

from nltk.corpus import wordnet as wn

spell_syns = wn.synsets('spell') # returns a list of synsets
spell_syns[0].definition() # the synset objects themselves carry their definition
spell_syns[0].examples()
spell_syns[0].lemma_names()
spell_syns[0].lemmas()

motorcar = wn.synset('car.n.01')
motorcar.hyponyms() # list of child hyponym synsets
mororcar.hypernyms() # list of parent hypernym synsets (polyhierarchical)
motorcar.hypernym_paths() # full paths up to highest category, list of lists of synsets
motorcar.root_hypernyms() # roots of the paths

orca = wn.synset('orca.n.01')
tortoise = wn.synset('tortoise.n.01')
plankton = wn.synset('plankton.n.01')

orca.lowest_common_hypernyms(tortoise) # returns lowest common parent

plankton.path_similarity(orca) # returns value between 0 and 1.

Week Eight

Word Embeddings

Starts by looking at the shortcomings of approaches like WordNet: human effort required to develop and maintain is huge.

Then looks at the early version of the distributional hypothesis:

If A and B have almost identical environments we say that they are synonyms. (Zellig Harris, 1954)

Looks at the basic co-occurrence matrices: term-document and term-term. Examples are taken from Jurafsky Martin Chapter 06: Vector Semantics and Word Embeddings.

Shows a term-document matrix, \(|V| \times |D|\), similar documents will have similar column vectors. Similar words will occur in similar documents and have similar row vectors.

Shows a term-term matrix, \(|V| \times |V|\), again taken from Jurafsky and Martin.

Shortcomings of traditional matrices: vectors are sparse (long and most entries are zero). Doesn’t reflect underlying linguistic structure.

Then reviews word embeddings, we represent words with low-dimensional vectors. We try to capture similarity between terms that might not be captured in sparse vectors (eg food|meal). 50-300 dimensions rather than |V|. Most values non-zero.

Benefits of word embeddings: classifiers learn fewer weights, helps with generalization, avoids overfitting, captures synonymy, bridges gap from neural approaches.

Introduces Word2Vec, a static embedding approach which captures one meaning per word.

We predict rather than count. Is word x likely to co-occur with word y? We train a classifier then keep the weights as the embedding. Running text is the training data.

Basic algorithm (skip-gram with negative sampling):

Treat neighboring context words as positive samples.
Treat other random words in V as negative samples.
Train a Logistic Regression classifier to distinguish the classes.
Use the learned weights as the embeddings.

Training process:

Iterate through the training data
Generate positive samples
Generate k negative samples

Objective, maximize the similarity of \((w,c_{pos})\), and minimize similarity of \((w,c_{neg})\) pairs.

Start with random vectors, and use Stochastic Gradient Descent to maximize the dot product of word with actual context words, and minimize the dot product of word with negative non-context words.

We output two matrices, a target matrix W and context matrix C, then the embedding is often the sum: \(W_i + C_i\).

Briefly introduces fasttext, which uses subword models, and GloVe which uses a global corpus statistics, combines count-based models with word2vec linear structures.

/ Word Embeddings in Practice/

Uses the library gensim to create word embeddings.

import gensim
from nltk.corpus import brown

#train the model
model = gensim.models.Word2Vec(brown.sents())

#save a copy
models.save('brown.embedding')

#then we can load it later
brown_model = gensim.models.Word2Vec.load('brown.embedding')

#look inside the model
#how many words?
len(brown_model.wv.vocab)

#how many dimensions?
len(brown_model['university'])

#calculate similarity
brown_model.wv.similarity('university', 'school')

brown_model.wv.most_similar('university', topn=5)

Then looks at a pre-trained embedding:

import nltk
from nltk.data import find
nltk.download('word2vec_sample')

word2vec_sample = str(find('model/word2vec_sample/pruned.word2vec.txt'))
news_model = gensim.models.KeyedVectors.load_word2vec_format(word2vec_sample, binary=False)

Vector Algebra

Next lecture turns to word vector algebra with static embeddings, A is to B as C is to…

news_model.most_similar(postive=['woman', 'king'], negative=['man'], topn=5) #queen
news_model.most_similar(postive=['Paris', 'Germany'], negative=['Berlin'], topn=5) #france
news_model.most_similar(postive=['has', 'be'], negative=['have'], topn=5) #is
news_model.most_similar(postive=['longest', 'short'], negative=['long'], topn=5) #shortest
news_model.most_similar(postive=['him', 'she'], negative=['he'], topn=5) #her
news_model.most_similar(postive=['light', 'long'], negative=['dark'], topn=5) #short
news_model.doesnt_match(['breakfast', 'cereal', 'lunch', 'dinner']) # cereal

Lab Summaries

Lab has you experiment with different embeddings and WordNet to gauge similarity.

Alex's Notes