Jurafsky Martin Chapter 06: Vector Semantics and Word Embeddings

Metadata

Title: Vector Semantics and Word Embeddings
Number: 6
Book: Jurafsky and Martin: Speech and Language Processing

Core Ideas

Words that occur in similar contexts tend to have similar meanings. This link between similarity in how words are distributed and similarity in what they mean is called the distributional hypothesis.

The distributional hypothesis was made in the 50s by linguists who noted that words which are synonyms tended to occur in the same environment, with the amount of meaning difference between two words “corresponding roughly to the amount of difference in their environments” (Harris, 1954).

In NLP, the instantiation of this linguistic hypothesis is known as vector semantics, where we learn representations of the meaning of words, called embeddings, directly from their distributions in texts. Such representations are used in every NLP application that makes use of meaning.

This chapter starts with static embeddings, which underlie more powerful dynamic embeddings (also called contextualized embeddings) like BERT which is introduced in ch. 11.

These representations are an example of representation learning, learning useful representations of the input text in self-supervised ways, rather than hand crafted feature engineering. It is a very important area for NLP.

Lexical Semantics

The chapter starts by introducing core ideas in Lexical Semantics. This is covered in more detail in ch. 18 which looks at WordNet etc.

Vector Semantics

The chapter then moves on to introduce Vector Semantics, and the main models for Weighting Term Vectors

Static Embeddings

The chapter considers Static Embeddings, most notably word2vec.

Evaluating Vector Models

The chapter concludes by looking at metrics for evaluating vector models. As elsewhere in NLP, the most important metric is extrinsic, using the model on a task and assessing whether it improves performance.

For intrinsic evaluation, most commonly we test a model’s performance on similarity, computing the correlation between an algorithm’s word similarity score and those assigned by people.

For context-free similarity, several datasets are available:

WordSim-353 is a commonly used set of ratings from 0 to 10 for 353 noun pairs.
SimLex-999 is more difficult, quantifying similarity rather than relatedness, including abstract and concrete adjective, noun, and verb pairs.
TOEFL is a set of 80 questions consisting of a target word, and 4 options for the correct synonym.

For datasets which include context there are:

Stanford Contextual Word Similarity (SCWS), giving human judgments on 2,003 word pairs in a sentential context.
Word in Context (WiC) gives target words in two sentential contexts in the same or different senses.

Alex's Notes