Vector Semantics (Jurafsky Martin)
As discussed in Jurafsky Martin Chapter 06: Vector Semantics and Word Embeddings
Introduction
Vector semantics is the standard way to represent word meaning in NLP.
The roots of the model lie in the 1950s. Osgood had the idea of using a point in 3D space to reprsent connotation (as discussed in Lexical Semantics (Jurafsky Martin)). Around the same time linguists proposed to define the meaning of a word by its distribution in language use, meaning its neighbouring words or grammatical environments. Combining the two ideas gets us to vector semantics.
The idea of vector semantics is to represent a word as a point in a multidimensional semantic space that is derived… from the distributions of word neighbours. Vectors for representing words are called embeddings… The word “embedding” derives from its mathematical sense as a mapping from one space or structure to another, although the meaning has shifted
The model of word similarity in vector semantics gives tremendous power to NLP applications. Previous models relied on the same words appearing in training sets and test sets. But with word embeddings, we can classify as long as we see words with similar meanings. Plus, vector semantic models can be learned automatically from text without supervision.
The chapter introduces the two most commonly used vector models:
tf-idf model: An important baseline. The meaning of a word is defined by a simple function of the counts of nearby words. This produces a very long sparse vector.
word2vec model: A family of models that construct shorter, dense vectors that have useful semantic properties.
Note that the term embedding is sometimes only used more strictly for dense models like word2vec, and not sparse models.
Document Vectors
Vector models of meaning are generally based on a co-occurrence matrix, a way of representing how often words co-occur.
In a term-document matrix each row represents a word in the vocabulary, each column a document from some corpus. For example:
As You Like It | Twelfth Night | Julius Caesar | |
---|---|---|---|
battle | 1 | 0 | 7 |
good | 114 | 80 | 62 |
fool | 36 | 58 | 1 |
wit | 20 | 15 | 2 |
This type of matrix was first defined for the vector space model of information retrieval. Each document is a column vector, and a vector space is a collection of vectors characterized by their dimension. Here the vectors are of dimension 4, normally they would have dimensionality \(|V|\), the vocabulary size. We can think of the vector for a document as a point in \(|V|\) -dimensional space. In the task of information retrieval we can use these locations to find similar documents, documents that are similar will have similar words, and so their column vectors will be similar. In a real scenario these matrices could be enormous, we have D columns, one for every document, and the vocabulary size can be tens of thousands. In IR we need to find document d that best matches the query q. We can represent q by a vector and then compare similarities. One way is to use tf-idf weighting, and the cosine similarity metric.
Words as vectors
So the document can be represented as a column vector, but the word can also be represented, by a row vector (so the word ‘battle’ above would be [1,0,7]). On the same principle as document similarity, similar words will have similar vectors, as they occur in similar locations. So a term-document matrix lets us represent the meaning of words by the documents it tends to occur in.
An alternative to the term-document matrix is the term-term matrix, also called the word-word matrix or term-context matrix, in which columns are labeled by words rather than documents, so that we have a \(|V| \times |V|\) matrix. Here each cell records the number of times the row (target) word co-occurs with the column (context) word in some context in the training corpus.
The context might be the whole document, but more commonly it is a smaller context, a window around the word (for example, +/- four words around the row word). A simplified word-word matrix might look something like this:
aardvark | … | computer | data | result | pie | sugar | |
---|---|---|---|---|---|---|---|
cherry | 0 | … | 2 | 8 | 9 | 442 | 25 |
strawberry | 0 | … | 0 | 0 | 1 | 60 | 19 |
digital | 0 | … | 1670 | 1683 | 85 | 5 | 4 |
information | 0 | … | 3325 | 3982 | 378 | 5 | 13 |
Note that cherry and strawberry are more similar to each other than words like digital.
Generally \(|V|\) is between 10,000 and 50,000 words. Keeping words after the most frequent 50,000 is not usually helpful. Most of these numbers will be zero so these are sparse vector representations. There are efficient algorithms for computing with sparse vectors.
Measuring Similarity: Cosine
To measure similarity between two target words v and w we need a metric that takes two vectors of the same dimensionality and gives a measure of their similarity. By far the most common metric is the cosine of the angle between the vectors.
The cosine is based on the dot product of the two vectors:
\[v \cdot w = \sum^N_{i=1} v_iw_i\]
The dot product will be high when the two vectors have large values in the same dimensions. Orthogonal vectors will have a dot product of 0, reflecting their dissimilarity. But the raw dot product favours long vectors. The dot product is higher when a vector is longer, in our case it will be higher for more frequent words with high counts. We want a score that measures similarity regardless of frequency. So we can modify the dot product by dividing by the lengths of the two vectors, where the length of the vector is:
\[|v| = \sqrt{\sum^N_{i=1} v_i^2}\]
This gives us a normalized dot product which turns out to be the same as the cosine of the angle between the two vectors:
\[ a \cdot b = |a||b| \cos \theta\]
\[ \frac{a \cdot b}{|a| |b|} = \cos \theta\]
So we can compute the cosine similarity between two word vectors as follows:
\[cosine(v,w) = \frac{v \cdot w}{|v| |w|} = \frac{\sum^N_{i=1} v_i w_i}{\sqrt{\sum^N_{i=1} v_i^2} \sqrt{\sum^N_{i=1} w_i^2}}\]
We might choose to pre-normalize each vector by dividing it by its length to create a unit vector. Then we can just take the dot product, as this will be the same as the cosine.
The cosine values range from -1, for vectors pointing in the opposite direction, to 0 for orthogonal vectors, to 1 for vectors in the same direction. Since frequency values are non-negative, cosine similarity for word vectors range between [0,1].
Weighting Terms
We don’t want to use raw term frequencies, however, as they don’t discriminate well enough. JM present a long section on the two main models for weighting terms to give better results: TF-IDF and Positive Pointwise Mutual Information.
This is captured in the Weighting Term Vectors note.