Alex's Notes

Weighting Term Vectors (Jurafsky Martin)

As presented in Jurafsky Martin Chapter 06: Vector Semantics and Word Embeddings

Raw frequency is not the best measure of word association. It is very skewed and not discriminative. We’re not going to get good discrimination from words like the, it, or they which occur frequently with all sorts of words. A word like good is just a frequent word and has high frequencies in many contexts. We have a bit of a paradox, words occuring nearby frequently are more important than words that occur nearby once or twice. But if they occur too frequently, like the, they are unimportant. How do we balance these constraints?

There are two common approaches, depending on whether the dimensions are documents (tf-idf) or words (PPMI).

TF-IDF Weighting

TF-IDF is the dominant way to weight co-occurrence matrices in information retrieval, but also plays a role in many other aspects of NLP. It’s a great baseline, the simple thing to try first.

It is defined as the product of two terms:

Term frequency, the frequency of word t in document d. Usually this is squashed from the raw frequency by taking the log base 10 of the frequency. We can’t take the log of 0, so we add 1 to the count:

\[\textrm{tf}_{t,d} = \log_{10}(\textrm{count}(t,d)+1)\]

On this score, words occuring 0 times would have a tf of 0, 10 times would have a tf of 1.04, 100 times a tf of 2.004, 1000 times a tf of 3.00044 etc.

Document frequency, the document frequency of a word t is the number of documents it occurs in (not the overall count of t across all documents, which is the collection frequency). Terms that occur only in a few documents are more useful for discriminating those documents. We can emphasize discriminative words by the inverse document frequency or idf term weight. This is defined as \(\frac{N}{\textrm{df}_t\) where N is the number of documents in the collection and \(\textrm{df}_t\) is the number of documents where t occurs.

So the score will range from 1, where a term occurs in all documents to N, where a term occurs in one document. Again we usually squash the value with a log, because of the large number of documents we typically work with, giving us:

\[\textrm{idf}_t = \log_{10} \left( \frac{N}{\textrm{df}_t} \right)\]

The tf-idf weighted value for word t in document d combines these two terms:

\[w_{t,d} = \textrm{tf}_{t,d} \times \textrm{idf}_t\]

An example TF-IDF term-document matrix might look like this:

As You Like ItTwelfth NightJulius CaesarHenry V
battle0.07400.220.28
good0000
fool0.0190.0210.00360.0083
wit0.0490.0440.0180.022

Note that the word good which is ubiquitous in Shakespeare’s plays, is eliminated by its IDF of 0, and fool which is almost ubiquitous, is also squashed down in importance compared to terms like battle which are more discriminatory.

The tf-idf model is often used for document functions like deciding if two documents are similar. We represent a document by taking the vectors of all the words in that document. Then we compute the centroid of all those vectors.

The centroid is the multidimensional version of the mean. The centroid of a set of vectors is a single vector that has the minimum sum of squared distances to each of the vectors in the set. So given k word vectors \(w_1, w_2, \dots , w_k\) the centroid document vector d is:

\[ d = \frac{w_1 + w_2 + \dots + w_k}{k}\]

Given two documents we compute their document vectors \(d_1\) and \(d_2\) and then estimate their similarity by computing \(\cos (d_1,d_2)\). This is used widely in information retrieval, plagiarism detection, recommendation systems, and digital humanities tasks.

Pointwise Mutual Information (PMI) Weighting

An alternative weighting function, used for term-term matrices, is Positive Pointwise Mutual Information (PPMI). It is based on the intuition that the best way to weigh the association between two words is to ask how much more the two words co-occur in the corpus than we would have expected a priori by chance.

The concept of Pointwise Mutual Information is one of the most important in NLP. It is a measure of how often two events x and y occur, compared with what we would expect if they were independent:

\[I(x,y) = \log_2 \frac{P(x,y)}{P(x)P(y)}\]

The pointwise mutual information between a target word w and a context word c is thus:

\[\textrm{PMI}(w,c) = \log_2 \frac{P(w,c)}{P(w)P( c)}\]

The numerator tells us how often we observed the two words together (calculating the probability using Maximum Likelihood Estimation). The denominator tells us how often we would expect to see the two words to co-occur assuming they occurred independently. The ratio gives us the estimate of how much more the two words co-occur than chance would predict, so it’s a very useful tool for finding words that are associated.

PMI values range from \([-\infty, +\infty]\). But negative scores are unreliable unless the corpora are enormous. So we tend to normalize the weighting to Positive Pointwise Mutual Information (PPMI) which replaces all negative scores by 0.

So if we start with a co-occurrence matrix F with W rows (words) and C columns (contexts) where \(f_{ij}\) gives the number of times word \(w_i\) occurs in context \(c_j\). We can turn this into a PPMI matrix where \(ppmi_{ij}\) gives the PPMI value of word \(w_i\) in context \(c_j\) as follows:

\[p_{ij} = \frac{f_{ij}}{\sum^W_{i=1}\sum^C_{j=1}f_{ij}}\]

\[p_{i*} = \frac{\sum^C_{j=1} f_{ij}}{\sum^W_{i=1}\sum^C_{j=1}f_{ij}}\]

\[p_{*j} = \frac{\sum^W_{j=1} f_{ij}}{\sum^W_{i=1}\sum^C_{j=1}f_{ij}}\]

\[\textrm{PPMI}_{ij} = \textrm{max} \left( \log_2 \frac{p_{ij}}{p_{i*}p_{*j}},0 \right)\]

Where \(p_{i*}\) is the probability of the target word, \(p_{*j}\) the probability of the context word, and \(p_{ij}\) the probability of the target given the context.

For example with the following co-occurence matrix:

aardvarkcomputerdataresultpiesugarcount (w)
cherry028944225486
strawberry0001601980
digital01670168385543447
information0332539823785137703
count (context)0499756734735126111716

we could compute p(w=information) = 7703/11716 = .6575; p(c=data) = 5673 / 11716 = .4842; p(w=information, c=data) = 3982 / 11716 = .3399.

So the ppmi(information,data) is log2 (.3399 / (.6575 * .4842)) = .0944.

The issue with raw ppmi like this is that it is biased toward infrequent events: very rare words will have high PMI scores. We can reduce this bias with different techniques. Laplace smoothing adds a constant k (often between 0.1 and 3) to all counts, discounting the non-zero counts.

Alternatively we can slightly change the computation for \(P( c)\) raising the probability of the context word to the power of \(\alpha\):

\[P_\alpha( c) = \frac{count( c)^\alpha}{\sum_c count( c)^\alpha}\]

It’s been found that \(\alpha=0.75\) has worked well. This increases the probability assigned to rare contxts, lowering their PMI.

The PPMI model can be used for computing word similarity for tasks like finding word paraphrases, tracking changes in word meaning, automatically discovering the meaning of words in different corpora. We can find the ten most similar words to any target word w by computing the cosines between w and each of the other V - 1 words, sorting and looking up the top 10.