Manning et al: Chapter 02 - The term vocabulary and postings lists
Metadata
Title: The term vocabulary and postings lists
Number: 2
Referenced in cm3060 Topic 02: Basic Processing
Coverage
The chapter starts with a short intro to extracting character sequences from files, and indexing granularlity. It then examines tokenization and normalization in the context of building an inverted index. Finally it turns to data structures that allow faster querying. These notes just cover the tokenization and normalization sections, which is all that is in scope for the module.
Tokenization
Given a character sequence and a defined document unit, tokenization is the task of chopping it up into pieces, called tokens, perhaps at the same time throwing away cerrtain characters, such as punctuation.
It helps to keep a few concepts distinct:
A token is an instance of a sequence of characters in some particular document that are grouped together as a useful semantic unit for processing.
A type is the class of all tokens containing the same character sequence.
A term is a (perhaps normalized) type that is included in an IR system’s dictionary. The set of index terms could be entirely distinct from the tokens in the document (eg a taxonomy) but in practice they are strongly related.
The main issue in tokenization is what are the correct tokens to use. In the simplest cases you can chop on whitespace and throw away punctuation. But what about the terms “O’Neill” and “can’t”. What is the desired tokenization? A simple strategy is to split on all nonalphanumeric characters, but that would leave the tokens can
and t
which doesn’t seem right. Whatever strategy is adopted it’s important to tokenize search queries and the document text the same way or you’ll get false negatives.
For most languages and domains in them there are special character sequences that we want to recognize as terms, like C++
or C#
in programming, email addresses, web urls, IP addresses, package tracking numbers etc. It is possible to omit these, but doing so has a high cost for information retrieval and other purposes.
Handling hyphenation automatically can be challening, they are used to split vowels (co-operative), join nouns as names (Hewlett-Packard), or as a copy editing technique to show grouping: (hold-him-back-maneuver). These are often handled by some heuristic rules.
Splitting on white space can also cause issues, eg with names (San Francisco), foreign phrases (au fait), compounds which are sometimes written as one, sometimes two words (white space), phone numbers (212 2211 21), dates (Mar 21 1921). This can be seriously annoying for users (eg searching for ‘York University’ shows lots of results for ‘New York University’).
Some Boolean search engines try to train users to use hyphens wherever possible and then generalize to cover the three cases, eg searching for “lower-case” would search for lower-case OR "lower case" OR lowercase
, but this relies on users…
Stop Words
The general strategy for determining a stop list is to sort the terms in a corpus by collection frequency and then to take the most frequent terms, usually hand-filtered for their semantic content relative to the domain of the documents being indexed, as a stop list, which are then discarded during indexing (or other analysis).
Using a stop list significantly reduces the number of postings a system has to store, and a lot of the time not indexing stop words does little harm. However phrase searches can suffer “President of the United States” or “flights to London” are going to suffer if the stop words are removed. Song titles are a common issue too.
As such the general trend in IR systems over time has been from standard use of large stop lists (200-300 terms), to small stop lists (7-12 terms), to no stop list at all (web search engines do not use stop lists generally). Some of the design of modern IR systems has focused precisely on how we can exploit language statistics to cope with common words in better ways. These techniques are covered later in the book: compression to reduce storage costs; standard term weighting to reduce the impact of very common words; impact-sorted indexes that can terminate scanning a postings list early when weights get small, saving processing time.
So for most modern IR systems, the additional cost of including stop words is not that high - either in terms of index size or in terms of query processing time.
Normalization (equivalence classes for terms)
There are many cases where two sequences of characters are not quite the same, but you want them to be treated equivalently (for example in matching search queries). For instance, you likely want “USA” to be treated the same as “U.S.A.”
Token normalization is the process of canonicalizing tokens so that matches occur despite superficial differences in the character sequences of the tokens. The most standard way to normalize is to implicitly create equivalence classes, which are normally named after one member of the set.
These are often created implicitly by removing tokens like hyphens, rather than being fully calculated in advance. Then the classes can emerge as a result of applying the rules. It is only easy to write rules of this kind by removing characters.
An alternative is to maintain relations between unnormalized tokens. This can be extended to synonyms (like car, automobile). There are two ways to do this. The usual way is to just index unnormalized tokens and then maintain a query expansion list of multiple vocabulary entries to consider for a particular query term. The query term is then treated as a disjunction over the list. Alternatively, the expansion is done on index construction, so that ‘automobile’ would be indexed as ‘car’ as well. The obvious trade off is query time performance vs index storage cost.
The advantage of query expansion is that it can be asymmetrical. For example “Windows” may just match on “Windows”, whereas “windows” may match on “Windows, window, windows”.
Too much normalization can lead to bad results. Mindlessly removing full stops (eg to normalize U.S.A. to USA) could cause issues with eg “C.A.T.” now matching “cat”. The following are common cases of normalization:
Accents and diacritics. In English accents and diacritics have marginal status, and can typically be removed. Even in other languages where they matter more, users might be lazy or in a hurry and avoid them, so we often want to normalize removal.
Capitalization/case-folding. Case-folding is a common strategy to reduce all characters to lower case. Often this is very helpful, avoiding issues around tokens at the start of sentences, or web searchers always using lower case. But many proper names are common nouns, and distinguished only by case (eg General Motors, Bush, Black, CAT). Lowercasing everything often remains the most practical solution (because users…) but there are alternatives - eg truecasing based on ML sequence models; or heuristics like leaving mid-sentence capitals intact.
Spelling differentiation eg Color vs. colour
Dates we may want to standardize different date formats (but also US/UK differences there…)
Multilingualism Corpora may be multilingual, which adds to the fun, what about when the same entity is spelled differently in different languages for example?
Stemming and Lemmatization
For grammatical reasons, documents are going to use different forms of a word (like organize, organizes, organizing). There are also families of derivationally related words with similar meanings (democratic, democracy, democratization). In many contexts, like IR, these should be treated as equivalent.
The goal of both stemming and lemmatization is to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form… Stemming usually refers to a crude heuristic process that chops off the ends of words in the hope of achieving this goal correctly most of the time, and often includes the removal of derivational affixes. Lemmatization usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma. (p. 31)
Stemming most commonly collapses derivationally related words, while lemmatization commonly only collapses the different inflectional forms of a lemma.
The chapter introduces the Porter stemmer, explaining it consists of five phases of word reductions applied sequentially. Within each phase there are various conventions to select rules (eg select the rule that applies to the longest suffix). Many of the rules use the concept of the measure of a word, checking the number of syllables to see if a word is long enough to regard the matching portion as a genuine suffix, not the stem.
Stemming increases recall while harming precision. (p. 32)
Generally doing either stemming or lemmatization produces at best modest benefits for IR, it helps a lot for some queries, but harms others. Moving to a lemmatizer does not help significantly often, because the issues are often pragmatic ones of word use rather than formal issues of morphology. Consider “operating system” and a sentence with “operate a system”… The situation is a bit different in other languages than English, where there is more morphology.