Word Order Models (Chollet)

As presented in Chollet: Chapter 11: Deep Learning for Text

Representing individual words is relatively uncontroversial in ML. They are categorical features, values from a predefined set. They can be encoded as category vectors, or dimensions in a feature space.

More difficult is how to encode the way words are woven into sentences, ie word order.

Unlike steps in a time series, words don’t have a canonical order across languages, or even within one language. Order is clearly important, but its relationship to meaning isn’t straightforward.

How to represent word order, then, is the pivotal question from which different kinds of NLP architectures spring. The simplest approach is to discard it entirely and treat text as an unordered set of words - the bag of words models.

Alternatively you could decide that word order strictly matters, and words should be processed exactly in the order they appear. This then produces Recurrent Neural Networks.

Finally there is a hybrid approach: the Transformer architecture. Technically transformers are order-agnostic, but they inject word position information into the representations they process. This means they can look at the whole sequence at once (unlike RNNs), while retaining order information. Both RNNs and Transformers are sequence models, contrasted with bag-of-words models.

Historically, sequence models really took off after 2015, but both remain relevant and deciding between them is nuanced. Here’s a summary:

	Word order awareness	Context awareness (cross-words interactions)
Bag-of-unigrams	No	No
Bag-of-bigrams	Very limited	No
RNN	Yes	No
Self-attention	No	Yes
Transformer	Yes	Yes

Bag-of-Words Approaches

The simplest approach is to discard word order and treat a document as a multi-set or “bag” of tokens. This means you can represent an entire text with a single vector, with each entry a presence indicator of some kind for a given word in the vocabulary.

You can use binary encoding (multi-hot) to encodea document with a 1 indicating the presence of the corresponding token, 0 otherwise.

The simplest form of this approach just looks at unigrams (single tokens), but this is very reductive, even atomic concepts can be expressed via multiple words (“United States” is very different from “united”, and “states”).

So you typically inject some word-order sensitivity back into bag-of-word models by using N-grams. This is really easy to do with Keras’ TextVectorization layer, you just pass ngrams=N as a parameter on construction.

This notably improves performance when switching to bigrams on the example problem in the book.

Beyond binary encoding, we may want to look at term frequency. TextVectorization layers take an output_mode argument that we can use to specify, eg multi-hot, count or tf-idf.

count produces a raw term frequency, but suffers from the dominance of words like ‘the’, ‘a’ etc. We need to normalize.

Why don’t we just normalize like we do elsewhere, subtract the mean and divide by SD? Because this would wreck our sparsity. Most feature vectors consist mainly of 0s in texts. This is great, as it drastically improves computation speed and reduces overfitting. If we normalized like usual we’d destroy this property. So we use TF-IDF instead.

TF-IDF doesn’t always help, but is always worth trying, you’ll often see a percentage point increase in accuracy with it.

Sequence Models

We see from the introduction of bigrams that order matters. But remember that the history of deep learning is moving away from manual feature engineering and letting the model find the features it needs in raw data.

This is what sequence models are all about. To implement a sequence model, you start by representing your input samples as sequences of integer indices. Then you map each integer to a vector to obtain vector sequences. Finally you feed that sequence of vectors into a stack of layers that could cross-correlate features from adjacent vectors. Options include 1D convnets, RNNs, or Transformers.

1D convnets were never that popular, though Chollet claims they can produce performance as good as RNNs at less computational cost.

p. 328 walks through a first example with one-hot vectors. Each integer is mapped to a one-hot vector corresponding to its word. Each sample is capped at 600 words, and the vocabulary is 20k, so the input sample is size (600, 20000). That’s 12 million floats for a single document.

This is really slow, and doesn’t perform well on the sample task.

Word Embeddings

The problem with such one-hot vectors is that they bake in a fundamental assumption about the structure of the feature space - that the different tokens are independent from each other. One-hot vectors are all orthogonal to each other.

But words form a structured space. Words like “movie” and “film” might be virtually interchangeable in many sentences. So their representation vectors should not be orthogonal, they should be almost the same vector.

In other words:

the geometric relationship between two word vectors should reflect the semantic relationship between these words… you would expect the geometric distance (such as the cosine distance or L2 distance) between any two word vectors to relate to the “semantic distance” between the associated words. Words that mean different things should lie far away from each other, whereas related words should be closer.

Word embeddings are vector representations of words that achieve exactly this: they map human language to a structured geometric space. Similar words get embedded in close locations, and specific directions in the embedding space are meaningful.

Vectors produced by one-hot encoding are binary, sparse, and very high-dimensional (often 20k or so, ie the number of words in the vocab). By contrast word embedding vectors are dense, low-dimensional floating point vectors, typically 256, 512, or 1024 dimensions.

There are two ways to obtain word embeddings. You could learn them jointly with the main task you care about. Here you’d start with random word vectors and learn the word embeddings in the same way you learn weights in a network.

Alternatively you load into your model pretrained word embeddings, ie embeddings learned using a different ML task to the one you’re solving.

Is there some Platonic ideal word-embedding space that would perfectly map human language and could be used for all tasks? Possibly, but we’re nowhere near that and it seems doubtful. There is no such thing as one human language, there are many languages that are not isomorphic, language reflects context and culture.

What makes a good word embedding space depends on your task. Good word embeddings for movie-review sentiment analysis are likely not that helpful for legal document classification. So it’s reasonable to expect to learn new embedding spaces for a new task.

Keras makes this easy, we learn the weights of an Embedding layer. For more detail see the docs. It takes at least two parameters on instantiation, the number of possible tokens, and the dimensionality of the output embeddings.

The Embedding layer is best understood as a dictionary that maps integer indices (reprsenting tokens) to dense vectors. It takes integers as an input, looks up the integers in an internal dictionary, and returns associated vectors.

It’s effectively a dictionary lookup: Word index -> Embedding layer -> Corresponding word vector

The layer takes as input a rank-2 tensor of integers, shape (batch_size, sequence_length) where every entry is a sequence of integers, and returns a 3D floating-point tensor of shape (batch_size, sequence_length, embedding_dimensionality).

When you instantiate a layer its weights are random, like normal, it adjusts them during training via backpropagation, like normal, structuring the space into something that the downstream model can exploit.

Once trained, the embedding space will show a lot of structure.

Masking

When using word embeddings it’s crucial to understand masking. Let’s say you’ve padded a document of 200 tokens to the maximum size of 600. That means your RNN, if you use one, will be processing hundreds of padding tokens before it’s done. It’s easy for the information it built up to be lost in the process.

This is where masking comes in. It’s not active by default in Embedding layers, but can be activated by mask_zero=True on creation.

The mask itself is a tesnor of ones and zeros of shape (batch_size, sequence_length) where the entry mask[i,t] indicates whether time-step t of sample i should be skipped or not.

You almost never need to manage masks by hand, Keras passes on the mask to every layer that can process it. The mask will be used by RNN layers to skip masked steps. If the model returns a sequence, the mask will be used by the loss function to skip masked steps in the output.

Pretrained Embeddings

If you don’t have much training data available, you might not be able to use your data alone to learn an appropriate embedding.

In that case you can try loading embedding vectors from a precomputed embedding space that captures generic aspects of language structure. The rationale is similar to using pretrained convnets in image processing.

In these cases word embeddings are generally computed using word-occurrence statistics. Word2Vec and GloVe are well-known examples.

You can download these and load them into an Embedding layer. You then freeze the weights to prevent them being updated during training. See p. 335 for a walkthrough of doing this with GloVe.

Pretrained embeddings aren’t always helpful. If you have enought data it can be better to learn a specialized embedding space from scratch. But they can help if you have smaller datasets.