Alex's Notes

Murphy: Chapter 01: Introduction

Metadata

Core Approach

Starts from the Mitchell definition of ML:

A computer program is said to learn from experience E with respect to some class of tasks T, and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E. (Mitchell, 97)

In this book, we will cover the most common types of ML, but from a probabilistic perspective. Roughly speaking, this means that we treat all unknown quantities (e.g., predictions about the future value of some quantity of interest, such as tomorrow’s temperature, or the parameters of some model) as random variables, that are endowed with probability distributions which describe a weighted set of possible values the variable may have.

The book takes that approach for two reasons. 1) It is optimal to decision making under uncertainty. 2) Probabilistic modelling is key to most areas of science, so provides a unifying framework:

Almost all of machine learning can be viewed in probabilistic terms, making probabilistic thinking fundamental. It is, of course, not the only view. But it is through this view that we can connect what we do in machine learning to every other computational science, whether that be in stochastic optimisation, control theory, operations research, econometrics, information theory, statistical physics or bio-statistics. For this reason alone, mastery of probabilistic thinking is essential. (Shakir Mohamed, DeepMind researcher, cited in Murphy).

Supervised Learning

SL is the most common form of ML, where the task T is to learn a mapping f from \(x \in \chi\) to output \(y \in \gamma\).

Inputs x are also called features, covariates, or predictors. Often a fixed-dimensional vector of numbers, so \(\chi = \mathbb{R}^D\) where D is the dimensionality of the vector.

Output y is known as the label, target, or response.

Experience E is a set of N input-output pairs, known as the training set. N is called the sample size. Formally: \(E = \{ (x_n,y_n)\}^N_{n=1}\)

The performance measure P depends on the type of output we are predicting.

Classification

In classification problems, the output space is a set of C unordered and mutuall exclusive labels known as classes: \(\gamma = \{1,2,\dots,C\}\). Predicting the class label given an input is also called pattern recognition. Binary classification is when there are just two classes (often denoted with \(y \in \{0,1\}\) or \(y \in \{-1,1\}\).

Iris classification is used to illustrate the task. We might approach this from the image, or from engineered features. When we have a small dataset of features, we typically store them in a \(N \times D\) matrix, where each row is an example, each column a feature. This is a design matrix.

When inputs are of variable length (like word sequences) they are usually stored not in tabular form. But such data is often coverted to a fixed-sized feature representation (called featurization), implicitly creating a design matrix for processing.

Before diving into the problem, we usually start with exploratory analysis, looking for obvious patterns that might help select a method, or obvious data problems. For tabular data we might use a pair plot. For higher-dimensional data, we might perform dimensionality reduction first, and then visualize.

We might see some clear decision rules, by detecting a decision boundary, by recursively dividing the space following such boundaries, we end up with a decision tree. We can represent the tree by storing, for each node, the feature index it uses and the threshold value. These paramaters are denoted by \(\theta\).

We can measure the performance on this task in terms of the misclassification rate. First we define the binary indicator function \(\mathbb{I}(e)\) which returns 1 iff the condition e is true:

\[\mathbb{I}(e) = \begin{cases} 1\;if\;e\;is\;true \\ 0 \;if\;e\;is\;false \end{cases} \]

Then the misclassification rate is:

\[\mathcal{L}(\theta) \triangleq \frac{1}{N} \sum^N_{n=1} \mathbb{I}(y_n \neq f(x_n;\theta))\]

If all errors are equal, otherwise we need an asymmetric loss function. In that case the empirical risk becomes the average loss of the predictor on the training set:

\[\mathcal{L}(\theta) \triangleq \frac{1}{N} \sum^N_{n=1} \mathcal{l}(y_n,f(x_n;\theta))\]

One way of defining training or model fitting is finding a setting of the parameters that minimizes the empirical risk. This is empirical risk minimization. But we want to minimize the expected loss on future data, ie to generalize, rather than just fit the training set.

The book then delves into probability, looking at issues like model uncertainty, and maximum likelihood estimation. As the module approach isn’t probabilistic, these topics aren’t covered here and will be added later.

Regression

If we want to predict a real valued quantity \(y \in \mathbb{R}\) instead of a class label this is regression. It is similar to classification, except we need a different loss function. The most common choice is quadratic loss or \(\mathcal{l}_2\) loss:

\(\mathcal{l}_2(y,\hat{y}) = (y - \hat{y})^2\)

This penalizes large residuals (misses) more than small ones.

The empirical risk when using quadratic loss is equal to the mean squared error or MSE:

\[MSE(\theta) = \frac{1}{N}\sum^N_{n=1}(y_n - f(x_n;\theta))^2\]

Perhaps the simplest regression model is linear regression of the form:

\[f(x;\theta)=b+wx\]

Where w is the slope, b is the offset, and \(\theta = (w,b)\) are the model’s parameters. We can adjust theta until we find the least squares solution. This can be extended for multiple features in multiple linear regression as follows:

\[f(x;\theta)=b+ w_1x_1 + w_2x_2 + \dots + w_Dx_D = b+ w^Tx\]

Linear regression may not fit the data, we might be able to improve by using a polynomial regression of degree D. This is a simple example of feature engineering, which has the form:

\[f(x;w) = w^T\theta(x)\]

Where \(\theta(x)\) is a feature vector derived from the input: \(\theta(x) = [1,x,x^2,\dots,x^D]\).

We can create more powerful models by learning to do such feature extraction automatically, decomposing the feature extractor into a stack of L nested functions. This is the key idea behind deep neural networks.

Unsupervised Learning

Although supervised learning is very useful, it is arguably just “glorified curve fitting” (Pearl, 18). It is maybe more interesting to try to ‘make sense’ of the data without any labels. This is called unsupervised learning.

From a probabilistic perspective, we can view the task of unsupervised learning as fitting an unconditional model of the form \(p(x)\), which can generate new data x, whereas supervised learning involves fitting a conditional model, \(p(y|x)\), which specifies (a distribution over) outputs given inputs.

Unsupervised learning avoids the costs of labelling, and the need to partition the world into often arbitrary categories. Humans will often disagree on such categorization issues, which means they are often not well defined. It is therefore unreasonable to expect machines to learn such mappings.

When we’re learning to see, nobody’s telling us what the right answers are - we just look. Every so often, your mother says “that’s a dog”, but that’s very little information. You’d be lucky if you got a few bits of information - even one bit per second - that way. The brain’s visual system has \(10^{14}\) neural connections. And you only live for \(10^9\) seconds. So it’s no use learning one bit per second. You need more like \(10^5\) bits per second. And there’s only one place you can get that much information: from the input itself. (Hinton, 1996)

A simple example of unsupervised learning is finding clusters in data, the goal being to partition the data into regions which contain similar points.

When dealing with high dimensional data, it is often useful to reduce the dimensionality by projecting it to a lower dimensional subspace which capture’s the “essence” of the data.

A recently popular approach to unsupervised learning is known as self-supervised learning, in which we create proxy supervised tasks from unlabeled data, for example masking out words in a sentence and trying to predict them given their context. The hope is that the resulting predictor \(\hat{x}_1 = f(x_2;\theta)\), where \(x_2\) is the observed input and \(\hat{x}\) is the predicted output, will learn useful features from the data that can then be used downstream in supervised tasks. This avoids having to learn the ‘true latent factors’ behind the observed data.

It is hard to evaluate the quality of an unsupervised learning method. A common method of doing so is to measure the probability assigned by teh model to unseen examples, treating the problem as one of density estimation. The idea is that the model should not be surprised by actual data samples, ie it should assign them high probability.

An alternative metric is to use the learned unsupervised representations in a downstream supervised task. If the unsupervised method has discovered something useful, it should be possible to use these patterns to perform supervised learning with less data than working with original features. Here we are trying to improve sample efficiency by learning a good representation.

Reinforcement Learning

Reinforcement learning is a third type of ML.

In this class of problems, the system or agent has to learn how to interact with its environment. This can be encoded by means of a policy: \(a = \pi(x)\), which specifies which action to take in response to each possible input x (derived from the environment state).

The main difference from supervised learning is that the agent is not told which action to take, instead it receives an occasional reward signal in response to its actions. It’s like learning with a critic, rather than learning with a teacher.

RL has grown in popularity as it has broad applicability (the reward signal can be any metric), but it’s really hard to make it work. A key difficulty is the reward signal may be rare, and it may be unclear to the agent which of its actions were responsible for getting the reward. There are ways to compensate for this, like expert demonstrations.

Data

ML fits models to data with algorithms. Books focus on the modelling and algorithms, but the nature and quality of the data makes a massive contribution to the success of any learned model. Some datasets are introduced, image data sets like ImageNet and MNIST (the handwritten digits one).

Text Datasets and Problems

For text classification a common dataset is the IMDB movie review dataset which has 25k labeled examples for training, and 25k for testing, with a binary sentiment analysis label (+ve or -ve).

For machine translation we have the Canadian parliament dataset (Eng-Fr) and the European Union Europarl dataset (and Eurlex…) The WMT dataset is a subset of the EU dataset consistsing of Eng-Ger pairs and is widely used.

More generally, translation is just one instance of the broader class of sequence to sequence mapping, called a seq2seq model, a form of high-dimensional classification. This way of framing the problem is very general and includes document summarization and question answering. A dataset used for this is SQuAD which consists of Text, Questions, and Answers to be used for sentence pair tagging.

Language modelling is a grand term for creating unconditional generative models of text sequences, just from input sentences. It is therefore a form of unsupervised learning. If the model generates output in response to an input, we can regard it as a conditional generative model.

Preprocessing Categorical Data

The chapter includes a summary of preprocessing steps to convert it into the vector form assumed by many models (ie vectors in \(\mathbb{R}^D\).

Categorical features need to be converted to a numerical scale so that computing weighted combinations of the inputs makes sense. The standard way is to use a one-hot encoding or dummy encoding. If there are, say 3 variables (red, green, blue) the corresopnding vectors would be: \(one-hot(red) = [1,0,0]\), \(one-hot(green) = [0,1,0]\), \(one-hot(blue) = [0,0,1]\).

But we may want to capture the interaction effects between features. To do this we can compute explicit feature crosses, the use of such crosses converts the original dataset into a wide format with more columns.

Preprocessing Text

Working with text data has its own challenges. Documents can have variable length, so they are not fixed-length feature vectors, as assumed by many models. Words are categorical variables with many possible values (equal to the vocabulary size), so one-hot encodings will be very high dimensional, with no natural notion of similarity. We may encounter new words at test time that haven’t been seen in training.

A simple approach to dealing with variable length documents is to treat them as a bag of words, ignoring word order. To convert this to a vector from a fixed input space, we first map each word to a token from a vocabulary. To reduce the token counts we might do some basic processing like stop word removal, and/or stemming.

Let $xnt be the token at location t in the n‘th document. If there are D unique tokens in the vocabulary, then we represent the n‘th document as a D-dimensional vector, \(\tilde{x}_n\) where \(\tilde{x}_{nv}\) is the number of times that word v occurs in document n:

\[\tilde{x}_{nv} = \sum^T_{t=1} \mathbb{I}(x_{nt} = v)\]

Where T is the length of document D. Now we can interpret documents as vectors in \(\mathbb{R}^D\). This is the vector space model of text. We tend to represent input data in \(D \times N\) term frequency matrices where \(TF_{ij}\) is the frequency of term i in document j.

To reduce the impact of words that feature many times across all documents we often use TF-IDF instead of raw term frequency. Here IDF is inverse document frequency defined as:

\[IDF_i \triangleq log \frac{N}{1+ DF_i}\]

where \(DF_i\) is the number of documents that include term i. and TF-IDF is defined as:

\[TFIDF_{ij} = log(TF{ij} + 1) \times IDF_i\]

Each row is often normalized as well. This is used a lot in ML tasks involving text.

TF-IDF still doesn’t deal with the issue that semantically similar words may be further apart in vector space than semantically dissimilar ones. This invalidates the assumptions made in logistic regression models, for example. The standard way to solve this is to use word embeddings where we map each sparse one-hot vector \(x_{nt} \in \{0,1\}^V\) to a lower-dimensional dense vector, \(e_nt \in \mathbb{R}^K\) using \(e_{nt} = Ex_{nt}\) where E is learned such that semanticall similar words are placed nearby. There are many ways to learn such embeddings.

If we have an embedding matrix we can represent a variable-length document as a bag of word embeddings, which can be converted to a fixed length vector by summing (or averaging the embeddings):

\[\bar{e}_n = \sum^T_{t=1} e_{nt} = E\tilde{x}_n\]

Where \(\tilde{x}_n\) is the bag of words representation. This can be used in a logistic regression classifier. We often use a pre-trained word embedding matrix E.

If we encounter new words during testing, which is inevitable, we might replace them with a special symbol for unknown word, or else look at subword units or wordpieces using byte-pair encoding which reates new symbols to represent common substrings.