Mitchell ML Chapter 01

Metadata

Title: Introduction
Number: 1
Book: Mitchell: Machine Learning

Summary

Introduces the definition and core ideas of machine learning, particularly function approximation, mainly through an example of a checkers learning program.

Definition of Machine Learning

We can think of machine learning broadly, as any computer program that improves its performance at some task through experience:

Definition: A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.

Some examples of well-defined ML problems:

Problem	Task T	Performance Measure P	Experience E
Learning Checkers	Playing checkers	% of games won	playing practice games against itself
Handwriting Recognition	Recognizing and classifying handwritten words in images	% of words correctly identified	Database of handwritten words with classifications
Learning to Drive	Driving on public highways using vision sensors	Average distance driven before an error (judged by human)	Sequence of images and steering commands recorded while observing a human

Desiging a system

Uses the example of desiging a system to play checkers. We face many design decisions.

Does the training experience provide direct or indirect feedback? The system might learn from direct training examples, consisting of individual board states and the correct moves in each state. Or indirect information consisting of the move sequences and outcomes of games played. In the latter case the learner faces the problem of credit assignment, or determining the degree to which an individual move deserves credit or blame for the final outcome.

Is the learner in control of the training examples? Does the learner rely on the trainer to provide examples, or can the learner propose states and ask for the correct label. Or else does the learner have total control, like when it plays against itself. In the latter the learner might experiment with brand new board states, or refine its skill with minor variations from prior games.

How well does the experience represent the distribution of examples over which the final system performance P must be measured? Learning is most reliable when the training examples follow the distribution of the test examples. Mastery of one distribution of examples will not necessarily lead to strong performance over another distribution.

Choosing a Target Function

We need to decide what type of knowledge will be learned and how it will be used by the performane program. Let’s say that we have a program that, given a board state, can generate a set of legal moves in checkers. We need to be able to choose the best move.

This is representative of a large class of tasks for which the legal moves that define a large search space are known in advance, but the best search strategy is not known. Many optimization problems fall into this class.

The most obvious choice then for the type of information that needs to be learned is a program, or function, that chooses the best move from any given board state. We can call the function ChooseMove and the notation ChooseMove: b -> M to indicate the domain and co-domain of the function:

Throughout our discussion of machine learning we will find it useful to reduce the problem of improving performance P at task T to the problem of learning some particular target function… The choice of the target function will therefore be a key design choice.

It turns out that this function is hard to learn. Easier to learn is a function that assigns a real value to any board state, a value that increases with ‘better’ states. If the agent has this function, it can generate all valid successor board states from the current one, and pick the highest value. Call the new function V.

We might be tempted to define V recursively. Give the board state a value of 100 if the game is won, 0 if drawn, and -100 if lost. Otherwise give the state the value of the best final board state from b if the game is played optimally. This is is a nonoperational definition as it is not effectively computable. We need an operational definition of the target function.

Thus we have reduced the learning task in this case to the problem of discovering an operational description of the target function V It may be very difficult in general to learn such an operational form of V perfectly. In fact we often expect learning algorithms to acquire only some approximation to the target function, and for this reason the process of learning the target function is often called function approximation. (p. 8)

The book distinguished between \(V\), the ideal target function, and \(\hat{V}\), the approximation that is actually learned by the system.

Representing the Target Function

How do we represent this function \(\hat{V}\) in such a way that the program can learn it? There are again many choices, a giant lookup table, a collection of rules that match against features of the board state, a polynomial of predefined features, a neural network.

The choice of representation is a crucial tradeoff of expressiveness (to most closely approximate V) against efficiency - the more expressive the representation the more training data the program needs to choose among the alternative hypotheses it can represent.

In the checkers example, we can keep it simple by choosing a linear combination with some defined features: \(x_1, x_2\): # of black and red pieces; \(x_3, x_4\): # of black and red kings, \(x_5, x_6\): # of black pieces threatened by red and vice versa.

This gives us a linear function:

\[\hat{V}(b) = w_0 + w_1x_1 + w_2x_2 +w_3x_3 +w_4x_4 +w_5x_5 + w_6x_6\]

Where \(w_0 \dots w_6\) are numerical coefficients, or weights to be chosen by the algorithm.

Choosing a Function Approximation Algorithm

To learn the target function \(\hat{V}\) we require a set of training examples, each describing a board state b and the training value \(V_{train}(b)\). Eg a final state where black has won and red has no pieces left would be: \(((x_1 = 3, x_2 = 0, x_3 = 1, x_4 = 0, x_5 = 0, x_6 = 0), +100)\). We need to derive such examples, and then adjust the weights in the function to fit the examples.

Estimating the Values

It is easy to assign a value to the end states, but what about intermediate states? There is a simple approach that has been surprisingly successful: we assign the training value of \(V_{train}(b)\) to be the current approximation of the successor move to b, ie:

\[V_{train}(b) \gets \hat{V}(Successor(b))\]

This seems weird, but makes sense if \(\hat{V}\) is more accurate later in the game.

Assigning Weights

Now we just have to choose the weights to best fit the examples. First we define best fit. For example, we could minimize the squared error E between the training values and the values predicted by the hypothesis \(\hat{V}\).

\[E = \sum_{(b,V_{train}(b) \in training \;examples} (V_{train}(b) - \hat{V}(b))^2\]

There are several algorithms that minimize E defined in this way. One is the Least Mean Squares training rule. Here for every training example we use the current weights to calculate \(\hat{V}\), then for each weight \(w_i\) we update it as:

\[w_i \gets w_i + \eta(V_{train}(b) - \hat{V}(b))x_i\]

So when the error is positive (ie the estimate is too low) each weight is increased proportionally, raising the estimate and reducing the error. If the value of \(x_i\) is 0 its weight is not altered.

The Final Design

The final design can be described in four modules that represent the main components of the system, these are common across ML systems:

The Performance System is the module that must solve the given performance task (eg playing checkers) by usig the learned target function(s). It takes an instance of a new problem as input and produces a trace of its solution as output.
The Critic takes as input the history or trace of the game and produces a set of training examples of the target function.
The Generalizer takes as input the training examples and produces an output hypothesis that estimates the target function.
The Experiment Generator takes as input the current hypothesis (the currently learned function) and outputs a new problem for the Performance System to explore. Its role is to pick new practice problems that will maximize the learning rate of the overall system. In a game example, it might always propose the same initial state - the new game state. Or it might create board positions designed to explore new regions of the state space.

Representations and the Hypothesis Space

In the checkers example the agent had to search through a very large space of possible hypotheses to determine one that best fits the observed data and any prior knowledge held by the learner. One aspect of learning ML is studying the algortithms that search a hypothesis space defined by some underlying representation (eg linear functions, logical descriptions, decision trees, neural networks). Each hypothesis representation is appropriate for learning different kinds of target functions. Each has an underlying structure that the algorithm can use to organize its search.

Throughout this book we will return to this perspective of learning as a search problem in order to characterize learning methods by their search strategies and by the underlying structure of the search spaces they explore. (p.15)

This is also important to analyzing the relationship between the size of the hypothesis space, the number of training examples available, and the confidence we can have that a hypothesis that fits the training data will generalize.

Alex's Notes