Heuristic Functions: Evaluation and Generation (Rusell Norvig)
Introduction
As discussed in Russell Norvig Chapter 03: Solving Problems by Searching, section 3.6, focusing on how heuristic functions can be evaluated and generated. The concept is introduced more fully in Informed (Heuristic) Search Strategies.
The heuristic function \(h(n)\) is part of an Informed Search Strategy where:
\(h(n) = estimated\;cost\;of\;the\;cheapest\;path\;from\;the\) \(state\;at\;node\;n\;to\;a\;goal\;state\)
The chapter starts with a couple of examples of heuristic functions for the 15-puzzle (ie the block-sliding puzzle where the goal is to get 15 numbered blocks to line up sequentially on a 4x4 grid by sliding blocks into the empty space).
\(h_1\) = the number of misplaced tiles. This is an admissible heuristic because any tile that is out of place will require at least one move to get it to the right place.
\(h_2\) = the sum of the distances of the tiles from their goal position, ie the sum of their vertical and horizontal distance. This is also admissible.
Neither heuristic overrestimates the true cost.
Evaluating Heuristics
Effective branching factor
One way to characterize the quality of a heuristic is the effective branching factor: b*:
If the total number of nodes generated by A* for a particular problem is N and the solution depth is d, then b* is the branching factor that a uniform tree of depth d would have to have in order to contain N + 1 nodes.
Therefore:
\(N+1 = 1 + b^* + (b^*)^2 + \;…\; + (b^*)^d\)
For example, if A* finds a solution at depth 5 using 52 nodes, the effective branching factor is 1.92. Although this can vary across instances, the branching factor is usually quite constant across a domain. Experimental measurements of b* on a small set of nontrivial problems can give a good guide to a heuristic’s overall usefulness. A well-designed heuristic would have a value of b* close to 1.
Effective depth
Alternatively, Korf and Reid (1998) argue that a better way to characterize the effect of A* pruning with a given heuristic h is that it reduces the effective depth by a constant \(k_h\) compared to the true depth. This means that the total search cost is \(O(b^{d-k_h})\), compared to \(O(b^d)\) for an uninformed search.
RN run some experiments testing the branching factor of \(h_1\) and \(h_2\) (p. 117). They show that \(h_2\) is always better. We can say that \(h_2\) dominates \(h_1\). Domination translates direcly into efficiency, the dominant heuristic will never expand more nodes than the dominated one.
Generating Heuristics
How might we come up with heuristics in the first place? Could a computer invent or learn one?
Relaxed Problems
In the case of \(h_1\) and \(h_2\), the functions are estimates of the remaining path length, but they are also accurate path lengths for simplified versions of the puzzle - where the rules are changed to make the problem simpler.
A problem with fewer restrictions on the actions is called a relaxed problem. The state-space graph of the relaxed problem is a supergraph of the original state space because the removal of the restrictions creates added edges in the graph. (p. 117)
As we’ve added edges, any solution to the original problem is, by definition, a solution in the relaxed problem, but the relaxed problem may have better solutions via shortcuts. Hence:
The cost of an optimal solution to a relaxed problem is an admissible heuristic for the original problem. Furthermore, because the derived heuristic is an exact cost for the relaxed problem, it must obey the triangle inequality and is therefore consistent (p. 118)
If a problem definition is written formally, it is possible to construct relaxed problems automatically. For example if we say of the block puzzle that “a block may move from square X to square Y if X is adjacent to Y and Y is blank”, then we can generate relaxed versions by removing the constraints, for example, that the blocks can move from X to Y if they are adjacent; or if Y is blank; or with no constraint at all.
It is crucial that the relaxed problems generated by this technique can be solved essentially without search, because the relaxed rules allow the problem to be decomposed into eight independent subproblems. If the relaxed problem is hard to solve, then the values of the corresponding heuristics will be expensive to obtain. (p. 118)
A program called Absolver can generate heuristics automatically from problem definitions. It has successfully done so with Rubik’s cube, and sliding block puzzles.
If several heuristics are available we can define \(h(n) = max \{h_1(n),…,h_k(n)\}\), which picks whichever function is most accurate on the node in question.
Pattern Databases
Admissible heuristics can also be derived from teh solution cost of a subproblem of a given problem. The idea behind pattern databases is to store exact solution costs for every possible subproblem instance.
The database is constructed by searching back from the goal and recording the cost of each new pattern encountered. The expense of this search is amortized over the subsequent problem instances so makes sense if we are going to solve many instances. It is an example of dynamic programming.
We then compute an admissible hueristic \(h_{DB}\) for each state encountered during the a search simply be looking up the corresponding subproblem configuration in the database.
For example, in an 8-block sliding problem, we might have pattern databases for the subproblems, 1,2,3,4 and 5,6,7,8 which can be combined by taking the maximum. This is more accurate than Manhattan distance.
Alternatively we might use disjoint pattern databases where only the blocks in the subproblem are considered, the rest are removed. Then the sum of the two subproblems can be taken as a lower bound on the cost of solving the whole problem.
Landmarks
The search algorithms we have looked at could not power something like Google maps, which can find cost-optimal routes in milliseconds. How can we achieve performance like that? The most important trick is precomputation of some optimal path costs. This precomputation can be done once and then amortized over billions of requests.
With a large graph, we can’t do this for every node. Better to choose a few landmark points from the vertices. Then for every landmark L and every other vertex v, we compute the value \(C^*(v,L)\). With the stored C* tables we can then easily create an efficient heuristic: the minimum over all landmarks of the cost of getting from the current node to that landmark, and then to the goal:
\(h_L(n) = min \; L \in Landmarks\; C^*(n,L) + C^*(L,goal)\)
This could be exact, if the optimal path runs via the landmark, but if not it will be inadmissible as it overestimates the cost.
An admissible version can be defined by using a differential heuristic:
\(h_{DH}(n) = max \; L \in Landmarks\; |C^*(n,L) - C^*(goal, L)|\)
Some route finding algorithms add shortcuts, artificial edges in the graph that define an optimal multi-action path, eg between major cities.
Learning
Can agents learn to search better? Yes. RN distinguish between the state space discussed up to now, the object-level state space and the metalevel state space:
Each state in a metalevel state space captures the internal (computational) state of a program that is searching in an ordinary state space
We can think of the search process as consisting of a sequence of ever larger search trees. This can be seen as depicting a path in the metalevel state space where each state on the path is an object-level search tree. For harder problems there may be many missteps, where subtrees are expanded and then abandoned.
A metalevel learning algorithm can learn from these experiences to avoid exploring unpromising subtrees. This is explored in Reinforcement Learning, covered in ch. 23 of the book.
An alternative is to learn heuristics from experience, ie lots of solved instance problems. This is machine learning, many of these approaches learn an imperfect approximation to the heuristic function, and risk inadmissibility. This can lead to a trade off between learning time, search run time, and solution cost.
Some machine learning techniques work better when supplied with features of a state that are relevant to predicting the state’s heuristic value, rather than just a ‘raw’ state description. For example in the block sliding puzzle features might include “# of misplaced tiles” or “# of pairs of adjacent blocks that are not adjacent in the goal state”. Let us call such features \(x_i(n)\). It is common to combine these features in linear combination to predict \(h(n)\):
\(h(n) = c_1x_1(n) + c_2x_2(n) + … + c_kx_k(n)\)
Where the constants \(c_i\) are adjusted to give the best fit to the actual data across the randomly generated configurations. Such heuristics are not necessarily admissible or consistent.