Alpaydin: Chapter 09: Decision Trees

Metadata

Title: Decision Trees
Number: 9
Book: Alpaydin: Introduction to Machine Learning

Key Quotes and Definitions

A decision tree is a hierarchical model for supervised learning whereby the local region is identified in a sequence of recursive splits in a smaller number of steps. A decision tree is composed of internal decision nodes and terminal leaves. Each decision node m implements a test function \(f_m(x)\) with discrete outcomes labelling the branches. Given an input, at each node, a test is applied and one of the branches is taken depending on the outcome. the process starts at the root and is repeated recursively until a leaf node is hit, at which point the value written in the leaf constitutes the output. (p. 217).

It is a nonparametric model, in that we don’t assume the tree structure a priori, or any parametric form for the class densities. Each \(f_m(x)\) defines a discriminant in the dimensional space dividing it into smaller and smaller regions as we take a path from the root down.

\(f_m(\cdot )\) is a simple function, different DT methods assume different models for the function, and so define different shapes for the discriminant and regions. Each leaf has an ouptut label, either a class (classification) or real value (regression) - the leaf node defining a localized region in the space, the boundaries determined by the discriminants in the internal nodes.

Decision trees can be fast, and explainable.

Univariate Trees

In a univariate tree, in each internal node the test uses only one of the input dimensions. If the input dimension is discrete, taking one of n values, the decision node has an n way split.

A decision tree’s branches are discrete, so a numeric value has to be discretized. If the input dimension is numeric, the test is a comparision against a threshold value: \(f_m(x): x_j > w_{m0}\). The resulting split is binary.

Tree induction is the construction of the tree given a training sample. For any training set there exist many trees that can code it without error. For simplicity, we are interested in finding the smallest among them, where size is measured by the number of nodes and the complexity of decision nodes. Finding the smallest tree is NP-complete, so we use local search procedures based on heuristics that give us reasonable trees in reasonable time. Tree learning algorithms are greedy, looking for the best split at each step.

Classification Trees

In decision trees for classificaiton the ‘goodness’ of a split is measure by an impurity measure. A split is pure if after the split, for all branches, all the instances choosing a branch belong to the same class.

For any node m, \(N_m\) is the number of training instances reaching that node. \(N^I_m\) instances of those belong to class \(C_i\). The estimate then for the probability of class \(C_i\) is:

\[\hat{P}(C_i|x,m) \equiv p^i_m = \frac{N^i_m}{N_m}\]

So node m is pure if \(p^i_m\) is 0 or 1 for all i. We can measure impurity with entropy, or with alternative functions that meet certain constraints, see p. 221 for details. if node m is not pure instances should be split to minimize impurity after the split.

For discrete attibutes we can test the impurity after splitting on that branch. For numerical attributes, note that the best split is always between adjacent points belonging to different classes. So we test those points and measure the resulting impurity. Then we can create a new split based on whichever attribute creates the minimum impurity.

This is the basis of the CART algorithm, ID3, and C4.5.

DTs can suffer from overfitting and noise, so we specify a threshold purity level, rather than absolute 0 or 1. They can also suffer from prioritizing features with many values, since the resulting impurity will be less, but can lead to over-complex trees. We might penalize such attributes or balance impurity drop with branching factor.

Pseudocode

GenerateTree(X)
  if NodeEntropy(X) < θI
    Create leaf labelled by majority class in X
    Return
  i <- SplitAttribute(X)
  For each branch of xi
    Find Xi falling in branch
    GenerateTree(Xi)

Split Attribute(X)
  MinEnt <- MAX
  For all attibutes i = 1, ..., d
    If xi is discrete with n values
      Split X into X1, ..., Xn by xi
      e <- SplitEntropy(X1, ..., Xn)
      if e < MinEnt MinEnt <- e; bestf <- i
    Else
      For all possible splits
        Split X into X1, X2 on xi
	e <- Split Entropy(X1, X2)
	if e< MinEnt MinEnt <- e; bestf <- i
  Return bestf

Where NodeEntropy is the equation:

\[l_m = - \sum^K_{i=1} p^i_m \log_2 p^i_m\]

And SplitEntropy is the equation:

\[L’_m = - \sum^n_{j=1} \frac{N_{mj}}{N_m} \sum^K_{i=1} p^i_{mj} \log_2 p^i_{mj}\]

Pruning

Often a node is not split further if the number of training instances reaching a node is smaller than a certain % of the training set (eg 5%). Stopping construction early in this way is called prepruning. Another possibility, better in practcice, is postpruning. In this method once the tree is constructed we look back and try to prune unneccessary subtrees.

The approach is as follows: Keep a pruning set from the initial training set. For each subtree, replace it with a leaf node labelled with the training instances covered by the subtree. If the leaf node does not perform worse than the subtree on the pruning set, we prune the subtree and keep the leaf node because the additional complexity is not justified. Prepruning is faster, but postpruning tends to lead to more accurate trees.

Rule Learning

A feature tree does its own feature extraction, it selects which features to split on. We might use a DT to see which features it uses, and then use those features in another learning method.

Another big advantage of DTs is interpretability. Each node carries conditions that are easy to understand, and can be written as a set of ‘if-then’ rules, called a rule base. Such a rule base allows knowledge extraction. We can calculate the % of the training data covered by each rule, showing the important features of the dataset. We can prune rules for simplification.

We can learn the rules directly through rule induction this works similiarly to tree induction, except the search is depth first, rather than breadth-first. Rules are learned one at a time, each a conjunction of conditions on discrete or numeric attributes.

A rule is said to cover an example if the example satisfies all the conditions of the rule. Once a rule is grown and pruned it is added to the rule base and all examples covered by it are removed from the training set.

The rules we learn are propositional rules. An extension is to learn first-order rules instead (ie predicates). The difference is how we learn the predicates and test them. If we learn such predicates then we can use them in eg Prolog. Such learning is called inductive logic programming.

Rule learning is out of scope of the module, but pp 231-234 for pseudocode and mathematical formalization.

Multivariate Trees and Summary

An alternative to univariate trees is to consider more than one feature at each node split. These are mulitivariate trees, which may use features in linear or nonlinear combination. Many studies, however have shown that univariate trees are quite accurate and interpretable, and the additional complexity brought by multivariate nodes are hardly justified. Multivariate trees also require discrete features to be converted to numeric ones (1,0 indicating membership of a discrete category).

DTs are used more for classification than regression. They are very popular, learning and responding quickly, and accurate in many domains.

It is generally recommended that DTs be tested and its accuracy taken as a benchmark before trying more complex algorithms. Analysing the tree also reveals aspects of the dataset.

DTs are especially popular when used in combination. If we train multiple DTs we get a decision forest. If we train many DTs on a random subset of the training set, and a random subset of the input features, we can often combine their results to significantly increase overall accuracy. This approach is known as the random forest method.

The DT is discriminant-based rather than likelihood-based methods that explicitly calculate the probability of class membership and class densities.

Alex's Notes