Tutorial (12): Exercise 5



Exercise 5: Evaluating using a testing set

The dataset

District House Type Income Previous
Customer
Outcome
Suburban Detached High No Nothing
Suburban Detached High Responded Nothing
Rural Detached High No Responded
Urban Semi-detached High No Responded
Urban Semi-detached Low No Responded
Urban Semi-detached Low Responded Nothing
Rural Semi-detached Low Responded Responded
Suburban Terrace High No Nothing
Suburban Semi-detached Low No Responded
Urban Terrace Low No Responded
Suburban Terrace Low Responded Responded
Rural Terrace High Responded Responded
Rural Detached Low No Responded
Urban Terrace High Responded Nothing

The Holdout (validation) data

Select rows by clicking to move data instances (rows) between tables

District House Type Income Previous
Customer
Outcome

The Decision Tree: Interactively build it

  • Click on the root node below and start building the tree.
  • Non leaf nodes can be "pruned" once they have been chosen (by clicking on the node and selecting "prune node completely")
  • The ratios on the branches indicate how well the chosen attribute at a node splits the remaining data based on the target attribute ('outcome').
  • Click on any nodes to hilight the rows in the data table that the rule down to that node covers.
  • At each node, the entropy of the data at that point in the tree will be given.
  • Information gain (entropy reduction) is specified for each attribute.
  • Reducing entropy to zero is a way of building a decision tree here.
    When no more nodes can be expanded, the tree has classified all the training data.

  • Move data into the testing set from the training set (by clicking) and then contsruct a tree. Ideally a testing set should have 33% or less of the training data (about 3 or 4 instances here).
  • Compare classification errors on the testing data for complex trees compared to simple trees.
root node

Classification Errors (Totals)

 

Correct

Incorrect

Training set

   

Validation set