1 What is a decision tree?

A decision tree is a tool that builds regression models in the shape of a tree structure. Decision trees take the shape of a graph that illustrates possible outcomes of different decisions based on a variety of parameters. Decision trees break the data down into smaller and smaller subsets, they are typically used for machine learning and data mining, and are based on machine learning algorithms. Decision trees are also referred to as recursive partitioning.

The Algorithm: How decision trees work

• Decision trees are based on an algorithm called ID3 created by JR Quinlan
• ID3 employs entropy and information gain to create a decsion tree
• entropy: is a top-down process that partitons data into subsets that consist of homogeneous data points. If a sample is completely homogenous the entropy is zero, if the sample is completely divided entropy is one.
• information gain: the decrease in entropy after the dataset is split on an attribute/parameter. Decision trees make splits based on which attributes generate the highest information gain, which results in the most homogenous subsets. Entropy values are calculated for every parameter that is entered into the tree model, for each decision, the parameter with the highest information gain is selected. Then the process is repeated.

Decision Tree Components

Decision trees are made up to two parts: nodes and leaves.

• Nodes: represent a decision test, examine a single variable and move to another node based on the outcome
• Leaves: represent the outcome of the decision.

What can I do with a decision tree?

Decision trees are useful to make various predictions. For example, to predict if an email is SPAM or not, to predict health outcomes, to predict what group an individual belongs to based on a variety of factors that are specified in the decision tree model.

• simple to understand and interpret
• help determine the expected outcomes of various scenarios
• help determine best and worst values for different scenarios
• can be combined with other decision techniques
• require a relatively low degree of data preparation
• can accommodate missing data
• low sensitivity to outliers
• low impact of nonlinear relationships between parameters
• can handle both categorical and numeric variables
• can translate the decision tree results into “decision rules”

• for categorical variables, more levels of the variable creates more bias of the decision tree toward that variable
• if the tree is over-fitted to the data, the results can be poor predictors

2 R Package: 'party'

The package we will use to create decision trees is called 'party'. Safe to say, you’re going to have a good time creating decision trees.

To install the package, use the syntax below. We will also be using the packages plyr and readr for some data set structuring.

2.1 Install ‘party’

install.packages("party")

library(party)
library(plyr)
library(readr)

2.2 The function: ctree()

To create decision trees, we will be using the function ctree() from the package 'party'. To get more information about the ctree() function you can use the syntax below.

?ctree()

A BRIEF OVERVIEW OF ctree()

The function ctree() is used to create conditional inference trees. The main components of this function are formula and data. Other components include subset, weights, controls, xtrafo, ytrafo, and scores.

• arguments
• formula: refers to the the decision model we are using to make predicitions. Similarly to ANOVA and regression models in R, the formula will take the shape of outcome~factor1+factor2+...factor(n): where the outcome is the variable we are trying to predict, and each of the factors are the bases for the decision nodes.

• data: tells the function which dataset to pull the variables listed in the model from.

• subset: is an optional add on which specifies a subset of observations to be used in the fitting process. Should be used if you don’t want to fit the model to the entire dataset.

• weights: is an optional vector that provides weighted values that can be used in the model fitting process. Can only consist of non-negative integers.

• basic syntax
• ctree(formula, data)

3 An Example using ctree()

3.1 The Dataset: IRIS

For the example, we will be using the dataset from UCI machine learning database called iris.

ABOUT IRIS The iris dataset contains information about three different types of iris flowers: setosa iris, versicolor iris, and virginica iris. There are five variables included in the dataset: sepal.length, sepal.width, petal.length, petal.width, and class. Each entry in the data set represent a different iris flower: the length and width of the sepal and petals are listed for each flower along with the type, or class, of the iris. The sepal and petal refer to two different components of the iris flower each of which contribute to the the overall aesthetic of the flower. Using the syntax below, we will load the dataset iris from the UCI website and rename each of the columns to reflect each of the five variables.

#first read in the dataset from the URL link - more info found in references section
iris <- read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data", col_names = FALSE)

#next, rename columns to represent the correct flower attribute
iris<-rename(iris, c("X1"="sepal.length", "X2"="sepal.width", "X3"="petal.length", "X4"="petal.width", "X5"="class"))

#change iris class into a factor
iris$class<-as.factor(iris$class)

#overall descriptives of the dataset
summary(iris)
##   sepal.length    sepal.width     petal.length    petal.width
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300
##  Mean   :5.843   Mean   :3.054   Mean   :3.759   Mean   :1.199
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500
##              class
##  Iris-setosa    :50
##  Iris-versicolor:50
##  Iris-virginica :50
##
##
## 

As you can see from the summary of the dataset, we have 150 total observations: each class of iris has 50 observations each. Some basic descriptive statistics of each of the four flower dimensions is also listed.

3.2 Decision Trees using 1 variable

First we will create 4 different decision trees consisting of one variable to predict which class of iris a given iris belongs to.

3.2.1 TREE 1

Predicting iris class by sepal length

To see how well sepal length predicts which class of iris a flower is, we create the following decision tree.

tree1<-ctree(class~sepal.length, data=iris) #set the model for the tree, predicting class by sepal length , data set being used is iris
plot(tree1) #view the decision tree

Interpreting the decision tree

To understand what the decision tree is saying, we want to start with the root of the tree (the first decision node). Looking at the first decision node, we know that the variable that the decision is determined by is sepal.length. There are two leaves from this node: first if the sepal length is less than or equal to 5.4 then it drops down into the group of iris flowers.

This first group, indicated by the first graph on the left side, tells us that there are 52 flowers that have a sepal length less than or equal to 5.4. Of these 52 flowers, approximately 80% of them fall into the first class of iris which is setosa. about 15% fall into the second class which is versicolor, and the remaining fall into the virginica class. The y-axis represents the proportion of the total flowers in this group that correspond to each of the iris flower classes (setosa, versicolor, virginica).

Next, we move to the second node. Again, the variable that the decision is determined by is sepal.length. The two leaves here are less than or equal to 6.1 or greater than 6.1. If the flower has a sepal length that is less than or equal to 6.1, it falls into the second group. Looking at the second graph we can see that the majority of the 43 flowers in this group are of iris class versicolor.

Next, we move to the third node. Here if the sepal length is less than or equal to 7 the flower falls into the thrid group and if it’s greater than 7 the flower falls into the fourth group. Looking at the third graph, there are 43 flowers in the third group and majority of these flowers are of class virginica, however, you can see that there are still a good amount of versicolor iris flowers in this group as well.

In the fourth and last group, there are the remaining 12 flowers. All of these flowers belong to the thrid class of iris: virginica.

Overall, the decision tree tells us that sertosa iris flowers tend to have shorter sepal length, versicolor iris flowers have mid-length sepals, and virginica iris flowers tend to have the longest sepal length.

3.2.2 TREE 2

Predicting iris class by sepal width

Now, let’s see how well sepal width does at predicting iris class.

tree2<-ctree(class~sepal.width, data=iris)
plot(tree2)

Looking at the decision tree, you can see that using sepal width creates three groups of flowers compared to sepal length that created four groups of flowers.

Here the results are much more mixed. Main conclusions would be that setosa iris tend to have wider sepals, versicolor tend to have more narrow sepals, and virginica have more variety in sepal width.

3.2.3 TREE 3

Predicting iris class by petal length

tree3<-ctree(class~petal.length, data=iris)
plot(tree3)

3.2.4 TREE 4

Predicting iris class by petal width

tree4<-ctree(class~petal.width, data=iris)
plot(tree4)

3.3 Decision Trees with 2 variables

3.3.1 TREE 5

Predicting iris class by sepal dimensions

tree5<-ctree(class~sepal.length+sepal.width, data=iris)
plot(tree5)

3.3.2 TREE 6

Predicting iris class by petal dimensions

tree6<-ctree(class~petal.length+petal.width, data=iris)
plot(tree6)

3.4 Decision Trees with all variables

3.4.1 TREE 7

Predicting iris class by sepal and petal dimensions

tree7<-ctree(class~sepal.length + sepal.width + petal.length + petal.width, data=iris)
plot(tree7)

Here we have the decision tree that includes all four variables (sepal length, sepal width, petal length, and petal width) into the prediction model.

Notice that there are only two factors that are used in the decision nodes: Petal length and petal width. This tells us that these two factors are most important when distinguishing which type of iris class each flower belongs to. The factors sepal length and sepal width are not necessary to predict which class the flowers belong to.

This decision tree is identical to decision tree #6.

4 How to Avoid Overfitting the Decision Tree

There are two approaches to avoid overfitting a decision tree to your data.

1. pre-pruning: prevents the tree from growing earlier, before the training data is perfectly classified

2. post-trimming: or post-pruning, tree is perfectly classified then after the tree is created prune or trim the tree

Post-trimming is the most common approach because it’s often difficult to estimate when to stop growing the tree. The important thing is to define the criteria which determines the correct final tree size.

1. validation set: use a different data set, other than the training set, to evaluate the post-trimming nodes from the decision tree. Often the dataset is broken in to two datasets, the training set and the validation set. The decision tree is constructed on the training set, then any post-trimming is done on the validation set.

2. statistical testing: create the decision tree using the training set, then apply statistical tests (error estimation or chi square) to determine if pruning a node or expanding a node produces an improvement beyond the training set. For more information on these statistical tests, see the “Overfitting Data” in the references and resources section.