ML Home

MLR Notes tmp

7.1. What is the recursive partitioning algorithm?”

[image:A9CB156B-2345-4599-9EC6-D49998E11ED0-41190-00001CF0CE311AC9/Photo Mar 12, 2021 at 95510 AM.jpg]

leaf nodes or leaves

Tree-based models can be used for both classification and regression tasks, so you may see them described as classification and regression trees (CART)

rpart algorithm

“rpart offers two approaches: the difference in entropy (called the information gain) and the difference in Gini index (called the Gini gain). The two methods usually give very similar results; but the Gini index (named after the sociologist and statistician Corrado Gini) is slightly faster to compute, so we’ll focus on it.”

“The Gini index is the default method rpart uses to decide how to split the tree”

7.1.1. Using Gini gain to split the tree

“Entropy and the Gini index are two ways of trying to measure the same thing: impurity. Impurity is a measure of how heterogeneous the classes are within a node.”

“By estimating the impurity (with whichever method you choose) that would result from using each predictor variable for the next split, the algorithm can choose the feature that will result in the smallest impurity. Put another way, the algorithm chooses the feature that will result in subsequent nodes that are as homogeneous as possible.”

“If a node contains only a single class (which would make it a leaf), it would be said to be pure.”

“We want to know the Gini gain of this split. The Gini gain is the difference between the Gini index of the parent node and the Gini index of the split. Looking at our example in figure 7.2, the Gini index for any node is calculated as”

\[Gini index = 1 - (p(A)^2 - p(B)^2)\]

[image:6B9F1AA3-FB42-4BEC-851C-EA9D44D5D601-41190-00001CF55C76F17F/Photo Mar 12, 2021 at 100845 AM.jpg]

Generalizing the Gini index to any number of classes

7.1.3. Hyperparameters of the rpart algorithm

cp value =

7.2. Building a decision tree model

Listing 7.1. Loading and exploring the zoo dataset

data(Zoo, package = “mlbench”) zooTib <- as_tibble(Zoo)

zooTib

“mlr won’t let us create a task with logical predictors, so let’s convert them into factors instead”

“ Listing 7.2. Converting logical variables to factors

zooTib <- mutate_if(zooTib, is.logical, as.factor)

7.4. Training the decision tree model

Listing 7.3. Creating the task and learner

zooTask <- makeClassifTask(data = zooTib, target = “type”)

tree <- makeLearner(“classif.rpart”)

“The maxcompete hyperparameter controls how many candidate splits can be displayed for each node in the model summary”

“The maxsurrogate hyperparameter is similar to maxcompete but controls how many surrogate splits are shown”

“The usesurrogate hyperparameter controls how the algorithm uses surrogate splits. A value of zero means surrogates will not be used, and cases with missing data will not be classified”

Recall from chapter 6 that we can quickly count the number of missing values per column of a data.frame or tibble by running map_dbl(zooTib, ~sum(is.na(.)))

Listing 7.4. Printing available rpart hyperparameters

getParamSet(tree)

Listing 7.5. Defining the hyperparameter space for tuning

treeParamSpace <- makeParamSet( makeIntegerParam(“minsplit”, lower = 5, upper = 20), makeIntegerParam(“minbucket”, lower = 3, upper = 10), makeNumericParam(“cp”, lower = 0.01, upper = 0.1), makeIntegerParam(“maxdepth”, lower = 3, upper = 10))

Listing 7.7. Performing hyperparameter tuning

library(parallel) library(parallelMap)

parallelStartSocket(cpus = detectCores())

tunedTreePars <- tuneParams(tree, task = zooTask, resampling = cvForTuning, par.set = treeParamSpace, control = randSearch)

parallelStop()

tunedTreePars

7.4.1. Training the model with the tuned hyperparameters

Listing 7.8. Training the final tuned model

tunedTree <- setHyperPars(tree, par.vals = tunedTreePars$x)

tunedTreeModel <- train(tunedTree, zooTask)

Listing 7.9. Plotting the decision tree

install.packages(“rpart.plot”)

library(rpart.plot)

treeModelData <- getLearnerModel(tunedTreeModel)

rpart.plot(treeModelData, roundint = FALSE, box.palette = “BuBn”, type = 5)

Listing 7.10. Exploring the model

printcp(treeModelData, digits = 3)

“For a detailed summary of the model, run summary(treeModelData).”

7.5. Cross-validating our decision tree model

Listing 7.11. Cross-validating the model-building process

outer <- makeResampleDesc(“CV”, iters = 5)

treeWrapper <- makeTuneWrapper(“classif.rpart”, resampling = cvForTuning, par.set = treeParamSpace, control = randSearch)

parallelStartSocket(cpus = detectCores())

cvWithTuning <- resample(treeWrapper, zooTask, resampling = outer)

parallelStop()

Now let’s look at the cross-validation result and see how our model-building process performed.

Listing 7.12. Extracting the cross-validation result

cvWithTuning