ML Home

library(mlr)
library(tidyverse)

Linear regression can handle both continuous and categorical predictor variables

The formula: \[y = intercept + slope × x\]

If all the variables are centered to have a mean of zero, then the intercept can be interpreted as the value of y at the mean of x (which is often more useful information).

Centering your variables like this doesn’t affect the slopes because the relationships between variables remain the same. Therefore, predictions made by linear regression models are unaffected by centering and scaling your data.

\[y = \beta_0 + \beta_1 x_1 + \epsilon\]

We use the squared residuals so that we disproportionately penalize cases that are far away from their predicted value.

The slopes in linear regression tell us how the outcome variable changes for a one-unit increase in each predictor while holding all other predictors constant. In other words, the slopes tell us how the outcome changes when we change the predictor variables, one at a time.

\[y = \beta_0 + \beta_1 X_1 + \cdots + \beta_k X_k + \epsilon\]

Recall that in logistic regression, we predict the log odds of a case belonging to a particular class. In linear regression, we simply predict the case’s value of the outcome variable.

There are other algorithms that may learn models that perform better on a particular task but aren’t as interpretable. e.g. Random forest, XGBoost, and SVMs are examples of black-box models.

Homoscedastic simply means the variance of the outcome variable doesn’t increase as the predicted value of the outcome increases.

The opposite of homoscedastic is heteroscedastic

If the residuals are heteroscedastic, it sometimes helps to build a model that predicts some transformation of the outcome variable instead. For example, predicting the log10 of the response variable is a common choice

Generalized Linear Models With Examples in R by Peter K. Dunn and Gordon K. Smyth (Springer, 2018).

9.1.2. What if our predictors are categorical

Dummy variables are new representations of categorical variables that map the categories to 0 and 1.

reference level

9.2. Building a linear regression model

Listing 9.1. Loading and exploring the Ozone dataset

data(Ozone, package = "mlbench")

ozoneTib <- as_tibble(Ozone)

names(ozoneTib) <- c("Month", "Date", "Day", "Ozone", "Press_height",
                     "Wind", "Humid", "Temp_Sand", "Temp_Monte",
                     "Inv_height", "Press_grad", "Inv_temp", "Visib")

ozoneTib
## # A tibble: 366 x 13
##    Month Date  Day   Ozone Press_height  Wind Humid Temp_Sand Temp_Monte
##    <fct> <fct> <fct> <dbl>        <dbl> <dbl> <dbl>     <dbl>      <dbl>
##  1 1     1     4         3         5480     8    20        NA       NA  
##  2 1     2     5         3         5660     6    NA        38       NA  
##  3 1     3     6         3         5710     4    28        40       NA  
##  4 1     4     7         5         5700     3    37        45       NA  
##  5 1     5     1         5         5760     3    51        54       45.3
##  6 1     6     2         6         5720     4    69        35       49.6
##  7 1     7     3         4         5790     6    19        45       46.4
##  8 1     8     4         4         5790     3    25        55       52.7
##  9 1     9     5         6         5700     3    73        41       48.0
## 10 1     10    6         7         5700     3    59        44       NA  
## # … with 356 more rows, and 4 more variables: Inv_height <dbl>,
## #   Press_grad <dbl>, Inv_temp <dbl>, Visib <dbl>

Listing 9.2. Cleaning the data

ozoneClean <- mutate_all(ozoneTib, as.numeric) %>%
  filter(is.na(Ozone) == FALSE)

ozoneClean
## # A tibble: 361 x 13
##    Month  Date   Day Ozone Press_height  Wind Humid Temp_Sand Temp_Monte
##    <dbl> <dbl> <dbl> <dbl>        <dbl> <dbl> <dbl>     <dbl>      <dbl>
##  1     1     1     4     3         5480     8    20        NA       NA  
##  2     1     2     5     3         5660     6    NA        38       NA  
##  3     1     3     6     3         5710     4    28        40       NA  
##  4     1     4     7     5         5700     3    37        45       NA  
##  5     1     5     1     5         5760     3    51        54       45.3
##  6     1     6     2     6         5720     4    69        35       49.6
##  7     1     7     3     4         5790     6    19        45       46.4
##  8     1     8     4     4         5790     3    25        55       52.7
##  9     1     9     5     6         5700     3    73        41       48.0
## 10     1    10     6     7         5700     3    59        44       NA  
## # … with 351 more rows, and 4 more variables: Inv_height <dbl>,
## #   Press_grad <dbl>, Inv_temp <dbl>, Visib <dbl>

Listing 9.3. Plotting the data

we have to use -Ozone to prevent the Ozone variable from being gathered with the others.

ozoneUntidy <- gather(ozoneClean, key = "Variable",
                      value = "Value", -Ozone)

ggplot(ozoneUntidy, aes(Value, Ozone)) +
  facet_wrap(~ Variable, scale = "free_x") +
  geom_point() +
  geom_smooth(method = "loess", formula="y ~ x") +
  geom_smooth(method = "lm", , formula="y ~ x", col = "red") +
  theme_bw()
## Warning: Removed 196 rows containing non-finite values (stat_smooth).

## Warning: Removed 196 rows containing non-finite values (stat_smooth).
## Warning: Removed 196 rows containing missing values (geom_point).

9.2.2. Imputing missing values

When doing any form of missing-value imputation, it’s extremely important to ensure that the data is either missing at random (MAR) or missing completely at random (MCAR), and not missing not at random (MNAR). If data is MCAR, it means the likelihood of a missing value is not related to any variable in the dataset. If data is MAR, it means the likelihood of a missing value is related only to the value of the other variables in the dataset.

There is a more powerful imputation technique called multiple imputation. The premise of multiple imputation is that you create many new datasets, replacing missing data with sensible values in each one

Listing 9.4. Using rpart to impute missing values

imputeMethod <- imputeLearner("regr.rpart")

ozoneImp <- impute(as.data.frame(ozoneClean),
                   classes = list(numeric = imputeMethod))

Listing 9.5. Defining our task and learner

ozoneTask <- makeRegrTask(data = ozoneImp$data, target = "Ozone")

lin <- makeLearner("regr.lm")

9.2.3. Automating feature selection

There are two methods for automating feature selection:

  • Filter methods — Filter methods compare each of the predictors against the outcome variable, and calculate a metric of how much the outcome varies with the predictor. This metric could be a correlation: for example, if both variables are continuous. The predictor variables are ranked in order of this metric (which, in theory, ranks them in order of how much information they can contribute to the model), and we can choose to drop a certain number or proportion of the worst-performing variables from our model. The number or proportion of variables we drop can be tuned as a hyperparameter during model building.
  • Wrapper methods — With wrapper methods, rather than using a single, out-of-model statistic to estimate feature importance, we iteratively train our model with different predictor variables. Eventually, the combination of predictors that gives us the best performing model is chosen. There are different ways of doing this, but one such example is sequential forward selection. In sequential forward selection, we start with no predictors and then add predictors one by one. At each step of the algorithm, the feature that results in the best model performance is chosen. Finally, when the addition of any more predictors doesn’t result in an improvement in performance, feature addition stops, and the final model is trained on the selected predictors.

wrapper methods tend to be computationally expensive. Filter methods, on the other hand, may or may not select the best-performing set of predictors but are much less computationally expensive.

The filter method for feature selection

  • Linear correlation — When both predictor and outcome are continuous
  • ANOVA — When the predictor is categorical and the outcome is continuous
  • Chi-squared— When both the predictor and outcome are continuous
  • Random forest importance — Can be used whether the predictors and outcomes are categorical or continuous (the default)

The default method used by mlr (because it doesn’t depend on whether the variables are categorical or continuous) is to build a random forest to predict the outcome

Listing 9.6. Using a filter method for feature selection

filterVals <- generateFilterValuesData(ozoneTask,
                                       method = "linear.correlation")

filterVals$data
##             name    type             filter      value
##  1:        Month numeric linear.correlation 0.05371420
##  2:         Date numeric linear.correlation 0.08205094
##  3:          Day numeric linear.correlation 0.04151444
##  4: Press_height numeric linear.correlation 0.58752354
##  5:         Wind numeric linear.correlation 0.00468138
##  6:        Humid numeric linear.correlation 0.45148094
##  7:    Temp_Sand numeric linear.correlation 0.76977674
##  8:   Temp_Monte numeric linear.correlation 0.74159034
##  9:   Inv_height numeric linear.correlation 0.57563391
## 10:   Press_grad numeric linear.correlation 0.23331823
## 11:     Inv_temp numeric linear.correlation 0.72712691
## 12:        Visib numeric linear.correlation 0.41471463

Exercise 1

Generate and plot filter values for ozoneTask, but using the default method randomForestSRC_importance (don’t overwrite the filterVals object). Are the variables ranked in the same order of importance between the two methods?

Listing 9.7. Manually selecting which features to drop

#ozoneFiltTask <- filterFeatures(ozoneTask,
#                                fval = filterVals, abs = 6)
#ozoneFiltTask <- filterFeatures(ozoneTask,
#                                fval = filterVals, per = 0.25)
#ozoneFiltTask <- filterFeatures(ozoneTask,
#                                fval = filterVals, threshold = 0.2)

Listing 9.8. Creating a filter wrapper

filterWrapper = makeFilterWrapper(learner = lin,
                                  fw.method = "linear.correlation")

Listing 9.9. Tuning the number of predictors to retain

lmParamSpace <- makeParamSet(
  makeIntegerParam("fw.abs", lower = 1, upper = 12)
)

gridSearch <- makeTuneControlGrid()

kFold <- makeResampleDesc("CV", iters = 10)

tunedFeats <- tuneParams(filterWrapper, task = ozoneTask, resampling = kFold,
                        par.set = lmParamSpace, control = gridSearch)
## [Tune] Started tuning learner regr.lm.filtered for parameter set:
##           Type len Def  Constr Req Tunable Trafo
## fw.abs integer   -   - 1 to 12   -    TRUE     -
## With control class: TuneControlGrid
## Imputation value: Inf
## [Tune-x] 1: fw.abs=1
## [Tune-y] 1: mse.test.mean=25.7766240; time: 0.0 min
## [Tune-x] 2: fw.abs=2
## [Tune-y] 2: mse.test.mean=25.2783515; time: 0.0 min
## [Tune-x] 3: fw.abs=3
## [Tune-y] 3: mse.test.mean=25.0705001; time: 0.0 min
## [Tune-x] 4: fw.abs=5
## [Tune-y] 4: mse.test.mean=22.6975378; time: 0.0 min
## [Tune-x] 5: fw.abs=6
## [Tune-y] 5: mse.test.mean=21.0344477; time: 0.0 min
## [Tune-x] 6: fw.abs=7
## [Tune-y] 6: mse.test.mean=21.1072068; time: 0.0 min
## [Tune-x] 7: fw.abs=8
## [Tune-y] 7: mse.test.mean=21.3092236; time: 0.0 min
## [Tune-x] 8: fw.abs=10
## [Tune-y] 8: mse.test.mean=20.7856666; time: 0.0 min
## [Tune-x] 9: fw.abs=11
## [Tune-y] 9: mse.test.mean=20.9164209; time: 0.0 min
## [Tune-x] 10: fw.abs=12
## [Tune-y] 10: mse.test.mean=21.0425258; time: 0.0 min
## [Tune] Result: fw.abs=10 : mse.test.mean=20.7856666
tunedFeats
## Tune result:
## Op. pars: fw.abs=10
## mse.test.mean=20.7856666

For regression problems, there are three commonly used performance metrics:

  • Mean absolute error (MAE)— Finds the absolute residual between each case and the model, adds them all up, and divides by the number of cases. We can interpret this as the mean absolute distance of the cases from the model.
  • Mean square error (MSE)— Similar to MAE but squares the residuals before finding their mean. This means MSE is more sensitive to outliers than MAE, because the size of the squared residual grows quadratically, the further from the model prediction it is. MSE is the default performance metric for regression learners in mlr. The choice of MSE or MAE depends on how you want to treat outliers in your data: if you want your model to be able to predict such cases, use MSE; otherwise, if you want your model to be less sensitive to outliers, use MAE.
  • Root mean square error (RMSE)— Because MSE squares the residual, its value isn’t on the same scale as the outcome variable. Instead, if we take the square root of the MSE, we get the RMSE. When tuning hyperparameters and comparing models, MSE and RMSE will always select the same models (because RMSE is simply a transformation of MSE), but RMSE has the benefit of being on the same scale as our outcome variable and so is more interpretable.

Excerpt From: Hefin I. Rhys. “Machine Learning with R, the tidyverse, and mlr.” Apple Books.

Exercise 2

Repeat the feature-filtering process in listings 9.8 and 9.9, but use the default fw.method argument (randomForestSRC_importance, or just don’t supply it). Does this select the same number of predictors as when we used linear correlation? Which method was faster?

Listing 9.10. Training the model with filtered features

filteredTask <- filterFeatures(ozoneTask, fval = filterVals,
                               abs = unlist(tunedFeats$x))

filteredModel <- train(lin, filteredTask)

Listing 9.11. Using a wrapper method for feature selection

featSelControl <- makeFeatSelControlSequential(method = "sfbs")

selFeats <- selectFeatures(learner = lin, task = ozoneTask,
                           resampling = kFold, control = featSelControl)
## [FeatSel] Started selecting features for learner 'regr.lm'
## With control class: FeatSelControlSequential
## Imputation value: Inf
## [FeatSel-x] 1: 111111111111 (12 bits)
## [FeatSel-y] 1: mse.test.mean=20.5296654; time: 0.0 min
## [FeatSel-x] 2: 011111111111 (11 bits)
## [FeatSel-y] 2: mse.test.mean=21.1348189; time: 0.0 min
## [FeatSel-x] 2: 101111111111 (11 bits)
## [FeatSel-y] 2: mse.test.mean=20.5070474; time: 0.0 min
## [FeatSel-x] 2: 110111111111 (11 bits)
## [FeatSel-y] 2: mse.test.mean=20.4789559; time: 0.0 min
## [FeatSel-x] 2: 111011111111 (11 bits)
## [FeatSel-y] 2: mse.test.mean=20.7191719; time: 0.0 min
## [FeatSel-x] 2: 111101111111 (11 bits)
## [FeatSel-y] 2: mse.test.mean=20.4503279; time: 0.0 min
## [FeatSel-x] 2: 111110111111 (11 bits)
## [FeatSel-y] 2: mse.test.mean=21.5244861; time: 0.0 min
## [FeatSel-x] 2: 111111011111 (11 bits)
## [FeatSel-y] 2: mse.test.mean=21.5342347; time: 0.0 min
## [FeatSel-x] 2: 111111101111 (11 bits)
## [FeatSel-y] 2: mse.test.mean=21.1358867; time: 0.0 min
## [FeatSel-x] 2: 111111110111 (11 bits)
## [FeatSel-y] 2: mse.test.mean=20.5869697; time: 0.0 min
## [FeatSel-x] 2: 111111111011 (11 bits)
## [FeatSel-y] 2: mse.test.mean=20.4238257; time: 0.0 min
## [FeatSel-x] 2: 111111111101 (11 bits)
## [FeatSel-y] 2: mse.test.mean=20.4394423; time: 0.0 min
## [FeatSel-x] 2: 111111111110 (11 bits)
## [FeatSel-y] 2: mse.test.mean=20.4874680; time: 0.0 min
## [FeatSel-x] 3: 111111111111 (12 bits)
## [FeatSel-y] 3: mse.test.mean=20.5296654; time: 0.0 min
## [FeatSel-x] 4: 011111111011 (10 bits)
## [FeatSel-y] 4: mse.test.mean=21.0873295; time: 0.0 min
## [FeatSel-x] 4: 101111111011 (10 bits)
## [FeatSel-y] 4: mse.test.mean=20.3951520; time: 0.0 min
## [FeatSel-x] 4: 110111111011 (10 bits)
## [FeatSel-y] 4: mse.test.mean=20.3784420; time: 0.0 min
## [FeatSel-x] 4: 111011111011 (10 bits)
## [FeatSel-y] 4: mse.test.mean=20.6320829; time: 0.0 min
## [FeatSel-x] 4: 111101111011 (10 bits)
## [FeatSel-y] 4: mse.test.mean=20.3517750; time: 0.0 min
## [FeatSel-x] 4: 111110111011 (10 bits)
## [FeatSel-y] 4: mse.test.mean=21.9178686; time: 0.0 min
## [FeatSel-x] 4: 111111011011 (10 bits)
## [FeatSel-y] 4: mse.test.mean=21.7552399; time: 0.0 min
## [FeatSel-x] 4: 111111101011 (10 bits)
## [FeatSel-y] 4: mse.test.mean=21.0354037; time: 0.0 min
## [FeatSel-x] 4: 111111110011 (10 bits)
## [FeatSel-y] 4: mse.test.mean=20.4669298; time: 0.0 min
## [FeatSel-x] 4: 111111111001 (10 bits)
## [FeatSel-y] 4: mse.test.mean=20.2846921; time: 0.0 min
## [FeatSel-x] 4: 111111111010 (10 bits)
## [FeatSel-y] 4: mse.test.mean=20.3889436; time: 0.0 min
## [FeatSel-x] 5: 111111111101 (11 bits)
## [FeatSel-y] 5: mse.test.mean=20.4394423; time: 0.0 min
## [FeatSel-x] 5: 111111111011 (11 bits)
## [FeatSel-y] 5: mse.test.mean=20.4238257; time: 0.0 min
## [FeatSel-x] 6: 011111111001 (9 bits)
## [FeatSel-y] 6: mse.test.mean=20.9541819; time: 0.0 min
## [FeatSel-x] 6: 101111111001 (9 bits)
## [FeatSel-y] 6: mse.test.mean=20.2605275; time: 0.0 min
## [FeatSel-x] 6: 110111111001 (9 bits)
## [FeatSel-y] 6: mse.test.mean=20.2431030; time: 0.0 min
## [FeatSel-x] 6: 111011111001 (9 bits)
## [FeatSel-y] 6: mse.test.mean=20.5056083; time: 0.0 min
## [FeatSel-x] 6: 111101111001 (9 bits)
## [FeatSel-y] 6: mse.test.mean=20.2151333; time: 0.0 min
## [FeatSel-x] 6: 111110111001 (9 bits)
## [FeatSel-y] 6: mse.test.mean=21.7520501; time: 0.0 min
## [FeatSel-x] 6: 111111011001 (9 bits)
## [FeatSel-y] 6: mse.test.mean=21.7147693; time: 0.0 min
## [FeatSel-x] 6: 111111101001 (9 bits)
## [FeatSel-y] 6: mse.test.mean=21.2631281; time: 0.0 min
## [FeatSel-x] 6: 111111110001 (9 bits)
## [FeatSel-y] 6: mse.test.mean=20.6554549; time: 0.0 min
## [FeatSel-x] 6: 111111111000 (9 bits)
## [FeatSel-y] 6: mse.test.mean=20.2569687; time: 0.0 min
## [FeatSel-x] 7: 111111111001 (10 bits)
## [FeatSel-y] 7: mse.test.mean=20.2846921; time: 0.0 min
## [FeatSel-x] 7: 111101111101 (10 bits)
## [FeatSel-y] 7: mse.test.mean=20.3638843; time: 0.0 min
## [FeatSel-x] 7: 111101111011 (10 bits)
## [FeatSel-y] 7: mse.test.mean=20.3517750; time: 0.0 min
## [FeatSel-x] 8: 011101111001 (8 bits)
## [FeatSel-y] 8: mse.test.mean=20.8467065; time: 0.0 min
## [FeatSel-x] 8: 101101111001 (8 bits)
## [FeatSel-y] 8: mse.test.mean=20.1791622; time: 0.0 min
## [FeatSel-x] 8: 110101111001 (8 bits)
## [FeatSel-y] 8: mse.test.mean=20.1744230; time: 0.0 min
## [FeatSel-x] 8: 111001111001 (8 bits)
## [FeatSel-y] 8: mse.test.mean=20.3946784; time: 0.0 min
## [FeatSel-x] 8: 111100111001 (8 bits)
## [FeatSel-y] 8: mse.test.mean=21.6446211; time: 0.0 min
## [FeatSel-x] 8: 111101011001 (8 bits)
## [FeatSel-y] 8: mse.test.mean=21.5958684; time: 0.0 min
## [FeatSel-x] 8: 111101101001 (8 bits)
## [FeatSel-y] 8: mse.test.mean=21.1622986; time: 0.0 min
## [FeatSel-x] 8: 111101110001 (8 bits)
## [FeatSel-y] 8: mse.test.mean=20.6966062; time: 0.0 min
## [FeatSel-x] 8: 111101111000 (8 bits)
## [FeatSel-y] 8: mse.test.mean=20.2188527; time: 0.0 min
## [FeatSel-x] 9: 111101111001 (9 bits)
## [FeatSel-y] 9: mse.test.mean=20.2151333; time: 0.0 min
## [FeatSel-x] 9: 110111111001 (9 bits)
## [FeatSel-y] 9: mse.test.mean=20.2431030; time: 0.0 min
## [FeatSel-x] 9: 110101111101 (9 bits)
## [FeatSel-y] 9: mse.test.mean=20.3197904; time: 0.0 min
## [FeatSel-x] 9: 110101111011 (9 bits)
## [FeatSel-y] 9: mse.test.mean=20.3080828; time: 0.0 min
## [FeatSel-x] 10: 010101111001 (7 bits)
## [FeatSel-y] 10: mse.test.mean=20.8117349; time: 0.0 min
## [FeatSel-x] 10: 100101111001 (7 bits)
## [FeatSel-y] 10: mse.test.mean=20.1386733; time: 0.0 min
## [FeatSel-x] 10: 110001111001 (7 bits)
## [FeatSel-y] 10: mse.test.mean=20.3594012; time: 0.0 min
## [FeatSel-x] 10: 110100111001 (7 bits)
## [FeatSel-y] 10: mse.test.mean=21.5603552; time: 0.0 min
## [FeatSel-x] 10: 110101011001 (7 bits)
## [FeatSel-y] 10: mse.test.mean=21.5617806; time: 0.0 min
## [FeatSel-x] 10: 110101101001 (7 bits)
## [FeatSel-y] 10: mse.test.mean=21.1291058; time: 0.0 min
## [FeatSel-x] 10: 110101110001 (7 bits)
## [FeatSel-y] 10: mse.test.mean=20.6516162; time: 0.0 min
## [FeatSel-x] 10: 110101111000 (7 bits)
## [FeatSel-y] 10: mse.test.mean=20.1776587; time: 0.0 min
## [FeatSel-x] 11: 110101111001 (8 bits)
## [FeatSel-y] 11: mse.test.mean=20.1744230; time: 0.0 min
## [FeatSel-x] 11: 101101111001 (8 bits)
## [FeatSel-y] 11: mse.test.mean=20.1791622; time: 0.0 min
## [FeatSel-x] 11: 100111111001 (8 bits)
## [FeatSel-y] 11: mse.test.mean=20.2192502; time: 0.0 min
## [FeatSel-x] 11: 100101111101 (8 bits)
## [FeatSel-y] 11: mse.test.mean=20.2902511; time: 0.0 min
## [FeatSel-x] 11: 100101111011 (8 bits)
## [FeatSel-y] 11: mse.test.mean=20.2676762; time: 0.0 min
## [FeatSel-x] 12: 000101111001 (6 bits)
## [FeatSel-y] 12: mse.test.mean=20.7756227; time: 0.0 min
## [FeatSel-x] 12: 100001111001 (6 bits)
## [FeatSel-y] 12: mse.test.mean=20.3149545; time: 0.0 min
## [FeatSel-x] 12: 100100111001 (6 bits)
## [FeatSel-y] 12: mse.test.mean=21.5122089; time: 0.0 min
## [FeatSel-x] 12: 100101011001 (6 bits)
## [FeatSel-y] 12: mse.test.mean=21.5417695; time: 0.0 min
## [FeatSel-x] 12: 100101101001 (6 bits)
## [FeatSel-y] 12: mse.test.mean=21.0892112; time: 0.0 min
## [FeatSel-x] 12: 100101110001 (6 bits)
## [FeatSel-y] 12: mse.test.mean=20.6120161; time: 0.0 min
## [FeatSel-x] 12: 100101111000 (6 bits)
## [FeatSel-y] 12: mse.test.mean=20.1376675; time: 0.0 min
## [FeatSel-x] 13: 110101111000 (7 bits)
## [FeatSel-y] 13: mse.test.mean=20.1776587; time: 0.0 min
## [FeatSel-x] 13: 101101111000 (7 bits)
## [FeatSel-y] 13: mse.test.mean=20.1779629; time: 0.0 min
## [FeatSel-x] 13: 100111111000 (7 bits)
## [FeatSel-y] 13: mse.test.mean=20.1902970; time: 0.0 min
## [FeatSel-x] 13: 100101111100 (7 bits)
## [FeatSel-y] 13: mse.test.mean=20.2790974; time: 0.0 min
## [FeatSel-x] 13: 100101111010 (7 bits)
## [FeatSel-y] 13: mse.test.mean=20.2558664; time: 0.0 min
## [FeatSel-x] 13: 100101111001 (7 bits)
## [FeatSel-y] 13: mse.test.mean=20.1386733; time: 0.0 min
## [FeatSel-x] 14: 000101111000 (5 bits)
## [FeatSel-y] 14: mse.test.mean=20.7358638; time: 0.0 min
## [FeatSel-x] 14: 100001111000 (5 bits)
## [FeatSel-y] 14: mse.test.mean=20.2874228; time: 0.0 min
## [FeatSel-x] 14: 100100111000 (5 bits)
## [FeatSel-y] 14: mse.test.mean=22.0801073; time: 0.0 min
## [FeatSel-x] 14: 100101011000 (5 bits)
## [FeatSel-y] 14: mse.test.mean=21.4436465; time: 0.0 min
## [FeatSel-x] 14: 100101101000 (5 bits)
## [FeatSel-y] 14: mse.test.mean=21.1474516; time: 0.0 min
## [FeatSel-x] 14: 100101110000 (5 bits)
## [FeatSel-y] 14: mse.test.mean=20.7367145; time: 0.0 min
## [FeatSel-x] 15: 110101111000 (7 bits)
## [FeatSel-y] 15: mse.test.mean=20.1776587; time: 0.0 min
## [FeatSel-x] 15: 101101111000 (7 bits)
## [FeatSel-y] 15: mse.test.mean=20.1779629; time: 0.0 min
## [FeatSel-x] 15: 100111111000 (7 bits)
## [FeatSel-y] 15: mse.test.mean=20.1902970; time: 0.0 min
## [FeatSel-x] 15: 100101111100 (7 bits)
## [FeatSel-y] 15: mse.test.mean=20.2790974; time: 0.0 min
## [FeatSel-x] 15: 100101111010 (7 bits)
## [FeatSel-y] 15: mse.test.mean=20.2558664; time: 0.0 min
## [FeatSel-x] 15: 100101111001 (7 bits)
## [FeatSel-y] 15: mse.test.mean=20.1386733; time: 0.0 min
## [FeatSel] Result: Month,Humid,Temp_Sand,Temp_... (5 bits)
selFeats
## FeatSel result:
## Features (5): Month, Humid, Temp_Sand, Temp_Monte, Inv_height
## mse.test.mean=20.2874228

Listing 9.12. Using a wrapper method for feature selection

ozoneSelFeat <- ozoneImp$data[, c("Ozone", selFeats$x)]

ozoneSelFeatTask <- makeRegrTask(data = ozoneSelFeat, target = "Ozone")

wrapperModel <- train(lin, ozoneSelFeatTask)

9.2.4. Including imputation and feature selection in cross-validation

Listing 9.13. Combining imputation and feature selection wrappers

imputeMethod <- imputeLearner("regr.rpart")

imputeWrapper <- makeImputeWrapper(lin,
                                   classes = list(numeric = imputeMethod))

featSelWrapper <- makeFeatSelWrapper(learner = imputeWrapper,
                                     resampling = kFold,
                                     control = featSelControl)

Listing 9.14. Cross-validating the model-building process

library(parallel)
library(parallelMap)

ozoneTaskWithNAs <- makeRegrTask(data = ozoneClean, target = "Ozone")
## Warning in makeTask(type = type, data = data, weights = weights, blocking =
## blocking, : Provided data is not a pure data.frame but from class tbl_df, hence
## it will be converted.
kFold3 <- makeResampleDesc("CV", iters = 3)

parallelStartSocket(cpus = detectCores())
## Starting parallelization in mode=socket with cpus=8.
lmCV <- resample(featSelWrapper, ozoneTaskWithNAs, resampling = kFold3)
## Exporting objects to slaves for mode socket: .mlr.slave.options
## Resampling: cross-validation
## Measures:             mse
## Mapping in parallel: mode = socket; level = mlr.resample; cpus = 8; elements = 3.
## 
## Aggregated Result: mse.test.mean=20.8957566
## 
parallelStop()
## Stopped parallelization. All cleaned up.
lmCV
## Resample Result
## Task: ozoneClean
## Learner: regr.lm.imputed.featsel
## Aggr perf: mse.test.mean=20.8957566
## Runtime: 58.0066

The wrapper method for feature selection

9.2.5. Interpreting the model

Listing 9.15. Interpreting the model

wrapperModelData <- getLearnerModel(wrapperModel)

summary(wrapperModelData)
## 
## Call:
## stats::lm(formula = f, data = d)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.7217  -2.8128  -0.2797   2.7788  13.8328 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.516e+01  1.950e+00  -7.775 8.24e-14 ***
## Month       -2.895e-01  7.354e-02  -3.937 9.94e-05 ***
## Humid        8.891e-02  1.310e-02   6.787 4.83e-11 ***
## Temp_Sand    2.044e-01  4.252e-02   4.807 2.26e-06 ***
## Temp_Monte   2.140e-01  5.555e-02   3.853 0.000139 ***
## Inv_height  -5.521e-04  1.762e-04  -3.134 0.001871 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.482 on 355 degrees of freedom
## Multiple R-squared:  0.6838, Adjusted R-squared:  0.6794 
## F-statistic: 153.6 on 5 and 355 DF,  p-value: < 2.2e-16

Listing 9.16. Creating diagnostic plots of the model

par(mfrow = c(2, 2))
plot(wrapperModelData)

par(mfrow = c(1, 1))

Residuals vs. Fitted plot

shows the predicted ozone level on the x-axis and the residual on the y-axis for each case. We hope that there are no patterns in this plot.

In other words, the amount of error shouldn’t depend on the predicted value.

In this situation, we have a curved relationship. This indicates that we have nonlinear relationships between predictors and ozone, and/or heteroscedasticity.

The Normal Q-Q (quantile-quantile) plot

shows the quantiles of the model residuals plotted against their quantiles if they were drawn from a theoretical normal distribution. If the data deviates considerably from a 1:1 diagonal line, this suggests the residuals are not normally distributed. This doesn’t seem to be a problem for this model: the residuals line up nicely on the diagonal.

Scale-Location plot

helps us identify heteroscedasticity of the residuals. There should be no patterns here, but it looks like the residuals are increasingly varied with larger predicted values, suggesting heteroscedasticity

Residuals vs. Leverage plot

helps us to identify cases that have excessive influence on the model parameters (potential outliers). Cases that fall inside a dotted region of the plot called Cook’s distance may be outliers whose inclusion or exclusion makes a large difference to the model. Because we can’t even see Cook’s distance here (it is beyond the axis limits), we have no worries about outliers.

These diagnostic plots (particularly the Residuals vs. Fitted plot) indicate the presence of nonlinear relationships between the predictor variables and the outcome variable.

Short Description of Diagnostic Plots

Residuals vs. Fitted plot

There are no patterns in this plot. The amount of error shouldn’t depend on the predicted value.

Normal Q-Q (quantile-quantile) plot

the quantiles of the model residuals plotted against their quantiles if they were drawn from a theoretical normal distribution. If the data deviates considerably from a 1:1 diagonal line, this suggests the residuals are not normally distributed.

Scale-Location plot

Indentifies Heteroscedasticity of the residuals. There should be no patterns

Residuals vs. Leverage plot

helps identify cases that have excessive influence on the model parameters (potential outliers)

9.3. Strengths and weaknesses of linear regression

Exercise 3

Instead of using a wrapper method, cross-validate the process of building our model using a filter method. Are the estimated MSE values similar? Which method is faster? Tips:

First, create a filter wrapper using our imputeWrapper as the learner. Define a hyperparameter space to tune “fw.abs” using makeParamSet(). Define a tuning wrapper that takes the filter wrapper as a learner and performs a grid search. Use resample() to perform cross-validation, using the tuning wrapper as the learner.