ML Home

library(mlr)
library(tidyverse)

Introduction

Dimension reduction is the process of learning how the information in a set of variables can be condensed into a smaller number of variables, with as little information loss as possible.

The phenomenon of the predictive power of a model increasing as the number of predictor variables increases, but then decreasing again as we continue to add more predictors, is called the Hughes phenomenon, after the statistician G. Hughes

Two most fundamental and commonly used algorithms:

  • Linear discriminant analysis (LDA)
  • Quadratic discriminant analysis (QDA)

LDA generalizes to QDA

LDA

LDA aims to learn a new representation of the data that separates the centroid of each class, while keeping the within-class variance as low as possible. A centroid is simply the point in the feature space that is the mean of all the predictors (a vector of means, one for each dimension). Then LDA finds a line through the origin that, when the data is projected onto it, simultaneously does the following:

  • Maximizes the difference between the class centroids along the line
  • Minimizes the within-class variance along the line”

This new axis is called a discriminant function (DF), and it is a linear combination of the original variables. For example, a discriminant function could be described by this equation:

DF = –0.5 × var1 + 1.2 × var2 + 0.85 × var3

Linear discriminant analysis vs. principal component analysis

5.1.2. What if we have more than two classes?

5.1.3. Learning curves instead of straight lines: QDA

LDA performs well if the data within each class is normally distributed across all the predictor variables, and the classes have similar covariances.

Covariance simply means how much one variable increases/decreases when another variable increases/decreases. So LDA assumes that for each class in the dataset, the predictor variables covary with each other the same amount.

This often isn’t the case, and classes have different covariances. In this situation, QDA tends to perform better than LDA because it doesn’t make this assumption (though it still assumes the data is normally distributed). Instead of learning straight lines that separate the classes, QDA learns curved lines. It is also well suited, therefore, to situations in which classes are best separated by a nonlinear decision boundary.

5.2. Building your first linear and quadratic discriminant models

5.2.1. Loading and exploring the wine dataset

Listing 5.1. Loading and exploring the wine dataset

#install.packages("HDclassif")

data(wine, package = "HDclassif")

wineTib <- as_tibble(wine)
wineTib
## # A tibble: 178 x 14
##    class    V1    V2    V3    V4    V5    V6    V7    V8    V9   V10   V11   V12
##    <int> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1     1  14.2  1.71  2.43  15.6   127  2.8   3.06 0.28   2.29  5.64  1.04  3.92
##  2     1  13.2  1.78  2.14  11.2   100  2.65  2.76 0.26   1.28  4.38  1.05  3.4 
##  3     1  13.2  2.36  2.67  18.6   101  2.8   3.24 0.3    2.81  5.68  1.03  3.17
##  4     1  14.4  1.95  2.5   16.8   113  3.85  3.49 0.24   2.18  7.8   0.86  3.45
##  5     1  13.2  2.59  2.87  21     118  2.8   2.69 0.39   1.82  4.32  1.04  2.93
##  6     1  14.2  1.76  2.45  15.2   112  3.27  3.39 0.34   1.97  6.75  1.05  2.85
##  7     1  14.4  1.87  2.45  14.6    96  2.5   2.52 0.3    1.98  5.25  1.02  3.58
##  8     1  14.1  2.15  2.61  17.6   121  2.6   2.51 0.31   1.25  5.05  1.06  3.58
##  9     1  14.8  1.64  2.17  14      97  2.8   2.98 0.290  1.98  5.2   1.08  2.85
## 10     1  13.9  1.35  2.27  16      98  2.98  3.15 0.22   1.85  7.22  1.01  3.55
## # … with 168 more rows, and 1 more variable: V13 <int>

Listing 5.2. Cleaning the dataset

names(wineTib) <- c("Class", "Alco", "Malic", "Ash", "Alk", "Mag",
                    "Phe", "Flav", "Non_flav", "Proan", "Col", "Hue",
                    "OD", "Prol")

wineTib$Class <- as.factor(wineTib$Class)

wineTib
## # A tibble: 178 x 14
##    Class  Alco Malic   Ash   Alk   Mag   Phe  Flav Non_flav Proan   Col   Hue
##    <fct> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl>    <dbl> <dbl> <dbl> <dbl>
##  1 1      14.2  1.71  2.43  15.6   127  2.8   3.06    0.28   2.29  5.64  1.04
##  2 1      13.2  1.78  2.14  11.2   100  2.65  2.76    0.26   1.28  4.38  1.05
##  3 1      13.2  2.36  2.67  18.6   101  2.8   3.24    0.3    2.81  5.68  1.03
##  4 1      14.4  1.95  2.5   16.8   113  3.85  3.49    0.24   2.18  7.8   0.86
##  5 1      13.2  2.59  2.87  21     118  2.8   2.69    0.39   1.82  4.32  1.04
##  6 1      14.2  1.76  2.45  15.2   112  3.27  3.39    0.34   1.97  6.75  1.05
##  7 1      14.4  1.87  2.45  14.6    96  2.5   2.52    0.3    1.98  5.25  1.02
##  8 1      14.1  2.15  2.61  17.6   121  2.6   2.51    0.31   1.25  5.05  1.06
##  9 1      14.8  1.64  2.17  14      97  2.8   2.98    0.290  1.98  5.2   1.08
## 10 1      13.9  1.35  2.27  16      98  2.98  3.15    0.22   1.85  7.22  1.01
## # … with 168 more rows, and 2 more variables: OD <dbl>, Prol <int>

5.2.2. Plotting the data

Listing 5.3. Creating an untidy tibble for plotting

wineUntidy <- gather(wineTib, "Variable", "Value", -Class)

ggplot(wineUntidy, aes(Class, Value)) +
  facet_wrap(~ Variable, scales = "free_y") +
  geom_boxplot() +
  theme_bw()

5.2.3. Training the models

Listing 5.4. Creating the task and learner, and training the model

wineTask <- makeClassifTask(data = wineTib, target = "Class")
## Warning in makeTask(type = type, data = data, weights = weights, blocking =
## blocking, : Provided data is not a pure data.frame but from class tbl_df, hence
## it will be converted.
lda <- makeLearner("classif.lda")

ldaModel <- train(lda, wineTask)

Listing 5.5. Extracting DF values for each case

ldaModelData <- getLearnerModel(ldaModel)

ldaPreds <- predict(ldaModelData)$x

head(ldaPreds)
##         LD1       LD2
## 1 -4.700244 1.9791383
## 2 -4.301958 1.1704129
## 3 -3.420720 1.4291014
## 4 -4.205754 4.0028715
## 5 -1.509982 0.4512239
## 6 -4.518689 3.2131376

Listing 5.6. Plotting the DF values against each other

wineTib %>%
  mutate(LD1 = ldaPreds[, 1],
         LD2 = ldaPreds[, 2]) %>%
  ggplot(aes(LD1, LD2, col = Class)) +
  geom_point() +
  stat_ellipse() +
  theme_bw()

Listing 5.7. Plotting the DF values against each other

qda <- makeLearner("classif.qda")

qdaModel <- train(qda, wineTask)

Listing 5.8. Cross-validating the LDA and QDA models

Should do 50 reps.

kFold <- makeResampleDesc(method = "RepCV",
                          folds = 10, 
                          reps = 5, # 50 to 5 for smaller report
                          stratify = TRUE)

ldaCV <- resample(learner = lda, task = wineTask, resampling = kFold,
                  measures = list(mmce, acc))
## Resampling: repeated cross-validation
## Measures:             mmce      acc
## [Resample] iter 1:    0.0000000 1.0000000
## [Resample] iter 2:    0.0555556 0.9444444
## [Resample] iter 3:    0.0000000 1.0000000
## [Resample] iter 4:    0.0000000 1.0000000
## [Resample] iter 5:    0.0000000 1.0000000
## [Resample] iter 6:    0.0555556 0.9444444
## [Resample] iter 7:    0.0000000 1.0000000
## [Resample] iter 8:    0.0555556 0.9444444
## [Resample] iter 9:    0.0000000 1.0000000
## [Resample] iter 10:   0.0000000 1.0000000
## [Resample] iter 11:   0.0000000 1.0000000
## [Resample] iter 12:   0.0000000 1.0000000
## [Resample] iter 13:   0.0000000 1.0000000
## [Resample] iter 14:   0.0000000 1.0000000
## [Resample] iter 15:   0.0555556 0.9444444
## [Resample] iter 16:   0.0000000 1.0000000
## [Resample] iter 17:   0.0000000 1.0000000
## [Resample] iter 18:   0.0000000 1.0000000
## [Resample] iter 19:   0.0000000 1.0000000
## [Resample] iter 20:   0.0555556 0.9444444
## [Resample] iter 21:   0.0000000 1.0000000
## [Resample] iter 22:   0.0588235 0.9411765
## [Resample] iter 23:   0.1111111 0.8888889
## [Resample] iter 24:   0.0000000 1.0000000
## [Resample] iter 25:   0.0000000 1.0000000
## [Resample] iter 26:   0.0000000 1.0000000
## [Resample] iter 27:   0.0000000 1.0000000
## [Resample] iter 28:   0.0000000 1.0000000
## [Resample] iter 29:   0.0000000 1.0000000
## [Resample] iter 30:   0.0000000 1.0000000
## [Resample] iter 31:   0.0000000 1.0000000
## [Resample] iter 32:   0.0000000 1.0000000
## [Resample] iter 33:   0.0000000 1.0000000
## [Resample] iter 34:   0.0000000 1.0000000
## [Resample] iter 35:   0.0000000 1.0000000
## [Resample] iter 36:   0.0555556 0.9444444
## [Resample] iter 37:   0.0555556 0.9444444
## [Resample] iter 38:   0.0000000 1.0000000
## [Resample] iter 39:   0.0000000 1.0000000
## [Resample] iter 40:   0.0000000 1.0000000
## [Resample] iter 41:   0.0000000 1.0000000
## [Resample] iter 42:   0.0000000 1.0000000
## [Resample] iter 43:   0.1111111 0.8888889
## [Resample] iter 44:   0.0000000 1.0000000
## [Resample] iter 45:   0.0000000 1.0000000
## [Resample] iter 46:   0.0555556 0.9444444
## [Resample] iter 47:   0.0555556 0.9444444
## [Resample] iter 48:   0.0000000 1.0000000
## [Resample] iter 49:   0.0000000 1.0000000
## [Resample] iter 50:   0.0000000 1.0000000
## 
## Aggregated Result: mmce.test.mean=0.0156209,acc.test.mean=0.9843791
## 
qdaCV <- resample(learner = qda, task = wineTask, resampling = kFold,
                  measures = list(mmce, acc))
## Resampling: repeated cross-validation
## Measures:             mmce      acc
## [Resample] iter 1:    0.0000000 1.0000000
## [Resample] iter 2:    0.0555556 0.9444444
## [Resample] iter 3:    0.0000000 1.0000000
## [Resample] iter 4:    0.0588235 0.9411765
## [Resample] iter 5:    0.0000000 1.0000000
## [Resample] iter 6:    0.0000000 1.0000000
## [Resample] iter 7:    0.0000000 1.0000000
## [Resample] iter 8:    0.0000000 1.0000000
## [Resample] iter 9:    0.0000000 1.0000000
## [Resample] iter 10:   0.0000000 1.0000000
## [Resample] iter 11:   0.0555556 0.9444444
## [Resample] iter 12:   0.0000000 1.0000000
## [Resample] iter 13:   0.0000000 1.0000000
## [Resample] iter 14:   0.0000000 1.0000000
## [Resample] iter 15:   0.0000000 1.0000000
## [Resample] iter 16:   0.0000000 1.0000000
## [Resample] iter 17:   0.0000000 1.0000000
## [Resample] iter 18:   0.0000000 1.0000000
## [Resample] iter 19:   0.0000000 1.0000000
## [Resample] iter 20:   0.0000000 1.0000000
## [Resample] iter 21:   0.0000000 1.0000000
## [Resample] iter 22:   0.0000000 1.0000000
## [Resample] iter 23:   0.0555556 0.9444444
## [Resample] iter 24:   0.0000000 1.0000000
## [Resample] iter 25:   0.0000000 1.0000000
## [Resample] iter 26:   0.0000000 1.0000000
## [Resample] iter 27:   0.0000000 1.0000000
## [Resample] iter 28:   0.0000000 1.0000000
## [Resample] iter 29:   0.0000000 1.0000000
## [Resample] iter 30:   0.0000000 1.0000000
## [Resample] iter 31:   0.0000000 1.0000000
## [Resample] iter 32:   0.0555556 0.9444444
## [Resample] iter 33:   0.0000000 1.0000000
## [Resample] iter 34:   0.0000000 1.0000000
## [Resample] iter 35:   0.0000000 1.0000000
## [Resample] iter 36:   0.0000000 1.0000000
## [Resample] iter 37:   0.0555556 0.9444444
## [Resample] iter 38:   0.0000000 1.0000000
## [Resample] iter 39:   0.0000000 1.0000000
## [Resample] iter 40:   0.0000000 1.0000000
## [Resample] iter 41:   0.0000000 1.0000000
## [Resample] iter 42:   0.0000000 1.0000000
## [Resample] iter 43:   0.0000000 1.0000000
## [Resample] iter 44:   0.0000000 1.0000000
## [Resample] iter 45:   0.0555556 0.9444444
## [Resample] iter 46:   0.0000000 1.0000000
## [Resample] iter 47:   0.0000000 1.0000000
## [Resample] iter 48:   0.0588235 0.9411765
## [Resample] iter 49:   0.0555556 0.9444444
## [Resample] iter 50:   0.0000000 1.0000000
## 
## Aggregated Result: mmce.test.mean=0.0101307,acc.test.mean=0.9898693
## 
ldaCV$aggr
## mmce.test.mean  acc.test.mean 
##     0.01562092     0.98437908
qdaCV$aggr
## mmce.test.mean  acc.test.mean 
##     0.01013072     0.98986928

Confusion Matrix LDA

calculateConfusionMatrix(ldaCV$pred, relative = TRUE)
## Relative confusion matrix (normalized by row/column):
##         predicted
## true     1           2           3           -err.-     
##   1      0.997/0.990 0.003/0.003 0.000/0.000 0.003      
##   2      0.008/0.010 0.969/0.991 0.023/0.033 0.031      
##   3      0.000/0.000 0.008/0.006 0.992/0.967 0.008      
##   -err.-       0.010       0.009       0.033 0.02       
## 
## 
## Absolute confusion matrix:
##         predicted
## true       1   2   3 -err.-
##   1      294   1   0      1
##   2        3 344   8     11
##   3        0   2 238      2
##   -err.-   3   3   8     14

Confusion Matrix QDA

calculateConfusionMatrix(qdaCV$pred, relative = TRUE)
## Relative confusion matrix (normalized by row/column):
##         predicted
## true     1           2           3           -err.-     
##   1      0.993/0.983 0.007/0.006 0.000/0.000 0.007      
##   2      0.014/0.017 0.986/0.989 0.000/0.000 0.014      
##   3      0.000/0.000 0.008/0.006 0.992/1.000 0.008      
##   -err.-        0.02        0.01        0.00 0.01       
## 
## 
## Absolute confusion matrix:
##         predicted
## true       1   2   3 -err.-
##   1      293   2   0      2
##   2        5 350   0      5
##   3        0   2 238      2
##   -err.-   5   4   0      9

5.3. Strengths and weaknesses of LDA and QDA

Exercise 1

Interpret the confusion matrices shown in the previous section.

Which model is better at identifying wines from vineyard 3? Does our LDA model misclassify more wines from vineyard 2 as being from vineyard 1 or vineyard 3?”

Exercise 2

Extract the discriminant scores from our LDA model, and use only these as the predictors for a kNN model (including tuning k). Experiment with your own cross-validation strategy.