ML Home

Logistic regression relies in a straight line equation p88

Logistic Regression can handle continuous and categorical predictor variables

It uses Linear Regression

linear_separator

The dotted line with y= 0.5 is the threshold

scurve

The s-curve gives fewer misclassifications

p91

Log odds

“Because probabilities are bounded between 0 and 1, it’s difficult to combine the information from two predictors”

“Odds are a convenient way of representing the likelihood of something occurring. They tell us how much more likely an event is to occur, rather than how likely it is not to occur.”

“whereas probability is bounded between 0 and 1, odds can take any positive value.”

“C3PO got his odds the wrong way around (as many people do). He should have said the odds of successfully navigating an asteroid field are approximately 1 to 3,720!”

“successfully navigating an asteroid field are approximately 3,720 to 1!” - wrong way to say it.

\[odds = \frac{p}{1-p}\]

\[log\;odds = ln \left(\frac{p}{1-p} \right) \; 4.3\]

4.3 converts probabilities into log odds, is also called the logit function

When interpreting log odds:

“The more influence a predictor variable has on the log odds, the steeper the slope will be, while variables that have no predictive value will have a slope that is nearly horizontal.”

\[p = \frac{1}{1+e^{-z} }\;\;(4.4)\]

z is the log odds

\[log\;odds = ln \left(\frac{p}{1-p} \right) = \beta_0 + \beta_1 X\; (4.5)\]

\[log\;odds = ln \left(\frac{p}{1-p} \right) = \beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p \; (4.7)\]

p - number of parameters

softmax regression/multinomial logistic regression"

Handles more than 2 classes

Logistic Regression Example: Titanic

library(mlr)
library(tidyverse)
#install.packages("titanic")

data(titanic_train, package = "titanic")

titanicTib <- as_tibble(titanic_train)

titanicTib
## # A tibble: 891 x 12
##    PassengerId Survived Pclass Name  Sex     Age SibSp Parch Ticket  Fare Cabin
##          <int>    <int>  <int> <chr> <chr> <dbl> <int> <int> <chr>  <dbl> <chr>
##  1           1        0      3 Brau… male     22     1     0 A/5 2…  7.25 ""   
##  2           2        1      1 Cumi… fema…    38     1     0 PC 17… 71.3  "C85"
##  3           3        1      3 Heik… fema…    26     0     0 STON/…  7.92 ""   
##  4           4        1      1 Futr… fema…    35     1     0 113803 53.1  "C12…
##  5           5        0      3 Alle… male     35     0     0 373450  8.05 ""   
##  6           6        0      3 Mora… male     NA     0     0 330877  8.46 ""   
##  7           7        0      1 McCa… male     54     0     0 17463  51.9  "E46"
##  8           8        0      3 Pals… male      2     3     1 349909 21.1  ""   
##  9           9        1      3 John… fema…    27     0     2 347742 11.1  ""   
## 10          10        1      2 Nass… fema…    14     1     0 237736 30.1  ""   
## # … with 881 more rows, and 1 more variable: Embarked <chr>

We’ll perform three tasks:

  1. Convert the Survived, Sex, and Pclass variables into factors.
  2. Create a new variable called FamSize by adding SibSp and Parch together.
  3. Select the variables we believe to be of predictive value for our model

Feature engineering comes in two flavors:

Feature selection - keeping variables that add predictive value, and removing those that don’t.

Listing 4.2. Cleaning the Titanic data, ready for modeling

fctrs <- c("Survived", "Sex", "Pclass")

titanicClean <- titanicTib %>%
  mutate_at(.vars = fctrs, .funs = factor) %>%
  mutate(FamSize = SibSp + Parch) %>%
  select(Survived, Pclass, Sex, Age, Fare, FamSize)

4.2.3. Plotting the data

Listing 4.3. Creating an untidy tibble for plotting

titanicUntidy <- gather(titanicClean, key = "Variable", value = "Value",
                        -Survived)
## Warning: attributes are not identical across measure variables;
## they will be dropped
titanicUntidy
## # A tibble: 4,455 x 3
##    Survived Variable Value
##    <fct>    <chr>    <chr>
##  1 0        Pclass   3    
##  2 1        Pclass   1    
##  3 1        Pclass   3    
##  4 1        Pclass   1    
##  5 0        Pclass   3    
##  6 0        Pclass   3    
##  7 0        Pclass   1    
##  8 0        Pclass   3    
##  9 1        Pclass   3    
## 10 1        Pclass   2    
## # … with 4,445 more rows

Listing 4.4. Creating subplots for each continuous variable

Violin Charts/facet_wrap

titanicUntidy %>%
  filter(Variable != "Pclass" & Variable != "Sex") %>%
  ggplot(aes(Survived, as.numeric(Value))) +
  facet_wrap(~ Variable, scales = "free_y") +
  geom_violin(draw_quantiles = c(0.25, 0.5, 0.75)) +
  theme_bw()
## Warning: Removed 177 rows containing non-finite values (stat_ydensity).
## Warning in regularize.values(x, y, ties, missing(ties), na.rm = na.rm):
## collapsing to unique 'x' values

## Warning in regularize.values(x, y, ties, missing(ties), na.rm = na.rm):
## collapsing to unique 'x' values

## Warning in regularize.values(x, y, ties, missing(ties), na.rm = na.rm):
## collapsing to unique 'x' values

violin geometric object, which is similar to a box plot but also shows the density of data along the y-axis

Exercise 1: Add a geom_point() layer

Redraw the plot in figure 4.10, but add a geom_point() layer, setting the alpha argument to 0.05 and the size argument to 3. Does this make the violin plot make more sense?

Listing 4.5. Creating subplots for each categorical variable/facet_wrap

facet_wrap() wraps a 1d sequence of panels into 2d. This is generally a better use of screen space than facet_grid() because most displays are roughly rectangular.

titanicUntidy %>%
  filter(Variable == "Pclass" | Variable == "Sex") %>%
  ggplot(aes(Value, fill = Survived)) +
  facet_wrap(~ Variable, scales = "free_x") +
  geom_bar(position = "fill") +
  theme_bw()

Listing 4.6. Creating a task and learner, and training a model

titanicTask <- makeClassifTask(data = titanicClean, target = "Survived")
## Warning in makeTask(type = type, data = data, weights = weights, blocking =
## blocking, : Provided data is not a pure data.frame but from class tbl_df, hence
## it will be converted.
logReg <- makeLearner("classif.logreg", predict.type = "prob")

Exercise 2: geom_bar() to dodge

Redraw the plot in figure 4.11, but change the geom_bar() argument position equal to “dodge”. Do this again, but make the position argument equal to “stack”. Can you see the difference between the three methods?

4.2.4. Training the model

Impute missing values first

Listing 4.8. Imputing missing values in the Age variable

imp <- impute(titanicClean, cols = list(Age = imputeMean()))

sum(is.na(titanicClean$Age)) # 177
## [1] 177
sum(is.na(imp$data$Age)) # 0
## [1] 0

Listing 4.9. Training a model on imputed data

titanicTask <- makeClassifTask(data = imp$data, target = "Survived")

logRegModel <- train(logReg, titanicTask)

4.3. Cross-validating the logistic regression model

Listing 4.10. Wrapping together the learner and the imputation method

logRegWrapper <- makeImputeWrapper("classif.logreg",
                                   cols = list(Age = imputeMean()))

Now let’s apply stratified, 10-fold cross-validation, repeated 50 times, to our wrapped learner.

Listing 4.11. Cross-validating our model-building process

kFold <- makeResampleDesc(method = "RepCV", # Repeated Cross-Validation
                          folds = 10, # Break data into 10 sections
                          reps = 5, # Change from 50 to 5 to create less output
                          stratify = TRUE)

logRegwithImpute <- resample(logRegWrapper, titanicTask,
                             resampling = kFold,
                             measures = list(acc, fpr, fnr))
## Resampling: repeated cross-validation
## Measures:             acc       fpr       fnr
## [Resample] iter 1:    0.7272727 0.4117647 0.1851852
## [Resample] iter 2:    0.9101124 0.1764706 0.0363636
## [Resample] iter 3:    0.8651685 0.1176471 0.1454545
## [Resample] iter 4:    0.7666667 0.4285714 0.1090909
## [Resample] iter 5:    0.8426966 0.1764706 0.1454545
## [Resample] iter 6:    0.7528090 0.2647059 0.2363636
## [Resample] iter 7:    0.8426966 0.2352941 0.1090909
## [Resample] iter 8:    0.7191011 0.4411765 0.1818182
## [Resample] iter 9:    0.7555556 0.3714286 0.1636364
## [Resample] iter 10:   0.7865169 0.3529412 0.1272727
## [Resample] iter 11:   0.7640449 0.2647059 0.2181818
## [Resample] iter 12:   0.8111111 0.2571429 0.1454545
## [Resample] iter 13:   0.7865169 0.3235294 0.1454545
## [Resample] iter 14:   0.8068182 0.2941176 0.1296296
## [Resample] iter 15:   0.8089888 0.2352941 0.1636364
## [Resample] iter 16:   0.7640449 0.3235294 0.1818182
## [Resample] iter 17:   0.7865169 0.2647059 0.1818182
## [Resample] iter 18:   0.8111111 0.3142857 0.1090909
## [Resample] iter 19:   0.7977528 0.2941176 0.1454545
## [Resample] iter 20:   0.7752809 0.4411765 0.0909091
## [Resample] iter 21:   0.7777778 0.3142857 0.1636364
## [Resample] iter 22:   0.8651685 0.3235294 0.0181818
## [Resample] iter 23:   0.8202247 0.2352941 0.1454545
## [Resample] iter 24:   0.7303371 0.3823529 0.2000000
## [Resample] iter 25:   0.8202247 0.2352941 0.1454545
## [Resample] iter 26:   0.8426966 0.3235294 0.0545455
## [Resample] iter 27:   0.8089888 0.2857143 0.1296296
## [Resample] iter 28:   0.7528090 0.2647059 0.2363636
## [Resample] iter 29:   0.7640449 0.2941176 0.2000000
## [Resample] iter 30:   0.7865169 0.2647059 0.1818182
## [Resample] iter 31:   0.7777778 0.4857143 0.0545455
## [Resample] iter 32:   0.7303371 0.3823529 0.2000000
## [Resample] iter 33:   0.7752809 0.2941176 0.1818182
## [Resample] iter 34:   0.8314607 0.2647059 0.1090909
## [Resample] iter 35:   0.7528090 0.2647059 0.2363636
## [Resample] iter 36:   0.8426966 0.2647059 0.0909091
## [Resample] iter 37:   0.7865169 0.3529412 0.1272727
## [Resample] iter 38:   0.8295455 0.2058824 0.1481481
## [Resample] iter 39:   0.7528090 0.4117647 0.1454545
## [Resample] iter 40:   0.8555556 0.1142857 0.1636364
## [Resample] iter 41:   0.8089888 0.2352941 0.1636364
## [Resample] iter 42:   0.7977528 0.2352941 0.1818182
## [Resample] iter 43:   0.7977528 0.3529412 0.1090909
## [Resample] iter 44:   0.7865169 0.3529412 0.1272727
## [Resample] iter 45:   0.7752809 0.2941176 0.1818182
## [Resample] iter 46:   0.8426966 0.2000000 0.1296296
## [Resample] iter 47:   0.7415730 0.4411765 0.1454545
## [Resample] iter 48:   0.8426966 0.2647059 0.0909091
## [Resample] iter 49:   0.7865169 0.2352941 0.2000000
## [Resample] iter 50:   0.7777778 0.3428571 0.1454545
## 
## Aggregated Result: acc.test.mean=0.7948383,fpr.test.mean=0.2981681,fnr.test.mean=0.1471717
## 
logRegwithImpute
## Resample Result
## Task: imp$data
## Learner: classif.logreg.imputed
## Aggr perf: acc.test.mean=0.7948383,fpr.test.mean=0.2981681,fnr.test.mean=0.1471717
## Runtime: 0.906848

https://mlr.mlr-org.com/articles/tutorial/measures.html

4.4. Interpreting the model: The odds ratio

Listing 4.12. Extracting model parameters/coefficients

logRegModelData <- getLearnerModel(logRegModel)

coef(logRegModelData)
##  (Intercept)      Pclass2      Pclass3      Sexmale          Age         Fare 
##  3.809661697 -1.000344806 -2.132428850 -2.775928255 -0.038822458  0.003218432 
##      FamSize 
## -0.243029114

An odds ratio is, well, a ratio of odds. For example, if the odds of surviving the Titanic if you’re female are about 7 to 10, and the odds of surviving if you’re male are 2 to 10, then the odds ratio for surviving if you’re female is 3.5. In other words, if you were female, you would have been 3.5 times more likely to survive than if you were male.

(7/10)/(2/10)
## [1] 3.5
7/2
## [1] 3.5

4.4.1. Converting model parameters into odds ratios

How do we get from log odds to odds ratios? By taking their exponent (\(e^{log\,odds}\)). We can also calculate 95% confidence intervals using the confint() function, to help us decide how strong the evidence is that each variable has predictive value.

Listing 4.13. Converting model parameters into odds ratios

exp(cbind(Odds_Ratio = coef(logRegModelData), confint(logRegModelData)))
## Waiting for profiling to be done...
##              Odds_Ratio       2.5 %       97.5 %
## (Intercept) 45.13516691 19.14718874 109.72483921
## Pclass2      0.36775262  0.20650392   0.65220841
## Pclass3      0.11854901  0.06700311   0.20885220
## Sexmale      0.06229163  0.04182164   0.09116657
## Age          0.96192148  0.94700049   0.97652950
## Fare         1.00322362  0.99872001   1.00863263
## FamSize      0.78424868  0.68315465   0.89110044

Most of these odds ratios are less than 1. An odds ratio less than 1 means an event is less likely to occur. It’s usually easier to interpret these if you divide 1 by them. For example, the odds ratio for surviving if you were male is 0.06, and 1 divided by 0.06 = 16.7. This means that, holding all other variables constant, men were 16.7 times less likely to survive than women

Listing 4.14. Using our model to make predictions on new data

data(titanic_test, package = "titanic")

titanicNew <- as_tibble(titanic_test)

titanicNewClean <- titanicNew %>%
  mutate_at(.vars = c("Sex", "Pclass"), .funs = factor) %>%
  mutate(FamSize = SibSp + Parch) %>%
  select(Pclass, Sex, Age, Fare, FamSize)

predict(logRegModel, newdata = titanicNewClean)
## Warning in predict.WrappedModel(logRegModel, newdata = titanicNewClean):
## Provided data for prediction is not a pure data.frame but from class tbl_df,
## hence it will be converted.
## Prediction: 418 observations
## predict.type: prob
## threshold: 0=0.50,1=0.50
## time: 0.00
##      prob.0     prob.1 response
## 1 0.9178036 0.08219636        0
## 2 0.5909570 0.40904305        0
## 3 0.9123303 0.08766974        0
## 4 0.8927383 0.10726167        0
## 5 0.4069407 0.59305933        1
## 6 0.8337609 0.16623907        0
## ... (#rows: 418, #cols: 3)