Introduction

The goal is to introduce logistic regression in R by solving the Kaggle Titanic Survivor competition.

Don’t expect a great score. There’s a lot more to learn but this blog will take you from zero to submission.

library(tidyverse) # A lot of magic in here
library(GGally)

Read the training data

train <- read_csv("../input/train.csv")

## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   PassengerId = col_double(),
##   Survived = col_double(),
##   Pclass = col_double(),
##   Name = col_character(),
##   Sex = col_character(),
##   Age = col_double(),
##   SibSp = col_double(),
##   Parch = col_double(),
##   Ticket = col_character(),
##   Fare = col_double(),
##   Cabin = col_character(),
##   Embarked = col_character()
## )

Read test data

test <- read_csv("../input/test.csv")

## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   PassengerId = col_double(),
##   Pclass = col_double(),
##   Name = col_character(),
##   Sex = col_character(),
##   Age = col_double(),
##   SibSp = col_double(),
##   Parch = col_double(),
##   Ticket = col_character(),
##   Fare = col_double(),
##   Cabin = col_character(),
##   Embarked = col_character()
## )

Remove a few columns to simplify

train <- select(train, -c("Name", "Ticket", "Cabin", "Embarked", "Fare")) # Tidyverse
test <- select(test, -c("Name", "Ticket", "Cabin", "Embarked", "Fare")) # Tidyverse

Convert Survived and Sex to Factors

Categorical variables are converted to factors in R.

In our data, sex is only one of 2 values (‘Male’, ‘Female’), so we treat it differently. They are categories.

Survived is only 1 or 0, so we also treat it differently. It’s also a category variable.

Ignore the details for now. We’re only learning enough to get us to the first submission.

train$Survived <- as_factor(train$Survived) # Only provided in training data
#contrasts(train$Survived)
table(train$Survived)

## 
##   0   1 
## 549 342

Convert Sex to Factor

table(train$Sex)

## 
## female   male 
##    314    577

train$Sex <- as_factor(train$Sex)
test$Sex <- as_factor(test$Sex)

contrasts(train$Sex)

##        female
## male        0
## female      1

table(train$Survived)

## 
##   0   1 
## 549 342

table(train$Sex)

## 
##   male female 
##    577    314

Handle Missing Data

Skim Train

Using the skim() function, we can see that the only missing data is Age, where 177 ages are missing in the training data.

library(skimr)
skim(train)

Data summary
Name	train
Number of rows	891
Number of columns	7
_______________________
Column type frequency:
factor	2
numeric	5
________________________
Group variables	None

Variable type: factor

skim_variable	n_missing	complete_rate	ordered	n_unique	top_counts
Survived	0	1	FALSE	2	0: 549, 1: 342
Sex	0	1	FALSE	2	mal: 577, fem: 314

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
PassengerId	0	1.0	446.00	257.35	1.00	223.50	446	668.5	891	▇▇▇▇▇
Pclass	0	1.0	2.31	0.84	1.00	2.00	3	3.0	3	▃▁▃▁▇
Age	177	0.8	29.70	14.53	0.42	20.12	28	38.0	80	▂▇▅▂▁
SibSp	0	1.0	0.52	1.10	0.00	0.00	0	1.0	8	▇▁▁▁▁
Parch	0	1.0	0.38	0.81	0.00	0.00	0	0.0	6	▇▁▁▁▁

Skim Test

In the test data, we are missing 86 ages. We will also replace those with the average.

skim(test)

Data summary
Name	test
Number of rows	418
Number of columns	6
_______________________
Column type frequency:
factor	1
numeric	5
________________________
Group variables	None

Variable type: factor

skim_variable	n_missing	complete_rate	ordered	n_unique	top_counts
Sex	0	1	FALSE	2	mal: 266, fem: 152

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
PassengerId	0	1.00	1100.50	120.81	892.00	996.25	1100.5	1204.75	1309	▇▇▇▇▇
Pclass	0	1.00	2.27	0.84	1.00	1.00	3.0	3.00	3	▃▁▃▁▇
Age	86	0.79	30.27	14.18	0.17	21.00	27.0	39.00	76	▂▇▃▂▁
SibSp	0	1.00	0.45	0.90	0.00	0.00	0.0	1.00	8	▇▁▁▁▁
Parch	0	1.00	0.39	0.98	0.00	0.00	0.0	0.00	9	▇▁▁▁▁

Combine train and test data

df = bind_rows(train, test)

Get Average Age

avg_age <-  mean(df$Age, na.rm=TRUE)
avg_age

## [1] 29.88114

sum(is.na(train$Age))

## [1] 177

Replace Missing Ages with the Mean

#train[is.na(train$Age)] <-  avg_age
#train %>% replace_na(avg_age)
#replace_na(train, list(Age=avg_age))
train[is.na(train$Age), "Age"] <- avg_age
test[is.na(test$Age), "Age"] <- avg_age

Verify Age was set

We should now get zero rows with na.

sum(is.na(train$Age))

## [1] 0

Correlation Plot

We won’t look at this now, but a correlation pairs plot can be useful when investigating if data is related.

ggpairs(train, binwidth=30)

## Warning in warn_if_args_exist(list(...)): Extra arguments: 'binwidth' are being
## ignored. If these are meant to be aesthetics, submit them using the 'mapping'
## variable within ggpairs with ggplot2::aes or ggplot2::aes_string.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Base Accuracy

In our training data, only 342 survived.

table(train$Survived)

## 
##   0   1 
## 549 342

Fit the model using all the parameters/features

lr.fit = glm(Survived ~ ., data = train, family = binomial)

Summary of the Model

summary(lr.fit)

## 
## Call:
## glm(formula = Survived ~ ., family = binomial, data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.6487  -0.6091  -0.4194   0.6109   2.4383  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  2.433e+00  4.557e-01   5.340 9.30e-08 ***
## PassengerId  8.673e-05  3.472e-04   0.250   0.8028    
## Pclass      -1.172e+00  1.197e-01  -9.789  < 2e-16 ***
## Sexfemale    2.772e+00  1.993e-01  13.908  < 2e-16 ***
## Age         -4.013e-02  7.776e-03  -5.161 2.46e-07 ***
## SibSp       -3.328e-01  1.087e-01  -3.061   0.0022 ** 
## Parch       -8.437e-02  1.151e-01  -0.733   0.4636    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1186.66  on 890  degrees of freedom
## Residual deviance:  790.23  on 884  degrees of freedom
## AIC: 804.23
## 
## Number of Fisher Scoring iterations: 5

Predict

p = predict(lr.fit, type = "response")

Confusion Matrix

table(train$Survived, p > 0.5)

##    
##     FALSE TRUE
##   0   460   89
##   1   100  242

True negative, False Positive
False negative, True Positive

Abbreviated like this in matrix form:

TN FP
FN TP

Accuracy

Accuracy is the percentage that the model got right: TN + TP divided by the total

The formula: \(Accuracy = \frac{TN + TP}{n}\)

accuracy = (242 + 460) / nrow(train)
accuracy

## [1] 0.7878788

Sensitivity

Sensitivity is the true positives divided by all the positives.

The formula: \(Sensitivity = \frac{TP}{TP + FP}\)

sensitivity = 242 / (242 + 100)
sensitivity

## [1] 0.7076023

Specificity

Specificity is the true negatives divide by all the negatives.

The formula: \(Specificity = \frac{TN}{TN + FN}\)

specificity = 460 / (460 + 89)
specificity

## [1] 0.8378871

Predict our Test Data

Now we will take the model and use it to predict who survives in the test data.

lr.fit.probs = predict(lr.fit, test, type = "response")

Add the column to the test data.

test$Survived = ifelse(lr.fit.probs > 0.5, 1, 0)

table(test$Survived)

## 
##   0   1 
## 255 163

We predict only 163 people will survive.

Generate Kaggle Submission File

test %>% 
  select(PassengerId, Survived) %>%
  write.csv("intro_lr_submission.csv", quote = FALSE, row.names = FALSE)

The intro submission gives a Kaggle score of 0.74880

An Introduction to Logistic Regression in R