Introduction

The goal is to introduce logistic regression in R by solving the Kaggle Titanic Survivor competition.

Don’t expect a great score. There’s a lot more to learn but this blog will take you from zero to submission.

library(tidyverse) # A lot of magic in here
library(GGally)

Read the training data

train <- read_csv("../input/train.csv")
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   PassengerId = col_double(),
##   Survived = col_double(),
##   Pclass = col_double(),
##   Name = col_character(),
##   Sex = col_character(),
##   Age = col_double(),
##   SibSp = col_double(),
##   Parch = col_double(),
##   Ticket = col_character(),
##   Fare = col_double(),
##   Cabin = col_character(),
##   Embarked = col_character()
## )

Read test data

test <- read_csv("../input/test.csv")
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   PassengerId = col_double(),
##   Pclass = col_double(),
##   Name = col_character(),
##   Sex = col_character(),
##   Age = col_double(),
##   SibSp = col_double(),
##   Parch = col_double(),
##   Ticket = col_character(),
##   Fare = col_double(),
##   Cabin = col_character(),
##   Embarked = col_character()
## )

Remove a few columns to simplify

train <- select(train, -c("Name", "Ticket", "Cabin", "Embarked", "Fare")) # Tidyverse
test <- select(test, -c("Name", "Ticket", "Cabin", "Embarked", "Fare")) # Tidyverse

Convert Survived and Sex to Factors

Categorical variables are converted to factors in R.

In our data, sex is only one of 2 values (‘Male’, ‘Female’), so we treat it differently. They are categories.

Survived is only 1 or 0, so we also treat it differently. It’s also a category variable.

Ignore the details for now. We’re only learning enough to get us to the first submission.

train$Survived <- as_factor(train$Survived) # Only provided in training data
#contrasts(train$Survived)
table(train$Survived)
## 
##   0   1 
## 549 342

Convert Sex to Factor

table(train$Sex)
## 
## female   male 
##    314    577
train$Sex <- as_factor(train$Sex)
test$Sex <- as_factor(test$Sex)

contrasts(train$Sex)
##        female
## male        0
## female      1
table(train$Survived)
## 
##   0   1 
## 549 342
table(train$Sex)
## 
##   male female 
##    577    314

Handle Missing Data

Skim Train

Using the skim() function, we can see that the only missing data is Age, where 177 ages are missing in the training data.

library(skimr)
skim(train)
Data summary
Name train
Number of rows 891
Number of columns 7
_______________________
Column type frequency:
factor 2
numeric 5
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
Survived 0 1 FALSE 2 0: 549, 1: 342
Sex 0 1 FALSE 2 mal: 577, fem: 314

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
PassengerId 0 1.0 446.00 257.35 1.00 223.50 446 668.5 891 ▇▇▇▇▇
Pclass 0 1.0 2.31 0.84 1.00 2.00 3 3.0 3 ▃▁▃▁▇
Age 177 0.8 29.70 14.53 0.42 20.12 28 38.0 80 ▂▇▅▂▁
SibSp 0 1.0 0.52 1.10 0.00 0.00 0 1.0 8 ▇▁▁▁▁
Parch 0 1.0 0.38 0.81 0.00 0.00 0 0.0 6 ▇▁▁▁▁

Skim Test

In the test data, we are missing 86 ages. We will also replace those with the average.

skim(test)
Data summary
Name test
Number of rows 418
Number of columns 6
_______________________
Column type frequency:
factor 1
numeric 5
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
Sex 0 1 FALSE 2 mal: 266, fem: 152

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
PassengerId 0 1.00 1100.50 120.81 892.00 996.25 1100.5 1204.75 1309 ▇▇▇▇▇
Pclass 0 1.00 2.27 0.84 1.00 1.00 3.0 3.00 3 ▃▁▃▁▇
Age 86 0.79 30.27 14.18 0.17 21.00 27.0 39.00 76 ▂▇▃▂▁
SibSp 0 1.00 0.45 0.90 0.00 0.00 0.0 1.00 8 ▇▁▁▁▁
Parch 0 1.00 0.39 0.98 0.00 0.00 0.0 0.00 9 ▇▁▁▁▁

Combine train and test data

df = bind_rows(train, test)

Get Average Age

avg_age <-  mean(df$Age, na.rm=TRUE)
avg_age
## [1] 29.88114
sum(is.na(train$Age))
## [1] 177

Replace Missing Ages with the Mean

#train[is.na(train$Age)] <-  avg_age
#train %>% replace_na(avg_age)
#replace_na(train, list(Age=avg_age))
train[is.na(train$Age), "Age"] <- avg_age
test[is.na(test$Age), "Age"] <- avg_age

Verify Age was set

We should now get zero rows with na.

sum(is.na(train$Age))
## [1] 0

Correlation Plot

We won’t look at this now, but a correlation pairs plot can be useful when investigating if data is related.

ggpairs(train, binwidth=30)
## Warning in warn_if_args_exist(list(...)): Extra arguments: 'binwidth' are being
## ignored. If these are meant to be aesthetics, submit them using the 'mapping'
## variable within ggpairs with ggplot2::aes or ggplot2::aes_string.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.