The goal is to introduce logistic regression in R by solving the Kaggle Titanic Survivor competition.
Don’t expect a great score. There’s a lot more to learn but this blog will take you from zero to submission.
library(tidyverse) # A lot of magic in here
library(GGally)
train <- read_csv("../input/train.csv")
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## PassengerId = col_double(),
## Survived = col_double(),
## Pclass = col_double(),
## Name = col_character(),
## Sex = col_character(),
## Age = col_double(),
## SibSp = col_double(),
## Parch = col_double(),
## Ticket = col_character(),
## Fare = col_double(),
## Cabin = col_character(),
## Embarked = col_character()
## )
test <- read_csv("../input/test.csv")
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## PassengerId = col_double(),
## Pclass = col_double(),
## Name = col_character(),
## Sex = col_character(),
## Age = col_double(),
## SibSp = col_double(),
## Parch = col_double(),
## Ticket = col_character(),
## Fare = col_double(),
## Cabin = col_character(),
## Embarked = col_character()
## )
train <- select(train, -c("Name", "Ticket", "Cabin", "Embarked", "Fare")) # Tidyverse
test <- select(test, -c("Name", "Ticket", "Cabin", "Embarked", "Fare")) # Tidyverse
Categorical variables are converted to factors in R.
In our data, sex is only one of 2 values (‘Male’, ‘Female’), so we treat it differently. They are categories.
Survived is only 1 or 0, so we also treat it differently. It’s also a category variable.
Ignore the details for now. We’re only learning enough to get us to the first submission.
train$Survived <- as_factor(train$Survived) # Only provided in training data
#contrasts(train$Survived)
table(train$Survived)
##
## 0 1
## 549 342
table(train$Sex)
##
## female male
## 314 577
train$Sex <- as_factor(train$Sex)
test$Sex <- as_factor(test$Sex)
contrasts(train$Sex)
## female
## male 0
## female 1
table(train$Survived)
##
## 0 1
## 549 342
table(train$Sex)
##
## male female
## 577 314
Using the skim() function, we can see that the only missing data is Age, where 177 ages are missing in the training data.
library(skimr)
skim(train)
Name | train |
Number of rows | 891 |
Number of columns | 7 |
_______________________ | |
Column type frequency: | |
factor | 2 |
numeric | 5 |
________________________ | |
Group variables | None |
Variable type: factor
skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
---|---|---|---|---|---|
Survived | 0 | 1 | FALSE | 2 | 0: 549, 1: 342 |
Sex | 0 | 1 | FALSE | 2 | mal: 577, fem: 314 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
PassengerId | 0 | 1.0 | 446.00 | 257.35 | 1.00 | 223.50 | 446 | 668.5 | 891 | ▇▇▇▇▇ |
Pclass | 0 | 1.0 | 2.31 | 0.84 | 1.00 | 2.00 | 3 | 3.0 | 3 | ▃▁▃▁▇ |
Age | 177 | 0.8 | 29.70 | 14.53 | 0.42 | 20.12 | 28 | 38.0 | 80 | ▂▇▅▂▁ |
SibSp | 0 | 1.0 | 0.52 | 1.10 | 0.00 | 0.00 | 0 | 1.0 | 8 | ▇▁▁▁▁ |
Parch | 0 | 1.0 | 0.38 | 0.81 | 0.00 | 0.00 | 0 | 0.0 | 6 | ▇▁▁▁▁ |
In the test data, we are missing 86 ages. We will also replace those with the average.
skim(test)
Name | test |
Number of rows | 418 |
Number of columns | 6 |
_______________________ | |
Column type frequency: | |
factor | 1 |
numeric | 5 |
________________________ | |
Group variables | None |
Variable type: factor
skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
---|---|---|---|---|---|
Sex | 0 | 1 | FALSE | 2 | mal: 266, fem: 152 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
PassengerId | 0 | 1.00 | 1100.50 | 120.81 | 892.00 | 996.25 | 1100.5 | 1204.75 | 1309 | ▇▇▇▇▇ |
Pclass | 0 | 1.00 | 2.27 | 0.84 | 1.00 | 1.00 | 3.0 | 3.00 | 3 | ▃▁▃▁▇ |
Age | 86 | 0.79 | 30.27 | 14.18 | 0.17 | 21.00 | 27.0 | 39.00 | 76 | ▂▇▃▂▁ |
SibSp | 0 | 1.00 | 0.45 | 0.90 | 0.00 | 0.00 | 0.0 | 1.00 | 8 | ▇▁▁▁▁ |
Parch | 0 | 1.00 | 0.39 | 0.98 | 0.00 | 0.00 | 0.0 | 0.00 | 9 | ▇▁▁▁▁ |
Combine train and test data
df = bind_rows(train, test)
avg_age <- mean(df$Age, na.rm=TRUE)
avg_age
## [1] 29.88114
sum(is.na(train$Age))
## [1] 177
#train[is.na(train$Age)] <- avg_age
#train %>% replace_na(avg_age)
#replace_na(train, list(Age=avg_age))
train[is.na(train$Age), "Age"] <- avg_age
test[is.na(test$Age), "Age"] <- avg_age
We should now get zero rows with na.
sum(is.na(train$Age))
## [1] 0
We won’t look at this now, but a correlation pairs plot can be useful when investigating if data is related.
ggpairs(train, binwidth=30)
## Warning in warn_if_args_exist(list(...)): Extra arguments: 'binwidth' are being
## ignored. If these are meant to be aesthetics, submit them using the 'mapping'
## variable within ggpairs with ggplot2::aes or ggplot2::aes_string.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.