Introduction

The goal is to introduce linear regression in R by solving the Kaggle Ames Housing competition.

Don’t expect a great score. There’s a lot more to learn but this blog will take you from zero to submission.

Simple linear regression only uses one variable/predictor/feature to make a prediction. In our case, the feature is the ground living area: GrLivArea. We chose this parameter by reading this document.

The R code can be found on Github

Load a couple libraries

library(tidyverse) # A lot of magic in here
library(GGally)

Read the training data

train <- read_csv("../input/train.csv")
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   .default = col_character(),
##   Id = col_double(),
##   MSSubClass = col_double(),
##   LotFrontage = col_double(),
##   LotArea = col_double(),
##   OverallQual = col_double(),
##   OverallCond = col_double(),
##   YearBuilt = col_double(),
##   YearRemodAdd = col_double(),
##   MasVnrArea = col_double(),
##   BsmtFinSF1 = col_double(),
##   BsmtFinSF2 = col_double(),
##   BsmtUnfSF = col_double(),
##   TotalBsmtSF = col_double(),
##   `1stFlrSF` = col_double(),
##   `2ndFlrSF` = col_double(),
##   LowQualFinSF = col_double(),
##   GrLivArea = col_double(),
##   BsmtFullBath = col_double(),
##   BsmtHalfBath = col_double(),
##   FullBath = col_double()
##   # ... with 18 more columns
## )
## ℹ Use `spec()` for the full column specifications.

Read test data

test <- read_csv("../input/test.csv")
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   .default = col_character(),
##   Id = col_double(),
##   MSSubClass = col_double(),
##   LotFrontage = col_double(),
##   LotArea = col_double(),
##   OverallQual = col_double(),
##   OverallCond = col_double(),
##   YearBuilt = col_double(),
##   YearRemodAdd = col_double(),
##   MasVnrArea = col_double(),
##   BsmtFinSF1 = col_double(),
##   BsmtFinSF2 = col_double(),
##   BsmtUnfSF = col_double(),
##   TotalBsmtSF = col_double(),
##   `1stFlrSF` = col_double(),
##   `2ndFlrSF` = col_double(),
##   LowQualFinSF = col_double(),
##   GrLivArea = col_double(),
##   BsmtFullBath = col_double(),
##   BsmtHalfBath = col_double(),
##   FullBath = col_double()
##   # ... with 17 more columns
## )
## ℹ Use `spec()` for the full column specifications.

Keep only a subset of the data

train <- select(train, c("Id", "GrLivArea", "LotArea", "TotalBsmtSF", "YearBuilt", "SalePrice")) # Tidyverse
test <- select(test, c("Id", "GrLivArea", "LotArea", "TotalBsmtSF", "YearBuilt")) # Tidyverse

Correlation Plot

ggpairs(train, binwidth=30)
## Warning in warn_if_args_exist(list(...)): Extra arguments: 'binwidth' are being
## ignored. If these are meant to be aesthetics, submit them using the 'mapping'
## variable within ggpairs with ggplot2::aes or ggplot2::aes_string.

In this chart, we can verify that SalePrice and GrLivArea are correlated.

Fit the Linear Model

lm.fit = lm(SalePrice ~ GrLivArea, data = train)

Model Summary

summary(lm.fit)
## 
## Call:
## lm(formula = SalePrice ~ GrLivArea, data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -462999  -29800   -1124   21957  339832 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 18569.026   4480.755   4.144 3.61e-05 ***
## GrLivArea     107.130      2.794  38.348  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 56070 on 1458 degrees of freedom
## Multiple R-squared:  0.5021, Adjusted R-squared:  0.5018 
## F-statistic:  1471 on 1 and 1458 DF,  p-value: < 2.2e-16

The coefficient for GrLivArea is 107.130 and the p-value is 2e-16, which means it is significant.

Plot the Regression Line

attach(train)
{plot(GrLivArea, SalePrice) # Plot points
abline(lm.fit) # Add Least Squares Regression Line
}

Predict SalePrice in Test Data

predSalePrice = predict(lm.fit, newdata = test)
test$SalePrice = predSalePrice

Generate Kaggle Submission File

test %>% 
  select(Id, SalePrice) %>%
  write.csv("intro_lr_submission.csv", quote = FALSE, row.names = FALSE)

The Kaggle Score: 0.29117