ISLR Home

Question

p122

This question involves the use of multiple linear regression on the Auto data set.

  1. Produce a scatterplot matrix which includes all of the variables in the data set.

  2. Compute the matrix of correlations between the variables using the function cor(). You will need to exclude the name variable, which is qualitative.

  3. Use the lm() function to perform a multiple linear regression with mpg as the response and all other variables except name as the predictors. Use the summary() function to print the results. Comment on the output. For instance:

  1. Is there a relationship between the predictors and the response?
  2. Which predictors appear to have a statistically significant relationship to the response?
  3. What does the coefficient for the year variable suggest?
  1. Use the plot() function to produce diagnostic plots of the linear regression fit. Comment on any problems you see with the fit. Do the residual plots suggest any unusually large outliers? Does the leverage plot identify any observations with unusually high leverage?

  2. Use the * and : symbols to fit linear regression models with interaction effects. Do any interactions appear to be statistically significant?

  3. Try a few different transformations of the variables, such as log(X), \(\sqrt{X}, X^2\). Comment on your findings.

library(ISLR)
library(tidyverse)
library(GGally)
library(car) # scatterplotMatrix

9a Scatterplot Matrix

Produce a scatterplot matrix which includes all of the variables in the data set.

Basic Scatterplot Matrix

pairs(Auto)

Enhanced Pairs Plot

auto <- as_tibble(Auto)
auto <- select(auto, -name)
colnames(auto)
## [1] "mpg"          "cylinders"    "displacement" "horsepower"   "weight"      
## [6] "acceleration" "year"         "origin"

Rename a few columns

names(auto)[names(auto) == "displacement"] <- "displ"
names(auto)[names(auto) == "horsepower"] <- "hp"
names(auto)[names(auto) == "acceleration"] <- "accel"
ggpairs(auto)

car Package scatterplotMatrix

scatterplotMatrix(auto, smooth = FALSE, main="Scatter Plot Matrix")

9b Correlations Matrix

Compute the matrix of correlations between the variables using the function cor(). You will need to exclude the name variable, which is qualitative.

options(digits=2)
cor(auto[,!colnames(auto) %in% c("name")]) # Skip name column
##             mpg cylinders displ    hp weight accel  year origin
## mpg        1.00     -0.78 -0.81 -0.78  -0.83  0.42  0.58   0.57
## cylinders -0.78      1.00  0.95  0.84   0.90 -0.50 -0.35  -0.57
## displ     -0.81      0.95  1.00  0.90   0.93 -0.54 -0.37  -0.61
## hp        -0.78      0.84  0.90  1.00   0.86 -0.69 -0.42  -0.46
## weight    -0.83      0.90  0.93  0.86   1.00 -0.42 -0.31  -0.59
## accel      0.42     -0.50 -0.54 -0.69  -0.42  1.00  0.29   0.21
## year       0.58     -0.35 -0.37 -0.42  -0.31  0.29  1.00   0.18
## origin     0.57     -0.57 -0.61 -0.46  -0.59  0.21  0.18   1.00

Enhanced Correlation Plot

ggcorr(auto)

9c Multiple Linear Regression: mpg ~ .

Running a MLR on all predictors except for name

Model: mpg ~ . -name

auto.mlr = lm(mpg ~ . -name, data=Auto)

Model Summary

summary(auto.mlr)
## 
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -9.590 -2.157 -0.117  1.869 13.060 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -1.72e+01   4.64e+00   -3.71  0.00024 ***
## cylinders    -4.93e-01   3.23e-01   -1.53  0.12780    
## displacement  1.99e-02   7.51e-03    2.65  0.00844 ** 
## horsepower   -1.70e-02   1.38e-02   -1.23  0.21963    
## weight       -6.47e-03   6.52e-04   -9.93  < 2e-16 ***
## acceleration  8.06e-02   9.88e-02    0.82  0.41548    
## year          7.51e-01   5.10e-02   14.73  < 2e-16 ***
## origin        1.43e+00   2.78e-01    5.13  4.7e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.3 on 384 degrees of freedom
## Multiple R-squared:  0.821,  Adjusted R-squared:  0.818 
## F-statistic:  252 on 7 and 384 DF,  p-value: <2e-16

i. Is there a relationship between the predictors and the response?

There are multiple predictors that have relationship with the response because their associated p-value is significant

ii. Which predictors appear to have a statistically significant relationship to the response?

The predictors: displacement, weight, year, and origin have a statistically significant relationship.

iii. What does the coefficient for the year variable suggest?

The coefficient of year suggests that every 4 years, the mpg goes up by 3

9d Diagnostic Plots

(9d) Use the plot() function to produce diagnostic plots of the linear regression fit.


Diagnostic Plots

par(mfrow=c(2,2))
plot(auto.mlr)

Better Plots

#qplot(auto.mlr)

Non-Linearity: The residual plot shows that there is a U-shape pattern in the residuals which might indicate that the data is non-linear.

Non-constant Variance: The residual plot also shows that the variance is not constant. There is a funnel shape appearing at the end which indicates heteroscedasticity (non-constant variance)

Outliers: There seems to not be any outliers because in the Scale-Location, all values are within the range of [-2,2]. It will only be an outlier if standardized residual is outside the range of [-3, 3].

High Leverage Points: Based on the Residuals vs. Leverage graph, there is no observations that provides a high leverage

9e Interaction Effects

Use the * and : symbols to fit linear regression models with interaction effects.

Do any interactions appear to be statistically significant?

names(Auto)
## [1] "mpg"          "cylinders"    "displacement" "horsepower"   "weight"      
## [6] "acceleration" "year"         "origin"       "name"
interact.fit = lm(mpg ~ . -name + horsepower*displacement, data=Auto)
origin.hp = lm(mpg ~ . -name + horsepower*origin, data=Auto)
summary(origin.hp)
## 
## Call:
## lm(formula = mpg ~ . - name + horsepower * origin, data = Auto)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -9.277 -1.875 -0.225  1.570 12.080 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -2.20e+01   4.40e+00   -5.00  8.9e-07 ***
## cylinders         -5.28e-01   3.03e-01   -1.74    0.082 .  
## displacement      -1.49e-03   7.61e-03   -0.20    0.845    
## horsepower         8.17e-02   1.86e-02    4.40  1.4e-05 ***
## weight            -4.71e-03   6.55e-04   -7.19  3.5e-12 ***
## acceleration      -1.12e-01   9.62e-02   -1.17    0.243    
## year               7.33e-01   4.78e-02   15.33  < 2e-16 ***
## origin             7.70e+00   8.86e-01    8.69  < 2e-16 ***
## horsepower:origin -7.95e-02   1.07e-02   -7.40  8.4e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.1 on 383 degrees of freedom
## Multiple R-squared:  0.844,  Adjusted R-squared:  0.841 
## F-statistic:  259 on 8 and 383 DF,  p-value: <2e-16

Statistically Significant Interaction Terms:

inter.fit = lm(mpg ~ . -name + horsepower:origin + horsepower:
                 + horsepower:displacement,
               data=Auto)
summary(inter.fit)
## 
## Call:
## lm(formula = mpg ~ . - name + horsepower:origin + horsepower:+horsepower:displacement, 
##     data = Auto)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.722 -1.525 -0.097  1.355 12.842 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             -4.71e+00   4.69e+00   -1.00    0.316    
## cylinders                5.14e-01   3.14e-01    1.64    0.102    
## displacement            -6.97e-02   1.14e-02   -6.10  2.6e-09 ***
## horsepower              -1.54e-01   3.55e-02   -4.34  1.8e-05 ***
## weight                  -3.08e-03   6.48e-04   -4.76  2.7e-06 ***
## acceleration            -2.28e-01   9.10e-02   -2.50    0.013 *  
## year                     7.35e-01   4.46e-02   16.48  < 2e-16 ***
## origin                   2.28e+00   1.09e+00    2.09    0.037 *  
## horsepower:origin       -1.92e-02   1.28e-02   -1.50    0.134    
## displacement:horsepower  4.67e-04   6.13e-05    7.61  2.1e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.9 on 382 degrees of freedom
## Multiple R-squared:  0.864,  Adjusted R-squared:  0.861 
## F-statistic:  271 on 9 and 382 DF,  p-value: <2e-16

Adding more interactions, decreases the significance of previous significant values

9f Transformations

Try a few different transformations of the variables, such as log(X), \(sqrt(X), X^2\)

Transform log(acceleration)

summary(lm(mpg ~ . -name + log(acceleration), data=Auto))
## 
## Call:
## lm(formula = mpg ~ . - name + log(acceleration), data = Auto)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -9.793 -2.005 -0.128  1.930 13.108 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        4.55e+01   1.48e+01    3.08   0.0022 ** 
## cylinders         -2.80e-01   3.19e-01   -0.88   0.3817    
## displacement       8.04e-03   7.81e-03    1.03   0.3034    
## horsepower        -3.43e-02   1.40e-02   -2.45   0.0147 *  
## weight            -5.34e-03   6.85e-04   -7.79  6.1e-14 ***
## acceleration       2.17e+00   4.78e-01    4.53  7.8e-06 ***
## year               7.56e-01   4.98e-02   15.19  < 2e-16 ***
## origin             1.33e+00   2.72e-01    4.88  1.6e-06 ***
## log(acceleration) -3.51e+01   7.89e+00   -4.46  1.1e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.2 on 383 degrees of freedom
## Multiple R-squared:  0.83,   Adjusted R-squared:  0.827 
## F-statistic:  234 on 8 and 383 DF,  p-value: <2e-16

log(acceleration) is still very significant but less significant than acceleration

Transform log(horsepower)

summary(lm(mpg ~ . -name + log(horsepower), data=Auto))
## 
## Call:
## lm(formula = mpg ~ . - name + log(horsepower), data = Auto)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.578 -1.662 -0.121  1.491 12.023 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      8.67e+01   1.11e+01    7.84  4.5e-14 ***
## cylinders       -5.53e-02   2.91e-01   -0.19  0.84923    
## displacement    -4.61e-03   7.11e-03   -0.65  0.51729    
## horsepower       1.76e-01   2.27e-02    7.77  7.0e-14 ***
## weight          -3.37e-03   6.56e-04   -5.13  4.6e-07 ***
## acceleration    -3.28e-01   9.67e-02   -3.39  0.00078 ***
## year             7.42e-01   4.53e-02   16.37  < 2e-16 ***
## origin           8.98e-01   2.53e-01    3.55  0.00043 ***
## log(horsepower) -2.69e+01   2.65e+00  -10.13  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3 on 383 degrees of freedom
## Multiple R-squared:  0.859,  Adjusted R-squared:  0.856 
## F-statistic:  292 on 8 and 383 DF,  p-value: <2e-16

log(horsepower) is more significant than horsepower

Transform \(horsepower^2\)

summary(lm(mpg ~ . -name + I(horsepower^2), data=Auto))
## 
## Call:
## lm(formula = mpg ~ . - name + I(horsepower^2), data = Auto)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.550 -1.731 -0.224  1.588 11.995 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      1.323656   4.624770    0.29  0.77487    
## cylinders        0.348906   0.304831    1.14  0.25309    
## displacement    -0.007565   0.007373   -1.03  0.30555    
## horsepower      -0.319463   0.034345   -9.30  < 2e-16 ***
## weight          -0.003271   0.000679   -4.82  2.1e-06 ***
## acceleration    -0.330598   0.099185   -3.33  0.00094 ***
## year             0.735341   0.045992   15.99  < 2e-16 ***
## origin           1.014413   0.254555    3.99  8.1e-05 ***
## I(horsepower^2)  0.001006   0.000106    9.45  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3 on 383 degrees of freedom
## Multiple R-squared:  0.855,  Adjusted R-squared:  0.852 
## F-statistic:  283 on 8 and 383 DF,  p-value: <2e-16

Squaring horsepower doesn’t change the significance

Transform \(weight^2\)

summary(lm(mpg~ . -name + I(weight^2), data=Auto))
## 
## Call:
## lm(formula = mpg ~ . - name + I(weight^2), data = Auto)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -9.471 -1.670 -0.149  1.638 12.543 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1.48e+00   4.61e+00    0.32   0.7487    
## cylinders    -2.84e-01   2.92e-01   -0.97   0.3308    
## displacement  1.37e-02   6.79e-03    2.02   0.0442 *  
## horsepower   -2.43e-02   1.24e-02   -1.96   0.0508 .  
## weight       -2.05e-02   1.58e-03  -12.97   <2e-16 ***
## acceleration  6.57e-02   8.90e-02    0.74   0.4606    
## year          8.00e-01   4.62e-02   17.33   <2e-16 ***
## origin        7.42e-01   2.60e-01    2.85   0.0046 ** 
## I(weight^2)   2.24e-06   2.34e-07    9.56   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3 on 383 degrees of freedom
## Multiple R-squared:  0.856,  Adjusted R-squared:  0.853 
## F-statistic:  284 on 8 and 383 DF,  p-value: <2e-16

Squaring the weights doesn’t change significance

Transform \(cylinders^2\)

lm.fit = lm(mpg ~ . -name + I(cylinders^2), data=Auto)

Diagnostic Plots

par(mfrow = c(2,2))
plot(lm.fit)

Model Summary

summary(lm(mpg~.-name+I(cylinders^2), data=Auto))
## 
## Call:
## lm(formula = mpg ~ . - name + I(cylinders^2), data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -11.426  -2.028  -0.161   1.717  12.876 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -2.017889   5.729087   -0.35   0.7249    
## cylinders      -5.817956   1.264357   -4.60  5.7e-06 ***
## displacement    0.019789   0.007346    2.69   0.0074 ** 
## horsepower     -0.031265   0.013872   -2.25   0.0248 *  
## weight         -0.006291   0.000639   -9.85  < 2e-16 ***
## acceleration    0.104852   0.096778    1.08   0.2793    
## year            0.745314   0.049840   14.95  < 2e-16 ***
## origin          1.227920   0.275660    4.45  1.1e-05 ***
## I(cylinders^2)  0.464469   0.106791    4.35  1.8e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.2 on 383 degrees of freedom
## Multiple R-squared:  0.83,   Adjusted R-squared:  0.826 
## F-statistic:  234 on 8 and 383 DF,  p-value: <2e-16

Squaring the cylinders makes cylinders and horsepower significant variables

References