ISLR Home

Question

In this exercise you will create some simulated data and will fit simple linear regression models to it. Make sure to use set.seed(1) prior to starting part (a) to ensure consistent results

13a Generate 100 random normal points

Setting the seed to keep random values consistent

set.seed(1)

# Creating a vector with values drawn from a random distribution
x = rnorm(100)

13b Epsilon \(\epsilon \sim \mathcal{N}(0,0.25)\)

Creating a vector with values drawn from a random distribution with mean 0, standard deviation 0.25

eps = rnorm(100, 0, 0.25)

13c Creating the model

Y = -1 + 0.5*x + eps

13d Plot

plot(x, Y)

There is a linear relationship

13e

Model Summary

lm.fit = lm(Y~x)
summary(lm.fit)
## 
## Call:
## lm(formula = Y ~ x)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.46921 -0.15344 -0.03487  0.13485  0.58654 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.00942    0.02425  -41.63   <2e-16 ***
## x            0.49973    0.02693   18.56   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2407 on 98 degrees of freedom
## Multiple R-squared:  0.7784, Adjusted R-squared:  0.7762 
## F-statistic: 344.3 on 1 and 98 DF,  p-value: < 2.2e-16

The coefficients for estimated model are off by a few decimals

13f Plot

plot(x, Y)
abline(lm.fit, col='red')
abline(-1, 0.5, col="blue")
legend(-1, c("Predicted", "Actual"), col=1:2, lwd=2)

13g Fit Model

Fitting a model introducing a quadratic term

lm.poly.fit = lm(Y~x+I(x^2))
summary(lm.poly.fit)
## 
## Call:
## lm(formula = Y ~ x + I(x^2))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.4913 -0.1563 -0.0322  0.1451  0.5675 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.98582    0.02941 -33.516   <2e-16 ***
## x            0.50429    0.02700  18.680   <2e-16 ***
## I(x^2)      -0.02973    0.02119  -1.403    0.164    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2395 on 97 degrees of freedom
## Multiple R-squared:  0.7828, Adjusted R-squared:  0.7784 
## F-statistic: 174.8 on 2 and 97 DF,  p-value: < 2.2e-16

Adding the quadratic term does NOT improve the model much (about 0.002). Also quadratic term is not significant according to t-statistic.

13h Generating Data

Generating data randomly for x from normal distribution

x.less = rnorm(100)

Generating random error terms with small standard deviation (0.05) from normal distribution

eps.less = rnorm(100, 0, 0.05) # epsilon

# Creating the Y value given x and eps

Y.less = -1 + 0.5*x.less + eps.less

Fit a least square regression

lm.less.fit = lm(Y.less ~ x.less)

Model Summary

summary(lm.less.fit)
## 
## Call:
## lm(formula = Y.less ~ x.less)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.137090 -0.028070 -0.000874  0.033987  0.092421 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.997578   0.004955  -201.3   <2e-16 ***
## x.less       0.505311   0.004813   105.0   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.04953 on 98 degrees of freedom
## Multiple R-squared:  0.9912, Adjusted R-squared:  0.9911 
## F-statistic: 1.102e+04 on 1 and 98 DF,  p-value: < 2.2e-16

Plotting the x and y values

plot(x.less, Y.less)

# Plotting the least square regression line
abline(lm.less.fit, col='red')

# Plotting the population regression line
abline(-1, 0.5, col="blue")

# Creating a legend for plot
legend(-1, c("Predicted", "Actual"), col=1:2, lwd=2)

When the standard deviations of the error terms are small, that means the predictors will vary less from the response giving the model a better accuracy.

13i

Generating data randomly for x from normal distribution

x.more = rnorm(100)

Generating random error terms with big standard deviation (2) from normal distribution

eps.more = rnorm(100, 0, 2)

Creating the Y value given x and eps

Y.more = -1 + 0.5*x.more + eps.more

# Fitting a least square regression
lm.more.fit = lm(Y.more~x.more)

Model Summary

summary(lm.more.fit)
## 
## Call:
## lm(formula = Y.more ~ x.more)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.0203 -1.2110  0.0413  1.4097  4.1796 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -1.0949     0.1935  -5.658 1.52e-07 ***
## x.more        0.3501     0.1662   2.106   0.0377 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.934 on 98 degrees of freedom
## Multiple R-squared:  0.04332,    Adjusted R-squared:  0.03355 
## F-statistic: 4.437 on 1 and 98 DF,  p-value: 0.03772

Plotting the x and y values

plot(x.more, Y.more)

# Plotting the least square regression line
abline(lm.more.fit, col='red')

# Plotting the population regression line
abline(-1, 0.5, col="blue")

# Creating a legend for plot
legend(-1, c("Predicted", "Actual"), col=1:2, lwd=2)

When the standard deviations of the error terms are huge, the predictors will vary a lot from the response giving the model a harder time to statistically identify what value it will be.

13j Confidence Intervals

Getting the confidence intervals for each linear model

confint(lm.fit)
##                  2.5 %     97.5 %
## (Intercept) -1.0575402 -0.9613061
## x            0.4462897  0.5531801
confint(lm.less.fit)
##                  2.5 %     97.5 %
## (Intercept) -1.0074105 -0.9877445
## x.less       0.4957597  0.5148621
confint(lm.more.fit)
##                   2.5 %     97.5 %
## (Intercept) -1.47895440 -0.7108553
## x.more       0.02027816  0.6799263

The higher the standard deviation of the error terms, the less confident the model has in predicting a value.