We have seen that as the number of features used in a model increases, the training error will necessarily decrease, but the test error may not. We will now explore this in a simulated data set. (a) Generate a data set with p = 20 features, n = 1,000 observa- tions, and an associated quantitative response vector generated according to the model Y =Xβ+ε, where β has some elements that are exactly equal to zero.

  1. Split your dataset into a training set containing 100 observations and a test set containing 900 observations.

  2. Perform best subset selection on the training set, and plot the training set MSE associated with the best model of each size.

  3. Plot the test set MSE associated with the best model of each size.

  4. For which model size does the test set MSE take on its minimum value? Comment on your results. If it takes on its minimum value for a model containing only an intercept or a model containing all of the features, then play around with the way that you are generating the data in (a) until you come up with a scenario in which the test set MSE is minimized for an intermediate model size.

  5. How does the model at which the test set MSE is minimized compare to the true model used to generate the data? Comment on the coefficient values.􏰜􏰊p ˆr 2

  6. Create a plot displaying j=1(βj − βj ) for a range of values of r, where βˆjr is the jth coefficient estimate for the best model containing r coefficients. Comment on what you observe. How does this compare to the test MSE plot from (d)?