ISLR Home

p401

We perform PCA on the USArrests data set, which is part of the base R package. The rows of the data set contain the 50 states, in alphabetical order.

states=row.names(USArrests)
#states
names(USArrests)
## [1] "Murder"   "Assault"  "UrbanPop" "Rape"

Apply the mean to each column

apply(USArrests, 2, mean)
##   Murder  Assault UrbanPop     Rape 
##    7.788  170.760   65.540   21.232

Apply the variance to each column

apply(USArrests, 2, var)
##     Murder    Assault   UrbanPop       Rape 
##   18.97047 6945.16571  209.51878   87.72916

Perform principal components analysis

Use the prcomp() function.

By default, the prcomp() function centers the variables to have mean zero. By using the option scale=TRUE, we scale the variables to have standard deviation one.

pr.out=prcomp(USArrests, scale=TRUE)
summary(pr.out)
## Importance of components:
##                           PC1    PC2     PC3     PC4
## Standard deviation     1.5749 0.9949 0.59713 0.41645
## Proportion of Variance 0.6201 0.2474 0.08914 0.04336
## Cumulative Proportion  0.6201 0.8675 0.95664 1.00000
pr.out$center
##   Murder  Assault UrbanPop     Rape 
##    7.788  170.760   65.540   21.232
pr.out$scale
##    Murder   Assault  UrbanPop      Rape 
##  4.355510 83.337661 14.474763  9.366385
names(pr.out)
## [1] "sdev"     "rotation" "center"   "scale"    "x"

The rotation matrix provides the principal component loadings; each column of pr.out$rotation contains the corresponding principal component loading vector.

pr.out$rotation
##                 PC1        PC2        PC3         PC4
## Murder   -0.5358995  0.4181809 -0.3412327  0.64922780
## Assault  -0.5831836  0.1879856 -0.2681484 -0.74340748
## UrbanPop -0.2781909 -0.8728062 -0.3780158  0.13387773
## Rape     -0.5434321 -0.1673186  0.8177779  0.08902432

There are four distinct principal components. In general min(n - 1, p) informative principal components in a data set with n observations and p variables.

dim(pr.out$x)
## [1] 50  4

Plot first two principal components

biplot(pr.out, scale=0)

The scale=0 argument to biplot() ensures that the arrows are scaled to represent the loadings; other values for scale give slightly different biplots with different interpretations.

Reproduce Figure 10.1

pr.out$rotation=-pr.out$rotation
pr.out$x=-pr.out$x
biplot(pr.out, scale=0)

The prcomp() function also outputs the standard deviation of each principal component

pr.out$sdev
## [1] 1.5748783 0.9948694 0.5971291 0.4164494

The variance explained by each principal component is obtained by squaring these:

pr.var <- pr.out$sdev^2
pr.var
## [1] 2.4802416 0.9897652 0.3565632 0.1734301

To compute the proportion of variance explained by each principal component, we simply divide the variance explained by each principal component by the total variance explained by all four principal components:

pve=pr.var/sum(pr.var)
pve
## [1] 0.62006039 0.24744129 0.08914080 0.04335752

We see that the first principal component explains 62.0 % of the variance in the data, the next principal component explains 24.7 % of the variance, and so forth.

Plot PVE

We can plot the PVE explained by each component, as well as the cumulative PVE, as follows:

par(mfrow=c(1,2))
plot(pve, xlab="Principal Component", 
     ylab="Proportion of Variance Explained ",
     ylim=c(0,1),type='b')
plot(cumsum(pve), xlab="Principal Component ",
     ylab="Cumulative Proportion of Variance Explained ",
     ylim=c(0,1), type='b')

cumsum() used above is the cumulative sum.

a=c(1,2,8,-3)
cumsum (a)
## [1]  1  3 11  8