ISLR Home

Question

Consider the USArrests data. We will now perform hierarchical clustering on the states.

  1. Using hierarchical clustering with complete linkage and Euclidean distance, cluster the states.

  2. Cut the dendrogram at a height that results in three distinct clusters. Which states belong to which clusters?

  3. Hierarchically cluster the states using complete linkage and Euclidean distance, after scaling the variables to have standard deviation one.

  4. What effect does scaling the variables have on the hierarchical clustering obtained? In your opinion, should the variables be scaled before the inter-observation dissimilarities are computed? Provide a justification for your answer.


library(ISLR)
library(ggdendro) # Better dendrograms
library(ggplot2)

9a Fitting a hierarchical clustering

hc.complete = hclust(
  dist(USArrests),
  method="complete"
)

Summary

summary(hc.complete)
##             Length Class  Mode     
## merge       98     -none- numeric  
## height      49     -none- numeric  
## order       50     -none- numeric  
## labels      50     -none- character
## method       1     -none- character
## call         3     -none- call     
## dist.method  1     -none- character
hc.complete
## 
## Call:
## hclust(d = dist(USArrests), method = "complete")
## 
## Cluster method   : complete 
## Distance         : euclidean 
## Number of objects: 50

Dendrogram

plot(hc.complete)

Better Dendrogram

ggdendrogram(hc.complete)

9b

Cutting the tree to have only three branches (clusters)

hc.cut.complete = cutree(hc.complete, 3)

Number of observations in each class

table(hc.cut.complete)
## hc.cut.complete
##  1  2  3 
## 16 14 20
hc.cut.complete
##        Alabama         Alaska        Arizona       Arkansas     California 
##              1              1              1              2              1 
##       Colorado    Connecticut       Delaware        Florida        Georgia 
##              2              3              1              1              2 
##         Hawaii          Idaho       Illinois        Indiana           Iowa 
##              3              3              1              3              3 
##         Kansas       Kentucky      Louisiana          Maine       Maryland 
##              3              3              1              3              1 
##  Massachusetts       Michigan      Minnesota    Mississippi       Missouri 
##              2              1              3              1              2 
##        Montana       Nebraska         Nevada  New Hampshire     New Jersey 
##              3              3              1              3              2 
##     New Mexico       New York North Carolina   North Dakota           Ohio 
##              1              1              1              3              3 
##       Oklahoma         Oregon   Pennsylvania   Rhode Island South Carolina 
##              2              2              3              2              1 
##   South Dakota      Tennessee          Texas           Utah        Vermont 
##              3              2              2              3              3 
##       Virginia     Washington  West Virginia      Wisconsin        Wyoming 
##              2              2              3              3              2

Plotting dendrogram with abline cutting off at 3 clusters

par(mfrow=c(1,1))
plot(hc.complete)
abline(h=150, col="red")

Better Dendrogram

ggdendrogram(hc.complete)

9c

Scaling the features

USArrests.scaled = scale(USArrests)

# Fitting the scaled features
hc.complete.scaled = hclust(
  dist(USArrests.scaled),
  method="complete"
)

Dendrogram

plot(hc.complete.scaled)

Better Dendrogram

ggdendrogram(hc.complete.scaled)

ggdendrogram(hc.complete.scaled, rotate = TRUE, theme_dendro = FALSE)

9d Cutting to 3 branches

Cutting the tree to have only three branches (clusters)

hc.scaled.cut.complete = cutree(hc.complete.scaled, 3)
table(hc.scaled.cut.complete)
## hc.scaled.cut.complete
##  1  2  3 
##  8 11 31
table(hc.scaled.cut.complete, hc.cut.complete)
##                       hc.cut.complete
## hc.scaled.cut.complete  1  2  3
##                      1  6  2  0
##                      2  9  2  0
##                      3  1 10 20

COMMENTS: Scaling the features changes the results. It decreases the vertical axis. It might be good to scale the features because if the features are in difference range, the model might weight features that have high values since its a big number even though the value might not have an impact on result.