Consider the USArrests data. We will now perform hierarchical clustering on the states.
Using hierarchical clustering with complete linkage and Euclidean distance, cluster the states.
Cut the dendrogram at a height that results in three distinct clusters. Which states belong to which clusters?
Hierarchically cluster the states using complete linkage and Euclidean distance, after scaling the variables to have standard deviation one.
What effect does scaling the variables have on the hierarchical clustering obtained? In your opinion, should the variables be scaled before the inter-observation dissimilarities are computed? Provide a justification for your answer.
library(ISLR)
library(ggdendro) # Better dendrograms
library(ggplot2)
hc.complete = hclust(
dist(USArrests),
method="complete"
)
summary(hc.complete)
## Length Class Mode
## merge 98 -none- numeric
## height 49 -none- numeric
## order 50 -none- numeric
## labels 50 -none- character
## method 1 -none- character
## call 3 -none- call
## dist.method 1 -none- character
hc.complete
##
## Call:
## hclust(d = dist(USArrests), method = "complete")
##
## Cluster method : complete
## Distance : euclidean
## Number of objects: 50
Dendrogram
plot(hc.complete)
Better Dendrogram
ggdendrogram(hc.complete)