Chapter 7 Unsupervised Learning
Clustering is an unsupervised learning algorithm. These algorithms can classify data into multiple groups. Such classification is based on similarity.
Within-cluster variation will be thus minimized by optimizing within-cluster sum of squares of Euclidean distances (ref)
7.1 K-means
K-means is a very popular clustering algorithm, that partitions the data into \(k\) groups.
Algorithm:
- Determine a number \(k\) (e.g., could be 3)
- randomly select \(k\) subjects in a data. Use these points as staring points (centers or cluster mean) for each cluster.
- By Euclidean distance measure (from the initial centers), try to determine in which cluster the remaining points belong.
- compute new mean value for each cluster.
- based on this new mean, try to determine again in which cluster the data points belong.
- process continues until the data points do not change cluster membership.
7.2 Read previously saved data
<- readRDS(file = "data/rhcAnalytic.RDS") ObsData
7.2.1 Example 1
<- ObsData[c("Heart.rate", "edu")]
datax0 <- kmeans(datax0, centers = 2, nstart = 10)
kres0 $centers kres0
## Heart.rate edu
## 1 54.55138 11.44494
## 2 134.96277 11.75466
plot(datax0, col = kres0$cluster, main = kres0$tot.withinss)
7.2.2 Example 2
<- ObsData[c("blood.pressure", "Heart.rate",
datax0 "Respiratory.rate")]
<- kmeans(datax0, centers = 2, nstart = 10)
kres0 $centers kres0
## blood.pressure Heart.rate Respiratory.rate
## 1 80.10812 135.08956 29.85267
## 2 73.71684 54.95789 22.76723
plot(datax0, col = kres0$cluster, main = kres0$tot.withinss)
7.2.3 Example with many variables
<- ObsData[c("edu", "blood.pressure", "Heart.rate",
datax "Respiratory.rate" , "Temperature",
"PH", "Weight", "Length.of.Stay")]
<- kmeans(datax, centers = 3)
kres #kres
head(kres$cluster)
## [1] 1 1 1 3 1 2
$size kres
## [1] 2793 1688 1254
$centers kres
## edu blood.pressure Heart.rate Respiratory.rate Temperature PH
## 1 11.85833 54.26924 136.37451 29.76119 37.85078 7.385249
## 2 11.54214 128.33886 126.12026 29.36611 37.68129 7.401027
## 3 11.46134 65.47249 53.24242 22.65973 37.01597 7.378482
## Weight Length.of.Stay
## 1 68.63384 23.42356
## 2 66.68351 20.68128
## 3 67.57291 18.58931
aggregate(datax, by = list(cluster = kres$cluster), mean)
## cluster edu blood.pressure Heart.rate Respiratory.rate Temperature
## 1 1 11.85833 54.26924 136.37451 29.76119 37.85078
## 2 2 11.54214 128.33886 126.12026 29.36611 37.68129
## 3 3 11.46134 65.47249 53.24242 22.65973 37.01597
## PH Weight Length.of.Stay
## 1 7.385249 68.63384 23.42356
## 2 7.401027 66.68351 20.68128
## 3 7.378482 67.57291 18.58931
aggregate(datax, by = list(cluster = kres$cluster), sd)
## cluster edu blood.pressure Heart.rate Respiratory.rate Temperature
## 1 1 3.162485 11.93763 23.13140 13.67791 1.781692
## 2 2 3.091605 18.58070 27.68369 14.08169 1.610746
## 3 3 3.160538 31.89150 23.63993 13.60831 1.832389
## PH Weight Length.of.Stay
## 1 0.1082140 27.99506 29.01143
## 2 0.1009567 32.15078 23.37223
## 3 0.1226041 26.87075 20.82024
7.3 Optimal number of clusters
require(factoextra)
fviz_nbclust(datax, kmeans, method = "wss")+
geom_vline(xintercept=3,linetype=3)
Here the vertical line is chosen based on elbow method (ref).
7.4 Discussion
- We need to supply a number, \(k\): but we can test different \(k\)s to identify optimal value
- Clustering can be influenced by outliners, so median based clustering is possible
-
mere ordering can influence clustering, hence we should choose different initial means (e.g.,
nstart
should be greater than 1).
Group characteristics include (to the extent that is possible)