Unsupervised learning

In this chapter, we will talk about unsupervised learning.

In the initial code chunk, we load a specific library that will be utilized for publishing-related functionality throughout the chapter.

Clustering

Clustering is an unsupervised learning algorithm. These algorithms can classify data into multiple groups. Such classification is based on similarity.

Group characteristics include (to the extent that is possible)

  • low inter-class similarity: observation from different clusters would be dissimilar
  • high intra-class similarity: observation from the same cluster would be similar

Within-cluster variation will be thus minimized by optimizing within-cluster sum of squares of Euclidean distances (Wikipedia 2023a)

K-means

K-means is a very popular clustering algorithm, that partitions the data into \(k\) groups.

Algorithm:

  • Determine a number \(k\) (e.g., could be 3)
  • randomly select \(k\) subjects in a data. Use these points as staring points (centers or cluster mean) for each cluster.
  • By Euclidean distance measure (from the initial centers), try to determine in which cluster the remaining points belong.
  • compute new mean value for each cluster.
  • based on this new mean, try to determine again in which cluster the data points belong.
  • process continues until the data points do not change cluster membership.

Read previously saved data

We read a previously saved dataset from a specified file path.

ObsData <- readRDS(file = "Data/machinelearning/rhcAnalytic.RDS")

In the next few code chunks, we implement k-means clustering on various subsets of the data, visualizing the results and displaying the cluster centers. The first example uses two variables, the second example uses three, and in the third example, a larger subset of variables is selected but not immediately utilized in the clustering. In the subsequent code chunk, we apply k-means clustering to the larger subset of variables, displaying various results and aggregating data by cluster to display mean and standard deviation values for each variable within each cluster.

Example 1

datax0 <- ObsData[c("Heart.rate", "edu")]
kres0 <- kmeans(datax0, centers = 2, nstart = 10)
kres0$centers
#>   Heart.rate      edu
#> 1   54.55138 11.44494
#> 2  134.96277 11.75466
plot(datax0, col = kres0$cluster, main = kres0$tot.withinss)

Example 2

datax0 <- ObsData[c("blood.pressure", "Heart.rate", "Respiratory.rate")]
kres0 <- kmeans(datax0, centers = 2, nstart = 10)
kres0$centers
#>   blood.pressure Heart.rate Respiratory.rate
#> 1       80.10812  135.08956         29.85267
#> 2       73.71684   54.95789         22.76723
plot(datax0, col = kres0$cluster, main = kres0$tot.withinss)

Example with many variables

datax <- ObsData[c("edu", "blood.pressure", "Heart.rate", 
                   "Respiratory.rate" , "Temperature",
                   "PH", "Weight", "Length.of.Stay")]
kres <- kmeans(datax, centers = 3)
#kres
head(kres$cluster)
#> [1] 1 1 1 3 1 2
kres$size
#> [1] 2795 1688 1252
kres$centers
#>        edu blood.pressure Heart.rate Respiratory.rate Temperature       PH
#> 1 11.85665       54.28086  136.34597         29.75277    37.85056 7.385267
#> 2 11.54214      128.33886  126.12026         29.36611    37.68129 7.401027
#> 3 11.46447       65.46446   53.17332         22.66717    37.01512 7.378432
#>     Weight Length.of.Stay
#> 1 68.67307       23.41789
#> 2 66.68351       20.68128
#> 3 67.48365       18.59425
aggregate(datax, by = list(cluster = kres$cluster), mean)
aggregate(datax, by = list(cluster = kres$cluster), sd)

Optimal number of clusters

Next, we explore determining the optimal number of clusters, visualizing the total within-cluster sum of squares for different values of k and indicating a chosen value of k with a vertical line on the plot.

require(factoextra)
fviz_nbclust(datax, kmeans, method = "wss")+
  geom_vline(xintercept=3,linetype=3)

Here the vertical line is chosen based on elbow method (Wikipedia 2023b).

Discussion

  • We need to supply a number, \(k\): but we can test different \(k\)s to identify optimal value
  • Clustering can be influenced by outliners, so median based clustering is possible
  • mere ordering can influence clustering, hence we should choose different initial means (e.g., nstart should be greater than 1).

Video content (optional)

Tip

For those who prefer a video walkthrough, feel free to watch the video below, which offers a description of an earlier version of the above content.

References

Wikipedia. 2023a. “Cross-Validation (Statistics).” https://en.wikipedia.org/wiki/Cross-validation_(statistics).
———. 2023b. “Elbow Method (Clustering).” https://en.wikipedia.org/wiki/Elbow_method_(clustering).