4 Cluster Analysis

This chapter focuses on clustering analysis, with the aim of identifying pattern in the data. The adopted approach is structured into multiple stages, each designed to evaluate, through different methods, the feasibility and robustness of clustering.

4.1 Feasibility of the cluster analysis

This first part investigates the feasibility of clustering by means of several exploratory techniques. Specifically, scatter plots of the variables colored according to the binary variable Private are compared with scatter plots generated from random values, in order to assess the presence of non-random patterns. The analysis is then extended to the first two principal components obtained via PCA. To further explore the correlation structure among observations, the VAT (Visual Assessment of Tendency) algorithm is applied, and the Hopkins statistic is computed to measure the clustering tendency of the dataset.

Both the VAT algorithm and the Hopkins statistic (0.1389425) indicate the presence of clusters in the dataset.

4.2 Hard clustering

This part is devoted to the application of hard clustering algorithms.

4.2.1 Hierarchical clustering

Before implementing hierarchical clustering, it is useful to compute the correlation between the cophenetic distances and the original distance matrix for each combination of distance and linkage method.

Type Correlations
euclidean_single 0.7038
euclidean_complete 0.6547
euclidean_average 0.7987
euclidean_ward.D2 0.4471
manhattan_single 0.6272
manhattan_complete 0.5267
manhattan_average 0.7605
manhattan_ward.D2 0.4633

The combinations with the highest cophenetic correlations are euclidean_average and manhattan_average. For both, a dendrogram is constructed, the optimal number of clusters (k) is identified, and the tree is pruned at this level.

Hierarchical clustering with euclidean distance and average linked method

NbClust function
## [1] "Frey index : No clustering structure in this data set"

## *** : The Hubert index is a graphical method of determining the number of clusters.
##                 In the plot of Hubert index, we seek a significant knee that corresponds to a 
##                 significant increase of the value of the measure i.e the significant peak in Hubert
##                 index second differences plot. 
## 

## *** : The D index is a graphical method of determining the number of clusters. 
##                 In the plot of D index, we seek a significant knee (the significant peak in Dindex
##                 second differences plot) that corresponds to a significant increase of the value of
##                 the measure. 
##  
## ******************************************************************* 
## * Among all indices:                                                
## * 8 proposed 2 as the best number of clusters 
## * 4 proposed 3 as the best number of clusters 
## * 1 proposed 4 as the best number of clusters 
## * 7 proposed 5 as the best number of clusters 
## * 1 proposed 8 as the best number of clusters 
## * 1 proposed 9 as the best number of clusters 
## * 1 proposed 10 as the best number of clusters 
## 
##                    ***** Conclusion *****                            
##  
## * According to the majority rule, the best number of clusters is  2 
##  
##  
## *******************************************************************
Two clusters

##   cluster size ave.sil.width
## 1       1  776          0.68
## 2       2    1          0.00

Five clusters

##   cluster size ave.sil.width
## 1       1  766          0.48
## 2       2    8          0.51
## 3       3    1          0.00
## 4       4    1          0.00
## 5       5    1          0.00

Based on the results, the hierarchical clusterings with euclidean distance and the average linkage method, for \(k = 2\) or \(k = 5\), do not yield meaningful results.

Hierarchical clustering with manhattan distance and average linked method

NbClust function

## *** : The Hubert index is a graphical method of determining the number of clusters.
##                 In the plot of Hubert index, we seek a significant knee that corresponds to a 
##                 significant increase of the value of the measure i.e the significant peak in Hubert
##                 index second differences plot. 
## 

## *** : The D index is a graphical method of determining the number of clusters. 
##                 In the plot of D index, we seek a significant knee (the significant peak in Dindex
##                 second differences plot) that corresponds to a significant increase of the value of
##                 the measure. 
##  
## ******************************************************************* 
## * Among all indices:                                                
## * 11 proposed 2 as the best number of clusters 
## * 1 proposed 3 as the best number of clusters 
## * 1 proposed 5 as the best number of clusters 
## * 2 proposed 6 as the best number of clusters 
## * 1 proposed 7 as the best number of clusters 
## * 6 proposed 8 as the best number of clusters 
## * 2 proposed 10 as the best number of clusters 
## 
##                    ***** Conclusion *****                            
##  
## * According to the majority rule, the best number of clusters is  2 
##  
##  
## *******************************************************************
Two clusters

##   cluster size ave.sil.width
## 1       1  776          0.53
## 2       2    1          0.00

Eight clusters

##   cluster size ave.sil.width
## 1       1  604          0.28
## 2       2   90          0.27
## 3       3   70          0.19
## 4       4    8          0.30
## 5       5    2          0.49
## 6       6    1          0.00
## 7       7    1          0.00
## 8       8    1          0.00

Based on the results, the hierarchical clusterings with manhattan distance and the average linkage method, for k = 2 or k = 8, do not yield meaningful results.

4.2.2 Partional clustering

Now, let us apply algorithms of partitioning clustering, such as k-means and k-medoids. We also apply k-means++, which selects the initial centroids in a way that reduces the probability of choosing centroids that are too close to each other.

We observe that, for \(k = 2\), k-means++ successfully identifies the separation between private and public colleges, unlike the classic k-means. However, for \(k = 3\), the behavior is essentially the same.

K-means

NbClust function

## *** : The Hubert index is a graphical method of determining the number of clusters.
##                 In the plot of Hubert index, we seek a significant knee that corresponds to a 
##                 significant increase of the value of the measure i.e the significant peak in Hubert
##                 index second differences plot. 
## 

## *** : The D index is a graphical method of determining the number of clusters. 
##                 In the plot of D index, we seek a significant knee (the significant peak in Dindex
##                 second differences plot) that corresponds to a significant increase of the value of
##                 the measure. 
##  
## ******************************************************************* 
## * Among all indices:                                                
## * 4 proposed 2 as the best number of clusters 
## * 15 proposed 3 as the best number of clusters 
## * 1 proposed 7 as the best number of clusters 
## * 1 proposed 8 as the best number of clusters 
## * 1 proposed 9 as the best number of clusters 
## * 1 proposed 10 as the best number of clusters 
## 
##                    ***** Conclusion *****                            
##  
## * According to the majority rule, the best number of clusters is  3 
##  
##  
## *******************************************************************
Three clusters

##   cluster size ave.sil.width
## 1       1   93          0.18
## 2       2  246          0.23
## 3       3  438          0.26

Classic k-means with \(k = 3\) performs better than hierarchical clustering, but it is still difficult to clearly distinguish three separate clusters. It should also be noted that the total variance explained by the first two principal components is relatively low.

The silhouette plot shows that the elements within each cluster do not achieve a high silhouette score.

K-means++

Two clusters

##   cluster size ave.sil.width
## 1       1  676          0.36
## 2       2  101          0.20

##                  true_labels
## pred_labelsKmeans  No Yes
##               No   88  13
##               Yes 124 552

The k-means++, with some errors, is able to identify the private and the public college and the average silhouette width has a better score than others algorithms.

K-medoids

##   cluster size ave.sil.width
## 1       1  388          0.27
## 2       2  237          0.23
## 3       3  152          0.10

##   cluster size ave.sil.width
## 1       1  367          0.32
## 2       2  236          0.22
## 3       3  174          0.11

The K-medoids algorithm is more robust to outliers, and when using the Manhattan distance instead of the Euclidean distance, it yields better results in terms of the average silhouette width.

4.2.3 Density-based clustering

Density-based clustering offers an alternative approach to partitioning methods, as it does not require the number of clusters to be specified in advance. The underlying idea is that clusters correspond to high-density regions in the data space, separated by low-density areas. In this section, two main algorithms are applied: DBSCAN (Density-Based Spatial Clustering of Applications with Noise), which identifies groups of points with sufficient density and distinguishes noise points, and OPTICS (Ordering Points To Identify the Clustering Structure), which extends DBSCAN by enabling the detection of clusters with varying densities.

DBSCAN

Setting parameters for DBSCAN | From Documentation of dbscan package function dbscan

The parameters minPts and eps define the minimum density required in the area around core points which form the backbone of clusters. minPts is the number of points required in the neighborhood around the point defined by the parameter eps (i.e., the radius around the point). Both parameters depend on each other and changing one typically requires changing the other one as well. The parameters also depend on the size of the data set with larger datasets requiring a larger minPts or a smaller eps.

  • ⁠minPts:⁠ The original DBSCAN paper (Ester et al, 1996) suggests to start by setting \(minPts \geq d + 1\), the data dimensionality plus one or higher with a minimum of 3. Larger values are preferable since increasing the parameter suppresses more noise in the data by requiring more points to form clusters. Sander et al (1998) uses in the examples two times the data dimensionality. Note that setting \(minPts \leq 2\) is equivalent to hierarchical clustering with the single link metric and the dendrogram cut at height eps.

  • ⁠eps:⁠ A suitable neighborhood size parameter eps given a fixed value for minPts can be found visually by inspecting the kNNdistplot() of the data using \(k=minPts−1\) (minPts includes the point itself, while the k-nearest neighbors distance does not). The k-nearest neighbor distance plot sorts all data points by their k-nearest neighbor distance. A sudden increase of the kNN distance (a knee) indicates that the points to the right are most likely outliers. Choose eps for DBSCAN where the knee is.

Here, the elbow method is somewhat difficult to interpret because there is no clear gap, and the resulting plot (on the right) is not satisfactory.

Optics

This implementation of OPTICS implements the original algorithm as described by Ankerst et al (1999). OPTICS is an ordering algorithm with methods to extract a clustering from the ordering.

OPTICS linearly orders the data points such that points which are spatially closest become neighbors in the ordering. The closest analog to this ordering is dendrogram in single-link hierarchical clustering.

Extracting a clustering | From Documentation of dbscan package function optics

Several methods to extract a clustering from the order returned by OPTICS are implemented:

  • extractDBSCAN() extracts a clustering from an OPTICS ordering that is similar to what DBSCAN would produce with an eps set to eps_cl (see Ankerst et al, 1999). The only difference to a DBSCAN clustering is that OPTICS is not able to assign some border points and reports them instead as noise.

  • extractXi() extract clusters hierarchically specified in Ankerst et al (1999) based on the steepness of the reachability plot. One interpretation of the xi parameter is that it classifies clusters by change in relative cluster density. The used algorithm was originally contributed by the ELKI framework and is explained in Schubert et al (2018), but contains a set of fixes.

extractDBSCAN() with eps_cl=1.61

extractXi() with xi=0.01

Finally, both DBSCAN and OPTICS yield unsatisfactory results.

4.2.4 Cluster validation

4.2.4.1 External clustering validation

This is possible apply only on the result of K-means++ (\(k=2\)), because it is the unique clustering algorithem that identify the beahiour of the private and public college.

Metric Value
Corrected Rand Index 0.3447
Meila’s VI Index 0.7293

Meila’s VI Index is not very high (its range is \((0, \infty)\)), but the Corrected Rand Index is low.

4.2.4.2 Stability measures

Method APN AD ADM FOM
HC - Euclidean - Avg - k=2 0.0003 5.3248 0.0075 0.9988
HC - Euclidean - Avg - k=5 0.0090 5.1870 0.1195 0.9812
HC - Manhattan - Avg - k=2 0.0100 17.0750 0.1067 0.9968
HC - Manhattan - Avg - k=8 0.0762 14.6278 0.7625 0.8614
KMeans - k=3 0.0446 4.3166 0.1900 0.8142
KMeans++ - k=2 0.2790 5.1583 1.4171 0.9792
KMedoids - Euclidean - k=3 0.1047 4.3950 0.4561 0.8357
KMedoids - Manhattan - k=3 0.1116 13.6128 0.4865 0.8384
## Best APN: "HC - Euclidean - Avg - k=2"
## Best AD: "KMeans - k=3"
## Best ADM: "HC - Euclidean - Avg - k=2"
## Best FOM: "KMeans - k=3"

From these stability measures, we can see that the most stable clustering algorithms are HC with Euclidean distance and average linkage (\(k=2\)), and K-means (\(k=3\)). However, the first solution yields only two clusters (with one of them has only one element). Therefore, K-means with \(k=3\) is preferable.

Finally, we should choose between k-means with \(k=3\) and k-means++ with \(k=2\).

4.3 Soft Clustering

This section is devoted to the application of soft clustering approaches, such as fuzzy clustering and model-based clustering.

4.3.1 Fuzzy Clustering

Here we compare the results of FKMEN with those of FKMedN. Other algorithms, such as GK-FKM and GKB-FK, were also tested, but they required too much computation time to provide results.

Fuzzy k-Means with Entropy regularization and Noise cluster

To select the optimal value of \(k\), we rely on fuzziness measures such as:

  • Partition Coefficient (PC):
##    PC k=2    PC k=3    PC k=4    PC k=5    PC k=6 
## 0.7949485 0.8783218 0.8910608 0.8914083 0.8325891

Best value at 5

  • Modified Partition Coefficient (MPC)
##   MPC k=2   MPC k=3   MPC k=4   MPC k=5   MPC k=6 
## 0.5898970 0.8174826 0.8547477 0.8642603 0.7991070

Best value at 5

  • Partition Entropy (PE)
##     PE k=2     PE k=3     PE k=4     PE k=5     PE k=6 
## 0.07165039 0.08514369 0.10977413 0.13535170 0.23454366

Best value at 2

and on compactness/separation measures such as:

  • Xie and Beni (XB) index
##    XB k=2    XB k=3    XB k=4    XB k=5    XB k=6 
## 0.4991835 0.5150927 0.7636770 0.8387996 1.3164018

Best value at 2

  • Fuzzy Silhouette (FS) index
## SIL.F k=2 SIL.F k=3 SIL.F k=4 SIL.F k=5 SIL.F k=6 
## 0.3518378 0.3735748 0.3156925 0.2980328 0.2237030

Best value at 3

According to these criteria, the optimal number of clusters is \(k=2\), since the Fuzzy Silhouette (FS) index reaches the second highest value at \(k=2\).

Therefore, we apply FKMEN with \(k=2\).

The resulting clusters have the following characteristics:

## Clus 1 Clus 2 
##    334    443

The centroids are given by:

Apps Accept Enroll Top10perc Top25perc F.Undergrad P.Undergrad Outstate Room.Board Books Personal PhD Terminal S.F.Ratio perc.alumni Expend Grad.Rate
Clus 1 2169.855 1515.178 488.797 35.892 66.747 1964.473 354.323 13651.779 4983.103 533.769 1055.049 81.451 88.025 12.403 31.056 11398.329 76.174
Clus 2 1738.471 1257.868 534.067 17.595 44.344 2552.633 708.204 8057.978 3820.954 520.817 1386.961 63.816 71.244 15.257 17.531 7094.512 57.498

The confusion matrix is:

##      
##         1   2
##   No   32 180
##   Yes 299 266

where the Rand index has value 0.0490135.

FK-Med algorithm with Noise cluster (FK MedN algorithm)

By Fuzzy Silhouette (FS) index:

##  SIL.F k=2  SIL.F k=3  SIL.F k=4  SIL.F k=5  SIL.F k=6 
## 0.31672170 0.25337145 0.22778222 0.18049682 0.04819403

The best value is 2.

Therefore, we apply FKMedN with \(k=2\).

The resulting clusters have the following characteristics:

## Clus 1 Clus 2 
##    378    399

Hence, the two medoids that represent the first two clusters are:

Private Apps Accept Enroll Top10perc Top25perc F.Undergrad P.Undergrad Outstate Room.Board Books Personal PhD Terminal S.F.Ratio perc.alumni Expend Grad.Rate
Manchester College Yes 1004 802 239 23 63 909 51 10440 3850 525 1450 63 72 11.8 20 7940 64
Roanoke College Yes 2227 1790 437 27 54 1460 239 13425 4425 450 1200 85 89 13.0 26 9405 72

The confusion matrix is:

##      
##         1   2
##   No  105 107
##   Yes 273 292

where the Rand index has value -6.9482421^{-4}.

4.3.2 Model Based Clustering

One disadvantage of traditional clustering methods, such as hierarchical and partitioning clustering algorithms, is that they are largely heuristic and not based on formal models.

An alternative is model-based clustering that use probabilistic assumptions.

## Best BIC values:
##              VVE,9       VVE,8       VVE,7
## BIC      -18547.17 -18671.4418 -18925.0659
## BIC diff      0.00   -124.2717   -377.8958

Using the Mclust function, the best model selected is VVE, which corresponds to clusters with variable volume and shape but equal orientation, with \(k=9\).

## ---------------------------------------------------- 
## Gaussian finite mixture model fitted by EM algorithm 
## ---------------------------------------------------- 
## 
## Mclust VVE (ellipsoidal, equal orientation) model with 9 components: 
## 
##  log-likelihood   n  df       BIC       ICL
##       -7776.111 777 450 -18547.17 -18669.83
## 
## Clustering table:
##   1   2   3   4   5   6   7   8   9 
##  52 101  90 125 112  94  46  54 103

However, from a graphical point of view, the result is very difficult to interpret.