Abstract

This study analyzes the College dataset through univariate analysis, principal component analysis (PCA), and clustering methods. The univariate analysis highlighted that not all fitted models were appropriate for the individual variables. PCA revealed that the dataset is not optimally suited for dimensionality reduction, as the first two components alone explain less than 80% of the variance; however, applying Kaiser’s rule allowed the extraction of three interpretable principal components: Academic Prestige and Student Spending, Size and Enrollment Volume, and Personal and Book Expenses. Finally, clustering analysis showed that the k-means++ algorithm most effectively distinguished institutions, identifying two groups corresponding to private and public colleges, while the traditional k-means with three clusters uncovered additional patterns related to institutional size, selectivity, and student costs.