1 Introduction
Understanding the landscape of higher education institutions in the United States is essential for students, policymakers, and educational organizations alike. Institutions vary significantly in terms of tuition costs, acceptance rates, graduation percentages, and student demographics. Uncovering the latent structure behind these variations can help identify common profiles among colleges and provide insights into the broader educational system.
In this project, we apply unsupervised learning techniques to explore and analyze patterns in a dataset of U.S. colleges. Specifically, we leverage Principal Component Analysis (PCA) for dimensionality reduction and clustering algorithms to group similar institutions based on multiple quantitative features.
The dataset used is the College dataset from the ISLP (Introduction to Statistical Learning with Python) package. It contains observations on 777 U.S. colleges, with variables covering aspects such as costs, enrollment, graduation rates, and student composition.
Variable Descriptions
The dataset includes a mix of numerical and categorical variables. Key attributes include:
Private: whether the college is private or public.Apps: number of applications received.Accept: number of accepted students.Enroll: number of students enrolled.Top10perc: percentage of students from the top 10% of their high school class.Top25perc: percentage from the top 25%.F.Undergrad: number of full-time undergraduates.P.Undergrad: number of part-time undergraduates.Outstate: out-of-state tuition.Room.Board: estimated room and board cost.Books: estimated cost of books.Personal: estimated personal spending.PhD: percentage of faculty with a PhD.Terminal: percentage with terminal degrees.S.F.Ratio: student-faculty ratio.perc.alumni: percentage of alumni who donate.Expend: instructional expenditure per student.Grad.Rate: graduation rate.
Analysis Objectives
The primary goal of this project is to uncover underlying patterns in the structure of U.S. colleges using unsupervised learning. We aim to:
- Conduct univariate analysis to understand the distribution of individual features.
- Apply Principal Component Analysis (PCA) to reduce the dimensionality of the dataset and identify the main axes of variation.
- Perform clustering analysis to segment colleges into interpretable groups based on their institutional characteristics.
This exploratory approach helps reveal how colleges can be grouped beyond traditional classifications, offering new perspectives on similarities and differences in the U.S. higher education system.