BSTA504 Midterm

Install required packages

#install.packages("openxlsx")
#install.packages("devtools")
#install.packages("wesanderson")
#install.packages("RColorBrewer")
#install.packages("tidytext")

Load required packages

library(tidytuesdayR)
library(janitor)

## 
## Attaching package: 'janitor'

## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test

library(tidyverse)

## ── Attaching packages
## ───────────────────────────────────────
## tidyverse 1.3.2 ──

## ✔ ggplot2 3.4.0      ✔ purrr   1.0.1 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.5.0 
## ✔ readr   2.1.3      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

library(ggplot2)
library(ggthemes)
library(readxl) # to read the excel files
library(openxlsx)
library(stringr)
library(wesanderson)
library(RColorBrewer)
library(gridExtra)

## 
## Attaching package: 'gridExtra'
## 
## The following object is masked from 'package:dplyr':
## 
##     combine

library(tidytext)

Define Your Research Question (10 points)

Define your research question below. What about the data interests you? What is a specific question you want to find out about the data?

Given your question, what is your expectation about the data?

Source of the data

I have taken the College tuition, diversity, and pay from TidyTuesday (week of 10th March 2020).The data comes from different sources but the original source is the US Department of Education.

The easily available data is from tuitiontracker.org but since the dataset is in a very wide format, the data was converted to long format and only selected tables are available in the tidytuesday dataset.

What interests me about the data?

I am very interested in understanding how the type of college chosen in the undergraduate degree influences and shapes one’s career prospects and their quality of education.

In most cases, the cost of attending public college is significantly cheaper than that of private colleges. Students who have their degrees from private colleges especially the elite universities and ivy-league colleges spend way more than the students who attend public colleges.The highly ranked colleges have very low acceptance rate making the process of getting into top colleges incredibly difficult.

Although, attending a highly ranked private college can be prestigious and may have better infrastructure and career prospects, it is important to not choose a college purely based on the ranking but instead evaluate all the aspects before choosing the college that would be well aligned with the one’s goal and interests.

I am also interested in understanding if gender, race and ethnicity has an influence in college selection.

Being an immigrant and having attended graduate college in a private university in the US which had a very high tuition and having worked in the industry for a couple years, my thought process of choosing a college has drifted away from emphasizing on the ranking to considering all the aspects holistically, the most important criteria`s to evaluate is

->the curriculum and faculty:ensure that curriculum offered is diverse and in-depth that will enhance career opportunities

->to be able to choose a program that is closely aligned with the interests

->networking opportunities and a strong alumni network

->return of investment

-> safety and living expenses

-> open to different cultural/ racial/ ethnic and socio-economic backgrounds

Define the research question

Research Question 1

Is the Earning potential of student and the quality of education influenced by the type of college they attend?Does this relationship change across the different states in the USA?

Typically the private schools have better infrastructure, facility and funding opportunities, that I would like to study if attending private/public college affect the earning potential at entry level and at much later in the careers.

I would also like to study how a private/public college degree would contribute to aquiring STEM skills and to the betterment of society.

Research Question 2

Does factors like gender, tuition-cost, race, income level affect the selection of type of college(public vs private)?

There has always been systemic bias in accessibility to education. Women, students from lower socio-economic backgrounds and minor racial/ethnic groups typically do not have the financial and moral support than those who come from affluent background and from more priveleged racial groups.

I am also interested in performing some sort of exploratory analysis such as

Number of private vs public colleges in the data set considered across the different states in the United States of America
Analyze the change in tuition rates over the last 20 years for different type of colleges and for different type of programs in the United States

What is a specific question you want to find out about the data

Basically my intent is to compare if the students attending private and public college have different career prospects and quality of education and if so, how does the difference look.

Also, I am interested in understanding if gender or race/ethnicity influence in the decision of attending a particular type of school and if location play a role too in college selection.

Given your question, what is your expectation about the data?

My expectation of the data is the choice of college is definitely influenced by the socioeconomic status. Most likely, students from lower socioeconomic status would attend colleges that have a reasonable tuition fees and also provides scholarship that will aid them in paying off their tuitions. Students from minor racial/ ethnic groups or students who are first generational learners would probably be inclined to attend colleges that would be have a strong diversity and inclusion policy, lower tuition fees or provide need based scholarships. I would expect % women enrollment to be lesser than % men enrollment in most parts of the USA.

And I think the earning potential of the student may vary at the beginning of their careers i.e. students who attend prestigious private colleges may have a better start in their careers compared to students who attend public colleges but this difference may weaken as they advance in their careers.

The quality of education(making world a better place and stem %) maybe influenced partly by the college one attends but the most important aspect is one’s personal interest and critical thinking skills.

Loading the Data (10 points)

Load the data below and use dplyr::glimpse() or skimr::skim() on the data. You should upload the data file into the data directory.

Load data from TidyTuesday

# Load the tidytuesday data

devtools::install_github("thebioengineer/tidytuesdayR")

## Skipping install of 'tidytuesdayR' from a github remote, the SHA1 (51dfba6f) has not changed since last install.
##   Use `force = TRUE` to force installation

tuesdata <- tidytuesdayR::tt_load(2020, week = 11)

## --- Compiling #TidyTuesday Information for 2020-03-10 ----

## --- There are 5 files available ---

## --- Starting Download ---

## 
##  Downloading file 1 of 5: `diversity_school.csv`
##  Downloading file 2 of 5: `historical_tuition.csv`
##  Downloading file 3 of 5: `salary_potential.csv`
##  Downloading file 4 of 5: `tuition_cost.csv`
##  Downloading file 5 of 5: `tuition_income.csv`

## --- Download complete ---

Upload the file into the data directory

The data could be uploaded either in 2 ways, directly access from “tuesdata” into the environment or by writing the data into the folder.

# write the cost data into /data folder
write_excel_csv(tuesdata$tuition_cost,
                file=here::here("Data/cost.csv"))

# write the tuition income data into /data folder
write_excel_csv(tuesdata$tuition_income,
 file=here::here("Data/income.csv"))

# write the salary data into /data folder
write_excel_csv(tuesdata$salary_potential,
 file=here::here("Data/salary.csv"))

# write the historical tuition data into /data folder
write_excel_csv(tuesdata$historical_tuition,
 file=here::here("Data/historical_tuition.csv"))

# write the diversity data into /data folder
write_excel_csv(tuesdata$diversity_school,
     file=here::here("Data/diversity.csv"))

#alternative way to load data

cost<-tuesdata$tuition_cost
income<-tuesdata$tuition_income
salary<-tuesdata$salary_potential
hist_tuition<-tuesdata$historical_tuition
diversity<-tuesdata$diversity_school

Read the data files

This section would be not be required if we directly loaded the data into the environment.

# read the cost data set from /data folder
cost<-read_csv(here::here("Data/cost.csv"),
na="NA",skip=0)

## Rows: 2973 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): name, state, state_code, type, degree_length
## dbl (5): room_and_board, in_state_tuition, in_state_total, out_of_state_tuit...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# read the income data set from /data folder
income<-read_csv(here::here("Data/income.csv"),
na="NA",skip=0)

## Rows: 209012 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): name, state, campus, income_lvl
## dbl (3): total_price, year, net_cost
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# read the salary data set from /data folder
salary<-read_csv(here::here("Data/salary.csv"),
na="NA",skip=0)

## Rows: 935 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): name, state_name
## dbl (5): rank, early_career_pay, mid_career_pay, make_world_better_percent, ...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# read the historical tuition data set from /data folder
hist_tuition<-read_csv(here::here("Data/historical_tuition.csv"), na="NA",skip=0)

## Rows: 270 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): type, year, tuition_type
## dbl (1): tuition_cost
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# read the diversity data set from /data folder
diversity<-read_csv(here::here("Data/diversity.csv"),na="NA", skip=0)

## Rows: 50655 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): name, state, category
## dbl (2): total_enrollment, enrollment
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Notes on the data in each dataset

If there are any quirks that you have to deal with NA coded as something else, or it is multiple tables, please make some notes here about what you need to do before you start transforming the data in the next section.

Make sure your data types are correct!

Tuition income data

In the following code chunk,we read the tuition income data and use the glimpse and the skim command to get a sense of the data, total number of observations, number of observations and variables,the data type of each variable and the skim also provides mini histogram plots of the continuous variables which gives us an idea about the underlying distribution. Missingness in each variable is also captured by the skim command.

# tuition income data

# verify using glimpse and skim

glimpse(income) # 209,012 observations and 7 variables

## Rows: 209,012
## Columns: 7
## $ name        <chr> "Piedmont International University", "Piedmont Internation…
## $ state       <chr> "NC", "NC", "NC", "NC", "NC", "NC", "NC", "NC", "NC", "NC"…
## $ total_price <dbl> 20174, 20174, 20174, 20174, 20514, 20514, 20514, 20514, 20…
## $ year        <dbl> 2016, 2016, 2016, 2016, 2017, 2017, 2017, 2017, 2017, 2018…
## $ campus      <chr> "On Campus", "On Campus", "On Campus", "On Campus", "On Ca…
## $ net_cost    <dbl> 11475.00, 11451.00, 16229.00, 15592.00, 11668.39, 11643.99…
## $ income_lvl  <chr> "0 to 30,000", "30,001 to 48,000", "48_001 to 75,000", "75…

                # 4 character variables and 3 numeric variables

skimr::skim(income) # 209,012 observations and 7 variables

Data summary
Name	income
Number of rows	209012
Number of columns	7
_______________________
Column type frequency:
character	4
numeric	3
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	n_unique
name	1	6	75	3664
state	1	2	2	51
campus	1	9	10	2
income_lvl	1	11	17	5

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
total_price	1	30102.19	13824.70	4906	19186.00	26286.0	38831	114083.00	▇▆▂▁▁
year	1	2014.38	2.46	2010	2012.00	2014.0	2017	2018.00	▅▇▃▇▇
net_cost	1	16784.92	8887.16	-15101	10148.56	16207.1	22350	95674.84	▁▇▁▁▁

                    # 4 character variables and 3 numeric variables

Using the glimpse command, we were able to know that this dataset had 209,012 observations and 7 variables-out of which 4 are character variables and 3 are numeric variables. The variables name,state, campus and income_level are character variables and the remaining variables: total_price, year and net_cost are in the numeric format.

The variables present in the data set are:

name->college name
state->state name
total_price->total price in USD
year->year
campus->on/off campus
net_cost->net cost-average actually paif after scholarship/award
income_lvl->income bracket

I also wanted to plot the histograms separately to get understand the underlying distribution, assess the center/spread/skewness of the distribution.

# additional checks on the distribution of numeric variables

# total price
plot1<-ggplot(data=income,
       aes(x=total_price))+
  
  geom_histogram(fill="lightblue")+
  
  geom_vline(xintercept=mean(income$total_price,na.rm=TRUE),                      col = "orange",
             lwd = 2)+
  
  annotate("text", x=35000, y=15000, label="Mean of total price",angle=90)+
  
geom_vline(xintercept=median(income$total_price,na.rm=TRUE),                         col = "red",
                lwd = 2)+
  
  annotate("text", x=22000, y=15000, label="Median of total price",angle=90)+
  
  
  
  labs(x="Total price in USD",
       y="Count",
       title="Histogram of Total price(in USD)")

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.

# mean exceeds median-> heavily right skewed!!

# net_cost

plot2<-ggplot(data=income,
       aes(x=net_cost))+
  
  geom_histogram(fill="light blue")+
  
  geom_vline(xintercept=mean(income$net_cost,
                             na.rm=TRUE),                       col = "orange",
             lwd = 2)+
  
  annotate("text", x=20000, y=15000, label="Mean of net cost",angle=90)+
  
geom_vline(xintercept=median(income$net_cost,
                             na.rm=TRUE),          
                             col = "red",
                             lwd = 2)+
  
  annotate("text", x=12000, y=15000, label="Median of net cost",angle=90)+
  
  labs(x="Net cost in USD",
       y="Count",
       title="Histogram of Net cost(in USD)") 
# reasonably normal
# mean and median almost coincides!

grid.arrange(plot1,plot2,nrow=2)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# check for missing values

sum(is.na(income)) # no missing values

## [1] 0

Some comments on the data

It can be noted that the variable state is in character format and needs to be converted to factors especially useful while making exploratory plots.

Also,the variable year is in the numeric format which will have to be converted to a factor format in order to study how the net_cost and total_price varies over the years.

Although the variable campus is in the categorical format, we are not very concerned about transforming it to the factor format as in the next section, I will be excluding the off-campus type observation since I am only interested in the on-campus observations.

Also,from the tiny histogram plots provided from the output of skim command, the variable total_price data seems to be heavily right skewed. I was particularly interested in the distribution of total_price, the mean exceeds the median and the data is heavily right skewed, typically log transformation is applied to right-skewed data to address non-normality but in this case since the primary interest is to visualize the data and not to perform apply modelling techniques, we will use the data as it is.

The distribution of net cost variable looks reasonably normal, the mean and median almost coincide.

No missing values observed in the tuition income data.

Tuition cost data

In the following code chunk, we use the glimpse and study commands to get an idea about the tuition cost data.

# tuition cost data

# verify using glimpse and skim

glimpse(cost)   # 2973 observations and 10 variables

## Rows: 2,973
## Columns: 10
## $ name                 <chr> "Aaniiih Nakoda College", "Abilene Christian Univ…
## $ state                <chr> "Montana", "Texas", "Georgia", "Minnesota", "Cali…
## $ state_code           <chr> "MT", "TX", "GA", "MN", "CA", "CO", "NY", "NY", "…
## $ type                 <chr> "Public", "Private", "Public", "For Profit", "For…
## $ degree_length        <chr> "2 Year", "4 Year", "2 Year", "2 Year", "4 Year",…
## $ room_and_board       <dbl> NA, 10350, 8474, NA, 16648, 8782, 16030, 11660, 1…
## $ in_state_tuition     <dbl> 2380, 34850, 4128, 17661, 27810, 9440, 38660, 537…
## $ in_state_total       <dbl> 2380, 45200, 12602, 17661, 44458, 18222, 54690, 1…
## $ out_of_state_tuition <dbl> 2380, 34850, 12550, 17661, 27810, 20456, 38660, 9…
## $ out_of_state_total   <dbl> 2380, 45200, 21024, 17661, 44458, 29238, 54690, 2…

                # 5 character variables and 5 numeric variables

skimr::skim(cost)  # 2973 observations and 10 variables

Data summary
Name	cost
Number of rows	2973
Number of columns	10
_______________________
Column type frequency:
character	5
numeric	5
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	n_unique
name	0	1.00	8	67	2938
state	52	0.98	4	14	50
state_code	0	1.00	2	2	55
type	0	1.00	5	10	4
degree_length	0	1.00	5	6	3

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
room_and_board	1094	0.63	10095.28	3288.55	30	7935	10000	12424.5	21300	▁▅▇▃▁
in_state_tuition	0	1.00	16491.29	14773.84	480	4890	10099	27124.0	59985	▇▂▂▁▁
in_state_total	0	1.00	22871.73	18948.39	962	5802	17669	35960.0	75003	▇▅▂▂▁
out_of_state_tuition	0	1.00	20532.73	13255.65	480	9552	17486	29208.0	59985	▇▆▅▂▁
out_of_state_total	0	1.00	26913.16	17719.73	1376	11196	23214	39054.0	75003	▇▅▅▂▁

                   # 5 character variables and 5 numeric variables

Based on the output of glimpse and skim command,2973 observations and 10 variables-out of which 5 are character variables and 5 are numeric variables. The variables name,state,state_code,type and degree_length are character variables and the remaining variables: room_and_board, in_state_tuition,in_state_total,out_of_state_tuition, out_of_state_total are in the numeric format.

The variables present in the data set are:

name->college name
state->state name
state_code->state abbreviation
type->public,private, for profit
degree_length->4 year or 2 year degree
room_and_board-> Room and board in USD
in_state_tuition->Tuition for in-state residents in USD
in_state_total->Total cost for in-state residents in USD (sum of room & board + in state tuition) 9)out_of_state_tuition->Tuition for out-of-state residents in USD 10)out_of_state_total->Total cost for out-of-state residents in USD (sum of room & board + out of state tuition)

The skim command not only provides the total # of observations, total # of variables and their data type’s but also provides a peek of the distribution of the continuous variables by their small histogram plots. I also tried plotting histograms using ggplots.

Numeric variables

# histogram of in-state tuition expenses
ggplot(data=cost,
       aes(x=in_state_tuition))+
  geom_histogram(fill="lightblue")+
  geom_vline(xintercept=mean(cost$in_state_tuition
                             ,na.rm=TRUE),        
             col = "orange",
             lwd = 2)+
annotate("text", x=18000, y=200, label="Mean",angle=90)+
geom_vline(xintercept=median(cost$in_state_tuition,
                             na.rm=TRUE),                                col = "red",
                      lwd = 2)+
annotate("text", x=8000, y=200, label="Median",angle=90)+
  labs(x="In state tuition expenses in USD",
       y="Count",
       title="Histogram of instate tuition expenses(in USD)")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# histogram of out-state tuition expenses
ggplot(data=cost,
       aes(x=out_of_state_tuition))+
  geom_histogram(fill="lightblue")+
  geom_vline(xintercept=mean(cost$out_of_state_tuition,na.rm=TRUE),        
             col = "orange",
             lwd = 2)+
annotate("text", x=23000, y=100, label="Mean",angle=90)+
geom_vline(xintercept=median(cost$out_of_state_tuition,na.rm=TRUE),                                
           col = "red",
           lwd = 2)+
annotate("text", x=15000, y=100, label="Median",angle=90)+
  labs(x="Out of state tuition expenses in USD",
       y="Count",
       title="Histogram of out of state tuition expenses(in USD)")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# histogram of in-state total expenses
ggplot(data=cost,
       aes(x=in_state_total))+
  geom_histogram(fill="lightblue")+
  geom_vline(xintercept=mean(cost$in_state_total
                             ,na.rm=TRUE),        
             col = "orange",
             lwd = 2)+
annotate("text", x=24000, y=150, label="Mean",angle=90)+
geom_vline(xintercept=median(cost$in_state_total,
                             na.rm=TRUE),                                col = "red",
                      lwd = 2)+
annotate("text", x=16000, y=150, label="Median",angle=90)+
  labs(x="In state total expenses in USD",
       y="Count",
       title="Histogram of instate total expenses(in USD)")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# histogram of out-state total expenses

ggplot(data=cost,
       aes(x=out_of_state_total))+
  geom_histogram(fill="lightblue")+
  geom_vline(xintercept=mean(cost$out_of_state_total,na.rm=TRUE),        
             col = "orange",
             lwd = 2)+
annotate("text", x=29000, y=100, label="Mean",angle=90)+
geom_vline(xintercept=median(cost$out_of_state_total,na.rm=TRUE),                                
           col = "red",
           lwd = 2)+
annotate("text", x=20000, y=100, label="Median",angle=90)+
  labs(x="Out of state total expenses in USD",
       y="Count",
       title="Histogram of out of state total expenses(in USD)")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# histogram of room and board expenses
ggplot(data=cost,
       aes(x=room_and_board))+
  geom_histogram(fill="lightblue")+
  geom_vline(xintercept=mean(cost$room_and_board
                             ,na.rm=TRUE),        
             col = "orange",
             lwd = 2)+
annotate("text", x=11000, y=50, label="Mean",angle=90)+
geom_vline(xintercept=median(cost$room_and_board,
                             na.rm=TRUE),                                col = "red",
                      lwd = 2)+
  annotate("text", x=9000, y=50, label="Median",angle=90)+
  labs(x="Room and Board expenses in USD",
       y="Count",
       title="Histogram of Room and board expenses(in USD)")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 1094 rows containing non-finite values (`stat_bin()`).

From the histogram plots,the distribution of the variable room_and_board is reasonably normal, where mean and median coincides.

The distribution of the variables: in_state_tuition,in_state_total, out_of_state_tuition and out_of_state_total is heavily right skewed with the mean exceeding the median values.

Character variables

There are 52 observations with missing values for the state variable and 1094 missing values for room and board, we assess the NA values for the state variable in the data transformation section.

We retain the missing values for room and board, since our question of interest is not related to room and board and those missing observations in room and board have complete observations for the other variables that I am interested in exploring further.

# filter out the state with NA's to understand why they are missing!

cost_na<-cost%>%filter(is.na(state))

# looks like the state names are missing however the state codes are present

# get the different state codes and counts of the states with missing names!

tabyl(cost_na$state_code)

##  cost_na$state_code  n    percent
##                  AS  1 0.01923077
##                  DC  8 0.15384615
##                  GU  1 0.01923077
##                  PR 41 0.78846154
##                  VI  1 0.01923077

# based on the state code's fill the name of the state using mutate command

cost<-cost%>%

mutate(state.upd=case_when(!is.na(state)~state, 
                                     is.na(state)&state_code=="AS"~"American Samoa",
                                     is.na(state)&state_code=="DC"~"District of Columbia",
                                     
is.na(state)&state_code=="GU"~"Guam",

is.na(state)&state_code=="PR"~"Peurto Rico",

is.na(state)&state_code=="VI"~"Virgin Islands",

TRUE~"NA"))

#verify
tabyl(cost$state.upd) # no NA's

##        cost$state.upd   n      percent
##               Alabama  54 0.0181634712
##                Alaska   6 0.0020181635
##        American Samoa   1 0.0003363606
##               Arizona  34 0.0114362597
##              Arkansas  46 0.0154725866
##            California 254 0.0854355869
##              Colorado  38 0.0127817020
##           Connecticut  36 0.0121089808
##              Delaware   9 0.0030272452
##  District of Columbia   8 0.0026908846
##               Florida  88 0.0295997309
##               Georgia  79 0.0265724857
##                  Guam   1 0.0003363606
##                Hawaii  14 0.0047090481
##                 Idaho  13 0.0043726875
##              Illinois 125 0.0420450723
##               Indiana  62 0.0208543559
##                  Iowa  52 0.0174907501
##                Kansas  52 0.0174907501
##              Kentucky  44 0.0147998655
##             Louisiana  34 0.0114362597
##                 Maine  27 0.0090817356
##              Maryland  45 0.0151362260
##         Massachusetts  93 0.0312815338
##              Michigan  78 0.0262361251
##             Minnesota  71 0.0238816011
##           Mississippi  32 0.0107635385
##              Missouri  73 0.0245543222
##               Montana  22 0.0073999327
##              Nebraska  33 0.0110998991
##                Nevada  10 0.0033636058
##         New Hampshire  21 0.0070635721
##            New Jersey  54 0.0181634712
##            New Mexico  24 0.0080726539
##              New York 221 0.0743356879
##        North Carolina 117 0.0393541877
##          North Dakota  18 0.0060544904
##                  Ohio 127 0.0427177935
##              Oklahoma  40 0.0134544231
##                Oregon  40 0.0134544231
##          Pennsylvania 160 0.0538176926
##           Peurto Rico  41 0.0137907837
##          Rhode Island  11 0.0036999664
##        South Carolina  57 0.0191725530
##          South Dakota  18 0.0060544904
##             Tennessee  62 0.0208543559
##                 Texas 150 0.0504540868
##                  Utah  14 0.0047090481
##               Vermont  19 0.0063908510
##        Virgin Islands   1 0.0003363606
##              Virginia  79 0.0265724857
##            Washington  60 0.0201816347
##         West Virginia  30 0.0100908174
##             Wisconsin  67 0.0225361588
##               Wyoming   8 0.0026908846

# use the state.upd column going forward but rename it to state

cost<-cost%>%
  select(-state)%>%
   mutate(state=state.upd)%>%
    select(-state.upd)

# rearrange the columns, state before state_code
cost<-cost%>%relocate("state",.after="name")

It was observed those 52 missing values of state was having complete observations for the state_code with which I was able to address the missing values in state.

# check the different levels in the variable type, degree length and state
unique(cost$type) # 4 types: public,private, not for profit and others

## [1] "Public"     "Private"    "For Profit" "Other"

unique(cost$degree_length) # 2 year, 4 year and others

## [1] "2 Year" "4 Year" "Other"

unique(cost$state) #55 unique tates

##  [1] "Montana"              "Texas"                "Georgia"             
##  [4] "Minnesota"            "California"           "Colorado"            
##  [7] "New York"             "Michigan"             "Virginia"            
## [10] "Florida"              "South Carolina"       "Alabama"             
## [13] "North Carolina"       "Alaska"               "Connecticut"         
## [16] "Pennsylvania"         "Mississippi"          "West Virginia"       
## [19] "Maryland"             "Ohio"                 "Iowa"                
## [22] "Kansas"               "Wisconsin"            "Illinois"            
## [25] "Tennessee"            "Arizona"              "Massachusetts"       
## [28] "American Samoa"       "District of Columbia" "Peurto Rico"         
## [31] "Indiana"              "Washington"           "Arkansas"            
## [34] "Kentucky"             "New Jersey"           "South Dakota"        
## [37] "Missouri"             "Maine"                "Louisiana"           
## [40] "Nebraska"             "Vermont"              "North Dakota"        
## [43] "Oregon"               "Idaho"                "Utah"                
## [46] "Hawaii"               "Rhode Island"         "Oklahoma"            
## [49] "Nevada"               "Wyoming"              "New Mexico"          
## [52] "New Hampshire"        "Delaware"             "Guam"                
## [55] "Virgin Islands"

# can also tabyl command to obtain the unique levels and their corresponding counts
tabyl(cost$type)

##   cost$type    n      percent
##  For Profit  107 0.0359905819
##       Other    1 0.0003363606
##     Private 1281 0.4308779011
##      Public 1584 0.5327951564

tabyl(cost$degree_length)

##  cost$degree_length    n      percent
##              2 Year 1120 0.3767238480
##              4 Year 1852 0.6229397915
##               Other    1 0.0003363606

tabyl(cost$state)

##            cost$state   n      percent
##               Alabama  54 0.0181634712
##                Alaska   6 0.0020181635
##        American Samoa   1 0.0003363606
##               Arizona  34 0.0114362597
##              Arkansas  46 0.0154725866
##            California 254 0.0854355869
##              Colorado  38 0.0127817020
##           Connecticut  36 0.0121089808
##              Delaware   9 0.0030272452
##  District of Columbia   8 0.0026908846
##               Florida  88 0.0295997309
##               Georgia  79 0.0265724857
##                  Guam   1 0.0003363606
##                Hawaii  14 0.0047090481
##                 Idaho  13 0.0043726875
##              Illinois 125 0.0420450723
##               Indiana  62 0.0208543559
##                  Iowa  52 0.0174907501
##                Kansas  52 0.0174907501
##              Kentucky  44 0.0147998655
##             Louisiana  34 0.0114362597
##                 Maine  27 0.0090817356
##              Maryland  45 0.0151362260
##         Massachusetts  93 0.0312815338
##              Michigan  78 0.0262361251
##             Minnesota  71 0.0238816011
##           Mississippi  32 0.0107635385
##              Missouri  73 0.0245543222
##               Montana  22 0.0073999327
##              Nebraska  33 0.0110998991
##                Nevada  10 0.0033636058
##         New Hampshire  21 0.0070635721
##            New Jersey  54 0.0181634712
##            New Mexico  24 0.0080726539
##              New York 221 0.0743356879
##        North Carolina 117 0.0393541877
##          North Dakota  18 0.0060544904
##                  Ohio 127 0.0427177935
##              Oklahoma  40 0.0134544231
##                Oregon  40 0.0134544231
##          Pennsylvania 160 0.0538176926
##           Peurto Rico  41 0.0137907837
##          Rhode Island  11 0.0036999664
##        South Carolina  57 0.0191725530
##          South Dakota  18 0.0060544904
##             Tennessee  62 0.0208543559
##                 Texas 150 0.0504540868
##                  Utah  14 0.0047090481
##               Vermont  19 0.0063908510
##        Virgin Islands   1 0.0003363606
##              Virginia  79 0.0265724857
##            Washington  60 0.0201816347
##         West Virginia  30 0.0100908174
##             Wisconsin  67 0.0225361588
##               Wyoming   8 0.0026908846

I will be converting the character variables type, state, state_code to factor format in the next section. I am not too concerned about the converting the degree_length variable to factor format as I will be only retaining 4 year degree_length observations(in the next section) which is more pertianing to my research question of interest.

Salary potential data

# salary potential data

# verify using glimpse and skim

glimpse(salary)   # 935 observations and 7 variables

## Rows: 935
## Columns: 7
## $ rank                      <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 1…
## $ name                      <chr> "Auburn University", "University of Alabama …
## $ state_name                <chr> "Alabama", "Alabama", "Alabama", "Alabama", …
## $ early_career_pay          <dbl> 54400, 57500, 52300, 54500, 48400, 46600, 49…
## $ mid_career_pay            <dbl> 104500, 103900, 97400, 93500, 90500, 89100, …
## $ make_world_better_percent <dbl> 51, 59, 50, 61, 52, 53, 48, 57, 56, 58, 60, …
## $ stem_percent              <dbl> 31, 45, 15, 30, 3, 12, 27, 17, 17, 20, 8, 7,…

                  # 2 character variables and 5 numeric variables

skimr::skim(salary) # 935 observations and 7 variables

Data summary
Name	salary
Number of rows	935
Number of columns	7
_______________________
Column type frequency:
character	2
numeric	5
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
name	0	1	6	64	0	934	0
state_name	0	1	4	14	0	50	0

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
rank	0	1.00	11.48	7.07	1	5	11	17	25	▇▇▆▆▅
early_career_pay	0	1.00	51359.68	8365.51	32500	46000	50000	55500	91200	▂▇▂▁▁
mid_career_pay	0	1.00	92805.78	15856.48	60100	81650	89900	100650	158200	▃▇▃▁▁
make_world_better_percent	33	0.96	53.88	9.19	33	48	52	58	94	▂▇▃▁▁
stem_percent	0	1.00	16.96	15.40	0	7	13	23	100	▇▃▁▁▁

                    # 2 character variables and 5 numeric variables

From the glimpse and skim command, we were able to observed that the salary potential data set had 935 observations and 7 variables out of which name and state_name variables are character variables and the rest of the variables:early_career_pay,mid_career_pay,make_world_better_percent and stem_percent. rank are numeric.

I also used ggplots to get a better idea about the distribution of the continuous variables.

# histogram of early career pay
ggplot(data=salary,
       aes(x=early_career_pay))+
  geom_histogram(fill="lightblue")+
  geom_vline(xintercept=mean(salary$early_career_pay
                             ,na.rm=TRUE),        
             col = "orange",
             lwd = 2)+
annotate("text", x=53000, y=30, label="Mean",angle=90)+
geom_vline(xintercept=median(salary$early_career_pay,
                             na.rm=TRUE),                                col = "red",
                lwd = 2)+
annotate("text", x=48000, y=30,label="Median",angle=90)+
  labs(x="Early career pay in USD",
       y="Count",
       title="Histogram of early career pay(in USD)")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# histogram of mid career pay
ggplot(data=salary,
       aes(x=mid_career_pay))+
  geom_histogram(fill="lightblue")+
  geom_vline(xintercept=mean(salary$mid_career_pay
                             ,na.rm=TRUE),        
             col = "orange",
             lwd = 2)+
annotate("text", x=95000, y=30, label="Mean",angle=90)+
geom_vline(xintercept=median(salary$mid_career_pay,
                             na.rm=TRUE),                                col = "red",
                lwd = 2)+
annotate("text", x=85000, y=30,label="Median",angle=90)+
  labs(x="Mid career pay in USD",
       y="Count",
       title="Histogram of mid career pay(in USD)")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# histogram of make world a better percent
ggplot(data=salary,
       aes(x=make_world_better_percent))+
  geom_histogram(fill="lightblue")+
  geom_vline(xintercept=mean(salary$make_world_better_percent
                             ,na.rm=TRUE),        
             col = "orange",
             lwd = 2)+
annotate("text", x=55, y=25, label="Mean",angle=90)+
geom_vline(xintercept=median(salary$make_world_better_percent,
                             na.rm=TRUE),                                col = "red",
                lwd = 2)+
annotate("text", x=50, y=25,label="Median",angle=90)+
  labs(x="Make world better in percent",
       title="Histogram of make world better")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 33 rows containing non-finite values (`stat_bin()`).

# histogram of stem
ggplot(data=salary,
       aes(x=stem_percent))+
  geom_histogram(fill="lightblue")+
  geom_vline(xintercept=mean(salary$stem_percent
                             ,na.rm=TRUE),        
             col = "orange",
             lwd = 2)+
annotate("text", x=20, y=50, label="Mean",angle=90)+
geom_vline(xintercept=median(salary$stem_percent,
                             na.rm=TRUE),                                col = "red",
                lwd = 2)+
annotate("text", x=10, y=50,label="Median",angle=90)+
  labs(x="Stem percent",
       title="Histogram of stem (in percent)")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# not-normal, right skew is observed!
shapiro.test(salary$early_career_pay)

## 
##  Shapiro-Wilk normality test
## 
## data:  salary$early_career_pay
## W = 0.94447, p-value < 2.2e-16

shapiro.test(salary$mid_career_pay)

## 
##  Shapiro-Wilk normality test
## 
## data:  salary$mid_career_pay
## W = 0.94412, p-value < 2.2e-16

shapiro.test(salary$make_world_better_percent)

## 
##  Shapiro-Wilk normality test
## 
## data:  salary$make_world_better_percent
## W = 0.93619, p-value < 2.2e-16

shapiro.test(salary$stem_percent)

## 
##  Shapiro-Wilk normality test
## 
## data:  salary$stem_percent
## W = 0.81815, p-value < 2.2e-16

From the histogram plots, the right skew is more prominently seen in the stem_percent, mid_career_pay, early_career_pay and in make_world_better_percent.

The variable state_name is in character format,I will convert it to factor format in the next section.

Historical tuition data

# hist tuition data

# verify using glimpse and skim

glimpse(hist_tuition)   # 270 observations and 4 variables

## Rows: 270
## Columns: 4
## $ type         <chr> "All Institutions", "All Institutions", "All Institutions…
## $ year         <chr> "1985-86", "1985-86", "1985-86", "1985-86", "1985-86", "1…
## $ tuition_type <chr> "All Constant", "4 Year Constant", "2 Year Constant", "Al…
## $ tuition_cost <dbl> 10893, 12274, 7508, 4885, 5504, 3367, 13822, 16224, 7421,…

                        # 3 character variables and 1 numeric variables

skimr::skim(hist_tuition) # 270 observations and 4 variables

Data summary
Name	hist_tuition
Number of rows	270
Number of columns	4
_______________________
Column type frequency:
character	3
numeric	1
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	n_unique
type	1	6	16	3
year	1	7	7	19
tuition_type	1	11	15	6

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
tuition_cost	0	1	17597.37	9201.45	2981	9796.5	16983	23574	41468	▇▆▇▂▂

                          # 3 character variables and 1 numeric variables

In the historical tuition dataset, there are 270 observations with 4 variables out of which 1 is a numeric variable and the remaining three are character variables

The histogram plot of the continuous variable is plotted below:

# histogram of tuition cost
ggplot(data=hist_tuition,
       aes(x=tuition_cost))+
  geom_histogram(fill="lightblue")+
  geom_vline(xintercept=mean(hist_tuition$tuition_cost
                             ,na.rm=TRUE),        
             col = "orange",
             lwd = 2)+
annotate("text", x=19000, y=10, label="Mean",angle=90)+
geom_vline(xintercept=median(hist_tuition$tuition_cost,
                             na.rm=TRUE),                                col = "red",
                lwd = 2)+
annotate("text", x=16000, y=10,label="Median",angle=90)+
  labs(x="Tuition cost(in USD)",
       y="count",
       title="Histogram of tuition cost (in USD)")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# just to verify
shapiro.test(hist_tuition$tuition_cost) #p<0.05, not normal

## 
##  Shapiro-Wilk normality test
## 
## data:  hist_tuition$tuition_cost
## W = 0.94332, p-value = 1.071e-08

The variable tuition_cost seems is bi-modal and right skewed, this is verified by using shapiro test for normality. Since the primary focus of the project is to visualize the data and not to use modelling techniques, I will be using the data as it is without any transformation.

The variables type, tuition type and year are in character format, I will be converting them into factor format in the next section. Also, I will be including only the observations that have “4 year current” and “4 year constant” tuition_type since I am am interested in understanding how the public/private institutions have different tuition cost for the 4 year degree programs.

# for the character variables
tabyl(hist_tuition$tuition_type) # 6 types

##  hist_tuition$tuition_type  n   percent
##            2 Year Constant 45 0.1666667
##             2 Year Current 45 0.1666667
##            4 Year Constant 45 0.1666667
##             4 Year Current 45 0.1666667
##               All Constant 45 0.1666667
##                All Current 45 0.1666667

tabyl(hist_tuition$type) # 3 types: private,public and all institutions

##  hist_tuition$type   n   percent
##   All Institutions 114 0.4222222
##            Private  78 0.2888889
##             Public  78 0.2888889

tabyl(hist_tuition$year) # different years starting from 1985 to 1986, 1995-96, 2000 to 2017(in a year increments)

##  hist_tuition$year  n    percent
##            1985-86 18 0.06666667
##            1995-96 18 0.06666667
##            2000-01 18 0.06666667
##            2001-02 18 0.06666667
##            2002-03 18 0.06666667
##            2003-04 18 0.06666667
##            2004-05 18 0.06666667
##            2005-06 18 0.06666667
##            2006-07  6 0.02222222
##            2007-08  6 0.02222222
##            2008-09  6 0.02222222
##            2009-10 12 0.04444444
##            2010-11  6 0.02222222
##            2011-12  6 0.02222222
##            2012-13 12 0.04444444
##            2013-14 18 0.06666667
##            2014-15 18 0.06666667
##            2015-16 18 0.06666667
##            2016-17 18 0.06666667

There are 3 type of colleges: public, private and all institutions; historical tuition is collected for the years 1985-86, 1995-1996 and for the years 2000 to 2017 at each year increments. There are 6 levels in the variable tuition_type: 2 year current, 2 year constant, 4 year current , 4 year constant, all current and all constant.

College Diversity data

# hist tuition data

# verify using glimpse and skim

glimpse(diversity)   # 50,655 observations and 5 variables

## Rows: 50,655
## Columns: 5
## $ name             <chr> "University of Phoenix-Arizona", "University of Phoen…
## $ total_enrollment <dbl> 195059, 195059, 195059, 195059, 195059, 195059, 19505…
## $ state            <chr> "Arizona", "Arizona", "Arizona", "Arizona", "Arizona"…
## $ category         <chr> "Women", "American Indian / Alaska Native", "Asian", …
## $ enrollment       <dbl> 134722, 876, 1959, 31455, 13984, 1019, 58209, 19039, …

                     # 3 character variables and 2 numeric variables

skimr::skim(diversity) # 50,655 observations and 5 variables

Data summary
Name	diversity
Number of rows	50655
Number of columns	5
_______________________
Column type frequency:
character	3
numeric	2
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	n_unique
name	341	0.99	5	84	4574
state	341	0.99	4	14	50
category	0	1.00	5	34	11

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
total_enrollment	0	1	4386.69	8217.53	1	391	1391	4504	195059	▇▁▁▁▁
enrollment	0	1	774.19	2545.01	0	6	59	379	134722	▇▁▁▁▁

                     # 3 character variables and 2 numeric variables

Notes on the data

It can be noted that the variables category and state are character variables, which I will convert to factor format in the next section . It is observed that there are 341 missing values in the variables name and state which would be addressed in the data transformation section.

# for the categorical variables
diversity_cat_table<-tabyl(diversity$category) 
# gender, race and ethnicity variables are all lumped into one column, numbers of only women are available, numbers of men need to be calculated by subtracting from total enrollment, and the long format data needs to be converted to wide format data with one column each for each gender and each race/ethnic groups!

diversity_cat_table

##                  diversity$category    n    percent
##     American Indian / Alaska Native 4605 0.09090909
##                               Asian 4605 0.09090909
##                               Black 4605 0.09090909
##                            Hispanic 4605 0.09090909
##  Native Hawaiian / Pacific Islander 4605 0.09090909
##                Non-Resident Foreign 4605 0.09090909
##                      Total Minority 4605 0.09090909
##                   Two Or More Races 4605 0.09090909
##                             Unknown 4605 0.09090909
##                               White 4605 0.09090909
##                               Women 4605 0.09090909

# check the number of total observations
sum(diversity_cat_table$n) # 50655 observation, tallies with the number of rows in the dataset and no NA observations!

## [1] 50655

Although the histogram plots of the enrollment and total_enrollment imply that the data is right-skewed, since the enrollment and total_enrollment are lumped for the different gender, racial and ethnic groups, the histogram plots wouldn’t be very meaningful in this case.

Transforming the data (15 points)

If the data needs to be transformed in any way (values recoded, pivoted, etc), do it here. Examples include transforming a continuous variable into a categorical using case_when(), etc.

Bonus points (5 points) for datasets that require merging of tables, but only if you reason through whether you should use left_join, inner_join, or right_join on these tables. No credit will be provided if you don’t.

Show your transformed table here. Use tools such as glimpse(), skim() or head() to illustrate your point.

Are the values what you expected for the variables? Why or Why not?

Cleaning the tuition income data

1) Transforming the income levels,year and state to factor format

From the earlier section, it was noted that the income levels, year and state variables are all in the character format, it was required to be transformed to factor format in order to use them in visualization and creating summary tables.

# check the number and names of the different levels in the categorical variable

income_tab<-tabyl(income$income_lvl)# 5 categories

# verify if there are any missing values
sum(income_tab$n) #209012

## [1] 209012

nrow(income) #209012, no missing values

## [1] 209012

# convert income levels,year and state to factor format 
income<-income%>%
  mutate(income_level=case_when(income_lvl=="0 to 30,000"~1,
income_lvl=="30,001 to 48,000"~2,
income_lvl=="48,001 to 75,000"~3,
income_lvl=="75,001 to 110,000"~4,
TRUE~5), 
income_level=factor(income_level,
levels=c(1,2,3,4,5),
labels(c("Less than $30,000", 
         "Between $30,001 and $48,000",
         "Between $48,001 and $75,000",
         "Between $75,001 to $110,000",
         "Over $110,000"))),
year=as.factor(year),
state=as.factor(state))

# drop income_lvl
income<-income%>%
        select(-income_lvl)

#verify 
skimr::skim(income) # 3 factor variables,income_level,year and state is created

Data summary
Name	income
Number of rows	209012
Number of columns	7
_______________________
Column type frequency:
character	2
factor	3
numeric	2
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
name	0	1	6	75	0	3664	0
campus	0	1	9	10	0	2	0

Variable type: factor

skim_variable	complete_rate	ordered	n_unique	top_counts
state	1	FALSE	51	NY: 16152, CA: 14104, PA: 13436, TX: 11133
year	1	FALSE	9	201: 27845, 201: 26390, 201: 24574, 201: 24547
income_level	1	FALSE	4	5: 80256, 1: 44969, 2: 43384, 4: 40403

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
total_price	0	1	30102.19	13824.70	4906	19186.00	26286.0	38831	114083.00	▇▆▂▁▁
net_cost	0	1	16784.92	8887.16	-15101	10148.56	16207.1	22350	95674.84	▁▇▁▁▁

2) Filter out only the on-campus type observations

Since, I am interested only in the on-campus type programs to answer my research question of interest, I excluded the off campus type observations from the dataset.

nrow(income) #209012 observations

## [1] 209012

income_on<-income%>%
         filter(campus=="On Campus")

skimr::skim(income_on) #75391 observations

Data summary
Name	income_on
Number of rows	75391
Number of columns	7
_______________________
Column type frequency:
character	2
factor	3
numeric	2
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
name	0	1	6	75	0	2043	0
campus	0	1	9	9	0	1	0

Variable type: factor

skim_variable	complete_rate	ordered	n_unique	top_counts
state	1	FALSE	51	NY: 6661, PA: 5133, CA: 4314, TX: 4156
year	1	FALSE	8	201: 9990, 201: 9800, 201: 9415, 201: 9375
income_level	1	FALSE	4	5: 29665, 1: 15483, 2: 15323, 4: 14920

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
total_price	0	1	33244.63	15145.87	4906	20714	30466.00	44278.00	96636.00	▇▇▅▁▁
net_cost	0	1	18193.89	8559.27	-15101	12181	17577.91	23128.32	95674.84	▁▇▁▁▁

The transformed income data set has 75391 observations(limited only to on-campus type observations ) and 7 variables and the variables state, year, income_lvl are converted to factor format.

Cleaning the tuition cost data

1) Assess the NA observations

It was noted that there were 52 missing observations for state variable and in the following code chunk, I have assessed the reason for missingness and manupulated data in order to address the missigness.

# filter out the state with NA's to understand why they are missing!

cost_na<-cost%>%filter(is.na(state))

# looks like the state names are missing however the state codes are present

# get the different state codes and counts of the states with missing names!

tabyl(cost_na$state_code)

## [1] cost_na$state_code n                  percent           
## <0 rows> (or 0-length row.names)

# based on the state code's fill the name of the state using mutate command

cost<-cost%>%

mutate(state.upd=case_when(!is.na(state)~state, 
                                     is.na(state)&state_code=="AS"~"American Samoa",
                                     is.na(state)&state_code=="DC"~"District of Columbia",
                                     
is.na(state)&state_code=="GU"~"Guam",

is.na(state)&state_code=="PR"~"Peurto Rico",

is.na(state)&state_code=="VI"~"Virgin Islands",

TRUE~"NA"))

#verify
tabyl(cost$state.upd) # no NA's

##        cost$state.upd   n      percent
##               Alabama  54 0.0181634712
##                Alaska   6 0.0020181635
##        American Samoa   1 0.0003363606
##               Arizona  34 0.0114362597
##              Arkansas  46 0.0154725866
##            California 254 0.0854355869
##              Colorado  38 0.0127817020
##           Connecticut  36 0.0121089808
##              Delaware   9 0.0030272452
##  District of Columbia   8 0.0026908846
##               Florida  88 0.0295997309
##               Georgia  79 0.0265724857
##                  Guam   1 0.0003363606
##                Hawaii  14 0.0047090481
##                 Idaho  13 0.0043726875
##              Illinois 125 0.0420450723
##               Indiana  62 0.0208543559
##                  Iowa  52 0.0174907501
##                Kansas  52 0.0174907501
##              Kentucky  44 0.0147998655
##             Louisiana  34 0.0114362597
##                 Maine  27 0.0090817356
##              Maryland  45 0.0151362260
##         Massachusetts  93 0.0312815338
##              Michigan  78 0.0262361251
##             Minnesota  71 0.0238816011
##           Mississippi  32 0.0107635385
##              Missouri  73 0.0245543222
##               Montana  22 0.0073999327
##              Nebraska  33 0.0110998991
##                Nevada  10 0.0033636058
##         New Hampshire  21 0.0070635721
##            New Jersey  54 0.0181634712
##            New Mexico  24 0.0080726539
##              New York 221 0.0743356879
##        North Carolina 117 0.0393541877
##          North Dakota  18 0.0060544904
##                  Ohio 127 0.0427177935
##              Oklahoma  40 0.0134544231
##                Oregon  40 0.0134544231
##          Pennsylvania 160 0.0538176926
##           Peurto Rico  41 0.0137907837
##          Rhode Island  11 0.0036999664
##        South Carolina  57 0.0191725530
##          South Dakota  18 0.0060544904
##             Tennessee  62 0.0208543559
##                 Texas 150 0.0504540868
##                  Utah  14 0.0047090481
##               Vermont  19 0.0063908510
##        Virgin Islands   1 0.0003363606
##              Virginia  79 0.0265724857
##            Washington  60 0.0201816347
##         West Virginia  30 0.0100908174
##             Wisconsin  67 0.0225361588
##               Wyoming   8 0.0026908846

# use the state.upd column going forward but rename it to state

cost<-cost%>%
  select(-state)%>%
   mutate(state=state.upd)%>%
    select(-state.upd)

# rearrange the columns, state before state_code
cost<-cost%>%relocate("state",.after="name")

# verify
skimr::skim(cost) # no missing values in state variable

Data summary
Name	cost
Number of rows	2973
Number of columns	10
_______________________
Column type frequency:
character	5
numeric	5
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	n_unique
name	1	8	67	2938
state	1	4	20	55
state_code	1	2	2	55
type	1	5	10	4
degree_length	1	5	6	3

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
room_and_board	1094	0.63	10095.28	3288.55	30	7935	10000	12424.5	21300	▁▅▇▃▁
in_state_tuition	0	1.00	16491.29	14773.84	480	4890	10099	27124.0	59985	▇▂▂▁▁
in_state_total	0	1.00	22871.73	18948.39	962	5802	17669	35960.0	75003	▇▅▂▂▁
out_of_state_tuition	0	1.00	20532.73	13255.65	480	9552	17486	29208.0	59985	▇▆▅▂▁
out_of_state_total	0	1.00	26913.16	17719.73	1376	11196	23214	39054.0	75003	▇▅▅▂▁

It was observed those 52 missing values of state was having complete observations for the state_code with which I was able to address the missing values in state.

2) Convert state and type to a factor variable

The variablesstate and type are converted to factor formats.

cost<-cost%>%mutate(across(.cols=c(state,type),.fns=as.factor))

skimr::skim(cost) # 2 factor variables are created

Data summary
Name	cost
Number of rows	2973
Number of columns	10
_______________________
Column type frequency:
character	3
factor	2
numeric	5
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	n_unique
name	1	8	67	2938
state_code	1	2	2	55
degree_length	1	5	6	3

Variable type: factor

skim_variable	n_missing	complete_rate	ordered	n_unique	top_counts
state	0	1	FALSE	55	Cal: 254, New: 221, Pen: 160, Tex: 150
type	0	1	FALSE	4	Pub: 1584, Pri: 1281, For: 107, Oth: 1

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
room_and_board	1094	0.63	10095.28	3288.55	30	7935	10000	12424.5	21300	▁▅▇▃▁
in_state_tuition	0	1.00	16491.29	14773.84	480	4890	10099	27124.0	59985	▇▂▂▁▁
in_state_total	0	1.00	22871.73	18948.39	962	5802	17669	35960.0	75003	▇▅▂▂▁
out_of_state_tuition	0	1.00	20532.73	13255.65	480	9552	17486	29208.0	59985	▇▆▅▂▁
out_of_state_total	0	1.00	26913.16	17719.73	1376	11196	23214	39054.0	75003	▇▅▅▂▁

State and type variables are converted to factor formats.

3) Filter out only 4 year degree observations

Since I wanted to limit the scope of the data set, I am primarily interested in 4 years degree observations to answer my research question of interest.

cost_4yr<-cost%>%filter(degree_length=="4 Year")

nrow(cost_4yr) #1852 observations

## [1] 1852

# assess the NA observations in room and board
cost_room_na<-cost_4yr%>%filter(is.na(room_and_board))

nrow(cost_room_na) # 245 missing values for room and board but the in-state and out of state tuition values are still present

## [1] 245

#final cost dataset
cost_final<-cost_4yr

#verify
skimr::skim(cost_final)

Data summary
Name	cost_final
Number of rows	1852
Number of columns	10
_______________________
Column type frequency:
character	3
factor	2
numeric	5
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	n_unique
name	1	8	66	1829
state_code	1	2	2	53
degree_length	1	6	6	1

Variable type: factor

skim_variable	n_missing	complete_rate	ordered	n_unique	top_counts
state	0	1	FALSE	53	New: 163, Cal: 135, Pen: 129, Tex: 82
type	0	1	FALSE	3	Pri: 1209, Pub: 597, For: 46, Oth: 0

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
room_and_board	245	0.87	10624.67	3070.93	950	8617.00	10400	12713.0	21300	▁▅▇▃▁
in_state_tuition	0	1.00	23212.36	14664.22	480	9932.25	19960	33854.5	59985	▇▅▆▃▂
in_state_total	0	1.00	32431.50	17533.60	1430	18199.00	28287	44845.5	75003	▃▇▅▃▂
out_of_state_tuition	0	1.00	26910.22	12630.54	480	17683.75	25392	34863.5	59985	▂▇▇▃▂
out_of_state_total	0	1.00	36129.35	15994.29	1430	24951.00	34888	46670.0	75003	▂▆▇▅▂

The final cost data set has only 1852 observations limited to only observations with 4 yr degree_length programs.

Cleaning the salary potential data

1) Rename the column state_name to state, useful during the left merge operation and convert state to a factor variable

salary<-salary%>%
         rename(state=state_name)%>%
          mutate(state=as.factor(state))

#verify
skimr::skim(salary)

Data summary
Name	salary
Number of rows	935
Number of columns	7
_______________________
Column type frequency:
character	1
factor	1
numeric	5
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
name	0	1	6	64	0	934	0

Variable type: factor

skim_variable	n_missing	complete_rate	ordered	n_unique	top_counts
state	0	1	FALSE	50	Ala: 25, Cal: 25, Flo: 25, Geo: 25

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
rank	0	1.00	11.48	7.07	1	5	11	17	25	▇▇▆▆▅
early_career_pay	0	1.00	51359.68	8365.51	32500	46000	50000	55500	91200	▂▇▂▁▁
mid_career_pay	0	1.00	92805.78	15856.48	60100	81650	89900	100650	158200	▃▇▃▁▁
make_world_better_percent	33	0.96	53.88	9.19	33	48	52	58	94	▂▇▃▁▁
stem_percent	0	1.00	16.96	15.40	0	7	13	23	100	▇▃▁▁▁

State_name column was renamed to State and it was converted to a factor format. We retain the NA observations of make_world_better_percent as it is since those observations have complete values for the other earning potential and quality of education variables.

Cleaning the historical tuition data

1) Convert year,type and tuition_type to factor variables

hist_tuition<-hist_tuition%>%
mutate(across(where(is.character),as.factor))

#verify
skimr::skim(hist_tuition)

Data summary
Name	hist_tuition
Number of rows	270
Number of columns	4
_______________________
Column type frequency:
factor	3
numeric	1
________________________
Group variables	None

Variable type: factor

skim_variable	complete_rate	ordered	n_unique	top_counts
type	1	FALSE	3	All: 114, Pri: 78, Pub: 78
year	1	FALSE	19	198: 18, 199: 18, 200: 18, 200: 18
tuition_type	1	FALSE	6	2 Y: 45, 2 Y: 45, 4 Y: 45, 4 Y: 45

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
tuition_cost	0	1	17597.37	9201.45	2981	9796.5	16983	23574	41468	▇▆▇▂▂

It was noted that year, type and tuition_type were all in the character format, they were converted to the factor format which would help in data visualization and summarization.

Cleaning the diversity data

1) Handle the NA observations

It was observed that there were 341 missing values in the variables name and state, the next step would be to explore the reason behind the missingness, we could either remove the NA observations in the state and name or group the NA observations into a separate category.

diversity_na<-diversity%>%filter(is.na(name)|is.name(state))                    

 #341 observations
 # implies that the same observations had missing values for both the name and state variable

# remove the NA observations from the data set since there is no way to merge with the other data sets, both the university name and state values are missing!

total_na<-length(unique(diversity_na$total_enrollment))
total_na # 30 distinct colleges

## [1] 30

It can be observed that the 341 missing observations have missing values for the name and state, by assessing the distinct entries of the total enrollment variable, these missing observations are from 30 different colleges.

If I had to assign them to a category called “NA”, it may not be a great option in this case as we would lump the enrollment from the 30 colleges with missing observations. Thus, I will be removing the NA observations from the diversity data set.

diversity<-diversity%>%filter(!is.na(name)|!is.na(state))
#50314 observations, 341 observations are removed!

skimr::skim(diversity) # 50314 observations and 5 variables, 3 character and 2 numeric variables,no missing values noted

Data summary
Name	diversity
Number of rows	50314
Number of columns	5
_______________________
Column type frequency:
character	3
numeric	2
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	n_unique
name	1	5	84	4574
state	1	4	14	50
category	1	5	34	11

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
total_enrollment	0	1	4392.17	8227.65	1	395	1394	4504	195059	▇▁▁▁▁
enrollment	0	1	775.26	2548.21	0	6	59	381	134722	▇▁▁▁▁

Thus after excluding the NA observations in the diversity data, there were 50314 observations with 5 variables , 3 of them are character variables and the remaining 2 are numeric variables.

2) Converting long to wide data

In the diversity data, it was observed that the data was in long-format where the gender and race categories were all lumped in one column. My next step in data transformation is to create separate columns for the Women, Men and for each of the race-ethnic groups i.e. we would be converting the long format of data to the wide-format.

There was also no separate row for the Men enrollment in the original diversity data, so we will have to subtract the women enrollment from the total enrollment in order to obtain the men enrollment for each college, this is easier to implement when the data is in long format.

Also, when we are comparing enrollment in different types of college across different states in the USA, we will have to account for the fact that different states have different population/student population as well.

For example, states like Texas, California, Florida, New York would certainly have more number of students enrolling in colleges than states like Rhode Island, New Hampshire,Wyoming.

It wouldn’t be appropriate to just report the number of students from different gender/ different racial/ethnic groups attending public/ private institutions across the different states, instead it would be reasonable to report the proportion of enrollment for each of the gender and racial ethnic groups. This is achieved by creating new columns for proportions for enrollment per college for Male, Female, Asian, Black, White, American Indian/Alaskan native, Native Hawaiian/ Pacific islander, Two or more races,unknown, non-resident foreign,total minority and Hispanic groups. I will also be creating additional column for enrollment of resident students.

# convert the variables state in factor format
diversity<-diversity%>%
  mutate(state=as.factor(state))
skimr::skim(diversity)

Data summary
Name	diversity
Number of rows	50314
Number of columns	5
_______________________
Column type frequency:
character	2
factor	1
numeric	2
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
name	0	1	5	84	0	4574	0
category	0	1	5	34	0	11	0

Variable type: factor

skim_variable	n_missing	complete_rate	ordered	n_unique	top_counts
state	0	1	FALSE	50	Cal: 5027, New: 3344, Tex: 2915, Pen: 2739

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
total_enrollment	0	1	4392.17	8227.65	1	395	1394	4504	195059	▇▁▁▁▁
enrollment	0	1	775.26	2548.21	0	6	59	381	134722	▇▁▁▁▁

# convert long to wide data format
diversity_wide<-pivot_wider(diversity,names_from=category,values_from=enrollment)

#check the data type!
skimr::skim(diversity_wide) #4574 observations and 14 variables,12 are numeric and 1 is factor and 1 is character variable

Data summary
Name	diversity_wide
Number of rows	4574
Number of columns	14
_______________________
Column type frequency:
character	1
factor	1
numeric	12
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
name	0	1	5	84	0	4574	0

Variable type: factor

skim_variable	n_missing	complete_rate	ordered	n_unique	top_counts
state	0	1	FALSE	50	Cal: 457, New: 304, Tex: 265, Pen: 249

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
total_enrollment	1	4392.17	8228.47	1	395.25	1394.0	4504.00	195059	▇▁▁▁▁
Women	1	2480.10	4708.32	0	220.25	815.0	2705.50	134722	▇▁▁▁▁
American Indian / Alaska Native	1	31.71	111.37	0	1.00	6.0	23.00	2659	▇▁▁▁▁
Asian	1	250.96	830.57	0	5.00	24.0	122.75	11257	▇▁▁▁▁
Black	1	569.42	1390.84	0	37.00	135.0	460.75	31455	▇▁▁▁▁
Hispanic	1	662.05	2028.28	0	19.00	90.0	365.75	44870	▇▁▁▁▁
Native Hawaiian / Pacific Islander	1	11.96	39.95	0	0.00	2.0	8.00	1019	▇▁▁▁▁
White	1	2303.23	4259.66	0	140.00	691.0	2528.00	61498	▇▁▁▁▁
Two Or More Races	1	129.50	392.62	0	3.00	24.0	103.00	19039	▇▁▁▁▁
Unknown	1	234.40	1207.42	0	6.00	43.0	196.00	65163	▇▁▁▁▁
Non-Resident Foreign	1	198.94	704.63	0	0.00	9.0	84.00	11495	▇▁▁▁▁
Total Minority	1	1655.60	3707.10	0	128.25	388.5	1335.00	68332	▇▁▁▁▁

typeof(diversity_wide) # list

## [1] "list"

# add a column for the enrollment for Men using mutate command
diversity_wide<-diversity_wide%>%mutate(Men=(total_enrollment-Women),Resident=((total_enrollment)-(diversity_wide$`Non-Resident Foreign`)))

# verify!
skimr::skim(diversity_wide) #14 columns

Data summary
Name	diversity_wide
Number of rows	4574
Number of columns	16
_______________________
Column type frequency:
character	1
factor	1
numeric	14
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
name	0	1	5	84	0	4574	0

Variable type: factor

skim_variable	n_missing	complete_rate	ordered	n_unique	top_counts
state	0	1	FALSE	50	Cal: 457, New: 304, Tex: 265, Pen: 249

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
total_enrollment	1	4392.17	8228.47	1	395.25	1394.0	4504.00	195059	▇▁▁▁▁
Women	1	2480.10	4708.32	0	220.25	815.0	2705.50	134722	▇▁▁▁▁
American Indian / Alaska Native	1	31.71	111.37	0	1.00	6.0	23.00	2659	▇▁▁▁▁
Asian	1	250.96	830.57	0	5.00	24.0	122.75	11257	▇▁▁▁▁
Black	1	569.42	1390.84	0	37.00	135.0	460.75	31455	▇▁▁▁▁
Hispanic	1	662.05	2028.28	0	19.00	90.0	365.75	44870	▇▁▁▁▁
Native Hawaiian / Pacific Islander	1	11.96	39.95	0	0.00	2.0	8.00	1019	▇▁▁▁▁
White	1	2303.23	4259.66	0	140.00	691.0	2528.00	61498	▇▁▁▁▁
Two Or More Races	1	129.50	392.62	0	3.00	24.0	103.00	19039	▇▁▁▁▁
Unknown	1	234.40	1207.42	0	6.00	43.0	196.00	65163	▇▁▁▁▁
Non-Resident Foreign	1	198.94	704.63	0	0.00	9.0	84.00	11495	▇▁▁▁▁
Total Minority	1	1655.60	3707.10	0	128.25	388.5	1335.00	68332	▇▁▁▁▁
Men	1	1912.07	3663.95	0	129.00	567.5	1867.00	60337	▇▁▁▁▁
Resident	1	4193.22	7786.88	1	382.00	1359.0	4356.50	191704	▇▁▁▁▁

# check for further missing values
sum(is.na(diversity_wide))

## [1] 0

Since the data set considered is very large, I have considered a set of inclusion-exclusion criterias to obtain a reasonable sized data set to answer my research question of interest.

I am primarily interested only in the on-campus programs and 4 year degree type programs obtained only from private and public colleges.This implies, I will be excluding observations from:

-> 2 year degree length program -> off-campus programs ->“for profit” and “other” type of colleges

I will filter out the observations that are aligned with my inclusion criteria`s after I perform the merge operation between tuition_cost, tuition_income,salary and diversity datasets in the subsequent sections. I have stated the reasons as to why I chose this exclusion critera in the next sections.

Since I am interested in studying if the type of college attended would influence the earning potential and quality of education, I will have to merge 2 datasets: tuition cost and salary potential using left join.

The tuition cost data has the information about the name of the college, the state in which the college is located and the type of the college(“public”,“private”,“For Profit”, “Other” ) and the degree length and the salary_potential data consists of two earning potential indicators-early_career_pay and mid_career_pay and two quality of education indicators- make_world_better_percent and stem_percent.

# make sure the variable names are clean before merging
diversity_wide<-clean_names(diversity_wide)
salary<-clean_names(salary)
cost<-clean_names(cost)

Merging Datasets

1) Merge salary potential and tution cost data

# merge using left join
merged1<-merge(x=cost,y=salary,by=c("name","state"),all.x=TRUE)

skimr::skim(merged1)

Data summary
Name	merged1
Number of rows	2973
Number of columns	15
_______________________
Column type frequency:
character	3
factor	2
numeric	10
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	n_unique
name	1	8	67	2938
state_code	1	2	2	55
degree_length	1	5	6	3

Variable type: factor

skim_variable	n_missing	complete_rate	ordered	n_unique	top_counts
state	0	1	FALSE	55	Cal: 254, New: 221, Pen: 160, Tex: 150
type	0	1	FALSE	4	Pub: 1584, Pri: 1281, For: 107, Oth: 1

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
room_and_board	1094	0.63	10095.28	3288.55	30	7935	10000	12424.5	21300	▁▅▇▃▁
in_state_tuition	0	1.00	16491.29	14773.84	480	4890	10099	27124.0	59985	▇▂▂▁▁
in_state_total	0	1.00	22871.73	18948.39	962	5802	17669	35960.0	75003	▇▅▂▂▁
out_of_state_tuition	0	1.00	20532.73	13255.65	480	9552	17486	29208.0	59985	▇▆▅▂▁
out_of_state_total	0	1.00	26913.16	17719.73	1376	11196	23214	39054.0	75003	▇▅▅▂▁
rank	2380	0.20	11.81	7.03	1	6	12	18.0	25	▇▇▇▆▅
early_career_pay	2380	0.20	50892.58	7929.03	32500	45900	49700	54500.0	88800	▂▇▂▁▁
mid_career_pay	2380	0.20	91941.82	15055.51	60100	81500	89600	99700.0	158200	▃▇▃▁▁
make_world_better_percent	2401	0.19	53.83	8.81	33	48	53	59.0	94	▂▇▃▁▁
stem_percent	2380	0.20	16.63	14.78	0	7	14	22.0	97	▇▃▁▁▁

skimr::skim(merged1) # missingness rate is very high in the variables are interested

Data summary
Name	merged1
Number of rows	2973
Number of columns	15
_______________________
Column type frequency:
character	3
factor	2
numeric	10
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	n_unique
name	1	8	67	2938
state_code	1	2	2	55
degree_length	1	5	6	3

Variable type: factor

skim_variable	n_missing	complete_rate	ordered	n_unique	top_counts
state	0	1	FALSE	55	Cal: 254, New: 221, Pen: 160, Tex: 150
type	0	1	FALSE	4	Pub: 1584, Pri: 1281, For: 107, Oth: 1

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
room_and_board	1094	0.63	10095.28	3288.55	30	7935	10000	12424.5	21300	▁▅▇▃▁
in_state_tuition	0	1.00	16491.29	14773.84	480	4890	10099	27124.0	59985	▇▂▂▁▁
in_state_total	0	1.00	22871.73	18948.39	962	5802	17669	35960.0	75003	▇▅▂▂▁
out_of_state_tuition	0	1.00	20532.73	13255.65	480	9552	17486	29208.0	59985	▇▆▅▂▁
out_of_state_total	0	1.00	26913.16	17719.73	1376	11196	23214	39054.0	75003	▇▅▅▂▁
rank	2380	0.20	11.81	7.03	1	6	12	18.0	25	▇▇▇▆▅
early_career_pay	2380	0.20	50892.58	7929.03	32500	45900	49700	54500.0	88800	▂▇▂▁▁
mid_career_pay	2380	0.20	91941.82	15055.51	60100	81500	89600	99700.0	158200	▃▇▃▁▁
make_world_better_percent	2401	0.19	53.83	8.81	33	48	53	59.0	94	▂▇▃▁▁
stem_percent	2380	0.20	16.63	14.78	0	7	14	22.0	97	▇▃▁▁▁

# analyzing the trends in missingness

# check the counts in each category of the following variables

tabyl(merged1$degree_length)

##  merged1$degree_length    n      percent
##                 2 Year 1120 0.3767238480
##                 4 Year 1852 0.6229397915
##                  Other    1 0.0003363606

tabyl(merged1$type)

##  merged1$type    n      percent
##    For Profit  107 0.0359905819
##         Other    1 0.0003363606
##       Private 1281 0.4308779011
##        Public 1584 0.5327951564

# filter only the NA observations in the quality of education and salary potential variables

merged1_na<-merged1%>%filter(is.na(early_career_pay)&is.na(mid_career_pay)&is.na(make_world_better_percent)&is.na(stem_percent)) # 2380 observations with missing NA in all the 4 variables

# check the counts in each category of the following variables

tabyl(merged1_na$degree_length) # almost all the 2 year degree length data has NA observations, for the 4 year degree length, 1261 out of 1852 observations are missing

##  merged1_na$degree_length    n      percent
##                    2 Year 1118 0.4697478992
##                    4 Year 1261 0.5298319328
##                     Other    1 0.0004201681

tabyl(merged1_na$type)# around 907 observations out of 1281 private college observations are missing, 1365 public out of 1584 observations are missing, all the for profit and other college observations are missing!

##  merged1_na$type    n      percent
##       For Profit  107 0.0449579832
##            Other    1 0.0004201681
##          Private  907 0.3810924370
##           Public 1365 0.5735294118

# final merged data set

# thus from the missingness analysis, we can omit the NA observations if they were NA in all the 4 variables, since there was only one complete observation for the 2 year degree data, I have excluded the 2 year degree length observations from my analysis

merged1_final<-merged1%>%filter(!is.na(early_career_pay)|!is.na(mid_career_pay)|!is.na(make_world_better_percent)|!is.na(stem_percent))%>%
filter(degree_length!="2 Year"|type!="For Profit"|type!="Other") #593 observations

# only 593 observations that can  be used to answer our question of interest
tabyl(merged1_final$type)

##  merged1_final$type   n   percent
##          For Profit   0 0.0000000
##               Other   0 0.0000000
##             Private 374 0.6306914
##              Public 219 0.3693086

# verify
skimr::skim(merged1_final)

Data summary
Name	merged1_final
Number of rows	593
Number of columns	15
_______________________
Column type frequency:
character	3
factor	2
numeric	10
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	n_unique
name	1	11	56	593
state_code	1	2	2	40
degree_length	1	6	6	2

Variable type: factor

skim_variable	n_missing	complete_rate	ordered	n_unique	top_counts
state	0	1	FALSE	40	Ken: 24, Ala: 22, Geo: 22, Iow: 22
type	0	1	FALSE	2	Pri: 374, Pub: 219, For: 0, Oth: 0

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
room_and_board	29	0.95	10730.87	2738.66	1698	8892	10353	12702.5	18127	▁▃▇▅▂
in_state_tuition	0	1.00	26355.69	16289.35	4118	10032	25600	40652.0	58230	▇▂▃▃▃
in_state_total	0	1.00	36561.78	18618.05	4258	19392	34280	52192.0	75003	▆▇▆▆▃
out_of_state_tuition	0	1.00	30826.91	12861.58	4118	20800	28898	41010.0	58230	▂▇▇▅▃
out_of_state_total	0	1.00	41033.00	15511.94	6670	29504	38783	52856.0	75003	▂▇▇▆▃
rank	0	1.00	11.81	7.03	1	6	12	18.0	25	▇▇▇▆▅
early_career_pay	0	1.00	50892.58	7929.03	32500	45900	49700	54500.0	88800	▂▇▂▁▁
mid_career_pay	0	1.00	91941.82	15055.51	60100	81500	89600	99700.0	158200	▃▇▃▁▁
make_world_better_percent	21	0.96	53.83	8.81	33	48	53	59.0	94	▂▇▃▁▁
stem_percent	0	1.00	16.63	14.78	0	7	14	22.0	97	▇▃▁▁▁

2) Merge diversity data with tuition cost data

# merge using left join
merged2<-merge(x=cost,y=diversity_wide,by=c("name","state"),all.x=TRUE)

skimr::skim(merged2)

Data summary
Name	merged2
Number of rows	2973
Number of columns	24
_______________________
Column type frequency:
character	3
factor	2
numeric	19
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	n_unique
name	1	8	67	2938
state_code	1	2	2	55
degree_length	1	5	6	3

Variable type: factor

skim_variable	n_missing	complete_rate	ordered	n_unique	top_counts
state	0	1	FALSE	55	Cal: 254, New: 221, Pen: 160, Tex: 150
type	0	1	FALSE	4	Pub: 1584, Pri: 1281, For: 107, Oth: 1

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
room_and_board	1094	0.63	10095.28	3288.55	30	7935.00	10000.0	12424.50	21300	▁▅▇▃▁
in_state_tuition	0	1.00	16491.29	14773.84	480	4890.00	10099.0	27124.00	59985	▇▂▂▁▁
in_state_total	0	1.00	22871.73	18948.39	962	5802.00	17669.0	35960.00	75003	▇▅▂▂▁
out_of_state_tuition	0	1.00	20532.73	13255.65	480	9552.00	17486.0	29208.00	59985	▇▆▅▂▁
out_of_state_total	0	1.00	26913.16	17719.73	1376	11196.00	23214.0	39054.00	75003	▇▅▅▂▁
total_enrollment	819	0.72	6185.05	8270.69	15	1351.50	3132.0	7623.00	81459	▇▁▁▁▁
women	819	0.72	3486.93	4631.49	0	772.75	1881.0	4381.00	48329	▇▁▁▁▁
american_indian_alaska_native	819	0.72	46.25	138.14	0	4.00	14.0	39.00	2659	▇▁▁▁▁
asian	819	0.72	321.60	839.51	0	14.00	55.0	222.00	10381	▇▁▁▁▁
black	819	0.72	774.88	1478.91	0	74.00	243.5	827.50	18914	▇▁▁▁▁
hispanic	819	0.72	933.05	2436.55	0	47.00	175.0	688.75	44870	▇▁▁▁▁
native_hawaiian_pacific_islander	819	0.72	14.56	36.60	0	1.00	4.0	13.00	653	▇▁▁▁▁
white	819	0.72	3347.52	4385.80	0	729.50	1838.5	4119.00	41898	▇▁▁▁▁
two_or_more_races	819	0.72	175.40	290.96	0	18.00	62.0	193.00	2801	▇▁▁▁▁
unknown	819	0.72	299.74	919.59	0	25.00	105.0	295.50	27213	▇▁▁▁▁
non_resident_foreign	819	0.72	272.05	797.11	0	5.00	38.5	160.00	11495	▇▁▁▁▁
total_minority	819	0.72	2265.74	3996.32	0	283.25	766.5	2441.00	56561	▇▁▁▁▁
men	819	0.72	2698.13	3769.05	0	540.25	1283.0	3277.50	33130	▇▁▁▁▁
resident	819	0.72	5913.01	7802.33	11	1298.00	3046.0	7315.75	80423	▇▁▁▁▁

Applying inclusion/exclusion criteria

# create new columns to compute the % gender, % race and ethnicity groups in each of the colleges
merged2<-merged2%>%mutate(across(.cols=women:resident,.fns=~((.x/total_enrollment)*100),.names ="percent_{.col}"))

#check the # of entries in the type and degree_length
tabyl(merged2$type)

##  merged2$type    n      percent
##    For Profit  107 0.0359905819
##         Other    1 0.0003363606
##       Private 1281 0.4308779011
##        Public 1584 0.5327951564

tabyl(merged2$degree_length)

##  merged2$degree_length    n      percent
##                 2 Year 1120 0.3767238480
##                 4 Year 1852 0.6229397915
##                  Other    1 0.0003363606

# retain only 4 year degree length observations
merged2.upd<-merged2%>%
  filter(degree_length=="4 Year"&(type=="Private"|type=="Public")) #1806 observations

#check the # of entries in the type and degree_length
tabyl(merged2.upd$type)

##  merged2.upd$type    n   percent
##        For Profit    0 0.0000000
##             Other    0 0.0000000
##           Private 1209 0.6694352
##            Public  597 0.3305648

tabyl(merged2.upd$degree_length)

##  merged2.upd$degree_length    n percent
##                     4 Year 1806       1

# check for NA's
sum(is.na(merged2.upd$total_enrollment)) #542 observations

## [1] 542

# remove those NA observations
merged2.final<-merged2.upd%>%
  filter(!is.na(total_enrollment))# 1264 observations

# relocate the men column next to women column
merged2.final<-merged2.final%>%relocate("percent_men",.after="percent_women")

#verify!
skimr::skim(merged2.final)

Data summary
Name	merged2.final
Number of rows	1264
Number of columns	37
_______________________
Column type frequency:
character	3
factor	2
numeric	32
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	n_unique
name	1	11	58	1264
state_code	1	2	2	50
degree_length	1	6	6	1

Variable type: factor

skim_variable	n_missing	complete_rate	ordered	n_unique	top_counts
state	0	1	FALSE	50	New: 84, Pen: 82, Cal: 81, Tex: 66
type	0	1	FALSE	2	Pri: 906, Pub: 358, For: 0, Oth: 0

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
room_and_board	100	0.92	10633.34	2998.55	950.00	8634.00	10359.00	12714.50	21300.00	▁▅▇▃▁
in_state_tuition	0	1.00	25218.35	14996.93	480.00	10849.00	24375.00	36180.50	59985.00	▇▆▇▅▂
in_state_total	0	1.00	35010.44	17662.86	1430.00	19454.75	32742.00	48450.00	75003.00	▃▇▆▅▂
out_of_state_tuition	0	1.00	28369.28	12797.39	480.00	18629.50	27124.00	36683.50	59985.00	▂▇▇▃▂
out_of_state_total	0	1.00	38161.37	15809.88	1430.00	26802.00	36709.50	48899.50	75003.00	▂▆▇▅▂
total_enrollment	0	1.00	5807.69	8620.18	20.00	1142.50	2523.00	6304.75	81459.00	▇▁▁▁▁
women	0	1.00	3249.67	4687.62	0.00	652.00	1502.00	3671.25	48329.00	▇▁▁▁▁
american_indian_alaska_native	0	1.00	32.90	105.22	0.00	3.00	9.00	26.00	1754.00	▇▁▁▁▁
asian	0	1.00	301.61	851.85	0.00	13.00	55.00	187.25	10381.00	▇▁▁▁▁
black	0	1.00	658.51	1219.67	0.00	67.00	196.50	669.00	14751.00	▇▁▁▁▁
hispanic	0	1.00	573.88	1633.40	0.00	41.00	139.00	458.00	31211.00	▇▁▁▁▁
native_hawaiian_pacific_islander	0	1.00	11.09	29.67	0.00	0.75	3.00	9.00	429.00	▇▁▁▁▁
white	0	1.00	3374.73	4968.27	0.00	594.50	1517.50	3738.75	41898.00	▇▁▁▁▁
two_or_more_races	0	1.00	165.70	296.36	0.00	17.00	56.00	167.00	2801.00	▇▁▁▁▁
unknown	0	1.00	294.58	1083.90	0.00	25.00	106.00	290.25	27213.00	▇▁▁▁▁
non_resident_foreign	0	1.00	394.69	987.26	0.00	16.00	73.00	269.00	11495.00	▇▁▁▁▁
total_minority	0	1.00	1743.70	3189.62	0.00	230.75	585.00	1803.00	39763.00	▇▁▁▁▁
men	0	1.00	2558.02	4037.99	0.00	463.25	1014.00	2675.50	33130.00	▇▁▁▁▁
resident	0	1.00	5413.00	7933.87	11.00	1100.50	2462.50	5911.25	80423.00	▇▁▁▁▁
percent_women	0	1.00	56.09	15.59	0.00	51.15	57.36	63.16	100.00	▁▁▇▅▁
percent_men	0	1.00	43.91	15.59	0.00	36.84	42.64	48.85	100.00	▁▅▇▁▁
percent_american_indian_alaska_native	0	1.00	1.02	5.98	0.00	0.15	0.31	0.59	100.00	▇▁▁▁▁
percent_asian	0	1.00	3.63	4.97	0.00	0.95	1.94	4.44	60.99	▇▁▁▁▁
percent_black	0	1.00	13.99	20.58	0.00	3.67	6.59	14.11	100.00	▇▁▁▁▁
percent_hispanic	0	1.00	7.87	9.17	0.00	2.75	5.18	9.24	92.98	▇▁▁▁▁
percent_native_hawaiian_pacific_islander	0	1.00	0.21	0.66	0.00	0.01	0.10	0.21	15.57	▇▁▁▁▁
percent_white	0	1.00	60.02	22.94	0.00	48.81	65.47	75.93	100.00	▂▂▃▇▃
percent_two_or_more_races	0	1.00	2.51	2.11	0.00	1.24	2.26	3.26	28.77	▇▁▁▁▁
percent_unknown	0	1.00	5.41	6.52	0.00	1.25	3.38	7.31	62.89	▇▁▁▁▁
percent_non_resident_foreign	0	1.00	5.33	8.07	0.00	1.04	2.83	6.57	96.62	▇▁▁▁▁
percent_total_minority	0	1.00	29.24	21.97	0.00	14.73	23.20	35.22	100.00	▇▇▂▁▁
percent_resident	0	1.00	94.67	8.07	3.38	93.43	97.17	98.96	100.00	▁▁▁▁▇

Reshape the data back to long format to make ggplots

Gender

# convert the wide to a long data
merged2_gender_long<-pivot_longer(data=merged2.final,
                          cols=c("percent_men","percent_women"),
                          names_to="percent_gender",
                          values_to="enrollment")

# remove the string "percent_" in the merged2_long columns
merged2_gender_long <-merged2_gender_long  %>%
  mutate(gender= str_remove_all(percent_gender, "percent_"))

#keep the columns that are required to plot compare gender
gender<-merged2_gender_long%>%
  select(name,state,state_code,type, degree_length,gender, enrollment, total_enrollment)

#verify
skimr::skim(gender)

Data summary
Name	gender
Number of rows	2528
Number of columns	8
_______________________
Column type frequency:
character	4
factor	2
numeric	2
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	n_unique
name	1	11	58	1264
state_code	1	2	2	50
degree_length	1	6	6	1
gender	1	3	5	2

Variable type: factor

skim_variable	n_missing	complete_rate	ordered	n_unique	top_counts
state	0	1	FALSE	50	New: 168, Pen: 164, Cal: 162, Tex: 132
type	0	1	FALSE	2	Pri: 1812, Pub: 716, For: 0, Oth: 0

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
enrollment	0	1	50.00	16.73	0	40.95	50	59.05	100	▁▂▇▂▁
total_enrollment	0	1	5807.69	8618.48	20	1142.50	2523	6304.75	81459	▇▁▁▁▁

Race and Ethnicity

# convert the wide to a long data
merged2_race_long<-pivot_longer(data=merged2.final,
 cols=percent_american_indian_alaska_native:percent_resident,
                          names_to="percent_race",
                          values_to="enrollment")

# remove the string "percent_" in the merged2_long columns
merged2_race_long <-merged2_race_long  %>%
  mutate(race= str_remove_all(percent_race, "percent_"))

#keep the columns that are  required to plot compare gender
race<-merged2_race_long%>%
  select(name,state,state_code,type, degree_length,race, enrollment, total_enrollment)

#verify
skimr::skim(race)

Data summary
Name	race
Number of rows	13904
Number of columns	8
_______________________
Column type frequency:
character	4
factor	2
numeric	2
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	n_unique
name	1	11	58	1264
state_code	1	2	2	50
degree_length	1	6	6	1
race	1	5	32	11

Variable type: factor

skim_variable	n_missing	complete_rate	ordered	n_unique	top_counts
state	0	1	FALSE	50	New: 924, Pen: 902, Cal: 891, Tex: 726
type	0	1	FALSE	2	Pri: 9966, Pub: 3938, For: 0, Oth: 0

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
enrollment	0	1	20.36	31.55	0	0.81	4.07	21.58	100	▇▁▁▁▁
total_enrollment	0	1	5807.69	8617.08	20	1142.50	2523.00	6304.75	81459	▇▁▁▁▁

Are the values what you expected for the variables?Why or Why not?

When I was working with the individual dataset and merging the datasets, I came across a lot of NA observations, it took me a while for me to figure out if they had a pattern to it or they were missing at random. For example the 2 year degree type observations had most of the observations missing for early career pay, mid career pay, stem % and make world better percent % , so then I had to apply a bunch of inclusion/exclusion criterias.

Also, reshaping the diversity data was a bit of challenge since the original data was in long format with all the gender and race variables lumped into one column, I had to convert to wide format in order to use them to create summary tables and then later for making boxplots and bar plots, I needed a long format data for which I had to reshape the data to a long format again.

I think after performing all the data transformation and merging of the datasets, I am happy with the outcome.

Visualizing and Summarizing the Data (15 points)

Use group_by() and summarize() to make a summary of the data here. The summary should be relevant to your research question

What are your findings about the summary? Are they what you expected?

Make at least two plots that help you answer your question on the transformed or summarized data. Use scales and/or labels to make each plot informative.

Summarizing data

Reasearch question 1

To answer out first research question of interest,

->we will to calculate the median early career pay, median mid career pay for the private/public colleges across different states of USA, this would provide an idea about the earning potential of a student at different stages in their career.

-> also to compare the quality of education, we will be comparing the median stem % and median make world better % between private and public colleges across different states in USA.

->apart from calculating the median of the variable of interest, it is also important to understand the spread of the distribution, for which we calculate the minimum and maximum values and a boxplot is plotted in the data visualization section

-> I created a comprehensive table which included the minimum,median and maximum value of early career pay, mid career pay, stem percent and make world better percent across private and public colleges in the different states in U.S.A and also created 4 separate tables one for each variable of interest.

# create a summary table which calculates the min,median and max of the early_career_pay, mid_career_pay, stem %, make world better %

merged1_stats<-merged1_final%>%
  group_by(state_code,type)%>%
  summarise(number_of_colleges=n(),
  across(early_career_pay:stem_percent,~min(.x, na.rm=TRUE),.names = "minimum_{.col}"),
across(early_career_pay:stem_percent,~median(.x, na.rm=TRUE),.names = "median_{.col}"),
  across(early_career_pay:stem_percent,~max(.x,na.rm=TRUE),.names = "maximum_{.col}"))

## `summarise()` has grouped output by 'state_code'. You can override using the
## `.groups` argument.

merged1_stats%>%gt::gt()

type	number_of_colleges	minimum_early_career_pay	minimum_mid_career_pay	minimum_make_world_better_percent	minimum_stem_percent	median_early_career_pay	median_mid_career_pay	median_make_world_better_percent	median_stem_percent	maximum_early_career_pay	maximum_mid_career_pay	maximum_make_world_better_percent	maximum_stem_percent
AK
Private	1	50300	90000	67	6	50300	90000	67.0	6.0	50300	90000	67	6
Public	2	56000	101000	54	10	57550	101400	55.5	15.0	59100	101800	57	20
AL
Private	10	34800	65100	52	0	41500	76450	64.0	12.5	54500	93500	77	30
Public	12	38900	70700	49	4	44750	79800	59.5	10.5	57500	104500	61	45
AR
Private	8	34600	61200	47	3	43200	77350	55.5	17.0	49200	90300	58	33
Public	8	40500	70500	49	2	44950	79800	56.0	7.5	52500	98000	84	22
AZ
Private	2	41200	73500	69	0	41900	75050	69.0	3.5	42600	76600	69	7
Public	2	49600	90000	51	12	52850	95500	53.5	17.5	56100	101000	56	23
CA
Private	13	58500	109400	39	0	64500	120600	49.5	16.0	88800	158200	59	97
Public	1	63000	114700	53	28	63000	114700	53.0	28.0	63000	114700	53	28
CO
Private	5	33600	60100	48	0	53100	94800	53.0	12.0	55700	105100	70	23
Public	9	44400	81400	47	3	48300	87100	53.0	19.0	75600	139600	60	93
CT
Private	12	49700	85900	36	2	55550	100650	52.5	19.0	70300	138300	63	31
Public	6	48500	83600	43	2	50450	87950	49.0	10.0	59100	105200	68	24
DE
Private	3	45100	76600	42	2	45400	77300	50.0	4.0	49800	86900	50	25
Public	2	46300	81900	43	10	52450	93650	52.5	16.0	58600	105400	62	22
FL
Private	10	46100	84400	44	2	49650	89100	53.0	8.0	58300	105400	88	48
Public	8	47300	83900	40	4	49500	89350	51.0	15.5	55800	102800	62	29
GA
Private	11	44400	79400	34	3	46800	84900	54.0	19.0	62000	110800	62	26
Public	11	43300	77900	45	7	47000	84400	51.0	10.0	54400	100700	69	22
HI
Private	3	45400	84000	55	3	51700	93500	56.0	12.0	52200	93900	57	15
Public	2	44000	77200	56	14	47800	84400	57.5	16.0	51600	91600	59	18
IA
Private	19	43900	78000	41	0	48000	86900	49.0	14.0	53400	99900	60	65
Public	3	47100	85300	46	8	54100	99400	48.0	16.0	56100	101300	48	31
ID
Private	2	46200	80600	59	7	49950	87050	63.5	10.5	53700	93500	68	14
Public	4	44900	77600	54	5	50450	89100	57.5	13.0	52300	97700	64	22
IL
Private	16	49400	89800	41	0	54200	96000	46.0	16.5	64600	118500	88	64
Public	4	51200	95000	44	10	54650	99650	47.0	25.5	62600	115300	48	36
IN
Private	17	46900	84500	40	0	50000	89900	53.0	21.0	75000	135800	59	96
Public	2	45600	83700	48	7	46750	84650	51.5	7.5	47900	85600	55	8
KS
Private	11	37100	66100	55	0	44700	79100	61.0	7.0	48700	91800	73	14
Public	7	43500	78000	51	4	47100	84900	54.0	9.0	53000	98600	64	24
KY
Private	16	35300	64100	50	0	40950	75300	54.0	8.0	50000	90900	74	27
Public	8	40700	73100	46	5	45550	81900	53.0	11.0	51700	96400	57	19
LA
Private	7	33900	61500	37	0	46700	86300	60.5	4.0	48400	88600	74	35
Public	9	40100	71000	51	5	48200	88200	57.0	13.0	53100	100700	76	24
MA
Private	20	52600	98200	36	0	62600	114100	47.0	25.0	86300	155200	71	86
Public	1	67900	117500	53	47	67900	117500	53.0	47.0	67900	117500	53	47
MD
Private	13	47900	82600	40	0	51100	94100	52.5	11.0	67400	118400	63	88
Public	5	49600	87000	47	2	50400	89800	51.0	12.0	51200	91200	65	19
ME
Private	9	36300	64600	37	0	46900	84300	51.0	7.0	61300	112000	69	37
Public	8	37100	66200	47	5	42550	74700	59.5	7.5	67000	121800	73	48
MI
Private	12	47400	84700	40	0	50700	91000	50.5	16.5	69900	125000	57	74
Public	10	45900	82200	46	6	50400	90500	48.5	15.0	65000	116300	54	81
MN
Private	6	47400	84700	39	0	50850	93900	49.0	16.5	58800	109900	52	45
Public	3	49700	85900	50	10	50200	86800	51.0	11.0	51600	91600	53	15
MO
Private	11	42500	77200	37	0	46500	81200	56.0	7.0	54200	95300	68	13
Public	6	43400	77700	48	9	46900	84200	51.0	21.0	67300	122600	54	80
MS
Private	6	34800	62400	56	2	41150	73050	60.5	19.0	49300	88700	76	32
Public	8	32500	61900	55	2	41300	74400	63.0	11.5	51100	94100	70	27
MT
Private	2	46000	82900	55	12	47100	87150	57.0	18.5	48200	91400	59	25
Public	2	53700	96200	54	31	57450	104400	59.0	46.0	61200	112600	64	61
NE
Private	7	43800	77700	45	0	48600	84800	57.0	9.0	53500	95800	74	23
Public	4	40600	73400	50	3	42450	75750	60.0	6.0	53400	93000	79	11
NV
Private	1	54000	95500	39	2	54000	95500	39.0	2.0	54000	95500	39	2
OH
Private	14	48300	88500	39	0	53700	99450	46.0	21.0	66600	117800	78	45
Public	2	48300	88600	51	20	50050	90550	51.5	20.5	51800	92500	52	21
OK
Private	8	40700	74100	51	0	46000	83050	58.5	8.0	58700	107200	71	41
Public	11	37300	67400	48	5	44500	77600	55.0	10.0	47900	90800	66	15
OR
Private	11	40500	76400	44	0	50000	89600	57.0	8.0	58500	104900	82	28
Public	8	44000	73900	46	5	51300	93150	53.5	11.5	65400	108300	79	34
PA
Private	19	53400	98400	37	5	59300	112300	43.0	25.0	75900	136100	72	66
TN
Private	14	41200	74900	46	3	45950	82350	59.5	9.0	65400	119100	68	39
Public	6	43700	77300	48	8	46850	79750	56.0	11.0	53100	95100	68	28
TX
Private	11	51800	93500	33	0	55500	98900	50.0	17.0	71000	129500	61	45
Public	7	50700	94800	48	3	56000	101200	57.0	22.0	61300	106500	86	53
UT
Private	1	54500	91500	57	9	54500	91500	57.0	9.0	54500	91500	57	9
Public	5	46400	84400	57	7	52400	93900	60.0	12.0	54800	102600	65	22
VA
Private	10	43000	80600	40	1	52500	92850	52.5	14.0	65900	123200	86	23
Public	12	44900	81800	42	9	52500	94050	52.0	18.5	65900	120600	68	47
VT
Private	8	42000	75200	39	0	49950	89200	51.0	15.5	60400	109800	63	20
Public	4	41500	75100	41	4	47050	85050	49.5	7.0	53200	99400	52	22
WA
Private	13	41100	73600	39	0	53600	97800	52.0	18.0	60600	113700	60	33
Public	6	43500	78400	45	9	51400	90850	50.0	15.0	55900	103700	56	20
WI
Private	12	45600	81900	45	0	48150	87000	53.0	15.5	65700	117600	94	69
Public	10	46000	81300	44	8	50000	89400	49.0	15.5	57300	104400	53	38
WY
Public	1	52400	98800	58	25	52400	98800	58.0	25.0	52400	98800	58	25

Early career pay

# you can also make individual tables for each of the variable of interest

#early_career_pay

early_career_pay_stats<-merged1_final%>%
  group_by(state_code,type)%>%
  summarise(number_of_colleges=n(),
    minimum_early_career_pay=min(early_career_pay,na.rm=TRUE),
    median_early_career_pay=median(early_career_pay,na.rm=TRUE),
    maximum_early_career_pay=max(early_career_pay,na.rm=TRUE))

## `summarise()` has grouped output by 'state_code'. You can override using the
## `.groups` argument.

early_career_pay_stats%>%gt::gt()

type	number_of_colleges	minimum_early_career_pay	median_early_career_pay	maximum_early_career_pay
AK
Private	1	50300	50300	50300
Public	2	56000	57550	59100
AL
Private	10	34800	41500	54500
Public	12	38900	44750	57500
AR
Private	8	34600	43200	49200
Public	8	40500	44950	52500
AZ
Private	2	41200	41900	42600
Public	2	49600	52850	56100
CA
Private	13	58500	64500	88800
Public	1	63000	63000	63000
CO
Private	5	33600	53100	55700
Public	9	44400	48300	75600
CT
Private	12	49700	55550	70300
Public	6	48500	50450	59100
DE
Private	3	45100	45400	49800
Public	2	46300	52450	58600
FL
Private	10	46100	49650	58300
Public	8	47300	49500	55800
GA
Private	11	44400	46800	62000
Public	11	43300	47000	54400
HI
Private	3	45400	51700	52200
Public	2	44000	47800	51600
IA
Private	19	43900	48000	53400
Public	3	47100	54100	56100
ID
Private	2	46200	49950	53700
Public	4	44900	50450	52300
IL
Private	16	49400	54200	64600
Public	4	51200	54650	62600
IN
Private	17	46900	50000	75000
Public	2	45600	46750	47900
KS
Private	11	37100	44700	48700
Public	7	43500	47100	53000
KY
Private	16	35300	40950	50000
Public	8	40700	45550	51700
LA
Private	7	33900	46700	48400
Public	9	40100	48200	53100
MA
Private	20	52600	62600	86300
Public	1	67900	67900	67900
MD
Private	13	47900	51100	67400
Public	5	49600	50400	51200
ME
Private	9	36300	46900	61300
Public	8	37100	42550	67000
MI
Private	12	47400	50700	69900
Public	10	45900	50400	65000
MN
Private	6	47400	50850	58800
Public	3	49700	50200	51600
MO
Private	11	42500	46500	54200
Public	6	43400	46900	67300
MS
Private	6	34800	41150	49300
Public	8	32500	41300	51100
MT
Private	2	46000	47100	48200
Public	2	53700	57450	61200
NE
Private	7	43800	48600	53500
Public	4	40600	42450	53400
NV
Private	1	54000	54000	54000
OH
Private	14	48300	53700	66600
Public	2	48300	50050	51800
OK
Private	8	40700	46000	58700
Public	11	37300	44500	47900
OR
Private	11	40500	50000	58500
Public	8	44000	51300	65400
PA
Private	19	53400	59300	75900
TN
Private	14	41200	45950	65400
Public	6	43700	46850	53100
TX
Private	11	51800	55500	71000
Public	7	50700	56000	61300
UT
Private	1	54500	54500	54500
Public	5	46400	52400	54800
VA
Private	10	43000	52500	65900
Public	12	44900	52500	65900
VT
Private	8	42000	49950	60400
Public	4	41500	47050	53200
WA
Private	13	41100	53600	60600
Public	6	43500	51400	55900
WI
Private	12	45600	48150	65700
Public	10	46000	50000	57300
WY
Public	1	52400	52400	52400

From the above summary table of the minimum,maximum and median early career pay across private/public colleges in the different states of USA, we noted the following:

-> Amongst the private colleges, the highest median early career pay was recorded for California, followed by Massachusetts and then by Pennsylvania. The lowest median early career pay was recorded for the Kansas state and the second lowest for Mississipi

->Amongst the public colleges, the highest median early career pay was noted for Massachusetts and then by California and the lowest was noted for Mississipi and the second lowest for Nebraska.

-> It must also be noted that in this data, there are 20 private colleges and 1 public college in Massachusetts, 13 private and 1 public college in California.

->The highest number of public colleges was in AL and VA(12 in each state)and the lowest number was in CA, MA, WY,NV (with only 1 public college in each state) and the second lowest was noted in AR, AZ, DE, HI(2 public colleges in each state)

-> By assessing the min and max values, if we compare the range of early career pay between Mississipi and California public colleges, it looks the max early career pay in Mississipi exceeds that of California, this may not true since we had only 1 public college in california in our data set and thus doesn’t provide an accurate representation of all the public colleges in CA, especially in a state like CA where there are a lot of public colleges. Mississipi on the other hand, have 10 public colleges, so the data is a little more reliable.

->If we compare MI and MA, the maximum early pay is only little higher for MA public colleges compared to MI public colleges, this is due to same issue as above since MA has data from only 1 public college, so we can’t make any definitive conclusions from this.

->It is interesting to note that Colorado,having 9 public colleges has the highest early career pay $75,600. I think this seems like a reasonable early career pay for Colorado but most likely the recent graduates from public colleges in CO may not make the highest pay compared to CA and MI. A solid comparison can’t be made at this point since we have just one public college in CA and in MI(in the data).

-> The lowest min early career pay amongst the public colleges was recorded for MS $32,500. This was calculated from 8 public colleges in OK and the highest min early career pay was recorded for MA and again we can’t consider this to be a very credible value since this data is obtained from only 1 public college

->The highest number of private colleges was in Massachusetts(20), followed by Iowa(19) and Indiana(17)and the lowest number was in AK,NV,UT(1 each) and the second lowest in AZ, ID, MT(1 each).

-> Amongst the private college,the highest median and max early career pay was noted for California followed by MA and then by Pennsylvania and Kansas, Mississipi recorded the lowest median and max early career pay. Colorado had recorded the lowest minimum early career pay compared to the other states.

Mid career pay

#mid_career_pay

mid_career_pay_stats<-merged1_final%>%
  group_by(state_code,type)%>%
  summarise(number_of_colleges=n(),
    minimum_mid_career_pay=min(mid_career_pay,na.rm=TRUE),
    median_mid_career_pay=median(mid_career_pay,na.rm=TRUE),
    maximum_mid_career_pay=max(mid_career_pay,na.rm=TRUE))

## `summarise()` has grouped output by 'state_code'. You can override using the
## `.groups` argument.

mid_career_pay_stats%>%gt::gt()

type	number_of_colleges	minimum_mid_career_pay	median_mid_career_pay	maximum_mid_career_pay
AK
Private	1	90000	90000	90000
Public	2	101000	101400	101800
AL
Private	10	65100	76450	93500
Public	12	70700	79800	104500
AR
Private	8	61200	77350	90300
Public	8	70500	79800	98000
AZ
Private	2	73500	75050	76600
Public	2	90000	95500	101000
CA
Private	13	109400	120600	158200
Public	1	114700	114700	114700
CO
Private	5	60100	94800	105100
Public	9	81400	87100	139600
CT
Private	12	85900	100650	138300
Public	6	83600	87950	105200
DE
Private	3	76600	77300	86900
Public	2	81900	93650	105400
FL
Private	10	84400	89100	105400
Public	8	83900	89350	102800
GA
Private	11	79400	84900	110800
Public	11	77900	84400	100700
HI
Private	3	84000	93500	93900
Public	2	77200	84400	91600
IA
Private	19	78000	86900	99900
Public	3	85300	99400	101300
ID
Private	2	80600	87050	93500
Public	4	77600	89100	97700
IL
Private	16	89800	96000	118500
Public	4	95000	99650	115300
IN
Private	17	84500	89900	135800
Public	2	83700	84650	85600
KS
Private	11	66100	79100	91800
Public	7	78000	84900	98600
KY
Private	16	64100	75300	90900
Public	8	73100	81900	96400
LA
Private	7	61500	86300	88600
Public	9	71000	88200	100700
MA
Private	20	98200	114100	155200
Public	1	117500	117500	117500
MD
Private	13	82600	94100	118400
Public	5	87000	89800	91200
ME
Private	9	64600	84300	112000
Public	8	66200	74700	121800
MI
Private	12	84700	91000	125000
Public	10	82200	90500	116300
MN
Private	6	84700	93900	109900
Public	3	85900	86800	91600
MO
Private	11	77200	81200	95300
Public	6	77700	84200	122600
MS
Private	6	62400	73050	88700
Public	8	61900	74400	94100
MT
Private	2	82900	87150	91400
Public	2	96200	104400	112600
NE
Private	7	77700	84800	95800
Public	4	73400	75750	93000
NV
Private	1	95500	95500	95500
OH
Private	14	88500	99450	117800
Public	2	88600	90550	92500
OK
Private	8	74100	83050	107200
Public	11	67400	77600	90800
OR
Private	11	76400	89600	104900
Public	8	73900	93150	108300
PA
Private	19	98400	112300	136100
TN
Private	14	74900	82350	119100
Public	6	77300	79750	95100
TX
Private	11	93500	98900	129500
Public	7	94800	101200	106500
UT
Private	1	91500	91500	91500
Public	5	84400	93900	102600
VA
Private	10	80600	92850	123200
Public	12	81800	94050	120600
VT
Private	8	75200	89200	109800
Public	4	75100	85050	99400
WA
Private	13	73600	97800	113700
Public	6	78400	90850	103700
WI
Private	12	81900	87000	117600
Public	10	81300	89400	104400
WY
Public	1	98800	98800	98800

From the above summary table of the minimum,maximum and median mid career pay across private/public colleges in the different states of USA, we noted the following:

-> Amongst the private colleges, the highest median mid career pay was recorded for California, followed by Massachusetts and Pennsylvania. The lowest median mid career pay was recorded for the Mississipi state and the second lowest for Arizona.

->Amongst the public colleges, the highest median mid career pay was noted for Massachusetts and then by California and the lowest was noted for Mississipi and the second lowest for ME.

->By eyeballing,the median mid career pay for the private and public colleges within each state,in some of the states have a slightly higher average mid career pay for public colleges compared to private which seems like an interesting observation however we need to factor in the number of colleges in each of the private/public college in each of the state.

->Sometimes there could be less number colleges in one category which would provide biased mean.It is difficult to compare the mean mid pay from private and public college from the summary table as the trend is not consistent across all the states.

-> It must also be noted that in this data, there are 20 private colleges and 1 public college in Massachusetts, 13 private and 1 public college in California.

->The highest number of public colleges was in AL and VA(12 in each state)and the lowest number was in CA, MA, WY,NV (with only 1 public college in each state) and the second lowest was noted in AR, AZ, DE, HI(2 public colleges in each state)

-> By assessing the min and max values, if we compare the range of mid career pay between Mississipi and California public colleges, it looks the max mid-career pay in Mississipi is less that of California’s, this may be true however this is not an accurate representation of all the public colleges in CA since there was on;y 1 public college in CA, especially in a state like CA where there are a lot of public colleges. Mississipi on the other hand, have 10 public colleges, so the data is a little more reliable.

->If we compare MI and MA, the maximum mid career pay is higher for MA public colleges compared to MI public colleges, however inferences can’t be drawn definitely since MA has data from only 1 public college.

->It is interesting to note that the highest maximum mid career pay amongst the public colleges was observed for Colorado with $139,600 followed by MO($122,600), this was higher than the CA and MA public colleges. One could interpret this in two ways, since CA and MA has a lot of public colleges and based on the location factor and cost of living, one would expect to see the highest mid career salaries made in these two states but we must factor in that both these states had data from only one public college which is definitely doesn’t represent the earning potential of students of students from California public colleges.

-> The lowest min mid-career pay amongst the public colleges was recorded for MS . This was calculated from 8 public colleges in MS and the trend is similar for the min early career pay as well.The highest min mid career pay was recorded for MA and again we can’t consider this to be a very credible value since this data is obtained from only 1 public college

->The highest number of private colleges was in Massachusetts(20), followed by Iowa(19) and Indiana(17)and the lowest number was in AK,NV,UT(1 each) and the second lowest in AZ, ID, MT(1 each).

-> Amongst the private college,the highest median and max mid-career pay was noted for California followed by MA and then by Pennsylvania and Mississipi and Arizona recorded the lowest median mid-career pay. Colorado had recorded the lowest minimum mid career pay compared to the other states. The trend is very similar to early career pay.

Make world better percent

Since right-skew was prominently noticed for make world better %, median was used instead of mean as median is more reliable as a measure of central tendency when the data is skewed or when outliers are present.

# make world better percent

make_world_better_stats<-merged1_final%>%
  group_by(state_code,type)%>%
  summarise(number_of_colleges=n(),
   minimum_make_world_better_percent=min(make_world_better_percent,na.rm=TRUE),
    median_make_world_better_percent=median(make_world_better_percent,na.rm=TRUE),
    maximum_make_world_better_percent=max(make_world_better_percent,na.rm=TRUE))

## `summarise()` has grouped output by 'state_code'. You can override using the
## `.groups` argument.

make_world_better_stats%>%gt::gt()

type	number_of_colleges	minimum_make_world_better_percent	median_make_world_better_percent	maximum_make_world_better_percent
AK
Private	1	67	67.0	67
Public	2	54	55.5	57
AL
Private	10	52	64.0	77
Public	12	49	59.5	61
AR
Private	8	47	55.5	58
Public	8	49	56.0	84
AZ
Private	2	69	69.0	69
Public	2	51	53.5	56
CA
Private	13	39	49.5	59
Public	1	53	53.0	53
CO
Private	5	48	53.0	70
Public	9	47	53.0	60
CT
Private	12	36	52.5	63
Public	6	43	49.0	68
DE
Private	3	42	50.0	50
Public	2	43	52.5	62
FL
Private	10	44	53.0	88
Public	8	40	51.0	62
GA
Private	11	34	54.0	62
Public	11	45	51.0	69
HI
Private	3	55	56.0	57
Public	2	56	57.5	59
IA
Private	19	41	49.0	60
Public	3	46	48.0	48
ID
Private	2	59	63.5	68
Public	4	54	57.5	64
IL
Private	16	41	46.0	88
Public	4	44	47.0	48
IN
Private	17	40	53.0	59
Public	2	48	51.5	55
KS
Private	11	55	61.0	73
Public	7	51	54.0	64
KY
Private	16	50	54.0	74
Public	8	46	53.0	57
LA
Private	7	37	60.5	74
Public	9	51	57.0	76
MA
Private	20	36	47.0	71
Public	1	53	53.0	53
MD
Private	13	40	52.5	63
Public	5	47	51.0	65
ME
Private	9	37	51.0	69
Public	8	47	59.5	73
MI
Private	12	40	50.5	57
Public	10	46	48.5	54
MN
Private	6	39	49.0	52
Public	3	50	51.0	53
MO
Private	11	37	56.0	68
Public	6	48	51.0	54
MS
Private	6	56	60.5	76
Public	8	55	63.0	70
MT
Private	2	55	57.0	59
Public	2	54	59.0	64
NE
Private	7	45	57.0	74
Public	4	50	60.0	79
NV
Private	1	39	39.0	39
OH
Private	14	39	46.0	78
Public	2	51	51.5	52
OK
Private	8	51	58.5	71
Public	11	48	55.0	66
OR
Private	11	44	57.0	82
Public	8	46	53.5	79
PA
Private	19	37	43.0	72
TN
Private	14	46	59.5	68
Public	6	48	56.0	68
TX
Private	11	33	50.0	61
Public	7	48	57.0	86
UT
Private	1	57	57.0	57
Public	5	57	60.0	65
VA
Private	10	40	52.5	86
Public	12	42	52.0	68
VT
Private	8	39	51.0	63
Public	4	41	49.5	52
WA
Private	13	39	52.0	60
Public	6	45	50.0	56
WI
Private	12	45	53.0	94
Public	10	44	49.0	53
WY
Public	1	58	58.0	58

From the above summary table of the minimum,maximum and median make world better % across private/public colleges in the different states of USA, we noted the following:

Private Colleges

-> Amongst the private colleges, the highest median make world better % was recorded for Arizona(69%) and Alaska(67%) and the lowest median make world better % was for Nevada(39%). It was surprising to observe that the median make world better % was pretty low for CA(49.5%), MA(47%) and PA(43%), these three states were having the highest average early and mid career pay.

->By assessing the min make world better % for the private colleges, the highest min make world better % was observed for Arizona(69%) and Alaska (67%) and the lowest % was noted for TX(33%), GA(34%),MA(36%) and CA(39%). The low % in make world better is not expected for states like MA and CA. Note:Only 2 private colleges are in AZ and 1 is in Alaska, so this data may not truly represent all the other private colleges in AZ and AK.

->By assessing the max make world better % for the private colleges, the highest max make world better % was observed for WI(94%) and for FL(88%) and IL(88%) and the smallest max make world better % was noted for NV(39%) and DE(50%).It was also intriguing to note that CA(59%) and MA(71%) had only moderate min make world better % rankings.

Public Colleges

->Amongst the public colleges, the highest median make world better % was observed for Mississipi(63%) and Nebraska(60%),both of these states had the lowest median early and mid career pay, so it was interesting to note this observation. The lowest median make world better % was observed for IA(48%) and IL(47%).

->By assessing the min make world better % for the public colleges, the highest min make world better % was observed for WY(58%) and UT(57%) and the lowest % was noted for FL(40%) and VT(41%) . The MIN make world better % for states like MA and CA(53% each) was almost comparable to the highest min make world better %.

->By assessing the max make world better % for the public colleges, the highest max make world better % was observed for TX(86%)and AR(84%) and the smallest max make world better % was noted for IL(48%) and IA(48%).It was also intriguing to note that CA(53%) and MA(53%) had pretty low max make world better % rankings.

->By eyeballing the median world better % between the public and private colleges across the different states-in some of the states the public colleges seem to have a better median world % compared to the private colleges and vice-versa. The largest diff in make world better % between the public and private college was observed for Arizona(approx 16%, with private college % being higher) and Alaska(approx 12%,with private college % being higher), for the rest of the states the difference was very small.

Stem percent

#stem percent

stem_stats<-merged1_final%>%
  group_by(state_code,type)%>%
  summarise(number_of_colleges=n(),
    minimum_stem_percent=min(stem_percent,na.rm=TRUE),
    median_stem_percent=median(stem_percent,na.rm=TRUE),
    maximum_stem_percent=max(stem_percent,na.rm=TRUE))

## `summarise()` has grouped output by 'state_code'. You can override using the
## `.groups` argument.

stem_stats%>%gt::gt()

type	number_of_colleges	minimum_stem_percent	median_stem_percent	maximum_stem_percent
AK
Private	1	6	6.0	6
Public	2	10	15.0	20
AL
Private	10	0	12.5	30
Public	12	4	10.5	45
AR
Private	8	3	17.0	33
Public	8	2	7.5	22
AZ
Private	2	0	3.5	7
Public	2	12	17.5	23
CA
Private	13	0	16.0	97
Public	1	28	28.0	28
CO
Private	5	0	12.0	23
Public	9	3	19.0	93
CT
Private	12	2	19.0	31
Public	6	2	10.0	24
DE
Private	3	2	4.0	25
Public	2	10	16.0	22
FL
Private	10	2	8.0	48
Public	8	4	15.5	29
GA
Private	11	3	19.0	26
Public	11	7	10.0	22
HI
Private	3	3	12.0	15
Public	2	14	16.0	18
IA
Private	19	0	14.0	65
Public	3	8	16.0	31
ID
Private	2	7	10.5	14
Public	4	5	13.0	22
IL
Private	16	0	16.5	64
Public	4	10	25.5	36
IN
Private	17	0	21.0	96
Public	2	7	7.5	8
KS
Private	11	0	7.0	14
Public	7	4	9.0	24
KY
Private	16	0	8.0	27
Public	8	5	11.0	19
LA
Private	7	0	4.0	35
Public	9	5	13.0	24
MA
Private	20	0	25.0	86
Public	1	47	47.0	47
MD
Private	13	0	11.0	88
Public	5	2	12.0	19
ME
Private	9	0	7.0	37
Public	8	5	7.5	48
MI
Private	12	0	16.5	74
Public	10	6	15.0	81
MN
Private	6	0	16.5	45
Public	3	10	11.0	15
MO
Private	11	0	7.0	13
Public	6	9	21.0	80
MS
Private	6	2	19.0	32
Public	8	2	11.5	27
MT
Private	2	12	18.5	25
Public	2	31	46.0	61
NE
Private	7	0	9.0	23
Public	4	3	6.0	11
NV
Private	1	2	2.0	2
OH
Private	14	0	21.0	45
Public	2	20	20.5	21
OK
Private	8	0	8.0	41
Public	11	5	10.0	15
OR
Private	11	0	8.0	28
Public	8	5	11.5	34
PA
Private	19	5	25.0	66
TN
Private	14	3	9.0	39
Public	6	8	11.0	28
TX
Private	11	0	17.0	45
Public	7	3	22.0	53
UT
Private	1	9	9.0	9
Public	5	7	12.0	22
VA
Private	10	1	14.0	23
Public	12	9	18.5	47
VT
Private	8	0	15.5	20
Public	4	4	7.0	22
WA
Private	13	0	18.0	33
Public	6	9	15.0	20
WI
Private	12	0	15.5	69
Public	10	8	15.5	38
WY
Public	1	25	25.0	25

Since right-skew was prominently noticed for stem %, median was used instead of mean as median is more reliable as a measure of central tendency when the data is skewed or when outliers are present.

Amongst the public colleges, MA(47%) has the highest median stem % followed by MT(46%) and CA(28.00%) and the least mean stem % is noted for NE(6%),VT(7.0%), AR(7.5%).It is interesting to note that CA has a pretty low stem % but this could be due to CA having only one public college in the data and this particular college had a low stem % but that doesn’t represent the public colleges in CA accurately. CO (93%) has the maximum stem % however it’s median stem is 19%. Indiana has the least maximum stem percent(8%).MA(47%) has the highest min stem percent and MD,AR,CT, MS(2% for each state) have the lowest min stem percent.

Amongst the private colleges, PA(25.00%) and MA(25.00%) has the highest median stem % and the least median stem % is noted for LA(4%),AZ(3.5%) and NV(2%). The median stem % of CA private colleges(13 colleges) is 16% only since the minimum stem % is 0 and the maximum stem % is 97.

It is intriguing to note that the private colleges on have lower median stem % compared to the public colleges.

Research Question 2

How Gender, Race and Ethnicity impact the choice of private/ public colleges across the USA

Gender

# create a summary table
gender.stats<-merged2.final%>%
              group_by(state,type)%>%
              summarise(number_of_colleges=n(),
across(c(percent_women,percent_men),~min(.x, na.rm=TRUE),.names = "minimum_{.col}"),
across(c(percent_women,percent_men),~mean(.x, na.rm=TRUE),.names = "average_{.col}"),
across(c(percent_women,percent_men),~max(.x,na.rm=TRUE),.names = "maximum_{.col}"))

## `summarise()` has grouped output by 'state'. You can override using the
## `.groups` argument.

gender.stats%>%gt::gt()

type	number_of_colleges	minimum_percent_women	minimum_percent_men	average_percent_women	average_percent_men	maximum_percent_women	maximum_percent_men
Alabama
Private	15	5.37634409	33.6250000	51.91116	48.08884	66.37500	94.62366
Public	11	49.39024390	29.7819002	61.67802	38.32198	70.21810	50.60976
Alaska
Private	2	54.34782609	33.6787565	60.33453	39.66547	66.32124	45.65217
Arizona
Private	3	44.91180461	33.0188679	55.24636	44.75364	66.98113	55.08820
Public	3	42.95310016	41.1081032	51.22506	48.77494	58.89190	57.04690
Arkansas
Private	10	33.14794216	35.9788360	49.92745	50.07255	64.02116	66.85206
Public	9	54.07879029	34.8788927	58.46318	41.53682	65.12111	45.92121
California
Private	76	0.00000000	0.6072874	54.81189	45.18811	99.39271	100.00000
Public	5	49.63775869	37.0936491	56.05517	43.94483	62.90635	50.36224
Colorado
Private	6	37.40157480	31.2248996	57.21153	42.78847	68.77510	62.59843
Public	6	28.31264676	34.3734440	49.11059	50.88941	65.62656	71.68735
Connecticut
Private	12	32.80423280	10.9551657	58.09519	41.90481	89.04483	67.19577
Public	6	50.13707734	32.6075687	56.47142	43.52858	67.39243	49.86292
Delaware
Private	1	43.19085487	56.8091451	43.19085	56.80915	43.19085	56.80915
Public	2	56.24779541	37.2299295	59.50893	40.49107	62.77007	43.75220
Florida
Private	28	0.00000000	30.2799404	49.15176	50.84824	69.72006	100.00000
Public	10	54.22673325	40.7673861	56.37840	43.62160	59.23261	45.77327
Georgia
Private	24	0.04741584	0.0000000	59.75706	40.24294	100.00000	99.95258
Public	20	30.15708166	30.9974425	59.32227	40.67773	69.00256	69.84292
Hawaii
Private	2	56.66723872	34.6153846	61.02593	38.97407	65.38462	43.33276
Public	1	60.60142712	39.3985729	60.60143	39.39857	60.60143	39.39857
Idaho
Private	3	49.14285714	41.6184971	53.32397	46.67603	58.38150	50.85714
Public	4	46.84669287	39.3587361	53.46854	46.53146	60.64126	53.15331
Illinois
Private	46	35.56596607	9.7560976	61.04623	38.95377	90.24390	64.43403
Public	9	44.90695614	29.8995845	57.03811	42.96189	70.10042	55.09304
Indiana
Private	19	0.00000000	32.7269386	50.26330	49.73670	67.27306	100.00000
Public	4	55.07092468	36.5624317	60.35421	39.64579	63.43757	44.92908
Iowa
Private	25	32.94360385	8.8695652	57.39885	42.60115	91.13043	67.05640
Public	3	43.98141426	41.7756539	51.26467	48.73533	58.22435	56.01859
Kansas
Private	11	40.06069803	34.7436941	53.20483	46.79517	65.25631	59.93930
Public	8	49.32568844	37.2260386	54.62502	45.37498	62.77396	50.67431
Kentucky
Private	17	21.79487179	26.5253137	56.05595	43.94405	73.47469	78.20513
Public	8	51.56532628	39.4195251	56.96244	43.03756	60.58047	48.43467
Louisiana
Private	7	53.26433121	14.6922184	65.51076	34.48924	85.30778	46.73567
Public	11	48.82850780	23.7755587	62.17425	37.82575	76.22444	51.17149
Maine
Private	8	50.02770083	26.5795207	61.89320	38.10680	73.42048	49.97230
Public	8	13.86792453	27.8516295	57.95685	42.04315	72.14837	86.13208
Maryland
Private	9	0.00000000	13.6034732	59.56967	40.43033	86.39653	100.00000
Public	8	52.41806909	27.1943824	60.01231	39.98769	72.80562	47.58193
Massachusetts
Private	48	19.43835015	0.4434590	59.50762	40.49238	99.55654	80.56165
Public	9	12.69205077	30.0238663	57.45505	42.54495	69.97613	87.30795
Michigan
Private	29	0.00000000	8.8410992	50.39129	49.60871	91.15890	100.00000
Public	11	26.29947880	40.3508772	53.03614	46.96386	59.64912	73.70052
Minnesota
Private	13	27.79922780	5.8160237	57.18617	42.81383	94.18398	72.20077
Public	4	56.25752106	37.4596960	58.24378	41.75622	62.54030	43.74248
Mississippi
Private	8	33.33333333	31.5751094	57.70600	42.29400	68.42489	66.66667
Public	8	48.59469659	18.6943620	62.61132	37.38868	81.30564	51.40530
Missouri
Private	26	2.15053763	0.0000000	61.51486	38.48514	100.00000	97.84946
Public	9	22.61574074	32.3437500	54.29603	45.70397	67.65625	77.38426
Montana
Private	4	48.36146971	29.8119964	59.83198	40.16802	70.18800	51.63853
Public	1	32.47002398	67.5299760	32.47002	67.52998	32.47002	67.52998
Nebraska
Private	11	44.85294118	2.2593320	62.60753	37.39247	97.74067	55.14706
Public	4	57.34870317	35.8225108	59.90584	40.09416	64.17749	42.65130
Nevada
Private	1	57.65158807	42.3484119	57.65159	42.34841	57.65159	42.34841
Public	2	64.55026455	23.5443038	70.50298	29.49702	76.45570	35.44974
New Hampshire
Private	8	47.11019371	30.3149606	59.95840	40.04160	69.68504	52.88981
Public	5	51.98764161	25.7582617	58.08928	41.91072	74.24174	48.01236
New Jersey
Private	15	0.00000000	25.7552483	43.07105	56.92895	74.24475	100.00000
Public	9	25.03287620	36.6027532	54.19591	45.80409	63.39725	74.96712
New Mexico
Private	1	65.31219029	34.6878097	65.31219	34.68781	65.31219	34.68781
Public	7	32.15796897	36.0406091	55.86579	44.13421	63.95939	67.84203
New York
Private	83	0.00000000	0.1943257	46.24819	53.75181	99.80567	100.00000
Public	1	84.65792708	15.3420729	84.65793	15.34207	84.65793	15.34207
North Carolina
Private	35	41.98581560	1.5906681	60.34770	39.65230	98.40933	58.01418
Public	15	44.26137868	27.8352490	58.95017	41.04983	72.16475	55.73862
North Dakota
Private	3	36.08695652	34.6030616	51.97229	48.02771	65.39694	63.91304
Public	6	46.01613888	39.3548387	54.96614	45.03386	60.64516	53.98386
Ohio
Private	46	0.00000000	9.5940959	53.61896	46.38104	90.40590	100.00000
Public	9	48.81479008	41.9344902	53.46294	46.53706	58.06551	51.18521
Oklahoma
Private	10	42.50320376	29.1666667	56.49564	43.50436	70.83333	57.49680
Public	10	54.20319752	35.5088496	59.83497	40.16503	64.49115	45.79680
Oregon
Private	12	44.09030544	31.2138728	59.32605	40.67395	68.78613	55.90969
Public	8	46.29107981	34.9528137	55.82764	44.17236	65.04719	53.70892
Pennsylvania
Private	66	0.00000000	0.6622517	60.29091	39.70909	99.33775	100.00000
Public	16	36.06615686	34.8564426	54.77032	45.22968	65.14356	63.93384
Rhode Island
Private	7	32.20396988	31.7269076	52.26290	47.73710	68.27309	67.79603
Public	2	54.77641663	31.0149288	61.88074	38.11926	68.98507	45.22358
South Carolina
Private	18	43.55909695	8.3513319	57.05298	42.94702	91.64867	56.44090
Public	9	19.93318486	29.7920892	56.90451	43.09549	70.20791	80.06682
South Dakota
Private	4	56.78160920	35.8024691	60.24194	39.75806	64.19753	43.21839
Public	8	22.05146533	29.4851794	55.96087	44.03913	70.51482	77.94853
Tennessee
Private	28	29.93630573	8.8034188	57.74361	42.25639	91.19658	70.06369
Public	6	44.55419349	38.9719730	56.32104	43.67896	61.02803	55.44581
Texas
Private	38	11.67883212	32.8674121	51.16689	48.83311	67.13259	88.32117
Public	28	43.26044598	11.6117046	58.63384	41.36616	88.38830	56.73955
Utah
Private	2	44.87600052	39.4268518	52.72457	47.27543	60.57315	55.12400
Public	2	44.51213708	46.5747091	48.96871	51.03129	53.42529	55.48786
Vermont
Private	9	20.28867102	30.1115242	53.37629	46.62371	69.88848	79.71133
Public	3	42.86713287	44.4306161	48.28991	51.71009	55.56938	57.13287
Virginia
Private	22	0.18099548	3.5714286	58.61043	41.38957	96.42857	99.81900
Public	12	11.05882353	31.8485086	54.91656	45.08344	68.15149	88.94118
Washington
Private	11	51.36035536	17.7204659	61.30630	38.69370	82.27953	48.63964
Public	6	51.34211811	44.3841522	53.63577	46.36423	55.61585	48.65788
West Virginia
Private	4	41.72185430	43.5555556	52.28120	47.71880	56.44444	58.27815
Public	9	39.09595559	38.0678183	53.42305	46.57695	61.93218	60.90404
Wisconsin
Private	20	24.05693950	3.1393889	65.38474	34.61526	96.86061	75.94306
Public	2	49.24419842	47.8318292	50.70618	49.29382	52.16817	50.75580
Wyoming
Public	1	51.95787832	48.0421217	51.95788	48.04212	51.95788	48.04212

Amongst the public colleges, the highest mean % women enrollment is noted for NY(84.65%) and NV(70.50%) and the lowest was noted for Montana(32.47%) and Vermont(48.28%). The highest mean % women enrollment for CA and MA public colleges is 56.05% and 57.4%. The % enrollment of men in public colleges would be (100-% women enrollment) and hence the lowest mean % enrollment of men is noted for NY and NV and the highest for Montana and Vermont.

One of the limitation of this data set is only women enrollment was provided and by taking the difference from the total enrollment and women enrollment, men enrollment was calculated. This calculation holds good only for binary gender groups and doesn’t include the other gender groups.

Race and Ethnicity

# create a summary table
race_ethnicity.stats<-merged2.final%>%
              group_by(state,type)%>%
              summarise(number_of_colleges=n(),
across(percent_american_indian_alaska_native:percent_resident,~min(.x, na.rm=TRUE),.names = "minimum_{.col}"),
across(percent_american_indian_alaska_native:percent_resident,~mean(.x, na.rm=TRUE),.names = "average_{.col}"),
across(percent_american_indian_alaska_native:percent_resident,~max(.x,na.rm=TRUE),.names = "maximum_{.col}"))

## `summarise()` has grouped output by 'state'. You can override using the
## `.groups` argument.

race_ethnicity.stats%>%gt::gt()

type	number_of_colleges	minimum_percent_american_indian_alaska_native	minimum_percent_asian	minimum_percent_black	minimum_percent_hispanic	minimum_percent_native_hawaiian_pacific_islander	minimum_percent_white	minimum_percent_two_or_more_races	minimum_percent_unknown	minimum_percent_non_resident_foreign	minimum_percent_total_minority	minimum_percent_resident	average_percent_american_indian_alaska_native	average_percent_asian	average_percent_black	average_percent_hispanic	average_percent_native_hawaiian_pacific_islander	average_percent_white	average_percent_two_or_more_races	average_percent_unknown	average_percent_non_resident_foreign	average_percent_total_minority	average_percent_resident	maximum_percent_american_indian_alaska_native	maximum_percent_asian	maximum_percent_black	maximum_percent_hispanic	maximum_percent_native_hawaiian_pacific_islander	maximum_percent_white	maximum_percent_two_or_more_races	maximum_percent_unknown	maximum_percent_non_resident_foreign	maximum_percent_total_minority	maximum_percent_resident
Alabama
Private	15	0.00000000	0.00000000	7.5268817	0.0000000	0.00000000	0.8960573	0.00000000	0.00000000	0.00000000	14.960470	92.315627	0.5611901	0.8230765	53.2076651	2.177298	0.09911336	35.29526	0.87036975	5.4710553	1.49497673	57.738713	98.50502	2.1505376	5.2320675	98.5507246	5.595930	1.07526882	81.22846	3.27034884	28.8000000	7.6843734	98.550725	100.00000
Public	11	0.14495380	0.15041364	7.2784810	0.7118845	0.00000000	3.7687987	0.00000000	1.06963312	0.54347826	12.615776	94.492899	0.6471454	1.6061984	28.5768747	2.108923	0.08325071	58.64694	1.67328864	3.4049145	3.25246816	34.695681	96.74753	1.5345269	4.9791422	89.6901613	3.497715	0.20879469	80.48395	3.01101722	6.2289895	5.5071010	92.498641	99.45652
Alaska
Private	2	6.52173913	0.00000000	0.0000000	2.1739130	0.00000000	55.4404145	2.17391304	0.00000000	0.00000000	10.869565	100.000000	10.7738229	0.8635579	1.1226252	2.382293	0.34542314	72.28542	4.62754374	7.5993092	0.00000000	20.115266	100.00000	15.0259067	1.7271157	2.2452504	2.590674	0.69084629	89.13043	7.08117444	15.1986183	0.0000000	29.360967	100.00000
Arizona
Private	3	0.94979647	1.41509434	2.8301887	2.4764151	0.23584906	6.5934066	0.00000000	0.00000000	0.00000000	14.150943	96.698113	23.5257010	2.0184101	5.2624602	10.743414	1.76995983	47.53874	3.07078458	4.3819283	1.68859818	46.390730	98.31140	67.0329670	2.4423338	7.4626866	15.468114	4.39560440	75.23585	4.61329715	7.3113208	3.3018868	93.406593	100.00000
Public	3	1.22407425	1.68922577	3.2051976	16.3732114	0.20468998	51.9773450	3.50755167	0.78686158	4.49738314	31.291733	84.193164	1.8475064	4.5607107	3.3692364	19.338836	0.21780097	55.51847	4.04685803	1.4155663	9.68501602	33.380949	90.31498	3.0247248	6.3294913	3.5830684	22.267734	0.22492660	62.37141	4.82584371	2.5357515	15.8068362	36.506771	95.50262
Arkansas
Private	10	0.00000000	0.11123471	3.7894737	0.3527337	0.00000000	0.5291005	0.00000000	0.00000000	0.88987764	11.239478	89.634146	1.0391711	1.2956800	25.1500961	5.387377	0.05549444	59.34981	1.46284304	1.5496239	4.70990631	34.390662	95.29009	1.8292683	5.6701031	94.2157953	14.650767	0.17636684	83.34444	3.74787053	6.4606742	10.3658537	94.994438	99.11012
Public	9	0.11937923	0.59678256	4.3923865	1.5917230	0.00000000	4.6955830	0.07958615	0.00000000	0.44110016	18.655224	95.084630	0.6727289	1.8417061	25.4832340	4.471255	0.13035704	60.82006	3.06935664	0.8577156	2.65358582	35.668637	97.34641	2.7232796	4.5674740	91.3251094	9.677892	0.78962211	77.60373	7.06741091	3.1487889	4.9153701	93.712694	99.55890
California
Private	76	0.00000000	0.00000000	0.0000000	0.0000000	0.00000000	0.0000000	0.00000000	0.00000000	0.00000000	0.000000	3.378378	0.4990811	11.3811370	5.9086256	18.109358	0.62276228	40.82164	3.89647241	6.4061661	12.35476169	40.417436	87.64524	5.0000000	60.9929078	21.8303946	48.186528	3.10663308	96.93252	9.63130173	62.2549020	96.6216216	87.009063	100.00000
Public	5	0.13450310	3.45315262	2.1743811	22.6423746	0.21528525	21.7405924	4.76568948	4.15431174	1.41426046	39.171152	89.056338	0.4288308	15.7942713	3.4476867	25.740584	0.41514383	36.04037	5.76817383	6.5548161	5.81012715	51.594690	94.18987	1.0017678	31.7335616	4.6767351	29.217812	0.73976707	51.87298	6.52314316	8.2334974	10.9436615	63.161434	98.58574
Colorado
Private	6	0.43541364	0.78740157	1.4056225	6.6929134	0.00000000	61.4031277	2.54126846	0.00000000	0.00000000	18.110236	89.313236	0.6500631	2.8170575	4.3132906	9.728886	0.16946646	69.32110	4.73317466	4.4980127	3.76895278	22.411939	96.23105	1.1811024	4.6481321	6.7565232	15.117289	0.39370079	81.88976	8.36961780	8.6120765	10.6867643	28.811903	100.00000
Public	6	0.21804764	0.63307834	0.8177262	6.9942972	0.05031869	54.4711158	2.86377709	1.76115398	0.00000000	16.101979	88.527340	4.4050785	1.5726292	2.6744927	13.681386	0.31768426	63.58903	3.91688512	7.0134684	2.82934492	26.568156	97.17066	23.3975204	3.8745387	6.0558022	25.745086	0.60333480	72.27951	5.98786600	15.1315789	11.4726602	41.281984	100.00000
Connecticut
Private	12	0.00000000	0.45161290	1.0582011	6.3873538	0.00000000	33.4232325	1.05820106	0.00000000	0.70967742	13.214913	75.900597	0.3499549	3.6141543	10.5126543	9.080962	0.09848934	58.38210	1.94630596	6.9648024	9.05057636	25.602520	90.94942	1.7994859	12.8161479	33.2258065	14.000000	0.21190522	75.97122	4.43417639	19.3372320	24.0994028	49.483871	99.29032
Public	6	0.15447798	1.29600829	5.0977733	7.2642327	0.01847575	57.8019699	1.12567204	2.57056452	0.20161290	21.524494	89.827060	0.2255267	3.7201589	10.6672862	10.894874	0.09255368	63.54261	2.16476152	6.3233951	2.36883308	27.765161	97.63117	0.3110420	8.7411929	16.7444272	15.120968	0.16801075	69.33989	2.59126159	8.6374134	10.1729400	33.229653	99.79839
Delaware
Private	1	0.19880716	2.73359841	10.6858847	1.3916501	0.00000000	19.6322068	1.29224652	0.49701789	63.56858847	16.302187	36.431412	0.1988072	2.7335984	10.6858847	1.391650	0.00000000	19.63221	1.29224652	0.4970179	63.56858847	16.302187	36.43141	0.1988072	2.7335984	10.6858847	1.391650	0.00000000	19.63221	1.29224652	0.4970179	63.5685885	16.302187	36.43141
Public	2	0.07495591	1.06891062	5.4320988	5.1398681	0.09259259	12.2583580	2.53968254	1.49470899	4.91244030	19.069665	91.141975	0.1739346	2.7787234	35.8750214	5.790834	0.10315324	41.41798	3.08926360	3.8858580	6.88523250	47.810930	93.11477	0.2729134	4.4885362	66.3179441	6.441799	0.11371390	70.57760	3.63884467	6.2770071	8.8580247	76.552195	95.08756
Florida
Private	28	0.00000000	0.00000000	0.0000000	0.0000000	0.00000000	0.5890052	0.00000000	0.00000000	0.00000000	0.000000	64.899108	0.3882364	1.9424517	20.4182202	12.287918	0.23402674	51.01961	2.02478246	5.1267983	6.55795435	37.295635	93.44205	1.4414414	7.5471698	88.5262117	39.814815	1.85185185	100.00000	5.40540541	19.7552020	35.1008916	94.187298	100.00000
Public	10	0.08466035	1.62371312	2.6378897	8.3478813	0.00000000	11.4553517	2.10038299	0.40040040	1.59607545	24.940048	91.316040	0.2292512	3.9775278	9.8004227	20.966752	0.18384279	55.80071	3.21755099	1.6367859	4.18716044	38.375347	95.81284	0.6110141	7.0785903	18.7411295	62.912719	0.48525041	70.74341	4.87362362	3.2006308	8.6839605	80.151179	98.40392
Georgia
Private	24	0.00000000	0.00000000	2.7280477	0.0000000	0.00000000	0.0000000	0.00000000	0.00000000	0.00000000	9.974425	80.620428	0.2957594	2.9849375	35.7465735	4.155618	0.09349994	43.71465	1.70773236	7.2180669	4.08316281	44.984121	95.91684	0.7262164	15.4445122	100.0000000	10.329068	0.35149385	85.25149	5.95647194	24.6800731	19.3795717	100.000000	100.00000
Public	20	0.06923709	0.10172940	4.2579681	1.1508951	0.00000000	4.6998983	0.56265985	0.03662109	0.34469552	16.361488	78.575447	0.2471032	3.6693637	33.5559831	5.969450	0.11231804	49.35042	2.59409998	1.6700013	2.83126455	46.148318	97.16874	0.4337866	15.5047817	91.4417887	19.715698	0.27541312	81.80744	4.10615924	4.2027194	21.4245532	94.217425	99.65530
Hawaii
Private	2	0.41187575	15.80573194	5.5152395	5.8780842	2.19667067	16.5094340	9.61538462	2.36828557	2.14078374	54.384761	85.807448	0.6413515	21.0741287	5.7179767	9.494731	8.88135420	22.78192	13.27688743	9.9649846	8.16666783	59.086430	91.83333	0.8708273	26.3425254	5.9207139	13.111378	15.56603774	29.05440	16.93839025	17.5616836	14.1925519	63.788099	97.85922
Public	1	0.50968400	20.54026504	1.1722732	11.6462793	9.96432212	22.8338430	28.77166157	0.33129460	4.23037717	72.604485	95.769623	0.5096840	20.5402650	1.1722732	11.646279	9.96432212	22.83384	28.77166157	0.3312946	4.23037717	72.604485	95.76962	0.5096840	20.5402650	1.1722732	11.646279	9.96432212	22.83384	28.77166157	0.3312946	4.2303772	72.604485	95.76962
Idaho
Private	3	0.52447552	0.57142857	1.1428571	5.1428571	0.00000000	63.3741259	1.71428571	3.42857143	0.57142857	10.285714	93.094406	0.9537534	1.5230550	1.5703699	9.407099	0.27908026	75.69288	1.92287096	5.3730200	3.27787530	15.656228	96.72212	1.7142857	2.7972028	1.9230769	15.297203	0.43706294	85.71429	2.09790210	6.6433566	6.9055944	23.076923	99.42857
Public	4	0.59387232	0.90613383	0.7667286	4.9953532	0.18800205	71.7849430	2.31588354	2.67193309	2.60223048	11.640335	89.887557	1.2274179	1.3941204	1.1870040	7.730865	0.25449549	76.66083	2.75868094	3.1647368	5.62185138	14.552583	94.37815	2.3001859	1.9615783	1.6061547	9.668421	0.31943132	83.08550	3.21312596	3.6786060	10.1124432	17.285284	97.39777
Illinois
Private	46	0.00000000	0.00000000	0.0000000	0.0000000	0.00000000	2.5225225	0.00000000	0.00000000	0.00000000	0.000000	51.671309	0.2809328	4.9724991	10.1155932	10.165297	0.19748833	61.80488	2.47353844	5.0380218	4.95174849	28.205349	95.04825	1.1702128	15.3543307	59.4594595	32.447113	0.84772370	100.00000	7.98747063	22.3085461	48.3286908	95.135135	100.00000
Public	9	0.08639787	0.89756535	4.8427116	4.9814877	0.03838035	5.2005373	0.00000000	0.49963619	2.04293629	19.980597	77.933097	0.1691045	6.0218893	20.4775320	12.008570	0.12310473	49.25631	1.89414275	3.3167118	6.73263146	40.694344	93.26737	0.2596953	18.6956988	68.7967761	31.727494	0.25304136	77.45816	2.76066178	9.6334677	22.0669030	78.257532	97.95706
Indiana
Private	19	0.00000000	0.00000000	2.3869347	1.4925373	0.00000000	3.4825871	0.41493776	0.00000000	0.00000000	5.800575	86.287996	0.2926445	2.0916343	12.6468480	4.713684	0.06698609	69.18740	2.45706312	2.6448209	5.89892250	22.268860	94.10108	0.7462687	4.7048198	92.2885572	12.336892	0.32618826	87.77565	6.50112867	7.4788681	13.7120036	96.268657	100.00000
Public	4	0.12587751	1.02777170	3.4331948	2.1999146	0.04373497	66.3202609	1.91157625	1.74939864	0.69975946	9.490488	92.429644	0.2103463	1.1449611	7.8016160	2.803577	0.07384205	79.68787	2.21488553	2.4348506	3.62804966	14.249228	96.37195	0.2730790	1.3578093	16.6805735	3.350278	0.11619463	88.06035	2.83698703	4.1103849	7.5703558	24.205416	99.30024
Iowa
Private	25	0.00000000	0.49592632	1.0272759	0.3478261	0.00000000	3.1250000	0.00000000	0.00000000	0.00000000	5.043478	31.568088	0.3556412	2.8814917	4.6048481	4.795584	0.44292186	72.39513	1.80619894	4.5756429	8.14253662	14.886686	91.85746	1.0273973	28.1250000	16.1920715	12.891344	7.05962988	92.46575	4.26758939	20.9302326	68.4319120	44.791667	100.00000
Public	3	0.16683350	0.90543260	2.5874837	3.1606304	0.05868545	67.1271271	1.76056338	2.25519785	5.24815560	8.936955	87.610944	0.2029154	2.3862456	2.7585718	4.382269	0.07848785	74.27715	1.93805610	4.2445687	9.73173972	11.746545	90.26826	0.2323218	3.5235235	2.8461795	5.772439	0.09002468	83.55969	2.10210210	5.9859860	12.3890557	14.497831	94.75184
Kansas
Private	11	0.30349014	0.32154341	2.5723473	2.8938907	0.00000000	46.0085837	0.96463023	0.45523520	0.00000000	8.360129	95.176849	1.0486529	1.4372847	8.7345840	7.369121	0.31167230	69.91048	2.52445367	6.7549818	1.90877293	21.425768	98.09123	1.8261752	4.8820179	18.0257511	12.535211	0.60698027	79.04584	6.26992561	21.6309013	4.8231511	31.673820	100.00000
Public	8	0.37955261	0.00000000	0.0000000	0.0000000	0.00000000	0.0000000	0.00000000	0.00000000	0.00000000	13.547920	74.235081	13.0673817	1.9318196	3.7788280	5.377596	0.09570302	59.66256	3.02190640	3.6233675	9.44084053	27.273234	90.55916	100.0000000	6.2687563	5.5285095	8.596199	0.14536057	77.71092	4.67778868	16.9294853	25.7649186	100.000000	100.00000
Kentucky
Private	17	0.00000000	0.00000000	0.0000000	0.0000000	0.00000000	62.3072178	0.00000000	0.00000000	0.00000000	0.000000	92.288711	0.2648063	1.0410234	7.9457530	2.382887	0.09232473	76.81884	1.82047481	7.0383214	2.59557407	13.547269	97.40443	0.8522727	3.3165105	18.3903072	5.965909	0.26609899	96.15385	5.42874769	20.1060204	7.7112893	28.130783	100.00000
Public	8	0.15769213	0.37097358	3.7006877	1.4657980	0.05353797	29.0765172	1.28483532	0.27828023	1.26649077	7.111835	92.638529	0.2218325	1.3789247	12.8920323	2.420546	0.10569806	73.88816	2.15688352	3.0859078	3.85001475	19.175918	96.14999	0.3644798	3.4738648	53.1398417	3.770697	0.21108179	90.20992	3.63619498	11.7678100	7.3614705	57.889182	98.73351
Louisiana
Private	7	0.00000000	0.00000000	8.2202650	0.5000000	0.00000000	0.3333333	0.00000000	0.00000000	0.00000000	21.367928	89.605649	0.5354838	4.0767140	34.7178595	4.855675	0.08057357	46.91237	1.93524502	3.1221840	3.76389907	46.201551	96.23610	0.9693053	12.9032258	92.3333333	12.494226	0.17421603	70.27464	3.55411955	7.1593533	10.3943506	93.583333	100.00000
Public	11	0.14265335	0.22202487	12.8374165	0.6181645	0.00000000	2.2202487	0.00000000	0.07284205	0.00000000	17.318486	93.350661	0.7328287	1.7430960	30.9478606	3.422808	0.06242392	54.30882	2.35380103	3.4826453	2.94571532	39.262818	97.05428	1.8595041	7.1041802	90.7637655	9.096816	0.14253898	71.90970	5.22537447	12.1730861	6.6493394	93.472469	100.00000
Maine
Private	8	0.00000000	1.32969656	0.7772021	1.2754705	0.00000000	60.0974553	0.00000000	0.00000000	0.00000000	7.228094	84.715026	0.2935869	3.6471025	3.5751147	5.240684	0.05904883	70.18189	3.07155363	8.5203383	5.41067965	15.887091	94.58932	0.5181347	6.3157895	6.0195987	12.686981	0.23331778	87.75988	6.48199446	29.9352199	15.2849741	30.470914	100.00000
Public	8	0.28301887	0.35149385	0.5660377	0.7547170	0.00000000	63.3006782	0.00000000	2.54716981	0.00000000	2.452830	91.559910	1.3500610	0.8111373	1.8375999	1.622170	0.07849710	77.58514	1.75485137	12.1270052	2.83354204	7.454316	97.16646	2.7160494	1.9458946	3.2098765	2.962963	0.18867925	95.00000	2.46045694	21.0248681	8.4400904	10.740741	100.00000
Maryland
Private	9	0.00000000	0.00000000	0.0000000	0.0000000	0.00000000	47.6137002	0.00000000	0.00000000	0.60157335	0.000000	82.360097	0.2394598	5.2044198	11.5754389	5.121979	0.09840825	65.31147	2.14089430	4.9964023	5.31152498	24.380600	94.68848	0.7959479	13.6580573	29.3151319	7.876655	0.21707670	94.53125	5.12820513	18.7264151	17.6399027	39.819528	99.39843
Public	8	0.03191829	0.15959145	7.9023823	1.9470156	0.00000000	1.4363230	1.91509735	0.51961548	1.51074956	22.486926	89.435046	0.1963727	2.3802636	43.4784274	4.144900	0.09675264	39.51956	3.35984981	2.8240257	3.99984778	53.656566	96.00015	0.3876853	4.7072022	83.4240562	7.088902	0.18685768	73.03893	4.41603719	5.3578640	10.5649537	90.640913	98.48925
Massachusetts
Private	48	0.00000000	1.06382979	0.8571429	2.4316109	0.00000000	13.3484163	0.00000000	0.00000000	0.00000000	8.376609	61.381074	0.2022323	6.1958632	7.3737226	7.665594	0.07901937	54.86901	2.98601752	9.6945052	10.93403489	24.502449	89.06597	0.6787330	23.1597073	35.8123570	22.034495	0.49751244	91.48936	7.69230769	39.8933333	38.6189258	59.267735	100.00000
Public	9	0.10770888	1.10741971	2.2044088	2.8056112	0.00000000	60.5727924	0.66800267	1.80360721	0.39660057	9.485638	95.662027	0.3806267	2.7624481	5.6671475	7.166784	0.06016574	74.60185	2.21910697	5.6799776	1.46189842	18.256279	98.53810	1.2024048	6.3961814	8.3852691	10.531995	0.19093079	88.24315	3.79686758	16.5632458	4.3379735	23.146649	99.60340
Michigan
Private	29	0.00000000	0.00000000	0.0000000	0.0000000	0.00000000	28.1250000	0.00000000	0.00000000	0.00000000	0.000000	78.730252	0.4275021	2.9250017	11.8024390	3.860043	0.30408100	66.64480	2.07622499	6.9982826	4.96162253	21.395291	95.03838	3.1250000	23.0386052	65.6250000	13.399649	5.75022462	100.00000	4.79123888	22.9110512	21.2697484	71.875000	100.00000
Public	11	0.23213249	0.54009140	1.3382167	1.6481195	0.00000000	54.4056857	0.00000000	0.43038176	1.59799155	6.916467	84.603465	1.1481290	2.5091218	8.9676443	3.402197	0.07528080	71.69357	2.31556707	3.0733725	6.81511362	18.417939	93.18489	8.6414624	7.4588440	18.1197268	4.670904	0.12691276	83.19519	3.14271684	6.1654288	15.3965347	31.644789	98.40201
Minnesota
Private	13	0.04823927	0.46674446	0.7001167	0.9334889	0.00000000	50.4587156	0.00000000	0.00000000	0.00000000	3.383897	86.444766	0.6100444	4.2869347	3.2824122	4.028990	0.05300242	73.40600	2.73655982	7.8374794	3.75858135	14.997944	96.24142	3.5460993	8.9450656	8.1081081	6.319883	0.18832392	94.63244	6.37065637	31.8639517	13.5552340	23.869713	100.00000
Public	4	0.10363888	1.02286402	1.6646611	1.8852788	0.02005616	55.5236385	2.02291696	1.56609857	1.94544725	8.190692	96.672041	0.9301768	4.0683518	5.8917111	2.951201	0.05694801	77.77681	2.80483632	3.1548336	2.36512843	16.703225	97.63487	2.6073004	12.0526631	17.9413525	4.751646	0.10771993	86.81567	4.14123279	5.4352186	3.3279595	39.664871	98.05455
Mississippi
Private	8	0.00000000	0.00000000	10.2941176	0.0000000	0.00000000	0.4153686	0.00000000	0.00000000	0.00000000	13.602941	92.174960	0.6126403	1.3950833	43.1429884	1.786073	0.03380897	47.56644	0.45078958	2.3691127	2.64306054	47.421384	97.35694	2.3809524	4.1567696	98.5555556	3.500243	0.17015070	84.92647	1.48274186	8.8964511	7.8250401	99.111111	100.00000
Public	8	0.04500450	0.27480077	14.1225614	0.3297609	0.00000000	2.8029678	0.00000000	0.00000000	0.00000000	21.219393	96.126726	0.2495251	0.9840197	50.9659836	1.488892	0.10082563	41.83147	1.15531855	0.7814013	2.44256433	54.944564	97.55744	0.5164366	1.9064125	93.0475405	3.089508	0.30228085	75.02111	2.13628989	3.6003600	3.8732744	95.603188	100.00000
Missouri
Private	26	0.00000000	0.34465780	0.9913259	0.7877893	0.00000000	42.7336156	0.00000000	0.00000000	0.00000000	5.328377	87.315634	0.5843973	1.9178370	8.6849277	4.605814	0.19764224	72.59577	2.43146465	5.6032551	3.37889531	18.422083	96.62110	2.2304833	7.5268817	37.0743634	16.129032	0.94395280	94.29988	5.94795539	23.5625705	12.6843658	49.013060	100.00000
Public	9	0.07812500	0.15625000	3.3564815	1.0284539	0.00000000	7.6562500	0.10755357	1.85185185	0.70312500	10.835467	82.453704	0.6143217	1.3297326	14.4737154	2.462243	0.09693745	66.59804	1.97585449	4.6847848	7.76436645	20.952804	92.23563	2.9217887	2.6157407	82.6562500	3.830394	0.20569078	80.66581	2.80986432	13.5062411	17.5462963	87.031250	99.29688
Montana
Private	4	1.25000000	0.00000000	0.5820722	1.6298021	0.23282887	21.3038417	0.00000000	0.00000000	0.00000000	10.277778	95.332671	20.3219605	1.1499986	1.6457009	4.146048	0.36368964	63.52879	2.48790826	4.1854713	2.17043496	30.115306	97.82957	76.2514552	2.1486124	3.1777557	8.236347	0.49652433	78.68056	6.65342602	9.4444444	4.6673287	78.696158	100.00000
Public	1	1.67865707	0.95923261	1.1031175	1.6306954	0.00000000	77.5539568	0.04796163	6.37889688	10.64748201	5.419664	89.352518	1.6786571	0.9592326	1.1031175	1.630695	0.00000000	77.55396	0.04796163	6.3788969	10.64748201	5.419664	89.35252	1.6786571	0.9592326	1.1031175	1.630695	0.00000000	77.55396	0.04796163	6.3788969	10.6474820	5.419664	89.35252
Nebraska
Private	11	0.00000000	0.00000000	1.7366136	2.7900147	0.00000000	57.0907987	0.00000000	0.00000000	0.00000000	9.261939	95.227606	0.5116739	2.2486971	6.4570358	6.353040	0.28972707	76.04051	2.02700758	4.6043600	1.46794375	17.887182	98.53206	1.0660981	9.0092278	15.4411765	14.705882	0.73529412	90.73806	6.75477239	16.2060937	4.7723935	32.598039	100.00000
Public	4	0.27056277	0.54755043	1.6504329	3.3008658	0.02705628	75.6676558	1.16341991	1.81277056	0.00000000	11.147186	91.801948	0.6945519	1.8367904	3.8739236	5.259272	0.18187061	77.82262	2.05152180	5.6469099	2.63254251	13.897930	97.36746	1.1869436	4.7348485	7.2829132	6.599424	0.36267722	78.84199	2.73656446	8.4438040	8.1980519	17.326931	100.00000
Nevada
Private	1	2.98363811	1.82868142	3.9461020	3.1761309	0.96246391	69.2974013	0.28873917	15.20692974	2.30991338	13.185756	97.690087	2.9836381	1.8286814	3.9461020	3.176131	0.96246391	69.29740	0.28873917	15.2069297	2.30991338	13.185756	97.69009	2.9836381	1.8286814	3.9461020	3.176131	0.96246391	69.29740	0.28873917	15.2069297	2.3099134	13.185756	97.69009
Public	2	0.42194093	2.05026455	1.8849206	16.1375661	0.82671958	43.5443038	2.54629630	5.05952381	0.00000000	26.620370	99.867725	1.7982721	6.2290704	6.6386628	19.179894	1.24317695	55.86607	3.22814112	5.7505777	0.06613757	38.317218	99.93386	3.1746032	10.4078762	11.3924051	22.222222	1.65963432	68.18783	3.90998594	6.4416315	0.1322751	50.014065	100.00000
New Hampshire
Private	8	0.00000000	0.60975610	1.0752688	1.7582418	0.00000000	25.0635486	0.00000000	6.58607753	0.00000000	5.376344	84.693554	0.5773033	2.7186457	5.3122559	4.107853	0.18477460	62.27110	1.57831958	18.3472421	4.90250116	14.479152	95.09750	1.6195618	12.3054938	17.1321384	7.480315	0.98425197	79.77642	3.89012385	62.8853353	15.3064465	30.422356	100.00000
Public	5	0.11764706	0.86011770	0.9683276	2.1276596	0.00000000	70.1176471	1.25643666	8.67460157	0.04034698	6.609325	96.355097	0.2754972	1.7038658	1.4185501	2.876995	0.01976929	79.01674	1.66209426	11.7142714	1.31221523	7.956772	98.68778	0.4979629	2.3529412	2.0803296	3.647059	0.04526935	84.47261	2.00000000	20.0000000	3.6449031	9.647059	99.95965
New Jersey
Private	15	0.00000000	0.00000000	0.0000000	0.0000000	0.00000000	9.4142259	0.00000000	0.00000000	0.00000000	0.000000	69.044898	0.1996066	4.0991606	13.3710016	10.922956	0.15660179	54.53586	1.29638963	8.0912045	7.32722061	30.045716	92.67278	0.7268322	15.8506429	52.0958084	28.770442	0.56323605	100.00000	4.18410042	38.1794651	30.9551020	80.588822	100.00000
Public	9	0.05635920	3.90323331	5.3140097	8.4066062	0.00000000	27.5596468	1.16608362	1.66861144	0.35005834	24.270712	83.064062	0.2038684	7.0899629	11.8147674	17.752549	0.20353163	50.54686	2.06493237	7.2281380	3.09539482	39.129612	96.90461	0.6280530	17.8564719	19.9606686	32.239430	0.64200977	73.71062	3.18150035	13.3612468	16.9359384	61.750246	99.64994
New Mexico
Private	1	2.08126858	0.99108028	23.9841427	20.7135778	0.39643211	33.7958375	0.00000000	17.74033697	0.29732408	48.166501	99.702676	2.0812686	0.9910803	23.9841427	20.713578	0.39643211	33.79584	0.00000000	17.7403370	0.29732408	48.166501	99.70268	2.0812686	0.9910803	23.9841427	20.713578	0.39643211	33.79584	0.00000000	17.7403370	0.2973241	48.166501	99.70268
Public	7	2.53878702	0.81782290	0.2057613	7.4074074	0.09402915	11.5226337	0.47793084	0.80233406	0.14587892	35.825106	92.900799	14.6967194	1.6622964	2.8202712	38.631326	0.24253118	31.05496	2.17870431	5.5063989	3.20678809	60.231849	96.79321	75.1028807	3.2753915	4.5967287	71.043034	0.50761421	55.99436	3.86579139	21.3663199	7.0992008	85.995624	99.85412
New York
Private	83	0.00000000	0.00000000	0.0000000	0.0000000	0.00000000	0.8598452	0.00000000	0.00000000	0.00000000	0.000000	49.529781	0.2509328	4.0604171	6.5272589	7.937922	0.13055783	66.48749	1.81357899	5.9502243	6.84161446	20.720667	93.15839	1.6798419	14.8435280	59.2019544	82.287188	3.40768278	100.00000	5.70012392	40.6980803	50.4702194	97.162511	100.00000
Public	1	0.16386727	10.07783695	9.6169603	16.1102007	0.27652601	46.3334699	3.40024580	1.61818927	12.40270381	39.645637	87.597296	0.1638673	10.0778370	9.6169603	16.110201	0.27652601	46.33347	3.40024580	1.6181893	12.40270381	39.645637	87.59730	0.1638673	10.0778370	9.6169603	16.110201	0.27652601	46.33347	3.40024580	1.6181893	12.4027038	39.645637	87.59730
North Carolina
Private	35	0.00000000	0.00000000	3.4831461	0.2219756	0.00000000	0.3074558	0.00000000	0.00000000	0.00000000	10.744681	81.716700	0.6779701	1.8996102	28.5820150	3.581534	0.09036037	52.36635	1.93381175	7.6767985	3.19155315	36.765301	96.80845	1.9292605	14.3037336	95.4545455	10.286225	0.51361068	85.49107	4.01129944	30.1282051	18.2832997	98.113208	100.00000
Public	15	0.24409187	0.37493305	3.2249675	1.3390466	0.02601795	11.4739170	0.79674521	0.94394144	0.21424746	11.366914	89.046456	1.5548657	2.5315037	27.5982550	4.897334	0.10334125	54.65034	2.86331034	3.1809657	2.62008187	39.548610	97.37992	14.8189504	8.2718380	76.0114479	7.342683	0.24026085	85.69289	4.59290188	8.1949652	10.9535438	84.142058	99.78575
North Dakota
Private	3	1.54639175	0.00000000	2.7411890	1.7391304	0.17799929	78.6956522	0.00000000	0.00000000	0.86956522	10.323959	93.917526	2.5462036	0.7577387	4.8969958	3.225991	1.03199110	80.31317	1.81461245	2.5382209	2.87507758	14.273532	97.12492	4.3478261	1.1391954	7.8260870	4.948454	2.60869565	81.75258	3.91304348	7.5115700	6.0824742	20.434783	99.13043
Public	6	0.73913338	0.18501388	2.3077955	1.6613549	0.08137248	69.2082111	1.76307046	0.55504163	2.68270120	8.184716	87.947214	1.2560680	0.9218368	3.8843095	3.794283	0.24022673	78.61656	2.60341936	2.4711061	6.21219071	12.700144	93.78781	1.7576318	1.4918288	6.4754857	4.897361	0.46253469	83.88970	2.96022202	4.0889672	12.0527859	16.558742	97.31730
Ohio
Private	46	0.00000000	0.00000000	0.0000000	0.0000000	0.00000000	0.5167959	0.00000000	0.00000000	0.00000000	0.000000	72.000000	0.2670138	1.7891142	10.7333653	3.175560	0.07934518	71.33020	2.28516381	6.3545010	3.98573686	18.329563	96.01426	1.5625000	14.3255037	92.7648579	10.052678	0.47846890	100.00000	10.88888889	42.8888889	28.0000000	94.315245	100.00000
Public	9	0.14717459	0.05711022	5.0859430	0.5886508	0.00000000	1.4848658	0.85665334	0.68875430	0.45688178	8.264657	88.944306	0.2287374	1.8620747	19.7214203	2.594275	0.07086432	64.48507	2.30621284	2.9103394	5.82101015	26.783584	94.17899	0.5886508	3.2424320	95.3169617	4.316249	0.11281321	84.81281	3.45564660	5.3214033	11.0556941	97.087379	99.54312
Oklahoma
Private	10	1.93939394	0.00000000	3.9085861	0.0000000	0.00000000	45.5104551	0.00000000	0.00000000	0.00000000	17.343016	66.666667	5.3816774	2.0479018	12.3772645	5.324622	0.17304398	57.06790	3.95654188	3.0050619	10.66598638	29.261052	89.33401	16.6666667	8.2020202	27.6413690	7.842574	0.37202381	70.23749	8.03436079	9.7960356	33.3333333	42.857143	100.00000
Public	10	1.49073328	0.00000000	2.3325062	1.2489927	0.00000000	13.6583400	0.00000000	0.00000000	0.20145044	21.191136	91.152019	8.7512386	1.4789009	14.2124162	6.925812	0.17863313	54.59074	7.64041271	2.3912953	3.83055596	39.187414	96.16944	21.1070999	3.0285036	79.3311845	16.255778	0.65017157	67.74129	17.59305211	7.7100646	8.8479810	82.070911	99.79855
Oregon
Private	12	0.16895969	1.19521912	0.9633911	3.0222222	0.19267823	57.3076923	0.79908676	1.56028369	0.00000000	14.666667	90.035971	0.9628141	5.1739799	2.1958425	7.599922	0.61496163	67.77156	6.27319696	4.7615602	4.64616427	22.820717	95.35384	2.5531915	13.9285714	5.8130401	10.473458	1.53846154	79.57447	11.94605010	7.9623288	9.9640288	32.967033	100.00000
Public	8	0.45438658	2.08263352	1.2233485	5.2778749	0.24466970	53.8461538	0.26450653	2.19538513	1.00938967	15.630988	86.989542	1.2139270	5.2652446	2.1188789	7.632290	0.75706647	66.50313	3.88804520	6.8906769	5.73074021	20.875452	94.26926	2.2721051	9.9615519	3.1990179	9.618718	2.41362209	76.01971	5.72249533	25.3611018	13.0104582	26.819757	98.99061
Pennsylvania
Private	66	0.00000000	0.00000000	0.0000000	0.0000000	0.00000000	17.7130045	0.00000000	0.00000000	0.00000000	0.000000	39.798206	0.1989786	3.3304171	9.5445353	5.173646	0.08829435	67.25971	2.20629738	6.5500583	5.64806678	20.542169	94.35193	0.8011146	16.4534838	67.3758865	13.359274	0.63357973	100.00000	7.78210117	36.2745098	60.2017937	77.905074	100.00000
Public	16	0.06337136	0.09784736	3.2722746	1.1153712	0.05885815	1.1741683	1.59883721	0.39786149	0.19569472	7.284768	92.930506	0.1592404	1.9026603	12.6461132	4.477350	0.09547283	73.67511	2.32262608	3.0026774	1.71875386	21.603463	98.28125	0.3270349	10.1880752	85.5185910	10.265435	0.16272510	89.19484	3.71819961	8.0070392	7.0694945	93.248532	99.80431
Rhode Island
Private	7	0.12249898	1.12508273	2.0416497	4.8730549	0.06142506	31.7680686	0.08190008	2.97515887	1.89850310	10.462735	70.559412	0.4110147	5.4439026	3.6833485	7.316944	0.10987466	62.76910	1.90956765	8.7936652	9.56258072	18.874652	90.43742	0.8599509	15.7615353	5.9906328	9.737501	0.19854401	77.41010	4.09541444	12.9811630	29.4405880	32.218713	98.10150
Public	2	0.26552411	2.88161092	4.8458150	7.7726148	0.03620783	65.0156232	1.66647379	9.03830575	0.17359102	18.520307	96.632672	0.3179258	3.0520540	6.1203962	10.557989	0.07596759	66.89318	2.02206074	9.1899633	1.77045974	22.146393	98.22954	0.3703275	3.2224971	7.3949774	13.343363	0.11572735	68.77074	2.37764770	9.3416209	3.3673285	25.772480	99.82641
South Carolina
Private	18	0.00000000	0.00000000	1.4157014	0.2136752	0.00000000	0.0000000	0.00000000	0.00000000	0.00000000	10.977034	94.208494	0.2769395	1.0605547	40.1867030	2.587914	0.06761614	46.07725	1.23779295	6.2452569	2.25997292	45.417521	97.74003	1.0718114	3.0825023	98.8543372	5.694981	0.38610039	79.97587	3.19662244	36.1411087	5.7915058	99.615385	100.00000
Public	9	0.12008406	0.66046232	6.1627854	0.5403783	0.03002101	3.7826479	0.03002101	0.33200531	0.15010507	13.034726	92.821522	0.3312356	1.8439523	27.3711694	2.943275	0.09573345	61.30750	2.15979787	1.8345840	2.11275703	34.745164	97.88724	0.9743875	5.8661146	93.9057340	5.707127	0.25055679	80.49057	4.05974338	4.8309179	7.1784783	95.286701	99.84989
South Dakota
Private	4	0.28188865	0.22988506	1.7241379	1.1980268	0.00000000	65.7064472	0.00000000	0.16181230	0.00000000	7.701149	98.390805	1.2928915	0.7700026	4.7755315	4.141070	0.13636877	83.44967	2.04877840	2.4818948	0.90379675	13.164643	99.09620	2.3462783	1.2345679	11.3854595	7.544582	0.40453074	90.34483	3.38266385	8.5048011	1.6091954	24.417010	100.00000
Public	8	0.82047916	0.00000000	0.0000000	0.0000000	0.00000000	3.7140855	0.00000000	0.00000000	0.00000000	7.703200	94.419198	24.4195678	0.9537036	1.6379058	2.286078	0.18111660	64.41244	1.83529354	1.2716429	3.00224907	31.313665	96.99775	95.3749124	2.0019691	4.7259600	3.931380	0.50690786	85.94435	3.43102216	4.0766318	5.5808020	95.935529	100.00000
Tennessee
Private	28	0.00000000	0.00000000	3.4697509	0.0000000	0.00000000	0.0000000	0.00000000	0.00000000	0.00000000	7.918149	90.524988	0.3252366	2.0583604	23.7164958	3.712725	0.09281991	60.51757	1.76878686	5.1718908	2.63611455	31.674425	97.36389	1.7441860	7.6146934	100.0000000	19.023622	0.59842520	87.18861	5.92105263	16.6103984	9.4750118	100.000000	100.00000
Public	6	0.14992504	1.00808685	3.6687539	1.2407223	0.00000000	20.3168273	2.20477996	0.37922215	0.42527940	9.577564	88.984919	0.2546569	1.8373055	24.2343711	3.131386	0.08429978	60.84336	3.09108378	1.2563444	5.26718887	32.633103	94.73281	0.3956087	2.9678522	63.6977955	5.686876	0.18791415	82.90841	5.08357235	2.2154090	11.0150807	69.159189	99.57472
Texas
Private	38	0.00000000	0.00000000	2.2988506	2.9197080	0.00000000	1.5544041	0.00000000	0.00000000	0.00000000	19.744842	76.468811	0.4585051	3.0085943	23.4126611	20.448949	0.17174096	42.95614	1.85604476	2.2780272	5.40934134	49.356496	94.59066	1.2195122	16.6440115	85.4858549	66.091954	1.48148148	72.54062	4.44078947	13.9376997	23.5311886	97.435897	100.00000
Public	28	0.06619010	0.38287504	0.6221869	2.7168110	0.00000000	2.0121790	0.00000000	0.21180831	0.52210233	28.105470	76.393159	0.3932651	6.1144762	15.6597864	27.711786	0.18342767	40.06689	2.41757050	1.6085496	5.84424550	52.480312	94.15575	0.8812950	19.9393808	84.0787757	92.983850	1.62723601	70.44774	7.67691717	6.0636788	23.6068413	94.638602	99.47790
Utah
Private	2	0.36740585	1.94528277	0.5150243	5.5274898	0.49808893	72.4615624	2.87092925	1.17438656	0.12625171	12.314657	96.181603	0.4950085	2.8707407	4.9634148	6.474334	0.54100090	77.57706	3.12323526	1.9828800	1.97232412	18.467735	98.02768	0.6226112	3.7961986	9.4118054	7.421179	0.58391287	82.69256	3.37554127	2.7913734	3.8183965	24.620813	99.87375
Public	2	0.36603221	1.22139169	1.2329506	8.6244645	0.37759112	49.7110272	1.63751252	3.86482627	1.55274717	14.198197	91.102650	0.3829209	3.1333041	1.2526803	8.993592	0.46009494	58.96308	2.38785034	19.2014275	5.22504882	16.610442	94.77495	0.3998096	5.0452166	1.2724100	9.362719	0.54259876	68.21514	3.13818816	34.5380288	8.8973505	19.022688	98.44725
Vermont
Private	9	0.19098549	0.65645514	0.0000000	0.4347826	0.00000000	57.9868709	1.08043217	0.49656226	0.00000000	6.521739	89.006623	0.6739817	2.2384566	3.8700412	5.081177	0.12708871	69.09190	3.01438427	12.2761315	3.62683738	15.005129	96.37316	1.7218543	6.7114094	12.9102845	9.409190	0.37174721	85.59969	5.66248257	26.9565217	10.9933775	25.601751	100.00000
Public	3	0.24891101	0.76923077	1.2912259	1.4915694	0.00000000	80.3438083	1.95804196	1.29701686	0.76923077	7.198444	95.698507	0.6237871	1.6515307	2.0697896	2.998046	0.03368132	84.49991	2.28911520	3.3656885	2.46845153	9.665949	97.53155	0.9090909	3.0180460	3.4265734	4.145924	0.06993007	89.16991	2.57467330	4.7552448	4.3014935	11.309894	99.23077
Virginia
Private	22	0.00000000	0.00000000	4.1134752	0.6172840	0.00000000	0.9259259	0.00000000	0.00000000	0.00000000	14.588859	90.043290	0.4647529	2.3696297	20.4559626	4.547600	0.12292265	62.79519	2.72368219	3.3897708	3.13049068	30.684550	96.86951	1.4351476	7.9628015	98.4567901	13.106655	0.99290780	82.84704	5.42986425	27.2505187	9.9567100	99.074074	100.00000
Public	12	0.14864541	0.45771144	4.1380964	1.8323408	0.00000000	2.9253731	0.00000000	0.05882353	0.00000000	13.650939	91.517782	0.2622088	4.7385796	22.3630494	4.938981	0.21013459	57.47387	3.16721640	4.3530848	2.49287401	35.680170	97.50713	0.3569710	14.8270035	85.2338308	10.139642	0.76470588	81.41176	5.29038986	8.8358209	8.4822181	89.547038	100.00000
Washington
Private	11	0.14154282	3.39110776	0.9345794	3.4941764	0.00000000	47.9994500	0.26497085	1.45081387	0.38924275	18.498368	90.884092	0.6174093	6.9774762	2.8470087	8.162984	0.47652631	66.16276	5.94665967	4.3613593	4.44781491	25.028064	95.55219	1.1658718	13.9007287	6.0521932	12.400636	2.10993892	76.41296	8.77565464	10.1746185	9.1159082	32.710023	99.61076
Public	6	0.43160691	2.15690922	1.4873838	6.5134870	0.18592297	46.9252412	4.48970490	1.03585657	0.45034368	23.552457	85.090657	0.9861372	6.7831796	3.1227578	9.580315	0.43934539	62.69559	6.25798570	5.6243575	4.51033049	27.169721	95.48967	2.5835506	19.4310468	5.3804219	13.171783	1.16111535	74.36919	7.72908367	9.5770828	14.9093426	34.637370	99.54966
West Virginia
Private	4	0.25396825	0.19854401	0.0000000	1.3245033	0.09474183	61.2979631	0.00000000	2.05162144	0.33112583	2.980132	93.699668	0.5367935	1.1090403	5.6047722	1.909421	0.27063041	76.93228	1.35697207	8.7008041	3.57928715	10.787629	96.42071	0.9000474	2.7001421	9.9005211	2.558029	0.66225166	88.07947	2.84579749	15.2534344	6.3003316	17.148271	99.66887
Public	9	0.00000000	0.16648169	4.2227935	0.4854369	0.00000000	48.3356449	0.00000000	0.44901457	0.33296337	7.386785	92.980291	0.3231175	0.9062822	7.3868638	1.935252	0.03845896	78.28907	1.56114182	6.7828578	2.77695728	12.151117	97.22304	0.6434051	1.8680377	12.8745838	4.282316	0.08568980	88.01039	2.97172237	39.8751734	7.0197087	15.316315	99.66704
Wisconsin
Private	20	0.00000000	0.17123288	0.2849003	1.4245014	0.00000000	56.1322729	0.00000000	0.46044370	0.00000000	3.988604	87.829181	0.5601772	2.6697187	6.1876671	5.939914	0.17927190	73.88952	2.26783759	5.0565536	3.24934109	17.804587	96.75066	2.0547945	7.9470199	22.0939386	16.785266	0.66225166	95.15670	3.90469887	21.0526316	12.1708185	41.816660	100.00000
Public	2	0.19810698	2.97160467	1.6606344	2.3419204	0.14903130	69.6015849	1.64998935	0.44023773	2.17917676	9.367681	96.263572	0.2480848	3.1198240	5.0565994	7.334288	0.17356914	77.85487	2.64097531	0.6139873	2.95780213	18.573341	97.04220	0.2980626	3.2680434	8.4525644	12.326656	0.19810698	86.10815	3.63196126	0.7877369	3.7364275	27.779001	97.82082
Wyoming
Public	1	0.56162246	1.33385335	0.9438378	5.8424337	0.25741030	74.1497660	2.37909516	7.80811232	6.72386895	11.318253	93.276131	0.5616225	1.3338534	0.9438378	5.842434	0.25741030	74.14977	2.37909516	7.8081123	6.72386895	11.318253	93.27613	0.5616225	1.3338534	0.9438378	5.842434	0.25741030	74.14977	2.37909516	7.8081123	6.7238690	11.318253	93.27613

I am particularly interested in the the White, Asian, black,Hispanic, american indian alaska native and native Hawaiian pacific islander race and ethnic groups and the following inferences can be made from the summary table:

Public colleges

->By assessing the mean % enrollment for the White, Asian, black,Hispanic, american indian alaska native and native Hawaiian pacific islander race and ethnic groups in the public colleges, it can be observed that the states in the Mid-west and North-Central region have more % white students enrolled in the colleges compared to the other race/ethnic groups. States such as California, New Mexico and Hawaii have the least % white students enrolled in public colleges.

->the highest Hispanic % enrollment was observed in New Mexico, Texas,California, Florida and Arizona which is quite expected and the least % Hispanic enrollment was seen in states like Mississipi , Maine, Montana etc

->the highest Asian % enrollment as expected was observed in Hawaii, California and New York, New Jersey and the least was seen in North Dakota, West Virginia and Maine

-> the highest american indian alaska native % enrollment was observed in South Dakota, New Mexico and Kansas and the least % was in Illinois. New York and Pennsylvania

->the highest native Hawaiian pacific islander % enrollment was noted in Hawaii, Nevada and Oregon and the least % was observed in New Hamphsire and Montana.

Private colleges

->By assessing the mean % enrollment for the White, Asian, black,Hispanic, american indian alaska native and native Hawaiian pacific islander race and ethnic groups in the private colleges, it can be observed that the states in the Mid-west and North-Central region such as South and North Dakota, Utah, West Virginia, Kentucky and ebraska have more % white students enrolled in the colleges compared to the other race/ethnic groups. States such as California, New Mexico,Hawaii,Delaware have the least % white students enrolled in private colleges.

->the highest Hispanic % enrollment was observed in New Mexico, Texas,California, Florida and Arizona which is quite expected and the least % Hispanic enrollment was seen in states like Alabama,West Virginia, Mississipi and Delaware etc

->the highest Asian % enrollment as expected was observed in Hawaii,California,Washington,Massachusetts and the least was seen in Alabama, South Dakota and North Dakota

-> the highest american indian alaska native % enrollment was observed in Arizona,Montana and Alaska and the least % was in New Jersey, Pennsylvania and Delaware.

->the highest native Hawaiian pacific islander % enrollment was noted in Hawaii, Arizona and North Dakota and the least % was observed in Minnesota, Mississipi and Delaware

Is the finding from summary what you expected

I expected a higher early career pay, higher stem % and higher making world better % for the private colleges and similar mid career pay for both the private and public college students. From the data, definitive conclusions couldn’t be made, it seems like both private and public colleges have similar earning potential and quality of education.

I also expected higher % men enrollment compared to % women enrollment, this would apply to both private and public schools.The inference made from the data about % enrollment based on gender was infact reverse of what I had expected. More % female enrollment was noticed in both private and public colleges.

I also expected certain race/ethnic groups choosing colleges based on location but the choice of private/public school would be influenced by economic factors as well. The inferences from the data aligns well with my expectation.

Data Visualization using ggplots

Research Question 1

1) Early career pay

Comparison of median early career pay between the different types of colleges across the different states

Side by side comparison

ggplot(data=merged1_stats,
       aes(x=median_early_career_pay,
           y=reorder(state_code,median_early_career_pay),fill=type))+
    geom_col(position="dodge")+
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))+
  scale_fill_brewer(palette="Pastel2")+
  labs(title=str_wrap("Median Early career pay(in USD) for the private vs public college across the different states",width=60),
       x="Median Early career pay(USD)",
       y="State Name")+
   guides(fill=guide_legend(title="College Type"))

From the above plot, it is evident that Massachusetts had the highest median early career, out of which the public college in MA had a slightly higher median early career pay than the private colleges in MA. This was followed by California having the second highest median early career pay with the private college having slightly higher median early career pay than the public colleges.

Missouri and Alaska seem to be have the lowest median early career pay. In around 10 states, it seems like the median early career pay is slightly higher for the public colleges compared to the private colleges and in 11 states the opposite trend is observed and in the rest of the states both the private and public colleges have similar median early career pay.

Only in 5 states(MA,AK,MT,DE,AZ), the differences seem to be more pronounced with public colleges having higher median career pay.

Note: NV and PA have early career pay data only for the private college and Wyoming has the data only for public college, so a comparison between private and public college couldn’t be made.

Separate panels

#create a manual color palette
pal <- colorRampPalette(c("red", "lightyellow","lightblue","lightgreen","pink","violet","orange","green","blue"))
pal<-pal(52)

merged1_final%>%
  group_by(state_code,type)%>%
   summarise(median_early_career_pay=median(early_career_pay,na.rm=TRUE))%>%
  ggplot()+
  labs(title=str_wrap("Comparison of median early career pay in USD between the private and public colleges for different states in the United States",width=60),
       x="Median Early career pay(in USD)",
       y="State Name")+
  scale_fill_manual(values=pal)+
  geom_bar(aes(x=median_early_career_pay,
 y=reorder_within(state_code,-median_early_career_pay,type),
    fill=state_code),stat="identity",show.legend=TRUE)+
  facet_wrap(~type,scales="free_y")+
  scale_y_reordered()+
 guides(fill=guide_legend(title="State Name"))

## `summarise()` has grouped output by 'state_code'. You can override using the
## `.groups` argument.

The interpretations of this plot are the same as the interpretations for the plot above. In this plot, the median early career pay for private and public colleges are plotted in different panels and thus it would be easier to interpret the changes in median early career pay within each type of college for the different states. The plot is more readable as the bars are arranged in a sequential order.

Assess the min, median and max of the early career pay between the different types of colleges across the different states

From the above bar plots, we were able to only compare the median early career pay i.e. only the center of the data for each type of college and for each state. It would be useful to understand the spread of the data as well for which boxplots can be used. Since we have 52 states and most color palettes have limited color options, I manually created a color palette by clubbing color options from inbuilt color palettes.

# create a manual palette using ColorBrewer Palettes
mycolors<- c(brewer.pal(name="Dark2", n = 8), brewer.pal(name="Paired", n = 6),
brewer.pal(name="Accent", n = 8),
brewer.pal(name="Set3", n = 12),
brewer.pal(name="Set2", n = 8),
brewer.pal(name="Set1", n = 9),
brewer.pal(name="Pastel1",n=1))

## Warning in brewer.pal(name = "Pastel1", n = 1): minimal value for n is 3, returning requested palette with 3 different levels

# boxplot to compare the median career pay for the different types of colleges at different states
ggplot(data=merged1_final,
       aes(x=early_career_pay,
           y=state_code,
           fill=state))+
    geom_boxplot()+
    facet_wrap(~type)+
  scale_fill_manual(values=mycolors)+
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))+
  labs(title=str_wrap("Median career pay(in USD) for the private vs public college across the different states",width=60),
       x="Median career pay(USD)",
       y="State Name")+
guides(fill=guide_legend(title="State Name"))

Amongst the private colleges, the early career pay was the highest in California and the least in Alabama and Mississipi. The interquartile range(difference between Q3 and Q1) seems largest for CA,ME and PA and the smallest for DE, KS, MT, MT and NV.

Amongst the public colleges, the early career pay was the highest in MA, the variability is very low, that is due to just one public college in MA present in the data. The least early career pay was noted for Mississipi. The largest IQR was observed for Oregon, Alaska and Virginia.

2) Mid career pay

Comparison of median mid career pay between the different types of colleges across the different states

Side by side comparison

ggplot(data=merged1_stats,
       aes(x=median_mid_career_pay,
y=reorder(state_code,median_mid_career_pay),
           fill=type))+
    geom_col(position="dodge")+
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))+
  scale_fill_brewer(palette="Pastel2")+
  labs(title=str_wrap("Median Mid career pay(in USD) for the private vs public college across the different states",width=60),
       x="Median Mid career pay(USD)",
       y="State Name")+
guides(fill=guide_legend(title="College Type"))

From the above plot, it is evident that California had the highest median mid career pay, out of which the private colleges in CA had a slightly higher median mid career pay than the public colleges in CA. This was followed by Massachusetts having the second highest median mid career pay with the public college having slightly higher median mid career pay than the private colleges.

Missouri and Alaska seem to be have the lowest median mid career pay. in around 18 states, it seems like the median mid career pay is slightly higher for the public colleges compared to the private colleges and in 14 states the opposite trend is observed and in the rest of the states both the private and public colleges have similar median early career pay.

Only in 5 states(AK,MT,IA,DE,AZ), the differences seem to be more pronounced with public colleges having higher median career pay than private colleges

Note: NV and PA have early career pay data only for the private college and Wyoming has the data only for public college, so a comparison between private and public college couldn’t be made.

Separate panels

merged1_final%>%
  group_by(state_code,type)%>%
   summarise(median_mid_career_pay=median(mid_career_pay,na.rm=TRUE))%>%
  ggplot()+
  labs(title=str_wrap("Comparison of median mid career pay in USD between the private and public colleges for different states in the United States",width=60),
       x="Median mid career pay(in USD)",
       y="State Name",
       col="State Name")+
  scale_fill_manual(values=pal)+
  geom_bar(aes(x=median_mid_career_pay,
 y=reorder_within(state_code,-median_mid_career_pay,type),
    fill=state_code),stat="identity",show.legend=TRUE)+
  facet_wrap(~type,scales="free_y")+
  scale_y_reordered()+
guides(fill=guide_legend(title="State Name"))

## `summarise()` has grouped output by 'state_code'. You can override using the
## `.groups` argument.

The interpretations of this plot are the same as the interpretations for the plot above. In this plot, the median mid career pay for private and public colleges are plotted in different panels and thus it would be easier to interpret the changes in median mid career pay within each type of college for the different states. The plot is more readable as the bars are arranged in a sequential order.

Assess the min, median and max of the mid career pay between the different types of colleges across the different states

# boxplot to compare the mid career pay for the different types of colleges at different states
ggplot(data=merged1_final,
       aes(x=mid_career_pay,
           y=state_code,
           fill=state))+
    geom_boxplot()+
    facet_wrap(~type)+
  scale_fill_manual(values=mycolors)+
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))+
  labs(title=str_wrap("Mid career pay(in USD) for the private vs public college across the different states",width=60),
       x="Mid career pay(USD)",
       y="State Name")+
  guides(fill=guide_legend(title="State Name"))

Amongst the private colleges, the mid career pay was the highest in California and the least in Colorado. The interquartile range(difference between Q3 and Q1) seems largest for CA,ME,PA,VA and VT and the smallest for DE, KS, MT, MT and NV.

Amongst the public colleges, the mid career pay was the highest in Colorado(this is the one of the two outlier points in the plot). The least mid career pay was noted for Mississipi. The largest IQR was observed for Oregon,Virginia and Vermont.

3) Make world a better percent

Comparison of median make world a better percent between the different types of colleges across the different states

Side by side comparison

ggplot(data=merged1_stats,
       aes(x=median_make_world_better_percent,
y=reorder(state_code,-median_make_world_better_percent),
           fill=type))+
    geom_col(position="dodge")+
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))+
  scale_fill_brewer(palette="Pastel2")+
  labs(title=str_wrap("Median Make world better(in percent) for private vs public college across the different states",width=60),
       x="Median make world better(in percent)",
       y="State Name")+
  guides(fill=guide_legend(title="College Type"))

Separate panels

merged1_final%>%
  group_by(state_code,type)%>%
   summarise(median_make_world_better_percent=median(make_world_better_percent,na.rm=TRUE))%>%
  ggplot()+
  labs(title=str_wrap("Comparison of median make world better in percent between the private and public colleges for different states in the United States",width=60),
       x="Median make world better (in percent)",
       y="State Name")+
  scale_fill_manual(values=pal)+
  geom_bar(aes(x=median_make_world_better_percent,
 y=reorder_within(state_code,-median_make_world_better_percent,type),
    fill=state_code),stat="identity",show.legend=TRUE)+
  facet_wrap(~type,scales="free_y")+
  scale_y_reordered()+
    guides(fill=guide_legend(title="State Name"))

## `summarise()` has grouped output by 'state_code'. You can override using the
## `.groups` argument.

From the above plots, the following inferences can be made:

-> Amongst the private colleges,Arizona has the highest median make world better % followed by Alaska and Nevada has the lowest median make world better %. CA and MA have pretty low median make world better % which is not quite expected.

->Comparing the public colleges across USA, Mississipi has the highest median make world better % followed by Nebraska. The least median make world better % is taken by Illionis and Iowa.The median make world better % seem to gone up a bit for CA and MA amongst the public colleges but it must be noted that we had only 1 public college for each in CA and MA and thus making the accuracy of the data questionable.

-> Public and Private colleges do not seem to have a noticeable difference in the making world better %.

Assess the min, median and max of the make_world_better_percent between the different types of colleges across the different states

# boxplot to compare the make world better percent for the different types of colleges at different states
ggplot(data=merged1_final,
       aes(x=make_world_better_percent,
           y=state_code,
           fill=state))+
    geom_boxplot()+
    facet_wrap(~type)+
  scale_fill_manual(values=mycolors)+
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))+
  labs(title=str_wrap("title=Make world better in percent for the private vs public college across the different states",width=60),
       x="Make world better (in percent)",
       y="State Name")+
guides(fill=guide_legend(title="State Name"))

## Warning: Removed 21 rows containing non-finite values (`stat_boxplot()`).

Amongst, the private colleges, Wisconsin has the highest make world better % and Texas has the lowest make world better %. The spread of the data is highest for Lousiana.

Comparing the public colleges across the different states in the US,Texas has the highest make world better % and Vermont and Florida has the lowest make world better %. The spread of the data is highest for Nebraska.

4) Stem percent

Comparison of median stem percent between the different types of colleges across the different states

Side by side comparison

ggplot(data=merged1_stats,
       aes(x=median_stem_percent,
    y=reorder(state_code,median_stem_percent),
           fill=type))+
    geom_col(position="dodge")+
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))+
  scale_fill_brewer(palette="Pastel2")+
  labs(title=str_wrap("Median Stem(in percent) for private vs public college across the different states",width=60),
       x="Median Stem(in percent)",
       y="State Name")+
  guides(fill=guide_legend(title="College Type"))

By assessing the above plot,

Amonsgt the public colleges,the median stem % is the highest for MA and MI and the least for ME. The difference in stem % between private and public colleges is prominent for MA,MI CA,IL,CO, MO,FL, AZ ,AK, LA and DE with stem % being higher for the public colleges compared to private colleges

Amonsgt the private colleges,PA and MA have the highest median stem % and NV has the lowest median stem % . For 6 states(MS,GA,CT,MN,AR, IN), the difference in stem % between the private and public colleges is prominent,

Separate panels

merged1_final%>%
  group_by(state_code,type)%>%
   summarise(median_stem_percent=median(stem_percent,na.rm=TRUE))%>%
  ggplot()+
  labs(title=str_wrap("Comparison of median stem in percent between the private and public colleges for different states in the United States",width=60),
       x="Median stem (in percent)",
       y="State Name")+
  scale_fill_manual(values=pal)+
  geom_bar(aes(x=median_stem_percent,
 y=reorder_within(state_code,-median_stem_percent,type),
    fill=state_code),stat="identity",show.legend=TRUE)+
  facet_wrap(~type,scales="free_y")+
  scale_y_reordered()+
  guides(fill=guide_legend(title="State Name"))

## `summarise()` has grouped output by 'state_code'. You can override using the
## `.groups` argument.

Assess the min, median and max of the make_world_better_percent between the different types of colleges across the different states

# boxplot to compare the stem percent for the different types of colleges at different states
ggplot(data=merged1_final,
       aes(x=stem_percent,
           y=state_code,
           fill=state))+
    geom_boxplot()+
    facet_wrap(~type)+
  scale_fill_manual(values=mycolors)+
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))+
  labs(title=str_wrap("Stem percent for the private vs public college across the different states",width=60),
       x="Stem(in percent)",
       y="State Name")+
   guides(fill=guide_legend(title="State Name"))

Comparing the private colleges in the different states, the highest median stem % was recorded for CA and IN. The lowest stem % was noted for AZ, NE. The variability in the stem % was high for PA and CA.

Comparing the public colleges, highest median stem % was recorded for Colorado and lowest for MD and TX.Variability was high in stem % of MO.

Research Question 2

The second question of interest was to study if gender and race/ethnicity has an impact on selecting private/public colleges, this is studief for the different states in USA.

1) Gender

ggplot(gender,
       aes(x=enrollment,
           y=reorder(state_code,enrollment),
           fill=gender))+
  geom_bar(position="dodge",stat="identity")+
  facet_wrap(~type)+
  labs(title=str_wrap("Comparison of % enrollment by different gender groups across private/public colleges in USA",width=60),
       x="Enrollment in %",
       y="State Name")+
    guides(fill=guide_legend(title="College Type"))

gender%>%
  group_by(state_code,type,gender)%>%
   summarise(average_gender_percent=mean(enrollment,na.rm=TRUE))%>%
  ggplot()+
  labs(title=str_wrap("Comparison of gender in percent between the private and public colleges for different states in the United States",width=60),
       x="Enrollment (in percent)",
       y="State Name")+
  scale_fill_manual(values=pal)+
  geom_bar(aes(x=average_gender_percent,
 y=reorder(state_code,-average_gender_percent),
    fill=state_code),stat="identity",show.legend=TRUE)+
  facet_wrap(~type+gender,scales="free_y")+
  scale_y_reordered()+
  guides(fill=guide_legend(title="State Name"))

## `summarise()` has grouped output by 'state_code', 'type'. You can override
## using the `.groups` argument.

The above two plots provide similar interpretation, they are just different ways to presenting the same data.

Amongst the public colleges, the highest % women enrollment was noted in NY,the least % women enrollment was noted in MT, the highest % men enrollment was seen in MT and the lowest in NY.

Out of the private colleges, the highest % women enrollment was noted in WI,the least % women enrollment was noted in NJ, the highest % men enrollment was seen in NJ and the lowest in WI.

# boxplot to compare the percent of men and women enrolled at  different types of colleges in different states
ggplot(data=gender,
       aes(x=enrollment,
           y=state_code,
           fill=type))+
    geom_boxplot()+
    facet_wrap(~gender+type)+
  scale_fill_manual(values=mycolors)+
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))+
  labs(title=str_wrap("Percent of men and women enrolled in private vs public college across the different states",width=60),
       x="Enrollment by Gender(in percent)",
       y="State Name")+
  guides(fill=guide_legend(title="College Type"))

Out of the private colleges, % enrollment of Men was highest in NJ and lowest in WI. The range of % men enrollment is the largest for New Jersey followed by New York and smallest for New Mexico and Nevada.

Out of the private colleges, % enrollment of women was highest in WI and lowest in New Jersey. The range of % women enrollment is the largest for New Jersey followed by New York and smallest for New Mexico and Nevada.

Comparing the public colleges, % enrollment of Men was highest in MT and lowest in NY. The range of % men enrollment is the largest for SD and smallest for MT.

Out of the public colleges, % enrollment of women was highest in NY and lowest in MT. The variability in the data for both NY and MT is very small and is largest for SD.

2) Race and Ethnicity

Since there are too many race and ethnic groups, I picked those race groups that I am interested to explore further.

race.final<-race%>%
            filter(race=="american_indian_alaska_native"|
                   race=="white"|
                   race=="asian"|
                   race=="black"|
                   race=="hispanic")

Combined Race plot

ggplot(race.final,
       aes(x=enrollment,
           y=state_code,
           fill=race))+
  geom_bar(position="stack",stat="identity")+
  facet_wrap(~type)+
  labs(title=str_wrap("Comparison of different race-ethnic groups enrollment in private and public colleges across the different states in USA",width=60),
       x="% Enrollment",
       y="State Name")+
   guides(fill=guide_legend(title="Race"))

Amongst the private colleges,NY has the highest student enrollment, out of which almost 3/4th of the student population are whites .

% Asian population is most prominent in California. % Black student population is noticeable in states like NC,TX,TN,GA and FL. Hispanic student population more commonly seen in CA and TX.

American Indian Alaskan Native is a really small race/ethnic population group, small numbers are noticed in CA, NY and NC

Out of the public colleges,the enrollment rates are generally lower than the private colleges.The largest white student population is noticed in PA, Blacks are seen in GA. Small Asian population noticed in TX and it is interesting to see that CA has only a really tiny Asian population enrolled in the public colleges. Kansas has a small native american population.Hispanics most commonly seen in TX and then in GA.

Extra Analysis

Comparison of number of private and public colleges in different states in the USA

compare<-cost_final%>%
  filter(type=="Private"|type=="Public")%>%
  group_by(state_code,type)%>%
  summarise(n=n())

## `summarise()` has grouped output by 'state_code'. You can override using the
## `.groups` argument.

ggplot(compare,
       aes(x=n,
           y=state_code,
           fill=type))+
  geom_bar(stat="identity")+
  facet_wrap(~type)+
  labs(title=str_wrap("Comparison of number of private and public collegs",width=60),
       x="Number of colleges",
       y="State name")+
  guides(fill=guide_legend(title="Race"))

In this data, the largest number of the private colleges was present in New York, followed by California and the smallest number of private colleges were located in AK,NV and NM. The largest number of public colleges was located in PA followed by TX and the least number of public colleges were in DC,WY and VI.

Historical Tuition data

ggplot(data=hist_tuition,
       aes(x=as.factor(year),
           y=tuition_cost,
           group=tuition_type))+
  geom_point(aes(color=tuition_type))+
  geom_line(aes(color=tuition_type))+
  facet_grid(.~type)+
  theme_minimal()+
  theme(legend.position = "bottom")+
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))+
  labs(title=str_wrap("Line plot of historical tuition data for the years 1985 to 2017",width=60),
       x="Year",
       y="Tuition cost in USD")+
  guides(fill=guide_legend(title="Tuition type"))

From the above line plot, increasing trend in the tuition cost is observed for all the different tuition types and for the different type of colleges.

Especially for the private colleges, the increasing trend is more prominent. In the private colleges, the “4 yr current” and “all year current” rates have constantly gone up across the years whereas the “2 yr current”, “2 yr constant”,“4 yr constant” and “all year constant” tuition costs slightly went up and then dropped a bit. This trend is expected. Also, the baseline tuition fees i.e. tuition cost at 1985-86 year is highest for “4 year constant” and “all constant” , followed by 2 year constant and then by “all current”,“4 year current” and “2 year current”.

For the public colleges, the baseline tuition fees is same as observed for the private colleges. However, the increasing trend is not very dramatic as observed for the private colleges. The trend is gradual over the years and the rate at which the tuition cost increases for the different tuition types are fairly similar.

Final Summary (10 points)

Summarize your research question and findings below.

I had two research questions:one was to identify how the earning potential and quality of education are different for students attending private and public colleges across the different states in the USA. The second question was to assess if gender/ race/ethnicity has an influence on college selection. For example, would a student belonging to a particular gender/race group pick a public/private college or have preference in attending college in a particular state.

Overall,I was able to draw some inferences from the summary tables and the bar plots/boxplots.

Findings for Q1

To answer the first question of interest, I used two variables i.e. early_career_pay and mid_career_pay to evaluate the earning potential of a student. I used two more variablesmake_world_better_percent and stem_percent to assess if a attending particular type of college at a particular location influences the quality of education one would receive.

For early-career pay:

There are a few states which have higher median early career pay for public colleges compared to private colleges and vice-versa and in only 5 states (MA,AK,MT,DE,AZ), the differences seem to be more pronounced with public colleges having higher median career pay.

Also,data was not available for some of the important states like CA, MA which has a lot of public colleges and hence a true estimate of the median career pay is not available for some of the states and hence this makes it harder to conclude if there is a significant difference between the earning potential of students attending private and public colleges.

From the boxplots, we were able to assess the spread of the data and the variability in data for some of the states likes CA and ME(for private colleges) and VA,VT (for public colleges) was pretty high. Due to the variability being different for different states and presence of outliers, median seem to be a more reliable measure to assess the central tendency of the data.

Overall, with the given data, it is very hard to draw an inference that attending a particular type of college actually influences the earning potential of a student.

The largest difference in the median early career pay amongst the 5 states where the difference in median early career pay was noticeable is $10,000. I would think that just by a difference of $10,000,I would conclude that both the public and private colleges seem to have comparable median early career pay.

Also, while comparing the early career pay, It is important to account for the cost of living which includes house pricing,food/healthcare/transportation costs in each of the state before we make a comparison between the early career pay

I used websites like nerdwallet.com to understand the differences in living cost between cities. For example: an annual income of $100,000 in San Francisco equals to $67,550 in Portland,OR and equals to $81,389 in Boston,MA.

This comparison is only for a single person, it would dramatically change if there are more dependents/children and in that case, child care cost will also play a huge role in assessing the cost of living.

For mid-career pay:

My conclusions for comparing median mid career pay between private and public colleges are similar to that of the early career pay since there were some states where the student from public colleges had higher median mid career pay than the private colleges and vice-versa and only in 5 states(AK,MT,IA,DE,AZ), the differences seem to be more pronounced with public colleges having higher median career pay than private colleges. Infact, the trends between early career pay and mid career pay across the states is similar.

The largest difference in median mid career pay of `$15,000 was noted in Arizona. It is also interesting to see the public colleges having similar and better earning potential in some states.

Thus, with the given constraints of the data,I did not see a convincing difference between the earning potential of private and public colleges.

For Make world a better percent:

Public and private colleges didn’t seem to have an observable difference in make world a better % thereby implying the quality of education is comparable between the private and public colleges across the states.

It was interesting to see that students graduating from colleges/states that were having high earning potential didn’t seem to have high % for making the world better.

For Stem percent:

In some of the states, the private colleges seems to have higher stem % rating that the public colleges and the vice-versa was observed in some states. Prominent difference in stem % rating was observed in 13 states(LA,DE,AK,AZ,FL,HI,MO,CO,VA, IL,CA,MT,MA)with public colleges have higher ratings than the private colleges and and 6 states (VT,AR,MN,CT,GA,MS) with private colleges having higher stem ratings than the public colleges.

Although, this is not very appropriate to make to make inferences from the plot since some of the states had one or two colleges and making inferences from a limited data could potentially result in incorrect conclusions. For example CA, MA had only 1 public college and MT had 2 public and 2 private colleges, so we really can’t take the prominent difference as an accurate reflection of all the colleges in the respective states.

However, by assessing the differences between the private and public college stem ratings, public colleges seem to be performing well in terms of imparting stem knowledge across most states and on par with private colleges.

Findings for Q2

Gender

From the summary table and the data visualizations, it seems like the % women enrollment is slightly higher in both public and private colleges compared to % men enrollment. Public colleges in a couple states like NY and MT had higher % men enrollment than women but the rest of the states pretty much have comparable enrollment % or %women enrollment exceeding %men enrollment.

Race

Race did seem to have an influence in type of college selection which kind of aligns with literature. In some states, I was able to notice similarities in race % enrollment between private and public colleges and in some states, I couldn’t find that connection, this could be due to some states having very less public colleges in the data set and hence a reliable comparison can’t be made.

For example: from the barplots, it was observed that Hispanics enrolled in colleges in areas where they are densely populated like Texas, California, New Mexico etc. Similarly, Asians were noticed prominently in California,Texas and Whites enrollment is prominent in Central North and Mid West colleges. Blacks were more prominently seen in Texas and Tennesse.American Indian and Alaska native were present in small numbers in NC and CA,

Are your findings what you expected? Why or Why not?

I was expecting to see that early career pay being influenced by the type of college one would attend as I expected that the private college students may have better exposure to career fairs, networking opportunities and the brand name would help them to get a really good entry level job whereas students from public colleges would also get new grad jobs but may not have a comparable compensation like the private college grads. But form this data,I didn’t observe this relationship and it seemed like both the private and public colleges have similar early career pay.

For the mid career pay, my expectation aligns with the findings, I wasn’t expecting to see the a remarkable difference in the mid career pay between students from private and public colleges.The rationale being, once a student enters industry, promotions and pay rise would be based on performance and not based on the type of college one attended.

I expected to see the private colleges having better stem % and higher making the world better percent since they would have more funding, infrastructure to support stem activities and in general projects that would help student to have an impact. But I was surprised to see that some of the public colleges having higher stem % and making world better % compared to the private colleges and overall from the data, it seemed like the public and private college students had similar quality of education.

My expectation of % enrollment based on gender was actually the opposite, I was expecting to see more % men enrollment atleast in some of the states but findings from the data was opposite. This could also be because the data as such didn’t have % men enrollment. I subtracted the women enrollment from total enrollment to obtain % men enrollment and the trend observed could be due to this assumption.

My expectation of % enrollment based on race was pretty accurate. Based on literature and online sources, it seemed like certain race/ethnic groups prefer to attend colleges in certain parts of the country however the difference in % enrollment between the private and public colleges was not clear from the plots.

I would expect there shouldn’t be a significant difference in the %enrollment based on race/ethnicity between private and public colleges, economic factors might play a role here as private colleges are mostly much more expensive than public colleges.

Some limitations noted in the dataset

There were some limitations in the data set due to which some of the conclusions may not be very reliable

-> some states had very limited data from public/private college, thus the comparisons made for some of the states will not be accurate

->there were lot of missing data and I had to exclude them in order to analyze, this could potentially cause bias in interpretations

->only women enrollment was provided, I calculated the men enrollment by subtracting from the total enrollment, by doing so, I didn’t account for non-binary gender groups.

->the income levels in the income data was in the categorical format. I was hoping to study the impact of economic factors in college selection, having the income in continuous format would have helped the analysis.

BSTA504 Midterm

Ramya Ramesh

2/11/2023

Install required packages

Load required packages

Define Your Research Question (10 points)

Source of the data

What interests me about the data?

Define the research question

Research Question 1

Research Question 2

What is a specific question you want to find out about the data

Given your question, what is your expectation about the data?

Loading the Data (10 points)

Load data from TidyTuesday

Upload the file into the data directory

Read the data files

Notes on the data in each dataset

Tuition income data

Tuition cost data

Salary potential data

Historical tuition data

College Diversity data

Transforming the data (15 points)

Cleaning the tuition income data

1) Transforming the income levels,year and state to factor format

2) Filter out only the on-campus type observations

Cleaning the tuition cost data

1) Assess the NA observations

2) Convert state and type to a factor variable

3) Filter out only 4 year degree observations

Cleaning the salary potential data

1) Rename the column state_name to state, useful during the left merge operation and convert state to a factor variable

Cleaning the historical tuition data

1) Convert year,type and tuition_type to factor variables

Cleaning the diversity data

1) Handle the NA observations

2) Converting long to wide data

Merging Datasets

1) Merge salary potential and tution cost data

2) Merge diversity data with tuition cost data

Applying inclusion/exclusion criteria

Reshape the data back to long format to make ggplots

Gender

Race and Ethnicity

Are the values what you expected for the variables?Why or Why not?

Visualizing and Summarizing the Data (15 points)

Summarizing data

Reasearch question 1

Early career pay

Mid career pay

Make world better percent

Stem percent

Research Question 2

Gender

Race and Ethnicity

Is the finding from summary what you expected

Data Visualization using ggplots

Research Question 1

1) Early career pay

Comparison of median early career pay between the different types of colleges across the different states

Side by side comparison

Separate panels

Assess the min, median and max of the early career pay between the different types of colleges across the different states

2) Mid career pay

Comparison of median mid career pay between the different types of colleges across the different states

Side by side comparison

Separate panels

Assess the min, median and max of the mid career pay between the different types of colleges across the different states

3) Make world a better percent

Comparison of median make world a better percent between the different types of colleges across the different states

Side by side comparison

Separate panels

Assess the min, median and max of the make_world_better_percent between the different types of colleges across the different states

4) Stem percent

Comparison of median stem percent between the different types of colleges across the different states

Side by side comparison

Separate panels

Assess the min, median and max of the make_world_better_percent between the different types of colleges across the different states

Research Question 2