Install required packages

#install.packages("openxlsx")
#install.packages("devtools")
#install.packages("wesanderson")
#install.packages("RColorBrewer")
#install.packages("tidytext")

Load required packages

library(tidytuesdayR)
library(janitor)
## 
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test
library(tidyverse)
## ── Attaching packages
## ───────────────────────────────────────
## tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0      ✔ purrr   1.0.1 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.5.0 
## ✔ readr   2.1.3      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(ggplot2)
library(ggthemes)
library(readxl) # to read the excel files
library(openxlsx)
library(stringr)
library(wesanderson)
library(RColorBrewer)
library(gridExtra)
## 
## Attaching package: 'gridExtra'
## 
## The following object is masked from 'package:dplyr':
## 
##     combine
library(tidytext)

Define Your Research Question (10 points)

Define your research question below. What about the data interests you? What is a specific question you want to find out about the data?

Given your question, what is your expectation about the data?

Source of the data

I have taken the College tuition, diversity, and pay from TidyTuesday (week of 10th March 2020).The data comes from different sources but the original source is the US Department of Education.

The easily available data is from tuitiontracker.org but since the dataset is in a very wide format, the data was converted to long format and only selected tables are available in the tidytuesday dataset.

What interests me about the data?

I am very interested in understanding how the type of college chosen in the undergraduate degree influences and shapes one’s career prospects and their quality of education.

In most cases, the cost of attending public college is significantly cheaper than that of private colleges. Students who have their degrees from private colleges especially the elite universities and ivy-league colleges spend way more than the students who attend public colleges.The highly ranked colleges have very low acceptance rate making the process of getting into top colleges incredibly difficult.

Although, attending a highly ranked private college can be prestigious and may have better infrastructure and career prospects, it is important to not choose a college purely based on the ranking but instead evaluate all the aspects before choosing the college that would be well aligned with the one’s goal and interests.

I am also interested in understanding if gender, race and ethnicity has an influence in college selection.

Being an immigrant and having attended graduate college in a private university in the US which had a very high tuition and having worked in the industry for a couple years, my thought process of choosing a college has drifted away from emphasizing on the ranking to considering all the aspects holistically, the most important criteria`s to evaluate is

->the curriculum and faculty:ensure that curriculum offered is diverse and in-depth that will enhance career opportunities

->to be able to choose a program that is closely aligned with the interests

->networking opportunities and a strong alumni network

->return of investment

-> safety and living expenses

-> open to different cultural/ racial/ ethnic and socio-economic backgrounds

Define the research question

Research Question 1

Is the Earning potential of student and the quality of education influenced by the type of college they attend?Does this relationship change across the different states in the USA?

Typically the private schools have better infrastructure, facility and funding opportunities, that I would like to study if attending private/public college affect the earning potential at entry level and at much later in the careers.

I would also like to study how a private/public college degree would contribute to aquiring STEM skills and to the betterment of society.

Research Question 2

Does factors like gender, tuition-cost, race, income level affect the selection of type of college(public vs private)?

There has always been systemic bias in accessibility to education. Women, students from lower socio-economic backgrounds and minor racial/ethnic groups typically do not have the financial and moral support than those who come from affluent background and from more priveleged racial groups.

I am also interested in performing some sort of exploratory analysis such as

  1. Number of private vs public colleges in the data set considered across the different states in the United States of America

  2. Analyze the change in tuition rates over the last 20 years for different type of colleges and for different type of programs in the United States

What is a specific question you want to find out about the data

Basically my intent is to compare if the students attending private and public college have different career prospects and quality of education and if so, how does the difference look.

Also, I am interested in understanding if gender or race/ethnicity influence in the decision of attending a particular type of school and if location play a role too in college selection.

Given your question, what is your expectation about the data?

My expectation of the data is the choice of college is definitely influenced by the socioeconomic status. Most likely, students from lower socioeconomic status would attend colleges that have a reasonable tuition fees and also provides scholarship that will aid them in paying off their tuitions. Students from minor racial/ ethnic groups or students who are first generational learners would probably be inclined to attend colleges that would be have a strong diversity and inclusion policy, lower tuition fees or provide need based scholarships. I would expect % women enrollment to be lesser than % men enrollment in most parts of the USA.

And I think the earning potential of the student may vary at the beginning of their careers i.e. students who attend prestigious private colleges may have a better start in their careers compared to students who attend public colleges but this difference may weaken as they advance in their careers.

The quality of education(making world a better place and stem %) maybe influenced partly by the college one attends but the most important aspect is one’s personal interest and critical thinking skills.

Loading the Data (10 points)

Load the data below and use dplyr::glimpse() or skimr::skim() on the data. You should upload the data file into the data directory.

Load data from TidyTuesday

# Load the tidytuesday data

devtools::install_github("thebioengineer/tidytuesdayR")
## Skipping install of 'tidytuesdayR' from a github remote, the SHA1 (51dfba6f) has not changed since last install.
##   Use `force = TRUE` to force installation
tuesdata <- tidytuesdayR::tt_load(2020, week = 11)
## --- Compiling #TidyTuesday Information for 2020-03-10 ----
## --- There are 5 files available ---
## --- Starting Download ---
## 
##  Downloading file 1 of 5: `diversity_school.csv`
##  Downloading file 2 of 5: `historical_tuition.csv`
##  Downloading file 3 of 5: `salary_potential.csv`
##  Downloading file 4 of 5: `tuition_cost.csv`
##  Downloading file 5 of 5: `tuition_income.csv`
## --- Download complete ---

Upload the file into the data directory

The data could be uploaded either in 2 ways, directly access from “tuesdata” into the environment or by writing the data into the folder.

# write the cost data into /data folder
write_excel_csv(tuesdata$tuition_cost,
                file=here::here("Data/cost.csv"))

# write the tuition income data into /data folder
write_excel_csv(tuesdata$tuition_income,
 file=here::here("Data/income.csv"))

# write the salary data into /data folder
write_excel_csv(tuesdata$salary_potential,
 file=here::here("Data/salary.csv"))

# write the historical tuition data into /data folder
write_excel_csv(tuesdata$historical_tuition,
 file=here::here("Data/historical_tuition.csv"))

# write the diversity data into /data folder
write_excel_csv(tuesdata$diversity_school,
     file=here::here("Data/diversity.csv"))
#alternative way to load data

cost<-tuesdata$tuition_cost
income<-tuesdata$tuition_income
salary<-tuesdata$salary_potential
hist_tuition<-tuesdata$historical_tuition
diversity<-tuesdata$diversity_school

Read the data files

This section would be not be required if we directly loaded the data into the environment.

# read the cost data set from /data folder
cost<-read_csv(here::here("Data/cost.csv"),
na="NA",skip=0)
## Rows: 2973 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): name, state, state_code, type, degree_length
## dbl (5): room_and_board, in_state_tuition, in_state_total, out_of_state_tuit...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# read the income data set from /data folder
income<-read_csv(here::here("Data/income.csv"),
na="NA",skip=0)
## Rows: 209012 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): name, state, campus, income_lvl
## dbl (3): total_price, year, net_cost
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# read the salary data set from /data folder
salary<-read_csv(here::here("Data/salary.csv"),
na="NA",skip=0)
## Rows: 935 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): name, state_name
## dbl (5): rank, early_career_pay, mid_career_pay, make_world_better_percent, ...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# read the historical tuition data set from /data folder
hist_tuition<-read_csv(here::here("Data/historical_tuition.csv"), na="NA",skip=0)
## Rows: 270 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): type, year, tuition_type
## dbl (1): tuition_cost
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# read the diversity data set from /data folder
diversity<-read_csv(here::here("Data/diversity.csv"),na="NA", skip=0)
## Rows: 50655 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): name, state, category
## dbl (2): total_enrollment, enrollment
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Notes on the data in each dataset

If there are any quirks that you have to deal with NA coded as something else, or it is multiple tables, please make some notes here about what you need to do before you start transforming the data in the next section.

Make sure your data types are correct!

Tuition income data

In the following code chunk,we read the tuition income data and use the glimpse and the skim command to get a sense of the data, total number of observations, number of observations and variables,the data type of each variable and the skim also provides mini histogram plots of the continuous variables which gives us an idea about the underlying distribution. Missingness in each variable is also captured by the skim command.

# tuition income data

# verify using glimpse and skim

glimpse(income) # 209,012 observations and 7 variables
## Rows: 209,012
## Columns: 7
## $ name        <chr> "Piedmont International University", "Piedmont Internation…
## $ state       <chr> "NC", "NC", "NC", "NC", "NC", "NC", "NC", "NC", "NC", "NC"…
## $ total_price <dbl> 20174, 20174, 20174, 20174, 20514, 20514, 20514, 20514, 20…
## $ year        <dbl> 2016, 2016, 2016, 2016, 2017, 2017, 2017, 2017, 2017, 2018…
## $ campus      <chr> "On Campus", "On Campus", "On Campus", "On Campus", "On Ca…
## $ net_cost    <dbl> 11475.00, 11451.00, 16229.00, 15592.00, 11668.39, 11643.99…
## $ income_lvl  <chr> "0 to 30,000", "30,001 to 48,000", "48_001 to 75,000", "75…
                # 4 character variables and 3 numeric variables
skimr::skim(income) # 209,012 observations and 7 variables
Data summary
Name income
Number of rows 209012
Number of columns 7
_______________________
Column type frequency:
character 4
numeric 3
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
name 0 1 6 75 0 3664 0
state 0 1 2 2 0 51 0
campus 0 1 9 10 0 2 0
income_lvl 0 1 11 17 0 5 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
total_price 0 1 30102.19 13824.70 4906 19186.00 26286.0 38831 114083.00 ▇▆▂▁▁
year 0 1 2014.38 2.46 2010 2012.00 2014.0 2017 2018.00 ▅▇▃▇▇
net_cost 0 1 16784.92 8887.16 -15101 10148.56 16207.1 22350 95674.84 ▁▇▁▁▁
                    # 4 character variables and 3 numeric variables

Using the glimpse command, we were able to know that this dataset had 209,012 observations and 7 variables-out of which 4 are character variables and 3 are numeric variables. The variables name,state, campus and income_level are character variables and the remaining variables: total_price, year and net_cost are in the numeric format.

The variables present in the data set are:

  1. name->college name
  2. state->state name
  3. total_price->total price in USD
  4. year->year
  5. campus->on/off campus
  6. net_cost->net cost-average actually paif after scholarship/award
  7. income_lvl->income bracket

I also wanted to plot the histograms separately to get understand the underlying distribution, assess the center/spread/skewness of the distribution.

# additional checks on the distribution of numeric variables

# total price
plot1<-ggplot(data=income,
       aes(x=total_price))+
  
  geom_histogram(fill="lightblue")+
  
  geom_vline(xintercept=mean(income$total_price,na.rm=TRUE),                      col = "orange",
             lwd = 2)+
  
  annotate("text", x=35000, y=15000, label="Mean of total price",angle=90)+
  
geom_vline(xintercept=median(income$total_price,na.rm=TRUE),                         col = "red",
                lwd = 2)+
  
  annotate("text", x=22000, y=15000, label="Median of total price",angle=90)+
  
  
  
  labs(x="Total price in USD",
       y="Count",
       title="Histogram of Total price(in USD)") 
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
# mean exceeds median-> heavily right skewed!!
# net_cost

plot2<-ggplot(data=income,
       aes(x=net_cost))+
  
  geom_histogram(fill="light blue")+
  
  geom_vline(xintercept=mean(income$net_cost,
                             na.rm=TRUE),                       col = "orange",
             lwd = 2)+
  
  annotate("text", x=20000, y=15000, label="Mean of net cost",angle=90)+
  
geom_vline(xintercept=median(income$net_cost,
                             na.rm=TRUE),          
                             col = "red",
                             lwd = 2)+
  
  annotate("text", x=12000, y=15000, label="Median of net cost",angle=90)+
  
  labs(x="Net cost in USD",
       y="Count",
       title="Histogram of Net cost(in USD)") 
# reasonably normal
# mean and median almost coincides!

grid.arrange(plot1,plot2,nrow=2)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# check for missing values

sum(is.na(income)) # no missing values
## [1] 0

Some comments on the data

It can be noted that the variable state is in character format and needs to be converted to factors especially useful while making exploratory plots.

Also,the variable year is in the numeric format which will have to be converted to a factor format in order to study how the net_cost and total_price varies over the years.

Although the variable campus is in the categorical format, we are not very concerned about transforming it to the factor format as in the next section, I will be excluding the off-campus type observation since I am only interested in the on-campus observations.

Also,from the tiny histogram plots provided from the output of skim command, the variable total_price data seems to be heavily right skewed. I was particularly interested in the distribution of total_price, the mean exceeds the median and the data is heavily right skewed, typically log transformation is applied to right-skewed data to address non-normality but in this case since the primary interest is to visualize the data and not to perform apply modelling techniques, we will use the data as it is.

The distribution of net cost variable looks reasonably normal, the mean and median almost coincide.

No missing values observed in the tuition income data.

Tuition cost data

In the following code chunk, we use the glimpse and study commands to get an idea about the tuition cost data.

# tuition cost data

# verify using glimpse and skim

glimpse(cost)   # 2973 observations and 10 variables
## Rows: 2,973
## Columns: 10
## $ name                 <chr> "Aaniiih Nakoda College", "Abilene Christian Univ…
## $ state                <chr> "Montana", "Texas", "Georgia", "Minnesota", "Cali…
## $ state_code           <chr> "MT", "TX", "GA", "MN", "CA", "CO", "NY", "NY", "…
## $ type                 <chr> "Public", "Private", "Public", "For Profit", "For…
## $ degree_length        <chr> "2 Year", "4 Year", "2 Year", "2 Year", "4 Year",…
## $ room_and_board       <dbl> NA, 10350, 8474, NA, 16648, 8782, 16030, 11660, 1…
## $ in_state_tuition     <dbl> 2380, 34850, 4128, 17661, 27810, 9440, 38660, 537…
## $ in_state_total       <dbl> 2380, 45200, 12602, 17661, 44458, 18222, 54690, 1…
## $ out_of_state_tuition <dbl> 2380, 34850, 12550, 17661, 27810, 20456, 38660, 9…
## $ out_of_state_total   <dbl> 2380, 45200, 21024, 17661, 44458, 29238, 54690, 2…
                # 5 character variables and 5 numeric variables
skimr::skim(cost)  # 2973 observations and 10 variables
Data summary
Name cost
Number of rows 2973
Number of columns 10
_______________________
Column type frequency:
character 5
numeric 5
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
name 0 1.00 8 67 0 2938 0
state 52 0.98 4 14 0 50 0
state_code 0 1.00 2 2 0 55 0
type 0 1.00 5 10 0 4 0
degree_length 0 1.00 5 6 0 3 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
room_and_board 1094 0.63 10095.28 3288.55 30 7935 10000 12424.5 21300 ▁▅▇▃▁
in_state_tuition 0 1.00 16491.29 14773.84 480 4890 10099 27124.0 59985 ▇▂▂▁▁
in_state_total 0 1.00 22871.73 18948.39 962 5802 17669 35960.0 75003 ▇▅▂▂▁
out_of_state_tuition 0 1.00 20532.73 13255.65 480 9552 17486 29208.0 59985 ▇▆▅▂▁
out_of_state_total 0 1.00 26913.16 17719.73 1376 11196 23214 39054.0 75003 ▇▅▅▂▁
                   # 5 character variables and 5 numeric variables

Based on the output of glimpse and skim command,2973 observations and 10 variables-out of which 5 are character variables and 5 are numeric variables. The variables name,state,state_code,type and degree_length are character variables and the remaining variables: room_and_board, in_state_tuition,in_state_total,out_of_state_tuition, out_of_state_total are in the numeric format.

The variables present in the data set are:

  1. name->college name
  2. state->state name
  3. state_code->state abbreviation
  4. type->public,private, for profit
  5. degree_length->4 year or 2 year degree
  6. room_and_board-> Room and board in USD
  7. in_state_tuition->Tuition for in-state residents in USD
  8. in_state_total->Total cost for in-state residents in USD (sum of room & board + in state tuition) 9)out_of_state_tuition->Tuition for out-of-state residents in USD 10)out_of_state_total->Total cost for out-of-state residents in USD (sum of room & board + out of state tuition)

The skim command not only provides the total # of observations, total # of variables and their data type’s but also provides a peek of the distribution of the continuous variables by their small histogram plots. I also tried plotting histograms using ggplots.

Numeric variables

# histogram of in-state tuition expenses
ggplot(data=cost,
       aes(x=in_state_tuition))+
  geom_histogram(fill="lightblue")+
  geom_vline(xintercept=mean(cost$in_state_tuition
                             ,na.rm=TRUE),        
             col = "orange",
             lwd = 2)+
annotate("text", x=18000, y=200, label="Mean",angle=90)+
geom_vline(xintercept=median(cost$in_state_tuition,
                             na.rm=TRUE),                                col = "red",
                      lwd = 2)+
annotate("text", x=8000, y=200, label="Median",angle=90)+
  labs(x="In state tuition expenses in USD",
       y="Count",
       title="Histogram of instate tuition expenses(in USD)") 
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# histogram of out-state tuition expenses
ggplot(data=cost,
       aes(x=out_of_state_tuition))+
  geom_histogram(fill="lightblue")+
  geom_vline(xintercept=mean(cost$out_of_state_tuition,na.rm=TRUE),        
             col = "orange",
             lwd = 2)+
annotate("text", x=23000, y=100, label="Mean",angle=90)+
geom_vline(xintercept=median(cost$out_of_state_tuition,na.rm=TRUE),                                
           col = "red",
           lwd = 2)+
annotate("text", x=15000, y=100, label="Median",angle=90)+
  labs(x="Out of state tuition expenses in USD",
       y="Count",
       title="Histogram of out of state tuition expenses(in USD)") 
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# histogram of in-state total expenses
ggplot(data=cost,
       aes(x=in_state_total))+
  geom_histogram(fill="lightblue")+
  geom_vline(xintercept=mean(cost$in_state_total
                             ,na.rm=TRUE),        
             col = "orange",
             lwd = 2)+
annotate("text", x=24000, y=150, label="Mean",angle=90)+
geom_vline(xintercept=median(cost$in_state_total,
                             na.rm=TRUE),                                col = "red",
                      lwd = 2)+
annotate("text", x=16000, y=150, label="Median",angle=90)+
  labs(x="In state total expenses in USD",
       y="Count",
       title="Histogram of instate total expenses(in USD)") 
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# histogram of out-state total expenses

ggplot(data=cost,
       aes(x=out_of_state_total))+
  geom_histogram(fill="lightblue")+
  geom_vline(xintercept=mean(cost$out_of_state_total,na.rm=TRUE),        
             col = "orange",
             lwd = 2)+
annotate("text", x=29000, y=100, label="Mean",angle=90)+
geom_vline(xintercept=median(cost$out_of_state_total,na.rm=TRUE),                                
           col = "red",
           lwd = 2)+
annotate("text", x=20000, y=100, label="Median",angle=90)+
  labs(x="Out of state total expenses in USD",
       y="Count",
       title="Histogram of out of state total expenses(in USD)") 
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# histogram of room and board expenses
ggplot(data=cost,
       aes(x=room_and_board))+
  geom_histogram(fill="lightblue")+
  geom_vline(xintercept=mean(cost$room_and_board
                             ,na.rm=TRUE),        
             col = "orange",
             lwd = 2)+
annotate("text", x=11000, y=50, label="Mean",angle=90)+
geom_vline(xintercept=median(cost$room_and_board,
                             na.rm=TRUE),                                col = "red",
                      lwd = 2)+
  annotate("text", x=9000, y=50, label="Median",angle=90)+
  labs(x="Room and Board expenses in USD",
       y="Count",
       title="Histogram of Room and board expenses(in USD)") 
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 1094 rows containing non-finite values (`stat_bin()`).

From the histogram plots,the distribution of the variable room_and_board is reasonably normal, where mean and median coincides.

The distribution of the variables: in_state_tuition,in_state_total, out_of_state_tuition and out_of_state_total is heavily right skewed with the mean exceeding the median values.

Character variables

There are 52 observations with missing values for the state variable and 1094 missing values for room and board, we assess the NA values for the state variable in the data transformation section.

We retain the missing values for room and board, since our question of interest is not related to room and board and those missing observations in room and board have complete observations for the other variables that I am interested in exploring further.

# filter out the state with NA's to understand why they are missing!

cost_na<-cost%>%filter(is.na(state))

# looks like the state names are missing however the state codes are present

# get the different state codes and counts of the states with missing names!

tabyl(cost_na$state_code)
##  cost_na$state_code  n    percent
##                  AS  1 0.01923077
##                  DC  8 0.15384615
##                  GU  1 0.01923077
##                  PR 41 0.78846154
##                  VI  1 0.01923077
# based on the state code's fill the name of the state using mutate command

cost<-cost%>%

mutate(state.upd=case_when(!is.na(state)~state, 
                                     is.na(state)&state_code=="AS"~"American Samoa",
                                     is.na(state)&state_code=="DC"~"District of Columbia",
                                     
is.na(state)&state_code=="GU"~"Guam",

is.na(state)&state_code=="PR"~"Peurto Rico",

is.na(state)&state_code=="VI"~"Virgin Islands",

TRUE~"NA"))

#verify
tabyl(cost$state.upd) # no NA's
##        cost$state.upd   n      percent
##               Alabama  54 0.0181634712
##                Alaska   6 0.0020181635
##        American Samoa   1 0.0003363606
##               Arizona  34 0.0114362597
##              Arkansas  46 0.0154725866
##            California 254 0.0854355869
##              Colorado  38 0.0127817020
##           Connecticut  36 0.0121089808
##              Delaware   9 0.0030272452
##  District of Columbia   8 0.0026908846
##               Florida  88 0.0295997309
##               Georgia  79 0.0265724857
##                  Guam   1 0.0003363606
##                Hawaii  14 0.0047090481
##                 Idaho  13 0.0043726875
##              Illinois 125 0.0420450723
##               Indiana  62 0.0208543559
##                  Iowa  52 0.0174907501
##                Kansas  52 0.0174907501
##              Kentucky  44 0.0147998655
##             Louisiana  34 0.0114362597
##                 Maine  27 0.0090817356
##              Maryland  45 0.0151362260
##         Massachusetts  93 0.0312815338
##              Michigan  78 0.0262361251
##             Minnesota  71 0.0238816011
##           Mississippi  32 0.0107635385
##              Missouri  73 0.0245543222
##               Montana  22 0.0073999327
##              Nebraska  33 0.0110998991
##                Nevada  10 0.0033636058
##         New Hampshire  21 0.0070635721
##            New Jersey  54 0.0181634712
##            New Mexico  24 0.0080726539
##              New York 221 0.0743356879
##        North Carolina 117 0.0393541877
##          North Dakota  18 0.0060544904
##                  Ohio 127 0.0427177935
##              Oklahoma  40 0.0134544231
##                Oregon  40 0.0134544231
##          Pennsylvania 160 0.0538176926
##           Peurto Rico  41 0.0137907837
##          Rhode Island  11 0.0036999664
##        South Carolina  57 0.0191725530
##          South Dakota  18 0.0060544904
##             Tennessee  62 0.0208543559
##                 Texas 150 0.0504540868
##                  Utah  14 0.0047090481
##               Vermont  19 0.0063908510
##        Virgin Islands   1 0.0003363606
##              Virginia  79 0.0265724857
##            Washington  60 0.0201816347
##         West Virginia  30 0.0100908174
##             Wisconsin  67 0.0225361588
##               Wyoming   8 0.0026908846
# use the state.upd column going forward but rename it to state

cost<-cost%>%
  select(-state)%>%
   mutate(state=state.upd)%>%
    select(-state.upd)

# rearrange the columns, state before state_code
cost<-cost%>%relocate("state",.after="name")

It was observed those 52 missing values of state was having complete observations for the state_code with which I was able to address the missing values in state.

# check the different levels in the variable type, degree length and state
unique(cost$type) # 4 types: public,private, not for profit and others
## [1] "Public"     "Private"    "For Profit" "Other"
unique(cost$degree_length) # 2 year, 4 year and others
## [1] "2 Year" "4 Year" "Other"
unique(cost$state) #55 unique tates
##  [1] "Montana"              "Texas"                "Georgia"             
##  [4] "Minnesota"            "California"           "Colorado"            
##  [7] "New York"             "Michigan"             "Virginia"            
## [10] "Florida"              "South Carolina"       "Alabama"             
## [13] "North Carolina"       "Alaska"               "Connecticut"         
## [16] "Pennsylvania"         "Mississippi"          "West Virginia"       
## [19] "Maryland"             "Ohio"                 "Iowa"                
## [22] "Kansas"               "Wisconsin"            "Illinois"            
## [25] "Tennessee"            "Arizona"              "Massachusetts"       
## [28] "American Samoa"       "District of Columbia" "Peurto Rico"         
## [31] "Indiana"              "Washington"           "Arkansas"            
## [34] "Kentucky"             "New Jersey"           "South Dakota"        
## [37] "Missouri"             "Maine"                "Louisiana"           
## [40] "Nebraska"             "Vermont"              "North Dakota"        
## [43] "Oregon"               "Idaho"                "Utah"                
## [46] "Hawaii"               "Rhode Island"         "Oklahoma"            
## [49] "Nevada"               "Wyoming"              "New Mexico"          
## [52] "New Hampshire"        "Delaware"             "Guam"                
## [55] "Virgin Islands"
# can also tabyl command to obtain the unique levels and their corresponding counts
tabyl(cost$type)
##   cost$type    n      percent
##  For Profit  107 0.0359905819
##       Other    1 0.0003363606
##     Private 1281 0.4308779011
##      Public 1584 0.5327951564
tabyl(cost$degree_length)
##  cost$degree_length    n      percent
##              2 Year 1120 0.3767238480
##              4 Year 1852 0.6229397915
##               Other    1 0.0003363606
tabyl(cost$state)
##            cost$state   n      percent
##               Alabama  54 0.0181634712
##                Alaska   6 0.0020181635
##        American Samoa   1 0.0003363606
##               Arizona  34 0.0114362597
##              Arkansas  46 0.0154725866
##            California 254 0.0854355869
##              Colorado  38 0.0127817020
##           Connecticut  36 0.0121089808
##              Delaware   9 0.0030272452
##  District of Columbia   8 0.0026908846
##               Florida  88 0.0295997309
##               Georgia  79 0.0265724857
##                  Guam   1 0.0003363606
##                Hawaii  14 0.0047090481
##                 Idaho  13 0.0043726875
##              Illinois 125 0.0420450723
##               Indiana  62 0.0208543559
##                  Iowa  52 0.0174907501
##                Kansas  52 0.0174907501
##              Kentucky  44 0.0147998655
##             Louisiana  34 0.0114362597
##                 Maine  27 0.0090817356
##              Maryland  45 0.0151362260
##         Massachusetts  93 0.0312815338
##              Michigan  78 0.0262361251
##             Minnesota  71 0.0238816011
##           Mississippi  32 0.0107635385
##              Missouri  73 0.0245543222
##               Montana  22 0.0073999327
##              Nebraska  33 0.0110998991
##                Nevada  10 0.0033636058
##         New Hampshire  21 0.0070635721
##            New Jersey  54 0.0181634712
##            New Mexico  24 0.0080726539
##              New York 221 0.0743356879
##        North Carolina 117 0.0393541877
##          North Dakota  18 0.0060544904
##                  Ohio 127 0.0427177935
##              Oklahoma  40 0.0134544231
##                Oregon  40 0.0134544231
##          Pennsylvania 160 0.0538176926
##           Peurto Rico  41 0.0137907837
##          Rhode Island  11 0.0036999664
##        South Carolina  57 0.0191725530
##          South Dakota  18 0.0060544904
##             Tennessee  62 0.0208543559
##                 Texas 150 0.0504540868
##                  Utah  14 0.0047090481
##               Vermont  19 0.0063908510
##        Virgin Islands   1 0.0003363606
##              Virginia  79 0.0265724857
##            Washington  60 0.0201816347
##         West Virginia  30 0.0100908174
##             Wisconsin  67 0.0225361588
##               Wyoming   8 0.0026908846

I will be converting the character variables type, state, state_code to factor format in the next section. I am not too concerned about the converting the degree_length variable to factor format as I will be only retaining 4 year degree_length observations(in the next section) which is more pertianing to my research question of interest.

Salary potential data

# salary potential data

# verify using glimpse and skim

glimpse(salary)   # 935 observations and 7 variables
## Rows: 935
## Columns: 7
## $ rank                      <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 1…
## $ name                      <chr> "Auburn University", "University of Alabama …
## $ state_name                <chr> "Alabama", "Alabama", "Alabama", "Alabama", …
## $ early_career_pay          <dbl> 54400, 57500, 52300, 54500, 48400, 46600, 49…
## $ mid_career_pay            <dbl> 104500, 103900, 97400, 93500, 90500, 89100, …
## $ make_world_better_percent <dbl> 51, 59, 50, 61, 52, 53, 48, 57, 56, 58, 60, …
## $ stem_percent              <dbl> 31, 45, 15, 30, 3, 12, 27, 17, 17, 20, 8, 7,…
                  # 2 character variables and 5 numeric variables
skimr::skim(salary) # 935 observations and 7 variables
Data summary
Name salary
Number of rows 935
Number of columns 7
_______________________
Column type frequency:
character 2
numeric 5
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
name 0 1 6 64 0 934 0
state_name 0 1 4 14 0 50 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
rank 0 1.00 11.48 7.07 1 5 11 17 25 ▇▇▆▆▅
early_career_pay 0 1.00 51359.68 8365.51 32500 46000 50000 55500 91200 ▂▇▂▁▁
mid_career_pay 0 1.00 92805.78 15856.48 60100 81650 89900 100650 158200 ▃▇▃▁▁
make_world_better_percent 33 0.96 53.88 9.19 33 48 52 58 94 ▂▇▃▁▁
stem_percent 0 1.00 16.96 15.40 0 7 13 23 100 ▇▃▁▁▁
                    # 2 character variables and 5 numeric variables

From the glimpse and skim command, we were able to observed that the salary potential data set had 935 observations and 7 variables out of which name and state_name variables are character variables and the rest of the variables:early_career_pay,mid_career_pay,make_world_better_percent and stem_percent. rank are numeric.

I also used ggplots to get a better idea about the distribution of the continuous variables.

# histogram of early career pay
ggplot(data=salary,
       aes(x=early_career_pay))+
  geom_histogram(fill="lightblue")+
  geom_vline(xintercept=mean(salary$early_career_pay
                             ,na.rm=TRUE),        
             col = "orange",
             lwd = 2)+
annotate("text", x=53000, y=30, label="Mean",angle=90)+
geom_vline(xintercept=median(salary$early_career_pay,
                             na.rm=TRUE),                                col = "red",
                lwd = 2)+
annotate("text", x=48000, y=30,label="Median",angle=90)+
  labs(x="Early career pay in USD",
       y="Count",
       title="Histogram of early career pay(in USD)")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# histogram of mid career pay
ggplot(data=salary,
       aes(x=mid_career_pay))+
  geom_histogram(fill="lightblue")+
  geom_vline(xintercept=mean(salary$mid_career_pay
                             ,na.rm=TRUE),        
             col = "orange",
             lwd = 2)+
annotate("text", x=95000, y=30, label="Mean",angle=90)+
geom_vline(xintercept=median(salary$mid_career_pay,
                             na.rm=TRUE),                                col = "red",
                lwd = 2)+
annotate("text", x=85000, y=30,label="Median",angle=90)+
  labs(x="Mid career pay in USD",
       y="Count",
       title="Histogram of mid career pay(in USD)")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# histogram of make world a better percent
ggplot(data=salary,
       aes(x=make_world_better_percent))+
  geom_histogram(fill="lightblue")+
  geom_vline(xintercept=mean(salary$make_world_better_percent
                             ,na.rm=TRUE),        
             col = "orange",
             lwd = 2)+
annotate("text", x=55, y=25, label="Mean",angle=90)+
geom_vline(xintercept=median(salary$make_world_better_percent,
                             na.rm=TRUE),                                col = "red",
                lwd = 2)+
annotate("text", x=50, y=25,label="Median",angle=90)+
  labs(x="Make world better in percent",
       title="Histogram of make world better")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 33 rows containing non-finite values (`stat_bin()`).

# histogram of stem
ggplot(data=salary,
       aes(x=stem_percent))+
  geom_histogram(fill="lightblue")+
  geom_vline(xintercept=mean(salary$stem_percent
                             ,na.rm=TRUE),        
             col = "orange",
             lwd = 2)+
annotate("text", x=20, y=50, label="Mean",angle=90)+
geom_vline(xintercept=median(salary$stem_percent,
                             na.rm=TRUE),                                col = "red",
                lwd = 2)+
annotate("text", x=10, y=50,label="Median",angle=90)+
  labs(x="Stem percent",
       title="Histogram of stem (in percent)")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# not-normal, right skew is observed!
shapiro.test(salary$early_career_pay)
## 
##  Shapiro-Wilk normality test
## 
## data:  salary$early_career_pay
## W = 0.94447, p-value < 2.2e-16
shapiro.test(salary$mid_career_pay)
## 
##  Shapiro-Wilk normality test
## 
## data:  salary$mid_career_pay
## W = 0.94412, p-value < 2.2e-16
shapiro.test(salary$make_world_better_percent)
## 
##  Shapiro-Wilk normality test
## 
## data:  salary$make_world_better_percent
## W = 0.93619, p-value < 2.2e-16
shapiro.test(salary$stem_percent)
## 
##  Shapiro-Wilk normality test
## 
## data:  salary$stem_percent
## W = 0.81815, p-value < 2.2e-16

From the histogram plots, the right skew is more prominently seen in the stem_percent, mid_career_pay, early_career_pay and in make_world_better_percent.

The variable state_name is in character format,I will convert it to factor format in the next section.

Historical tuition data

# hist tuition data

# verify using glimpse and skim

glimpse(hist_tuition)   # 270 observations and 4 variables
## Rows: 270
## Columns: 4
## $ type         <chr> "All Institutions", "All Institutions", "All Institutions…
## $ year         <chr> "1985-86", "1985-86", "1985-86", "1985-86", "1985-86", "1…
## $ tuition_type <chr> "All Constant", "4 Year Constant", "2 Year Constant", "Al…
## $ tuition_cost <dbl> 10893, 12274, 7508, 4885, 5504, 3367, 13822, 16224, 7421,…
                        # 3 character variables and 1 numeric variables
skimr::skim(hist_tuition) # 270 observations and 4 variables
Data summary
Name hist_tuition
Number of rows 270
Number of columns 4
_______________________
Column type frequency:
character 3
numeric 1
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
type 0 1 6 16 0 3 0
year 0 1 7 7 0 19 0
tuition_type 0 1 11 15 0 6 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
tuition_cost 0 1 17597.37 9201.45 2981 9796.5 16983 23574 41468 ▇▆▇▂▂
                          # 3 character variables and 1 numeric variables

In the historical tuition dataset, there are 270 observations with 4 variables out of which 1 is a numeric variable and the remaining three are character variables

The histogram plot of the continuous variable is plotted below:

# histogram of tuition cost
ggplot(data=hist_tuition,
       aes(x=tuition_cost))+
  geom_histogram(fill="lightblue")+
  geom_vline(xintercept=mean(hist_tuition$tuition_cost
                             ,na.rm=TRUE),        
             col = "orange",
             lwd = 2)+
annotate("text", x=19000, y=10, label="Mean",angle=90)+
geom_vline(xintercept=median(hist_tuition$tuition_cost,
                             na.rm=TRUE),                                col = "red",
                lwd = 2)+
annotate("text", x=16000, y=10,label="Median",angle=90)+
  labs(x="Tuition cost(in USD)",
       y="count",
       title="Histogram of tuition cost (in USD)")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# just to verify
shapiro.test(hist_tuition$tuition_cost) #p<0.05, not normal
## 
##  Shapiro-Wilk normality test
## 
## data:  hist_tuition$tuition_cost
## W = 0.94332, p-value = 1.071e-08

The variable tuition_cost seems is bi-modal and right skewed, this is verified by using shapiro test for normality. Since the primary focus of the project is to visualize the data and not to use modelling techniques, I will be using the data as it is without any transformation.

The variables type, tuition type and year are in character format, I will be converting them into factor format in the next section. Also, I will be including only the observations that have “4 year current” and “4 year constant” tuition_type since I am am interested in understanding how the public/private institutions have different tuition cost for the 4 year degree programs.

# for the character variables
tabyl(hist_tuition$tuition_type) # 6 types
##  hist_tuition$tuition_type  n   percent
##            2 Year Constant 45 0.1666667
##             2 Year Current 45 0.1666667
##            4 Year Constant 45 0.1666667
##             4 Year Current 45 0.1666667
##               All Constant 45 0.1666667
##                All Current 45 0.1666667
tabyl(hist_tuition$type) # 3 types: private,public and all institutions
##  hist_tuition$type   n   percent
##   All Institutions 114 0.4222222
##            Private  78 0.2888889
##             Public  78 0.2888889
tabyl(hist_tuition$year) # different years starting from 1985 to 1986, 1995-96, 2000 to 2017(in a year increments)
##  hist_tuition$year  n    percent
##            1985-86 18 0.06666667
##            1995-96 18 0.06666667
##            2000-01 18 0.06666667
##            2001-02 18 0.06666667
##            2002-03 18 0.06666667
##            2003-04 18 0.06666667
##            2004-05 18 0.06666667
##            2005-06 18 0.06666667
##            2006-07  6 0.02222222
##            2007-08  6 0.02222222
##            2008-09  6 0.02222222
##            2009-10 12 0.04444444
##            2010-11  6 0.02222222
##            2011-12  6 0.02222222
##            2012-13 12 0.04444444
##            2013-14 18 0.06666667
##            2014-15 18 0.06666667
##            2015-16 18 0.06666667
##            2016-17 18 0.06666667

There are 3 type of colleges: public, private and all institutions; historical tuition is collected for the years 1985-86, 1995-1996 and for the years 2000 to 2017 at each year increments. There are 6 levels in the variable tuition_type: 2 year current, 2 year constant, 4 year current , 4 year constant, all current and all constant.

College Diversity data

# hist tuition data

# verify using glimpse and skim

glimpse(diversity)   # 50,655 observations and 5 variables
## Rows: 50,655
## Columns: 5
## $ name             <chr> "University of Phoenix-Arizona", "University of Phoen…
## $ total_enrollment <dbl> 195059, 195059, 195059, 195059, 195059, 195059, 19505…
## $ state            <chr> "Arizona", "Arizona", "Arizona", "Arizona", "Arizona"…
## $ category         <chr> "Women", "American Indian / Alaska Native", "Asian", …
## $ enrollment       <dbl> 134722, 876, 1959, 31455, 13984, 1019, 58209, 19039, …
                     # 3 character variables and 2 numeric variables
skimr::skim(diversity) # 50,655 observations and 5 variables
Data summary
Name diversity
Number of rows 50655
Number of columns 5
_______________________
Column type frequency:
character 3
numeric 2
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
name 341 0.99 5 84 0 4574 0
state 341 0.99 4 14 0 50 0
category 0 1.00 5 34 0 11 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
total_enrollment 0 1 4386.69 8217.53 1 391 1391 4504 195059 ▇▁▁▁▁
enrollment 0 1 774.19 2545.01 0 6 59 379 134722 ▇▁▁▁▁
                     # 3 character variables and 2 numeric variables

Notes on the data

It can be noted that the variables category and state are character variables, which I will convert to factor format in the next section . It is observed that there are 341 missing values in the variables name and state which would be addressed in the data transformation section.

# for the categorical variables
diversity_cat_table<-tabyl(diversity$category) 
# gender, race and ethnicity variables are all lumped into one column, numbers of only women are available, numbers of men need to be calculated by subtracting from total enrollment, and the long format data needs to be converted to wide format data with one column each for each gender and each race/ethnic groups!

diversity_cat_table
##                  diversity$category    n    percent
##     American Indian / Alaska Native 4605 0.09090909
##                               Asian 4605 0.09090909
##                               Black 4605 0.09090909
##                            Hispanic 4605 0.09090909
##  Native Hawaiian / Pacific Islander 4605 0.09090909
##                Non-Resident Foreign 4605 0.09090909
##                      Total Minority 4605 0.09090909
##                   Two Or More Races 4605 0.09090909
##                             Unknown 4605 0.09090909
##                               White 4605 0.09090909
##                               Women 4605 0.09090909
# check the number of total observations
sum(diversity_cat_table$n) # 50655 observation, tallies with the number of rows in the dataset and no NA observations!
## [1] 50655

Although the histogram plots of the enrollment and total_enrollment imply that the data is right-skewed, since the enrollment and total_enrollment are lumped for the different gender, racial and ethnic groups, the histogram plots wouldn’t be very meaningful in this case.

Transforming the data (15 points)

If the data needs to be transformed in any way (values recoded, pivoted, etc), do it here. Examples include transforming a continuous variable into a categorical using case_when(), etc.

Bonus points (5 points) for datasets that require merging of tables, but only if you reason through whether you should use left_join, inner_join, or right_join on these tables. No credit will be provided if you don’t.

Show your transformed table here. Use tools such as glimpse(), skim() or head() to illustrate your point.

Are the values what you expected for the variables? Why or Why not?

Cleaning the tuition income data

1) Transforming the income levels,year and state to factor format

From the earlier section, it was noted that the income levels, year and state variables are all in the character format, it was required to be transformed to factor format in order to use them in visualization and creating summary tables.

# check the number and names of the different levels in the categorical variable

income_tab<-tabyl(income$income_lvl)# 5 categories

# verify if there are any missing values
sum(income_tab$n) #209012
## [1] 209012
nrow(income) #209012, no missing values
## [1] 209012
# convert income levels,year and state to factor format 
income<-income%>%
  mutate(income_level=case_when(income_lvl=="0 to 30,000"~1,
income_lvl=="30,001 to 48,000"~2,
income_lvl=="48,001 to 75,000"~3,
income_lvl=="75,001 to 110,000"~4,
TRUE~5), 
income_level=factor(income_level,
levels=c(1,2,3,4,5),
labels(c("Less than $30,000", 
         "Between $30,001 and $48,000",
         "Between $48,001 and $75,000",
         "Between $75,001 to $110,000",
         "Over $110,000"))),
year=as.factor(year),
state=as.factor(state))

# drop income_lvl
income<-income%>%
        select(-income_lvl)

#verify 
skimr::skim(income) # 3 factor variables,income_level,year and state is created 
Data summary
Name income
Number of rows 209012
Number of columns 7
_______________________
Column type frequency:
character 2
factor 3
numeric 2
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
name 0 1 6 75 0 3664 0
campus 0 1 9 10 0 2 0

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
state 0 1 FALSE 51 NY: 16152, CA: 14104, PA: 13436, TX: 11133
year 0 1 FALSE 9 201: 27845, 201: 26390, 201: 24574, 201: 24547
income_level 0 1 FALSE 4 5: 80256, 1: 44969, 2: 43384, 4: 40403

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
total_price 0 1 30102.19 13824.70 4906 19186.00 26286.0 38831 114083.00 ▇▆▂▁▁
net_cost 0 1 16784.92 8887.16 -15101 10148.56 16207.1 22350 95674.84 ▁▇▁▁▁

2) Filter out only the on-campus type observations

Since, I am interested only in the on-campus type programs to answer my research question of interest, I excluded the off campus type observations from the dataset.

nrow(income) #209012 observations
## [1] 209012
income_on<-income%>%
         filter(campus=="On Campus")

skimr::skim(income_on) #75391 observations
Data summary
Name income_on
Number of rows 75391
Number of columns 7
_______________________
Column type frequency:
character 2
factor 3
numeric 2
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
name 0 1 6 75 0 2043 0
campus 0 1 9 9 0 1 0

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
state 0 1 FALSE 51 NY: 6661, PA: 5133, CA: 4314, TX: 4156
year 0 1 FALSE 8 201: 9990, 201: 9800, 201: 9415, 201: 9375
income_level 0 1 FALSE 4 5: 29665, 1: 15483, 2: 15323, 4: 14920

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
total_price 0 1 33244.63 15145.87 4906 20714 30466.00 44278.00 96636.00 ▇▇▅▁▁
net_cost 0 1 18193.89 8559.27 -15101 12181 17577.91 23128.32 95674.84 ▁▇▁▁▁

The transformed income data set has 75391 observations(limited only to on-campus type observations ) and 7 variables and the variables state, year, income_lvl are converted to factor format.

Cleaning the tuition cost data

1) Assess the NA observations

It was noted that there were 52 missing observations for state variable and in the following code chunk, I have assessed the reason for missingness and manupulated data in order to address the missigness.

# filter out the state with NA's to understand why they are missing!

cost_na<-cost%>%filter(is.na(state))

# looks like the state names are missing however the state codes are present

# get the different state codes and counts of the states with missing names!

tabyl(cost_na$state_code)
## [1] cost_na$state_code n                  percent           
## <0 rows> (or 0-length row.names)
# based on the state code's fill the name of the state using mutate command

cost<-cost%>%

mutate(state.upd=case_when(!is.na(state)~state, 
                                     is.na(state)&state_code=="AS"~"American Samoa",
                                     is.na(state)&state_code=="DC"~"District of Columbia",
                                     
is.na(state)&state_code=="GU"~"Guam",

is.na(state)&state_code=="PR"~"Peurto Rico",

is.na(state)&state_code=="VI"~"Virgin Islands",

TRUE~"NA"))

#verify
tabyl(cost$state.upd) # no NA's
##        cost$state.upd   n      percent
##               Alabama  54 0.0181634712
##                Alaska   6 0.0020181635
##        American Samoa   1 0.0003363606
##               Arizona  34 0.0114362597
##              Arkansas  46 0.0154725866
##            California 254 0.0854355869
##              Colorado  38 0.0127817020
##           Connecticut  36 0.0121089808
##              Delaware   9 0.0030272452
##  District of Columbia   8 0.0026908846
##               Florida  88 0.0295997309
##               Georgia  79 0.0265724857
##                  Guam   1 0.0003363606
##                Hawaii  14 0.0047090481
##                 Idaho  13 0.0043726875
##              Illinois 125 0.0420450723
##               Indiana  62 0.0208543559
##                  Iowa  52 0.0174907501
##                Kansas  52 0.0174907501
##              Kentucky  44 0.0147998655
##             Louisiana  34 0.0114362597
##                 Maine  27 0.0090817356
##              Maryland  45 0.0151362260
##         Massachusetts  93 0.0312815338
##              Michigan  78 0.0262361251
##             Minnesota  71 0.0238816011
##           Mississippi  32 0.0107635385
##              Missouri  73 0.0245543222
##               Montana  22 0.0073999327
##              Nebraska  33 0.0110998991
##                Nevada  10 0.0033636058
##         New Hampshire  21 0.0070635721
##            New Jersey  54 0.0181634712
##            New Mexico  24 0.0080726539
##              New York 221 0.0743356879
##        North Carolina 117 0.0393541877
##          North Dakota  18 0.0060544904
##                  Ohio 127 0.0427177935
##              Oklahoma  40 0.0134544231
##                Oregon  40 0.0134544231
##          Pennsylvania 160 0.0538176926
##           Peurto Rico  41 0.0137907837
##          Rhode Island  11 0.0036999664
##        South Carolina  57 0.0191725530
##          South Dakota  18 0.0060544904
##             Tennessee  62 0.0208543559
##                 Texas 150 0.0504540868
##                  Utah  14 0.0047090481
##               Vermont  19 0.0063908510
##        Virgin Islands   1 0.0003363606
##              Virginia  79 0.0265724857
##            Washington  60 0.0201816347
##         West Virginia  30 0.0100908174
##             Wisconsin  67 0.0225361588
##               Wyoming   8 0.0026908846
# use the state.upd column going forward but rename it to state

cost<-cost%>%
  select(-state)%>%
   mutate(state=state.upd)%>%
    select(-state.upd)

# rearrange the columns, state before state_code
cost<-cost%>%relocate("state",.after="name")

# verify
skimr::skim(cost) # no missing values in state variable
Data summary
Name cost
Number of rows 2973
Number of columns 10
_______________________
Column type frequency:
character 5
numeric 5
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
name 0 1 8 67 0 2938 0
state 0 1 4 20 0 55 0
state_code 0 1 2 2 0 55 0
type 0 1 5 10 0 4 0
degree_length 0 1 5 6 0 3 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
room_and_board 1094 0.63 10095.28 3288.55 30 7935 10000 12424.5 21300 ▁▅▇▃▁
in_state_tuition 0 1.00 16491.29 14773.84 480 4890 10099 27124.0 59985 ▇▂▂▁▁
in_state_total 0 1.00 22871.73 18948.39 962 5802 17669 35960.0 75003 ▇▅▂▂▁
out_of_state_tuition 0 1.00 20532.73 13255.65 480 9552 17486 29208.0 59985 ▇▆▅▂▁
out_of_state_total 0 1.00 26913.16 17719.73 1376 11196 23214 39054.0 75003 ▇▅▅▂▁

It was observed those 52 missing values of state was having complete observations for the state_code with which I was able to address the missing values in state.

2) Convert state and type to a factor variable

The variablesstate and type are converted to factor formats.

cost<-cost%>%mutate(across(.cols=c(state,type),.fns=as.factor))

skimr::skim(cost) # 2 factor variables are created
Data summary
Name cost
Number of rows 2973
Number of columns 10
_______________________
Column type frequency:
character 3
factor 2
numeric 5
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
name 0 1 8 67 0 2938 0
state_code 0 1 2 2 0 55 0
degree_length 0 1 5 6 0 3 0

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
state 0 1 FALSE 55 Cal: 254, New: 221, Pen: 160, Tex: 150
type 0 1 FALSE 4 Pub: 1584, Pri: 1281, For: 107, Oth: 1

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
room_and_board 1094 0.63 10095.28 3288.55 30 7935 10000 12424.5 21300 ▁▅▇▃▁
in_state_tuition 0 1.00 16491.29 14773.84 480 4890 10099 27124.0 59985 ▇▂▂▁▁
in_state_total 0 1.00 22871.73 18948.39 962 5802 17669 35960.0 75003 ▇▅▂▂▁
out_of_state_tuition 0 1.00 20532.73 13255.65 480 9552 17486 29208.0 59985 ▇▆▅▂▁
out_of_state_total 0 1.00 26913.16 17719.73 1376 11196 23214 39054.0 75003 ▇▅▅▂▁

State and type variables are converted to factor formats.

3) Filter out only 4 year degree observations

Since I wanted to limit the scope of the data set, I am primarily interested in 4 years degree observations to answer my research question of interest.

cost_4yr<-cost%>%filter(degree_length=="4 Year")

nrow(cost_4yr) #1852 observations
## [1] 1852
# assess the NA observations in room and board
cost_room_na<-cost_4yr%>%filter(is.na(room_and_board))

nrow(cost_room_na) # 245 missing values for room and board but the in-state and out of state tuition values are still present
## [1] 245
#final cost dataset
cost_final<-cost_4yr

#verify
skimr::skim(cost_final)
Data summary
Name cost_final
Number of rows 1852
Number of columns 10
_______________________
Column type frequency:
character 3
factor 2
numeric 5
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
name 0 1 8 66 0 1829 0
state_code 0 1 2 2 0 53 0
degree_length 0 1 6 6 0 1 0

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
state 0 1 FALSE 53 New: 163, Cal: 135, Pen: 129, Tex: 82
type 0 1 FALSE 3 Pri: 1209, Pub: 597, For: 46, Oth: 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
room_and_board 245 0.87 10624.67 3070.93 950 8617.00 10400 12713.0 21300 ▁▅▇▃▁
in_state_tuition 0 1.00 23212.36 14664.22 480 9932.25 19960 33854.5 59985 ▇▅▆▃▂
in_state_total 0 1.00 32431.50 17533.60 1430 18199.00 28287 44845.5 75003 ▃▇▅▃▂
out_of_state_tuition 0 1.00 26910.22 12630.54 480 17683.75 25392 34863.5 59985 ▂▇▇▃▂
out_of_state_total 0 1.00 36129.35 15994.29 1430 24951.00 34888 46670.0 75003 ▂▆▇▅▂

The final cost data set has only 1852 observations limited to only observations with 4 yr degree_length programs.

Cleaning the salary potential data

1) Rename the column state_name to state, useful during the left merge operation and convert state to a factor variable

salary<-salary%>%
         rename(state=state_name)%>%
          mutate(state=as.factor(state))

#verify
skimr::skim(salary)
Data summary
Name salary
Number of rows 935
Number of columns 7
_______________________
Column type frequency:
character 1
factor 1
numeric 5
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
name 0 1 6 64 0 934 0

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
state 0 1 FALSE 50 Ala: 25, Cal: 25, Flo: 25, Geo: 25

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
rank 0 1.00 11.48 7.07 1 5 11 17 25 ▇▇▆▆▅
early_career_pay 0 1.00 51359.68 8365.51 32500 46000 50000 55500 91200 ▂▇▂▁▁
mid_career_pay 0 1.00 92805.78 15856.48 60100 81650 89900 100650 158200 ▃▇▃▁▁
make_world_better_percent 33 0.96 53.88 9.19 33 48 52 58 94 ▂▇▃▁▁
stem_percent 0 1.00 16.96 15.40 0 7 13 23 100 ▇▃▁▁▁

State_name column was renamed to State and it was converted to a factor format. We retain the NA observations of make_world_better_percent as it is since those observations have complete values for the other earning potential and quality of education variables.

Cleaning the historical tuition data

1) Convert year,type and tuition_type to factor variables

hist_tuition<-hist_tuition%>%
mutate(across(where(is.character),as.factor))

#verify
skimr::skim(hist_tuition)
Data summary
Name hist_tuition
Number of rows 270
Number of columns 4
_______________________
Column type frequency:
factor 3
numeric 1
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
type 0 1 FALSE 3 All: 114, Pri: 78, Pub: 78
year 0 1 FALSE 19 198: 18, 199: 18, 200: 18, 200: 18
tuition_type 0 1 FALSE 6 2 Y: 45, 2 Y: 45, 4 Y: 45, 4 Y: 45

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
tuition_cost 0 1 17597.37 9201.45 2981 9796.5 16983 23574 41468 ▇▆▇▂▂

It was noted that year, type and tuition_type were all in the character format, they were converted to the factor format which would help in data visualization and summarization.

Cleaning the diversity data

1) Handle the NA observations

It was observed that there were 341 missing values in the variables name and state, the next step would be to explore the reason behind the missingness, we could either remove the NA observations in the state and name or group the NA observations into a separate category.

diversity_na<-diversity%>%filter(is.na(name)|is.name(state))                    

 #341 observations
 # implies that the same observations had missing values for both the name and state variable

# remove the NA observations from the data set since there is no way to merge with the other data sets, both the university name and state values are missing!

total_na<-length(unique(diversity_na$total_enrollment))
total_na # 30 distinct colleges
## [1] 30

It can be observed that the 341 missing observations have missing values for the name and state, by assessing the distinct entries of the total enrollment variable, these missing observations are from 30 different colleges.

If I had to assign them to a category called “NA”, it may not be a great option in this case as we would lump the enrollment from the 30 colleges with missing observations. Thus, I will be removing the NA observations from the diversity data set.

diversity<-diversity%>%filter(!is.na(name)|!is.na(state))
#50314 observations, 341 observations are removed!

skimr::skim(diversity) # 50314 observations and 5 variables, 3 character and 2 numeric variables,no missing values noted
Data summary
Name diversity
Number of rows 50314
Number of columns 5
_______________________
Column type frequency:
character 3
numeric 2
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
name 0 1 5 84 0 4574 0
state 0 1 4 14 0 50 0
category 0 1 5 34 0 11 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
total_enrollment 0 1 4392.17 8227.65 1 395 1394 4504 195059 ▇▁▁▁▁
enrollment 0 1 775.26 2548.21 0 6 59 381 134722 ▇▁▁▁▁

Thus after excluding the NA observations in the diversity data, there were 50314 observations with 5 variables , 3 of them are character variables and the remaining 2 are numeric variables.

2) Converting long to wide data

In the diversity data, it was observed that the data was in long-format where the gender and race categories were all lumped in one column. My next step in data transformation is to create separate columns for the Women, Men and for each of the race-ethnic groups i.e. we would be converting the long format of data to the wide-format.

There was also no separate row for the Men enrollment in the original diversity data, so we will have to subtract the women enrollment from the total enrollment in order to obtain the men enrollment for each college, this is easier to implement when the data is in long format.

Also, when we are comparing enrollment in different types of college across different states in the USA, we will have to account for the fact that different states have different population/student population as well.

For example, states like Texas, California, Florida, New York would certainly have more number of students enrolling in colleges than states like Rhode Island, New Hampshire,Wyoming.

It wouldn’t be appropriate to just report the number of students from different gender/ different racial/ethnic groups attending public/ private institutions across the different states, instead it would be reasonable to report the proportion of enrollment for each of the gender and racial ethnic groups. This is achieved by creating new columns for proportions for enrollment per college for Male, Female, Asian, Black, White, American Indian/Alaskan native, Native Hawaiian/ Pacific islander, Two or more races,unknown, non-resident foreign,total minority and Hispanic groups. I will also be creating additional column for enrollment of resident students.

# convert the variables state in factor format
diversity<-diversity%>%
  mutate(state=as.factor(state))
skimr::skim(diversity)
Data summary
Name diversity
Number of rows 50314
Number of columns 5
_______________________
Column type frequency:
character 2
factor 1
numeric 2
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
name 0 1 5 84 0 4574 0
category 0 1 5 34 0 11 0

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
state 0 1 FALSE 50 Cal: 5027, New: 3344, Tex: 2915, Pen: 2739

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
total_enrollment 0 1 4392.17 8227.65 1 395 1394 4504 195059 ▇▁▁▁▁
enrollment 0 1 775.26 2548.21 0 6 59 381 134722 ▇▁▁▁▁
# convert long to wide data format
diversity_wide<-pivot_wider(diversity,names_from=category,values_from=enrollment)

#check the data type!
skimr::skim(diversity_wide) #4574 observations and 14 variables,12 are numeric and 1 is factor and 1 is character variable
Data summary
Name diversity_wide
Number of rows 4574
Number of columns 14
_______________________
Column type frequency:
character 1
factor 1
numeric 12
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
name 0 1 5 84 0 4574 0

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
state 0 1 FALSE 50 Cal: 457, New: 304, Tex: 265, Pen: 249

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
total_enrollment 0 1 4392.17 8228.47 1 395.25 1394.0 4504.00 195059 ▇▁▁▁▁
Women 0 1 2480.10 4708.32 0 220.25 815.0 2705.50 134722 ▇▁▁▁▁
American Indian / Alaska Native 0 1 31.71 111.37 0 1.00 6.0 23.00 2659 ▇▁▁▁▁
Asian 0 1 250.96 830.57 0 5.00 24.0 122.75 11257 ▇▁▁▁▁
Black 0 1 569.42 1390.84 0 37.00 135.0 460.75 31455 ▇▁▁▁▁
Hispanic 0 1 662.05 2028.28 0 19.00 90.0 365.75 44870 ▇▁▁▁▁
Native Hawaiian / Pacific Islander 0 1 11.96 39.95 0 0.00 2.0 8.00 1019 ▇▁▁▁▁
White 0 1 2303.23 4259.66 0 140.00 691.0 2528.00 61498 ▇▁▁▁▁
Two Or More Races 0 1 129.50 392.62 0 3.00 24.0 103.00 19039 ▇▁▁▁▁
Unknown 0 1 234.40 1207.42 0 6.00 43.0 196.00 65163 ▇▁▁▁▁
Non-Resident Foreign 0 1 198.94 704.63 0 0.00 9.0 84.00 11495 ▇▁▁▁▁
Total Minority 0 1 1655.60 3707.10 0 128.25 388.5 1335.00 68332 ▇▁▁▁▁
typeof(diversity_wide) # list 
## [1] "list"
# add a column for the enrollment for Men using mutate command
diversity_wide<-diversity_wide%>%mutate(Men=(total_enrollment-Women),Resident=((total_enrollment)-(diversity_wide$`Non-Resident Foreign`)))

# verify!
skimr::skim(diversity_wide) #14 columns
Data summary
Name diversity_wide
Number of rows 4574
Number of columns 16
_______________________
Column type frequency:
character 1
factor 1
numeric 14
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
name 0 1 5 84 0 4574 0

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
state 0 1 FALSE 50 Cal: 457, New: 304, Tex: 265, Pen: 249

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
total_enrollment 0 1 4392.17 8228.47 1 395.25 1394.0 4504.00 195059 ▇▁▁▁▁
Women 0 1 2480.10 4708.32 0 220.25 815.0 2705.50 134722 ▇▁▁▁▁
American Indian / Alaska Native 0 1 31.71 111.37 0 1.00 6.0 23.00 2659 ▇▁▁▁▁
Asian 0 1 250.96 830.57 0 5.00 24.0 122.75 11257 ▇▁▁▁▁
Black 0 1 569.42 1390.84 0 37.00 135.0 460.75 31455 ▇▁▁▁▁
Hispanic 0 1 662.05 2028.28 0 19.00 90.0 365.75 44870 ▇▁▁▁▁
Native Hawaiian / Pacific Islander 0 1 11.96 39.95 0 0.00 2.0 8.00 1019 ▇▁▁▁▁
White 0 1 2303.23 4259.66 0 140.00 691.0 2528.00 61498 ▇▁▁▁▁
Two Or More Races 0 1 129.50 392.62 0 3.00 24.0 103.00 19039 ▇▁▁▁▁
Unknown 0 1 234.40 1207.42 0 6.00 43.0 196.00 65163 ▇▁▁▁▁
Non-Resident Foreign 0 1 198.94 704.63 0 0.00 9.0 84.00 11495 ▇▁▁▁▁
Total Minority 0 1 1655.60 3707.10 0 128.25 388.5 1335.00 68332 ▇▁▁▁▁
Men 0 1 1912.07 3663.95 0 129.00 567.5 1867.00 60337 ▇▁▁▁▁
Resident 0 1 4193.22 7786.88 1 382.00 1359.0 4356.50 191704 ▇▁▁▁▁
# check for further missing values
sum(is.na(diversity_wide))
## [1] 0

Since the data set considered is very large, I have considered a set of inclusion-exclusion criterias to obtain a reasonable sized data set to answer my research question of interest.

I am primarily interested only in the on-campus programs and 4 year degree type programs obtained only from private and public colleges.This implies, I will be excluding observations from:

-> 2 year degree length program -> off-campus programs ->“for profit” and “other” type of colleges

I will filter out the observations that are aligned with my inclusion criteria`s after I perform the merge operation between tuition_cost, tuition_income,salary and diversity datasets in the subsequent sections. I have stated the reasons as to why I chose this exclusion critera in the next sections.

Since I am interested in studying if the type of college attended would influence the earning potential and quality of education, I will have to merge 2 datasets: tuition cost and salary potential using left join.

The tuition cost data has the information about the name of the college, the state in which the college is located and the type of the college(“public”,“private”,“For Profit”, “Other” ) and the degree length and the salary_potential data consists of two earning potential indicators-early_career_pay and mid_career_pay and two quality of education indicators- make_world_better_percent and stem_percent.

# make sure the variable names are clean before merging
diversity_wide<-clean_names(diversity_wide)
salary<-clean_names(salary)
cost<-clean_names(cost)

Merging Datasets

1) Merge salary potential and tution cost data

# merge using left join
merged1<-merge(x=cost,y=salary,by=c("name","state"),all.x=TRUE)

skimr::skim(merged1) 
Data summary
Name merged1
Number of rows 2973
Number of columns 15
_______________________
Column type frequency:
character 3
factor 2
numeric 10
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
name 0 1 8 67 0 2938 0
state_code 0 1 2 2 0 55 0
degree_length 0 1 5 6 0 3 0

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
state 0 1 FALSE 55 Cal: 254, New: 221, Pen: 160, Tex: 150
type 0 1 FALSE 4 Pub: 1584, Pri: 1281, For: 107, Oth: 1

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
room_and_board 1094 0.63 10095.28 3288.55 30 7935 10000 12424.5 21300 ▁▅▇▃▁
in_state_tuition 0 1.00 16491.29 14773.84 480 4890 10099 27124.0 59985 ▇▂▂▁▁
in_state_total 0 1.00 22871.73 18948.39 962 5802 17669 35960.0 75003 ▇▅▂▂▁
out_of_state_tuition 0 1.00 20532.73 13255.65 480 9552 17486 29208.0 59985 ▇▆▅▂▁
out_of_state_total 0 1.00 26913.16 17719.73 1376 11196 23214 39054.0 75003 ▇▅▅▂▁
rank 2380 0.20 11.81 7.03 1 6 12 18.0 25 ▇▇▇▆▅
early_career_pay 2380 0.20 50892.58 7929.03 32500 45900 49700 54500.0 88800 ▂▇▂▁▁
mid_career_pay 2380 0.20 91941.82 15055.51 60100 81500 89600 99700.0 158200 ▃▇▃▁▁
make_world_better_percent 2401 0.19 53.83 8.81 33 48 53 59.0 94 ▂▇▃▁▁
stem_percent 2380 0.20 16.63 14.78 0 7 14 22.0 97 ▇▃▁▁▁
skimr::skim(merged1) # missingness rate is very high in the variables are interested
Data summary
Name merged1
Number of rows 2973
Number of columns 15
_______________________
Column type frequency:
character 3
factor 2
numeric 10
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
name 0 1 8 67 0 2938 0
state_code 0 1 2 2 0 55 0
degree_length 0 1 5 6 0 3 0

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
state 0 1 FALSE 55 Cal: 254, New: 221, Pen: 160, Tex: 150
type 0 1 FALSE 4 Pub: 1584, Pri: 1281, For: 107, Oth: 1

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
room_and_board 1094 0.63 10095.28 3288.55 30 7935 10000 12424.5 21300 ▁▅▇▃▁
in_state_tuition 0 1.00 16491.29 14773.84 480 4890 10099 27124.0 59985 ▇▂▂▁▁
in_state_total 0 1.00 22871.73 18948.39 962 5802 17669 35960.0 75003 ▇▅▂▂▁
out_of_state_tuition 0 1.00 20532.73 13255.65 480 9552 17486 29208.0 59985 ▇▆▅▂▁
out_of_state_total 0 1.00 26913.16 17719.73 1376 11196 23214 39054.0 75003 ▇▅▅▂▁
rank 2380 0.20 11.81 7.03 1 6 12 18.0 25 ▇▇▇▆▅
early_career_pay 2380 0.20 50892.58 7929.03 32500 45900 49700 54500.0 88800 ▂▇▂▁▁
mid_career_pay 2380 0.20 91941.82 15055.51 60100 81500 89600 99700.0 158200 ▃▇▃▁▁
make_world_better_percent 2401 0.19 53.83 8.81 33 48 53 59.0 94 ▂▇▃▁▁
stem_percent 2380 0.20 16.63 14.78 0 7 14 22.0 97 ▇▃▁▁▁
# analyzing the trends in missingness

# check the counts in each category of the following variables

tabyl(merged1$degree_length)
##  merged1$degree_length    n      percent
##                 2 Year 1120 0.3767238480
##                 4 Year 1852 0.6229397915
##                  Other    1 0.0003363606
tabyl(merged1$type)
##  merged1$type    n      percent
##    For Profit  107 0.0359905819
##         Other    1 0.0003363606
##       Private 1281 0.4308779011
##        Public 1584 0.5327951564
# filter only the NA observations in the quality of education and salary potential variables

merged1_na<-merged1%>%filter(is.na(early_career_pay)&is.na(mid_career_pay)&is.na(make_world_better_percent)&is.na(stem_percent)) # 2380 observations with missing NA in all the 4 variables

# check the counts in each category of the following variables

tabyl(merged1_na$degree_length) # almost all the 2 year degree length data has NA observations, for the 4 year degree length, 1261 out of 1852 observations are missing
##  merged1_na$degree_length    n      percent
##                    2 Year 1118 0.4697478992
##                    4 Year 1261 0.5298319328
##                     Other    1 0.0004201681
tabyl(merged1_na$type)# around 907 observations out of 1281 private college observations are missing, 1365 public out of 1584 observations are missing, all the for profit and other college observations are missing!
##  merged1_na$type    n      percent
##       For Profit  107 0.0449579832
##            Other    1 0.0004201681
##          Private  907 0.3810924370
##           Public 1365 0.5735294118
# final merged data set

# thus from the missingness analysis, we can omit the NA observations if they were NA in all the 4 variables, since there was only one complete observation for the 2 year degree data, I have excluded the 2 year degree length observations from my analysis

merged1_final<-merged1%>%filter(!is.na(early_career_pay)|!is.na(mid_career_pay)|!is.na(make_world_better_percent)|!is.na(stem_percent))%>%
filter(degree_length!="2 Year"|type!="For Profit"|type!="Other") #593 observations

# only 593 observations that can  be used to answer our question of interest
tabyl(merged1_final$type)
##  merged1_final$type   n   percent
##          For Profit   0 0.0000000
##               Other   0 0.0000000
##             Private 374 0.6306914
##              Public 219 0.3693086
# verify
skimr::skim(merged1_final)
Data summary
Name merged1_final
Number of rows 593
Number of columns 15
_______________________
Column type frequency:
character 3
factor 2
numeric 10
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
name 0 1 11 56 0 593 0
state_code 0 1 2 2 0 40 0
degree_length 0 1 6 6 0 2 0

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
state 0 1 FALSE 40 Ken: 24, Ala: 22, Geo: 22, Iow: 22
type 0 1 FALSE 2 Pri: 374, Pub: 219, For: 0, Oth: 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
room_and_board 29 0.95 10730.87 2738.66 1698 8892 10353 12702.5 18127 ▁▃▇▅▂
in_state_tuition 0 1.00 26355.69 16289.35 4118 10032 25600 40652.0 58230 ▇▂▃▃▃
in_state_total 0 1.00 36561.78 18618.05 4258 19392 34280 52192.0 75003 ▆▇▆▆▃
out_of_state_tuition 0 1.00 30826.91 12861.58 4118 20800 28898 41010.0 58230 ▂▇▇▅▃
out_of_state_total 0 1.00 41033.00 15511.94 6670 29504 38783 52856.0 75003 ▂▇▇▆▃
rank 0 1.00 11.81 7.03 1 6 12 18.0 25 ▇▇▇▆▅
early_career_pay 0 1.00 50892.58 7929.03 32500 45900 49700 54500.0 88800 ▂▇▂▁▁
mid_career_pay 0 1.00 91941.82 15055.51 60100 81500 89600 99700.0 158200 ▃▇▃▁▁
make_world_better_percent 21 0.96 53.83 8.81 33 48 53 59.0 94 ▂▇▃▁▁
stem_percent 0 1.00 16.63 14.78 0 7 14 22.0 97 ▇▃▁▁▁

2) Merge diversity data with tuition cost data

# merge using left join
merged2<-merge(x=cost,y=diversity_wide,by=c("name","state"),all.x=TRUE)

skimr::skim(merged2) 
Data summary
Name merged2
Number of rows 2973
Number of columns 24
_______________________
Column type frequency:
character 3
factor 2
numeric 19
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
name 0 1 8 67 0 2938 0
state_code 0 1 2 2 0 55 0
degree_length 0 1 5 6 0 3 0

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
state 0 1 FALSE 55 Cal: 254, New: 221, Pen: 160, Tex: 150
type 0 1 FALSE 4 Pub: 1584, Pri: 1281, For: 107, Oth: 1

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
room_and_board 1094 0.63 10095.28 3288.55 30 7935.00 10000.0 12424.50 21300 ▁▅▇▃▁
in_state_tuition 0 1.00 16491.29 14773.84 480 4890.00 10099.0 27124.00 59985 ▇▂▂▁▁
in_state_total 0 1.00 22871.73 18948.39 962 5802.00 17669.0 35960.00 75003 ▇▅▂▂▁
out_of_state_tuition 0 1.00 20532.73 13255.65 480 9552.00 17486.0 29208.00 59985 ▇▆▅▂▁
out_of_state_total 0 1.00 26913.16 17719.73 1376 11196.00 23214.0 39054.00 75003 ▇▅▅▂▁
total_enrollment 819 0.72 6185.05 8270.69 15 1351.50 3132.0 7623.00 81459 ▇▁▁▁▁
women 819 0.72 3486.93 4631.49 0 772.75 1881.0 4381.00 48329 ▇▁▁▁▁
american_indian_alaska_native 819 0.72 46.25 138.14 0 4.00 14.0 39.00 2659 ▇▁▁▁▁
asian 819 0.72 321.60 839.51 0 14.00 55.0 222.00 10381 ▇▁▁▁▁
black 819 0.72 774.88 1478.91 0 74.00 243.5 827.50 18914 ▇▁▁▁▁
hispanic 819 0.72 933.05 2436.55 0 47.00 175.0 688.75 44870 ▇▁▁▁▁
native_hawaiian_pacific_islander 819 0.72 14.56 36.60 0 1.00 4.0 13.00 653 ▇▁▁▁▁
white 819 0.72 3347.52 4385.80 0 729.50 1838.5 4119.00 41898 ▇▁▁▁▁
two_or_more_races 819 0.72 175.40 290.96 0 18.00 62.0 193.00 2801 ▇▁▁▁▁
unknown 819 0.72 299.74 919.59 0 25.00 105.0 295.50 27213 ▇▁▁▁▁
non_resident_foreign 819 0.72 272.05 797.11 0 5.00 38.5 160.00 11495 ▇▁▁▁▁
total_minority 819 0.72 2265.74 3996.32 0 283.25 766.5 2441.00 56561 ▇▁▁▁▁
men 819 0.72 2698.13 3769.05 0 540.25 1283.0 3277.50 33130 ▇▁▁▁▁
resident 819 0.72 5913.01 7802.33 11 1298.00 3046.0 7315.75 80423 ▇▁▁▁▁

Applying inclusion/exclusion criteria

# create new columns to compute the % gender, % race and ethnicity groups in each of the colleges
merged2<-merged2%>%mutate(across(.cols=women:resident,.fns=~((.x/total_enrollment)*100),.names ="percent_{.col}"))

#check the # of entries in the type and degree_length
tabyl(merged2$type)
##  merged2$type    n      percent
##    For Profit  107 0.0359905819
##         Other    1 0.0003363606
##       Private 1281 0.4308779011
##        Public 1584 0.5327951564
tabyl(merged2$degree_length)
##  merged2$degree_length    n      percent
##                 2 Year 1120 0.3767238480
##                 4 Year 1852 0.6229397915
##                  Other    1 0.0003363606
# retain only 4 year degree length observations
merged2.upd<-merged2%>%
  filter(degree_length=="4 Year"&(type=="Private"|type=="Public")) #1806 observations

#check the # of entries in the type and degree_length
tabyl(merged2.upd$type)
##  merged2.upd$type    n   percent
##        For Profit    0 0.0000000
##             Other    0 0.0000000
##           Private 1209 0.6694352
##            Public  597 0.3305648
tabyl(merged2.upd$degree_length)
##  merged2.upd$degree_length    n percent
##                     4 Year 1806       1
# check for NA's
sum(is.na(merged2.upd$total_enrollment)) #542 observations
## [1] 542
# remove those NA observations
merged2.final<-merged2.upd%>%
  filter(!is.na(total_enrollment))# 1264 observations

# relocate the men column next to women column
merged2.final<-merged2.final%>%relocate("percent_men",.after="percent_women")

#verify!
skimr::skim(merged2.final) 
Data summary
Name merged2.final
Number of rows 1264
Number of columns 37
_______________________
Column type frequency:
character 3
factor 2
numeric 32
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
name 0 1 11 58 0 1264 0
state_code 0 1 2 2 0 50 0
degree_length 0 1 6 6 0 1 0

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
state 0 1 FALSE 50 New: 84, Pen: 82, Cal: 81, Tex: 66
type 0 1 FALSE 2 Pri: 906, Pub: 358, For: 0, Oth: 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
room_and_board 100 0.92 10633.34 2998.55 950.00 8634.00 10359.00 12714.50 21300.00 ▁▅▇▃▁
in_state_tuition 0 1.00 25218.35 14996.93 480.00 10849.00 24375.00 36180.50 59985.00 ▇▆▇▅▂
in_state_total 0 1.00 35010.44 17662.86 1430.00 19454.75 32742.00 48450.00 75003.00 ▃▇▆▅▂
out_of_state_tuition 0 1.00 28369.28 12797.39 480.00 18629.50 27124.00 36683.50 59985.00 ▂▇▇▃▂
out_of_state_total 0 1.00 38161.37 15809.88 1430.00 26802.00 36709.50 48899.50 75003.00 ▂▆▇▅▂
total_enrollment 0 1.00 5807.69 8620.18 20.00 1142.50 2523.00 6304.75 81459.00 ▇▁▁▁▁
women 0 1.00 3249.67 4687.62 0.00 652.00 1502.00 3671.25 48329.00 ▇▁▁▁▁
american_indian_alaska_native 0 1.00 32.90 105.22 0.00 3.00 9.00 26.00 1754.00 ▇▁▁▁▁
asian 0 1.00 301.61 851.85 0.00 13.00 55.00 187.25 10381.00 ▇▁▁▁▁
black 0 1.00 658.51 1219.67 0.00 67.00 196.50 669.00 14751.00 ▇▁▁▁▁
hispanic 0 1.00 573.88 1633.40 0.00 41.00 139.00 458.00 31211.00 ▇▁▁▁▁
native_hawaiian_pacific_islander 0 1.00 11.09 29.67 0.00 0.75 3.00 9.00 429.00 ▇▁▁▁▁
white 0 1.00 3374.73 4968.27 0.00 594.50 1517.50 3738.75 41898.00 ▇▁▁▁▁
two_or_more_races 0 1.00 165.70 296.36 0.00 17.00 56.00 167.00 2801.00 ▇▁▁▁▁
unknown 0 1.00 294.58 1083.90 0.00 25.00 106.00 290.25 27213.00 ▇▁▁▁▁
non_resident_foreign 0 1.00 394.69 987.26 0.00 16.00 73.00 269.00 11495.00 ▇▁▁▁▁
total_minority 0 1.00 1743.70 3189.62 0.00 230.75 585.00 1803.00 39763.00 ▇▁▁▁▁
men 0 1.00 2558.02 4037.99 0.00 463.25 1014.00 2675.50 33130.00 ▇▁▁▁▁
resident 0 1.00 5413.00 7933.87 11.00 1100.50 2462.50 5911.25 80423.00 ▇▁▁▁▁
percent_women 0 1.00 56.09 15.59 0.00 51.15 57.36 63.16 100.00 ▁▁▇▅▁
percent_men 0 1.00 43.91 15.59 0.00 36.84 42.64 48.85 100.00 ▁▅▇▁▁
percent_american_indian_alaska_native 0 1.00 1.02 5.98 0.00 0.15 0.31 0.59 100.00 ▇▁▁▁▁
percent_asian 0 1.00 3.63 4.97 0.00 0.95 1.94 4.44 60.99 ▇▁▁▁▁
percent_black 0 1.00 13.99 20.58 0.00 3.67 6.59 14.11 100.00 ▇▁▁▁▁
percent_hispanic 0 1.00 7.87 9.17 0.00 2.75 5.18 9.24 92.98 ▇▁▁▁▁
percent_native_hawaiian_pacific_islander 0 1.00 0.21 0.66 0.00 0.01 0.10 0.21 15.57 ▇▁▁▁▁
percent_white 0 1.00 60.02 22.94 0.00 48.81 65.47 75.93 100.00 ▂▂▃▇▃
percent_two_or_more_races 0 1.00 2.51 2.11 0.00 1.24 2.26 3.26 28.77 ▇▁▁▁▁
percent_unknown 0 1.00 5.41 6.52 0.00 1.25 3.38 7.31 62.89 ▇▁▁▁▁
percent_non_resident_foreign 0 1.00 5.33 8.07 0.00 1.04 2.83 6.57 96.62 ▇▁▁▁▁
percent_total_minority 0 1.00 29.24 21.97 0.00 14.73 23.20 35.22 100.00 ▇▇▂▁▁
percent_resident 0 1.00 94.67 8.07 3.38 93.43 97.17 98.96 100.00 ▁▁▁▁▇

Reshape the data back to long format to make ggplots

Gender

# convert the wide to a long data
merged2_gender_long<-pivot_longer(data=merged2.final,
                          cols=c("percent_men","percent_women"),
                          names_to="percent_gender",
                          values_to="enrollment")

# remove the string "percent_" in the merged2_long columns
merged2_gender_long <-merged2_gender_long  %>%
  mutate(gender= str_remove_all(percent_gender, "percent_"))

#keep the columns that are required to plot compare gender
gender<-merged2_gender_long%>%
  select(name,state,state_code,type, degree_length,gender, enrollment, total_enrollment)

#verify
skimr::skim(gender)
Data summary
Name gender
Number of rows 2528
Number of columns 8
_______________________
Column type frequency:
character 4
factor 2
numeric 2
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
name 0 1 11 58 0 1264 0
state_code 0 1 2 2 0 50 0
degree_length 0 1 6 6 0 1 0
gender 0 1 3 5 0 2 0

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
state 0 1 FALSE 50 New: 168, Pen: 164, Cal: 162, Tex: 132
type 0 1 FALSE 2 Pri: 1812, Pub: 716, For: 0, Oth: 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
enrollment 0 1 50.00 16.73 0 40.95 50 59.05 100 ▁▂▇▂▁
total_enrollment 0 1 5807.69 8618.48 20 1142.50 2523 6304.75 81459 ▇▁▁▁▁

Race and Ethnicity

# convert the wide to a long data
merged2_race_long<-pivot_longer(data=merged2.final,
 cols=percent_american_indian_alaska_native:percent_resident,
                          names_to="percent_race",
                          values_to="enrollment")

# remove the string "percent_" in the merged2_long columns
merged2_race_long <-merged2_race_long  %>%
  mutate(race= str_remove_all(percent_race, "percent_"))

#keep the columns that are  required to plot compare gender
race<-merged2_race_long%>%
  select(name,state,state_code,type, degree_length,race, enrollment, total_enrollment)

#verify
skimr::skim(race)
Data summary
Name race
Number of rows 13904
Number of columns 8
_______________________
Column type frequency:
character 4
factor 2
numeric 2
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
name 0 1 11 58 0 1264 0
state_code 0 1 2 2 0 50 0
degree_length 0 1 6 6 0 1 0
race 0 1 5 32 0 11 0

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
state 0 1 FALSE 50 New: 924, Pen: 902, Cal: 891, Tex: 726
type 0 1 FALSE 2 Pri: 9966, Pub: 3938, For: 0, Oth: 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
enrollment 0 1 20.36 31.55 0 0.81 4.07 21.58 100 ▇▁▁▁▁
total_enrollment 0 1 5807.69 8617.08 20 1142.50 2523.00 6304.75 81459 ▇▁▁▁▁

Are the values what you expected for the variables?Why or Why not?

When I was working with the individual dataset and merging the datasets, I came across a lot of NA observations, it took me a while for me to figure out if they had a pattern to it or they were missing at random. For example the 2 year degree type observations had most of the observations missing for early career pay, mid career pay, stem % and make world better percent % , so then I had to apply a bunch of inclusion/exclusion criterias.

Also, reshaping the diversity data was a bit of challenge since the original data was in long format with all the gender and race variables lumped into one column, I had to convert to wide format in order to use them to create summary tables and then later for making boxplots and bar plots, I needed a long format data for which I had to reshape the data to a long format again.

I think after performing all the data transformation and merging of the datasets, I am happy with the outcome.

Visualizing and Summarizing the Data (15 points)

Use group_by() and summarize() to make a summary of the data here. The summary should be relevant to your research question

What are your findings about the summary? Are they what you expected?

Make at least two plots that help you answer your question on the transformed or summarized data. Use scales and/or labels to make each plot informative.

Summarizing data

Reasearch question 1

To answer out first research question of interest,

->we will to calculate the median early career pay, median mid career pay for the private/public colleges across different states of USA, this would provide an idea about the earning potential of a student at different stages in their career.

-> also to compare the quality of education, we will be comparing the median stem % and median make world better % between private and public colleges across different states in USA.

->apart from calculating the median of the variable of interest, it is also important to understand the spread of the distribution, for which we calculate the minimum and maximum values and a boxplot is plotted in the data visualization section

-> I created a comprehensive table which included the minimum,median and maximum value of early career pay, mid career pay, stem percent and make world better percent across private and public colleges in the different states in U.S.A and also created 4 separate tables one for each variable of interest.

# create a summary table which calculates the min,median and max of the early_career_pay, mid_career_pay, stem %, make world better %

merged1_stats<-merged1_final%>%
  group_by(state_code,type)%>%
  summarise(number_of_colleges=n(),
  across(early_career_pay:stem_percent,~min(.x, na.rm=TRUE),.names = "minimum_{.col}"),
across(early_career_pay:stem_percent,~median(.x, na.rm=TRUE),.names = "median_{.col}"),
  across(early_career_pay:stem_percent,~max(.x,na.rm=TRUE),.names = "maximum_{.col}"))
## `summarise()` has grouped output by 'state_code'. You can override using the
## `.groups` argument.
merged1_stats%>%gt::gt()
type number_of_colleges minimum_early_career_pay minimum_mid_career_pay minimum_make_world_better_percent minimum_stem_percent median_early_career_pay median_mid_career_pay median_make_world_better_percent median_stem_percent maximum_early_career_pay maximum_mid_career_pay maximum_make_world_better_percent maximum_stem_percent
AK
Private 1 50300 90000 67 6 50300 90000 67.0 6.0 50300 90000 67 6
Public 2 56000 101000 54 10 57550 101400 55.5 15.0 59100 101800 57 20
AL
Private 10 34800 65100 52 0 41500 76450 64.0 12.5 54500 93500 77 30
Public 12 38900 70700 49 4 44750 79800 59.5 10.5 57500 104500 61 45
AR
Private 8 34600 61200 47 3 43200 77350 55.5 17.0 49200 90300 58 33
Public 8 40500 70500 49 2 44950 79800 56.0 7.5 52500 98000 84 22
AZ
Private 2 41200 73500 69 0 41900 75050 69.0 3.5 42600 76600 69 7
Public 2 49600 90000 51 12 52850 95500 53.5 17.5 56100 101000 56 23
CA
Private 13 58500 109400 39 0 64500 120600 49.5 16.0 88800 158200 59 97
Public 1 63000 114700 53 28 63000 114700 53.0 28.0 63000 114700 53 28
CO
Private 5 33600 60100 48 0 53100 94800 53.0 12.0 55700 105100 70 23
Public 9 44400 81400 47 3 48300 87100 53.0 19.0 75600 139600 60 93
CT
Private 12 49700 85900 36 2 55550 100650 52.5 19.0 70300 138300 63 31
Public 6 48500 83600 43 2 50450 87950 49.0 10.0 59100 105200 68 24
DE
Private 3 45100 76600 42 2 45400 77300 50.0 4.0 49800 86900 50 25
Public 2 46300 81900 43 10 52450 93650 52.5 16.0 58600 105400 62 22
FL
Private 10 46100 84400 44 2 49650 89100 53.0 8.0 58300 105400 88 48
Public 8 47300 83900 40 4 49500 89350 51.0 15.5 55800 102800 62 29
GA
Private 11 44400 79400 34 3 46800 84900 54.0 19.0 62000 110800 62 26
Public 11 43300 77900 45 7 47000 84400 51.0 10.0 54400 100700 69 22
HI
Private 3 45400 84000 55 3 51700 93500 56.0 12.0 52200 93900 57 15
Public 2 44000 77200 56 14 47800 84400 57.5 16.0 51600 91600 59 18
IA
Private 19 43900 78000 41 0 48000 86900 49.0 14.0 53400 99900 60 65
Public 3 47100 85300 46 8 54100 99400 48.0 16.0 56100 101300 48 31
ID
Private 2 46200 80600 59 7 49950 87050 63.5 10.5 53700 93500 68 14
Public 4 44900 77600 54 5 50450 89100 57.5 13.0 52300 97700 64 22
IL
Private 16 49400 89800 41 0 54200 96000 46.0 16.5 64600 118500 88 64
Public 4 51200 95000 44 10 54650 99650 47.0 25.5 62600 115300 48 36
IN
Private 17 46900 84500 40 0 50000 89900 53.0 21.0 75000 135800 59 96
Public 2 45600 83700 48 7 46750 84650 51.5 7.5 47900 85600 55 8
KS
Private 11 37100 66100 55 0 44700 79100 61.0 7.0 48700 91800 73 14
Public 7 43500 78000 51 4 47100 84900 54.0 9.0 53000 98600 64 24
KY
Private 16 35300 64100 50 0 40950 75300 54.0 8.0 50000 90900 74 27
Public 8 40700 73100 46 5 45550 81900 53.0 11.0 51700 96400 57 19
LA
Private 7 33900 61500 37 0 46700 86300 60.5 4.0 48400 88600 74 35
Public 9 40100 71000 51 5 48200 88200 57.0 13.0 53100 100700 76 24
MA
Private 20 52600 98200 36 0 62600 114100 47.0 25.0 86300 155200 71 86
Public 1 67900 117500 53 47 67900 117500 53.0 47.0 67900 117500 53 47
MD
Private 13 47900 82600 40 0 51100 94100 52.5 11.0 67400 118400 63 88
Public 5 49600 87000 47 2 50400 89800 51.0 12.0 51200 91200 65 19
ME
Private 9 36300 64600 37 0 46900 84300 51.0 7.0 61300 112000 69 37
Public 8 37100 66200 47 5 42550 74700 59.5 7.5 67000 121800 73 48
MI
Private 12 47400 84700 40 0 50700 91000 50.5 16.5 69900 125000 57 74
Public 10 45900 82200 46 6 50400 90500 48.5 15.0 65000 116300 54 81
MN
Private 6 47400 84700 39 0 50850 93900 49.0 16.5 58800 109900 52 45
Public 3 49700 85900 50 10 50200 86800 51.0 11.0 51600 91600 53 15
MO
Private 11 42500 77200 37 0 46500 81200 56.0 7.0 54200 95300 68 13
Public 6 43400 77700 48 9 46900 84200 51.0 21.0 67300 122600 54 80
MS
Private 6 34800 62400 56 2 41150 73050 60.5 19.0 49300 88700 76 32
Public 8 32500 61900 55 2 41300 74400 63.0 11.5 51100 94100 70 27
MT
Private 2 46000 82900 55 12 47100 87150 57.0 18.5 48200 91400 59 25
Public 2 53700 96200 54 31 57450 104400 59.0 46.0 61200 112600 64 61
NE
Private 7 43800 77700 45 0 48600 84800 57.0 9.0 53500 95800 74 23
Public 4 40600 73400 50 3 42450 75750 60.0 6.0 53400 93000 79 11
NV
Private 1 54000 95500 39 2 54000 95500 39.0 2.0 54000 95500 39 2
OH
Private 14 48300 88500 39 0 53700 99450 46.0 21.0 66600 117800 78 45
Public 2 48300 88600 51 20 50050 90550 51.5 20.5 51800 92500 52 21
OK
Private 8 40700 74100 51 0 46000 83050 58.5 8.0 58700 107200 71 41
Public 11 37300 67400 48 5 44500 77600 55.0 10.0 47900 90800 66 15
OR
Private 11 40500 76400 44 0 50000 89600 57.0 8.0 58500 104900 82 28
Public 8 44000 73900 46 5 51300 93150 53.5 11.5 65400 108300 79 34
PA
Private 19 53400 98400 37 5 59300 112300 43.0 25.0 75900 136100 72 66
TN
Private 14 41200 74900 46 3 45950 82350 59.5 9.0 65400 119100 68 39
Public 6 43700 77300 48 8 46850 79750 56.0 11.0 53100 95100 68 28
TX
Private 11 51800 93500 33 0 55500 98900 50.0 17.0 71000 129500 61 45
Public 7 50700 94800 48 3 56000 101200 57.0 22.0 61300 106500 86 53
UT
Private 1 54500 91500 57 9 54500 91500 57.0 9.0 54500 91500 57 9
Public 5 46400 84400 57 7 52400 93900 60.0 12.0 54800 102600 65 22
VA
Private 10 43000 80600 40 1 52500 92850 52.5 14.0 65900 123200 86 23
Public 12 44900 81800 42 9 52500 94050 52.0 18.5 65900 120600 68 47
VT
Private 8 42000 75200 39 0 49950 89200 51.0 15.5 60400 109800 63 20
Public 4 41500 75100 41 4 47050 85050 49.5 7.0 53200 99400 52 22
WA
Private 13 41100 73600 39 0 53600 97800 52.0 18.0 60600 113700 60 33
Public 6 43500 78400 45 9 51400 90850 50.0 15.0 55900 103700 56 20
WI
Private 12 45600 81900 45 0 48150 87000 53.0 15.5 65700 117600 94 69
Public 10 46000 81300 44 8 50000 89400 49.0 15.5 57300 104400 53 38
WY
Public 1 52400 98800 58 25 52400 98800 58.0 25.0 52400 98800 58 25

Early career pay

# you can also make individual tables for each of the variable of interest

#early_career_pay

early_career_pay_stats<-merged1_final%>%
  group_by(state_code,type)%>%
  summarise(number_of_colleges=n(),
    minimum_early_career_pay=min(early_career_pay,na.rm=TRUE),
    median_early_career_pay=median(early_career_pay,na.rm=TRUE),
    maximum_early_career_pay=max(early_career_pay,na.rm=TRUE))
## `summarise()` has grouped output by 'state_code'. You can override using the
## `.groups` argument.
early_career_pay_stats%>%gt::gt()
type number_of_colleges minimum_early_career_pay median_early_career_pay maximum_early_career_pay
AK
Private 1 50300 50300 50300
Public 2 56000 57550 59100
AL
Private 10 34800 41500 54500
Public 12 38900 44750 57500
AR
Private 8 34600 43200 49200
Public 8 40500 44950 52500
AZ
Private 2 41200 41900 42600
Public 2 49600 52850 56100
CA
Private 13 58500 64500 88800
Public 1 63000 63000 63000
CO
Private 5 33600 53100 55700
Public 9 44400 48300 75600
CT
Private 12 49700 55550 70300
Public 6 48500 50450 59100
DE
Private 3 45100 45400 49800
Public 2 46300 52450 58600
FL
Private 10 46100 49650 58300
Public 8 47300 49500 55800
GA
Private 11 44400 46800 62000
Public 11 43300 47000 54400
HI
Private 3 45400 51700 52200
Public 2 44000 47800 51600
IA
Private 19 43900 48000 53400
Public 3 47100 54100 56100
ID
Private 2 46200 49950 53700
Public 4 44900 50450 52300
IL
Private 16 49400 54200 64600
Public 4 51200 54650 62600
IN
Private 17 46900 50000 75000
Public 2 45600 46750 47900
KS
Private 11 37100 44700 48700
Public 7 43500 47100 53000
KY
Private 16 35300 40950 50000
Public 8 40700 45550 51700
LA
Private 7 33900 46700 48400
Public 9 40100 48200 53100
MA
Private 20 52600 62600 86300
Public 1 67900 67900 67900
MD
Private 13 47900 51100 67400
Public 5 49600 50400 51200
ME
Private 9 36300 46900 61300
Public 8 37100 42550 67000
MI
Private 12 47400 50700 69900
Public 10 45900 50400 65000
MN
Private 6 47400 50850 58800
Public 3 49700 50200 51600
MO
Private 11 42500 46500 54200
Public 6 43400 46900 67300
MS
Private 6 34800 41150 49300
Public 8 32500 41300 51100
MT
Private 2 46000 47100 48200
Public 2 53700 57450 61200
NE
Private 7 43800 48600 53500
Public 4 40600 42450 53400
NV
Private 1 54000 54000 54000
OH
Private 14 48300 53700 66600
Public 2 48300 50050 51800
OK
Private 8 40700 46000 58700
Public 11 37300 44500 47900
OR
Private 11 40500 50000 58500
Public 8 44000 51300 65400
PA
Private 19 53400 59300 75900
TN
Private 14 41200 45950 65400
Public 6 43700 46850 53100
TX
Private 11 51800 55500 71000
Public 7 50700 56000 61300
UT
Private 1 54500 54500 54500
Public 5 46400 52400 54800
VA
Private 10 43000 52500 65900
Public 12 44900 52500 65900
VT
Private 8 42000 49950 60400
Public 4 41500 47050 53200
WA
Private 13 41100 53600 60600
Public 6 43500 51400 55900
WI
Private 12 45600 48150 65700
Public 10 46000 50000 57300
WY
Public 1 52400 52400 52400

From the above summary table of the minimum,maximum and median early career pay across private/public colleges in the different states of USA, we noted the following:

-> Amongst the private colleges, the highest median early career pay was recorded for California, followed by Massachusetts and then by Pennsylvania. The lowest median early career pay was recorded for the Kansas state and the second lowest for Mississipi

->Amongst the public colleges, the highest median early career pay was noted for Massachusetts and then by California and the lowest was noted for Mississipi and the second lowest for Nebraska.

-> It must also be noted that in this data, there are 20 private colleges and 1 public college in Massachusetts, 13 private and 1 public college in California.

->The highest number of public colleges was in AL and VA(12 in each state)and the lowest number was in CA, MA, WY,NV (with only 1 public college in each state) and the second lowest was noted in AR, AZ, DE, HI(2 public colleges in each state)

-> By assessing the min and max values, if we compare the range of early career pay between Mississipi and California public colleges, it looks the max early career pay in Mississipi exceeds that of California, this may not true since we had only 1 public college in california in our data set and thus doesn’t provide an accurate representation of all the public colleges in CA, especially in a state like CA where there are a lot of public colleges. Mississipi on the other hand, have 10 public colleges, so the data is a little more reliable.

->If we compare MI and MA, the maximum early pay is only little higher for MA public colleges compared to MI public colleges, this is due to same issue as above since MA has data from only 1 public college, so we can’t make any definitive conclusions from this.

->It is interesting to note that Colorado,having 9 public colleges has the highest early career pay $75,600. I think this seems like a reasonable early career pay for Colorado but most likely the recent graduates from public colleges in CO may not make the highest pay compared to CA and MI. A solid comparison can’t be made at this point since we have just one public college in CA and in MI(in the data).

-> The lowest min early career pay amongst the public colleges was recorded for MS $32,500. This was calculated from 8 public colleges in OK and the highest min early career pay was recorded for MA and again we can’t consider this to be a very credible value since this data is obtained from only 1 public college

->The highest number of private colleges was in Massachusetts(20), followed by Iowa(19) and Indiana(17)and the lowest number was in AK,NV,UT(1 each) and the second lowest in AZ, ID, MT(1 each).

-> Amongst the private college,the highest median and max early career pay was noted for California followed by MA and then by Pennsylvania and Kansas, Mississipi recorded the lowest median and max early career pay. Colorado had recorded the lowest minimum early career pay compared to the other states.

Mid career pay

#mid_career_pay

mid_career_pay_stats<-merged1_final%>%
  group_by(state_code,type)%>%
  summarise(number_of_colleges=n(),
    minimum_mid_career_pay=min(mid_career_pay,na.rm=TRUE),
    median_mid_career_pay=median(mid_career_pay,na.rm=TRUE),
    maximum_mid_career_pay=max(mid_career_pay,na.rm=TRUE))
## `summarise()` has grouped output by 'state_code'. You can override using the
## `.groups` argument.
mid_career_pay_stats%>%gt::gt()
type number_of_colleges minimum_mid_career_pay median_mid_career_pay maximum_mid_career_pay
AK
Private 1 90000 90000 90000
Public 2 101000 101400 101800
AL
Private 10 65100 76450 93500
Public 12 70700 79800 104500
AR
Private 8 61200 77350 90300
Public 8 70500 79800 98000
AZ
Private 2 73500 75050 76600
Public 2 90000 95500 101000
CA
Private 13 109400 120600 158200
Public 1 114700 114700 114700
CO
Private 5 60100 94800 105100
Public 9 81400 87100 139600
CT
Private 12 85900 100650 138300
Public 6 83600 87950 105200
DE
Private 3 76600 77300 86900
Public 2 81900 93650 105400
FL
Private 10 84400 89100 105400
Public 8 83900 89350 102800
GA
Private 11 79400 84900 110800
Public 11 77900 84400 100700
HI
Private 3 84000 93500 93900
Public 2 77200 84400 91600
IA
Private 19 78000 86900 99900
Public 3 85300 99400 101300
ID
Private 2 80600 87050 93500
Public 4 77600 89100 97700
IL
Private 16 89800 96000 118500
Public 4 95000 99650 115300
IN
Private 17 84500 89900 135800
Public 2 83700 84650 85600
KS
Private 11 66100 79100 91800
Public 7 78000 84900 98600
KY
Private 16 64100 75300 90900
Public 8 73100 81900 96400
LA
Private 7 61500 86300 88600
Public 9 71000 88200 100700
MA
Private 20 98200 114100 155200
Public 1 117500 117500 117500
MD
Private 13 82600 94100 118400
Public 5 87000 89800 91200
ME
Private 9 64600 84300 112000
Public 8 66200 74700 121800
MI
Private 12 84700 91000 125000
Public 10 82200 90500 116300
MN
Private 6 84700 93900 109900
Public 3 85900 86800 91600
MO
Private 11 77200 81200 95300
Public 6 77700 84200 122600
MS
Private 6 62400 73050 88700
Public 8 61900 74400 94100
MT
Private 2 82900 87150 91400
Public 2 96200 104400 112600
NE
Private 7 77700 84800 95800
Public 4 73400 75750 93000
NV
Private 1 95500 95500 95500
OH
Private 14 88500 99450 117800
Public 2 88600 90550 92500
OK
Private 8 74100 83050 107200
Public 11 67400 77600 90800
OR
Private 11 76400 89600 104900
Public 8 73900 93150 108300
PA
Private 19 98400 112300 136100
TN
Private 14 74900 82350 119100
Public 6 77300 79750 95100
TX
Private 11 93500 98900 129500
Public 7 94800 101200 106500
UT
Private 1 91500 91500 91500
Public 5 84400 93900 102600
VA
Private 10 80600 92850 123200
Public 12 81800 94050 120600
VT
Private 8 75200 89200 109800
Public 4 75100 85050 99400
WA
Private 13 73600 97800 113700
Public 6 78400 90850 103700
WI
Private 12 81900 87000 117600
Public 10 81300 89400 104400
WY
Public 1 98800 98800 98800

From the above summary table of the minimum,maximum and median mid career pay across private/public colleges in the different states of USA, we noted the following:

-> Amongst the private colleges, the highest median mid career pay was recorded for California, followed by Massachusetts and Pennsylvania. The lowest median mid career pay was recorded for the Mississipi state and the second lowest for Arizona.

->Amongst the public colleges, the highest median mid career pay was noted for Massachusetts and then by California and the lowest was noted for Mississipi and the second lowest for ME.

->By eyeballing,the median mid career pay for the private and public colleges within each state,in some of the states have a slightly higher average mid career pay for public colleges compared to private which seems like an interesting observation however we need to factor in the number of colleges in each of the private/public college in each of the state.

->Sometimes there could be less number colleges in one category which would provide biased mean.It is difficult to compare the mean mid pay from private and public college from the summary table as the trend is not consistent across all the states.

-> It must also be noted that in this data, there are 20 private colleges and 1 public college in Massachusetts, 13 private and 1 public college in California.

->The highest number of public colleges was in AL and VA(12 in each state)and the lowest number was in CA, MA, WY,NV (with only 1 public college in each state) and the second lowest was noted in AR, AZ, DE, HI(2 public colleges in each state)

-> By assessing the min and max values, if we compare the range of mid career pay between Mississipi and California public colleges, it looks the max mid-career pay in Mississipi is less that of California’s, this may be true however this is not an accurate representation of all the public colleges in CA since there was on;y 1 public college in CA, especially in a state like CA where there are a lot of public colleges. Mississipi on the other hand, have 10 public colleges, so the data is a little more reliable.

->If we compare MI and MA, the maximum mid career pay is higher for MA public colleges compared to MI public colleges, however inferences can’t be drawn definitely since MA has data from only 1 public college.

->It is interesting to note that the highest maximum mid career pay amongst the public colleges was observed for Colorado with $139,600 followed by MO($122,600), this was higher than the CA and MA public colleges. One could interpret this in two ways, since CA and MA has a lot of public colleges and based on the location factor and cost of living, one would expect to see the highest mid career salaries made in these two states but we must factor in that both these states had data from only one public college which is definitely doesn’t represent the earning potential of students of students from California public colleges.

-> The lowest min mid-career pay amongst the public colleges was recorded for MS . This was calculated from 8 public colleges in MS and the trend is similar for the min early career pay as well.The highest min mid career pay was recorded for MA and again we can’t consider this to be a very credible value since this data is obtained from only 1 public college

->The highest number of private colleges was in Massachusetts(20), followed by Iowa(19) and Indiana(17)and the lowest number was in AK,NV,UT(1 each) and the second lowest in AZ, ID, MT(1 each).

-> Amongst the private college,the highest median and max mid-career pay was noted for California followed by MA and then by Pennsylvania and Mississipi and Arizona recorded the lowest median mid-career pay. Colorado had recorded the lowest minimum mid career pay compared to the other states. The trend is very similar to early career pay.

Make world better percent

Since right-skew was prominently noticed for make world better %, median was used instead of mean as median is more reliable as a measure of central tendency when the data is skewed or when outliers are present.

# make world better percent

make_world_better_stats<-merged1_final%>%
  group_by(state_code,type)%>%
  summarise(number_of_colleges=n(),
   minimum_make_world_better_percent=min(make_world_better_percent,na.rm=TRUE),
    median_make_world_better_percent=median(make_world_better_percent,na.rm=TRUE),
    maximum_make_world_better_percent=max(make_world_better_percent,na.rm=TRUE))
## `summarise()` has grouped output by 'state_code'. You can override using the
## `.groups` argument.
make_world_better_stats%>%gt::gt()
type number_of_colleges minimum_make_world_better_percent median_make_world_better_percent maximum_make_world_better_percent
AK
Private 1 67 67.0 67
Public 2 54 55.5 57
AL
Private 10 52 64.0 77
Public 12 49 59.5 61
AR
Private 8 47 55.5 58
Public 8 49 56.0 84
AZ
Private 2 69 69.0 69
Public 2 51 53.5 56
CA
Private 13 39 49.5 59
Public 1 53 53.0 53
CO
Private 5 48 53.0 70
Public 9 47 53.0 60
CT
Private 12 36 52.5 63
Public 6 43 49.0 68
DE
Private 3 42 50.0 50
Public 2 43 52.5 62
FL
Private 10 44 53.0 88
Public 8 40 51.0 62
GA
Private 11 34 54.0 62
Public 11 45 51.0 69
HI
Private 3 55 56.0 57
Public 2 56 57.5 59
IA
Private 19 41 49.0 60
Public 3 46 48.0 48
ID
Private 2 59 63.5 68
Public 4 54 57.5 64
IL
Private 16 41 46.0 88
Public 4 44 47.0 48
IN
Private 17 40 53.0 59
Public 2 48 51.5 55
KS
Private 11 55 61.0 73
Public 7 51 54.0 64
KY
Private 16 50 54.0 74
Public 8 46 53.0 57
LA
Private 7 37 60.5 74
Public 9 51 57.0 76
MA
Private 20 36 47.0 71
Public 1 53 53.0 53
MD
Private 13 40 52.5 63
Public 5 47 51.0 65
ME
Private 9 37 51.0 69
Public 8 47 59.5 73
MI
Private 12 40 50.5 57
Public 10 46 48.5 54
MN
Private 6 39 49.0 52
Public 3 50 51.0 53
MO
Private 11 37 56.0 68
Public 6 48 51.0 54
MS
Private 6 56 60.5 76
Public 8 55 63.0 70
MT
Private 2 55 57.0 59
Public 2 54 59.0 64
NE
Private 7 45 57.0 74
Public 4 50 60.0 79
NV
Private 1 39 39.0 39
OH
Private 14 39 46.0 78
Public 2 51 51.5 52
OK
Private 8 51 58.5 71
Public 11 48 55.0 66
OR
Private 11 44 57.0 82
Public 8 46 53.5 79
PA
Private 19 37 43.0 72
TN
Private 14 46 59.5 68
Public 6 48 56.0 68
TX
Private 11 33 50.0 61
Public 7 48 57.0 86
UT
Private 1 57 57.0 57
Public 5 57 60.0 65
VA
Private 10 40 52.5 86
Public 12 42 52.0 68
VT
Private 8 39 51.0 63
Public 4 41 49.5 52
WA
Private 13 39 52.0 60
Public 6 45 50.0 56
WI
Private 12 45 53.0 94
Public 10 44 49.0 53
WY
Public 1 58 58.0 58

From the above summary table of the minimum,maximum and median make world better % across private/public colleges in the different states of USA, we noted the following:

Private Colleges

-> Amongst the private colleges, the highest median make world better % was recorded for Arizona(69%) and Alaska(67%) and the lowest median make world better % was for Nevada(39%). It was surprising to observe that the median make world better % was pretty low for CA(49.5%), MA(47%) and PA(43%), these three states were having the highest average early and mid career pay.

->By assessing the min make world better % for the private colleges, the highest min make world better % was observed for Arizona(69%) and Alaska (67%) and the lowest % was noted for TX(33%), GA(34%),MA(36%) and CA(39%). The low % in make world better is not expected for states like MA and CA. Note:Only 2 private colleges are in AZ and 1 is in Alaska, so this data may not truly represent all the other private colleges in AZ and AK.

->By assessing the max make world better % for the private colleges, the highest max make world better % was observed for WI(94%) and for FL(88%) and IL(88%) and the smallest max make world better % was noted for NV(39%) and DE(50%).It was also intriguing to note that CA(59%) and MA(71%) had only moderate min make world better % rankings.

Public Colleges

->Amongst the public colleges, the highest median make world better % was observed for Mississipi(63%) and Nebraska(60%),both of these states had the lowest median early and mid career pay, so it was interesting to note this observation. The lowest median make world better % was observed for IA(48%) and IL(47%).

->By assessing the min make world better % for the public colleges, the highest min make world better % was observed for WY(58%) and UT(57%) and the lowest % was noted for FL(40%) and VT(41%) . The MIN make world better % for states like MA and CA(53% each) was almost comparable to the highest min make world better %.

->By assessing the max make world better % for the public colleges, the highest max make world better % was observed for TX(86%)and AR(84%) and the smallest max make world better % was noted for IL(48%) and IA(48%).It was also intriguing to note that CA(53%) and MA(53%) had pretty low max make world better % rankings.

->By eyeballing the median world better % between the public and private colleges across the different states-in some of the states the public colleges seem to have a better median world % compared to the private colleges and vice-versa. The largest diff in make world better % between the public and private college was observed for Arizona(approx 16%, with private college % being higher) and Alaska(approx 12%,with private college % being higher), for the rest of the states the difference was very small.

Stem percent

#stem percent

stem_stats<-merged1_final%>%
  group_by(state_code,type)%>%
  summarise(number_of_colleges=n(),
    minimum_stem_percent=min(stem_percent,na.rm=TRUE),
    median_stem_percent=median(stem_percent,na.rm=TRUE),
    maximum_stem_percent=max(stem_percent,na.rm=TRUE))
## `summarise()` has grouped output by 'state_code'. You can override using the
## `.groups` argument.
stem_stats%>%gt::gt()
type number_of_colleges minimum_stem_percent median_stem_percent maximum_stem_percent
AK
Private 1 6 6.0 6
Public 2 10 15.0 20
AL
Private 10 0 12.5 30
Public 12 4 10.5 45
AR
Private 8 3 17.0 33
Public 8 2 7.5 22
AZ
Private 2 0 3.5 7
Public 2 12 17.5 23
CA
Private 13 0 16.0 97
Public 1 28 28.0 28
CO
Private 5 0 12.0 23
Public 9 3 19.0 93
CT
Private 12 2 19.0 31
Public 6 2 10.0 24
DE
Private 3 2 4.0 25
Public 2 10 16.0 22
FL
Private 10 2 8.0 48
Public 8 4 15.5 29
GA
Private 11 3 19.0 26
Public 11 7 10.0 22
HI
Private 3 3 12.0 15
Public 2 14 16.0 18
IA
Private 19 0 14.0 65
Public 3 8 16.0 31
ID
Private 2 7 10.5 14
Public 4 5 13.0 22
IL
Private 16 0 16.5 64
Public 4 10 25.5 36
IN
Private 17 0 21.0 96
Public 2 7 7.5 8
KS
Private 11 0 7.0 14
Public 7 4 9.0 24
KY
Private 16 0 8.0 27
Public 8 5 11.0 19
LA
Private 7 0 4.0 35
Public 9 5 13.0 24
MA
Private 20 0 25.0 86
Public 1 47 47.0 47
MD
Private 13 0 11.0 88
Public 5 2 12.0 19
ME
Private 9 0 7.0 37
Public 8 5 7.5 48
MI
Private 12 0 16.5 74
Public 10 6 15.0 81
MN
Private 6 0 16.5 45
Public 3 10 11.0 15
MO
Private 11 0 7.0 13
Public 6 9 21.0 80
MS
Private 6 2 19.0 32
Public 8 2 11.5 27
MT
Private 2 12 18.5 25
Public 2 31 46.0 61
NE
Private 7 0 9.0 23
Public 4 3 6.0 11
NV
Private 1 2 2.0 2
OH
Private 14 0 21.0 45
Public 2 20 20.5 21
OK
Private 8 0 8.0 41
Public 11 5 10.0 15
OR
Private 11 0 8.0 28
Public 8 5 11.5 34
PA
Private 19 5 25.0 66
TN
Private 14 3 9.0 39
Public 6 8 11.0 28
TX
Private 11 0 17.0 45
Public 7 3 22.0 53
UT
Private 1 9 9.0 9
Public 5 7 12.0 22
VA
Private 10 1 14.0 23
Public 12 9 18.5 47
VT
Private 8 0 15.5 20
Public 4 4 7.0 22
WA
Private 13 0 18.0 33
Public 6 9 15.0 20
WI
Private 12 0 15.5 69
Public 10 8 15.5 38
WY
Public 1 25 25.0 25

Since right-skew was prominently noticed for stem %, median was used instead of mean as median is more reliable as a measure of central tendency when the data is skewed or when outliers are present.

Amongst the public colleges, MA(47%) has the highest median stem % followed by MT(46%) and CA(28.00%) and the least mean stem % is noted for NE(6%),VT(7.0%), AR(7.5%).It is interesting to note that CA has a pretty low stem % but this could be due to CA having only one public college in the data and this particular college had a low stem % but that doesn’t represent the public colleges in CA accurately. CO (93%) has the maximum stem % however it’s median stem is 19%. Indiana has the least maximum stem percent(8%).MA(47%) has the highest min stem percent and MD,AR,CT, MS(2% for each state) have the lowest min stem percent.

Amongst the private colleges, PA(25.00%) and MA(25.00%) has the highest median stem % and the least median stem % is noted for LA(4%),AZ(3.5%) and NV(2%). The median stem % of CA private colleges(13 colleges) is 16% only since the minimum stem % is 0 and the maximum stem % is 97.

It is intriguing to note that the private colleges on have lower median stem % compared to the public colleges.

Research Question 2

How Gender, Race and Ethnicity impact the choice of private/ public colleges across the USA

Gender

# create a summary table
gender.stats<-merged2.final%>%
              group_by(state,type)%>%
              summarise(number_of_colleges=n(),
across(c(percent_women,percent_men),~min(.x, na.rm=TRUE),.names = "minimum_{.col}"),
across(c(percent_women,percent_men),~mean(.x, na.rm=TRUE),.names = "average_{.col}"),
across(c(percent_women,percent_men),~max(.x,na.rm=TRUE),.names = "maximum_{.col}"))
## `summarise()` has grouped output by 'state'. You can override using the
## `.groups` argument.
gender.stats%>%gt::gt()
type number_of_colleges minimum_percent_women minimum_percent_men average_percent_women average_percent_men maximum_percent_women maximum_percent_men
Alabama
Private 15 5.37634409 33.6250000 51.91116 48.08884 66.37500 94.62366
Public 11 49.39024390 29.7819002 61.67802 38.32198 70.21810 50.60976
Alaska
Private 2 54.34782609 33.6787565 60.33453 39.66547 66.32124 45.65217
Arizona
Private 3 44.91180461 33.0188679 55.24636 44.75364 66.98113 55.08820
Public 3 42.95310016 41.1081032 51.22506 48.77494 58.89190 57.04690
Arkansas
Private 10 33.14794216 35.9788360 49.92745 50.07255 64.02116 66.85206
Public 9 54.07879029 34.8788927 58.46318 41.53682 65.12111 45.92121
California
Private 76 0.00000000 0.6072874 54.81189 45.18811 99.39271 100.00000
Public 5 49.63775869 37.0936491 56.05517 43.94483 62.90635 50.36224
Colorado
Private 6 37.40157480 31.2248996 57.21153 42.78847 68.77510 62.59843
Public 6 28.31264676 34.3734440 49.11059 50.88941 65.62656 71.68735
Connecticut
Private 12 32.80423280 10.9551657 58.09519 41.90481 89.04483 67.19577
Public 6 50.13707734 32.6075687 56.47142 43.52858 67.39243 49.86292
Delaware
Private 1 43.19085487 56.8091451 43.19085 56.80915 43.19085 56.80915
Public 2 56.24779541 37.2299295 59.50893 40.49107 62.77007 43.75220
Florida
Private 28 0.00000000 30.2799404 49.15176 50.84824 69.72006 100.00000
Public 10 54.22673325 40.7673861 56.37840 43.62160 59.23261 45.77327
Georgia
Private 24 0.04741584 0.0000000 59.75706 40.24294 100.00000 99.95258
Public 20 30.15708166 30.9974425 59.32227 40.67773 69.00256 69.84292
Hawaii
Private 2 56.66723872 34.6153846 61.02593 38.97407 65.38462 43.33276
Public 1 60.60142712 39.3985729 60.60143 39.39857 60.60143 39.39857
Idaho
Private 3 49.14285714 41.6184971 53.32397 46.67603 58.38150 50.85714
Public 4 46.84669287 39.3587361 53.46854 46.53146 60.64126 53.15331
Illinois
Private 46 35.56596607 9.7560976 61.04623 38.95377 90.24390 64.43403
Public 9 44.90695614 29.8995845 57.03811 42.96189 70.10042 55.09304
Indiana
Private 19 0.00000000 32.7269386 50.26330 49.73670 67.27306 100.00000
Public 4 55.07092468 36.5624317 60.35421 39.64579 63.43757 44.92908
Iowa
Private 25 32.94360385 8.8695652 57.39885 42.60115 91.13043 67.05640
Public 3 43.98141426 41.7756539 51.26467 48.73533 58.22435 56.01859
Kansas
Private 11 40.06069803 34.7436941 53.20483 46.79517 65.25631 59.93930
Public 8 49.32568844 37.2260386 54.62502 45.37498 62.77396 50.67431
Kentucky
Private 17 21.79487179 26.5253137 56.05595 43.94405 73.47469 78.20513
Public 8 51.56532628 39.4195251 56.96244 43.03756 60.58047 48.43467
Louisiana
Private 7 53.26433121 14.6922184 65.51076 34.48924 85.30778 46.73567
Public 11 48.82850780 23.7755587 62.17425 37.82575 76.22444 51.17149
Maine
Private 8 50.02770083 26.5795207 61.89320 38.10680 73.42048 49.97230
Public 8 13.86792453 27.8516295 57.95685 42.04315 72.14837 86.13208
Maryland
Private 9 0.00000000 13.6034732 59.56967 40.43033 86.39653 100.00000
Public 8 52.41806909 27.1943824 60.01231 39.98769 72.80562 47.58193
Massachusetts
Private 48 19.43835015 0.4434590 59.50762 40.49238 99.55654 80.56165
Public 9 12.69205077 30.0238663 57.45505 42.54495 69.97613 87.30795
Michigan
Private 29 0.00000000 8.8410992 50.39129 49.60871 91.15890 100.00000
Public 11 26.29947880 40.3508772 53.03614 46.96386 59.64912 73.70052
Minnesota
Private 13 27.79922780 5.8160237 57.18617 42.81383 94.18398 72.20077
Public 4 56.25752106 37.4596960 58.24378 41.75622 62.54030 43.74248
Mississippi
Private 8 33.33333333 31.5751094 57.70600 42.29400 68.42489 66.66667
Public 8 48.59469659 18.6943620 62.61132 37.38868 81.30564 51.40530
Missouri
Private 26 2.15053763 0.0000000 61.51486 38.48514 100.00000 97.84946
Public 9 22.61574074 32.3437500 54.29603 45.70397 67.65625 77.38426
Montana
Private 4 48.36146971 29.8119964 59.83198 40.16802 70.18800 51.63853
Public 1 32.47002398 67.5299760 32.47002 67.52998 32.47002 67.52998
Nebraska
Private 11 44.85294118 2.2593320 62.60753 37.39247 97.74067 55.14706
Public 4 57.34870317 35.8225108 59.90584 40.09416 64.17749 42.65130
Nevada
Private 1 57.65158807 42.3484119 57.65159 42.34841 57.65159 42.34841
Public 2 64.55026455 23.5443038 70.50298 29.49702 76.45570 35.44974
New Hampshire
Private 8 47.11019371 30.3149606 59.95840 40.04160 69.68504 52.88981
Public 5 51.98764161 25.7582617 58.08928 41.91072 74.24174 48.01236
New Jersey
Private 15 0.00000000 25.7552483 43.07105 56.92895 74.24475 100.00000
Public 9 25.03287620 36.6027532 54.19591 45.80409 63.39725 74.96712
New Mexico
Private 1 65.31219029 34.6878097 65.31219 34.68781 65.31219 34.68781
Public 7 32.15796897 36.0406091 55.86579 44.13421 63.95939 67.84203
New York
Private 83 0.00000000 0.1943257 46.24819 53.75181 99.80567 100.00000
Public 1 84.65792708 15.3420729 84.65793 15.34207 84.65793 15.34207
North Carolina
Private 35 41.98581560 1.5906681 60.34770 39.65230 98.40933 58.01418
Public 15 44.26137868 27.8352490 58.95017 41.04983 72.16475 55.73862
North Dakota
Private 3 36.08695652 34.6030616 51.97229 48.02771 65.39694 63.91304
Public 6 46.01613888 39.3548387 54.96614 45.03386 60.64516 53.98386
Ohio
Private 46 0.00000000 9.5940959 53.61896 46.38104 90.40590 100.00000
Public 9 48.81479008 41.9344902 53.46294 46.53706 58.06551 51.18521
Oklahoma
Private 10 42.50320376 29.1666667 56.49564 43.50436 70.83333 57.49680
Public 10 54.20319752 35.5088496 59.83497 40.16503 64.49115 45.79680
Oregon
Private 12 44.09030544 31.2138728 59.32605 40.67395 68.78613 55.90969
Public 8 46.29107981 34.9528137 55.82764 44.17236 65.04719 53.70892
Pennsylvania
Private 66 0.00000000 0.6622517 60.29091 39.70909 99.33775 100.00000
Public 16 36.06615686 34.8564426 54.77032 45.22968 65.14356 63.93384
Rhode Island
Private 7 32.20396988 31.7269076 52.26290 47.73710 68.27309 67.79603
Public 2 54.77641663 31.0149288 61.88074 38.11926 68.98507 45.22358
South Carolina
Private 18 43.55909695 8.3513319 57.05298 42.94702 91.64867 56.44090
Public 9 19.93318486 29.7920892 56.90451 43.09549 70.20791 80.06682
South Dakota
Private 4 56.78160920 35.8024691 60.24194 39.75806 64.19753 43.21839
Public 8 22.05146533 29.4851794 55.96087 44.03913 70.51482 77.94853
Tennessee
Private 28 29.93630573 8.8034188 57.74361 42.25639 91.19658 70.06369
Public 6 44.55419349 38.9719730 56.32104 43.67896 61.02803 55.44581
Texas
Private 38 11.67883212 32.8674121 51.16689 48.83311 67.13259 88.32117
Public 28 43.26044598 11.6117046 58.63384 41.36616 88.38830 56.73955
Utah
Private 2 44.87600052 39.4268518 52.72457 47.27543 60.57315 55.12400
Public 2 44.51213708 46.5747091 48.96871 51.03129 53.42529 55.48786
Vermont
Private 9 20.28867102 30.1115242 53.37629 46.62371 69.88848 79.71133
Public 3 42.86713287 44.4306161 48.28991 51.71009 55.56938 57.13287
Virginia
Private 22 0.18099548 3.5714286 58.61043 41.38957 96.42857 99.81900
Public 12 11.05882353 31.8485086 54.91656 45.08344 68.15149 88.94118
Washington
Private 11 51.36035536 17.7204659 61.30630 38.69370 82.27953 48.63964
Public 6 51.34211811 44.3841522 53.63577 46.36423 55.61585 48.65788
West Virginia
Private 4 41.72185430 43.5555556 52.28120 47.71880 56.44444 58.27815
Public 9 39.09595559 38.0678183 53.42305 46.57695 61.93218 60.90404
Wisconsin
Private 20 24.05693950 3.1393889 65.38474 34.61526 96.86061 75.94306
Public 2 49.24419842 47.8318292 50.70618 49.29382 52.16817 50.75580
Wyoming
Public 1 51.95787832 48.0421217 51.95788 48.04212 51.95788 48.04212

Amongst the public colleges, the highest mean % women enrollment is noted for NY(84.65%) and NV(70.50%) and the lowest was noted for Montana(32.47%) and Vermont(48.28%). The highest mean % women enrollment for CA and MA public colleges is 56.05% and 57.4%. The % enrollment of men in public colleges would be (100-% women enrollment) and hence the lowest mean % enrollment of men is noted for NY and NV and the highest for Montana and Vermont.

One of the limitation of this data set is only women enrollment was provided and by taking the difference from the total enrollment and women enrollment, men enrollment was calculated. This calculation holds good only for binary gender groups and doesn’t include the other gender groups.

Race and Ethnicity

# create a summary table
race_ethnicity.stats<-merged2.final%>%
              group_by(state,type)%>%
              summarise(number_of_colleges=n(),
across(percent_american_indian_alaska_native:percent_resident,~min(.x, na.rm=TRUE),.names = "minimum_{.col}"),
across(percent_american_indian_alaska_native:percent_resident,~mean(.x, na.rm=TRUE),.names = "average_{.col}"),
across(percent_american_indian_alaska_native:percent_resident,~max(.x,na.rm=TRUE),.names = "maximum_{.col}"))
## `summarise()` has grouped output by 'state'. You can override using the
## `.groups` argument.
race_ethnicity.stats%>%gt::gt()
type number_of_colleges minimum_percent_american_indian_alaska_native minimum_percent_asian minimum_percent_black minimum_percent_hispanic minimum_percent_native_hawaiian_pacific_islander minimum_percent_white minimum_percent_two_or_more_races minimum_percent_unknown minimum_percent_non_resident_foreign minimum_percent_total_minority minimum_percent_resident average_percent_american_indian_alaska_native average_percent_asian average_percent_black average_percent_hispanic average_percent_native_hawaiian_pacific_islander average_percent_white average_percent_two_or_more_races average_percent_unknown average_percent_non_resident_foreign average_percent_total_minority average_percent_resident maximum_percent_american_indian_alaska_native maximum_percent_asian maximum_percent_black maximum_percent_hispanic maximum_percent_native_hawaiian_pacific_islander maximum_percent_white maximum_percent_two_or_more_races maximum_percent_unknown maximum_percent_non_resident_foreign maximum_percent_total_minority maximum_percent_resident
Alabama
Private 15 0.00000000 0.00000000 7.5268817 0.0000000 0.00000000 0.8960573 0.00000000 0.00000000 0.00000000 14.960470 92.315627 0.5611901 0.8230765 53.2076651 2.177298 0.09911336 35.29526 0.87036975 5.4710553 1.49497673 57.738713 98.50502 2.1505376 5.2320675 98.5507246 5.595930 1.07526882 81.22846 3.27034884 28.8000000 7.6843734 98.550725 100.00000
Public 11 0.14495380 0.15041364 7.2784810 0.7118845 0.00000000 3.7687987 0.00000000 1.06963312 0.54347826 12.615776 94.492899 0.6471454 1.6061984 28.5768747 2.108923 0.08325071 58.64694 1.67328864 3.4049145 3.25246816 34.695681 96.74753 1.5345269 4.9791422 89.6901613 3.497715 0.20879469 80.48395 3.01101722 6.2289895 5.5071010 92.498641 99.45652
Alaska
Private 2 6.52173913 0.00000000 0.0000000 2.1739130 0.00000000 55.4404145 2.17391304 0.00000000 0.00000000 10.869565 100.000000 10.7738229 0.8635579 1.1226252 2.382293 0.34542314 72.28542 4.62754374 7.5993092 0.00000000 20.115266 100.00000 15.0259067 1.7271157 2.2452504 2.590674 0.69084629 89.13043 7.08117444 15.1986183 0.0000000 29.360967 100.00000
Arizona
Private 3 0.94979647 1.41509434 2.8301887 2.4764151 0.23584906 6.5934066 0.00000000 0.00000000 0.00000000 14.150943 96.698113 23.5257010 2.0184101 5.2624602 10.743414 1.76995983 47.53874 3.07078458 4.3819283 1.68859818 46.390730 98.31140 67.0329670 2.4423338 7.4626866 15.468114 4.39560440 75.23585 4.61329715 7.3113208 3.3018868 93.406593 100.00000
Public 3 1.22407425 1.68922577 3.2051976 16.3732114 0.20468998 51.9773450 3.50755167 0.78686158 4.49738314 31.291733 84.193164 1.8475064 4.5607107 3.3692364 19.338836 0.21780097 55.51847 4.04685803 1.4155663 9.68501602 33.380949 90.31498 3.0247248 6.3294913 3.5830684 22.267734 0.22492660 62.37141 4.82584371 2.5357515 15.8068362 36.506771 95.50262
Arkansas
Private 10 0.00000000 0.11123471 3.7894737 0.3527337 0.00000000 0.5291005 0.00000000 0.00000000 0.88987764 11.239478 89.634146 1.0391711 1.2956800 25.1500961 5.387377 0.05549444 59.34981 1.46284304 1.5496239 4.70990631 34.390662 95.29009 1.8292683 5.6701031 94.2157953 14.650767 0.17636684 83.34444 3.74787053 6.4606742 10.3658537 94.994438 99.11012
Public 9 0.11937923 0.59678256 4.3923865 1.5917230 0.00000000 4.6955830 0.07958615 0.00000000 0.44110016 18.655224 95.084630 0.6727289 1.8417061 25.4832340 4.471255 0.13035704 60.82006 3.06935664 0.8577156 2.65358582 35.668637 97.34641 2.7232796 4.5674740 91.3251094 9.677892 0.78962211 77.60373 7.06741091 3.1487889 4.9153701 93.712694 99.55890
California
Private 76 0.00000000 0.00000000 0.0000000 0.0000000 0.00000000 0.0000000 0.00000000 0.00000000 0.00000000 0.000000 3.378378 0.4990811 11.3811370 5.9086256 18.109358 0.62276228 40.82164 3.89647241 6.4061661 12.35476169 40.417436 87.64524 5.0000000 60.9929078 21.8303946 48.186528 3.10663308 96.93252 9.63130173 62.2549020 96.6216216 87.009063 100.00000
Public 5 0.13450310 3.45315262 2.1743811 22.6423746 0.21528525 21.7405924 4.76568948 4.15431174 1.41426046 39.171152 89.056338 0.4288308 15.7942713 3.4476867 25.740584 0.41514383 36.04037 5.76817383 6.5548161 5.81012715 51.594690 94.18987 1.0017678 31.7335616 4.6767351 29.217812 0.73976707 51.87298 6.52314316 8.2334974 10.9436615 63.161434 98.58574
Colorado
Private 6 0.43541364 0.78740157 1.4056225 6.6929134 0.00000000 61.4031277 2.54126846 0.00000000 0.00000000 18.110236 89.313236 0.6500631 2.8170575 4.3132906 9.728886 0.16946646 69.32110 4.73317466 4.4980127 3.76895278 22.411939 96.23105 1.1811024 4.6481321 6.7565232 15.117289 0.39370079 81.88976 8.36961780 8.6120765 10.6867643 28.811903 100.00000
Public 6 0.21804764 0.63307834 0.8177262 6.9942972 0.05031869 54.4711158 2.86377709 1.76115398 0.00000000 16.101979 88.527340 4.4050785 1.5726292 2.6744927 13.681386 0.31768426 63.58903 3.91688512 7.0134684 2.82934492 26.568156 97.17066 23.3975204 3.8745387 6.0558022 25.745086 0.60333480 72.27951 5.98786600 15.1315789 11.4726602 41.281984 100.00000
Connecticut
Private 12 0.00000000 0.45161290 1.0582011 6.3873538 0.00000000 33.4232325 1.05820106 0.00000000 0.70967742 13.214913 75.900597 0.3499549 3.6141543 10.5126543 9.080962 0.09848934 58.38210 1.94630596 6.9648024 9.05057636 25.602520 90.94942 1.7994859 12.8161479 33.2258065 14.000000 0.21190522 75.97122 4.43417639 19.3372320 24.0994028 49.483871 99.29032
Public 6 0.15447798 1.29600829 5.0977733 7.2642327 0.01847575 57.8019699 1.12567204 2.57056452 0.20161290 21.524494 89.827060 0.2255267 3.7201589 10.6672862 10.894874 0.09255368 63.54261 2.16476152 6.3233951 2.36883308 27.765161 97.63117 0.3110420 8.7411929 16.7444272 15.120968 0.16801075 69.33989 2.59126159 8.6374134 10.1729400 33.229653 99.79839
Delaware
Private 1 0.19880716 2.73359841 10.6858847 1.3916501 0.00000000 19.6322068 1.29224652 0.49701789 63.56858847 16.302187 36.431412 0.1988072 2.7335984 10.6858847 1.391650 0.00000000 19.63221 1.29224652 0.4970179 63.56858847 16.302187 36.43141 0.1988072 2.7335984 10.6858847 1.391650 0.00000000 19.63221 1.29224652 0.4970179 63.5685885 16.302187 36.43141
Public 2 0.07495591 1.06891062 5.4320988 5.1398681 0.09259259 12.2583580 2.53968254 1.49470899 4.91244030 19.069665 91.141975 0.1739346 2.7787234 35.8750214 5.790834 0.10315324 41.41798 3.08926360 3.8858580 6.88523250 47.810930 93.11477 0.2729134 4.4885362 66.3179441 6.441799 0.11371390 70.57760 3.63884467 6.2770071 8.8580247 76.552195 95.08756
Florida
Private 28 0.00000000 0.00000000 0.0000000 0.0000000 0.00000000 0.5890052 0.00000000 0.00000000 0.00000000 0.000000 64.899108 0.3882364 1.9424517 20.4182202 12.287918 0.23402674 51.01961 2.02478246 5.1267983 6.55795435 37.295635 93.44205 1.4414414 7.5471698 88.5262117 39.814815 1.85185185 100.00000 5.40540541 19.7552020 35.1008916 94.187298 100.00000
Public 10 0.08466035 1.62371312 2.6378897 8.3478813 0.00000000 11.4553517 2.10038299 0.40040040 1.59607545 24.940048 91.316040 0.2292512 3.9775278 9.8004227 20.966752 0.18384279 55.80071 3.21755099 1.6367859 4.18716044 38.375347 95.81284 0.6110141 7.0785903 18.7411295 62.912719 0.48525041 70.74341 4.87362362 3.2006308 8.6839605 80.151179 98.40392
Georgia
Private 24 0.00000000 0.00000000 2.7280477 0.0000000 0.00000000 0.0000000 0.00000000 0.00000000 0.00000000 9.974425 80.620428 0.2957594 2.9849375 35.7465735 4.155618 0.09349994 43.71465 1.70773236 7.2180669 4.08316281 44.984121 95.91684 0.7262164 15.4445122 100.0000000 10.329068 0.35149385 85.25149 5.95647194 24.6800731 19.3795717 100.000000 100.00000
Public 20 0.06923709 0.10172940 4.2579681 1.1508951 0.00000000 4.6998983 0.56265985 0.03662109 0.34469552 16.361488 78.575447 0.2471032 3.6693637 33.5559831 5.969450 0.11231804 49.35042 2.59409998 1.6700013 2.83126455 46.148318 97.16874 0.4337866 15.5047817 91.4417887 19.715698 0.27541312 81.80744 4.10615924 4.2027194 21.4245532 94.217425 99.65530
Hawaii
Private 2 0.41187575 15.80573194 5.5152395 5.8780842 2.19667067 16.5094340 9.61538462 2.36828557 2.14078374 54.384761 85.807448 0.6413515 21.0741287 5.7179767 9.494731 8.88135420 22.78192 13.27688743 9.9649846 8.16666783 59.086430 91.83333 0.8708273 26.3425254 5.9207139 13.111378 15.56603774 29.05440 16.93839025 17.5616836 14.1925519 63.788099 97.85922
Public 1 0.50968400 20.54026504 1.1722732 11.6462793 9.96432212 22.8338430 28.77166157 0.33129460 4.23037717 72.604485 95.769623 0.5096840 20.5402650 1.1722732 11.646279 9.96432212 22.83384 28.77166157 0.3312946 4.23037717 72.604485 95.76962 0.5096840 20.5402650 1.1722732 11.646279 9.96432212 22.83384 28.77166157 0.3312946 4.2303772 72.604485 95.76962
Idaho
Private 3 0.52447552 0.57142857 1.1428571 5.1428571 0.00000000 63.3741259 1.71428571 3.42857143 0.57142857 10.285714 93.094406 0.9537534 1.5230550 1.5703699 9.407099 0.27908026 75.69288 1.92287096 5.3730200 3.27787530 15.656228 96.72212 1.7142857 2.7972028 1.9230769 15.297203 0.43706294 85.71429 2.09790210 6.6433566 6.9055944 23.076923 99.42857
Public 4 0.59387232 0.90613383 0.7667286 4.9953532 0.18800205 71.7849430 2.31588354 2.67193309 2.60223048 11.640335 89.887557 1.2274179 1.3941204 1.1870040 7.730865 0.25449549 76.66083 2.75868094 3.1647368 5.62185138 14.552583 94.37815 2.3001859 1.9615783 1.6061547 9.668421 0.31943132 83.08550 3.21312596 3.6786060 10.1124432 17.285284 97.39777
Illinois
Private 46 0.00000000 0.00000000 0.0000000 0.0000000 0.00000000 2.5225225 0.00000000 0.00000000 0.00000000 0.000000 51.671309 0.2809328 4.9724991 10.1155932 10.165297 0.19748833 61.80488 2.47353844 5.0380218 4.95174849 28.205349 95.04825 1.1702128 15.3543307 59.4594595 32.447113 0.84772370 100.00000 7.98747063 22.3085461 48.3286908 95.135135 100.00000
Public 9 0.08639787 0.89756535 4.8427116 4.9814877 0.03838035 5.2005373 0.00000000 0.49963619 2.04293629 19.980597 77.933097 0.1691045 6.0218893 20.4775320 12.008570 0.12310473 49.25631 1.89414275 3.3167118 6.73263146 40.694344 93.26737 0.2596953 18.6956988 68.7967761 31.727494 0.25304136 77.45816 2.76066178 9.6334677 22.0669030 78.257532 97.95706
Indiana
Private 19 0.00000000 0.00000000 2.3869347 1.4925373 0.00000000 3.4825871 0.41493776 0.00000000 0.00000000 5.800575 86.287996 0.2926445 2.0916343 12.6468480 4.713684 0.06698609 69.18740 2.45706312 2.6448209 5.89892250 22.268860 94.10108 0.7462687 4.7048198 92.2885572 12.336892 0.32618826 87.77565 6.50112867 7.4788681 13.7120036 96.268657 100.00000
Public 4 0.12587751 1.02777170 3.4331948 2.1999146 0.04373497 66.3202609 1.91157625 1.74939864 0.69975946 9.490488 92.429644 0.2103463 1.1449611 7.8016160 2.803577 0.07384205 79.68787 2.21488553 2.4348506 3.62804966 14.249228 96.37195 0.2730790 1.3578093 16.6805735 3.350278 0.11619463 88.06035 2.83698703 4.1103849 7.5703558 24.205416 99.30024
Iowa
Private 25 0.00000000 0.49592632 1.0272759 0.3478261 0.00000000 3.1250000 0.00000000 0.00000000 0.00000000 5.043478 31.568088 0.3556412 2.8814917 4.6048481 4.795584 0.44292186 72.39513 1.80619894 4.5756429 8.14253662 14.886686 91.85746 1.0273973 28.1250000 16.1920715 12.891344 7.05962988 92.46575 4.26758939 20.9302326 68.4319120 44.791667 100.00000
Public 3 0.16683350 0.90543260 2.5874837 3.1606304 0.05868545 67.1271271 1.76056338 2.25519785 5.24815560 8.936955 87.610944 0.2029154 2.3862456 2.7585718 4.382269 0.07848785 74.27715 1.93805610 4.2445687 9.73173972 11.746545 90.26826 0.2323218 3.5235235 2.8461795 5.772439 0.09002468 83.55969 2.10210210 5.9859860 12.3890557 14.497831 94.75184
Kansas
Private 11 0.30349014 0.32154341 2.5723473 2.8938907 0.00000000 46.0085837 0.96463023 0.45523520 0.00000000 8.360129 95.176849 1.0486529 1.4372847 8.7345840 7.369121 0.31167230 69.91048 2.52445367 6.7549818 1.90877293 21.425768 98.09123 1.8261752 4.8820179 18.0257511 12.535211 0.60698027 79.04584 6.26992561 21.6309013 4.8231511 31.673820 100.00000
Public 8 0.37955261 0.00000000 0.0000000 0.0000000 0.00000000 0.0000000 0.00000000 0.00000000 0.00000000 13.547920 74.235081 13.0673817 1.9318196 3.7788280 5.377596 0.09570302 59.66256 3.02190640 3.6233675 9.44084053 27.273234 90.55916 100.0000000 6.2687563 5.5285095 8.596199 0.14536057 77.71092 4.67778868 16.9294853 25.7649186 100.000000 100.00000
Kentucky
Private 17 0.00000000 0.00000000 0.0000000 0.0000000 0.00000000 62.3072178 0.00000000 0.00000000 0.00000000 0.000000 92.288711 0.2648063 1.0410234 7.9457530 2.382887 0.09232473 76.81884 1.82047481 7.0383214 2.59557407 13.547269 97.40443 0.8522727 3.3165105 18.3903072 5.965909 0.26609899 96.15385 5.42874769 20.1060204 7.7112893 28.130783 100.00000
Public 8 0.15769213 0.37097358 3.7006877 1.4657980 0.05353797 29.0765172 1.28483532 0.27828023 1.26649077 7.111835 92.638529 0.2218325 1.3789247 12.8920323 2.420546 0.10569806 73.88816 2.15688352 3.0859078 3.85001475 19.175918 96.14999 0.3644798 3.4738648 53.1398417 3.770697 0.21108179 90.20992 3.63619498 11.7678100 7.3614705 57.889182 98.73351
Louisiana
Private 7 0.00000000 0.00000000 8.2202650 0.5000000 0.00000000 0.3333333 0.00000000 0.00000000 0.00000000 21.367928 89.605649 0.5354838 4.0767140 34.7178595 4.855675 0.08057357 46.91237 1.93524502 3.1221840 3.76389907 46.201551 96.23610 0.9693053 12.9032258 92.3333333 12.494226 0.17421603 70.27464 3.55411955 7.1593533 10.3943506 93.583333 100.00000
Public 11 0.14265335 0.22202487 12.8374165 0.6181645 0.00000000 2.2202487 0.00000000 0.07284205 0.00000000 17.318486 93.350661 0.7328287 1.7430960 30.9478606 3.422808 0.06242392 54.30882 2.35380103 3.4826453 2.94571532 39.262818 97.05428 1.8595041 7.1041802 90.7637655 9.096816 0.14253898 71.90970 5.22537447 12.1730861 6.6493394 93.472469 100.00000
Maine
Private 8 0.00000000 1.32969656 0.7772021 1.2754705 0.00000000 60.0974553 0.00000000 0.00000000 0.00000000 7.228094 84.715026 0.2935869 3.6471025 3.5751147 5.240684 0.05904883 70.18189 3.07155363 8.5203383 5.41067965 15.887091 94.58932 0.5181347 6.3157895 6.0195987 12.686981 0.23331778 87.75988 6.48199446 29.9352199 15.2849741 30.470914 100.00000
Public 8 0.28301887 0.35149385 0.5660377 0.7547170 0.00000000 63.3006782 0.00000000 2.54716981 0.00000000 2.452830 91.559910 1.3500610 0.8111373 1.8375999 1.622170 0.07849710 77.58514 1.75485137 12.1270052 2.83354204 7.454316 97.16646 2.7160494 1.9458946 3.2098765 2.962963 0.18867925 95.00000 2.46045694 21.0248681 8.4400904 10.740741 100.00000
Maryland
Private 9 0.00000000 0.00000000 0.0000000 0.0000000 0.00000000 47.6137002 0.00000000 0.00000000 0.60157335 0.000000 82.360097 0.2394598 5.2044198 11.5754389 5.121979 0.09840825 65.31147 2.14089430 4.9964023 5.31152498 24.380600 94.68848 0.7959479 13.6580573 29.3151319 7.876655 0.21707670 94.53125 5.12820513 18.7264151 17.6399027 39.819528 99.39843
Public 8 0.03191829 0.15959145 7.9023823 1.9470156 0.00000000 1.4363230 1.91509735 0.51961548 1.51074956 22.486926 89.435046 0.1963727 2.3802636 43.4784274 4.144900 0.09675264 39.51956 3.35984981 2.8240257 3.99984778 53.656566 96.00015 0.3876853 4.7072022 83.4240562 7.088902 0.18685768 73.03893 4.41603719 5.3578640 10.5649537 90.640913 98.48925
Massachusetts
Private 48 0.00000000 1.06382979 0.8571429 2.4316109 0.00000000 13.3484163 0.00000000 0.00000000 0.00000000 8.376609 61.381074 0.2022323 6.1958632 7.3737226 7.665594 0.07901937 54.86901 2.98601752 9.6945052 10.93403489 24.502449 89.06597 0.6787330 23.1597073 35.8123570 22.034495 0.49751244 91.48936 7.69230769 39.8933333 38.6189258 59.267735 100.00000
Public 9 0.10770888 1.10741971 2.2044088 2.8056112 0.00000000 60.5727924 0.66800267 1.80360721 0.39660057 9.485638 95.662027 0.3806267 2.7624481 5.6671475 7.166784 0.06016574 74.60185 2.21910697 5.6799776 1.46189842 18.256279 98.53810 1.2024048 6.3961814 8.3852691 10.531995 0.19093079 88.24315 3.79686758 16.5632458 4.3379735 23.146649 99.60340
Michigan
Private 29 0.00000000 0.00000000 0.0000000 0.0000000 0.00000000 28.1250000 0.00000000 0.00000000 0.00000000 0.000000 78.730252 0.4275021 2.9250017 11.8024390 3.860043 0.30408100 66.64480 2.07622499 6.9982826 4.96162253 21.395291 95.03838 3.1250000 23.0386052 65.6250000 13.399649 5.75022462 100.00000 4.79123888 22.9110512 21.2697484 71.875000 100.00000
Public 11 0.23213249 0.54009140 1.3382167 1.6481195 0.00000000 54.4056857 0.00000000 0.43038176 1.59799155 6.916467 84.603465 1.1481290 2.5091218 8.9676443 3.402197 0.07528080 71.69357 2.31556707 3.0733725 6.81511362 18.417939 93.18489 8.6414624 7.4588440 18.1197268 4.670904 0.12691276 83.19519 3.14271684 6.1654288 15.3965347 31.644789 98.40201
Minnesota
Private 13 0.04823927 0.46674446 0.7001167 0.9334889 0.00000000 50.4587156 0.00000000 0.00000000 0.00000000 3.383897 86.444766 0.6100444 4.2869347 3.2824122 4.028990 0.05300242 73.40600 2.73655982 7.8374794 3.75858135 14.997944 96.24142 3.5460993 8.9450656 8.1081081 6.319883 0.18832392 94.63244 6.37065637 31.8639517 13.5552340 23.869713 100.00000
Public 4 0.10363888 1.02286402 1.6646611 1.8852788 0.02005616 55.5236385 2.02291696 1.56609857 1.94544725 8.190692 96.672041 0.9301768 4.0683518 5.8917111 2.951201 0.05694801 77.77681 2.80483632 3.1548336 2.36512843 16.703225 97.63487 2.6073004 12.0526631 17.9413525 4.751646 0.10771993 86.81567 4.14123279 5.4352186 3.3279595 39.664871 98.05455
Mississippi
Private 8 0.00000000 0.00000000 10.2941176 0.0000000 0.00000000 0.4153686 0.00000000 0.00000000 0.00000000 13.602941 92.174960 0.6126403 1.3950833 43.1429884 1.786073 0.03380897 47.56644 0.45078958 2.3691127 2.64306054 47.421384 97.35694 2.3809524 4.1567696 98.5555556 3.500243 0.17015070 84.92647 1.48274186 8.8964511 7.8250401 99.111111 100.00000
Public 8 0.04500450 0.27480077 14.1225614 0.3297609 0.00000000 2.8029678 0.00000000 0.00000000 0.00000000 21.219393 96.126726 0.2495251 0.9840197 50.9659836 1.488892 0.10082563 41.83147 1.15531855 0.7814013 2.44256433 54.944564 97.55744 0.5164366 1.9064125 93.0475405 3.089508 0.30228085 75.02111 2.13628989 3.6003600 3.8732744 95.603188 100.00000
Missouri
Private 26 0.00000000 0.34465780 0.9913259 0.7877893 0.00000000 42.7336156 0.00000000 0.00000000 0.00000000 5.328377 87.315634 0.5843973 1.9178370 8.6849277 4.605814 0.19764224 72.59577 2.43146465 5.6032551 3.37889531 18.422083 96.62110 2.2304833 7.5268817 37.0743634 16.129032 0.94395280 94.29988 5.94795539 23.5625705 12.6843658 49.013060 100.00000
Public 9 0.07812500 0.15625000 3.3564815 1.0284539 0.00000000 7.6562500 0.10755357 1.85185185 0.70312500 10.835467 82.453704 0.6143217 1.3297326 14.4737154 2.462243 0.09693745 66.59804 1.97585449 4.6847848 7.76436645 20.952804 92.23563 2.9217887 2.6157407 82.6562500 3.830394 0.20569078 80.66581 2.80986432 13.5062411 17.5462963 87.031250 99.29688
Montana
Private 4 1.25000000 0.00000000 0.5820722 1.6298021 0.23282887 21.3038417 0.00000000 0.00000000 0.00000000 10.277778 95.332671 20.3219605 1.1499986 1.6457009 4.146048 0.36368964 63.52879 2.48790826 4.1854713 2.17043496 30.115306 97.82957 76.2514552 2.1486124 3.1777557 8.236347 0.49652433 78.68056 6.65342602 9.4444444 4.6673287 78.696158 100.00000
Public 1 1.67865707 0.95923261 1.1031175 1.6306954 0.00000000 77.5539568 0.04796163 6.37889688 10.64748201 5.419664 89.352518 1.6786571 0.9592326 1.1031175 1.630695 0.00000000 77.55396 0.04796163 6.3788969 10.64748201 5.419664 89.35252 1.6786571 0.9592326 1.1031175 1.630695 0.00000000 77.55396 0.04796163 6.3788969 10.6474820 5.419664 89.35252
Nebraska
Private 11 0.00000000 0.00000000 1.7366136 2.7900147 0.00000000 57.0907987 0.00000000 0.00000000 0.00000000 9.261939 95.227606 0.5116739 2.2486971 6.4570358 6.353040 0.28972707 76.04051 2.02700758 4.6043600 1.46794375 17.887182 98.53206 1.0660981 9.0092278 15.4411765 14.705882 0.73529412 90.73806 6.75477239 16.2060937 4.7723935 32.598039 100.00000
Public 4 0.27056277 0.54755043 1.6504329 3.3008658 0.02705628 75.6676558 1.16341991 1.81277056 0.00000000 11.147186 91.801948 0.6945519 1.8367904 3.8739236 5.259272 0.18187061 77.82262 2.05152180 5.6469099 2.63254251 13.897930 97.36746 1.1869436 4.7348485 7.2829132 6.599424 0.36267722 78.84199 2.73656446 8.4438040 8.1980519 17.326931 100.00000
Nevada
Private 1 2.98363811 1.82868142 3.9461020 3.1761309 0.96246391 69.2974013 0.28873917 15.20692974 2.30991338 13.185756 97.690087 2.9836381 1.8286814 3.9461020 3.176131 0.96246391 69.29740 0.28873917 15.2069297 2.30991338 13.185756 97.69009 2.9836381 1.8286814 3.9461020 3.176131 0.96246391 69.29740 0.28873917 15.2069297 2.3099134 13.185756 97.69009
Public 2 0.42194093 2.05026455 1.8849206 16.1375661 0.82671958 43.5443038 2.54629630 5.05952381 0.00000000 26.620370 99.867725 1.7982721 6.2290704 6.6386628 19.179894 1.24317695 55.86607 3.22814112 5.7505777 0.06613757 38.317218 99.93386 3.1746032 10.4078762 11.3924051 22.222222 1.65963432 68.18783 3.90998594 6.4416315 0.1322751 50.014065 100.00000
New Hampshire
Private 8 0.00000000 0.60975610 1.0752688 1.7582418 0.00000000 25.0635486 0.00000000 6.58607753 0.00000000 5.376344 84.693554 0.5773033 2.7186457 5.3122559 4.107853 0.18477460 62.27110 1.57831958 18.3472421 4.90250116 14.479152 95.09750 1.6195618 12.3054938 17.1321384 7.480315 0.98425197 79.77642 3.89012385 62.8853353 15.3064465 30.422356 100.00000
Public 5 0.11764706 0.86011770 0.9683276 2.1276596 0.00000000 70.1176471 1.25643666 8.67460157 0.04034698 6.609325 96.355097 0.2754972 1.7038658 1.4185501 2.876995 0.01976929 79.01674 1.66209426 11.7142714 1.31221523 7.956772 98.68778 0.4979629 2.3529412 2.0803296 3.647059 0.04526935 84.47261 2.00000000 20.0000000 3.6449031 9.647059 99.95965
New Jersey
Private 15 0.00000000 0.00000000 0.0000000 0.0000000 0.00000000 9.4142259 0.00000000 0.00000000 0.00000000 0.000000 69.044898 0.1996066 4.0991606 13.3710016 10.922956 0.15660179 54.53586 1.29638963 8.0912045 7.32722061 30.045716 92.67278 0.7268322 15.8506429 52.0958084 28.770442 0.56323605 100.00000 4.18410042 38.1794651 30.9551020 80.588822 100.00000
Public 9 0.05635920 3.90323331 5.3140097 8.4066062 0.00000000 27.5596468 1.16608362 1.66861144 0.35005834 24.270712 83.064062 0.2038684 7.0899629 11.8147674 17.752549 0.20353163 50.54686 2.06493237 7.2281380 3.09539482 39.129612 96.90461 0.6280530 17.8564719 19.9606686 32.239430 0.64200977 73.71062 3.18150035 13.3612468 16.9359384 61.750246 99.64994
New Mexico
Private 1 2.08126858 0.99108028 23.9841427 20.7135778 0.39643211 33.7958375 0.00000000 17.74033697 0.29732408 48.166501 99.702676 2.0812686 0.9910803 23.9841427 20.713578 0.39643211 33.79584 0.00000000 17.7403370 0.29732408 48.166501 99.70268 2.0812686 0.9910803 23.9841427 20.713578 0.39643211 33.79584 0.00000000 17.7403370 0.2973241 48.166501 99.70268
Public 7 2.53878702 0.81782290 0.2057613 7.4074074 0.09402915 11.5226337 0.47793084 0.80233406 0.14587892 35.825106 92.900799 14.6967194 1.6622964 2.8202712 38.631326 0.24253118 31.05496 2.17870431 5.5063989 3.20678809 60.231849 96.79321 75.1028807 3.2753915 4.5967287 71.043034 0.50761421 55.99436 3.86579139 21.3663199 7.0992008 85.995624 99.85412
New York
Private 83 0.00000000 0.00000000 0.0000000 0.0000000 0.00000000 0.8598452 0.00000000 0.00000000 0.00000000 0.000000 49.529781 0.2509328 4.0604171 6.5272589 7.937922 0.13055783 66.48749 1.81357899 5.9502243 6.84161446 20.720667 93.15839 1.6798419 14.8435280 59.2019544 82.287188 3.40768278 100.00000 5.70012392 40.6980803 50.4702194 97.162511 100.00000
Public 1 0.16386727 10.07783695 9.6169603 16.1102007 0.27652601 46.3334699 3.40024580 1.61818927 12.40270381 39.645637 87.597296 0.1638673 10.0778370 9.6169603 16.110201 0.27652601 46.33347 3.40024580 1.6181893 12.40270381 39.645637 87.59730 0.1638673 10.0778370 9.6169603 16.110201 0.27652601 46.33347 3.40024580 1.6181893 12.4027038 39.645637 87.59730
North Carolina
Private 35 0.00000000 0.00000000 3.4831461 0.2219756 0.00000000 0.3074558 0.00000000 0.00000000 0.00000000 10.744681 81.716700 0.6779701 1.8996102 28.5820150 3.581534 0.09036037 52.36635 1.93381175 7.6767985 3.19155315 36.765301 96.80845 1.9292605 14.3037336 95.4545455 10.286225 0.51361068 85.49107 4.01129944 30.1282051 18.2832997 98.113208 100.00000
Public 15 0.24409187 0.37493305 3.2249675 1.3390466 0.02601795 11.4739170 0.79674521 0.94394144 0.21424746 11.366914 89.046456 1.5548657 2.5315037 27.5982550 4.897334 0.10334125 54.65034 2.86331034 3.1809657 2.62008187 39.548610 97.37992 14.8189504 8.2718380 76.0114479 7.342683 0.24026085 85.69289 4.59290188 8.1949652 10.9535438 84.142058 99.78575
North Dakota
Private 3 1.54639175 0.00000000 2.7411890 1.7391304 0.17799929 78.6956522 0.00000000 0.00000000 0.86956522 10.323959 93.917526 2.5462036 0.7577387 4.8969958 3.225991 1.03199110 80.31317 1.81461245 2.5382209 2.87507758 14.273532 97.12492 4.3478261 1.1391954 7.8260870 4.948454 2.60869565 81.75258 3.91304348 7.5115700 6.0824742 20.434783 99.13043
Public 6 0.73913338 0.18501388 2.3077955 1.6613549 0.08137248 69.2082111 1.76307046 0.55504163 2.68270120 8.184716 87.947214 1.2560680 0.9218368 3.8843095 3.794283 0.24022673 78.61656 2.60341936 2.4711061 6.21219071 12.700144 93.78781 1.7576318 1.4918288 6.4754857 4.897361 0.46253469 83.88970 2.96022202 4.0889672 12.0527859 16.558742 97.31730
Ohio
Private 46 0.00000000 0.00000000 0.0000000 0.0000000 0.00000000 0.5167959 0.00000000 0.00000000 0.00000000 0.000000 72.000000 0.2670138 1.7891142 10.7333653 3.175560 0.07934518 71.33020 2.28516381 6.3545010 3.98573686 18.329563 96.01426 1.5625000 14.3255037 92.7648579 10.052678 0.47846890 100.00000 10.88888889 42.8888889 28.0000000 94.315245 100.00000
Public 9 0.14717459 0.05711022 5.0859430 0.5886508 0.00000000 1.4848658 0.85665334 0.68875430 0.45688178 8.264657 88.944306 0.2287374 1.8620747 19.7214203 2.594275 0.07086432 64.48507 2.30621284 2.9103394 5.82101015 26.783584 94.17899 0.5886508 3.2424320 95.3169617 4.316249 0.11281321 84.81281 3.45564660 5.3214033 11.0556941 97.087379 99.54312
Oklahoma
Private 10 1.93939394 0.00000000 3.9085861 0.0000000 0.00000000 45.5104551 0.00000000 0.00000000 0.00000000 17.343016 66.666667 5.3816774 2.0479018 12.3772645 5.324622 0.17304398 57.06790 3.95654188 3.0050619 10.66598638 29.261052 89.33401 16.6666667 8.2020202 27.6413690 7.842574 0.37202381 70.23749 8.03436079 9.7960356 33.3333333 42.857143 100.00000
Public 10 1.49073328 0.00000000 2.3325062 1.2489927 0.00000000 13.6583400 0.00000000 0.00000000 0.20145044 21.191136 91.152019 8.7512386 1.4789009 14.2124162 6.925812 0.17863313 54.59074 7.64041271 2.3912953 3.83055596 39.187414 96.16944 21.1070999 3.0285036 79.3311845 16.255778 0.65017157 67.74129 17.59305211 7.7100646 8.8479810 82.070911 99.79855
Oregon
Private 12 0.16895969 1.19521912 0.9633911 3.0222222 0.19267823 57.3076923 0.79908676 1.56028369 0.00000000 14.666667 90.035971 0.9628141 5.1739799 2.1958425 7.599922 0.61496163 67.77156 6.27319696 4.7615602 4.64616427 22.820717 95.35384 2.5531915 13.9285714 5.8130401 10.473458 1.53846154 79.57447 11.94605010 7.9623288 9.9640288 32.967033 100.00000
Public 8 0.45438658 2.08263352 1.2233485 5.2778749 0.24466970 53.8461538 0.26450653 2.19538513 1.00938967 15.630988 86.989542 1.2139270 5.2652446 2.1188789 7.632290 0.75706647 66.50313 3.88804520 6.8906769 5.73074021 20.875452 94.26926 2.2721051 9.9615519 3.1990179 9.618718 2.41362209 76.01971 5.72249533 25.3611018 13.0104582 26.819757 98.99061
Pennsylvania
Private 66 0.00000000 0.00000000 0.0000000 0.0000000 0.00000000 17.7130045 0.00000000 0.00000000 0.00000000 0.000000 39.798206 0.1989786 3.3304171 9.5445353 5.173646 0.08829435 67.25971 2.20629738 6.5500583 5.64806678 20.542169 94.35193 0.8011146 16.4534838 67.3758865 13.359274 0.63357973 100.00000 7.78210117 36.2745098 60.2017937 77.905074 100.00000
Public 16 0.06337136 0.09784736 3.2722746 1.1153712 0.05885815 1.1741683 1.59883721 0.39786149 0.19569472 7.284768 92.930506 0.1592404 1.9026603 12.6461132 4.477350 0.09547283 73.67511 2.32262608 3.0026774 1.71875386 21.603463 98.28125 0.3270349 10.1880752 85.5185910 10.265435 0.16272510 89.19484 3.71819961 8.0070392 7.0694945 93.248532 99.80431
Rhode Island
Private 7 0.12249898 1.12508273 2.0416497 4.8730549 0.06142506 31.7680686 0.08190008 2.97515887 1.89850310 10.462735 70.559412 0.4110147 5.4439026 3.6833485 7.316944 0.10987466 62.76910 1.90956765 8.7936652 9.56258072 18.874652 90.43742 0.8599509 15.7615353 5.9906328 9.737501 0.19854401 77.41010 4.09541444 12.9811630 29.4405880 32.218713 98.10150
Public 2 0.26552411 2.88161092 4.8458150 7.7726148 0.03620783 65.0156232 1.66647379 9.03830575 0.17359102 18.520307 96.632672 0.3179258 3.0520540 6.1203962 10.557989 0.07596759 66.89318 2.02206074 9.1899633 1.77045974 22.146393 98.22954 0.3703275 3.2224971 7.3949774 13.343363 0.11572735 68.77074 2.37764770 9.3416209 3.3673285 25.772480 99.82641
South Carolina
Private 18 0.00000000 0.00000000 1.4157014 0.2136752 0.00000000 0.0000000 0.00000000 0.00000000 0.00000000 10.977034 94.208494 0.2769395 1.0605547 40.1867030 2.587914 0.06761614 46.07725 1.23779295 6.2452569 2.25997292 45.417521 97.74003 1.0718114 3.0825023 98.8543372 5.694981 0.38610039 79.97587 3.19662244 36.1411087 5.7915058 99.615385 100.00000
Public 9 0.12008406 0.66046232 6.1627854 0.5403783 0.03002101 3.7826479 0.03002101 0.33200531 0.15010507 13.034726 92.821522 0.3312356 1.8439523 27.3711694 2.943275 0.09573345 61.30750 2.15979787 1.8345840 2.11275703 34.745164 97.88724 0.9743875 5.8661146 93.9057340 5.707127 0.25055679 80.49057 4.05974338 4.8309179 7.1784783 95.286701 99.84989
South Dakota
Private 4 0.28188865 0.22988506 1.7241379 1.1980268 0.00000000 65.7064472 0.00000000 0.16181230 0.00000000 7.701149 98.390805 1.2928915 0.7700026 4.7755315 4.141070 0.13636877 83.44967 2.04877840 2.4818948 0.90379675 13.164643 99.09620 2.3462783 1.2345679 11.3854595 7.544582 0.40453074 90.34483 3.38266385 8.5048011 1.6091954 24.417010 100.00000
Public 8 0.82047916 0.00000000 0.0000000 0.0000000 0.00000000 3.7140855 0.00000000 0.00000000 0.00000000 7.703200 94.419198 24.4195678 0.9537036 1.6379058 2.286078 0.18111660 64.41244 1.83529354 1.2716429 3.00224907 31.313665 96.99775 95.3749124 2.0019691 4.7259600 3.931380 0.50690786 85.94435 3.43102216 4.0766318 5.5808020 95.935529 100.00000
Tennessee
Private 28 0.00000000 0.00000000 3.4697509 0.0000000 0.00000000 0.0000000 0.00000000 0.00000000 0.00000000 7.918149 90.524988 0.3252366 2.0583604 23.7164958 3.712725 0.09281991 60.51757 1.76878686 5.1718908 2.63611455 31.674425 97.36389 1.7441860 7.6146934 100.0000000 19.023622 0.59842520 87.18861 5.92105263 16.6103984 9.4750118 100.000000 100.00000
Public 6 0.14992504 1.00808685 3.6687539 1.2407223 0.00000000 20.3168273 2.20477996 0.37922215 0.42527940 9.577564 88.984919 0.2546569 1.8373055 24.2343711 3.131386 0.08429978 60.84336 3.09108378 1.2563444 5.26718887 32.633103 94.73281 0.3956087 2.9678522 63.6977955 5.686876 0.18791415 82.90841 5.08357235 2.2154090 11.0150807 69.159189 99.57472
Texas
Private 38 0.00000000 0.00000000 2.2988506 2.9197080 0.00000000 1.5544041 0.00000000 0.00000000 0.00000000 19.744842 76.468811 0.4585051 3.0085943 23.4126611 20.448949 0.17174096 42.95614 1.85604476 2.2780272 5.40934134 49.356496 94.59066 1.2195122 16.6440115 85.4858549 66.091954 1.48148148 72.54062 4.44078947 13.9376997 23.5311886 97.435897 100.00000
Public 28 0.06619010 0.38287504 0.6221869 2.7168110 0.00000000 2.0121790 0.00000000 0.21180831 0.52210233 28.105470 76.393159 0.3932651 6.1144762 15.6597864 27.711786 0.18342767 40.06689 2.41757050 1.6085496 5.84424550 52.480312 94.15575 0.8812950 19.9393808 84.0787757 92.983850 1.62723601 70.44774 7.67691717 6.0636788 23.6068413 94.638602 99.47790
Utah
Private 2 0.36740585 1.94528277 0.5150243 5.5274898 0.49808893 72.4615624 2.87092925 1.17438656 0.12625171 12.314657 96.181603 0.4950085 2.8707407 4.9634148 6.474334 0.54100090 77.57706 3.12323526 1.9828800 1.97232412 18.467735 98.02768 0.6226112 3.7961986 9.4118054 7.421179 0.58391287 82.69256 3.37554127 2.7913734 3.8183965 24.620813 99.87375
Public 2 0.36603221 1.22139169 1.2329506 8.6244645 0.37759112 49.7110272 1.63751252 3.86482627 1.55274717 14.198197 91.102650 0.3829209 3.1333041 1.2526803 8.993592 0.46009494 58.96308 2.38785034 19.2014275 5.22504882 16.610442 94.77495 0.3998096 5.0452166 1.2724100 9.362719 0.54259876 68.21514 3.13818816 34.5380288 8.8973505 19.022688 98.44725
Vermont
Private 9 0.19098549 0.65645514 0.0000000 0.4347826 0.00000000 57.9868709 1.08043217 0.49656226 0.00000000 6.521739 89.006623 0.6739817 2.2384566 3.8700412 5.081177 0.12708871 69.09190 3.01438427 12.2761315 3.62683738 15.005129 96.37316 1.7218543 6.7114094 12.9102845 9.409190 0.37174721 85.59969 5.66248257 26.9565217 10.9933775 25.601751 100.00000
Public 3 0.24891101 0.76923077 1.2912259 1.4915694 0.00000000 80.3438083 1.95804196 1.29701686 0.76923077 7.198444 95.698507 0.6237871 1.6515307 2.0697896 2.998046 0.03368132 84.49991 2.28911520 3.3656885 2.46845153 9.665949 97.53155 0.9090909 3.0180460 3.4265734 4.145924 0.06993007 89.16991 2.57467330 4.7552448 4.3014935 11.309894 99.23077
Virginia
Private 22 0.00000000 0.00000000 4.1134752 0.6172840 0.00000000 0.9259259 0.00000000 0.00000000 0.00000000 14.588859 90.043290 0.4647529 2.3696297 20.4559626 4.547600 0.12292265 62.79519 2.72368219 3.3897708 3.13049068 30.684550 96.86951 1.4351476 7.9628015 98.4567901 13.106655 0.99290780 82.84704 5.42986425 27.2505187 9.9567100 99.074074 100.00000
Public 12 0.14864541 0.45771144 4.1380964 1.8323408 0.00000000 2.9253731 0.00000000 0.05882353 0.00000000 13.650939 91.517782 0.2622088 4.7385796 22.3630494 4.938981 0.21013459 57.47387 3.16721640 4.3530848 2.49287401 35.680170 97.50713 0.3569710 14.8270035 85.2338308 10.139642 0.76470588 81.41176 5.29038986 8.8358209 8.4822181 89.547038 100.00000
Washington
Private 11 0.14154282 3.39110776 0.9345794 3.4941764 0.00000000 47.9994500 0.26497085 1.45081387 0.38924275 18.498368 90.884092 0.6174093 6.9774762 2.8470087 8.162984 0.47652631 66.16276 5.94665967 4.3613593 4.44781491 25.028064 95.55219 1.1658718 13.9007287 6.0521932 12.400636 2.10993892 76.41296 8.77565464 10.1746185 9.1159082 32.710023 99.61076
Public 6 0.43160691 2.15690922 1.4873838 6.5134870 0.18592297 46.9252412 4.48970490 1.03585657 0.45034368 23.552457 85.090657 0.9861372 6.7831796 3.1227578 9.580315 0.43934539 62.69559 6.25798570 5.6243575 4.51033049 27.169721 95.48967 2.5835506 19.4310468 5.3804219 13.171783 1.16111535 74.36919 7.72908367 9.5770828 14.9093426 34.637370 99.54966
West Virginia
Private 4 0.25396825 0.19854401 0.0000000 1.3245033 0.09474183 61.2979631 0.00000000 2.05162144 0.33112583 2.980132 93.699668 0.5367935 1.1090403 5.6047722 1.909421 0.27063041 76.93228 1.35697207 8.7008041 3.57928715 10.787629 96.42071 0.9000474 2.7001421 9.9005211 2.558029 0.66225166 88.07947 2.84579749 15.2534344 6.3003316 17.148271 99.66887
Public 9 0.00000000 0.16648169 4.2227935 0.4854369 0.00000000 48.3356449 0.00000000 0.44901457 0.33296337 7.386785 92.980291 0.3231175 0.9062822 7.3868638 1.935252 0.03845896 78.28907 1.56114182 6.7828578 2.77695728 12.151117 97.22304 0.6434051 1.8680377 12.8745838 4.282316 0.08568980 88.01039 2.97172237 39.8751734 7.0197087 15.316315 99.66704
Wisconsin
Private 20 0.00000000 0.17123288 0.2849003 1.4245014 0.00000000 56.1322729 0.00000000 0.46044370 0.00000000 3.988604 87.829181 0.5601772 2.6697187 6.1876671 5.939914 0.17927190 73.88952 2.26783759 5.0565536 3.24934109 17.804587 96.75066 2.0547945 7.9470199 22.0939386 16.785266 0.66225166 95.15670 3.90469887 21.0526316 12.1708185 41.816660 100.00000
Public 2 0.19810698 2.97160467 1.6606344 2.3419204 0.14903130 69.6015849 1.64998935 0.44023773 2.17917676 9.367681 96.263572 0.2480848 3.1198240 5.0565994 7.334288 0.17356914 77.85487 2.64097531 0.6139873 2.95780213 18.573341 97.04220 0.2980626 3.2680434 8.4525644 12.326656 0.19810698 86.10815 3.63196126 0.7877369 3.7364275 27.779001 97.82082
Wyoming
Public 1 0.56162246 1.33385335 0.9438378 5.8424337 0.25741030 74.1497660 2.37909516 7.80811232 6.72386895 11.318253 93.276131 0.5616225 1.3338534 0.9438378 5.842434 0.25741030 74.14977 2.37909516 7.8081123 6.72386895 11.318253 93.27613 0.5616225 1.3338534 0.9438378 5.842434 0.25741030 74.14977 2.37909516 7.8081123 6.7238690 11.318253 93.27613

I am particularly interested in the the White, Asian, black,Hispanic, american indian alaska native and native Hawaiian pacific islander race and ethnic groups and the following inferences can be made from the summary table:

Public colleges

->By assessing the mean % enrollment for the White, Asian, black,Hispanic, american indian alaska native and native Hawaiian pacific islander race and ethnic groups in the public colleges, it can be observed that the states in the Mid-west and North-Central region have more % white students enrolled in the colleges compared to the other race/ethnic groups. States such as California, New Mexico and Hawaii have the least % white students enrolled in public colleges.

->the highest Hispanic % enrollment was observed in New Mexico, Texas,California, Florida and Arizona which is quite expected and the least % Hispanic enrollment was seen in states like Mississipi , Maine, Montana etc

->the highest Asian % enrollment as expected was observed in Hawaii, California and New York, New Jersey and the least was seen in North Dakota, West Virginia and Maine

-> the highest american indian alaska native % enrollment was observed in South Dakota, New Mexico and Kansas and the least % was in Illinois. New York and Pennsylvania

->the highest native Hawaiian pacific islander % enrollment was noted in Hawaii, Nevada and Oregon and the least % was observed in New Hamphsire and Montana.

Private colleges

->By assessing the mean % enrollment for the White, Asian, black,Hispanic, american indian alaska native and native Hawaiian pacific islander race and ethnic groups in the private colleges, it can be observed that the states in the Mid-west and North-Central region such as South and North Dakota, Utah, West Virginia, Kentucky and ebraska have more % white students enrolled in the colleges compared to the other race/ethnic groups. States such as California, New Mexico,Hawaii,Delaware have the least % white students enrolled in private colleges.

->the highest Hispanic % enrollment was observed in New Mexico, Texas,California, Florida and Arizona which is quite expected and the least % Hispanic enrollment was seen in states like Alabama,West Virginia, Mississipi and Delaware etc

->the highest Asian % enrollment as expected was observed in Hawaii,California,Washington,Massachusetts and the least was seen in Alabama, South Dakota and North Dakota

-> the highest american indian alaska native % enrollment was observed in Arizona,Montana and Alaska and the least % was in New Jersey, Pennsylvania and Delaware.

->the highest native Hawaiian pacific islander % enrollment was noted in Hawaii, Arizona and North Dakota and the least % was observed in Minnesota, Mississipi and Delaware

Is the finding from summary what you expected

I expected a higher early career pay, higher stem % and higher making world better % for the private colleges and similar mid career pay for both the private and public college students. From the data, definitive conclusions couldn’t be made, it seems like both private and public colleges have similar earning potential and quality of education.

I also expected higher % men enrollment compared to % women enrollment, this would apply to both private and public schools.The inference made from the data about % enrollment based on gender was infact reverse of what I had expected. More % female enrollment was noticed in both private and public colleges.

I also expected certain race/ethnic groups choosing colleges based on location but the choice of private/public school would be influenced by economic factors as well. The inferences from the data aligns well with my expectation.

Data Visualization using ggplots

Research Question 1

1) Early career pay

Comparison of median early career pay between the different types of colleges across the different states

Side by side comparison

ggplot(data=merged1_stats,
       aes(x=median_early_career_pay,
           y=reorder(state_code,median_early_career_pay),fill=type))+
    geom_col(position="dodge")+
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))+
  scale_fill_brewer(palette="Pastel2")+
  labs(title=str_wrap("Median Early career pay(in USD) for the private vs public college across the different states",width=60),
       x="Median Early career pay(USD)",
       y="State Name")+
   guides(fill=guide_legend(title="College Type"))

From the above plot, it is evident that Massachusetts had the highest median early career, out of which the public college in MA had a slightly higher median early career pay than the private colleges in MA. This was followed by California having the second highest median early career pay with the private college having slightly higher median early career pay than the public colleges.

Missouri and Alaska seem to be have the lowest median early career pay. In around 10 states, it seems like the median early career pay is slightly higher for the public colleges compared to the private colleges and in 11 states the opposite trend is observed and in the rest of the states both the private and public colleges have similar median early career pay.

Only in 5 states(MA,AK,MT,DE,AZ), the differences seem to be more pronounced with public colleges having higher median career pay.

Note: NV and PA have early career pay data only for the private college and Wyoming has the data only for public college, so a comparison between private and public college couldn’t be made.

Separate panels

#create a manual color palette
pal <- colorRampPalette(c("red", "lightyellow","lightblue","lightgreen","pink","violet","orange","green","blue"))
pal<-pal(52)

merged1_final%>%
  group_by(state_code,type)%>%
   summarise(median_early_career_pay=median(early_career_pay,na.rm=TRUE))%>%
  ggplot()+
  labs(title=str_wrap("Comparison of median early career pay in USD between the private and public colleges for different states in the United States",width=60),
       x="Median Early career pay(in USD)",
       y="State Name")+
  scale_fill_manual(values=pal)+
  geom_bar(aes(x=median_early_career_pay,
 y=reorder_within(state_code,-median_early_career_pay,type),
    fill=state_code),stat="identity",show.legend=TRUE)+
  facet_wrap(~type,scales="free_y")+
  scale_y_reordered()+
 guides(fill=guide_legend(title="State Name"))
## `summarise()` has grouped output by 'state_code'. You can override using the
## `.groups` argument.

The interpretations of this plot are the same as the interpretations for the plot above. In this plot, the median early career pay for private and public colleges are plotted in different panels and thus it would be easier to interpret the changes in median early career pay within each type of college for the different states. The plot is more readable as the bars are arranged in a sequential order.

Assess the min, median and max of the early career pay between the different types of colleges across the different states

From the above bar plots, we were able to only compare the median early career pay i.e. only the center of the data for each type of college and for each state. It would be useful to understand the spread of the data as well for which boxplots can be used. Since we have 52 states and most color palettes have limited color options, I manually created a color palette by clubbing color options from inbuilt color palettes.

# create a manual palette using ColorBrewer Palettes
mycolors<- c(brewer.pal(name="Dark2", n = 8), brewer.pal(name="Paired", n = 6),
brewer.pal(name="Accent", n = 8),
brewer.pal(name="Set3", n = 12),
brewer.pal(name="Set2", n = 8),
brewer.pal(name="Set1", n = 9),
brewer.pal(name="Pastel1",n=1))
## Warning in brewer.pal(name = "Pastel1", n = 1): minimal value for n is 3, returning requested palette with 3 different levels
# boxplot to compare the median career pay for the different types of colleges at different states
ggplot(data=merged1_final,
       aes(x=early_career_pay,
           y=state_code,
           fill=state))+
    geom_boxplot()+
    facet_wrap(~type)+
  scale_fill_manual(values=mycolors)+
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))+
  labs(title=str_wrap("Median career pay(in USD) for the private vs public college across the different states",width=60),
       x="Median career pay(USD)",
       y="State Name")+
guides(fill=guide_legend(title="State Name"))

Amongst the private colleges, the early career pay was the highest in California and the least in Alabama and Mississipi. The interquartile range(difference between Q3 and Q1) seems largest for CA,ME and PA and the smallest for DE, KS, MT, MT and NV.

Amongst the public colleges, the early career pay was the highest in MA, the variability is very low, that is due to just one public college in MA present in the data. The least early career pay was noted for Mississipi. The largest IQR was observed for Oregon, Alaska and Virginia.

2) Mid career pay

Comparison of median mid career pay between the different types of colleges across the different states

Side by side comparison

ggplot(data=merged1_stats,
       aes(x=median_mid_career_pay,
y=reorder(state_code,median_mid_career_pay),
           fill=type))+
    geom_col(position="dodge")+
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))+
  scale_fill_brewer(palette="Pastel2")+
  labs(title=str_wrap("Median Mid career pay(in USD) for the private vs public college across the different states",width=60),
       x="Median Mid career pay(USD)",
       y="State Name")+
guides(fill=guide_legend(title="College Type"))

From the above plot, it is evident that California had the highest median mid career pay, out of which the private colleges in CA had a slightly higher median mid career pay than the public colleges in CA. This was followed by Massachusetts having the second highest median mid career pay with the public college having slightly higher median mid career pay than the private colleges.

Missouri and Alaska seem to be have the lowest median mid career pay. in around 18 states, it seems like the median mid career pay is slightly higher for the public colleges compared to the private colleges and in 14 states the opposite trend is observed and in the rest of the states both the private and public colleges have similar median early career pay.

Only in 5 states(AK,MT,IA,DE,AZ), the differences seem to be more pronounced with public colleges having higher median career pay than private colleges

Note: NV and PA have early career pay data only for the private college and Wyoming has the data only for public college, so a comparison between private and public college couldn’t be made.

Separate panels

merged1_final%>%
  group_by(state_code,type)%>%
   summarise(median_mid_career_pay=median(mid_career_pay,na.rm=TRUE))%>%
  ggplot()+
  labs(title=str_wrap("Comparison of median mid career pay in USD between the private and public colleges for different states in the United States",width=60),
       x="Median mid career pay(in USD)",
       y="State Name",
       col="State Name")+
  scale_fill_manual(values=pal)+
  geom_bar(aes(x=median_mid_career_pay,
 y=reorder_within(state_code,-median_mid_career_pay,type),
    fill=state_code),stat="identity",show.legend=TRUE)+
  facet_wrap(~type,scales="free_y")+
  scale_y_reordered()+
guides(fill=guide_legend(title="State Name"))
## `summarise()` has grouped output by 'state_code'. You can override using the
## `.groups` argument.

The interpretations of this plot are the same as the interpretations for the plot above. In this plot, the median mid career pay for private and public colleges are plotted in different panels and thus it would be easier to interpret the changes in median mid career pay within each type of college for the different states. The plot is more readable as the bars are arranged in a sequential order.

Assess the min, median and max of the mid career pay between the different types of colleges across the different states

# boxplot to compare the mid career pay for the different types of colleges at different states
ggplot(data=merged1_final,
       aes(x=mid_career_pay,
           y=state_code,
           fill=state))+
    geom_boxplot()+
    facet_wrap(~type)+
  scale_fill_manual(values=mycolors)+
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))+
  labs(title=str_wrap("Mid career pay(in USD) for the private vs public college across the different states",width=60),
       x="Mid career pay(USD)",
       y="State Name")+
  guides(fill=guide_legend(title="State Name"))

Amongst the private colleges, the mid career pay was the highest in California and the least in Colorado. The interquartile range(difference between Q3 and Q1) seems largest for CA,ME,PA,VA and VT and the smallest for DE, KS, MT, MT and NV.

Amongst the public colleges, the mid career pay was the highest in Colorado(this is the one of the two outlier points in the plot). The least mid career pay was noted for Mississipi. The largest IQR was observed for Oregon,Virginia and Vermont.

3) Make world a better percent

Comparison of median make world a better percent between the different types of colleges across the different states

Side by side comparison

ggplot(data=merged1_stats,
       aes(x=median_make_world_better_percent,
y=reorder(state_code,-median_make_world_better_percent),
           fill=type))+
    geom_col(position="dodge")+
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))+
  scale_fill_brewer(palette="Pastel2")+
  labs(title=str_wrap("Median Make world better(in percent) for private vs public college across the different states",width=60),
       x="Median make world better(in percent)",
       y="State Name")+
  guides(fill=guide_legend(title="College Type"))

Separate panels

merged1_final%>%
  group_by(state_code,type)%>%
   summarise(median_make_world_better_percent=median(make_world_better_percent,na.rm=TRUE))%>%
  ggplot()+
  labs(title=str_wrap("Comparison of median make world better in percent between the private and public colleges for different states in the United States",width=60),
       x="Median make world better (in percent)",
       y="State Name")+
  scale_fill_manual(values=pal)+
  geom_bar(aes(x=median_make_world_better_percent,
 y=reorder_within(state_code,-median_make_world_better_percent,type),
    fill=state_code),stat="identity",show.legend=TRUE)+
  facet_wrap(~type,scales="free_y")+
  scale_y_reordered()+
    guides(fill=guide_legend(title="State Name"))
## `summarise()` has grouped output by 'state_code'. You can override using the
## `.groups` argument.

From the above plots, the following inferences can be made:

-> Amongst the private colleges,Arizona has the highest median make world better % followed by Alaska and Nevada has the lowest median make world better %. CA and MA have pretty low median make world better % which is not quite expected.

->Comparing the public colleges across USA, Mississipi has the highest median make world better % followed by Nebraska. The least median make world better % is taken by Illionis and Iowa.The median make world better % seem to gone up a bit for CA and MA amongst the public colleges but it must be noted that we had only 1 public college for each in CA and MA and thus making the accuracy of the data questionable.

-> Public and Private colleges do not seem to have a noticeable difference in the making world better %.

Assess the min, median and max of the make_world_better_percent between the different types of colleges across the different states

# boxplot to compare the make world better percent for the different types of colleges at different states
ggplot(data=merged1_final,
       aes(x=make_world_better_percent,
           y=state_code,
           fill=state))+
    geom_boxplot()+
    facet_wrap(~type)+
  scale_fill_manual(values=mycolors)+
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))+
  labs(title=str_wrap("title=Make world better in percent for the private vs public college across the different states",width=60),
       x="Make world better (in percent)",
       y="State Name")+
guides(fill=guide_legend(title="State Name"))
## Warning: Removed 21 rows containing non-finite values (`stat_boxplot()`).

Amongst, the private colleges, Wisconsin has the highest make world better % and Texas has the lowest make world better %. The spread of the data is highest for Lousiana.

Comparing the public colleges across the different states in the US,Texas has the highest make world better % and Vermont and Florida has the lowest make world better %. The spread of the data is highest for Nebraska.

4) Stem percent

Comparison of median stem percent between the different types of colleges across the different states

Side by side comparison

ggplot(data=merged1_stats,
       aes(x=median_stem_percent,
    y=reorder(state_code,median_stem_percent),
           fill=type))+
    geom_col(position="dodge")+
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))+
  scale_fill_brewer(palette="Pastel2")+
  labs(title=str_wrap("Median Stem(in percent) for private vs public college across the different states",width=60),
       x="Median Stem(in percent)",
       y="State Name")+
  guides(fill=guide_legend(title="College Type"))

By assessing the above plot,

Amonsgt the public colleges,the median stem % is the highest for MA and MI and the least for ME. The difference in stem % between private and public colleges is prominent for MA,MI CA,IL,CO, MO,FL, AZ ,AK, LA and DE with stem % being higher for the public colleges compared to private colleges

Amonsgt the private colleges,PA and MA have the highest median stem % and NV has the lowest median stem % . For 6 states(MS,GA,CT,MN,AR, IN), the difference in stem % between the private and public colleges is prominent,

Separate panels

merged1_final%>%
  group_by(state_code,type)%>%
   summarise(median_stem_percent=median(stem_percent,na.rm=TRUE))%>%
  ggplot()+
  labs(title=str_wrap("Comparison of median stem in percent between the private and public colleges for different states in the United States",width=60),
       x="Median stem (in percent)",
       y="State Name")+
  scale_fill_manual(values=pal)+
  geom_bar(aes(x=median_stem_percent,
 y=reorder_within(state_code,-median_stem_percent,type),
    fill=state_code),stat="identity",show.legend=TRUE)+
  facet_wrap(~type,scales="free_y")+
  scale_y_reordered()+
  guides(fill=guide_legend(title="State Name"))
## `summarise()` has grouped output by 'state_code'. You can override using the
## `.groups` argument.

Assess the min, median and max of the make_world_better_percent between the different types of colleges across the different states

# boxplot to compare the stem percent for the different types of colleges at different states
ggplot(data=merged1_final,
       aes(x=stem_percent,
           y=state_code,
           fill=state))+
    geom_boxplot()+
    facet_wrap(~type)+
  scale_fill_manual(values=mycolors)+
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))+
  labs(title=str_wrap("Stem percent for the private vs public college across the different states",width=60),
       x="Stem(in percent)",
       y="State Name")+
   guides(fill=guide_legend(title="State Name"))

Comparing the private colleges in the different states, the highest median stem % was recorded for CA and IN. The lowest stem % was noted for AZ, NE. The variability in the stem % was high for PA and CA.

Comparing the public colleges, highest median stem % was recorded for Colorado and lowest for MD and TX.Variability was high in stem % of MO.

Research Question 2

The second question of interest was to study if gender and race/ethnicity has an impact on selecting private/public colleges, this is studief for the different states in USA.

1) Gender

ggplot(gender,
       aes(x=enrollment,
           y=reorder(state_code,enrollment),
           fill=gender))+
  geom_bar(position="dodge",stat="identity")+
  facet_wrap(~type)+
  labs(title=str_wrap("Comparison of % enrollment by different gender groups across private/public colleges in USA",width=60),
       x="Enrollment in %",
       y="State Name")+
    guides(fill=guide_legend(title="College Type"))

gender%>%
  group_by(state_code,type,gender)%>%
   summarise(average_gender_percent=mean(enrollment,na.rm=TRUE))%>%
  ggplot()+
  labs(title=str_wrap("Comparison of gender in percent between the private and public colleges for different states in the United States",width=60),
       x="Enrollment (in percent)",
       y="State Name")+
  scale_fill_manual(values=pal)+
  geom_bar(aes(x=average_gender_percent,
 y=reorder(state_code,-average_gender_percent),
    fill=state_code),stat="identity",show.legend=TRUE)+
  facet_wrap(~type+gender,scales="free_y")+
  scale_y_reordered()+
  guides(fill=guide_legend(title="State Name"))
## `summarise()` has grouped output by 'state_code', 'type'. You can override
## using the `.groups` argument.

The above two plots provide similar interpretation, they are just different ways to presenting the same data.

Amongst the public colleges, the highest % women enrollment was noted in NY,the least % women enrollment was noted in MT, the highest % men enrollment was seen in MT and the lowest in NY.

Out of the private colleges, the highest % women enrollment was noted in WI,the least % women enrollment was noted in NJ, the highest % men enrollment was seen in NJ and the lowest in WI.

# boxplot to compare the percent of men and women enrolled at  different types of colleges in different states
ggplot(data=gender,
       aes(x=enrollment,
           y=state_code,
           fill=type))+
    geom_boxplot()+
    facet_wrap(~gender+type)+
  scale_fill_manual(values=mycolors)+
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))+
  labs(title=str_wrap("Percent of men and women enrolled in private vs public college across the different states",width=60),
       x="Enrollment by Gender(in percent)",
       y="State Name")+
  guides(fill=guide_legend(title="College Type"))

Out of the private colleges, % enrollment of Men was highest in NJ and lowest in WI. The range of % men enrollment is the largest for New Jersey followed by New York and smallest for New Mexico and Nevada.

Out of the private colleges, % enrollment of women was highest in WI and lowest in New Jersey. The range of % women enrollment is the largest for New Jersey followed by New York and smallest for New Mexico and Nevada.

Comparing the public colleges, % enrollment of Men was highest in MT and lowest in NY. The range of % men enrollment is the largest for SD and smallest for MT.

Out of the public colleges, % enrollment of women was highest in NY and lowest in MT. The variability in the data for both NY and MT is very small and is largest for SD.

2) Race and Ethnicity

Since there are too many race and ethnic groups, I picked those race groups that I am interested to explore further.

race.final<-race%>%
            filter(race=="american_indian_alaska_native"|
                   race=="white"|
                   race=="asian"|
                   race=="black"|
                   race=="hispanic")

Combined Race plot

ggplot(race.final,
       aes(x=enrollment,
           y=state_code,
           fill=race))+
  geom_bar(position="stack",stat="identity")+
  facet_wrap(~type)+
  labs(title=str_wrap("Comparison of different race-ethnic groups enrollment in private and public colleges across the different states in USA",width=60),
       x="% Enrollment",
       y="State Name")+
   guides(fill=guide_legend(title="Race"))

Amongst the private colleges,NY has the highest student enrollment, out of which almost 3/4th of the student population are whites .

% Asian population is most prominent in California. % Black student population is noticeable in states like NC,TX,TN,GA and FL. Hispanic student population more commonly seen in CA and TX.

American Indian Alaskan Native is a really small race/ethnic population group, small numbers are noticed in CA, NY and NC

Out of the public colleges,the enrollment rates are generally lower than the private colleges.The largest white student population is noticed in PA, Blacks are seen in GA. Small Asian population noticed in TX and it is interesting to see that CA has only a really tiny Asian population enrolled in the public colleges. Kansas has a small native american population.Hispanics most commonly seen in TX and then in GA.

Extra Analysis

Comparison of number of private and public colleges in different states in the USA

compare<-cost_final%>%
  filter(type=="Private"|type=="Public")%>%
  group_by(state_code,type)%>%
  summarise(n=n())
## `summarise()` has grouped output by 'state_code'. You can override using the
## `.groups` argument.
ggplot(compare,
       aes(x=n,
           y=state_code,
           fill=type))+
  geom_bar(stat="identity")+
  facet_wrap(~type)+
  labs(title=str_wrap("Comparison of number of private and public collegs",width=60),
       x="Number of colleges",
       y="State name")+
  guides(fill=guide_legend(title="Race"))

In this data, the largest number of the private colleges was present in New York, followed by California and the smallest number of private colleges were located in AK,NV and NM. The largest number of public colleges was located in PA followed by TX and the least number of public colleges were in DC,WY and VI.

Historical Tuition data

ggplot(data=hist_tuition,
       aes(x=as.factor(year),
           y=tuition_cost,
           group=tuition_type))+
  geom_point(aes(color=tuition_type))+
  geom_line(aes(color=tuition_type))+
  facet_grid(.~type)+
  theme_minimal()+
  theme(legend.position = "bottom")+
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))+
  labs(title=str_wrap("Line plot of historical tuition data for the years 1985 to 2017",width=60),
       x="Year",
       y="Tuition cost in USD")+
  guides(fill=guide_legend(title="Tuition type"))

From the above line plot, increasing trend in the tuition cost is observed for all the different tuition types and for the different type of colleges.

Especially for the private colleges, the increasing trend is more prominent. In the private colleges, the “4 yr current” and “all year current” rates have constantly gone up across the years whereas the “2 yr current”, “2 yr constant”,“4 yr constant” and “all year constant” tuition costs slightly went up and then dropped a bit. This trend is expected. Also, the baseline tuition fees i.e. tuition cost at 1985-86 year is highest for “4 year constant” and “all constant” , followed by 2 year constant and then by “all current”,“4 year current” and “2 year current”.

For the public colleges, the baseline tuition fees is same as observed for the private colleges. However, the increasing trend is not very dramatic as observed for the private colleges. The trend is gradual over the years and the rate at which the tuition cost increases for the different tuition types are fairly similar.

Final Summary (10 points)

Summarize your research question and findings below.

I had two research questions:one was to identify how the earning potential and quality of education are different for students attending private and public colleges across the different states in the USA. The second question was to assess if gender/ race/ethnicity has an influence on college selection. For example, would a student belonging to a particular gender/race group pick a public/private college or have preference in attending college in a particular state.

Overall,I was able to draw some inferences from the summary tables and the bar plots/boxplots.

Findings for Q1

To answer the first question of interest, I used two variables i.e. early_career_pay and mid_career_pay to evaluate the earning potential of a student. I used two more variablesmake_world_better_percent and stem_percent to assess if a attending particular type of college at a particular location influences the quality of education one would receive.

For early-career pay:

There are a few states which have higher median early career pay for public colleges compared to private colleges and vice-versa and in only 5 states (MA,AK,MT,DE,AZ), the differences seem to be more pronounced with public colleges having higher median career pay.

Also,data was not available for some of the important states like CA, MA which has a lot of public colleges and hence a true estimate of the median career pay is not available for some of the states and hence this makes it harder to conclude if there is a significant difference between the earning potential of students attending private and public colleges.

From the boxplots, we were able to assess the spread of the data and the variability in data for some of the states likes CA and ME(for private colleges) and VA,VT (for public colleges) was pretty high. Due to the variability being different for different states and presence of outliers, median seem to be a more reliable measure to assess the central tendency of the data.

Overall, with the given data, it is very hard to draw an inference that attending a particular type of college actually influences the earning potential of a student.

The largest difference in the median early career pay amongst the 5 states where the difference in median early career pay was noticeable is $10,000. I would think that just by a difference of $10,000,I would conclude that both the public and private colleges seem to have comparable median early career pay.

Also, while comparing the early career pay, It is important to account for the cost of living which includes house pricing,food/healthcare/transportation costs in each of the state before we make a comparison between the early career pay

I used websites like nerdwallet.com to understand the differences in living cost between cities. For example: an annual income of $100,000 in San Francisco equals to $67,550 in Portland,OR and equals to $81,389 in Boston,MA.

This comparison is only for a single person, it would dramatically change if there are more dependents/children and in that case, child care cost will also play a huge role in assessing the cost of living.

For mid-career pay:

My conclusions for comparing median mid career pay between private and public colleges are similar to that of the early career pay since there were some states where the student from public colleges had higher median mid career pay than the private colleges and vice-versa and only in 5 states(AK,MT,IA,DE,AZ), the differences seem to be more pronounced with public colleges having higher median career pay than private colleges. Infact, the trends between early career pay and mid career pay across the states is similar.

The largest difference in median mid career pay of `$15,000 was noted in Arizona. It is also interesting to see the public colleges having similar and better earning potential in some states.

Thus, with the given constraints of the data,I did not see a convincing difference between the earning potential of private and public colleges.

For Make world a better percent:

Public and private colleges didn’t seem to have an observable difference in make world a better % thereby implying the quality of education is comparable between the private and public colleges across the states.

It was interesting to see that students graduating from colleges/states that were having high earning potential didn’t seem to have high % for making the world better.

For Stem percent:

In some of the states, the private colleges seems to have higher stem % rating that the public colleges and the vice-versa was observed in some states. Prominent difference in stem % rating was observed in 13 states(LA,DE,AK,AZ,FL,HI,MO,CO,VA, IL,CA,MT,MA)with public colleges have higher ratings than the private colleges and and 6 states (VT,AR,MN,CT,GA,MS) with private colleges having higher stem ratings than the public colleges.

Although, this is not very appropriate to make to make inferences from the plot since some of the states had one or two colleges and making inferences from a limited data could potentially result in incorrect conclusions. For example CA, MA had only 1 public college and MT had 2 public and 2 private colleges, so we really can’t take the prominent difference as an accurate reflection of all the colleges in the respective states.

However, by assessing the differences between the private and public college stem ratings, public colleges seem to be performing well in terms of imparting stem knowledge across most states and on par with private colleges.

Findings for Q2

Gender

From the summary table and the data visualizations, it seems like the % women enrollment is slightly higher in both public and private colleges compared to % men enrollment. Public colleges in a couple states like NY and MT had higher % men enrollment than women but the rest of the states pretty much have comparable enrollment % or %women enrollment exceeding %men enrollment.

Race

Race did seem to have an influence in type of college selection which kind of aligns with literature. In some states, I was able to notice similarities in race % enrollment between private and public colleges and in some states, I couldn’t find that connection, this could be due to some states having very less public colleges in the data set and hence a reliable comparison can’t be made.

For example: from the barplots, it was observed that Hispanics enrolled in colleges in areas where they are densely populated like Texas, California, New Mexico etc. Similarly, Asians were noticed prominently in California,Texas and Whites enrollment is prominent in Central North and Mid West colleges. Blacks were more prominently seen in Texas and Tennesse.American Indian and Alaska native were present in small numbers in NC and CA,

Are your findings what you expected? Why or Why not?

I was expecting to see that early career pay being influenced by the type of college one would attend as I expected that the private college students may have better exposure to career fairs, networking opportunities and the brand name would help them to get a really good entry level job whereas students from public colleges would also get new grad jobs but may not have a comparable compensation like the private college grads. But form this data,I didn’t observe this relationship and it seemed like both the private and public colleges have similar early career pay.

For the mid career pay, my expectation aligns with the findings, I wasn’t expecting to see the a remarkable difference in the mid career pay between students from private and public colleges.The rationale being, once a student enters industry, promotions and pay rise would be based on performance and not based on the type of college one attended.

I expected to see the private colleges having better stem % and higher making the world better percent since they would have more funding, infrastructure to support stem activities and in general projects that would help student to have an impact. But I was surprised to see that some of the public colleges having higher stem % and making world better % compared to the private colleges and overall from the data, it seemed like the public and private college students had similar quality of education.

My expectation of % enrollment based on gender was actually the opposite, I was expecting to see more % men enrollment atleast in some of the states but findings from the data was opposite. This could also be because the data as such didn’t have % men enrollment. I subtracted the women enrollment from total enrollment to obtain % men enrollment and the trend observed could be due to this assumption.

My expectation of % enrollment based on race was pretty accurate. Based on literature and online sources, it seemed like certain race/ethnic groups prefer to attend colleges in certain parts of the country however the difference in % enrollment between the private and public colleges was not clear from the plots.

I would expect there shouldn’t be a significant difference in the %enrollment based on race/ethnicity between private and public colleges, economic factors might play a role here as private colleges are mostly much more expensive than public colleges.

Some limitations noted in the dataset

There were some limitations in the data set due to which some of the conclusions may not be very reliable

-> some states had very limited data from public/private college, thus the comparisons made for some of the states will not be accurate

->there were lot of missing data and I had to exclude them in order to analyze, this could potentially cause bias in interpretations

->only women enrollment was provided, I calculated the men enrollment by subtracting from the total enrollment, by doing so, I didn’t account for non-binary gender groups.

->the income levels in the income data was in the categorical format. I was hoping to study the impact of economic factors in college selection, having the income in continuous format would have helped the analysis.