Midterm (Due Sunday 2/19/2023 at 11:55 pm)

Please submit your .Rmd and .html files in Sakai. If you are working together, both people should submit the files.

60 / 60 points total

The goal of the midterm project is to showcase skills that you have learned in class so far. The midterm is open note, but if you use someone else’s code, you must attribute them.

Instructions: Before you get Started

  1. Pick a dataset. Ideally, the dataset should be around 2000 rows, and should have both categorical and numeric covariates. Choose a dataset with more than 4 columns/variables.
  • Potential Sources for data:
  • Note that most of the Tidy Tuesday data are .csv files. There is code to load the files from csv for each of the datasets and a short description of the variables, or you can upload the .csv file into your data folder. This resource is probably the easiest to deatl with.
  • You may use another dataset or your own data, but please make sure it is de-identified and has enough rows/variables.
  1. Define a research question, involving at least one categorical variable. You may schedule a time with Jessica or Brad to discuss your data set and research question, or you may message it to one of us in slack or email. Please do one of the two options pretty early on. We just want to look at the data and make sure that it is appropriate for your question.

  2. You must use each of the following functions at least once:

  • mutate()
  • group_by()
  • summarize()
  • ggplot()

and at least one of the following:

  • case_when()
  • across()
  • *_join() (i.e. left_join())
  • pivot_*() (i.e. pivot_longer())
  • function()
  1. The code chunks below are guides, please add more code chunks to do what you need.

  2. If you do not want your final project posted on the public website, please let Jessica know. We can also keep it anonymous if you’d like to remove your name from the Rmd and html, or use a pseudonym.

You may remove these instructions from your final Rmd if you like

Working Together

If you’d like to work together in pairs, that is encouraged, but you must divide the work equitably and you must note who worked on what. This is probably easiest as notes in the text. Please let Brad or Jessica know that you’ll be working together.

No acknowledgements of contributions = -10 points overall.

Please Note

I will take off points (-5 points for each section) if you don’t add observations and notes in your RMarkdown document. I want you to think and reason through your analysis, even if they are preliminary thoughts.

Define Your Research Question (10 points)

Define your research question below. What about the data interests you? What is a specific question you want to find out about the data?

I will be looking at U.S. collegiate sport data. I am interested in comparing the expenses and revenue of men’s and women’s soccer programs at each university. My research question is: how does men’s collegiate sport spending and revenue from soccer compare to women’s collegiate sport spending and revenue for soccer in the United States? Does this differ by public vs private school status?

Given your question, what is your expectation about the data?

I will have a combination of numerical data and categorical data. I will use the revenue and expenses of each program and the category of school (private vs public) to look into this question.

Loading the Data (10 points)

Load the data below and use dplyr::glimpse() or skimr::skim() on the data. You should upload the data file into the data directory.

tuesdata <- tidytuesdayR::tt_load('2022-03-29')
## --- Compiling #TidyTuesday Information for 2022-03-29 ----
## --- There is 1 file available ---
## --- Starting Download ---
## 
##  Downloading file 1 of 1: `sports.csv`
## --- Download complete ---
tuesdata <- tidytuesdayR::tt_load(2022, week = 13)
## --- Compiling #TidyTuesday Information for 2022-03-29 ----
## --- There is 1 file available ---
## --- Starting Download ---
## 
##  Downloading file 1 of 1: `sports.csv`
## --- Download complete ---
sports <- tuesdata$sports

This data is from github.com, this import code was also provided from github.com

glimpse(sports)
## Rows: 132,327
## Columns: 28
## $ year                 <dbl> 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2…
## $ unitid               <dbl> 100654, 100654, 100654, 100654, 100654, 100654, 1…
## $ institution_name     <chr> "Alabama A & M University", "Alabama A & M Univer…
## $ city_txt             <chr> "Normal", "Normal", "Normal", "Normal", "Normal",…
## $ state_cd             <chr> "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "…
## $ zip_text             <chr> "35762", "35762", "35762", "35762", "35762", "357…
## $ classification_code  <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1…
## $ classification_name  <chr> "NCAA Division I-FCS", "NCAA Division I-FCS", "NC…
## $ classification_other <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ ef_male_count        <dbl> 1923, 1923, 1923, 1923, 1923, 1923, 1923, 1923, 1…
## $ ef_female_count      <dbl> 2300, 2300, 2300, 2300, 2300, 2300, 2300, 2300, 2…
## $ ef_total_count       <dbl> 4223, 4223, 4223, 4223, 4223, 4223, 4223, 4223, 4…
## $ sector_cd            <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ sector_name          <chr> "Public, 4-year or above", "Public, 4-year or abo…
## $ sportscode           <dbl> 1, 2, 3, 7, 8, 15, 16, 22, 26, 33, 1, 2, 3, 8, 12…
## $ partic_men           <dbl> 31, 19, 61, 99, 9, NA, NA, 7, NA, NA, 32, 13, NA,…
## $ partic_women         <dbl> NA, 16, 46, NA, NA, 21, 25, 10, 16, 9, NA, 20, 68…
## $ partic_coed_men      <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ partic_coed_women    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ sum_partic_men       <dbl> 31, 19, 61, 99, 9, 0, 0, 7, 0, 0, 32, 13, 0, 10, …
## $ sum_partic_women     <dbl> 0, 16, 46, 0, 0, 21, 25, 10, 16, 9, 0, 20, 68, 7,…
## $ rev_men              <dbl> 345592, 1211095, 183333, 2808949, 78270, NA, NA, …
## $ rev_women            <dbl> NA, 748833, 315574, NA, NA, 410717, 298164, 13114…
## $ total_rev_menwomen   <dbl> 345592, 1959928, 498907, 2808949, 78270, 410717, …
## $ exp_men              <dbl> 397818, 817868, 246949, 3059353, 83913, NA, NA, 9…
## $ exp_women            <dbl> NA, 742460, 251184, NA, NA, 432648, 340259, 11388…
## $ total_exp_menwomen   <dbl> 397818, 1560328, 498133, 3059353, 83913, 432648, …
## $ sports               <chr> "Baseball", "Basketball", "All Track Combined", "…

I see here that my dataset has more data than I really need. I am only interested in soccer programs and thier revenue and expenses, as well as the sector_name which identifies if it is a private or public school. I will make a smaller dataset with only the important information to my question.

#create smaller dataset with only relevant variables

sport <- sports %>% select(institution_name, sector_name, rev_men, rev_women, total_rev_menwomen, exp_men, exp_women, total_exp_menwomen, sports)

glimpse(sport)
## Rows: 132,327
## Columns: 9
## $ institution_name   <chr> "Alabama A & M University", "Alabama A & M Universi…
## $ sector_name        <chr> "Public, 4-year or above", "Public, 4-year or above…
## $ rev_men            <dbl> 345592, 1211095, 183333, 2808949, 78270, NA, NA, 78…
## $ rev_women          <dbl> NA, 748833, 315574, NA, NA, 410717, 298164, 131145,…
## $ total_rev_menwomen <dbl> 345592, 1959928, 498907, 2808949, 78270, 410717, 29…
## $ exp_men            <dbl> 397818, 817868, 246949, 3059353, 83913, NA, NA, 996…
## $ exp_women          <dbl> NA, 742460, 251184, NA, NA, 432648, 340259, 113886,…
## $ total_exp_menwomen <dbl> 397818, 1560328, 498133, 3059353, 83913, 432648, 34…
## $ sports             <chr> "Baseball", "Basketball", "All Track Combined", "Fo…
#create data set that is only soccer data

soccer <- sport %>% filter(sports == "Soccer")

glimpse(soccer)
## Rows: 8,647
## Columns: 9
## $ institution_name   <chr> "Alabama A & M University", "University of Alabama …
## $ sector_name        <chr> "Public, 4-year or above", "Public, 4-year or above…
## $ rev_men            <dbl> NA, 1062855, 318878, NA, NA, 475121, NA, 269643, 17…
## $ rev_women          <dbl> 410717, 944819, 371380, 399170, 776307, 534357, 188…
## $ total_rev_menwomen <dbl> 410717, 2007674, 690258, 399170, 776307, 1009478, 1…
## $ exp_men            <dbl> NA, 1052063, 318878, NA, NA, 475121, NA, 269643, 17…
## $ exp_women          <dbl> 432648, 944819, 371380, 399170, 1716681, 534357, 23…
## $ total_exp_menwomen <dbl> 432648, 1996882, 690258, 399170, 1716681, 1009478, …
## $ sports             <chr> "Soccer", "Soccer", "Soccer", "Soccer", "Soccer", "…

Transforming the data (15 points)

For the revenue and expense data, I see that there is NA data. This likley means that there is no soccer program for that particular institution, meaning they spend and make $0. I will re-code all these NA as $0, so that I can compare institutions to see if the women’s or men’s programs make/spend more money, with a non-exisitng program counting as $0 rather than NA.

#recategorizing NA for revenue and expense as 0

soccer <- soccer %>% mutate(rev_men = replace_na(rev_men, 0))
soccer <- soccer %>% mutate(rev_women = replace_na(rev_women, 0))
soccer <- soccer %>% mutate(exp_men = replace_na(exp_men, 0))
soccer <- soccer %>% mutate(exp_women = replace_na(exp_women, 0))

If the data needs to be transformed in any way (values recoded, pivoted, etc), do it here. Examples include transforming a continuous variable into a categorical using case_when(), etc.

I need to make a new variable that turns all the sector_name into a binary category of “public” or “private”. I am doing this to simplify the outcome data, as I do not think the nuance of 2-year vs 4-year and for-profit vs non-profit is going to be important to my research question.

#turning the various sector names into two categories: public or private

soccer <- soccer %>% mutate(
  school_type = case_when(
    sector_name == "Public, 4-year or above" ~ "Public",
    sector_name == "Public, 2-year" ~ "Public",
    sector_name == "Private nonprofit, 4-year or above" ~ "Private",
    sector_name == "Private for-profit, 4-year or above" ~ "Private",
    sector_name == "Private for-profit, 2-year" ~ "Private",
    sector_name == "Private nonprofit, 2-year" ~ "Private",
    TRUE ~ "other"
  )
)

soccer %>% tabyl(sector_name, school_type)
##                          sector_name other Private Public
##           Private for-profit, 2-year     0       8      0
##  Private for-profit, 4-year or above     0      73      0
##            Private nonprofit, 2-year     0      54      0
##   Private nonprofit, 4-year or above     0    4153      0
##                       Public, 2-year     0       0   1838
##              Public, 4-year or above     0       0   2516
##                                 <NA>     5       0      0

I would like to see how the number of institutions where women’s soccer programs bring in more revenue compares to the number of institutions where the men’s program brings in more, as well as how expenses compare. I will make a new variable that divides instutions into three groups: women’s is higher, women’s is lower, and the programs are equal.

#making a new variable the determines in the women's program has higher revenue, less revenue, or the same revenue as the men's program
soccer <- soccer %>% mutate(
  women_rev_higher = case_when(
    rev_women > rev_men ~ "Yes", 
    rev_women < rev_men ~ "No",
    rev_women == rev_men ~ "Same",
    TRUE ~ "Other"
  )
)

soccer %>% tabyl(women_rev_higher)
##  women_rev_higher    n    percent
##                No 4179 0.48328900
##              Same  344 0.03978258
##               Yes 4124 0.47692841
# making a variable that determines if the women's program spends more, the same, or less than men's program
soccer <- soccer %>% mutate(
  women_exp_higher = case_when(
    exp_women > exp_men ~ "Yes", 
    exp_women < exp_men ~ "No",
    exp_women == exp_men ~ "Same",
    TRUE ~ "Other"
  )
)

soccer %>% tabyl(women_exp_higher)
##  women_exp_higher    n   percent
##                No 4129 0.4775066
##              Same  338 0.0390887
##               Yes 4180 0.4834046

Show your transformed table here. Use tools such as glimpse(), skim() or head() to illustrate your point.

glimpse(soccer)
## Rows: 8,647
## Columns: 12
## $ institution_name   <chr> "Alabama A & M University", "University of Alabama …
## $ sector_name        <chr> "Public, 4-year or above", "Public, 4-year or above…
## $ rev_men            <dbl> 0, 1062855, 318878, 0, 0, 475121, 0, 269643, 170929…
## $ rev_women          <dbl> 410717, 944819, 371380, 399170, 776307, 534357, 188…
## $ total_rev_menwomen <dbl> 410717, 2007674, 690258, 399170, 776307, 1009478, 1…
## $ exp_men            <dbl> 0, 1052063, 318878, 0, 0, 475121, 0, 269643, 170929…
## $ exp_women          <dbl> 432648, 944819, 371380, 399170, 1716681, 534357, 23…
## $ total_exp_menwomen <dbl> 432648, 1996882, 690258, 399170, 1716681, 1009478, …
## $ sports             <chr> "Soccer", "Soccer", "Soccer", "Soccer", "Soccer", "…
## $ school_type        <chr> "Public", "Public", "Public", "Public", "Public", "…
## $ women_rev_higher   <chr> "Yes", "No", "Yes", "Yes", "Yes", "Yes", "Yes", "No…
## $ women_exp_higher   <chr> "Yes", "No", "Yes", "Yes", "Yes", "Yes", "Yes", "No…
#there are five observations that are missing school type, I will also remove these from my final data set 

soccer <- soccer %>% filter(school_type == "Public" | school_type == "Private")

glimpse(soccer)
## Rows: 8,642
## Columns: 12
## $ institution_name   <chr> "Alabama A & M University", "University of Alabama …
## $ sector_name        <chr> "Public, 4-year or above", "Public, 4-year or above…
## $ rev_men            <dbl> 0, 1062855, 318878, 0, 0, 475121, 0, 269643, 170929…
## $ rev_women          <dbl> 410717, 944819, 371380, 399170, 776307, 534357, 188…
## $ total_rev_menwomen <dbl> 410717, 2007674, 690258, 399170, 776307, 1009478, 1…
## $ exp_men            <dbl> 0, 1052063, 318878, 0, 0, 475121, 0, 269643, 170929…
## $ exp_women          <dbl> 432648, 944819, 371380, 399170, 1716681, 534357, 23…
## $ total_exp_menwomen <dbl> 432648, 1996882, 690258, 399170, 1716681, 1009478, …
## $ sports             <chr> "Soccer", "Soccer", "Soccer", "Soccer", "Soccer", "…
## $ school_type        <chr> "Public", "Public", "Public", "Public", "Public", "…
## $ women_rev_higher   <chr> "Yes", "No", "Yes", "Yes", "Yes", "Yes", "Yes", "No…
## $ women_exp_higher   <chr> "Yes", "No", "Yes", "Yes", "Yes", "Yes", "Yes", "No…

Are the values what you expected for the variables? Why or Why not?

These values are what I was expecting. I can compared that the sector_name and school_type match, meaning that my transformation worked correctly. I can also see that the variables such as revenue and expenses are numerical and appear to make sense as a collegiate soccer budget. My categorical revenue and expense variables also appear to have transformed correctly.

Visualizing and Summarizing the Data (15 points)

Use group_by() and summarize() to make a summary of the data here. The summary should be relevant to your research question

#summary of average revenue of men and women's soccer programs, by school type
soccer %>% group_by(school_type) %>% summarize(average_rev_men = mean(rev_men, na.rm = TRUE), average_rev_wom = mean(rev_women, na.rm = TRUE))
## # A tibble: 2 × 3
##   school_type average_rev_men average_rev_wom
##   <chr>                 <dbl>           <dbl>
## 1 Private             297056.         318188.
## 2 Public              152874.         259344.
#summary of average expenditures of men and women's soccer programs, by school type

soccer %>% group_by(school_type) %>% summarize(average_exp_men = mean(exp_men, na.rm = TRUE), average_exp_wom = mean(exp_women, na.rm = TRUE))
## # A tibble: 2 × 3
##   school_type average_exp_men average_exp_wom
##   <chr>                 <dbl>           <dbl>
## 1 Private             301028.         325021.
## 2 Public              174028.         331261.
# summary of how many school have higher women's program revenue

soccer %>% group_by(women_rev_higher) %>% summarize(n())
## # A tibble: 3 × 2
##   women_rev_higher `n()`
##   <chr>            <int>
## 1 No                4174
## 2 Same               344
## 3 Yes               4124
# summary of how many school have higher women's program expenses

soccer %>% group_by(women_exp_higher) %>% summarize(n())
## # A tibble: 3 × 2
##   women_exp_higher `n()`
##   <chr>            <int>
## 1 No                4124
## 2 Same               338
## 3 Yes               4180

What are your findings about the summary? Are they what you expected?

I found that overall, women’s soccer had an average revenue of $332,858 at private institutions and $297,702 at public institutions. Comparatively, men’s soccer had an average revenue of $312,890 at private institutions and $220,840 at public institutions. Women’s programs average expenses were $340,007 at private institutions and $380,256 at public institutions, compared to men’s programs spending of $317,073 at private institutions and $251,399 at public institutions. Overall, women’s soccer programs spent more and made more than men’s programs. When we look at how many institutions have higher revenue from women’s programs, it is very similar to the number of institutions that have higher revenue from thier men’s programs (4,124 and 4,174, respectively). There is also a very similar number of institutions that spend more on their women’s soccer programs to the number that spend more on their men’s soccer program (4,180 and 4,128, repectively).

This was not what I expected, as I thought that men’s programs would certainly spend more and likely have higher revenue. However, it is very exciting to see that women’s soccer has comparable funding to men’s.

Make at least two plots that help you answer your question on the transformed or summarized data. Use scales and/or labels to make each plot informative.

#boxplot of revenues

ggplot(soccer) +
  aes(x = school_type, y = rev_men, fill = school_type) +
  geom_boxplot() +
  labs(title = "Men's Soccer Program Revenue", x = "School Type", y = "Revenue (in $USD)")

ggplot(soccer) +
  aes(x = school_type, y = rev_women, fill = school_type) +
  geom_boxplot() +
  labs(title = "Women's Soccer Program Revenue", x = "School Type", y = "Revenue (in $USD)")

I created a box plot of men’s soccer program revenue and women’s program revenue, broken up by private vs public schools. We can see that overall, private institution soccer programs have higher revenue, in both women’s and men’s. We can also see that most programs are fairly comparable in revenue between men’s and women’s programs, with women’s programs having greater high revenue outliers, which likley explains the higher average revenue.

#boxplot of expenses

ggplot(soccer) +
  aes(x = school_type, y = exp_men, fill = school_type) +
  geom_boxplot() +
  labs(title = "Men's Soccer Program Expenses", x = "School Type", y = "Expenses (in $USD)")

ggplot(soccer) +
  aes(x = school_type, y = exp_women, fill = school_type) +
  geom_boxplot() +
  labs(title = "Women's Soccer Program Expenses", x = "School Type", y = "Expenses (in $USD)")

I created a box plot of men’s soccer program expenses and women’s program expenses, broken up by private vs public schools. We can see that overall, private institution soccer programs have higher expensenses, in both women’s and men’s. We can also see that most programs are fairly comparable in expenses between men’s and women’s programs, with women’s programs having greater high expense outliers, which likley explains the higher average expenses.

 ggplot(soccer) +
  aes(x = rev_men, y = exp_men, color = school_type) +
  geom_point() +
  labs(title = "Revenue vs Expenses - Men", x = "Revenue", y = "Expenses") +
  xlim(1, 850000) +  ylim(1, 850000) #limit to take out extreme outliers seen in boxplot, which should make the scatter plot more readable
## Warning: Removed 2079 rows containing missing values (`geom_point()`).

ggplot(soccer) +
  aes(x = rev_women, y = exp_women, color = school_type) +
  geom_point() +
  labs(title = " Revenue vs Expenses - Women", x = "Revenue", y = "Expenses") +
  xlim(1, 850000) +  ylim(1, 850000) #limit to take out extreme outliers seen in boxplot, which should make the scatter plot more readable
## Warning: Removed 1641 rows containing missing values (`geom_point()`).

Scatter plots here can be used to see the linear relationship between revenue and expenses that is apparent in both men’s programs and women’s programs. We see here that there do seem to be more cases of high expense and low revenue among women’s soccer programs when compared to men’s, mostly in the public sector. Overall though, the relationship is very linear, meaning that the more soccer programs spend the more they bring in revenue regardless of whether it is a men’s soccer program or women’s soccer program.

# bar plot of higher revenue

ggplot(soccer) +
  aes(x = women_rev_higher, fill = women_rev_higher) +
  geom_bar() +
  labs(title = "Institutions Soccer Program Revenue", x = "Women's Program Revenue is Higher", y = "Number of Institutions")

# bar plot of higher expenses

ggplot(soccer) +
  aes(x = women_exp_higher, fill = women_exp_higher) +
  geom_bar() +
  labs(title = "Institutions Soccer Program Expenses", x = "Women's Program Expenses are Higher", y = "Number of Institutions")

These bar plots show that overall there are close to same number of institutions that have women’s soccer programs that make more than their men’s soccer programs as there are institutions where the men’s soccer program makes more. Similarly, the number of institutions where the women’s soccer program spends more than the men’s soccer program is very similar to the number of institutions where the men’s program spends more than the women’s program. There are very few institutions where the spending or revenue of the women’s soccer program and the men’s soccer program is equal.

Final Summary (10 points)

Summarize your research question and findings below.

My research question was how do men’s collegiate soccer programs in the U.S. compare to women’s collegiate soccer programs in the U.S. in terms of revenue and expenses, and does this differ by school status (private vs public)? I found that overall womens programs have higher expenses and higher revenue, and this was the case both in public and private institutions. However, is does appear that this higher average revenue and expenses from women’s soccer programs are due to a few outlier programs. When we look at the number of institutions that have higher revenue from women’s soccer programs than their men’s, there are actually fewer institutions where this is the case than there are institutions that have higher revenue from men’s programs and overall they are actually very similar. We see the same situation in expenses, where the actual number of institutions where women’s soccer programs are spending more that men’s is only very slightly higher, showing that there are almost as many institutions that spend more on men’s than women’s programs. In both revenue and expenses, there are very few institutions where the dollar amount is the same.

Are your findings what you expected? Why or Why not?

These findings were not what I expected. I had assumed that in the world of colleigate sports women’s programs would have smaller expense budgets and bring in less revenue. I had also thought that there would be a difference between private and public school spending, but they were actually very similar. It is very encouraging to me to see this, that at least in the context of soccer programs women are getting equal (or more) funding and are bringing in equal (often more) revenue.

All code was adapted from R Programming Winter 2023 course at the OHSU-PSU School of Public Health, taught by Jessica Minnier