Please submit your .Rmd
and .html
files in Sakai. If you are working together, both people should submit the files.
The goal of the midterm project is to showcase skills that you have learned in class so far. The midterm is open note, but if you use someone else’s code, you must attribute them.
.csv
file into your data
folder. This resource is probably the easiest to deatl with.Define a research question, involving at least one categorical variable. You may schedule a time with Jessica or Brad to discuss your data set and research question, or you may message it to one of us in slack or email. Please do one of the two options pretty early on. We just want to look at the data and make sure that it is appropriate for your question.
You must use each of the following functions at least once:
mutate()
group_by()
summarize()
ggplot()
and at least one of the following:
case_when()
across()
*_join()
(i.e. left_join()
)pivot_*()
(i.e. pivot_longer()
)function()
The code chunks below are guides, please add more code chunks to do what you need.
If you do not want your final project posted on the public website, please let Jessica know. We can also keep it anonymous if you’d like to remove your name from the Rmd and html, or use a pseudonym.
You may remove these instructions from your final Rmd if you like
If you’d like to work together in pairs, that is encouraged, but you must divide the work equitably and you must note who worked on what. This is probably easiest as notes in the text. Please let Brad or Jessica know that you’ll be working together.
No acknowledgements of contributions = -10 points overall.
I will take off points (-5 points for each section) if you don’t add observations and notes in your RMarkdown document. I want you to think and reason through your analysis, even if they are preliminary thoughts.
Define your research question below. What about the data interests you? What is a specific question you want to find out about the data?
Given your question, what is your expectation about the data?
Load the data below and use
dplyr::glimpse()
orskimr::skim()
on the data. You should upload the data file into thedata
directory.
tuesdata <- tidytuesdayR::tt_load('2022-03-29')
## --- Compiling #TidyTuesday Information for 2022-03-29 ----
## --- There is 1 file available ---
## --- Starting Download ---
##
## Downloading file 1 of 1: `sports.csv`
## --- Download complete ---
tuesdata <- tidytuesdayR::tt_load(2022, week = 13)
## --- Compiling #TidyTuesday Information for 2022-03-29 ----
## --- There is 1 file available ---
## --- Starting Download ---
##
## Downloading file 1 of 1: `sports.csv`
## --- Download complete ---
sports <- tuesdata$sports
This data is from github.com, this import code was also provided from github.com
glimpse(sports)
## Rows: 132,327
## Columns: 28
## $ year <dbl> 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2…
## $ unitid <dbl> 100654, 100654, 100654, 100654, 100654, 100654, 1…
## $ institution_name <chr> "Alabama A & M University", "Alabama A & M Univer…
## $ city_txt <chr> "Normal", "Normal", "Normal", "Normal", "Normal",…
## $ state_cd <chr> "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "…
## $ zip_text <chr> "35762", "35762", "35762", "35762", "35762", "357…
## $ classification_code <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1…
## $ classification_name <chr> "NCAA Division I-FCS", "NCAA Division I-FCS", "NC…
## $ classification_other <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ ef_male_count <dbl> 1923, 1923, 1923, 1923, 1923, 1923, 1923, 1923, 1…
## $ ef_female_count <dbl> 2300, 2300, 2300, 2300, 2300, 2300, 2300, 2300, 2…
## $ ef_total_count <dbl> 4223, 4223, 4223, 4223, 4223, 4223, 4223, 4223, 4…
## $ sector_cd <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ sector_name <chr> "Public, 4-year or above", "Public, 4-year or abo…
## $ sportscode <dbl> 1, 2, 3, 7, 8, 15, 16, 22, 26, 33, 1, 2, 3, 8, 12…
## $ partic_men <dbl> 31, 19, 61, 99, 9, NA, NA, 7, NA, NA, 32, 13, NA,…
## $ partic_women <dbl> NA, 16, 46, NA, NA, 21, 25, 10, 16, 9, NA, 20, 68…
## $ partic_coed_men <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ partic_coed_women <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ sum_partic_men <dbl> 31, 19, 61, 99, 9, 0, 0, 7, 0, 0, 32, 13, 0, 10, …
## $ sum_partic_women <dbl> 0, 16, 46, 0, 0, 21, 25, 10, 16, 9, 0, 20, 68, 7,…
## $ rev_men <dbl> 345592, 1211095, 183333, 2808949, 78270, NA, NA, …
## $ rev_women <dbl> NA, 748833, 315574, NA, NA, 410717, 298164, 13114…
## $ total_rev_menwomen <dbl> 345592, 1959928, 498907, 2808949, 78270, 410717, …
## $ exp_men <dbl> 397818, 817868, 246949, 3059353, 83913, NA, NA, 9…
## $ exp_women <dbl> NA, 742460, 251184, NA, NA, 432648, 340259, 11388…
## $ total_exp_menwomen <dbl> 397818, 1560328, 498133, 3059353, 83913, 432648, …
## $ sports <chr> "Baseball", "Basketball", "All Track Combined", "…
sector_name
which identifies if it is a private or public school. I will make a smaller dataset with only the important information to my question.#create smaller dataset with only relevant variables
sport <- sports %>% select(institution_name, sector_name, rev_men, rev_women, total_rev_menwomen, exp_men, exp_women, total_exp_menwomen, sports)
glimpse(sport)
## Rows: 132,327
## Columns: 9
## $ institution_name <chr> "Alabama A & M University", "Alabama A & M Universi…
## $ sector_name <chr> "Public, 4-year or above", "Public, 4-year or above…
## $ rev_men <dbl> 345592, 1211095, 183333, 2808949, 78270, NA, NA, 78…
## $ rev_women <dbl> NA, 748833, 315574, NA, NA, 410717, 298164, 131145,…
## $ total_rev_menwomen <dbl> 345592, 1959928, 498907, 2808949, 78270, 410717, 29…
## $ exp_men <dbl> 397818, 817868, 246949, 3059353, 83913, NA, NA, 996…
## $ exp_women <dbl> NA, 742460, 251184, NA, NA, 432648, 340259, 113886,…
## $ total_exp_menwomen <dbl> 397818, 1560328, 498133, 3059353, 83913, 432648, 34…
## $ sports <chr> "Baseball", "Basketball", "All Track Combined", "Fo…
#create data set that is only soccer data
soccer <- sport %>% filter(sports == "Soccer")
glimpse(soccer)
## Rows: 8,647
## Columns: 9
## $ institution_name <chr> "Alabama A & M University", "University of Alabama …
## $ sector_name <chr> "Public, 4-year or above", "Public, 4-year or above…
## $ rev_men <dbl> NA, 1062855, 318878, NA, NA, 475121, NA, 269643, 17…
## $ rev_women <dbl> 410717, 944819, 371380, 399170, 776307, 534357, 188…
## $ total_rev_menwomen <dbl> 410717, 2007674, 690258, 399170, 776307, 1009478, 1…
## $ exp_men <dbl> NA, 1052063, 318878, NA, NA, 475121, NA, 269643, 17…
## $ exp_women <dbl> 432648, 944819, 371380, 399170, 1716681, 534357, 23…
## $ total_exp_menwomen <dbl> 432648, 1996882, 690258, 399170, 1716681, 1009478, …
## $ sports <chr> "Soccer", "Soccer", "Soccer", "Soccer", "Soccer", "…
#recategorizing NA for revenue and expense as 0
soccer <- soccer %>% mutate(rev_men = replace_na(rev_men, 0))
soccer <- soccer %>% mutate(rev_women = replace_na(rev_women, 0))
soccer <- soccer %>% mutate(exp_men = replace_na(exp_men, 0))
soccer <- soccer %>% mutate(exp_women = replace_na(exp_women, 0))
If the data needs to be transformed in any way (values recoded, pivoted, etc), do it here. Examples include transforming a continuous variable into a categorical using
case_when()
, etc.
sector_name
into a binary category of “public” or “private”. I am doing this to simplify the outcome data, as I do not think the nuance of 2-year vs 4-year and for-profit vs non-profit is going to be important to my research question.#turning the various sector names into two categories: public or private
soccer <- soccer %>% mutate(
school_type = case_when(
sector_name == "Public, 4-year or above" ~ "Public",
sector_name == "Public, 2-year" ~ "Public",
sector_name == "Private nonprofit, 4-year or above" ~ "Private",
sector_name == "Private for-profit, 4-year or above" ~ "Private",
sector_name == "Private for-profit, 2-year" ~ "Private",
sector_name == "Private nonprofit, 2-year" ~ "Private",
TRUE ~ "other"
)
)
soccer %>% tabyl(sector_name, school_type)
## sector_name other Private Public
## Private for-profit, 2-year 0 8 0
## Private for-profit, 4-year or above 0 73 0
## Private nonprofit, 2-year 0 54 0
## Private nonprofit, 4-year or above 0 4153 0
## Public, 2-year 0 0 1838
## Public, 4-year or above 0 0 2516
## <NA> 5 0 0
#making a new variable the determines in the women's program has higher revenue, less revenue, or the same revenue as the men's program
soccer <- soccer %>% mutate(
women_rev_higher = case_when(
rev_women > rev_men ~ "Yes",
rev_women < rev_men ~ "No",
rev_women == rev_men ~ "Same",
TRUE ~ "Other"
)
)
soccer %>% tabyl(women_rev_higher)
## women_rev_higher n percent
## No 4179 0.48328900
## Same 344 0.03978258
## Yes 4124 0.47692841
# making a variable that determines if the women's program spends more, the same, or less than men's program
soccer <- soccer %>% mutate(
women_exp_higher = case_when(
exp_women > exp_men ~ "Yes",
exp_women < exp_men ~ "No",
exp_women == exp_men ~ "Same",
TRUE ~ "Other"
)
)
soccer %>% tabyl(women_exp_higher)
## women_exp_higher n percent
## No 4129 0.4775066
## Same 338 0.0390887
## Yes 4180 0.4834046
Show your transformed table here. Use tools such as
glimpse()
,skim()
orhead()
to illustrate your point.
glimpse(soccer)
## Rows: 8,647
## Columns: 12
## $ institution_name <chr> "Alabama A & M University", "University of Alabama …
## $ sector_name <chr> "Public, 4-year or above", "Public, 4-year or above…
## $ rev_men <dbl> 0, 1062855, 318878, 0, 0, 475121, 0, 269643, 170929…
## $ rev_women <dbl> 410717, 944819, 371380, 399170, 776307, 534357, 188…
## $ total_rev_menwomen <dbl> 410717, 2007674, 690258, 399170, 776307, 1009478, 1…
## $ exp_men <dbl> 0, 1052063, 318878, 0, 0, 475121, 0, 269643, 170929…
## $ exp_women <dbl> 432648, 944819, 371380, 399170, 1716681, 534357, 23…
## $ total_exp_menwomen <dbl> 432648, 1996882, 690258, 399170, 1716681, 1009478, …
## $ sports <chr> "Soccer", "Soccer", "Soccer", "Soccer", "Soccer", "…
## $ school_type <chr> "Public", "Public", "Public", "Public", "Public", "…
## $ women_rev_higher <chr> "Yes", "No", "Yes", "Yes", "Yes", "Yes", "Yes", "No…
## $ women_exp_higher <chr> "Yes", "No", "Yes", "Yes", "Yes", "Yes", "Yes", "No…
#there are five observations that are missing school type, I will also remove these from my final data set
soccer <- soccer %>% filter(school_type == "Public" | school_type == "Private")
glimpse(soccer)
## Rows: 8,642
## Columns: 12
## $ institution_name <chr> "Alabama A & M University", "University of Alabama …
## $ sector_name <chr> "Public, 4-year or above", "Public, 4-year or above…
## $ rev_men <dbl> 0, 1062855, 318878, 0, 0, 475121, 0, 269643, 170929…
## $ rev_women <dbl> 410717, 944819, 371380, 399170, 776307, 534357, 188…
## $ total_rev_menwomen <dbl> 410717, 2007674, 690258, 399170, 776307, 1009478, 1…
## $ exp_men <dbl> 0, 1052063, 318878, 0, 0, 475121, 0, 269643, 170929…
## $ exp_women <dbl> 432648, 944819, 371380, 399170, 1716681, 534357, 23…
## $ total_exp_menwomen <dbl> 432648, 1996882, 690258, 399170, 1716681, 1009478, …
## $ sports <chr> "Soccer", "Soccer", "Soccer", "Soccer", "Soccer", "…
## $ school_type <chr> "Public", "Public", "Public", "Public", "Public", "…
## $ women_rev_higher <chr> "Yes", "No", "Yes", "Yes", "Yes", "Yes", "Yes", "No…
## $ women_exp_higher <chr> "Yes", "No", "Yes", "Yes", "Yes", "Yes", "Yes", "No…
Are the values what you expected for the variables? Why or Why not?
sector_name
and school_type
match, meaning that my transformation worked correctly. I can also see that the variables such as revenue and expenses are numerical and appear to make sense as a collegiate soccer budget. My categorical revenue and expense variables also appear to have transformed correctly.Use
group_by()
andsummarize()
to make a summary of the data here. The summary should be relevant to your research question
#summary of average revenue of men and women's soccer programs, by school type
soccer %>% group_by(school_type) %>% summarize(average_rev_men = mean(rev_men, na.rm = TRUE), average_rev_wom = mean(rev_women, na.rm = TRUE))
## # A tibble: 2 × 3
## school_type average_rev_men average_rev_wom
## <chr> <dbl> <dbl>
## 1 Private 297056. 318188.
## 2 Public 152874. 259344.
#summary of average expenditures of men and women's soccer programs, by school type
soccer %>% group_by(school_type) %>% summarize(average_exp_men = mean(exp_men, na.rm = TRUE), average_exp_wom = mean(exp_women, na.rm = TRUE))
## # A tibble: 2 × 3
## school_type average_exp_men average_exp_wom
## <chr> <dbl> <dbl>
## 1 Private 301028. 325021.
## 2 Public 174028. 331261.
# summary of how many school have higher women's program revenue
soccer %>% group_by(women_rev_higher) %>% summarize(n())
## # A tibble: 3 × 2
## women_rev_higher `n()`
## <chr> <int>
## 1 No 4174
## 2 Same 344
## 3 Yes 4124
# summary of how many school have higher women's program expenses
soccer %>% group_by(women_exp_higher) %>% summarize(n())
## # A tibble: 3 × 2
## women_exp_higher `n()`
## <chr> <int>
## 1 No 4124
## 2 Same 338
## 3 Yes 4180
What are your findings about the summary? Are they what you expected?
Make at least two plots that help you answer your question on the transformed or summarized data. Use scales and/or labels to make each plot informative.
#boxplot of revenues
ggplot(soccer) +
aes(x = school_type, y = rev_men, fill = school_type) +
geom_boxplot() +
labs(title = "Men's Soccer Program Revenue", x = "School Type", y = "Revenue (in $USD)")
ggplot(soccer) +
aes(x = school_type, y = rev_women, fill = school_type) +
geom_boxplot() +
labs(title = "Women's Soccer Program Revenue", x = "School Type", y = "Revenue (in $USD)")
#boxplot of expenses
ggplot(soccer) +
aes(x = school_type, y = exp_men, fill = school_type) +
geom_boxplot() +
labs(title = "Men's Soccer Program Expenses", x = "School Type", y = "Expenses (in $USD)")
ggplot(soccer) +
aes(x = school_type, y = exp_women, fill = school_type) +
geom_boxplot() +
labs(title = "Women's Soccer Program Expenses", x = "School Type", y = "Expenses (in $USD)")
ggplot(soccer) +
aes(x = rev_men, y = exp_men, color = school_type) +
geom_point() +
labs(title = "Revenue vs Expenses - Men", x = "Revenue", y = "Expenses") +
xlim(1, 850000) + ylim(1, 850000) #limit to take out extreme outliers seen in boxplot, which should make the scatter plot more readable
## Warning: Removed 2079 rows containing missing values (`geom_point()`).
ggplot(soccer) +
aes(x = rev_women, y = exp_women, color = school_type) +
geom_point() +
labs(title = " Revenue vs Expenses - Women", x = "Revenue", y = "Expenses") +
xlim(1, 850000) + ylim(1, 850000) #limit to take out extreme outliers seen in boxplot, which should make the scatter plot more readable
## Warning: Removed 1641 rows containing missing values (`geom_point()`).
# bar plot of higher revenue
ggplot(soccer) +
aes(x = women_rev_higher, fill = women_rev_higher) +
geom_bar() +
labs(title = "Institutions Soccer Program Revenue", x = "Women's Program Revenue is Higher", y = "Number of Institutions")
# bar plot of higher expenses
ggplot(soccer) +
aes(x = women_exp_higher, fill = women_exp_higher) +
geom_bar() +
labs(title = "Institutions Soccer Program Expenses", x = "Women's Program Expenses are Higher", y = "Number of Institutions")
Summarize your research question and findings below.
Are your findings what you expected? Why or Why not?
All code was adapted from R Programming Winter 2023 course at the OHSU-PSU School of Public Health, taught by Jessica Minnier