Please submit your .Rmd
and .html
files in
Sakai. If you are working together, both people should submit the
files.
The goal of the midterm project is to showcase skills that you have learned in class so far. The midterm is open note, but if you use someone else’s code, you must attribute them.
.csv
file into your data
folder. This resource is probably the
easiest to deal with.Define a research question, involving at least one categorical variable. You may schedule a time with Jessica or Brad to discuss your data set and research question, or you may message it to one of us in slack or email. Please do one of the two options pretty early on. We just want to look at the data and make sure that it is appropriate for your question.
You must use each of the following functions at least once:
mutate()
group_by()
summarize()
ggplot()
and at least one of the following:
case_when()
across()
*_join()
(i.e. left_join()
)pivot_*()
(i.e. pivot_longer()
)function()
The code chunks below are guides, please add more code chunks to do what you need.
If you do not want your final project posted on the public website, please let Jessica know. We can also keep it anonymous if you’d like to remove your name from the Rmd and html, or use a pseudonym.
You may remove these instructions from your final Rmd if you like
If you’d like to work together in pairs, that is encouraged, but you must divide the work equitably and you must note who worked on what. This is probably easiest as notes in the text. Please let Brad or Jessica know that you’ll be working together.
No acknowledgements of contributions = -10 points overall.
I will take off points (-5 points for each section) if you don’t add observations and notes in your RMarkdown document. I want you to think and reason through your analysis, even if they are preliminary thoughts.
Define your research question below. What about the data interests you? What is a specific question you want to find out about the data?
I’m interested in looking at whether vaccination rate across differ across school types (public or private) for both mmr vaccination and overall vaccination rates during the 2018-2019 school year. I’m also interested in looking at which counties have the lowest vaccination rates. I chose to limit this data set to Ohio in particular because that state reported their school types, as well as mmr and overall vaccination rates.
Given your question, what is your expectation about the data?
I expect public schools to have higher vaccination rates than private schools for both mmr and overall vaccinations. Mmr vaccination rates should be higher than overall vaccination rates, as some students may have the mmr vaccine, but not have all of their state’s required vaccines. I’m not sure which counties will have the lowest vaccination rates.
Load the data below and use
dplyr::glimpse()
orskimr::skim()
on the data. You should upload the data file into thedata
directory.
I initially imported the data data from the tidytuesday website and saved it in the project data file. This data set is quite large, but I’ll be limiting it to Ohio data only.
#measles <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-02-25/measles.csv')
#write.csv(measles, file = "BSTA 504 Midterm/data/measles.csv", row.names = FALSE)
measles <- measles <- readr::read_csv("BSTA 504 Midterm/data/measles.csv")
## Rows: 66113 Columns: 16
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (6): state, year, name, type, city, county
## dbl (8): index, enroll, mmr, overall, xmed, xper, lat, lng
## lgl (2): district, xrel
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
dplyr::glimpse(measles)
## Rows: 66,113
## Columns: 16
## $ index <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 10, 11, 12, 13, 14, 15, 15, 16…
## $ state <chr> "Arizona", "Arizona", "Arizona", "Arizona", "Arizona", "Arizo…
## $ year <chr> "2018-19", "2018-19", "2018-19", "2018-19", "2018-19", "2018-…
## $ name <chr> "A J Mitchell Elementary", "Academy Del Sol", "Academy Del So…
## $ type <chr> "Public", "Charter", "Charter", "Charter", "Charter", "Public…
## $ city <chr> "Nogales", "Tucson", "Tucson", "Phoenix", "Phoenix", "Phoenix…
## $ county <chr> "Santa Cruz", "Pima", "Pima", "Maricopa", "Maricopa", "Marico…
## $ district <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ enroll <dbl> 51, 22, 85, 60, 43, 36, 24, 22, 26, 78, 78, 35, 54, 54, 34, 5…
## $ mmr <dbl> 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 1…
## $ overall <dbl> -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -…
## $ xrel <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ xmed <dbl> NA, NA, NA, NA, 2.33, NA, NA, NA, NA, NA, NA, 2.86, NA, 7.41,…
## $ xper <dbl> NA, NA, NA, NA, 2.33, NA, 4.17, NA, NA, NA, NA, NA, NA, NA, N…
## $ lat <dbl> 31.34782, 32.22192, 32.13049, 33.48545, 33.49562, 33.43532, 3…
## $ lng <dbl> -110.9380, -110.8961, -111.1170, -112.1306, -112.2247, -112.1…
If there are any quirks that you have to deal with
NA
coded as something else, or it is multiple tables, please make some notes here about what you need to do before you start transforming the data in the next section.
MMR and overall vaccination rates with missing values are coded as -1, so these rows are removed below.
Make sure your data types are correct!
The data types are as expected. Most importantly: type and county are character variables, while mmr and overall are doubles.
If the data needs to be transformed in any way (values recoded, pivoted, etc), do it here. Examples include transforming a continuous variable into a categorical using
case_when()
, etc.
The following data transformations were done:
#remove rows with measles or overall vaccination rate =-1
measles <- measles[measles$mmr!=-1 & measles$overall!=-1,]
#limit data to Ohio state
measles_oh <- measles[measles$state=="Ohio",]
#remove unwanted columns
measles_oh <- measles_oh %>% select(-lat, -lng, -district, -xrel, -xmed, -xper) %>%
#drop missing values from type and mmr
drop_na(type, mmr) %>%
#remove duplicates
distinct() %>%
#reassign the type variable to be a factor
mutate(type = factor(type))
#create a long format measles data set over mmr and overall
measles_long <-
measles_oh %>%
pivot_longer(
cols= c(mmr, overall),
names_to = "vax", # column names are moved here
values_to = "vax_rate" # data points are moved here
)
Bonus points (5 points) for datasets that require merging of tables, but only if you reason through whether you should use
left_join
,inner_join
, orright_join
on these tables. No credit will be provided if you don’t.
Show your transformed table here. Use tools such as
glimpse()
,skim()
orhead()
to illustrate your point.
#check data transformations
glimpse(measles_oh)
## Rows: 1,986
## Columns: 10
## $ index <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,…
## $ state <chr> "Ohio", "Ohio", "Ohio", "Ohio", "Ohio", "Ohio", "Ohio", "Ohio"…
## $ year <chr> "2018-19", "2018-19", "2018-19", "2018-19", "2018-19", "2018-1…
## $ name <chr> "Academy Of Educational Excellence", "Adrian Elementary", "All…
## $ type <fct> Public, Public, Private, Private, Private, Private, Private, P…
## $ city <chr> "Toledo", "South Euclid", "Cincinnati", "Columbus", "Rossford"…
## $ county <chr> "Lucas", "Cuyahoga", "Hamilton", "Franklin", "Wood", "Lake", "…
## $ enroll <dbl> 22, 62, 52, 38, 23, 11, 15, 64, 80, 201, 117, 28, 15, 96, 73, …
## $ mmr <dbl> 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 10…
## $ overall <dbl> 95.45, 95.16, 100.00, 100.00, 95.65, 100.00, 100.00, 98.44, 98…
skim(measles_oh)
Name | measles_oh |
Number of rows | 1986 |
Number of columns | 10 |
_______________________ | |
Column type frequency: | |
character | 5 |
factor | 1 |
numeric | 4 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
state | 0 | 1 | 4 | 4 | 0 | 1 | 0 |
year | 0 | 1 | 7 | 7 | 0 | 1 | 0 |
name | 0 | 1 | 4 | 60 | 0 | 1781 | 0 |
city | 0 | 1 | 3 | 20 | 0 | 619 | 0 |
county | 0 | 1 | 4 | 10 | 0 | 88 | 0 |
Variable type: factor
skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
---|---|---|---|---|---|
type | 0 | 1 | FALSE | 2 | Pub: 1582, Pri: 404 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
index | 0 | 1 | 993.50 | 573.46 | 1.00 | 497.25 | 993.50 | 1489.75 | 1987 | ▇▇▇▇▇ |
enroll | 0 | 1 | 66.37 | 48.10 | 11.00 | 37.00 | 58.00 | 83.00 | 743 | ▇▁▁▁▁ |
mmr | 0 | 1 | 90.44 | 12.42 | 14.29 | 88.73 | 94.64 | 97.62 | 100 | ▁▁▁▁▇ |
overall | 0 | 1 | 87.85 | 14.29 | 11.11 | 85.46 | 92.86 | 96.30 | 100 | ▁▁▁▁▇ |
#two school types
table(measles_oh$type)
##
## Private Public
## 404 1582
#vaccination percentages
range(measles_oh$mmr)
## [1] 14.29 100.00
range(measles_oh$overall)
## [1] 11.11 100.00
#check long format data set
glimpse(measles_long)
## Rows: 3,972
## Columns: 10
## $ index <dbl> 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 10, 10,…
## $ state <chr> "Ohio", "Ohio", "Ohio", "Ohio", "Ohio", "Ohio", "Ohio", "Ohio…
## $ year <chr> "2018-19", "2018-19", "2018-19", "2018-19", "2018-19", "2018-…
## $ name <chr> "Academy Of Educational Excellence", "Academy Of Educational …
## $ type <fct> Public, Public, Public, Public, Private, Private, Private, Pr…
## $ city <chr> "Toledo", "Toledo", "South Euclid", "South Euclid", "Cincinna…
## $ county <chr> "Lucas", "Lucas", "Cuyahoga", "Cuyahoga", "Hamilton", "Hamilt…
## $ enroll <dbl> 22, 22, 62, 62, 52, 52, 38, 38, 23, 23, 11, 11, 15, 15, 64, 6…
## $ vax <chr> "mmr", "overall", "mmr", "overall", "mmr", "overall", "mmr", …
## $ vax_rate <dbl> 100.00, 95.45, 100.00, 95.16, 100.00, 100.00, 100.00, 100.00,…
Are the values what you expected for the variables? Why or Why not?
Yes, the values are what I expected. State and year are characters with only one unique value and no missing values. Name, city, and county are also character variables and have no missing values. School type is a factor with 2 unique values and no missing variables. Index has no missing values. Mmr and overall are doubles with no missing values, and they are both between 0 and 100%. The measles_long dataset is twice as long as measles_oh, which is expected.
Use
group_by()
andsummarize()
to make a summary of the data here. The summary should be relevant to your research question
#summary stats for mmr and overall vaccination by school type
measles_oh %>%
group_by(type) %>%
summarize(across(c(mmr, overall),
.fns = list(min = min, mean = mean, max = max), na.rm = TRUE)) %>%
#create table
gt()
type | mmr_min | mmr_mean | mmr_max | overall_min | overall_mean | overall_max |
---|---|---|---|---|---|---|
Private | 14.29 | 90.38849 | 100 | 11.11 | 88.24064 | 100 |
Public | 16.67 | 90.45727 | 100 | 12.00 | 87.74507 | 100 |
#number of schools with mmr vaccination rates below 50%
sum(measles_oh$mmr<50)
## [1] 47
#summary stats for mean mmr vaccination per county
measles_oh %>%
group_by(county) %>%
summarize(across(c(mmr), list(mean=mean))) %>%
#sort by lowest to highest vaccination rate
arrange(mmr_mean) %>%
gt()
county | mmr_mean |
---|---|
Holmes | 74.60600 |
Morrow | 76.27800 |
Hamilton | 81.95805 |
Erie | 82.19714 |
Ottawa | 82.36750 |
Mahoning | 84.55333 |
Summit | 85.26511 |
Huron | 85.67667 |
Cuyahoga | 86.03346 |
Vinton | 88.25667 |
Stark | 88.46193 |
Shelby | 88.72308 |
Harrison | 89.79667 |
Richland | 90.09864 |
Ross | 90.13375 |
Auglaize | 90.13500 |
Champaign | 90.35800 |
Allen | 90.40952 |
Coshocton | 90.52375 |
Knox | 90.58900 |
Wayne | 90.67389 |
Wyandot | 90.70429 |
Trumbull | 90.71679 |
Williams | 90.72250 |
Fayette | 90.83000 |
Clinton | 90.88200 |
Carroll | 91.02000 |
Union | 91.06300 |
Medina | 91.15308 |
Tuscarawas | 91.26100 |
Warren | 91.46556 |
Montgomery | 91.57000 |
Columbiana | 91.62000 |
Athens | 91.91444 |
Mercer | 92.20143 |
Lucas | 92.32837 |
VanWert | 92.55500 |
Franklin | 92.84257 |
Marion | 92.87909 |
Clark | 92.93538 |
Greene | 93.03650 |
Miami | 93.15937 |
Lorain | 93.20574 |
Fairfield | 93.27955 |
Hocking | 93.31833 |
Seneca | 93.44714 |
Lake | 93.51222 |
Wood | 93.54167 |
Hardin | 93.57000 |
Ashland | 93.60500 |
Logan | 93.70250 |
Licking | 93.75833 |
Fulton | 93.96714 |
Henry | 94.00600 |
Geauga | 94.04429 |
Butler | 94.07500 |
Putnam | 94.08455 |
Guernsey | 94.12000 |
Delaware | 94.22861 |
Noble | 94.27500 |
Defiance | 94.36857 |
Pickaway | 94.47556 |
Crawford | 94.55667 |
Hancock | 94.71429 |
Portage | 95.08600 |
Madison | 95.13500 |
Sandusky | 95.38167 |
Darke | 95.73250 |
Preble | 95.76000 |
Belmont | 95.83333 |
Highland | 95.98333 |
Washington | 96.03000 |
Morgan | 96.05250 |
Clermont | 96.32303 |
Jefferson | 96.40615 |
Paulding | 96.45800 |
Ashtabula | 97.28417 |
Meigs | 97.40000 |
Brown | 97.67750 |
Perry | 97.67889 |
Jackson | 97.81333 |
Pike | 97.83250 |
Muskingum | 97.88538 |
Scioto | 98.00417 |
Adams | 98.10250 |
Gallia | 98.16429 |
Monroe | 98.54250 |
Lawrence | 98.86889 |
What are your findings about the summary? Are they what you expected?
The findings are not entirely what I expected. I was surprised that private and public schools have similar mean vaccination rates. As expected, the mean mmr vaccination rates were higher than overall vaccination rates for both public and private schools.
Make at least two plots that help you answer your question on the transformed or summarized data. Use scales and/or labels to make each plot informative.
#Boxplot of measles and overall vaccination rates for public vs. private
ggplot(measles_long) +
aes(x = type,
y = vax_rate,
fill = type) +
labs(
x = "School Type",
y = "Vaccination Rate",
title = "Ohio School Vaccination Rates 2018-2019"
) +
geom_boxplot() +
facet_wrap(vars(vax))
#Density plot of mmr vaccination by school type
ggplot(data = measles_oh,
aes(x = mmr,
fill = type)
) +
geom_density(alpha = 0.4) +
scale_fill_discrete(
name = "School Type",
) +
labs(
x = "MMR Vaccination Rate",
title = "Ohio School MMR Vaccination Rates 2018-2019"
) +
hrbrthemes::theme_ipsum() +
theme(legend.position=c(.15,.8))
#the distributions for public and private schools are very similar
Summarize your research question and findings below.
In Ohio schools in 2018-2019, the mean mmr vaccination rates were very similar across public and private schools (90.5$ vs. 90.4%). The rates of mean overall required vaccinations were also similar across public and private schools (87.8% vs. 88.2%). While close, mmr vaccination rates were slightly higher than overall vaccination rates. While the average mmr vaccination rate was 90%, 13 counties had mean mmr vaccination rates below 90%. Holmes, Morrow, and Hamilton counties ranked the lowest in mean mmr vaccination rates. There were 47 schools with mmr vaccination rates under 50%, ranging as low as 14.3% vaccinated.
Are your findings what you expected? Why or Why not?
I expected private schools to have lower vaccination rates, because they may not have the same state requirements as public schools, but they actually had similar rates. Unexpectedly mmr vaccination rates were lower than overall vaccination rates, because some students may have their mmr vaccine but not all of their required vaccines.