Midterm (Due Sunday 2/19/2023 at 11:55 pm)

Please submit your .Rmd and .html files in Sakai. If you are working together, both people should submit the files.

60 / 60 points total

The goal of the midterm project is to showcase skills that you have learned in class so far. The midterm is open note, but if you use someone else’s code, you must attribute them.

Instructions: Before you get Started

Pick a dataset. Ideally, the dataset should be around 2000 rows, and should have both categorical and numeric covariates. Choose a dataset with more than 4 columns/variables.

Potential Sources for data:
Note that most of the Tidy Tuesday data are .csv files. There is code to load the files from csv for each of the datasets and a short description of the variables, or you can upload the .csv file into your data folder. This resource is probably the easiest to deal with.
You may use another dataset or your own data, but please make sure it is de-identified and has enough rows/variables.

Define a research question, involving at least one categorical variable. You may schedule a time with Jessica or Brad to discuss your data set and research question, or you may message it to one of us in slack or email. Please do one of the two options pretty early on. We just want to look at the data and make sure that it is appropriate for your question.
You must use each of the following functions at least once:

mutate()
group_by()
summarize()
ggplot()

and at least one of the following:

case_when()
across()
*_join() (i.e. left_join())
pivot_*() (i.e. pivot_longer())
function()

The code chunks below are guides, please add more code chunks to do what you need.
If you do not want your final project posted on the public website, please let Jessica know. We can also keep it anonymous if you’d like to remove your name from the Rmd and html, or use a pseudonym.

You may remove these instructions from your final Rmd if you like

Working Together

If you’d like to work together in pairs, that is encouraged, but you must divide the work equitably and you must note who worked on what. This is probably easiest as notes in the text. Please let Brad or Jessica know that you’ll be working together.

No acknowledgements of contributions = -10 points overall.

Please Note

I will take off points (-5 points for each section) if you don’t add observations and notes in your RMarkdown document. I want you to think and reason through your analysis, even if they are preliminary thoughts.

Define Your Research Question (10 points)

Define your research question below. What about the data interests you? What is a specific question you want to find out about the data?

I’m interested in looking at whether vaccination rate across differ across school types (public or private) for both mmr vaccination and overall vaccination rates during the 2018-2019 school year. I’m also interested in looking at which counties have the lowest vaccination rates. I chose to limit this data set to Ohio in particular because that state reported their school types, as well as mmr and overall vaccination rates.

Given your question, what is your expectation about the data?

I expect public schools to have higher vaccination rates than private schools for both mmr and overall vaccinations. Mmr vaccination rates should be higher than overall vaccination rates, as some students may have the mmr vaccine, but not have all of their state’s required vaccines. I’m not sure which counties will have the lowest vaccination rates.

Loading the Data (10 points)

Load the data below and use dplyr::glimpse() or skimr::skim() on the data. You should upload the data file into the data directory.

I initially imported the data data from the tidytuesday website and saved it in the project data file. This data set is quite large, but I’ll be limiting it to Ohio data only.

#measles <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-02-25/measles.csv')
#write.csv(measles, file = "BSTA 504 Midterm/data/measles.csv", row.names = FALSE)

measles <- measles <- readr::read_csv("BSTA 504 Midterm/data/measles.csv")

## Rows: 66113 Columns: 16
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (6): state, year, name, type, city, county
## dbl (8): index, enroll, mmr, overall, xmed, xper, lat, lng
## lgl (2): district, xrel
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

dplyr::glimpse(measles)

## Rows: 66,113
## Columns: 16
## $ index    <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 10, 11, 12, 13, 14, 15, 15, 16…
## $ state    <chr> "Arizona", "Arizona", "Arizona", "Arizona", "Arizona", "Arizo…
## $ year     <chr> "2018-19", "2018-19", "2018-19", "2018-19", "2018-19", "2018-…
## $ name     <chr> "A J Mitchell Elementary", "Academy Del Sol", "Academy Del So…
## $ type     <chr> "Public", "Charter", "Charter", "Charter", "Charter", "Public…
## $ city     <chr> "Nogales", "Tucson", "Tucson", "Phoenix", "Phoenix", "Phoenix…
## $ county   <chr> "Santa Cruz", "Pima", "Pima", "Maricopa", "Maricopa", "Marico…
## $ district <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ enroll   <dbl> 51, 22, 85, 60, 43, 36, 24, 22, 26, 78, 78, 35, 54, 54, 34, 5…
## $ mmr      <dbl> 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 1…
## $ overall  <dbl> -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -…
## $ xrel     <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ xmed     <dbl> NA, NA, NA, NA, 2.33, NA, NA, NA, NA, NA, NA, 2.86, NA, 7.41,…
## $ xper     <dbl> NA, NA, NA, NA, 2.33, NA, 4.17, NA, NA, NA, NA, NA, NA, NA, N…
## $ lat      <dbl> 31.34782, 32.22192, 32.13049, 33.48545, 33.49562, 33.43532, 3…
## $ lng      <dbl> -110.9380, -110.8961, -111.1170, -112.1306, -112.2247, -112.1…

If there are any quirks that you have to deal with NA coded as something else, or it is multiple tables, please make some notes here about what you need to do before you start transforming the data in the next section.

MMR and overall vaccination rates with missing values are coded as -1, so these rows are removed below.

Make sure your data types are correct!

The data types are as expected. Most importantly: type and county are character variables, while mmr and overall are doubles.

Transforming the data (15 points)

If the data needs to be transformed in any way (values recoded, pivoted, etc), do it here. Examples include transforming a continuous variable into a categorical using case_when(), etc.

The following data transformations were done:

remove rows with mmr == -1
limit data to Ohio state
remove unwanted columns
drop missing values (as NA) from school type and mmr, as these are the most important variables for analysis
remove duplicates (some schools were counted twice with the same data)
change the school type variable into a factor
create a separate object with the data in long format to compare vaccination rates for mmr vs. overall

#remove rows with measles or overall vaccination rate =-1
measles <- measles[measles$mmr!=-1 & measles$overall!=-1,]

#limit data to Ohio state
measles_oh <- measles[measles$state=="Ohio",]

#remove unwanted columns
measles_oh <- measles_oh %>% select(-lat, -lng, -district, -xrel, -xmed, -xper) %>%
  #drop missing values from type and mmr
  drop_na(type, mmr) %>%
  #remove duplicates
  distinct() %>%
  #reassign the type variable to be a factor
  mutate(type = factor(type))

#create a long format measles data set over mmr and overall
measles_long <- 
  measles_oh %>%
    pivot_longer(
      cols= c(mmr, overall), 
      names_to = "vax", # column names are moved here
      values_to = "vax_rate" # data points are moved here
    )

Bonus points (5 points) for datasets that require merging of tables, but only if you reason through whether you should use left_join, inner_join, or right_join on these tables. No credit will be provided if you don’t.

Show your transformed table here. Use tools such as glimpse(), skim() or head() to illustrate your point.

#check data transformations
glimpse(measles_oh)

## Rows: 1,986
## Columns: 10
## $ index   <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,…
## $ state   <chr> "Ohio", "Ohio", "Ohio", "Ohio", "Ohio", "Ohio", "Ohio", "Ohio"…
## $ year    <chr> "2018-19", "2018-19", "2018-19", "2018-19", "2018-19", "2018-1…
## $ name    <chr> "Academy Of Educational Excellence", "Adrian Elementary", "All…
## $ type    <fct> Public, Public, Private, Private, Private, Private, Private, P…
## $ city    <chr> "Toledo", "South Euclid", "Cincinnati", "Columbus", "Rossford"…
## $ county  <chr> "Lucas", "Cuyahoga", "Hamilton", "Franklin", "Wood", "Lake", "…
## $ enroll  <dbl> 22, 62, 52, 38, 23, 11, 15, 64, 80, 201, 117, 28, 15, 96, 73, …
## $ mmr     <dbl> 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 10…
## $ overall <dbl> 95.45, 95.16, 100.00, 100.00, 95.65, 100.00, 100.00, 98.44, 98…

skim(measles_oh)

Data summary
Name	measles_oh
Number of rows	1986
Number of columns	10
_______________________
Column type frequency:
character	5
factor	1
numeric	4
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	n_unique
state	1	4	4	1
year	1	7	7	1
name	1	4	60	1781
city	1	3	20	619
county	1	4	10	88

Variable type: factor

skim_variable	n_missing	complete_rate	ordered	n_unique	top_counts
type	0	1	FALSE	2	Pub: 1582, Pri: 404

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
index	1	993.50	573.46	1.00	497.25	993.50	1489.75	1987	▇▇▇▇▇
enroll	1	66.37	48.10	11.00	37.00	58.00	83.00	743	▇▁▁▁▁
mmr	1	90.44	12.42	14.29	88.73	94.64	97.62	100	▁▁▁▁▇
overall	1	87.85	14.29	11.11	85.46	92.86	96.30	100	▁▁▁▁▇

#two school types
table(measles_oh$type)

## 
## Private  Public 
##     404    1582

#vaccination percentages
range(measles_oh$mmr)

## [1]  14.29 100.00

range(measles_oh$overall)

## [1]  11.11 100.00

#check long format data set
glimpse(measles_long)

## Rows: 3,972
## Columns: 10
## $ index    <dbl> 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 10, 10,…
## $ state    <chr> "Ohio", "Ohio", "Ohio", "Ohio", "Ohio", "Ohio", "Ohio", "Ohio…
## $ year     <chr> "2018-19", "2018-19", "2018-19", "2018-19", "2018-19", "2018-…
## $ name     <chr> "Academy Of Educational Excellence", "Academy Of Educational …
## $ type     <fct> Public, Public, Public, Public, Private, Private, Private, Pr…
## $ city     <chr> "Toledo", "Toledo", "South Euclid", "South Euclid", "Cincinna…
## $ county   <chr> "Lucas", "Lucas", "Cuyahoga", "Cuyahoga", "Hamilton", "Hamilt…
## $ enroll   <dbl> 22, 22, 62, 62, 52, 52, 38, 38, 23, 23, 11, 11, 15, 15, 64, 6…
## $ vax      <chr> "mmr", "overall", "mmr", "overall", "mmr", "overall", "mmr", …
## $ vax_rate <dbl> 100.00, 95.45, 100.00, 95.16, 100.00, 100.00, 100.00, 100.00,…

Are the values what you expected for the variables? Why or Why not?

Yes, the values are what I expected. State and year are characters with only one unique value and no missing values. Name, city, and county are also character variables and have no missing values. School type is a factor with 2 unique values and no missing variables. Index has no missing values. Mmr and overall are doubles with no missing values, and they are both between 0 and 100%. The measles_long dataset is twice as long as measles_oh, which is expected.

Visualizing and Summarizing the Data (15 points)

Use group_by() and summarize() to make a summary of the data here. The summary should be relevant to your research question

#summary stats for mmr and overall vaccination by school type
measles_oh %>%
  group_by(type) %>%
  summarize(across(c(mmr, overall),
            .fns = list(min = min, mean = mean, max = max), na.rm = TRUE)) %>%
  #create table
  gt()

type	mmr_min	mmr_mean	mmr_max	overall_min	overall_mean	overall_max
Private	14.29	90.38849	100	11.11	88.24064	100
Public	16.67	90.45727	100	12.00	87.74507	100

#number of schools with mmr vaccination rates below 50%
sum(measles_oh$mmr<50)

## [1] 47

#summary stats for mean mmr vaccination per county
measles_oh %>%
  group_by(county) %>%
  summarize(across(c(mmr), list(mean=mean))) %>%
  #sort by lowest to highest vaccination rate
  arrange(mmr_mean) %>%
  gt()

county	mmr_mean
Holmes	74.60600
Morrow	76.27800
Hamilton	81.95805
Erie	82.19714
Ottawa	82.36750
Mahoning	84.55333
Summit	85.26511
Huron	85.67667
Cuyahoga	86.03346
Vinton	88.25667
Stark	88.46193
Shelby	88.72308
Harrison	89.79667
Richland	90.09864
Ross	90.13375
Auglaize	90.13500
Champaign	90.35800
Allen	90.40952
Coshocton	90.52375
Knox	90.58900
Wayne	90.67389
Wyandot	90.70429
Trumbull	90.71679
Williams	90.72250
Fayette	90.83000
Clinton	90.88200
Carroll	91.02000
Union	91.06300
Medina	91.15308
Tuscarawas	91.26100
Warren	91.46556
Montgomery	91.57000
Columbiana	91.62000
Athens	91.91444
Mercer	92.20143
Lucas	92.32837
VanWert	92.55500
Franklin	92.84257
Marion	92.87909
Clark	92.93538
Greene	93.03650
Miami	93.15937
Lorain	93.20574
Fairfield	93.27955
Hocking	93.31833
Seneca	93.44714
Lake	93.51222
Wood	93.54167
Hardin	93.57000
Ashland	93.60500
Logan	93.70250
Licking	93.75833
Fulton	93.96714
Henry	94.00600
Geauga	94.04429
Butler	94.07500
Putnam	94.08455
Guernsey	94.12000
Delaware	94.22861
Noble	94.27500
Defiance	94.36857
Pickaway	94.47556
Crawford	94.55667
Hancock	94.71429
Portage	95.08600
Madison	95.13500
Sandusky	95.38167
Darke	95.73250
Preble	95.76000
Belmont	95.83333
Highland	95.98333
Washington	96.03000
Morgan	96.05250
Clermont	96.32303
Jefferson	96.40615
Paulding	96.45800
Ashtabula	97.28417
Meigs	97.40000
Brown	97.67750
Perry	97.67889
Jackson	97.81333
Pike	97.83250
Muskingum	97.88538
Scioto	98.00417
Adams	98.10250
Gallia	98.16429
Monroe	98.54250
Lawrence	98.86889

What are your findings about the summary? Are they what you expected?

The findings are not entirely what I expected. I was surprised that private and public schools have similar mean vaccination rates. As expected, the mean mmr vaccination rates were higher than overall vaccination rates for both public and private schools.

Make at least two plots that help you answer your question on the transformed or summarized data. Use scales and/or labels to make each plot informative.

#Boxplot of measles and overall vaccination rates for public vs. private
ggplot(measles_long) +
  
  aes(x = type, 
      y = vax_rate, 
      fill = type) +
  
  labs(
    x = "School Type",
    y = "Vaccination Rate",
    title = "Ohio School Vaccination Rates 2018-2019"
    ) +
  
  geom_boxplot() +
  
  facet_wrap(vars(vax))

#Density plot of mmr vaccination by school type
ggplot(data = measles_oh,
       aes(x = mmr,
           fill = type)
       ) + 
  geom_density(alpha = 0.4) +
  scale_fill_discrete(
    name = "School Type",
    ) +
  labs(
    x = "MMR Vaccination Rate",   
    title = "Ohio School MMR Vaccination Rates 2018-2019"
    ) +
  hrbrthemes::theme_ipsum() +
  theme(legend.position=c(.15,.8))

#the distributions for public and private schools are very similar

Final Summary (10 points)

Summarize your research question and findings below.

In Ohio schools in 2018-2019, the mean mmr vaccination rates were very similar across public and private schools (90.5$ vs. 90.4%). The rates of mean overall required vaccinations were also similar across public and private schools (87.8% vs. 88.2%). While close, mmr vaccination rates were slightly higher than overall vaccination rates. While the average mmr vaccination rate was 90%, 13 counties had mean mmr vaccination rates below 90%. Holmes, Morrow, and Hamilton counties ranked the lowest in mean mmr vaccination rates. There were 47 schools with mmr vaccination rates under 50%, ranging as low as 14.3% vaccinated.

Are your findings what you expected? Why or Why not?

I expected private schools to have lower vaccination rates, because they may not have the same state requirements as public schools, but they actually had similar rates. Unexpectedly mmr vaccination rates were lower than overall vaccination rates, because some students may have their mmr vaccine but not all of their required vaccines.

Midterm

Ingrid Jennings

2023-02-19