Midterm (Due Sunday 2/19/2023 at 11:55 pm)

Please submit your .Rmd and .html files in Sakai. If you are working together, both people should submit the files.

60 / 60 points total

The goal of the midterm project is to showcase skills that you have learned in class so far. The midterm is open note, but if you use someone else’s code, you must attribute them.

Instructions: Before you get Started

  1. Pick a dataset. Ideally, the dataset should be around 2000 rows, and should have both categorical and numeric covariates. Choose a dataset with more than 4 columns/variables.
  • Potential Sources for data:
  • Note that most of the Tidy Tuesday data are .csv files. There is code to load the files from csv for each of the datasets and a short description of the variables, or you can upload the .csv file into your data folder. This resource is probably the easiest to deal with.
  • You may use another dataset or your own data, but please make sure it is de-identified and has enough rows/variables.
  1. Define a research question, involving at least one categorical variable. You may schedule a time with Jessica or Brad to discuss your data set and research question, or you may message it to one of us in slack or email. Please do one of the two options pretty early on. We just want to look at the data and make sure that it is appropriate for your question.

  2. You must use each of the following functions at least once:

  • mutate()
  • group_by()
  • summarize()
  • ggplot()

and at least one of the following:

  • case_when()
  • across()
  • *_join() (i.e. left_join())
  • pivot_*() (i.e. pivot_longer())
  • function()
  1. The code chunks below are guides, please add more code chunks to do what you need.

  2. If you do not want your final project posted on the public website, please let Jessica know. We can also keep it anonymous if you’d like to remove your name from the Rmd and html, or use a pseudonym.

You may remove these instructions from your final Rmd if you like

Working Together

If you’d like to work together in pairs, that is encouraged, but you must divide the work equitably and you must note who worked on what. This is probably easiest as notes in the text. Please let Brad or Jessica know that you’ll be working together.

No acknowledgements of contributions = -10 points overall.

Please Note

I will take off points (-5 points for each section) if you don’t add observations and notes in your RMarkdown document. I want you to think and reason through your analysis, even if they are preliminary thoughts.

Define Your Research Question (10 points)

Define your research question below. What about the data interests you? What is a specific question you want to find out about the data?

I’m interested in looking at whether vaccination rate across differ across school types (public or private) for both mmr vaccination and overall vaccination rates during the 2018-2019 school year. I’m also interested in looking at which counties have the lowest vaccination rates. I chose to limit this data set to Ohio in particular because that state reported their school types, as well as mmr and overall vaccination rates.

Given your question, what is your expectation about the data?

I expect public schools to have higher vaccination rates than private schools for both mmr and overall vaccinations. Mmr vaccination rates should be higher than overall vaccination rates, as some students may have the mmr vaccine, but not have all of their state’s required vaccines. I’m not sure which counties will have the lowest vaccination rates.

Loading the Data (10 points)

Load the data below and use dplyr::glimpse() or skimr::skim() on the data. You should upload the data file into the data directory.

I initially imported the data data from the tidytuesday website and saved it in the project data file. This data set is quite large, but I’ll be limiting it to Ohio data only.

#measles <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-02-25/measles.csv')
#write.csv(measles, file = "BSTA 504 Midterm/data/measles.csv", row.names = FALSE)

measles <- measles <- readr::read_csv("BSTA 504 Midterm/data/measles.csv")
## Rows: 66113 Columns: 16
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (6): state, year, name, type, city, county
## dbl (8): index, enroll, mmr, overall, xmed, xper, lat, lng
## lgl (2): district, xrel
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
dplyr::glimpse(measles)
## Rows: 66,113
## Columns: 16
## $ index    <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 10, 11, 12, 13, 14, 15, 15, 16…
## $ state    <chr> "Arizona", "Arizona", "Arizona", "Arizona", "Arizona", "Arizo…
## $ year     <chr> "2018-19", "2018-19", "2018-19", "2018-19", "2018-19", "2018-…
## $ name     <chr> "A J Mitchell Elementary", "Academy Del Sol", "Academy Del So…
## $ type     <chr> "Public", "Charter", "Charter", "Charter", "Charter", "Public…
## $ city     <chr> "Nogales", "Tucson", "Tucson", "Phoenix", "Phoenix", "Phoenix…
## $ county   <chr> "Santa Cruz", "Pima", "Pima", "Maricopa", "Maricopa", "Marico…
## $ district <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ enroll   <dbl> 51, 22, 85, 60, 43, 36, 24, 22, 26, 78, 78, 35, 54, 54, 34, 5…
## $ mmr      <dbl> 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 1…
## $ overall  <dbl> -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -…
## $ xrel     <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ xmed     <dbl> NA, NA, NA, NA, 2.33, NA, NA, NA, NA, NA, NA, 2.86, NA, 7.41,…
## $ xper     <dbl> NA, NA, NA, NA, 2.33, NA, 4.17, NA, NA, NA, NA, NA, NA, NA, N…
## $ lat      <dbl> 31.34782, 32.22192, 32.13049, 33.48545, 33.49562, 33.43532, 3…
## $ lng      <dbl> -110.9380, -110.8961, -111.1170, -112.1306, -112.2247, -112.1…

If there are any quirks that you have to deal with NA coded as something else, or it is multiple tables, please make some notes here about what you need to do before you start transforming the data in the next section.

MMR and overall vaccination rates with missing values are coded as -1, so these rows are removed below.

Make sure your data types are correct!

The data types are as expected. Most importantly: type and county are character variables, while mmr and overall are doubles.

Transforming the data (15 points)

If the data needs to be transformed in any way (values recoded, pivoted, etc), do it here. Examples include transforming a continuous variable into a categorical using case_when(), etc.

The following data transformations were done:

  • remove rows with mmr == -1
  • limit data to Ohio state
  • remove unwanted columns
  • drop missing values (as NA) from school type and mmr, as these are the most important variables for analysis
  • remove duplicates (some schools were counted twice with the same data)
  • change the school type variable into a factor
  • create a separate object with the data in long format to compare vaccination rates for mmr vs. overall
#remove rows with measles or overall vaccination rate =-1
measles <- measles[measles$mmr!=-1 & measles$overall!=-1,]

#limit data to Ohio state
measles_oh <- measles[measles$state=="Ohio",]

#remove unwanted columns
measles_oh <- measles_oh %>% select(-lat, -lng, -district, -xrel, -xmed, -xper) %>%
  #drop missing values from type and mmr
  drop_na(type, mmr) %>%
  #remove duplicates
  distinct() %>%
  #reassign the type variable to be a factor
  mutate(type = factor(type))

#create a long format measles data set over mmr and overall
measles_long <- 
  measles_oh %>%
    pivot_longer(
      cols= c(mmr, overall), 
      names_to = "vax", # column names are moved here
      values_to = "vax_rate" # data points are moved here
    )

Bonus points (5 points) for datasets that require merging of tables, but only if you reason through whether you should use left_join, inner_join, or right_join on these tables. No credit will be provided if you don’t.

Show your transformed table here. Use tools such as glimpse(), skim() or head() to illustrate your point.

#check data transformations
glimpse(measles_oh)
## Rows: 1,986
## Columns: 10
## $ index   <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,…
## $ state   <chr> "Ohio", "Ohio", "Ohio", "Ohio", "Ohio", "Ohio", "Ohio", "Ohio"…
## $ year    <chr> "2018-19", "2018-19", "2018-19", "2018-19", "2018-19", "2018-1…
## $ name    <chr> "Academy Of Educational Excellence", "Adrian Elementary", "All…
## $ type    <fct> Public, Public, Private, Private, Private, Private, Private, P…
## $ city    <chr> "Toledo", "South Euclid", "Cincinnati", "Columbus", "Rossford"…
## $ county  <chr> "Lucas", "Cuyahoga", "Hamilton", "Franklin", "Wood", "Lake", "…
## $ enroll  <dbl> 22, 62, 52, 38, 23, 11, 15, 64, 80, 201, 117, 28, 15, 96, 73, …
## $ mmr     <dbl> 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 10…
## $ overall <dbl> 95.45, 95.16, 100.00, 100.00, 95.65, 100.00, 100.00, 98.44, 98…
skim(measles_oh)
Data summary
Name measles_oh
Number of rows 1986
Number of columns 10
_______________________
Column type frequency:
character 5
factor 1
numeric 4
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
state 0 1 4 4 0 1 0
year 0 1 7 7 0 1 0
name 0 1 4 60 0 1781 0
city 0 1 3 20 0 619 0
county 0 1 4 10 0 88 0

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
type 0 1 FALSE 2 Pub: 1582, Pri: 404

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
index 0 1 993.50 573.46 1.00 497.25 993.50 1489.75 1987 ▇▇▇▇▇
enroll 0 1 66.37 48.10 11.00 37.00 58.00 83.00 743 ▇▁▁▁▁
mmr 0 1 90.44 12.42 14.29 88.73 94.64 97.62 100 ▁▁▁▁▇
overall 0 1 87.85 14.29 11.11 85.46 92.86 96.30 100 ▁▁▁▁▇
#two school types
table(measles_oh$type)
## 
## Private  Public 
##     404    1582
#vaccination percentages
range(measles_oh$mmr)
## [1]  14.29 100.00
range(measles_oh$overall)
## [1]  11.11 100.00
#check long format data set
glimpse(measles_long)
## Rows: 3,972
## Columns: 10
## $ index    <dbl> 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 10, 10,…
## $ state    <chr> "Ohio", "Ohio", "Ohio", "Ohio", "Ohio", "Ohio", "Ohio", "Ohio…
## $ year     <chr> "2018-19", "2018-19", "2018-19", "2018-19", "2018-19", "2018-…
## $ name     <chr> "Academy Of Educational Excellence", "Academy Of Educational …
## $ type     <fct> Public, Public, Public, Public, Private, Private, Private, Pr…
## $ city     <chr> "Toledo", "Toledo", "South Euclid", "South Euclid", "Cincinna…
## $ county   <chr> "Lucas", "Lucas", "Cuyahoga", "Cuyahoga", "Hamilton", "Hamilt…
## $ enroll   <dbl> 22, 22, 62, 62, 52, 52, 38, 38, 23, 23, 11, 11, 15, 15, 64, 6…
## $ vax      <chr> "mmr", "overall", "mmr", "overall", "mmr", "overall", "mmr", …
## $ vax_rate <dbl> 100.00, 95.45, 100.00, 95.16, 100.00, 100.00, 100.00, 100.00,…

Are the values what you expected for the variables? Why or Why not?

Yes, the values are what I expected. State and year are characters with only one unique value and no missing values. Name, city, and county are also character variables and have no missing values. School type is a factor with 2 unique values and no missing variables. Index has no missing values. Mmr and overall are doubles with no missing values, and they are both between 0 and 100%. The measles_long dataset is twice as long as measles_oh, which is expected.

Visualizing and Summarizing the Data (15 points)

Use group_by() and summarize() to make a summary of the data here. The summary should be relevant to your research question

#summary stats for mmr and overall vaccination by school type
measles_oh %>%
  group_by(type) %>%
  summarize(across(c(mmr, overall),
            .fns = list(min = min, mean = mean, max = max), na.rm = TRUE)) %>%
  #create table
  gt()
type mmr_min mmr_mean mmr_max overall_min overall_mean overall_max
Private 14.29 90.38849 100 11.11 88.24064 100
Public 16.67 90.45727 100 12.00 87.74507 100
#number of schools with mmr vaccination rates below 50%
sum(measles_oh$mmr<50)
## [1] 47
#summary stats for mean mmr vaccination per county
measles_oh %>%
  group_by(county) %>%
  summarize(across(c(mmr), list(mean=mean))) %>%
  #sort by lowest to highest vaccination rate
  arrange(mmr_mean) %>%
  gt()
county mmr_mean
Holmes 74.60600
Morrow 76.27800
Hamilton 81.95805
Erie 82.19714
Ottawa 82.36750
Mahoning 84.55333
Summit 85.26511
Huron 85.67667
Cuyahoga 86.03346
Vinton 88.25667
Stark 88.46193
Shelby 88.72308
Harrison 89.79667
Richland 90.09864
Ross 90.13375
Auglaize 90.13500
Champaign 90.35800
Allen 90.40952
Coshocton 90.52375
Knox 90.58900
Wayne 90.67389
Wyandot 90.70429
Trumbull 90.71679
Williams 90.72250
Fayette 90.83000
Clinton 90.88200
Carroll 91.02000
Union 91.06300
Medina 91.15308
Tuscarawas 91.26100
Warren 91.46556
Montgomery 91.57000
Columbiana 91.62000
Athens 91.91444
Mercer 92.20143
Lucas 92.32837
VanWert 92.55500
Franklin 92.84257
Marion 92.87909
Clark 92.93538
Greene 93.03650
Miami 93.15937
Lorain 93.20574
Fairfield 93.27955
Hocking 93.31833
Seneca 93.44714
Lake 93.51222
Wood 93.54167
Hardin 93.57000
Ashland 93.60500
Logan 93.70250
Licking 93.75833
Fulton 93.96714
Henry 94.00600
Geauga 94.04429
Butler 94.07500
Putnam 94.08455
Guernsey 94.12000
Delaware 94.22861
Noble 94.27500
Defiance 94.36857
Pickaway 94.47556
Crawford 94.55667
Hancock 94.71429
Portage 95.08600
Madison 95.13500
Sandusky 95.38167
Darke 95.73250
Preble 95.76000
Belmont 95.83333
Highland 95.98333
Washington 96.03000
Morgan 96.05250
Clermont 96.32303
Jefferson 96.40615
Paulding 96.45800
Ashtabula 97.28417
Meigs 97.40000
Brown 97.67750
Perry 97.67889
Jackson 97.81333
Pike 97.83250
Muskingum 97.88538
Scioto 98.00417
Adams 98.10250
Gallia 98.16429
Monroe 98.54250
Lawrence 98.86889

What are your findings about the summary? Are they what you expected?

The findings are not entirely what I expected. I was surprised that private and public schools have similar mean vaccination rates. As expected, the mean mmr vaccination rates were higher than overall vaccination rates for both public and private schools.

Make at least two plots that help you answer your question on the transformed or summarized data. Use scales and/or labels to make each plot informative.

#Boxplot of measles and overall vaccination rates for public vs. private
ggplot(measles_long) +
  
  aes(x = type, 
      y = vax_rate, 
      fill = type) +
  
  labs(
    x = "School Type",
    y = "Vaccination Rate",
    title = "Ohio School Vaccination Rates 2018-2019"
    ) +
  
  geom_boxplot() +
  
  facet_wrap(vars(vax))

#Density plot of mmr vaccination by school type
ggplot(data = measles_oh,
       aes(x = mmr,
           fill = type)
       ) + 
  geom_density(alpha = 0.4) +
  scale_fill_discrete(
    name = "School Type",
    ) +
  labs(
    x = "MMR Vaccination Rate",   
    title = "Ohio School MMR Vaccination Rates 2018-2019"
    ) +
  hrbrthemes::theme_ipsum() +
  theme(legend.position=c(.15,.8))

#the distributions for public and private schools are very similar

Final Summary (10 points)

Summarize your research question and findings below.

In Ohio schools in 2018-2019, the mean mmr vaccination rates were very similar across public and private schools (90.5$ vs. 90.4%). The rates of mean overall required vaccinations were also similar across public and private schools (87.8% vs. 88.2%). While close, mmr vaccination rates were slightly higher than overall vaccination rates. While the average mmr vaccination rate was 90%, 13 counties had mean mmr vaccination rates below 90%. Holmes, Morrow, and Hamilton counties ranked the lowest in mean mmr vaccination rates. There were 47 schools with mmr vaccination rates under 50%, ranging as low as 14.3% vaccinated.

Are your findings what you expected? Why or Why not?

I expected private schools to have lower vaccination rates, because they may not have the same state requirements as public schools, but they actually had similar rates. Unexpectedly mmr vaccination rates were lower than overall vaccination rates, because some students may have their mmr vaccine but not all of their required vaccines.