Midterm (Due Sunday 2/19/2023 at 11:55 pm)

Please submit your .Rmd and .html files in Sakai. If you are working together, both people should submit the files.

60 / 60 points total

The goal of the midterm project is to showcase skills that you have learned in class so far. The midterm is open note, but if you use someone else’s code, you must attribute them.

Working Together

I am working solo, but I have had help with coding some of the variables from Anna Booman.

Please Note

I will take off points (-5 points for each section) if you don’t add observations and notes in your RMarkdown document. I want you to think and reason through your analysis, even if they are preliminary thoughts.

Define Your Research Question (10 points)

Define your research question below. What about the data interests you? What is a specific question you want to find out about the data?

The data I am working with comes from health centers that work primarily with low-income people, as well as linked birth records. This data contains longitudinal information about health before, during, and after pregnancy, and is therefore a potentially rich source of information about the conditions that contribute to birth outcomes, both joyful and adverse. One of the unique elements of this data is relatively granular information about insurance type at each visit, allowing us to conduct analyses that probe the complex relationship between health insurance coverage and health. While these relationships may not point at direct influences of insurance type, exploring if associations exist - or not - helps us map the landscape in order to better plan what questions to ask next.

My research question is, does birthweight vary by insurance type?

Given your question, what is your expectation about the data?

I anticipate that there will be variation in birthweight by insurance type, even after adjusting for variables expected to act as confounders such as maternal age, gestational age, maternal BMI, household income, and maternal race or ethnicity.

Loading the Data (10 points)

Load the data below and use dplyr::glimpse() or skimr::skim() on the data. You should upload the data file into the data directory.

data <- read.csv("~/Library/CloudStorage/OneDrive-SharedLibraries-OregonHealth&ScienceUniversity/Janne Boone-Heinonen - PROMISE-Data/Analytic Data/final_data/wt_change_imputed_final_version2.csv")

length(unique(data$pregid)) # 77,691 pregnancies
## [1] 77691
skim(data)
Data summary
Name data
Number of rows 895371
Number of columns 39
_______________________
Column type frequency:
character 19
numeric 20
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
pregid 0 1.00 38 38 0 77691 0
measure_date 0 1.00 10 10 0 4378 0
patid 0 1.00 36 36 0 65245 0
bmi.pre.cat 0 1.00 22 31 0 6 0
current_state_of_residence 0 1.00 0 2 488 48 0
preg_outcome 0 1.00 7 11 0 3 0
preg_start_date 0 1.00 10 10 0 4556 0
d1tri2 0 1.00 10 10 0 4556 0
d1tri3 0 1.00 10 10 0 4556 0
preg_end_date 0 1.00 10 10 0 4542 0
FPL.precat 0 1.00 10 13 0 3 0
payer.precat 0 1.00 10 23 0 5 0
payer.prenatalcat 0 1.00 10 23 0 5 0
payer.endofpreg.dt 4816 0.99 10 10 0 3515 0
payer.endofpregcat 0 1.00 10 23 0 5 0
race_ethnicity_cat 0 1.00 10 20 0 8 0
age_preg_end_cat 0 1.00 6 12 0 7 0
percentile 155757 0.83 6 12 0 8 0
race_eth_simple 0 1.00 10 11 0 6 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
time 0 1.00 175.23 75.09 0.00 111.00 189.00 244.00 315.00 ▂▆▅▇▅
measure_wt 0 1.00 169.59 39.87 70.00 141.60 163.20 190.60 491.00 ▆▇▁▁▁
median_ht 0 1.00 62.53 3.07 48.00 60.50 62.49 64.52 80.00 ▁▃▇▁▁
bmi 0 1.00 30.40 6.31 13.70 25.97 29.64 33.91 75.57 ▃▇▁▁▁
preg_n 0 1.00 1.28 0.56 1.00 1.00 1.00 1.00 8.00 ▇▁▁▁▁
multiple_gestation 0 1.00 0.01 0.09 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
fullterm 38267 0.96 0.94 0.23 0.00 1.00 1.00 1.00 1.00 ▁▁▁▁▇
biologically_plausible 155757 0.83 1.00 0.06 0.00 1.00 1.00 1.00 1.00 ▁▁▁▁▇
gest.days 0 1.00 274.46 12.80 140.00 271.00 276.00 281.00 315.00 ▁▁▁▇▃
FPL.prepreg 308244 0.66 5334.46 646527.19 0.00 29.00 77.00 121.00 91827365.00 ▇▁▁▁▁
FPL.predist 308244 0.66 25.15 65.01 -365.00 7.00 42.00 59.00 97.00 ▁▁▁▂▇
payer.predist 437089 0.51 -80.30 101.10 -365.00 -138.00 -45.00 -1.00 30.00 ▁▁▂▃▇
zscore 155757 0.83 0.00 1.07 -8.21 -0.67 -0.04 0.63 33.45 ▇▆▁▁▁
traj.t1 0 1.00 158.32 40.45 67.48 129.65 150.91 178.85 421.79 ▅▇▂▁▁
traj.t2 0 1.00 158.34 39.47 69.58 130.46 150.93 178.14 439.50 ▆▇▁▁▁
traj.t3 3520 1.00 171.16 38.13 79.66 144.47 164.49 190.47 469.66 ▆▇▁▁▁
traj.t4 177 1.00 183.20 38.85 84.10 156.06 176.84 203.22 496.35 ▆▇▁▁▁
wt.change 0 1.00 11.25 12.06 -63.51 1.22 9.06 19.31 123.37 ▁▇▆▁▁
predicted.wt 0 1.00 169.59 39.78 70.91 141.62 163.33 190.47 491.78 ▆▇▁▁▁
cohort_inclusion_flag 0 1.00 1.00 0.00 0.00 1.00 1.00 1.00 1.00 ▁▁▁▁▇
glimpse(data)
## Rows: 895,371
## Columns: 39
## $ pregid                     <chr> "0002AF21-6A06-4B4D-AFE4-B6BB9AAC2B5E-3", "…
## $ time                       <int> 20, 129, 171, 206, 252, 46, 60, 72, 81, 102…
## $ measure_date               <chr> "2019-09-26", "2020-01-13", "2020-02-24", "…
## $ measure_wt                 <dbl> 129.0, 130.0, 130.8, 131.6, 132.0, 113.0, 1…
## $ median_ht                  <dbl> 58.75, 58.75, 58.75, 58.75, 58.75, 62.00, 6…
## $ bmi                        <dbl> 26.27679, 26.48048, 26.64344, 26.80640, 26.…
## $ patid                      <chr> "0002AF21-6A06-4B4D-AFE4-B6BB9AAC2B5E", "00…
## $ preg_n                     <int> 3, 3, 3, 3, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ bmi.pre.cat                <chr> "3. Overweight (25 to <30)", "3. Overweight…
## $ current_state_of_residence <chr> "WA", "WA", "WA", "WA", "WA", "CA", "CA", "…
## $ preg_outcome               <chr> "Live birth", "Live birth", "Live birth", "…
## $ multiple_gestation         <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ fullterm                   <int> 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ biologically_plausible     <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ preg_start_date            <chr> "2019-09-06", "2019-09-06", "2019-09-06", "…
## $ d1tri2                     <chr> "2019-12-13", "2019-12-13", "2019-12-13", "…
## $ d1tri3                     <chr> "2020-03-20", "2020-03-20", "2020-03-20", "…
## $ preg_end_date              <chr> "2020-06-11", "2020-06-11", "2020-06-11", "…
## $ gest.days                  <int> 279, 279, 279, 279, 279, 242, 242, 242, 242…
## $ FPL.prepreg                <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ FPL.predist                <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ FPL.precat                 <chr> "3. Missing", "3. Missing", "3. Missing", "…
## $ payer.predist              <int> NA, NA, NA, NA, NA, -99, -99, -99, -99, -99…
## $ payer.precat               <chr> "5. Missing", "5. Missing", "5. Missing", "…
## $ payer.prenatalcat          <chr> "1. Medicaid", "1. Medicaid", "1. Medicaid"…
## $ payer.endofpreg.dt         <chr> "2020-06-05", "2020-06-05", "2020-06-05", "…
## $ payer.endofpregcat         <chr> "1. Medicaid", "1. Medicaid", "1. Medicaid"…
## $ race_ethnicity_cat         <chr> "2. NH Black", "2. NH Black", "2. NH Black"…
## $ age_preg_end_cat           <chr> "7. 40+", "7. 40+", "7. 40+", "7. 40+", "7.…
## $ percentile                 <chr> "5. 50th-90th", "5. 50th-90th", "5. 50th-90…
## $ zscore                     <dbl> 0.4344976, 0.4344976, 0.4344976, 0.4344976,…
## $ race_eth_simple            <chr> "2. NH Black", "2. NH Black", "2. NH Black"…
## $ traj.t1                    <dbl> 133.7341, 133.7341, 133.7341, 133.7341, 133…
## $ traj.t2                    <dbl> 125.6397, 125.6397, 125.6397, 125.6397, 125…
## $ traj.t3                    <dbl> 131.0897, 131.0897, 131.0897, 131.0897, 131…
## $ traj.t4                    <dbl> 137.1570, 137.1570, 137.1570, 137.1570, 137…
## $ wt.change                  <dbl> 3.36034, 4.36034, 5.16034, 5.96034, 6.36034…
## $ predicted.wt               <dbl> 131.3959, 126.2259, 129.0255, 131.8888, 135…
## $ cohort_inclusion_flag      <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
# I will also import the raw insurance variables - I will be constructing my insurance type vector from this data

insurance_info <- read.csv("~/Library/CloudStorage/OneDrive-SharedLibraries-OregonHealth&ScienceUniversity/Janne Boone-Heinonen - PROMISE-Data/Analytic Data/main_datasets/update_final/wts_long_3pct_preg_only_new.csv")

glimpse(insurance_info)
## Rows: 900,431
## Columns: 15
## $ patid                    <chr> "0002AF21-6A06-4B4D-AFE4-B6BB9AAC2B5E", "0002…
## $ encounterid              <chr> "1FA1F6B3-400F-443B-9254-F72E3550FA45", "2660…
## $ pregid                   <chr> "0002AF21-6A06-4B4D-AFE4-B6BB9AAC2B5E-3", "00…
## $ vitalid                  <int> 11989308, 14700696, 10085872, 19308392, 89008…
## $ measure_date             <chr> "2020-01-13", "2020-02-24", "2019-09-26", "20…
## $ measure_time             <chr> "09:42", "11:13", "11:28", "12:29", "10:56", …
## $ measure_wt               <dbl> 130.0, 130.8, 129.0, 132.0, 131.6, 131.0, 115…
## $ measure_ht               <dbl> 58.75, NA, 58.75, 58.66, NA, NA, NA, NA, NA, …
## $ median_ht                <dbl> 58.75, 58.75, 58.75, 58.75, 58.75, 62.00, 62.…
## $ SMOKING_DESCRIPTION      <chr> "Unknown if ever smoked", "Unknown if ever sm…
## $ TOBACCO_DESCRIPTION      <chr> "Unknown", "Unknown", "Unknown", "Never", "Un…
## $ TOBACCO_TYPE_DESCRIPTION <chr> "No information", "No information", "No infor…
## $ bmi                      <dbl> 26.48048, 26.64344, 26.27679, 26.88788, 26.80…
## $ payor_type_research_R    <chr> NA, "Medicaid", NA, "Medicaid", "Medicaid", "…
## $ fpl_percentage           <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…

If there are any quirks that you have to deal with NA coded as something else, or it is multiple tables, please make some notes here about what you need to do before you start transforming the data in the next section.

This data is in two tables, so I will need to merge or join them before I can proceed. I need to remove Other Public/Other Private visits - these are non-informative. I also need to exclude visits with NA insurance (this is different from uninsured; it just means that data wasn’t collected at that visit) and make sure the date is formatted correctly.

# I don't need to keep all those variables & there are multiple duplicate columns, so prior to merging I will reduce both data sets to just what I am actually using (or might plausibly want to use). 

data2 <- data[,c("patid", "pregid", "measure_date", "preg_start_date", "preg_end_date", "gest.days", "preg_n", "bmi.pre.cat", "fullterm", "payer.prenatalcat", "race_eth_simple", "race_ethnicity_cat", "FPL.precat", "age_preg_end_cat", "percentile")]
insurance_info <- insurance_info[,c("pregid", "measure_date", "payor_type_research_R")]

## I need to merge this with the main dataset, matched on pregid and measure_date. I will do this using "inner_join" since I need to match payor type to rows based on these two columns, and do not need any rows that do not have a measure date (all should have a measure date though! This data has already been cleaned).
data2 <- data2 %>%
  inner_join(y = insurance_info,
            by = c("pregid" = "pregid", "measure_date" = "measure_date"))

Bonus points (5 points) for datasets that require merging of tables, but only if you reason through whether you should use left_join, inner_join, or right_join on these tables. No credit will be provided if you don’t.

Show your transformed table here. Use tools such as glimpse(), skim() or head() to illustrate your point.

# The first data table
glimpse(data)
## Rows: 895,371
## Columns: 39
## $ pregid                     <chr> "0002AF21-6A06-4B4D-AFE4-B6BB9AAC2B5E-3", "…
## $ time                       <int> 20, 129, 171, 206, 252, 46, 60, 72, 81, 102…
## $ measure_date               <chr> "2019-09-26", "2020-01-13", "2020-02-24", "…
## $ measure_wt                 <dbl> 129.0, 130.0, 130.8, 131.6, 132.0, 113.0, 1…
## $ median_ht                  <dbl> 58.75, 58.75, 58.75, 58.75, 58.75, 62.00, 6…
## $ bmi                        <dbl> 26.27679, 26.48048, 26.64344, 26.80640, 26.…
## $ patid                      <chr> "0002AF21-6A06-4B4D-AFE4-B6BB9AAC2B5E", "00…
## $ preg_n                     <int> 3, 3, 3, 3, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ bmi.pre.cat                <chr> "3. Overweight (25 to <30)", "3. Overweight…
## $ current_state_of_residence <chr> "WA", "WA", "WA", "WA", "WA", "CA", "CA", "…
## $ preg_outcome               <chr> "Live birth", "Live birth", "Live birth", "…
## $ multiple_gestation         <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ fullterm                   <int> 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ biologically_plausible     <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ preg_start_date            <chr> "2019-09-06", "2019-09-06", "2019-09-06", "…
## $ d1tri2                     <chr> "2019-12-13", "2019-12-13", "2019-12-13", "…
## $ d1tri3                     <chr> "2020-03-20", "2020-03-20", "2020-03-20", "…
## $ preg_end_date              <chr> "2020-06-11", "2020-06-11", "2020-06-11", "…
## $ gest.days                  <int> 279, 279, 279, 279, 279, 242, 242, 242, 242…
## $ FPL.prepreg                <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ FPL.predist                <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ FPL.precat                 <chr> "3. Missing", "3. Missing", "3. Missing", "…
## $ payer.predist              <int> NA, NA, NA, NA, NA, -99, -99, -99, -99, -99…
## $ payer.precat               <chr> "5. Missing", "5. Missing", "5. Missing", "…
## $ payer.prenatalcat          <chr> "1. Medicaid", "1. Medicaid", "1. Medicaid"…
## $ payer.endofpreg.dt         <chr> "2020-06-05", "2020-06-05", "2020-06-05", "…
## $ payer.endofpregcat         <chr> "1. Medicaid", "1. Medicaid", "1. Medicaid"…
## $ race_ethnicity_cat         <chr> "2. NH Black", "2. NH Black", "2. NH Black"…
## $ age_preg_end_cat           <chr> "7. 40+", "7. 40+", "7. 40+", "7. 40+", "7.…
## $ percentile                 <chr> "5. 50th-90th", "5. 50th-90th", "5. 50th-90…
## $ zscore                     <dbl> 0.4344976, 0.4344976, 0.4344976, 0.4344976,…
## $ race_eth_simple            <chr> "2. NH Black", "2. NH Black", "2. NH Black"…
## $ traj.t1                    <dbl> 133.7341, 133.7341, 133.7341, 133.7341, 133…
## $ traj.t2                    <dbl> 125.6397, 125.6397, 125.6397, 125.6397, 125…
## $ traj.t3                    <dbl> 131.0897, 131.0897, 131.0897, 131.0897, 131…
## $ traj.t4                    <dbl> 137.1570, 137.1570, 137.1570, 137.1570, 137…
## $ wt.change                  <dbl> 3.36034, 4.36034, 5.16034, 5.96034, 6.36034…
## $ predicted.wt               <dbl> 131.3959, 126.2259, 129.0255, 131.8888, 135…
## $ cohort_inclusion_flag      <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
# The second data table
glimpse(insurance_info)
## Rows: 900,431
## Columns: 3
## $ pregid                <chr> "0002AF21-6A06-4B4D-AFE4-B6BB9AAC2B5E-3", "0002A…
## $ measure_date          <chr> "2020-01-13", "2020-02-24", "2019-09-26", "2020-…
## $ payor_type_research_R <chr> NA, "Medicaid", NA, "Medicaid", "Medicaid", "Med…
# The inner-joined data tables - note that the only new column is payor_type_research_R
glimpse(data2)
## Rows: 895,371
## Columns: 16
## $ patid                 <chr> "0002AF21-6A06-4B4D-AFE4-B6BB9AAC2B5E", "0002AF2…
## $ pregid                <chr> "0002AF21-6A06-4B4D-AFE4-B6BB9AAC2B5E-3", "0002A…
## $ measure_date          <chr> "2019-09-26", "2020-01-13", "2020-02-24", "2020-…
## $ preg_start_date       <chr> "2019-09-06", "2019-09-06", "2019-09-06", "2019-…
## $ preg_end_date         <chr> "2020-06-11", "2020-06-11", "2020-06-11", "2020-…
## $ gest.days             <int> 279, 279, 279, 279, 279, 242, 242, 242, 242, 242…
## $ preg_n                <int> 3, 3, 3, 3, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ bmi.pre.cat           <chr> "3. Overweight (25 to <30)", "3. Overweight (25 …
## $ fullterm              <int> 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ payer.prenatalcat     <chr> "1. Medicaid", "1. Medicaid", "1. Medicaid", "1.…
## $ race_eth_simple       <chr> "2. NH Black", "2. NH Black", "2. NH Black", "2.…
## $ race_ethnicity_cat    <chr> "2. NH Black", "2. NH Black", "2. NH Black", "2.…
## $ FPL.precat            <chr> "3. Missing", "3. Missing", "3. Missing", "3. Mi…
## $ age_preg_end_cat      <chr> "7. 40+", "7. 40+", "7. 40+", "7. 40+", "7. 40+"…
## $ percentile            <chr> "5. 50th-90th", "5. 50th-90th", "5. 50th-90th", …
## $ payor_type_research_R <chr> NA, NA, "Medicaid", "Medicaid", "Medicaid", "Uni…

Are the values what you expected for the variables? Why or Why not?

The values are exactly what I expected, fortunately! Now the insurance data has been merged with the main data by patent ID and visit date, and the new column is the one for insurance payor type.

Make sure your data types are correct!

Transforming the data (15 points)

If the data needs to be transformed in any way (values recoded, pivoted, etc), do it here. Examples include transforming a continuous variable into a categorical using case_when(), etc.

# Change to factor and collapse birthweight percentiles
# List the levels
table(data2$percentile, exclude = NULL)
## 
##      1. <3rd   2. 3rd-5th  3. 5th-10th 4. 10th-50th 5. 50th-90th 6. 90th-95th 
##        19873        15289        37752       300040       292960        35282 
## 7. 95th-97th       8. >97         <NA> 
##        14231        24187       155757
data2$percentile <- as.factor(data2$percentile)

data2 <- data2 %>%
  mutate(percentile2 = fct_collapse(percentile,
                               "<10th" = c("1. <3rd", "2. 3rd-5th",  "3. 5th-10th"),
                               "10th-50th" = c("4. 10th-50th"),
                               "50th-90th" = c("5. 50th-90th"),
                               ">90th" = c("6. 90th-95th", "7. 95th-97th", "8. >97")))

levels(data2$percentile2)
## [1] "<10th"     "10th-50th" "50th-90th" ">90th"
# Compare before and after
data2 %>% 
  count(percentile, percentile2) 
##     percentile percentile2      n
## 1      1. <3rd       <10th  19873
## 2   2. 3rd-5th       <10th  15289
## 3  3. 5th-10th       <10th  37752
## 4 4. 10th-50th   10th-50th 300040
## 5 5. 50th-90th   50th-90th 292960
## 6 6. 90th-95th       >90th  35282
## 7 7. 95th-97th       >90th  14231
## 8       8. >97       >90th  24187
## 9         <NA>        <NA> 155757
# Create binomial variables for logistic regression models
data2 <- data2 %>%
  mutate(lobw = fct_collapse(percentile,
                               "Low" = c("1. <3rd", "2. 3rd-5th",  "3. 5th-10th"),
                               "Not low" = c("4. 10th-50th", "5. 50th-90th","6. 90th-95th", "7. 95th-97th", "8. >97")))
                            
levels(data2$lobw)
## [1] "Low"     "Not low"
data2 <- data2 %>%
  mutate(hibw = fct_collapse(percentile,
                               "High" = c("6. 90th-95th", "7. 95th-97th", "8. >97"),
                               "Not high" = c("1. <3rd", "2. 3rd-5th",  "3. 5th-10th", "4. 10th-50th", "5. 50th-90th")))
                            
levels(data2$hibw)
## [1] "Not high" "High"
## recode race/ethnicity
data2$race_ethnicity_cat2 <- case_when(data2$race_ethnicity_cat == "2. NH Black" ~ "2. Black",
                                            data2$race_ethnicity_cat == "7. Hispanic" ~ "3. Hispanic",
                                            data2$race_ethnicity_cat == "1. NH White" ~ "1. White",
                                            data2$race_ethnicity_cat == "3. NH Asian" |
                                            data2$race_ethnicity_cat == "4. NH NH/OPI" ~ "4. AA/PI",
                                            data2$race_ethnicity_cat == "5. NH AI/AN" ~ "5. AI/AN",
                                            data2$race_ethnicity_cat == "6. NH Other/Multiple" |
                                            data2$race_ethnicity_cat == "8. Unknown" ~ "6. Unknown/Multi/Other",
                                            TRUE ~ "NA")
table(data2$race_ethnicity_cat2)
## 
##               1. White               2. Black            3. Hispanic 
##                 198379                  96657                 531429 
##               4. AA/PI               5. AI/AN 6. Unknown/Multi/Other 
##                  37803                   3541                  27562
# Remove Other Public/Other Private visits and NAs
data2 <- data2 %>%
  filter(payor_type_research_R != "Other Private" & 
                     payor_type_research_R != "Other Public") %>%
  filter(!is.na(payor_type_research_R)) %>%
    filter(!is.na(race_ethnicity_cat)) %>%
      filter(!is.na(percentile2)) 

# How many people and pregnancies do we still have?
length(unique(data2$pregid)) # 61,236 pregnancies
## [1] 62630
length(unique(data2$patid)) # 52,287 people
## [1] 53506
# Check to make sure date variables are coded correctly
data2$preg_end_date <- as.Date(data2$preg_end_date, "%Y-%m-%d")
data2$preg_start_date <- as.Date(data2$preg_start_date, "%Y-%m-%d")
data2$measure_date <- as.Date(data2$measure_date, "%Y-%m-%d")

# Order the dataset by pregid and measure_date
data2 <- arrange(data2, pregid, measure_date)

# I started with 77,691 pregnancies in the PROMISE dataset. After excluding visits with "Other Public" or "Other Private" insurance, as well as visits with missing insurance, missing race/eth, and missing birthweight data, we are left with 61,236 pregnancies among 52,287 individuals. 

# I need to create a study population subset with the inclusion criteria "minimum of three visits".
## I received help with the "subset" code from Anna Booman
#### Minimum of three visits #### 
data2 <- data2 %>% 
  group_by(pregid) %>% 
  mutate(visit_count = seq(n()))

visits_three <- data2 %>% 
  group_by(pregid) %>% 
  summarize(max_visit=max(visit_count))
visits_three <- subset(visits_three, visits_three$max_visit >= 3)
nrow(visits_three) # 74,980
## [1] 61236
data2 <- subset(data2, data2$pregid %in% visits_three[["pregid"]])
length(unique(data2$pregid))
## [1] 61236
length(unique(data2$patid))
## [1] 52287
# After subsetting to those with three or more visits, I am left with 61,236 pregnancies among 52,287 individuals.
# I had help from Anna Booman with using "difftime" to sort the dataset. TBH I did this and then realized that I'm not using it here, but I left it in anyway.
## Sort dataset by date of measurement within each pregnancy
data2$weeks_since_start <- difftime(data2$measure_date, data2$preg_start_date, units="weeks")
data2 <- data2[order(data2$pregid, data2$weeks_since_start),]

Visualizing and Summarizing the Data (15 points)

Use group_by() and summarize() to make a summary of the data here. The summary should be relevant to your research question

# Use the table1 package to make the table
label(data2$percentile2) <- "Birthweight category - percentile"
label(data2$age_preg_end_cat) <- "Age, end of pregnancy"
label(data2$race_ethnicity_cat2) <- "Race/Ethnicity"
label(data2$FPL.precat) <- "FPL"
label(data2$fullterm) <- "Full Term"
label(data2$bmi.pre.cat) <- "Maternal BMI category"

units(data2$age_preg_end_cat) <- "years"
units(data2$FPL.precat) <- "%"

## reorder variables to appear in the correct order
data2 <- data2 %>% 
  mutate(payor_type_research_R = factor(payor_type_research_R, levels=c("Private", "Medicaid", "Medicare", "Uninsured")),
         lobw = factor(lobw, levels= c("Not low", "Low")),
         hibw = factor(hibw, levels= c("Not high", "High"))
         )

table1(~ percentile2 + race_ethnicity_cat2 + age_preg_end_cat + data2$bmi.pre.cat + FPL.precat + fullterm| payor_type_research_R, data=data2)
Private
(N=71392)
Medicaid
(N=573215)
Medicare
(N=2500)
Uninsured
(N=62133)
Overall
(N=709240)
Birthweight category - percentile
<10th 5480 (7.7%) 57988 (10.1%) 432 (17.3%) 5602 (9.0%) 69502 (9.8%)
10th-50th 26275 (36.8%) 234606 (40.9%) 1018 (40.7%) 25343 (40.8%) 287242 (40.5%)
50th-90th 31235 (43.8%) 224630 (39.2%) 876 (35.0%) 25057 (40.3%) 281798 (39.7%)
>90th 8402 (11.8%) 55991 (9.8%) 174 (7.0%) 6131 (9.9%) 70698 (10.0%)
Race/Ethnicity
1. White 39897 (55.9%) 95325 (16.6%) 1024 (41.0%) 18858 (30.4%) 155104 (21.9%)
2. Black 5811 (8.1%) 54917 (9.6%) 908 (36.3%) 3158 (5.1%) 64794 (9.1%)
3. Hispanic 18186 (25.5%) 382090 (66.7%) 389 (15.6%) 36609 (58.9%) 437274 (61.7%)
4. AA/PI 3483 (4.9%) 23255 (4.1%) 23 (0.9%) 1520 (2.4%) 28281 (4.0%)
5. AI/AN 361 (0.5%) 2075 (0.4%) 37 (1.5%) 379 (0.6%) 2852 (0.4%)
6. Unknown/Multi/Other 3654 (5.1%) 15553 (2.7%) 119 (4.8%) 1609 (2.6%) 20935 (3.0%)
Age, end of pregnancy (years)
1. <18 618 (0.9%) 13784 (2.4%) 2 (0.1%) 1189 (1.9%) 15593 (2.2%)
2. 18 to <20 1911 (2.7%) 34550 (6.0%) 44 (1.8%) 3251 (5.2%) 39756 (5.6%)
3. 20 to <25 11492 (16.1%) 142096 (24.8%) 375 (15.0%) 14990 (24.1%) 168953 (23.8%)
4. 25 to <30 21397 (30.0%) 164499 (28.7%) 719 (28.8%) 18638 (30.0%) 205253 (28.9%)
5. 30 to <35 22284 (31.2%) 128836 (22.5%) 817 (32.7%) 15109 (24.3%) 167046 (23.6%)
6. 35 to <40 11474 (16.1%) 70487 (12.3%) 418 (16.7%) 7358 (11.8%) 89737 (12.7%)
7. 40+ 2216 (3.1%) 18963 (3.3%) 125 (5.0%) 1598 (2.6%) 22902 (3.2%)
Maternal BMI category
1. Underweight (<18.5) 1215 (1.7%) 11329 (2.0%) 34 (1.4%) 1059 (1.7%) 13637 (1.9%)
2. Normal/Healthy (18.5 to <25) 29182 (40.9%) 180820 (31.5%) 757 (30.3%) 21737 (35.0%) 232496 (32.8%)
3. Overweight (25 to <30) 20392 (28.6%) 184786 (32.2%) 550 (22.0%) 20487 (33.0%) 226215 (31.9%)
4. Obesity Class 1 (30 to <35) 11825 (16.6%) 114886 (20.0%) 512 (20.5%) 11559 (18.6%) 138782 (19.6%)
5. Obesity Class 2 (35 to <40) 5266 (7.4%) 50124 (8.7%) 294 (11.8%) 4778 (7.7%) 60462 (8.5%)
6. Obesity Class 3 (40+) 3512 (4.9%) 31270 (5.5%) 353 (14.1%) 2513 (4.0%) 37648 (5.3%)
FPL (%)
1. <138% FPL 19071 (26.7%) 331141 (57.8%) 1521 (60.8%) 33055 (53.2%) 384788 (54.3%)
2. >=138% FPL 22470 (31.5%) 68645 (12.0%) 275 (11.0%) 5713 (9.2%) 97103 (13.7%)
3. Missing 29851 (41.8%) 173429 (30.3%) 704 (28.2%) 23365 (37.6%) 227349 (32.1%)
Full Term
Mean (SD) 0.960 (0.196) 0.945 (0.228) 0.901 (0.299) 0.947 (0.225) 0.946 (0.225)
Median [Min, Max] 1.00 [0, 1.00] 1.00 [0, 1.00] 1.00 [0, 1.00] 1.00 [0, 1.00] 1.00 [0, 1.00]
# logistic regression model with low birthweight as the outcome of interest

lobwXins <- glm(lobw ~ payor_type_research_R + race_ethnicity_cat2 + age_preg_end_cat + bmi.pre.cat + FPL.precat + gest.days, data = data2, family = "binomial")

summary(lobwXins)
## 
## Call:
## glm(formula = lobw ~ payor_type_research_R + race_ethnicity_cat2 + 
##     age_preg_end_cat + bmi.pre.cat + FPL.precat + gest.days, 
##     family = "binomial", data = data2)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6014  -0.4827  -0.4187  -0.3631   2.6375  
## 
## Coefficients:
##                                              Estimate Std. Error z value
## (Intercept)                                 2.7048953  0.0891293  30.348
## payor_type_research_RMedicaid               0.1833027  0.0156647  11.702
## payor_type_research_RMedicare               0.7817984  0.0560327  13.953
## payor_type_research_RUninsured              0.1002604  0.0204241   4.909
## race_ethnicity_cat22. Black                 0.7222436  0.0149901  48.181
## race_ethnicity_cat23. Hispanic              0.2134972  0.0113809  18.759
## race_ethnicity_cat24. AA/PI                 0.4075773  0.0208241  19.572
## race_ethnicity_cat25. AI/AN                 0.0796160  0.0694579   1.146
## race_ethnicity_cat26. Unknown/Multi/Other   0.3081004  0.0246457  12.501
## age_preg_end_cat2. 18 to <20               -0.1688134  0.0268368  -6.290
## age_preg_end_cat3. 20 to <25               -0.2393229  0.0235506 -10.162
## age_preg_end_cat4. 25 to <30               -0.4303397  0.0236455 -18.200
## age_preg_end_cat5. 30 to <35               -0.5548840  0.0242006 -22.929
## age_preg_end_cat6. 35 to <40               -0.5328054  0.0256501 -20.772
## age_preg_end_cat7. 40+                     -0.3962252  0.0319392 -12.406
## bmi.pre.cat2. Normal/Healthy (18.5 to <25) -0.4210866  0.0230072 -18.302
## bmi.pre.cat3. Overweight (25 to <30)       -0.6999908  0.0234868 -29.804
## bmi.pre.cat4. Obesity Class 1 (30 to <35)  -0.9201133  0.0246029 -37.399
## bmi.pre.cat5. Obesity Class 2 (35 to <40)  -1.0739595  0.0276645 -38.821
## bmi.pre.cat6. Obesity Class 3 (40+)        -1.1461740  0.0306353 -37.413
## FPL.precat2. >=138% FPL                    -0.0287478  0.0129861  -2.214
## FPL.precat3. Missing                        0.0544036  0.0090316   6.024
## gest.days                                  -0.0156253  0.0002999 -52.107
##                                            Pr(>|z|)    
## (Intercept)                                 < 2e-16 ***
## payor_type_research_RMedicaid               < 2e-16 ***
## payor_type_research_RMedicare               < 2e-16 ***
## payor_type_research_RUninsured             9.16e-07 ***
## race_ethnicity_cat22. Black                 < 2e-16 ***
## race_ethnicity_cat23. Hispanic              < 2e-16 ***
## race_ethnicity_cat24. AA/PI                 < 2e-16 ***
## race_ethnicity_cat25. AI/AN                  0.2517    
## race_ethnicity_cat26. Unknown/Multi/Other   < 2e-16 ***
## age_preg_end_cat2. 18 to <20               3.17e-10 ***
## age_preg_end_cat3. 20 to <25                < 2e-16 ***
## age_preg_end_cat4. 25 to <30                < 2e-16 ***
## age_preg_end_cat5. 30 to <35                < 2e-16 ***
## age_preg_end_cat6. 35 to <40                < 2e-16 ***
## age_preg_end_cat7. 40+                      < 2e-16 ***
## bmi.pre.cat2. Normal/Healthy (18.5 to <25)  < 2e-16 ***
## bmi.pre.cat3. Overweight (25 to <30)        < 2e-16 ***
## bmi.pre.cat4. Obesity Class 1 (30 to <35)   < 2e-16 ***
## bmi.pre.cat5. Obesity Class 2 (35 to <40)   < 2e-16 ***
## bmi.pre.cat6. Obesity Class 3 (40+)         < 2e-16 ***
## FPL.precat2. >=138% FPL                      0.0268 *  
## FPL.precat3. Missing                       1.70e-09 ***
## gest.days                                   < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 454843  on 709239  degrees of freedom
## Residual deviance: 443130  on 709217  degrees of freedom
## AIC: 443176
## 
## Number of Fisher Scoring iterations: 5
exp(coef(lobwXins))
##                                (Intercept) 
##                                 14.9527513 
##              payor_type_research_RMedicaid 
##                                  1.2011780 
##              payor_type_research_RMedicare 
##                                  2.1853990 
##             payor_type_research_RUninsured 
##                                  1.1054588 
##                race_ethnicity_cat22. Black 
##                                  2.0590477 
##             race_ethnicity_cat23. Hispanic 
##                                  1.2380000 
##                race_ethnicity_cat24. AA/PI 
##                                  1.5031716 
##                race_ethnicity_cat25. AI/AN 
##                                  1.0828712 
##  race_ethnicity_cat26. Unknown/Multi/Other 
##                                  1.3608375 
##               age_preg_end_cat2. 18 to <20 
##                                  0.8446665 
##               age_preg_end_cat3. 20 to <25 
##                                  0.7871607 
##               age_preg_end_cat4. 25 to <30 
##                                  0.6502882 
##               age_preg_end_cat5. 30 to <35 
##                                  0.5741389 
##               age_preg_end_cat6. 35 to <40 
##                                  0.5869560 
##                     age_preg_end_cat7. 40+ 
##                                  0.6728551 
## bmi.pre.cat2. Normal/Healthy (18.5 to <25) 
##                                  0.6563333 
##       bmi.pre.cat3. Overweight (25 to <30) 
##                                  0.4965899 
##  bmi.pre.cat4. Obesity Class 1 (30 to <35) 
##                                  0.3984739 
##  bmi.pre.cat5. Obesity Class 2 (35 to <40) 
##                                  0.3416531 
##        bmi.pre.cat6. Obesity Class 3 (40+) 
##                                  0.3178505 
##                    FPL.precat2. >=138% FPL 
##                                  0.9716615 
##                       FPL.precat3. Missing 
##                                  1.0559107 
##                                  gest.days 
##                                  0.9844961
# logistic regression model with high birthweight as the outcome of interest

hibwXins <- glm(hibw ~ payor_type_research_R + race_ethnicity_cat2 + age_preg_end_cat + bmi.pre.cat + FPL.precat + gest.days, data = data2, family = "binomial")

summary(hibwXins)
## 
## Call:
## glm(formula = hibw ~ payor_type_research_R + race_ethnicity_cat2 + 
##     age_preg_end_cat + bmi.pre.cat + FPL.precat + gest.days, 
##     family = "binomial", data = data2)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.8885  -0.4998  -0.4205  -0.3429   3.0309  
## 
## Coefficients:
##                                              Estimate Std. Error z value
## (Intercept)                                -7.2912406  0.1295600 -56.277
## payor_type_research_RMedicaid              -0.0840906  0.0136296  -6.170
## payor_type_research_RMedicare              -0.5977583  0.0804238  -7.433
## payor_type_research_RUninsured             -0.0988187  0.0184725  -5.349
## race_ethnicity_cat22. Black                -0.6678679  0.0177665 -37.591
## race_ethnicity_cat23. Hispanic             -0.2851271  0.0101131 -28.194
## race_ethnicity_cat24. AA/PI                -0.3626098  0.0241238 -15.031
## race_ethnicity_cat25. AI/AN                -0.4262671  0.0660154  -6.457
## race_ethnicity_cat26. Unknown/Multi/Other  -0.2525892  0.0250234 -10.094
## age_preg_end_cat2. 18 to <20               -0.0009903  0.0399847  -0.025
## age_preg_end_cat3. 20 to <25                0.1274526  0.0355126   3.589
## age_preg_end_cat4. 25 to <30                0.3270460  0.0351862   9.295
## age_preg_end_cat5. 30 to <35                0.4749380  0.0352908  13.458
## age_preg_end_cat6. 35 to <40                0.5274528  0.0359630  14.667
## age_preg_end_cat7. 40+                      0.5363214  0.0400932  13.377
## bmi.pre.cat2. Normal/Healthy (18.5 to <25)  0.7582999  0.0518279  14.631
## bmi.pre.cat3. Overweight (25 to <30)        1.1425421  0.0517624  22.073
## bmi.pre.cat4. Obesity Class 1 (30 to <35)   1.5193581  0.0519148  29.266
## bmi.pre.cat5. Obesity Class 2 (35 to <40)   1.7017578  0.0525321  32.395
## bmi.pre.cat6. Obesity Class 3 (40+)         1.9290072  0.0530185  36.384
## FPL.precat2. >=138% FPL                    -0.0094289  0.0122199  -0.772
## FPL.precat3. Missing                       -0.0047238  0.0091450  -0.517
## gest.days                                   0.0140274  0.0004084  34.347
##                                            Pr(>|z|)    
## (Intercept)                                 < 2e-16 ***
## payor_type_research_RMedicaid              6.84e-10 ***
## payor_type_research_RMedicare              1.06e-13 ***
## payor_type_research_RUninsured             8.82e-08 ***
## race_ethnicity_cat22. Black                 < 2e-16 ***
## race_ethnicity_cat23. Hispanic              < 2e-16 ***
## race_ethnicity_cat24. AA/PI                 < 2e-16 ***
## race_ethnicity_cat25. AI/AN                1.07e-10 ***
## race_ethnicity_cat26. Unknown/Multi/Other   < 2e-16 ***
## age_preg_end_cat2. 18 to <20               0.980241    
## age_preg_end_cat3. 20 to <25               0.000332 ***
## age_preg_end_cat4. 25 to <30                < 2e-16 ***
## age_preg_end_cat5. 30 to <35                < 2e-16 ***
## age_preg_end_cat6. 35 to <40                < 2e-16 ***
## age_preg_end_cat7. 40+                      < 2e-16 ***
## bmi.pre.cat2. Normal/Healthy (18.5 to <25)  < 2e-16 ***
## bmi.pre.cat3. Overweight (25 to <30)        < 2e-16 ***
## bmi.pre.cat4. Obesity Class 1 (30 to <35)   < 2e-16 ***
## bmi.pre.cat5. Obesity Class 2 (35 to <40)   < 2e-16 ***
## bmi.pre.cat6. Obesity Class 3 (40+)         < 2e-16 ***
## FPL.precat2. >=138% FPL                    0.440350    
## FPL.precat3. Missing                       0.605474    
## gest.days                                   < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 460130  on 709239  degrees of freedom
## Residual deviance: 445397  on 709217  degrees of freedom
## AIC: 445443
## 
## Number of Fisher Scoring iterations: 6
exp(coef(hibwXins))
##                                (Intercept) 
##                               0.0006814821 
##              payor_type_research_RMedicaid 
##                               0.9193479931 
##              payor_type_research_RMedicare 
##                               0.5500433090 
##             payor_type_research_RUninsured 
##                               0.9059069501 
##                race_ethnicity_cat22. Black 
##                               0.5128007806 
##             race_ethnicity_cat23. Hispanic 
##                               0.7519186617 
##                race_ethnicity_cat24. AA/PI 
##                               0.6958578832 
##                race_ethnicity_cat25. AI/AN 
##                               0.6529418899 
##  race_ethnicity_cat26. Unknown/Multi/Other 
##                               0.7767869519 
##               age_preg_end_cat2. 18 to <20 
##                               0.9990102125 
##               age_preg_end_cat3. 20 to <25 
##                               1.1359310334 
##               age_preg_end_cat4. 25 to <30 
##                               1.3868652128 
##               age_preg_end_cat5. 30 to <35 
##                               1.6079144250 
##               age_preg_end_cat6. 35 to <40 
##                               1.6946103111 
##                     age_preg_end_cat7. 40+ 
##                               1.7097059349 
## bmi.pre.cat2. Normal/Healthy (18.5 to <25) 
##                               2.1346439813 
##       bmi.pre.cat3. Overweight (25 to <30) 
##                               3.1347269122 
##  bmi.pre.cat4. Obesity Class 1 (30 to <35) 
##                               4.5692912299 
##  bmi.pre.cat5. Obesity Class 2 (35 to <40) 
##                               5.4835778698 
##        bmi.pre.cat6. Obesity Class 3 (40+) 
##                               6.8826740630 
##                    FPL.precat2. >=138% FPL 
##                               0.9906154515 
##                       FPL.precat3. Missing 
##                               0.9952873242 
##                                  gest.days 
##                               1.0141262895

What are your findings about the summary? Are they what you expected?

My findings are that insurance type is associated with birthweight even after adjusting for gestational age, maternal age, race & ethnicity, BMI, and household income, Medicaid, Medicare, and no insurance are associated with higher odds of low birthweight, and lower odds of high birthweight. This is not unexpected, as insurance type is itself associated with a myriad of other factors, primarily social, which are not fully captured in study data.

Make at least two plots that help you answer your question on the transformed or summarized data. Use scales and/or labels to make each plot informative.

# I was tragically disappointed by the lack of continuous birthweight data in this dataset, because it is listed in the data dictionary and I was planning to use it to make ridgeline density plots of birthweight by insurance type to visualize the relatve distribution. I remain crushed, because (in one scientist's humble opinion) the options for plots for categorical outcomes are not that fun. Nonetheless, I soldier on.

# A stacked barchart. I have removed the Y-axis labels because they were not informative, and I couldn't fgure out how to create a scale that would make them informative. I wanted to facet wrap them but then could not seem to find how we did that before. 
ggplot(data = data2) +  
  aes(x = payor_type_research_R) +  
  geom_bar() +  
  aes(fill = percentile2) +  
  scale_fill_discrete(  
    name = "Birthweight percentiles"  
    ) +  
  labs(x = "Insurance type",  
       y = "Birthweight percentiles",  
       title = "Barplot") +  
  theme_bw() +  
  theme(axis.text.x=element_text(  
    angle = -30, hjust = 0), axis.ticks.x = ,       
    axis.text.y=element_blank(),
      axis.ticks.y=element_blank()) + 
 
  scale_fill_viridis_d(name = "Birthweight Percentiles") +  
  theme(
    text = element_text(family = "Palatino"))
## Scale for fill is already present.
## Adding another scale for fill, which will replace the existing scale.

# Because the previous bar chart really did little to convey anything and I could not figure out how to facet wrap it to make it useful, I decided to make one to show what percent each percentile of birthweight category is within each insurance type.
# First I need to make row percentages of birthweight percentile category
row_pct <- data2 %>%
  group_by(payor_type_research_R) %>%
  count(percentile2) %>%
  mutate(percent = (n/sum(n)) * 100,
         label = sprintf("%0.0f%%", percent))

ggplot(data2) + 
  aes(x=payor_type_research_R,fill=percentile2) + 
  geom_bar(position="fill") +
  geom_text(data=row_pct, aes(y=n,label=label),position=position_fill(vjust = 0.5), size = 3) + 
  scale_fill_brewer(palette="Set3") + 
  coord_flip() +
  ggtitle("Proportions of Birthweight Percentile Within Insurance Type")

Final Summary (10 points)

Summarize your research question and findings below.

My research question was if variations in birthweight are associated with different insurance types during pregnancy. I found that not only are birthweight distributions different across categories of insurance type, but that after adjustment for gestational age and maternal age, race & ethnicity, BMI, and household income, insurance remained significantly associated with odds of birth weights in both the lowest 10th percentile and highest 10th percentiles. Those on Medicaid about 20% higher odds of birthweights in the 10th percentile, those on Medicare had about 119% higher odds, and those who were uninsured had about 111% higher odds, compared to those on private insurance. At the other end of the birthweight scale, those on Medicaid about 8% lower odds of birthweight in the 90th percentile, those on Medicare had about 45% lower odds, and those who were uninsured had about 9% lower odds, compared to those on private insurance.

Are your findings what you expected? Why or Why not?

The associations themselves are not unexpected, as insurance type is itself associated with a myriad of other factors, primarily social, which are not fully captured in study data. However, because I adjusted for the most obvious potential confounders for which we had data, in particular income, race, and gestational age, I was surprised that these associations remained as strong as they are. The most curious finding, for me, was actually for one of the adjustment variables - I was not expecting to find that %FPL was not significantly associated with birthweights in the 90th percentile even with the other variables included in the model. assuming I coded all my variables correctly, these findings point toward a juicy avenue of future research.