Midterm (Due Sunday 2/19/2023 at 11:55 pm)

Please submit your .Rmd and .html files in Sakai. If you are working together, both people should submit the files.

60 / 60 points total

The goal of the midterm project is to showcase skills that you have learned in class so far. The midterm is open note, but if you use someone else’s code, you must attribute them.

Instructions: Before you get Started

Pick a dataset. Ideally, the dataset should be around 2000 rows, and should have both categorical and numeric covariates. Choose a dataset with more than 4 columns/variables.

Potential Sources for data:
Note that most of the Tidy Tuesday data are .csv files. There is code to load the files from csv for each of the datasets and a short description of the variables, or you can upload the .csv file into your data folder. This resource is probably the easiest to deal with.
You may use another dataset or your own data, but please make sure it is de-identified and has enough rows/variables.

Define a research question, involving at least one categorical variable. You may schedule a time with Jessica or Brad to discuss your data set and research question, or you may message it to one of us in slack or email. Please do one of the two options pretty early on. We just want to look at the data and make sure that it is appropriate for your question.
You must use each of the following functions at least once:

mutate()
group_by()
summarize()
ggplot()

and at least one of the following:

case_when()
across()
*_join() (i.e. left_join())
pivot_*() (i.e. pivot_longer())
function()

The code chunks below are guides, please add more code chunks to do what you need.
If you do not want your final project posted on the public website, please let Jessica know. We can also keep it anonymous if you’d like to remove your name from the Rmd and html, or use a pseudonym.

You may remove these instructions from your final Rmd if you like

Working Together

If you’d like to work together in pairs, that is encouraged, but you must divide the work equitably and you must note who worked on what. This is probably easiest as notes in the text. Please let Brad or Jessica know that you’ll be working together.

No acknowledgements of contributions = -10 points overall.

Please Note

I will take off points (-5 points for each section) if you don’t add observations and notes in your RMarkdown document. I want you to think and reason through your analysis, even if they are preliminary thoughts.

Lifestyle Risk Factors For Heart Disease

This dataset is sourced from MOHAMED ELSAYED on Kaggle, titled “Heart Disease Prediction”.This work is done by Puthyda Keath.

1. Define Your Research Question (10 points)

Define your research question below. What about the data interests you? What is a specific question you want to find out about the data?

I’m interested in looking at what kind of lifestyle factors are related to increasing the risk of heart disease. This data has some pretty good lifestyle-related variables such as sleeping hours, smoking, drinking, physical activity, and BMI that could allow me to explore the risk factors associated with heart disease. This led me to my research question: “What is the association between lifestyle factors (sleep time, smoking, alcohol drinking, physical activity, and BMI) and the risk of developing heart disease?”. I also would like to know which age groups are at risk of developing heart disease.

Given your question, what is your expectation about the data?

I hypothesize that people with unhealthy lifestyles such as less sleep time, smoking, drinking alcohol, less physical activity, and higher BMI will be more likely to have heart disease. According to the CDC, the prevalence of heart disease is higher in people who are 60 and above. So, I hypothesize that there are higher percentages of people who are older than 60 year old with heart disease than those who are younger than 60 year old.

2. Loading the Data (10 points)

Load the data below and use dplyr::glimpse() or skimr::skim() on the data. You should upload the data file into the data directory.

library(readr)
heart_2020_cleaned <- read_csv("heart_2020_cleaned.csv", na= c("NA"))

## Rows: 319795 Columns: 18
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (14): HeartDisease, Smoking, AlcoholDrinking, Stroke, DiffWalking, Sex, ...
## dbl  (4): BMI, PhysicalHealth, MentalHealth, SleepTime
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

#checking out the data using glimpse function
glimpse(heart_2020_cleaned) # 319,795 rows and 18 columns

## Rows: 319,795
## Columns: 18
## $ HeartDisease     <chr> "No", "No", "No", "No", "No", "Yes", "No", "No", "No"…
## $ BMI              <dbl> 16.60, 20.34, 26.58, 24.21, 23.71, 28.87, 21.63, 31.6…
## $ Smoking          <chr> "Yes", "No", "Yes", "No", "No", "Yes", "No", "Yes", "…
## $ AlcoholDrinking  <chr> "No", "No", "No", "No", "No", "No", "No", "No", "No",…
## $ Stroke           <chr> "No", "Yes", "No", "No", "No", "No", "No", "No", "No"…
## $ PhysicalHealth   <dbl> 3, 0, 20, 0, 28, 6, 15, 5, 0, 0, 30, 0, 0, 7, 0, 1, 5…
## $ MentalHealth     <dbl> 30, 0, 30, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 30, 0, 2,…
## $ DiffWalking      <chr> "No", "No", "No", "No", "Yes", "Yes", "No", "Yes", "N…
## $ Sex              <chr> "Female", "Female", "Male", "Female", "Female", "Fema…
## $ AgeCategory      <chr> "55-59", "80 or older", "65-69", "75-79", "40-44", "7…
## $ Race             <chr> "White", "White", "White", "White", "White", "Black",…
## $ Diabetic         <chr> "Yes", "No", "Yes", "No", "No", "No", "No", "Yes", "N…
## $ PhysicalActivity <chr> "Yes", "Yes", "Yes", "No", "Yes", "No", "Yes", "No", …
## $ GenHealth        <chr> "Very good", "Very good", "Fair", "Good", "Very good"…
## $ SleepTime        <dbl> 5, 7, 8, 6, 8, 12, 4, 9, 5, 10, 15, 5, 8, 7, 5, 6, 10…
## $ Asthma           <chr> "Yes", "No", "Yes", "No", "No", "No", "Yes", "Yes", "…
## $ KidneyDisease    <chr> "No", "No", "No", "No", "No", "No", "No", "No", "Yes"…
## $ SkinCancer       <chr> "Yes", "No", "No", "Yes", "No", "No", "Yes", "No", "N…

#checking the data structure using skimr
skim_heart_2020_cleaned <- skimr::skim(heart_2020_cleaned) %>%
    as_tibble() %>%
    print() #No missing value

## # A tibble: 18 × 17
##    skim_type skim_vari…¹ n_mis…² compl…³ chara…⁴ chara…⁵ chara…⁶ chara…⁷ chara…⁸
##    <chr>     <chr>         <int>   <dbl>   <int>   <int>   <int>   <int>   <int>
##  1 character HeartDisea…       0       1       2       3       0       2       0
##  2 character Smoking           0       1       2       3       0       2       0
##  3 character AlcoholDri…       0       1       2       3       0       2       0
##  4 character Stroke            0       1       2       3       0       2       0
##  5 character DiffWalking       0       1       2       3       0       2       0
##  6 character Sex               0       1       4       6       0       2       0
##  7 character AgeCategory       0       1       5      11       0      13       0
##  8 character Race              0       1       5      30       0       6       0
##  9 character Diabetic          0       1       2      23       0       4       0
## 10 character PhysicalAc…       0       1       2       3       0       2       0
## 11 character GenHealth         0       1       4       9       0       5       0
## 12 character Asthma            0       1       2       3       0       2       0
## 13 character KidneyDise…       0       1       2       3       0       2       0
## 14 character SkinCancer        0       1       2       3       0       2       0
## 15 numeric   BMI               0       1      NA      NA      NA      NA      NA
## 16 numeric   PhysicalHe…       0       1      NA      NA      NA      NA      NA
## 17 numeric   MentalHeal…       0       1      NA      NA      NA      NA      NA
## 18 numeric   SleepTime         0       1      NA      NA      NA      NA      NA
## # … with 8 more variables: numeric.mean <dbl>, numeric.sd <dbl>,
## #   numeric.p0 <dbl>, numeric.p25 <dbl>, numeric.p50 <dbl>, numeric.p75 <dbl>,
## #   numeric.p100 <dbl>, numeric.hist <chr>, and abbreviated variable names
## #   ¹skim_variable, ²n_missing, ³complete_rate, ⁴character.min, ⁵character.max,
## #   ⁶character.empty, ⁷character.n_unique, ⁸character.whitespace

#Checking how the data is formatted using gt function
gt(head(heart_2020_cleaned))

HeartDisease	BMI	Smoking	AlcoholDrinking	Stroke	PhysicalHealth	MentalHealth	DiffWalking	Sex	AgeCategory	Race	Diabetic	PhysicalActivity	GenHealth	SleepTime	Asthma	KidneyDisease	SkinCancer
No	16.60	Yes	No	No	3	30	No	Female	55-59	White	Yes	Yes	Very good	5	Yes	No	Yes
No	20.34	No	No	Yes	0	0	No	Female	80 or older	White	No	Yes	Very good	7	No	No	No
No	26.58	Yes	No	No	20	30	No	Male	65-69	White	Yes	Yes	Fair	8	Yes	No	No
No	24.21	No	No	No	0	0	No	Female	75-79	White	No	No	Good	6	No	No	Yes
No	23.71	No	No	No	28	0	Yes	Female	40-44	White	No	Yes	Very good	8	No	No	No
Yes	28.87	Yes	No	No	6	0	Yes	Female	75-79	Black	No	No	Fair	12	No	No	No

#Create a copy of heart_2020_cleaned dataset
heart_2020_new <- heart_2020_cleaned

heart_2020_new

## # A tibble: 319,795 × 18
##    HeartDis…¹   BMI Smoking Alcoh…² Stroke Physi…³ Menta…⁴ DiffW…⁵ Sex   AgeCa…⁶
##    <chr>      <dbl> <chr>   <chr>   <chr>    <dbl>   <dbl> <chr>   <chr> <chr>  
##  1 No          16.6 Yes     No      No           3      30 No      Fema… 55-59  
##  2 No          20.3 No      No      Yes          0       0 No      Fema… 80 or …
##  3 No          26.6 Yes     No      No          20      30 No      Male  65-69  
##  4 No          24.2 No      No      No           0       0 No      Fema… 75-79  
##  5 No          23.7 No      No      No          28       0 Yes     Fema… 40-44  
##  6 Yes         28.9 Yes     No      No           6       0 Yes     Fema… 75-79  
##  7 No          21.6 No      No      No          15       0 No      Fema… 70-74  
##  8 No          31.6 Yes     No      No           5       0 Yes     Fema… 80 or …
##  9 No          26.4 No      No      No           0       0 No      Fema… 80 or …
## 10 No          40.7 No      No      No           0       0 Yes     Male  65-69  
## # … with 319,785 more rows, 8 more variables: Race <chr>, Diabetic <chr>,
## #   PhysicalActivity <chr>, GenHealth <chr>, SleepTime <dbl>, Asthma <chr>,
## #   KidneyDisease <chr>, SkinCancer <chr>, and abbreviated variable names
## #   ¹HeartDisease, ²AlcoholDrinking, ³PhysicalHealth, ⁴MentalHealth,
## #   ⁵DiffWalking, ⁶AgeCategory

skim_heart_2020 <- skimr::skim(heart_2020_new) %>%
    as_tibble() %>%
    print()

## # A tibble: 18 × 17
##    skim_type skim_vari…¹ n_mis…² compl…³ chara…⁴ chara…⁵ chara…⁶ chara…⁷ chara…⁸
##    <chr>     <chr>         <int>   <dbl>   <int>   <int>   <int>   <int>   <int>
##  1 character HeartDisea…       0       1       2       3       0       2       0
##  2 character Smoking           0       1       2       3       0       2       0
##  3 character AlcoholDri…       0       1       2       3       0       2       0
##  4 character Stroke            0       1       2       3       0       2       0
##  5 character DiffWalking       0       1       2       3       0       2       0
##  6 character Sex               0       1       4       6       0       2       0
##  7 character AgeCategory       0       1       5      11       0      13       0
##  8 character Race              0       1       5      30       0       6       0
##  9 character Diabetic          0       1       2      23       0       4       0
## 10 character PhysicalAc…       0       1       2       3       0       2       0
## 11 character GenHealth         0       1       4       9       0       5       0
## 12 character Asthma            0       1       2       3       0       2       0
## 13 character KidneyDise…       0       1       2       3       0       2       0
## 14 character SkinCancer        0       1       2       3       0       2       0
## 15 numeric   BMI               0       1      NA      NA      NA      NA      NA
## 16 numeric   PhysicalHe…       0       1      NA      NA      NA      NA      NA
## 17 numeric   MentalHeal…       0       1      NA      NA      NA      NA      NA
## 18 numeric   SleepTime         0       1      NA      NA      NA      NA      NA
## # … with 8 more variables: numeric.mean <dbl>, numeric.sd <dbl>,
## #   numeric.p0 <dbl>, numeric.p25 <dbl>, numeric.p50 <dbl>, numeric.p75 <dbl>,
## #   numeric.p100 <dbl>, numeric.hist <chr>, and abbreviated variable names
## #   ¹skim_variable, ²n_missing, ³complete_rate, ⁴character.min, ⁵character.max,
## #   ⁶character.empty, ⁷character.n_unique, ⁸character.whitespace

glimpse(heart_2020_new)

## Rows: 319,795
## Columns: 18
## $ HeartDisease     <chr> "No", "No", "No", "No", "No", "Yes", "No", "No", "No"…
## $ BMI              <dbl> 16.60, 20.34, 26.58, 24.21, 23.71, 28.87, 21.63, 31.6…
## $ Smoking          <chr> "Yes", "No", "Yes", "No", "No", "Yes", "No", "Yes", "…
## $ AlcoholDrinking  <chr> "No", "No", "No", "No", "No", "No", "No", "No", "No",…
## $ Stroke           <chr> "No", "Yes", "No", "No", "No", "No", "No", "No", "No"…
## $ PhysicalHealth   <dbl> 3, 0, 20, 0, 28, 6, 15, 5, 0, 0, 30, 0, 0, 7, 0, 1, 5…
## $ MentalHealth     <dbl> 30, 0, 30, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 30, 0, 2,…
## $ DiffWalking      <chr> "No", "No", "No", "No", "Yes", "Yes", "No", "Yes", "N…
## $ Sex              <chr> "Female", "Female", "Male", "Female", "Female", "Fema…
## $ AgeCategory      <chr> "55-59", "80 or older", "65-69", "75-79", "40-44", "7…
## $ Race             <chr> "White", "White", "White", "White", "White", "Black",…
## $ Diabetic         <chr> "Yes", "No", "Yes", "No", "No", "No", "No", "Yes", "N…
## $ PhysicalActivity <chr> "Yes", "Yes", "Yes", "No", "Yes", "No", "Yes", "No", …
## $ GenHealth        <chr> "Very good", "Very good", "Fair", "Good", "Very good"…
## $ SleepTime        <dbl> 5, 7, 8, 6, 8, 12, 4, 9, 5, 10, 15, 5, 8, 7, 5, 6, 10…
## $ Asthma           <chr> "Yes", "No", "Yes", "No", "No", "No", "Yes", "Yes", "…
## $ KidneyDisease    <chr> "No", "No", "No", "No", "No", "No", "No", "No", "Yes"…
## $ SkinCancer       <chr> "Yes", "No", "No", "Yes", "No", "No", "Yes", "No", "N…

gt(head(heart_2020_new))

HeartDisease	BMI	Smoking	AlcoholDrinking	Stroke	PhysicalHealth	MentalHealth	DiffWalking	Sex	AgeCategory	Race	Diabetic	PhysicalActivity	GenHealth	SleepTime	Asthma	KidneyDisease	SkinCancer
No	16.60	Yes	No	No	3	30	No	Female	55-59	White	Yes	Yes	Very good	5	Yes	No	Yes
No	20.34	No	No	Yes	0	0	No	Female	80 or older	White	No	Yes	Very good	7	No	No	No
No	26.58	Yes	No	No	20	30	No	Male	65-69	White	Yes	Yes	Fair	8	Yes	No	No
No	24.21	No	No	No	0	0	No	Female	75-79	White	No	No	Good	6	No	No	Yes
No	23.71	No	No	No	28	0	Yes	Female	40-44	White	No	Yes	Very good	8	No	No	No
Yes	28.87	Yes	No	No	6	0	Yes	Female	75-79	Black	No	No	Fair	12	No	No	No

If there are any quirks that you have to deal with NA coded as something else, or it is multiple tables, please make some notes here about what you need to do before you start transforming the data in the next section.

This dataset is pretty clean. There is no missing data, and all variables are correctly coded as characters and numbers. In this research analysis, I want to focus on only lifestyle factors, so I will remove variables that are not related to lifestyle factors such as Stroke, Diabetic, GenHealth, Asthma, Kidney Disease and Skin Cancer. However, I will still keep demographic factors on top of lifestyle factors.

Make sure your data types are correct!

3. Transforming the data (15 points)

If the data needs to be transformed in any way (values recoded, pivoted, etc), do it here. Examples include transforming a continuous variable into a categorical using case_when(), etc.

#Removing other variables that are not related to lifestyle factors from the dataset to keep the dataset focus on the research question

heart_lifestyle_2020 <- heart_2020_new %>%
  select(-Stroke, -Diabetic, -GenHealth, -Asthma, -KidneyDisease, -SkinCancer)

heart_lifestyle_2020 #heart_lifestyle_2020 dataset has only 12 columns now

## # A tibble: 319,795 × 12
##    HeartDise…¹   BMI Smoking Alcoh…² Physi…³ Menta…⁴ DiffW…⁵ Sex   AgeCa…⁶ Race 
##    <chr>       <dbl> <chr>   <chr>     <dbl>   <dbl> <chr>   <chr> <chr>   <chr>
##  1 No           16.6 Yes     No            3      30 No      Fema… 55-59   White
##  2 No           20.3 No      No            0       0 No      Fema… 80 or … White
##  3 No           26.6 Yes     No           20      30 No      Male  65-69   White
##  4 No           24.2 No      No            0       0 No      Fema… 75-79   White
##  5 No           23.7 No      No           28       0 Yes     Fema… 40-44   White
##  6 Yes          28.9 Yes     No            6       0 Yes     Fema… 75-79   Black
##  7 No           21.6 No      No           15       0 No      Fema… 70-74   White
##  8 No           31.6 Yes     No            5       0 Yes     Fema… 80 or … White
##  9 No           26.4 No      No            0       0 No      Fema… 80 or … White
## 10 No           40.7 No      No            0       0 Yes     Male  65-69   White
## # … with 319,785 more rows, 2 more variables: PhysicalActivity <chr>,
## #   SleepTime <dbl>, and abbreviated variable names ¹HeartDisease,
## #   ²AlcoholDrinking, ³PhysicalHealth, ⁴MentalHealth, ⁵DiffWalking,
## #   ⁶AgeCategory

#Transform BMI variable from continuous variable to categorical variable with 4 levels
heart_lifestyle_2020 <- heart_lifestyle_2020 %>%
  mutate(bmi_cat = case_when(BMI <18.5 ~ "Underweight",
                             (BMI >=18.5 & BMI <25) ~ "Normalweight",
                             (BMI >= 25 & BMI <30) ~ "Overweight",
                             BMI >=30 ~ "Obese")) %>%
   #keep BMI category in the order we wanted
  mutate(bmi_cat = factor(bmi_cat, levels = c("Underweight", "Normalweight", "Overweight", "Obese")))

heart_lifestyle_2020 %>%
glimpse()

## Rows: 319,795
## Columns: 13
## $ HeartDisease     <chr> "No", "No", "No", "No", "No", "Yes", "No", "No", "No"…
## $ BMI              <dbl> 16.60, 20.34, 26.58, 24.21, 23.71, 28.87, 21.63, 31.6…
## $ Smoking          <chr> "Yes", "No", "Yes", "No", "No", "Yes", "No", "Yes", "…
## $ AlcoholDrinking  <chr> "No", "No", "No", "No", "No", "No", "No", "No", "No",…
## $ PhysicalHealth   <dbl> 3, 0, 20, 0, 28, 6, 15, 5, 0, 0, 30, 0, 0, 7, 0, 1, 5…
## $ MentalHealth     <dbl> 30, 0, 30, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 30, 0, 2,…
## $ DiffWalking      <chr> "No", "No", "No", "No", "Yes", "Yes", "No", "Yes", "N…
## $ Sex              <chr> "Female", "Female", "Male", "Female", "Female", "Fema…
## $ AgeCategory      <chr> "55-59", "80 or older", "65-69", "75-79", "40-44", "7…
## $ Race             <chr> "White", "White", "White", "White", "White", "Black",…
## $ PhysicalActivity <chr> "Yes", "Yes", "Yes", "No", "Yes", "No", "Yes", "No", …
## $ SleepTime        <dbl> 5, 7, 8, 6, 8, 12, 4, 9, 5, 10, 15, 5, 8, 7, 5, 6, 10…
## $ bmi_cat          <fct> Underweight, Normalweight, Overweight, Normalweight, …

#Transform SleepTime variable from continuous variable to categorical variable with 3 levels

heart_lifestyle_2020 <- heart_lifestyle_2020 %>%
  mutate(sleeptime_cat = case_when(SleepTime <7 ~ "Less than 7h",
                             (SleepTime >=7 & SleepTime <9) ~ "7-9h",
                             SleepTime >=9 ~ "More than 9h")) %>%
  #keep sleeptime category in the order we wanted 
  mutate(sleeptime_cat = factor(sleeptime_cat, levels = c("Less than 7h", "7-9h", "More than 9h")))

heart_lifestyle_2020 %>%
glimpse()

## Rows: 319,795
## Columns: 14
## $ HeartDisease     <chr> "No", "No", "No", "No", "No", "Yes", "No", "No", "No"…
## $ BMI              <dbl> 16.60, 20.34, 26.58, 24.21, 23.71, 28.87, 21.63, 31.6…
## $ Smoking          <chr> "Yes", "No", "Yes", "No", "No", "Yes", "No", "Yes", "…
## $ AlcoholDrinking  <chr> "No", "No", "No", "No", "No", "No", "No", "No", "No",…
## $ PhysicalHealth   <dbl> 3, 0, 20, 0, 28, 6, 15, 5, 0, 0, 30, 0, 0, 7, 0, 1, 5…
## $ MentalHealth     <dbl> 30, 0, 30, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 30, 0, 2,…
## $ DiffWalking      <chr> "No", "No", "No", "No", "Yes", "Yes", "No", "Yes", "N…
## $ Sex              <chr> "Female", "Female", "Male", "Female", "Female", "Fema…
## $ AgeCategory      <chr> "55-59", "80 or older", "65-69", "75-79", "40-44", "7…
## $ Race             <chr> "White", "White", "White", "White", "White", "Black",…
## $ PhysicalActivity <chr> "Yes", "Yes", "Yes", "No", "Yes", "No", "Yes", "No", …
## $ SleepTime        <dbl> 5, 7, 8, 6, 8, 12, 4, 9, 5, 10, 15, 5, 8, 7, 5, 6, 10…
## $ bmi_cat          <fct> Underweight, Normalweight, Overweight, Normalweight, …
## $ sleeptime_cat    <fct> Less than 7h, 7-9h, 7-9h, Less than 7h, 7-9h, More th…

#Transform Physical Health variable from continuous variable to categorical variable with 3 levels

heart_lifestyle_2020 <- heart_lifestyle_2020 %>%
  mutate(physicalhealth_cat = case_when(PhysicalHealth <10 ~ "< 10",
                             (PhysicalHealth >= 10 & PhysicalHealth <21) ~ "10-20",
                             PhysicalHealth >=21 ~ "20-30"))


heart_lifestyle_2020 %>%
glimpse()

## Rows: 319,795
## Columns: 15
## $ HeartDisease       <chr> "No", "No", "No", "No", "No", "Yes", "No", "No", "N…
## $ BMI                <dbl> 16.60, 20.34, 26.58, 24.21, 23.71, 28.87, 21.63, 31…
## $ Smoking            <chr> "Yes", "No", "Yes", "No", "No", "Yes", "No", "Yes",…
## $ AlcoholDrinking    <chr> "No", "No", "No", "No", "No", "No", "No", "No", "No…
## $ PhysicalHealth     <dbl> 3, 0, 20, 0, 28, 6, 15, 5, 0, 0, 30, 0, 0, 7, 0, 1,…
## $ MentalHealth       <dbl> 30, 0, 30, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 30, 0, …
## $ DiffWalking        <chr> "No", "No", "No", "No", "Yes", "Yes", "No", "Yes", …
## $ Sex                <chr> "Female", "Female", "Male", "Female", "Female", "Fe…
## $ AgeCategory        <chr> "55-59", "80 or older", "65-69", "75-79", "40-44", …
## $ Race               <chr> "White", "White", "White", "White", "White", "Black…
## $ PhysicalActivity   <chr> "Yes", "Yes", "Yes", "No", "Yes", "No", "Yes", "No"…
## $ SleepTime          <dbl> 5, 7, 8, 6, 8, 12, 4, 9, 5, 10, 15, 5, 8, 7, 5, 6, …
## $ bmi_cat            <fct> Underweight, Normalweight, Overweight, Normalweight…
## $ sleeptime_cat      <fct> Less than 7h, 7-9h, 7-9h, Less than 7h, 7-9h, More …
## $ physicalhealth_cat <chr> "< 10", "< 10", "10-20", "< 10", "20-30", "< 10", "…

#Transform Mental Health variable from continuous variable to categorical variable with 3 levels

heart_lifestyle_2020 <- heart_lifestyle_2020 %>%
  mutate(mentalhealth_cat = case_when(MentalHealth <10 ~ "< 10",
                             (MentalHealth >= 10 & MentalHealth <21) ~ "10-20",
                             MentalHealth >=21 ~ "20-30"))

heart_lifestyle_2020 %>%
  glimpse()

## Rows: 319,795
## Columns: 16
## $ HeartDisease       <chr> "No", "No", "No", "No", "No", "Yes", "No", "No", "N…
## $ BMI                <dbl> 16.60, 20.34, 26.58, 24.21, 23.71, 28.87, 21.63, 31…
## $ Smoking            <chr> "Yes", "No", "Yes", "No", "No", "Yes", "No", "Yes",…
## $ AlcoholDrinking    <chr> "No", "No", "No", "No", "No", "No", "No", "No", "No…
## $ PhysicalHealth     <dbl> 3, 0, 20, 0, 28, 6, 15, 5, 0, 0, 30, 0, 0, 7, 0, 1,…
## $ MentalHealth       <dbl> 30, 0, 30, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 30, 0, …
## $ DiffWalking        <chr> "No", "No", "No", "No", "Yes", "Yes", "No", "Yes", …
## $ Sex                <chr> "Female", "Female", "Male", "Female", "Female", "Fe…
## $ AgeCategory        <chr> "55-59", "80 or older", "65-69", "75-79", "40-44", …
## $ Race               <chr> "White", "White", "White", "White", "White", "Black…
## $ PhysicalActivity   <chr> "Yes", "Yes", "Yes", "No", "Yes", "No", "Yes", "No"…
## $ SleepTime          <dbl> 5, 7, 8, 6, 8, 12, 4, 9, 5, 10, 15, 5, 8, 7, 5, 6, …
## $ bmi_cat            <fct> Underweight, Normalweight, Overweight, Normalweight…
## $ sleeptime_cat      <fct> Less than 7h, 7-9h, 7-9h, Less than 7h, 7-9h, More …
## $ physicalhealth_cat <chr> "< 10", "< 10", "10-20", "< 10", "20-30", "< 10", "…
## $ mentalhealth_cat   <chr> "20-30", "< 10", "20-30", "< 10", "< 10", "< 10", "…

# Collapse Age Category variable into 4 levels because we are more interested in age above 40 as this age group is more proned into developing heart disease

heart_lifestyle_2020 <- heart_lifestyle_2020 %>%
  mutate(Age = fct_collapse(AgeCategory,
        "below 40" = c("18-24", "25-29", "30-34", "35-39"),
        "40-59" = c("40-44", "45-49", "50-54", "55-59"),
        "60-79" = c("60-64", "65-69", "70-74", "75-79"),
        "80 or older" = "80 or older"))
  
heart_lifestyle_2020 %>%
  glimpse()

## Rows: 319,795
## Columns: 17
## $ HeartDisease       <chr> "No", "No", "No", "No", "No", "Yes", "No", "No", "N…
## $ BMI                <dbl> 16.60, 20.34, 26.58, 24.21, 23.71, 28.87, 21.63, 31…
## $ Smoking            <chr> "Yes", "No", "Yes", "No", "No", "Yes", "No", "Yes",…
## $ AlcoholDrinking    <chr> "No", "No", "No", "No", "No", "No", "No", "No", "No…
## $ PhysicalHealth     <dbl> 3, 0, 20, 0, 28, 6, 15, 5, 0, 0, 30, 0, 0, 7, 0, 1,…
## $ MentalHealth       <dbl> 30, 0, 30, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 30, 0, …
## $ DiffWalking        <chr> "No", "No", "No", "No", "Yes", "Yes", "No", "Yes", …
## $ Sex                <chr> "Female", "Female", "Male", "Female", "Female", "Fe…
## $ AgeCategory        <chr> "55-59", "80 or older", "65-69", "75-79", "40-44", …
## $ Race               <chr> "White", "White", "White", "White", "White", "Black…
## $ PhysicalActivity   <chr> "Yes", "Yes", "Yes", "No", "Yes", "No", "Yes", "No"…
## $ SleepTime          <dbl> 5, 7, 8, 6, 8, 12, 4, 9, 5, 10, 15, 5, 8, 7, 5, 6, …
## $ bmi_cat            <fct> Underweight, Normalweight, Overweight, Normalweight…
## $ sleeptime_cat      <fct> Less than 7h, 7-9h, 7-9h, Less than 7h, 7-9h, More …
## $ physicalhealth_cat <chr> "< 10", "< 10", "10-20", "< 10", "20-30", "< 10", "…
## $ mentalhealth_cat   <chr> "20-30", "< 10", "20-30", "< 10", "< 10", "< 10", "…
## $ Age                <fct> 40-59, 80 or older, 60-79, 60-79, 40-59, 60-79, 60-…

#Filter the dataset to focus on those with heart disease only
heart_lifestyle_2020_yeshd <- heart_lifestyle_2020 %>%
  filter (HeartDisease == "Yes")

heart_lifestyle_2020_yeshd

## # A tibble: 27,373 × 17
##    HeartDise…¹   BMI Smoking Alcoh…² Physi…³ Menta…⁴ DiffW…⁵ Sex   AgeCa…⁶ Race 
##    <chr>       <dbl> <chr>   <chr>     <dbl>   <dbl> <chr>   <chr> <chr>   <chr>
##  1 Yes          28.9 Yes     No            6       0 Yes     Fema… 75-79   Black
##  2 Yes          34.3 Yes     No           30       0 Yes     Male  60-64   White
##  3 Yes          33.0 Yes     No           10       0 Yes     Male  75-79   White
##  4 Yes          25.1 No      No            0       0 Yes     Fema… 80 or … White
##  5 Yes          30.2 Yes     No            6       2 Yes     Fema… 75-79   White
##  6 Yes          20.4 Yes     No            3       0 No      Fema… 70-74   White
##  7 Yes          34.3 No      No            0       0 No      Fema… 55-59   White
##  8 Yes          28.3 Yes     No           30      30 No      Fema… 70-74   White
##  9 Yes          24.7 Yes     No           30      30 Yes     Male  70-74   White
## 10 Yes          33.6 No      No            2       0 Yes     Fema… 70-74   White
## # … with 27,363 more rows, 7 more variables: PhysicalActivity <chr>,
## #   SleepTime <dbl>, bmi_cat <fct>, sleeptime_cat <fct>,
## #   physicalhealth_cat <chr>, mentalhealth_cat <chr>, Age <fct>, and
## #   abbreviated variable names ¹HeartDisease, ²AlcoholDrinking,
## #   ³PhysicalHealth, ⁴MentalHealth, ⁵DiffWalking, ⁶AgeCategory

#Filter the dataset to focus on those without heart disease only
heart_lifestyle_2020_nohd <- heart_lifestyle_2020 %>%
  filter (HeartDisease == "No")

heart_lifestyle_2020_nohd

## # A tibble: 292,422 × 17
##    HeartDise…¹   BMI Smoking Alcoh…² Physi…³ Menta…⁴ DiffW…⁵ Sex   AgeCa…⁶ Race 
##    <chr>       <dbl> <chr>   <chr>     <dbl>   <dbl> <chr>   <chr> <chr>   <chr>
##  1 No           16.6 Yes     No            3      30 No      Fema… 55-59   White
##  2 No           20.3 No      No            0       0 No      Fema… 80 or … White
##  3 No           26.6 Yes     No           20      30 No      Male  65-69   White
##  4 No           24.2 No      No            0       0 No      Fema… 75-79   White
##  5 No           23.7 No      No           28       0 Yes     Fema… 40-44   White
##  6 No           21.6 No      No           15       0 No      Fema… 70-74   White
##  7 No           31.6 Yes     No            5       0 Yes     Fema… 80 or … White
##  8 No           26.4 No      No            0       0 No      Fema… 80 or … White
##  9 No           40.7 No      No            0       0 Yes     Male  65-69   White
## 10 No           28.7 Yes     No            0       0 No      Fema… 55-59   White
## # … with 292,412 more rows, 7 more variables: PhysicalActivity <chr>,
## #   SleepTime <dbl>, bmi_cat <fct>, sleeptime_cat <fct>,
## #   physicalhealth_cat <chr>, mentalhealth_cat <chr>, Age <fct>, and
## #   abbreviated variable names ¹HeartDisease, ²AlcoholDrinking,
## #   ³PhysicalHealth, ⁴MentalHealth, ⁵DiffWalking, ⁶AgeCategory

Bonus points (5 points) for datasets that require merging of tables, but only if you reason through whether you should use left_join, inner_join, or right_join on these tables. No credit will be provided if you don’t.

Not applicable to this dataset because there is no other dataset to merge with.

Show your transformed table here. Use tools such as glimpse(), skim() or head() to illustrate your point.

heart_lifestyle_2020 %>%
glimpse()

## Rows: 319,795
## Columns: 17
## $ HeartDisease       <chr> "No", "No", "No", "No", "No", "Yes", "No", "No", "N…
## $ BMI                <dbl> 16.60, 20.34, 26.58, 24.21, 23.71, 28.87, 21.63, 31…
## $ Smoking            <chr> "Yes", "No", "Yes", "No", "No", "Yes", "No", "Yes",…
## $ AlcoholDrinking    <chr> "No", "No", "No", "No", "No", "No", "No", "No", "No…
## $ PhysicalHealth     <dbl> 3, 0, 20, 0, 28, 6, 15, 5, 0, 0, 30, 0, 0, 7, 0, 1,…
## $ MentalHealth       <dbl> 30, 0, 30, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 30, 0, …
## $ DiffWalking        <chr> "No", "No", "No", "No", "Yes", "Yes", "No", "Yes", …
## $ Sex                <chr> "Female", "Female", "Male", "Female", "Female", "Fe…
## $ AgeCategory        <chr> "55-59", "80 or older", "65-69", "75-79", "40-44", …
## $ Race               <chr> "White", "White", "White", "White", "White", "Black…
## $ PhysicalActivity   <chr> "Yes", "Yes", "Yes", "No", "Yes", "No", "Yes", "No"…
## $ SleepTime          <dbl> 5, 7, 8, 6, 8, 12, 4, 9, 5, 10, 15, 5, 8, 7, 5, 6, …
## $ bmi_cat            <fct> Underweight, Normalweight, Overweight, Normalweight…
## $ sleeptime_cat      <fct> Less than 7h, 7-9h, 7-9h, Less than 7h, 7-9h, More …
## $ physicalhealth_cat <chr> "< 10", "< 10", "10-20", "< 10", "20-30", "< 10", "…
## $ mentalhealth_cat   <chr> "20-30", "< 10", "20-30", "< 10", "< 10", "< 10", "…
## $ Age                <fct> 40-59, 80 or older, 60-79, 60-79, 40-59, 60-79, 60-…

Are the values what you expected for the variables? Why or Why not?

This dataset is a clean dataset, so all the values are what I expected for all the variables. However, there are too many levels within the age category. So, I collapsed some age groups together as shown above.

4. Visualizing and Summarizing the Data (15 points)

Use group_by() and summarize() to make a summary of the data here. The summary should be relevant to your research question

#Get a summary of count and percentage group by BMI category among those with heart disease

heart_lifestyle_2020_yeshd %>% 
  group_by(bmi_cat) %>%
  summarise(count = n()) %>%
  mutate(percent = round((count/sum(count))*100, 1)) %>%
adorn_totals()

##       bmi_cat count percent
##   Underweight   401     1.5
##  Normalweight  6309    23.0
##    Overweight  9991    36.5
##         Obese 10672    39.0
##         Total 27373   100.0

#Get a summary of count and percentage group by BMI category among those without heart disease

heart_lifestyle_2020_nohd %>% 
  group_by(bmi_cat) %>%
  summarise(count = n()) %>%
  mutate(percent = round((count/sum(count))*100, 1)) %>%
adorn_totals()

##       bmi_cat  count percent
##   Underweight   4709     1.6
##  Normalweight  91022    31.1
##    Overweight 104521    35.7
##         Obese  92170    31.5
##         Total 292422    99.9

#Cross tabulate BMI category by Heart Disease status
heart_lifestyle_2020 %>%
  tabyl(HeartDisease, bmi_cat) %>%
  adorn_totals(where = c("row", "col")) %>% adorn_percentages() %>%
  adorn_pct_formatting(digits = 1) %>% adorn_ns(position = "front") %>% adorn_title()

##                   bmi_cat                                            
##  HeartDisease Underweight  Normalweight     Overweight          Obese
##            No 4709 (1.6%) 91022 (31.1%) 104521 (35.7%)  92170 (31.5%)
##           Yes  401 (1.5%)  6309 (23.0%)   9991 (36.5%)  10672 (39.0%)
##         Total 5110 (1.6%) 97331 (30.4%) 114512 (35.8%) 102842 (32.2%)
##                 
##            Total
##  292422 (100.0%)
##   27373 (100.0%)
##  319795 (100.0%)

#Get a summary of count and percentage group by sleeptime category among those with heart disease
heart_lifestyle_2020_yeshd %>% 
  group_by(sleeptime_cat) %>%
  summarise(count = n()) %>%
  mutate(percent = round((count/sum(count))*100, 1)) %>%
adorn_totals()

##  sleeptime_cat count percent
##   Less than 7h  9172    33.5
##           7-9h 14683    53.6
##   More than 9h  3518    12.9
##          Total 27373   100.0

#Get a summary of count and percentage group by sleeptime category among those without heart disease
heart_lifestyle_2020_nohd %>% 
  group_by(sleeptime_cat) %>%
  summarise(count = n()) %>%
  mutate(percent = round((count/sum(count))*100, 1)) %>%
adorn_totals()

##  sleeptime_cat  count percent
##   Less than 7h  87814    30.0
##           7-9h 180670    61.8
##   More than 9h  23938     8.2
##          Total 292422   100.0

#Cross tabulate sleeptime category by Heart Disease status

heart_lifestyle_2020 %>%
  tabyl(HeartDisease, sleeptime_cat) %>%
  adorn_totals(where = c("row", "col")) %>% adorn_percentages() %>%
  adorn_pct_formatting(digits = 1) %>% adorn_ns(position = "front") %>% adorn_title()

##               sleeptime_cat                                             
##  HeartDisease  Less than 7h           7-9h  More than 9h           Total
##            No 87814 (30.0%) 180670 (61.8%) 23938  (8.2%) 292422 (100.0%)
##           Yes  9172 (33.5%)  14683 (53.6%)  3518 (12.9%)  27373 (100.0%)
##         Total 96986 (30.3%) 195353 (61.1%) 27456  (8.6%) 319795 (100.0%)

#Cross tabulate Smoking status by Heart Disease status

heart_lifestyle_2020 %>%
  tabyl(HeartDisease, Smoking) %>%
  adorn_totals(where = c("row", "col")) %>% adorn_percentages() %>%
  adorn_pct_formatting(digits = 1) %>% adorn_ns(position = "front") %>% adorn_title()

##                      Smoking                               
##  HeartDisease             No            Yes           Total
##            No 176551 (60.4%) 115871 (39.6%) 292422 (100.0%)
##           Yes  11336 (41.4%)  16037 (58.6%)  27373 (100.0%)
##         Total 187887 (58.8%) 131908 (41.2%) 319795 (100.0%)

#Cross tabulate Alcohol drinking status by Heart Disease status
heart_lifestyle_2020 %>%
  tabyl(HeartDisease, AlcoholDrinking) %>%
  adorn_totals(where = c("row", "col")) %>% adorn_percentages() %>%
  adorn_pct_formatting(digits = 1) %>% adorn_ns(position = "front") %>% adorn_title()

##               AlcoholDrinking                             
##  HeartDisease              No          Yes           Total
##            No  271786 (92.9%) 20636 (7.1%) 292422 (100.0%)
##           Yes   26232 (95.8%)  1141 (4.2%)  27373 (100.0%)
##         Total  298018 (93.2%) 21777 (6.8%) 319795 (100.0%)

#Cross tabulate physical activity status by Heart Disease status

heart_lifestyle_2020 %>%
  tabyl(HeartDisease, PhysicalActivity) %>%
  adorn_totals(where = c("row", "col")) %>% adorn_percentages() %>%
  adorn_pct_formatting(digits = 1) %>% adorn_ns(position = "front") %>% adorn_title()

##               PhysicalActivity                               
##  HeartDisease               No            Yes           Total
##            No    61954 (21.2%) 230468 (78.8%) 292422 (100.0%)
##           Yes     9884 (36.1%)  17489 (63.9%)  27373 (100.0%)
##         Total    71838 (22.5%) 247957 (77.5%) 319795 (100.0%)

#Select variables that I want to include in my Table 1
summary_data <- heart_lifestyle_2020 %>% select(HeartDisease, Age, Sex, Race, bmi_cat, Smoking, AlcoholDrinking, PhysicalActivity, sleeptime_cat, physicalhealth_cat, mentalhealth_cat)

#Create a summary table for all the wanted variables into Table 1
summary_data %>%
  tbl_summary(by= HeartDisease) %>%
  modify_caption("**Table 1. Characteristic of Particpants by Heart Disease Status**")%>% bold_labels()

**Table 1. Characteristic of Particpants by Heart Disease Status**
Characteristic	No, N = 292,422¹	Yes, N = 27,373¹
Age
below 40	76,537 (26%)	785 (2.9%)
40-59	93,121 (32%)	4,815 (18%)
60-79	104,060 (36%)	16,324 (60%)
80 or older	18,704 (6.4%)	5,449 (20%)
Sex
Female	156,571 (54%)	11,234 (41%)
Male	135,851 (46%)	16,139 (59%)
Race
American Indian/Alaskan Native	4,660 (1.6%)	542 (2.0%)
Asian	7,802 (2.7%)	266 (1.0%)
Black	21,210 (7.3%)	1,729 (6.3%)
Hispanic	26,003 (8.9%)	1,443 (5.3%)
Other	10,042 (3.4%)	886 (3.2%)
White	222,705 (76%)	22,507 (82%)
bmi_cat
Underweight	4,709 (1.6%)	401 (1.5%)
Normalweight	91,022 (31%)	6,309 (23%)
Overweight	104,521 (36%)	9,991 (36%)
Obese	92,170 (32%)	10,672 (39%)
Smoking	115,871 (40%)	16,037 (59%)
AlcoholDrinking	20,636 (7.1%)	1,141 (4.2%)
PhysicalActivity	230,468 (79%)	17,489 (64%)
sleeptime_cat
Less than 7h	87,814 (30%)	9,172 (34%)
7-9h	180,670 (62%)	14,683 (54%)
More than 9h	23,938 (8.2%)	3,518 (13%)
physicalhealth_cat
< 10	260,428 (89%)	19,224 (70%)
10-20	14,874 (5.1%)	2,928 (11%)
20-30	17,120 (5.9%)	5,221 (19%)
mentalhealth_cat
< 10	247,264 (85%)	22,252 (81%)
10-20	26,886 (9.2%)	2,464 (9.0%)
20-30	18,272 (6.2%)	2,657 (9.7%)
¹ n (%)

What are your findings about the summary? Are they what you expected?

From the summary Table 1, it is observed that participants in the age group of 60-79 have the highest risk of having heart disease. Among participants who have heart disease, 60% of them are aged between 60-79. I also observed that there are more men with heart disease compared to women. In terms of physical activity, I observed that among those participants who have heart disease, 64% are active and 36% are inactive. However, 79% are active and 21% are inactive among those who do not have heart disease. For sleep time, for those who have heart disease, 34% are sleep deprived (less than 7hr), 54% are healthy sleep (7-9hr) and 13% are oversleeping (more than 9hrs), whereas for those who do not have heart disease, 30% are sleep deprived(less than 7hr), 62% are healthy sleep (7-9hr) and 8.2% oversleeping (more than 9hr). For smoking, among those who have heart disease, 59% smoke, and 41% do not smoke, while among those who do not have heart disease, 40% smoke and 60% do not smoke. For alcohol drinking, among those who have heart disease, 96% do not drink, and 4% drink, whereas among those who do no have heart disease, 93% do not drink and 7% drink. In term of BMI, among those who have heart disease, 1.5% are underweight, 23% are normal weight and 36.5% are overweight, while among those who do not have heart disease, 1.6% are underweight, 31.1% are normal weight and 35.7% are overweight. I expected most of these, however, I was a little surprised to see such low percentages of those who drink alcohol among both group of with and without heart disease.

Make at least two plots that help you answer your question on the transformed or summarized data. Use scales and/or labels to make each plot informative.

#To answer question 1
smoke_plot <- ggplot(heart_lifestyle_2020) + 
  aes(x = Smoking, fill = HeartDisease) +
  geom_bar() +
  facet_wrap(vars(Sex)) +
 labs(title = "Smoking Status by Heart Disease Status among Female and Male ",
       x = "Smoking Status",
       y = "Frequency", fill = "Heart Disease Status") +
  theme_minimal() +
  scale_fill_viridis_d()

smoke_plot

# To answer question 1
drink_plot <- ggplot(heart_lifestyle_2020) + 
  aes(x = AlcoholDrinking, fill = HeartDisease) +
  geom_bar() +
  facet_wrap(vars(Sex)) +
 labs(title = "Drinking Status by Heart Disease Status among Female and Male ",
       x = "Drinking Status",
       y = "Frequency", fill = "Heart Disease Status") +
  theme_minimal() +
  scale_fill_viridis_d()

drink_plot

#To answer question 1
activity_plot <- ggplot(heart_lifestyle_2020) + 
  aes(x = PhysicalActivity, fill = HeartDisease) +
  geom_bar(position = "stack") +
  facet_wrap(vars(Sex)) +
 labs(title = "Physical Activity by Heart Disease Status among Female and Male ",
       x = "Physical Activity",
       y = "Frequency", fill = "Heart Disease Status") +
  theme_minimal() +
  scale_fill_viridis_d()

activity_plot

#To answer question 1
BMI_plot <- ggplot(heart_lifestyle_2020) + 
  aes(x = BMI, fill = HeartDisease) +
  geom_boxplot() +
  facet_wrap(vars(Sex)) +
 labs(title = "BMI by Heart Disease Status among Female and Male ",
       x = "BMI",
       y = "Frequency", fill = "Heart Disease Status") +
  theme_minimal() +
  scale_fill_viridis_d()

BMI_plot

BMICAT_plot <- ggplot(heart_lifestyle_2020) + 
  aes(x = HeartDisease, fill = bmi_cat) +
  geom_bar(position = "stack") +
  facet_wrap(vars(Sex)) +
 labs(title = "BMI Category by Heart Disease Status among Female and Male ",
       x = "Heart Disease Status",
       y = "Frequency", fill = "BMI Category") +
  theme_minimal() +
  scale_fill_viridis_d()

BMICAT_plot

#To answer question 1

sleep_plot <- ggplot(heart_lifestyle_2020) + 
  aes(x = HeartDisease, fill = sleeptime_cat) +
  geom_bar(position = "stack") +
  facet_wrap(vars(Sex)) +
 labs(title = "SleepTime Category by Heart Disease Status among Female and Male ",
       x = "Heart Disease",
       y = "Frequency", fill = "SleepTime Category") +
  theme_minimal() +
  scale_fill_viridis_d()

sleep_plot

# To answer question 2
age_plot <- ggplot(heart_lifestyle_2020) + 
  aes(x = HeartDisease, fill = Age) +
  geom_bar() +
  facet_wrap(vars(Sex)) +
 labs(title = "Heart Disease by Age among Female and Male ",
       x = "Heart Disease Status",
       y = "Frequency", fill = "Age Category") +
  theme_minimal() +
  scale_fill_viridis_d()

age_plot

Final Summary (10 points)

Summarize your research question and findings below.

The objective of this analysis is to determine which lifestyle factors are related to increasing the risk of heart disease. Overall, I observed that among those who have heart disease, there are higher percentages of inactive, sleep-deprived(less than 7hr), smoke, and overweight compared to those who do not have heart disease. However, there is a lower percentage of those who drink among those who have heart disease compared to those who do not have heart disease. Moreover,the participants in the age group of 60-79 have the highest risk of having heart disease, more in males than females.

Are your findings what you expected? Why or Why not?

These findings are what I expected because we see that among lifestyle behaviors that are considered to be unhealthy such as smoking, lack of sleep and being inactive have high percentages in having heart disease. However, the way the data collected for some variables such as smoking and alcohol drinking could be improved by collecting in more detail on how often participants smoke or drink and also collect the amount of cigarettes and alcohol being consumed.

Acknowledgement

Some of the codes (Summary Table 1 and Count & Percentages by group) in this project are adapted from Laura Jacobsen BSTA504 Midterm 2/13/2022 and BSTA 504 Winter 2023 Class materials.

BSTA 504: R-Programming Midterm

Puthyda Keath

2023-02-19