Please submit your .Rmd
and .html
files in
Sakai. If you are working together, both people should submit the
files.
The goal of the midterm project is to showcase skills that you have learned in class so far. The midterm is open note, but if you use someone else’s code, you must attribute them.
.csv
file into your data
folder. This resource is probably the
easiest to deal with.Define a research question, involving at least one categorical variable. You may schedule a time with Jessica or Brad to discuss your data set and research question, or you may message it to one of us in slack or email. Please do one of the two options pretty early on. We just want to look at the data and make sure that it is appropriate for your question.
You must use each of the following functions at least once:
mutate()
group_by()
summarize()
ggplot()
and at least one of the following:
case_when()
across()
*_join()
(i.e. left_join()
)pivot_*()
(i.e. pivot_longer()
)function()
The code chunks below are guides, please add more code chunks to do what you need.
If you do not want your final project posted on the public website, please let Jessica know. We can also keep it anonymous if you’d like to remove your name from the Rmd and html, or use a pseudonym.
You may remove these instructions from your final Rmd if you like
If you’d like to work together in pairs, that is encouraged, but you must divide the work equitably and you must note who worked on what. This is probably easiest as notes in the text. Please let Brad or Jessica know that you’ll be working together.
No acknowledgements of contributions = -10 points overall.
I will take off points (-5 points for each section) if you don’t add observations and notes in your RMarkdown document. I want you to think and reason through your analysis, even if they are preliminary thoughts.
This dataset is sourced from MOHAMED ELSAYED on Kaggle, titled “Heart Disease Prediction”.This work is done by Puthyda Keath.
Define your research question below. What about the data interests you? What is a specific question you want to find out about the data?
I’m interested in looking at what kind of lifestyle factors are related to increasing the risk of heart disease. This data has some pretty good lifestyle-related variables such as sleeping hours, smoking, drinking, physical activity, and BMI that could allow me to explore the risk factors associated with heart disease. This led me to my research question: “What is the association between lifestyle factors (sleep time, smoking, alcohol drinking, physical activity, and BMI) and the risk of developing heart disease?”. I also would like to know which age groups are at risk of developing heart disease.
Given your question, what is your expectation about the data?
I hypothesize that people with unhealthy lifestyles such as less sleep time, smoking, drinking alcohol, less physical activity, and higher BMI will be more likely to have heart disease. According to the CDC, the prevalence of heart disease is higher in people who are 60 and above. So, I hypothesize that there are higher percentages of people who are older than 60 year old with heart disease than those who are younger than 60 year old.
Load the data below and use
dplyr::glimpse()
orskimr::skim()
on the data. You should upload the data file into thedata
directory.
library(readr)
heart_2020_cleaned <- read_csv("heart_2020_cleaned.csv", na= c("NA"))
## Rows: 319795 Columns: 18
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (14): HeartDisease, Smoking, AlcoholDrinking, Stroke, DiffWalking, Sex, ...
## dbl (4): BMI, PhysicalHealth, MentalHealth, SleepTime
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#checking out the data using glimpse function
glimpse(heart_2020_cleaned) # 319,795 rows and 18 columns
## Rows: 319,795
## Columns: 18
## $ HeartDisease <chr> "No", "No", "No", "No", "No", "Yes", "No", "No", "No"…
## $ BMI <dbl> 16.60, 20.34, 26.58, 24.21, 23.71, 28.87, 21.63, 31.6…
## $ Smoking <chr> "Yes", "No", "Yes", "No", "No", "Yes", "No", "Yes", "…
## $ AlcoholDrinking <chr> "No", "No", "No", "No", "No", "No", "No", "No", "No",…
## $ Stroke <chr> "No", "Yes", "No", "No", "No", "No", "No", "No", "No"…
## $ PhysicalHealth <dbl> 3, 0, 20, 0, 28, 6, 15, 5, 0, 0, 30, 0, 0, 7, 0, 1, 5…
## $ MentalHealth <dbl> 30, 0, 30, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 30, 0, 2,…
## $ DiffWalking <chr> "No", "No", "No", "No", "Yes", "Yes", "No", "Yes", "N…
## $ Sex <chr> "Female", "Female", "Male", "Female", "Female", "Fema…
## $ AgeCategory <chr> "55-59", "80 or older", "65-69", "75-79", "40-44", "7…
## $ Race <chr> "White", "White", "White", "White", "White", "Black",…
## $ Diabetic <chr> "Yes", "No", "Yes", "No", "No", "No", "No", "Yes", "N…
## $ PhysicalActivity <chr> "Yes", "Yes", "Yes", "No", "Yes", "No", "Yes", "No", …
## $ GenHealth <chr> "Very good", "Very good", "Fair", "Good", "Very good"…
## $ SleepTime <dbl> 5, 7, 8, 6, 8, 12, 4, 9, 5, 10, 15, 5, 8, 7, 5, 6, 10…
## $ Asthma <chr> "Yes", "No", "Yes", "No", "No", "No", "Yes", "Yes", "…
## $ KidneyDisease <chr> "No", "No", "No", "No", "No", "No", "No", "No", "Yes"…
## $ SkinCancer <chr> "Yes", "No", "No", "Yes", "No", "No", "Yes", "No", "N…
#checking the data structure using skimr
skim_heart_2020_cleaned <- skimr::skim(heart_2020_cleaned) %>%
as_tibble() %>%
print() #No missing value
## # A tibble: 18 × 17
## skim_type skim_vari…¹ n_mis…² compl…³ chara…⁴ chara…⁵ chara…⁶ chara…⁷ chara…⁸
## <chr> <chr> <int> <dbl> <int> <int> <int> <int> <int>
## 1 character HeartDisea… 0 1 2 3 0 2 0
## 2 character Smoking 0 1 2 3 0 2 0
## 3 character AlcoholDri… 0 1 2 3 0 2 0
## 4 character Stroke 0 1 2 3 0 2 0
## 5 character DiffWalking 0 1 2 3 0 2 0
## 6 character Sex 0 1 4 6 0 2 0
## 7 character AgeCategory 0 1 5 11 0 13 0
## 8 character Race 0 1 5 30 0 6 0
## 9 character Diabetic 0 1 2 23 0 4 0
## 10 character PhysicalAc… 0 1 2 3 0 2 0
## 11 character GenHealth 0 1 4 9 0 5 0
## 12 character Asthma 0 1 2 3 0 2 0
## 13 character KidneyDise… 0 1 2 3 0 2 0
## 14 character SkinCancer 0 1 2 3 0 2 0
## 15 numeric BMI 0 1 NA NA NA NA NA
## 16 numeric PhysicalHe… 0 1 NA NA NA NA NA
## 17 numeric MentalHeal… 0 1 NA NA NA NA NA
## 18 numeric SleepTime 0 1 NA NA NA NA NA
## # … with 8 more variables: numeric.mean <dbl>, numeric.sd <dbl>,
## # numeric.p0 <dbl>, numeric.p25 <dbl>, numeric.p50 <dbl>, numeric.p75 <dbl>,
## # numeric.p100 <dbl>, numeric.hist <chr>, and abbreviated variable names
## # ¹skim_variable, ²n_missing, ³complete_rate, ⁴character.min, ⁵character.max,
## # ⁶character.empty, ⁷character.n_unique, ⁸character.whitespace
#Checking how the data is formatted using gt function
gt(head(heart_2020_cleaned))
HeartDisease | BMI | Smoking | AlcoholDrinking | Stroke | PhysicalHealth | MentalHealth | DiffWalking | Sex | AgeCategory | Race | Diabetic | PhysicalActivity | GenHealth | SleepTime | Asthma | KidneyDisease | SkinCancer |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
No | 16.60 | Yes | No | No | 3 | 30 | No | Female | 55-59 | White | Yes | Yes | Very good | 5 | Yes | No | Yes |
No | 20.34 | No | No | Yes | 0 | 0 | No | Female | 80 or older | White | No | Yes | Very good | 7 | No | No | No |
No | 26.58 | Yes | No | No | 20 | 30 | No | Male | 65-69 | White | Yes | Yes | Fair | 8 | Yes | No | No |
No | 24.21 | No | No | No | 0 | 0 | No | Female | 75-79 | White | No | No | Good | 6 | No | No | Yes |
No | 23.71 | No | No | No | 28 | 0 | Yes | Female | 40-44 | White | No | Yes | Very good | 8 | No | No | No |
Yes | 28.87 | Yes | No | No | 6 | 0 | Yes | Female | 75-79 | Black | No | No | Fair | 12 | No | No | No |
#Create a copy of heart_2020_cleaned dataset
heart_2020_new <- heart_2020_cleaned
heart_2020_new
## # A tibble: 319,795 × 18
## HeartDis…¹ BMI Smoking Alcoh…² Stroke Physi…³ Menta…⁴ DiffW…⁵ Sex AgeCa…⁶
## <chr> <dbl> <chr> <chr> <chr> <dbl> <dbl> <chr> <chr> <chr>
## 1 No 16.6 Yes No No 3 30 No Fema… 55-59
## 2 No 20.3 No No Yes 0 0 No Fema… 80 or …
## 3 No 26.6 Yes No No 20 30 No Male 65-69
## 4 No 24.2 No No No 0 0 No Fema… 75-79
## 5 No 23.7 No No No 28 0 Yes Fema… 40-44
## 6 Yes 28.9 Yes No No 6 0 Yes Fema… 75-79
## 7 No 21.6 No No No 15 0 No Fema… 70-74
## 8 No 31.6 Yes No No 5 0 Yes Fema… 80 or …
## 9 No 26.4 No No No 0 0 No Fema… 80 or …
## 10 No 40.7 No No No 0 0 Yes Male 65-69
## # … with 319,785 more rows, 8 more variables: Race <chr>, Diabetic <chr>,
## # PhysicalActivity <chr>, GenHealth <chr>, SleepTime <dbl>, Asthma <chr>,
## # KidneyDisease <chr>, SkinCancer <chr>, and abbreviated variable names
## # ¹HeartDisease, ²AlcoholDrinking, ³PhysicalHealth, ⁴MentalHealth,
## # ⁵DiffWalking, ⁶AgeCategory
skim_heart_2020 <- skimr::skim(heart_2020_new) %>%
as_tibble() %>%
print()
## # A tibble: 18 × 17
## skim_type skim_vari…¹ n_mis…² compl…³ chara…⁴ chara…⁵ chara…⁶ chara…⁷ chara…⁸
## <chr> <chr> <int> <dbl> <int> <int> <int> <int> <int>
## 1 character HeartDisea… 0 1 2 3 0 2 0
## 2 character Smoking 0 1 2 3 0 2 0
## 3 character AlcoholDri… 0 1 2 3 0 2 0
## 4 character Stroke 0 1 2 3 0 2 0
## 5 character DiffWalking 0 1 2 3 0 2 0
## 6 character Sex 0 1 4 6 0 2 0
## 7 character AgeCategory 0 1 5 11 0 13 0
## 8 character Race 0 1 5 30 0 6 0
## 9 character Diabetic 0 1 2 23 0 4 0
## 10 character PhysicalAc… 0 1 2 3 0 2 0
## 11 character GenHealth 0 1 4 9 0 5 0
## 12 character Asthma 0 1 2 3 0 2 0
## 13 character KidneyDise… 0 1 2 3 0 2 0
## 14 character SkinCancer 0 1 2 3 0 2 0
## 15 numeric BMI 0 1 NA NA NA NA NA
## 16 numeric PhysicalHe… 0 1 NA NA NA NA NA
## 17 numeric MentalHeal… 0 1 NA NA NA NA NA
## 18 numeric SleepTime 0 1 NA NA NA NA NA
## # … with 8 more variables: numeric.mean <dbl>, numeric.sd <dbl>,
## # numeric.p0 <dbl>, numeric.p25 <dbl>, numeric.p50 <dbl>, numeric.p75 <dbl>,
## # numeric.p100 <dbl>, numeric.hist <chr>, and abbreviated variable names
## # ¹skim_variable, ²n_missing, ³complete_rate, ⁴character.min, ⁵character.max,
## # ⁶character.empty, ⁷character.n_unique, ⁸character.whitespace
glimpse(heart_2020_new)
## Rows: 319,795
## Columns: 18
## $ HeartDisease <chr> "No", "No", "No", "No", "No", "Yes", "No", "No", "No"…
## $ BMI <dbl> 16.60, 20.34, 26.58, 24.21, 23.71, 28.87, 21.63, 31.6…
## $ Smoking <chr> "Yes", "No", "Yes", "No", "No", "Yes", "No", "Yes", "…
## $ AlcoholDrinking <chr> "No", "No", "No", "No", "No", "No", "No", "No", "No",…
## $ Stroke <chr> "No", "Yes", "No", "No", "No", "No", "No", "No", "No"…
## $ PhysicalHealth <dbl> 3, 0, 20, 0, 28, 6, 15, 5, 0, 0, 30, 0, 0, 7, 0, 1, 5…
## $ MentalHealth <dbl> 30, 0, 30, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 30, 0, 2,…
## $ DiffWalking <chr> "No", "No", "No", "No", "Yes", "Yes", "No", "Yes", "N…
## $ Sex <chr> "Female", "Female", "Male", "Female", "Female", "Fema…
## $ AgeCategory <chr> "55-59", "80 or older", "65-69", "75-79", "40-44", "7…
## $ Race <chr> "White", "White", "White", "White", "White", "Black",…
## $ Diabetic <chr> "Yes", "No", "Yes", "No", "No", "No", "No", "Yes", "N…
## $ PhysicalActivity <chr> "Yes", "Yes", "Yes", "No", "Yes", "No", "Yes", "No", …
## $ GenHealth <chr> "Very good", "Very good", "Fair", "Good", "Very good"…
## $ SleepTime <dbl> 5, 7, 8, 6, 8, 12, 4, 9, 5, 10, 15, 5, 8, 7, 5, 6, 10…
## $ Asthma <chr> "Yes", "No", "Yes", "No", "No", "No", "Yes", "Yes", "…
## $ KidneyDisease <chr> "No", "No", "No", "No", "No", "No", "No", "No", "Yes"…
## $ SkinCancer <chr> "Yes", "No", "No", "Yes", "No", "No", "Yes", "No", "N…
gt(head(heart_2020_new))
HeartDisease | BMI | Smoking | AlcoholDrinking | Stroke | PhysicalHealth | MentalHealth | DiffWalking | Sex | AgeCategory | Race | Diabetic | PhysicalActivity | GenHealth | SleepTime | Asthma | KidneyDisease | SkinCancer |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
No | 16.60 | Yes | No | No | 3 | 30 | No | Female | 55-59 | White | Yes | Yes | Very good | 5 | Yes | No | Yes |
No | 20.34 | No | No | Yes | 0 | 0 | No | Female | 80 or older | White | No | Yes | Very good | 7 | No | No | No |
No | 26.58 | Yes | No | No | 20 | 30 | No | Male | 65-69 | White | Yes | Yes | Fair | 8 | Yes | No | No |
No | 24.21 | No | No | No | 0 | 0 | No | Female | 75-79 | White | No | No | Good | 6 | No | No | Yes |
No | 23.71 | No | No | No | 28 | 0 | Yes | Female | 40-44 | White | No | Yes | Very good | 8 | No | No | No |
Yes | 28.87 | Yes | No | No | 6 | 0 | Yes | Female | 75-79 | Black | No | No | Fair | 12 | No | No | No |
If there are any quirks that you have to deal with
NA
coded as something else, or it is multiple tables, please make some notes here about what you need to do before you start transforming the data in the next section.
This dataset is pretty clean. There is no missing data, and all variables are correctly coded as characters and numbers. In this research analysis, I want to focus on only lifestyle factors, so I will remove variables that are not related to lifestyle factors such as Stroke, Diabetic, GenHealth, Asthma, Kidney Disease and Skin Cancer. However, I will still keep demographic factors on top of lifestyle factors.
Make sure your data types are correct!
If the data needs to be transformed in any way (values recoded, pivoted, etc), do it here. Examples include transforming a continuous variable into a categorical using
case_when()
, etc.
#Removing other variables that are not related to lifestyle factors from the dataset to keep the dataset focus on the research question
heart_lifestyle_2020 <- heart_2020_new %>%
select(-Stroke, -Diabetic, -GenHealth, -Asthma, -KidneyDisease, -SkinCancer)
heart_lifestyle_2020 #heart_lifestyle_2020 dataset has only 12 columns now
## # A tibble: 319,795 × 12
## HeartDise…¹ BMI Smoking Alcoh…² Physi…³ Menta…⁴ DiffW…⁵ Sex AgeCa…⁶ Race
## <chr> <dbl> <chr> <chr> <dbl> <dbl> <chr> <chr> <chr> <chr>
## 1 No 16.6 Yes No 3 30 No Fema… 55-59 White
## 2 No 20.3 No No 0 0 No Fema… 80 or … White
## 3 No 26.6 Yes No 20 30 No Male 65-69 White
## 4 No 24.2 No No 0 0 No Fema… 75-79 White
## 5 No 23.7 No No 28 0 Yes Fema… 40-44 White
## 6 Yes 28.9 Yes No 6 0 Yes Fema… 75-79 Black
## 7 No 21.6 No No 15 0 No Fema… 70-74 White
## 8 No 31.6 Yes No 5 0 Yes Fema… 80 or … White
## 9 No 26.4 No No 0 0 No Fema… 80 or … White
## 10 No 40.7 No No 0 0 Yes Male 65-69 White
## # … with 319,785 more rows, 2 more variables: PhysicalActivity <chr>,
## # SleepTime <dbl>, and abbreviated variable names ¹HeartDisease,
## # ²AlcoholDrinking, ³PhysicalHealth, ⁴MentalHealth, ⁵DiffWalking,
## # ⁶AgeCategory
#Transform BMI variable from continuous variable to categorical variable with 4 levels
heart_lifestyle_2020 <- heart_lifestyle_2020 %>%
mutate(bmi_cat = case_when(BMI <18.5 ~ "Underweight",
(BMI >=18.5 & BMI <25) ~ "Normalweight",
(BMI >= 25 & BMI <30) ~ "Overweight",
BMI >=30 ~ "Obese")) %>%
#keep BMI category in the order we wanted
mutate(bmi_cat = factor(bmi_cat, levels = c("Underweight", "Normalweight", "Overweight", "Obese")))
heart_lifestyle_2020 %>%
glimpse()
## Rows: 319,795
## Columns: 13
## $ HeartDisease <chr> "No", "No", "No", "No", "No", "Yes", "No", "No", "No"…
## $ BMI <dbl> 16.60, 20.34, 26.58, 24.21, 23.71, 28.87, 21.63, 31.6…
## $ Smoking <chr> "Yes", "No", "Yes", "No", "No", "Yes", "No", "Yes", "…
## $ AlcoholDrinking <chr> "No", "No", "No", "No", "No", "No", "No", "No", "No",…
## $ PhysicalHealth <dbl> 3, 0, 20, 0, 28, 6, 15, 5, 0, 0, 30, 0, 0, 7, 0, 1, 5…
## $ MentalHealth <dbl> 30, 0, 30, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 30, 0, 2,…
## $ DiffWalking <chr> "No", "No", "No", "No", "Yes", "Yes", "No", "Yes", "N…
## $ Sex <chr> "Female", "Female", "Male", "Female", "Female", "Fema…
## $ AgeCategory <chr> "55-59", "80 or older", "65-69", "75-79", "40-44", "7…
## $ Race <chr> "White", "White", "White", "White", "White", "Black",…
## $ PhysicalActivity <chr> "Yes", "Yes", "Yes", "No", "Yes", "No", "Yes", "No", …
## $ SleepTime <dbl> 5, 7, 8, 6, 8, 12, 4, 9, 5, 10, 15, 5, 8, 7, 5, 6, 10…
## $ bmi_cat <fct> Underweight, Normalweight, Overweight, Normalweight, …
#Transform SleepTime variable from continuous variable to categorical variable with 3 levels
heart_lifestyle_2020 <- heart_lifestyle_2020 %>%
mutate(sleeptime_cat = case_when(SleepTime <7 ~ "Less than 7h",
(SleepTime >=7 & SleepTime <9) ~ "7-9h",
SleepTime >=9 ~ "More than 9h")) %>%
#keep sleeptime category in the order we wanted
mutate(sleeptime_cat = factor(sleeptime_cat, levels = c("Less than 7h", "7-9h", "More than 9h")))
heart_lifestyle_2020 %>%
glimpse()
## Rows: 319,795
## Columns: 14
## $ HeartDisease <chr> "No", "No", "No", "No", "No", "Yes", "No", "No", "No"…
## $ BMI <dbl> 16.60, 20.34, 26.58, 24.21, 23.71, 28.87, 21.63, 31.6…
## $ Smoking <chr> "Yes", "No", "Yes", "No", "No", "Yes", "No", "Yes", "…
## $ AlcoholDrinking <chr> "No", "No", "No", "No", "No", "No", "No", "No", "No",…
## $ PhysicalHealth <dbl> 3, 0, 20, 0, 28, 6, 15, 5, 0, 0, 30, 0, 0, 7, 0, 1, 5…
## $ MentalHealth <dbl> 30, 0, 30, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 30, 0, 2,…
## $ DiffWalking <chr> "No", "No", "No", "No", "Yes", "Yes", "No", "Yes", "N…
## $ Sex <chr> "Female", "Female", "Male", "Female", "Female", "Fema…
## $ AgeCategory <chr> "55-59", "80 or older", "65-69", "75-79", "40-44", "7…
## $ Race <chr> "White", "White", "White", "White", "White", "Black",…
## $ PhysicalActivity <chr> "Yes", "Yes", "Yes", "No", "Yes", "No", "Yes", "No", …
## $ SleepTime <dbl> 5, 7, 8, 6, 8, 12, 4, 9, 5, 10, 15, 5, 8, 7, 5, 6, 10…
## $ bmi_cat <fct> Underweight, Normalweight, Overweight, Normalweight, …
## $ sleeptime_cat <fct> Less than 7h, 7-9h, 7-9h, Less than 7h, 7-9h, More th…
#Transform Physical Health variable from continuous variable to categorical variable with 3 levels
heart_lifestyle_2020 <- heart_lifestyle_2020 %>%
mutate(physicalhealth_cat = case_when(PhysicalHealth <10 ~ "< 10",
(PhysicalHealth >= 10 & PhysicalHealth <21) ~ "10-20",
PhysicalHealth >=21 ~ "20-30"))
heart_lifestyle_2020 %>%
glimpse()
## Rows: 319,795
## Columns: 15
## $ HeartDisease <chr> "No", "No", "No", "No", "No", "Yes", "No", "No", "N…
## $ BMI <dbl> 16.60, 20.34, 26.58, 24.21, 23.71, 28.87, 21.63, 31…
## $ Smoking <chr> "Yes", "No", "Yes", "No", "No", "Yes", "No", "Yes",…
## $ AlcoholDrinking <chr> "No", "No", "No", "No", "No", "No", "No", "No", "No…
## $ PhysicalHealth <dbl> 3, 0, 20, 0, 28, 6, 15, 5, 0, 0, 30, 0, 0, 7, 0, 1,…
## $ MentalHealth <dbl> 30, 0, 30, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 30, 0, …
## $ DiffWalking <chr> "No", "No", "No", "No", "Yes", "Yes", "No", "Yes", …
## $ Sex <chr> "Female", "Female", "Male", "Female", "Female", "Fe…
## $ AgeCategory <chr> "55-59", "80 or older", "65-69", "75-79", "40-44", …
## $ Race <chr> "White", "White", "White", "White", "White", "Black…
## $ PhysicalActivity <chr> "Yes", "Yes", "Yes", "No", "Yes", "No", "Yes", "No"…
## $ SleepTime <dbl> 5, 7, 8, 6, 8, 12, 4, 9, 5, 10, 15, 5, 8, 7, 5, 6, …
## $ bmi_cat <fct> Underweight, Normalweight, Overweight, Normalweight…
## $ sleeptime_cat <fct> Less than 7h, 7-9h, 7-9h, Less than 7h, 7-9h, More …
## $ physicalhealth_cat <chr> "< 10", "< 10", "10-20", "< 10", "20-30", "< 10", "…
#Transform Mental Health variable from continuous variable to categorical variable with 3 levels
heart_lifestyle_2020 <- heart_lifestyle_2020 %>%
mutate(mentalhealth_cat = case_when(MentalHealth <10 ~ "< 10",
(MentalHealth >= 10 & MentalHealth <21) ~ "10-20",
MentalHealth >=21 ~ "20-30"))
heart_lifestyle_2020 %>%
glimpse()
## Rows: 319,795
## Columns: 16
## $ HeartDisease <chr> "No", "No", "No", "No", "No", "Yes", "No", "No", "N…
## $ BMI <dbl> 16.60, 20.34, 26.58, 24.21, 23.71, 28.87, 21.63, 31…
## $ Smoking <chr> "Yes", "No", "Yes", "No", "No", "Yes", "No", "Yes",…
## $ AlcoholDrinking <chr> "No", "No", "No", "No", "No", "No", "No", "No", "No…
## $ PhysicalHealth <dbl> 3, 0, 20, 0, 28, 6, 15, 5, 0, 0, 30, 0, 0, 7, 0, 1,…
## $ MentalHealth <dbl> 30, 0, 30, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 30, 0, …
## $ DiffWalking <chr> "No", "No", "No", "No", "Yes", "Yes", "No", "Yes", …
## $ Sex <chr> "Female", "Female", "Male", "Female", "Female", "Fe…
## $ AgeCategory <chr> "55-59", "80 or older", "65-69", "75-79", "40-44", …
## $ Race <chr> "White", "White", "White", "White", "White", "Black…
## $ PhysicalActivity <chr> "Yes", "Yes", "Yes", "No", "Yes", "No", "Yes", "No"…
## $ SleepTime <dbl> 5, 7, 8, 6, 8, 12, 4, 9, 5, 10, 15, 5, 8, 7, 5, 6, …
## $ bmi_cat <fct> Underweight, Normalweight, Overweight, Normalweight…
## $ sleeptime_cat <fct> Less than 7h, 7-9h, 7-9h, Less than 7h, 7-9h, More …
## $ physicalhealth_cat <chr> "< 10", "< 10", "10-20", "< 10", "20-30", "< 10", "…
## $ mentalhealth_cat <chr> "20-30", "< 10", "20-30", "< 10", "< 10", "< 10", "…
# Collapse Age Category variable into 4 levels because we are more interested in age above 40 as this age group is more proned into developing heart disease
heart_lifestyle_2020 <- heart_lifestyle_2020 %>%
mutate(Age = fct_collapse(AgeCategory,
"below 40" = c("18-24", "25-29", "30-34", "35-39"),
"40-59" = c("40-44", "45-49", "50-54", "55-59"),
"60-79" = c("60-64", "65-69", "70-74", "75-79"),
"80 or older" = "80 or older"))
heart_lifestyle_2020 %>%
glimpse()
## Rows: 319,795
## Columns: 17
## $ HeartDisease <chr> "No", "No", "No", "No", "No", "Yes", "No", "No", "N…
## $ BMI <dbl> 16.60, 20.34, 26.58, 24.21, 23.71, 28.87, 21.63, 31…
## $ Smoking <chr> "Yes", "No", "Yes", "No", "No", "Yes", "No", "Yes",…
## $ AlcoholDrinking <chr> "No", "No", "No", "No", "No", "No", "No", "No", "No…
## $ PhysicalHealth <dbl> 3, 0, 20, 0, 28, 6, 15, 5, 0, 0, 30, 0, 0, 7, 0, 1,…
## $ MentalHealth <dbl> 30, 0, 30, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 30, 0, …
## $ DiffWalking <chr> "No", "No", "No", "No", "Yes", "Yes", "No", "Yes", …
## $ Sex <chr> "Female", "Female", "Male", "Female", "Female", "Fe…
## $ AgeCategory <chr> "55-59", "80 or older", "65-69", "75-79", "40-44", …
## $ Race <chr> "White", "White", "White", "White", "White", "Black…
## $ PhysicalActivity <chr> "Yes", "Yes", "Yes", "No", "Yes", "No", "Yes", "No"…
## $ SleepTime <dbl> 5, 7, 8, 6, 8, 12, 4, 9, 5, 10, 15, 5, 8, 7, 5, 6, …
## $ bmi_cat <fct> Underweight, Normalweight, Overweight, Normalweight…
## $ sleeptime_cat <fct> Less than 7h, 7-9h, 7-9h, Less than 7h, 7-9h, More …
## $ physicalhealth_cat <chr> "< 10", "< 10", "10-20", "< 10", "20-30", "< 10", "…
## $ mentalhealth_cat <chr> "20-30", "< 10", "20-30", "< 10", "< 10", "< 10", "…
## $ Age <fct> 40-59, 80 or older, 60-79, 60-79, 40-59, 60-79, 60-…
#Filter the dataset to focus on those with heart disease only
heart_lifestyle_2020_yeshd <- heart_lifestyle_2020 %>%
filter (HeartDisease == "Yes")
heart_lifestyle_2020_yeshd
## # A tibble: 27,373 × 17
## HeartDise…¹ BMI Smoking Alcoh…² Physi…³ Menta…⁴ DiffW…⁵ Sex AgeCa…⁶ Race
## <chr> <dbl> <chr> <chr> <dbl> <dbl> <chr> <chr> <chr> <chr>
## 1 Yes 28.9 Yes No 6 0 Yes Fema… 75-79 Black
## 2 Yes 34.3 Yes No 30 0 Yes Male 60-64 White
## 3 Yes 33.0 Yes No 10 0 Yes Male 75-79 White
## 4 Yes 25.1 No No 0 0 Yes Fema… 80 or … White
## 5 Yes 30.2 Yes No 6 2 Yes Fema… 75-79 White
## 6 Yes 20.4 Yes No 3 0 No Fema… 70-74 White
## 7 Yes 34.3 No No 0 0 No Fema… 55-59 White
## 8 Yes 28.3 Yes No 30 30 No Fema… 70-74 White
## 9 Yes 24.7 Yes No 30 30 Yes Male 70-74 White
## 10 Yes 33.6 No No 2 0 Yes Fema… 70-74 White
## # … with 27,363 more rows, 7 more variables: PhysicalActivity <chr>,
## # SleepTime <dbl>, bmi_cat <fct>, sleeptime_cat <fct>,
## # physicalhealth_cat <chr>, mentalhealth_cat <chr>, Age <fct>, and
## # abbreviated variable names ¹HeartDisease, ²AlcoholDrinking,
## # ³PhysicalHealth, ⁴MentalHealth, ⁵DiffWalking, ⁶AgeCategory
#Filter the dataset to focus on those without heart disease only
heart_lifestyle_2020_nohd <- heart_lifestyle_2020 %>%
filter (HeartDisease == "No")
heart_lifestyle_2020_nohd
## # A tibble: 292,422 × 17
## HeartDise…¹ BMI Smoking Alcoh…² Physi…³ Menta…⁴ DiffW…⁵ Sex AgeCa…⁶ Race
## <chr> <dbl> <chr> <chr> <dbl> <dbl> <chr> <chr> <chr> <chr>
## 1 No 16.6 Yes No 3 30 No Fema… 55-59 White
## 2 No 20.3 No No 0 0 No Fema… 80 or … White
## 3 No 26.6 Yes No 20 30 No Male 65-69 White
## 4 No 24.2 No No 0 0 No Fema… 75-79 White
## 5 No 23.7 No No 28 0 Yes Fema… 40-44 White
## 6 No 21.6 No No 15 0 No Fema… 70-74 White
## 7 No 31.6 Yes No 5 0 Yes Fema… 80 or … White
## 8 No 26.4 No No 0 0 No Fema… 80 or … White
## 9 No 40.7 No No 0 0 Yes Male 65-69 White
## 10 No 28.7 Yes No 0 0 No Fema… 55-59 White
## # … with 292,412 more rows, 7 more variables: PhysicalActivity <chr>,
## # SleepTime <dbl>, bmi_cat <fct>, sleeptime_cat <fct>,
## # physicalhealth_cat <chr>, mentalhealth_cat <chr>, Age <fct>, and
## # abbreviated variable names ¹HeartDisease, ²AlcoholDrinking,
## # ³PhysicalHealth, ⁴MentalHealth, ⁵DiffWalking, ⁶AgeCategory
Bonus points (5 points) for datasets that require merging of tables, but only if you reason through whether you should use
left_join
,inner_join
, orright_join
on these tables. No credit will be provided if you don’t.
Not applicable to this dataset because there is no other dataset to merge with.
Show your transformed table here. Use tools such as
glimpse()
,skim()
orhead()
to illustrate your point.
heart_lifestyle_2020 %>%
glimpse()
## Rows: 319,795
## Columns: 17
## $ HeartDisease <chr> "No", "No", "No", "No", "No", "Yes", "No", "No", "N…
## $ BMI <dbl> 16.60, 20.34, 26.58, 24.21, 23.71, 28.87, 21.63, 31…
## $ Smoking <chr> "Yes", "No", "Yes", "No", "No", "Yes", "No", "Yes",…
## $ AlcoholDrinking <chr> "No", "No", "No", "No", "No", "No", "No", "No", "No…
## $ PhysicalHealth <dbl> 3, 0, 20, 0, 28, 6, 15, 5, 0, 0, 30, 0, 0, 7, 0, 1,…
## $ MentalHealth <dbl> 30, 0, 30, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 30, 0, …
## $ DiffWalking <chr> "No", "No", "No", "No", "Yes", "Yes", "No", "Yes", …
## $ Sex <chr> "Female", "Female", "Male", "Female", "Female", "Fe…
## $ AgeCategory <chr> "55-59", "80 or older", "65-69", "75-79", "40-44", …
## $ Race <chr> "White", "White", "White", "White", "White", "Black…
## $ PhysicalActivity <chr> "Yes", "Yes", "Yes", "No", "Yes", "No", "Yes", "No"…
## $ SleepTime <dbl> 5, 7, 8, 6, 8, 12, 4, 9, 5, 10, 15, 5, 8, 7, 5, 6, …
## $ bmi_cat <fct> Underweight, Normalweight, Overweight, Normalweight…
## $ sleeptime_cat <fct> Less than 7h, 7-9h, 7-9h, Less than 7h, 7-9h, More …
## $ physicalhealth_cat <chr> "< 10", "< 10", "10-20", "< 10", "20-30", "< 10", "…
## $ mentalhealth_cat <chr> "20-30", "< 10", "20-30", "< 10", "< 10", "< 10", "…
## $ Age <fct> 40-59, 80 or older, 60-79, 60-79, 40-59, 60-79, 60-…
Are the values what you expected for the variables? Why or Why not?
This dataset is a clean dataset, so all the values are what I expected for all the variables. However, there are too many levels within the age category. So, I collapsed some age groups together as shown above.
Use
group_by()
andsummarize()
to make a summary of the data here. The summary should be relevant to your research question
#Get a summary of count and percentage group by BMI category among those with heart disease
heart_lifestyle_2020_yeshd %>%
group_by(bmi_cat) %>%
summarise(count = n()) %>%
mutate(percent = round((count/sum(count))*100, 1)) %>%
adorn_totals()
## bmi_cat count percent
## Underweight 401 1.5
## Normalweight 6309 23.0
## Overweight 9991 36.5
## Obese 10672 39.0
## Total 27373 100.0
#Get a summary of count and percentage group by BMI category among those without heart disease
heart_lifestyle_2020_nohd %>%
group_by(bmi_cat) %>%
summarise(count = n()) %>%
mutate(percent = round((count/sum(count))*100, 1)) %>%
adorn_totals()
## bmi_cat count percent
## Underweight 4709 1.6
## Normalweight 91022 31.1
## Overweight 104521 35.7
## Obese 92170 31.5
## Total 292422 99.9
#Cross tabulate BMI category by Heart Disease status
heart_lifestyle_2020 %>%
tabyl(HeartDisease, bmi_cat) %>%
adorn_totals(where = c("row", "col")) %>% adorn_percentages() %>%
adorn_pct_formatting(digits = 1) %>% adorn_ns(position = "front") %>% adorn_title()
## bmi_cat
## HeartDisease Underweight Normalweight Overweight Obese
## No 4709 (1.6%) 91022 (31.1%) 104521 (35.7%) 92170 (31.5%)
## Yes 401 (1.5%) 6309 (23.0%) 9991 (36.5%) 10672 (39.0%)
## Total 5110 (1.6%) 97331 (30.4%) 114512 (35.8%) 102842 (32.2%)
##
## Total
## 292422 (100.0%)
## 27373 (100.0%)
## 319795 (100.0%)
#Get a summary of count and percentage group by sleeptime category among those with heart disease
heart_lifestyle_2020_yeshd %>%
group_by(sleeptime_cat) %>%
summarise(count = n()) %>%
mutate(percent = round((count/sum(count))*100, 1)) %>%
adorn_totals()
## sleeptime_cat count percent
## Less than 7h 9172 33.5
## 7-9h 14683 53.6
## More than 9h 3518 12.9
## Total 27373 100.0
#Get a summary of count and percentage group by sleeptime category among those without heart disease
heart_lifestyle_2020_nohd %>%
group_by(sleeptime_cat) %>%
summarise(count = n()) %>%
mutate(percent = round((count/sum(count))*100, 1)) %>%
adorn_totals()
## sleeptime_cat count percent
## Less than 7h 87814 30.0
## 7-9h 180670 61.8
## More than 9h 23938 8.2
## Total 292422 100.0
#Cross tabulate sleeptime category by Heart Disease status
heart_lifestyle_2020 %>%
tabyl(HeartDisease, sleeptime_cat) %>%
adorn_totals(where = c("row", "col")) %>% adorn_percentages() %>%
adorn_pct_formatting(digits = 1) %>% adorn_ns(position = "front") %>% adorn_title()
## sleeptime_cat
## HeartDisease Less than 7h 7-9h More than 9h Total
## No 87814 (30.0%) 180670 (61.8%) 23938 (8.2%) 292422 (100.0%)
## Yes 9172 (33.5%) 14683 (53.6%) 3518 (12.9%) 27373 (100.0%)
## Total 96986 (30.3%) 195353 (61.1%) 27456 (8.6%) 319795 (100.0%)
#Cross tabulate Smoking status by Heart Disease status
heart_lifestyle_2020 %>%
tabyl(HeartDisease, Smoking) %>%
adorn_totals(where = c("row", "col")) %>% adorn_percentages() %>%
adorn_pct_formatting(digits = 1) %>% adorn_ns(position = "front") %>% adorn_title()
## Smoking
## HeartDisease No Yes Total
## No 176551 (60.4%) 115871 (39.6%) 292422 (100.0%)
## Yes 11336 (41.4%) 16037 (58.6%) 27373 (100.0%)
## Total 187887 (58.8%) 131908 (41.2%) 319795 (100.0%)
#Cross tabulate Alcohol drinking status by Heart Disease status
heart_lifestyle_2020 %>%
tabyl(HeartDisease, AlcoholDrinking) %>%
adorn_totals(where = c("row", "col")) %>% adorn_percentages() %>%
adorn_pct_formatting(digits = 1) %>% adorn_ns(position = "front") %>% adorn_title()
## AlcoholDrinking
## HeartDisease No Yes Total
## No 271786 (92.9%) 20636 (7.1%) 292422 (100.0%)
## Yes 26232 (95.8%) 1141 (4.2%) 27373 (100.0%)
## Total 298018 (93.2%) 21777 (6.8%) 319795 (100.0%)
#Cross tabulate physical activity status by Heart Disease status
heart_lifestyle_2020 %>%
tabyl(HeartDisease, PhysicalActivity) %>%
adorn_totals(where = c("row", "col")) %>% adorn_percentages() %>%
adorn_pct_formatting(digits = 1) %>% adorn_ns(position = "front") %>% adorn_title()
## PhysicalActivity
## HeartDisease No Yes Total
## No 61954 (21.2%) 230468 (78.8%) 292422 (100.0%)
## Yes 9884 (36.1%) 17489 (63.9%) 27373 (100.0%)
## Total 71838 (22.5%) 247957 (77.5%) 319795 (100.0%)
#Select variables that I want to include in my Table 1
summary_data <- heart_lifestyle_2020 %>% select(HeartDisease, Age, Sex, Race, bmi_cat, Smoking, AlcoholDrinking, PhysicalActivity, sleeptime_cat, physicalhealth_cat, mentalhealth_cat)
#Create a summary table for all the wanted variables into Table 1
summary_data %>%
tbl_summary(by= HeartDisease) %>%
modify_caption("**Table 1. Characteristic of Particpants by Heart Disease Status**")%>% bold_labels()
Characteristic | No, N = 292,4221 | Yes, N = 27,3731 |
---|---|---|
Age | ||
below 40 | 76,537 (26%) | 785 (2.9%) |
40-59 | 93,121 (32%) | 4,815 (18%) |
60-79 | 104,060 (36%) | 16,324 (60%) |
80 or older | 18,704 (6.4%) | 5,449 (20%) |
Sex | ||
Female | 156,571 (54%) | 11,234 (41%) |
Male | 135,851 (46%) | 16,139 (59%) |
Race | ||
American Indian/Alaskan Native | 4,660 (1.6%) | 542 (2.0%) |
Asian | 7,802 (2.7%) | 266 (1.0%) |
Black | 21,210 (7.3%) | 1,729 (6.3%) |
Hispanic | 26,003 (8.9%) | 1,443 (5.3%) |
Other | 10,042 (3.4%) | 886 (3.2%) |
White | 222,705 (76%) | 22,507 (82%) |
bmi_cat | ||
Underweight | 4,709 (1.6%) | 401 (1.5%) |
Normalweight | 91,022 (31%) | 6,309 (23%) |
Overweight | 104,521 (36%) | 9,991 (36%) |
Obese | 92,170 (32%) | 10,672 (39%) |
Smoking | 115,871 (40%) | 16,037 (59%) |
AlcoholDrinking | 20,636 (7.1%) | 1,141 (4.2%) |
PhysicalActivity | 230,468 (79%) | 17,489 (64%) |
sleeptime_cat | ||
Less than 7h | 87,814 (30%) | 9,172 (34%) |
7-9h | 180,670 (62%) | 14,683 (54%) |
More than 9h | 23,938 (8.2%) | 3,518 (13%) |
physicalhealth_cat | ||
< 10 | 260,428 (89%) | 19,224 (70%) |
10-20 | 14,874 (5.1%) | 2,928 (11%) |
20-30 | 17,120 (5.9%) | 5,221 (19%) |
mentalhealth_cat | ||
< 10 | 247,264 (85%) | 22,252 (81%) |
10-20 | 26,886 (9.2%) | 2,464 (9.0%) |
20-30 | 18,272 (6.2%) | 2,657 (9.7%) |
1 n (%) |
What are your findings about the summary? Are they what you expected?
From the summary Table 1, it is observed that participants in the age group of 60-79 have the highest risk of having heart disease. Among participants who have heart disease, 60% of them are aged between 60-79. I also observed that there are more men with heart disease compared to women. In terms of physical activity, I observed that among those participants who have heart disease, 64% are active and 36% are inactive. However, 79% are active and 21% are inactive among those who do not have heart disease. For sleep time, for those who have heart disease, 34% are sleep deprived (less than 7hr), 54% are healthy sleep (7-9hr) and 13% are oversleeping (more than 9hrs), whereas for those who do not have heart disease, 30% are sleep deprived(less than 7hr), 62% are healthy sleep (7-9hr) and 8.2% oversleeping (more than 9hr). For smoking, among those who have heart disease, 59% smoke, and 41% do not smoke, while among those who do not have heart disease, 40% smoke and 60% do not smoke. For alcohol drinking, among those who have heart disease, 96% do not drink, and 4% drink, whereas among those who do no have heart disease, 93% do not drink and 7% drink. In term of BMI, among those who have heart disease, 1.5% are underweight, 23% are normal weight and 36.5% are overweight, while among those who do not have heart disease, 1.6% are underweight, 31.1% are normal weight and 35.7% are overweight. I expected most of these, however, I was a little surprised to see such low percentages of those who drink alcohol among both group of with and without heart disease.
Make at least two plots that help you answer your question on the transformed or summarized data. Use scales and/or labels to make each plot informative.
#To answer question 1
smoke_plot <- ggplot(heart_lifestyle_2020) +
aes(x = Smoking, fill = HeartDisease) +
geom_bar() +
facet_wrap(vars(Sex)) +
labs(title = "Smoking Status by Heart Disease Status among Female and Male ",
x = "Smoking Status",
y = "Frequency", fill = "Heart Disease Status") +
theme_minimal() +
scale_fill_viridis_d()
smoke_plot
# To answer question 1
drink_plot <- ggplot(heart_lifestyle_2020) +
aes(x = AlcoholDrinking, fill = HeartDisease) +
geom_bar() +
facet_wrap(vars(Sex)) +
labs(title = "Drinking Status by Heart Disease Status among Female and Male ",
x = "Drinking Status",
y = "Frequency", fill = "Heart Disease Status") +
theme_minimal() +
scale_fill_viridis_d()
drink_plot
#To answer question 1
activity_plot <- ggplot(heart_lifestyle_2020) +
aes(x = PhysicalActivity, fill = HeartDisease) +
geom_bar(position = "stack") +
facet_wrap(vars(Sex)) +
labs(title = "Physical Activity by Heart Disease Status among Female and Male ",
x = "Physical Activity",
y = "Frequency", fill = "Heart Disease Status") +
theme_minimal() +
scale_fill_viridis_d()
activity_plot
#To answer question 1
BMI_plot <- ggplot(heart_lifestyle_2020) +
aes(x = BMI, fill = HeartDisease) +
geom_boxplot() +
facet_wrap(vars(Sex)) +
labs(title = "BMI by Heart Disease Status among Female and Male ",
x = "BMI",
y = "Frequency", fill = "Heart Disease Status") +
theme_minimal() +
scale_fill_viridis_d()
BMI_plot
BMICAT_plot <- ggplot(heart_lifestyle_2020) +
aes(x = HeartDisease, fill = bmi_cat) +
geom_bar(position = "stack") +
facet_wrap(vars(Sex)) +
labs(title = "BMI Category by Heart Disease Status among Female and Male ",
x = "Heart Disease Status",
y = "Frequency", fill = "BMI Category") +
theme_minimal() +
scale_fill_viridis_d()
BMICAT_plot
#To answer question 1
sleep_plot <- ggplot(heart_lifestyle_2020) +
aes(x = HeartDisease, fill = sleeptime_cat) +
geom_bar(position = "stack") +
facet_wrap(vars(Sex)) +
labs(title = "SleepTime Category by Heart Disease Status among Female and Male ",
x = "Heart Disease",
y = "Frequency", fill = "SleepTime Category") +
theme_minimal() +
scale_fill_viridis_d()
sleep_plot
# To answer question 2
age_plot <- ggplot(heart_lifestyle_2020) +
aes(x = HeartDisease, fill = Age) +
geom_bar() +
facet_wrap(vars(Sex)) +
labs(title = "Heart Disease by Age among Female and Male ",
x = "Heart Disease Status",
y = "Frequency", fill = "Age Category") +
theme_minimal() +
scale_fill_viridis_d()
age_plot
Summarize your research question and findings below.
The objective of this analysis is to determine which lifestyle factors are related to increasing the risk of heart disease. Overall, I observed that among those who have heart disease, there are higher percentages of inactive, sleep-deprived(less than 7hr), smoke, and overweight compared to those who do not have heart disease. However, there is a lower percentage of those who drink among those who have heart disease compared to those who do not have heart disease. Moreover,the participants in the age group of 60-79 have the highest risk of having heart disease, more in males than females.
Are your findings what you expected? Why or Why not?
These findings are what I expected because we see that among lifestyle behaviors that are considered to be unhealthy such as smoking, lack of sleep and being inactive have high percentages in having heart disease. However, the way the data collected for some variables such as smoking and alcohol drinking could be improved by collecting in more detail on how often participants smoke or drink and also collect the amount of cigarettes and alcohol being consumed.
Some of the codes (Summary Table 1 and Count & Percentages by group) in this project are adapted from Laura Jacobsen BSTA504 Midterm 2/13/2022 and BSTA 504 Winter 2023 Class materials.