Please submit your .Rmd
and .html
files in
Sakai. If you are working together, both people should submit the
files.
The goal of the midterm project is to showcase skills that you have learned in class so far. The midterm is open note, but if you use someone else’s code, you must attribute them.
.csv
file into your data
folder. This resource is probably the
easiest to deatl with.Define a research question, involving at least one categorical variable. You may schedule a time with Jessica or Brad to discuss your data set and research question, or you may message it to one of us in slack or email. Please do one of the two options pretty early on. We just want to look at the data and make sure that it is appropriate for your question.
You must use each of the following functions at least once:
mutate()
group_by()
summarize()
ggplot()
and at least one of the following:
case_when()
across()
*_join()
(i.e. left_join()
)pivot_*()
(i.e. pivot_longer()
)function()
The code chunks below are guides, please add more code chunks to do what you need.
If you do not want your final project posted on the public website, please let Jessica know. We can also keep it anonymous if you’d like to remove your name from the Rmd and html, or use a pseudonym.
You may remove these instructions from your final Rmd if you like
If you’d like to work together in pairs, that is encouraged, but you must divide the work equitably and you must note who worked on what. This is probably easiest as notes in the text. Please let Brad or Jessica know that you’ll be working together.
No acknowledgements of contributions = -10 points overall.
I will take off points (-5 points for each section) if you don’t add observations and notes in your RMarkdown document. I want you to think and reason through your analysis, even if they are preliminary thoughts.
Define your research question below. What about the data interests you? What is a specific question you want to find out about the data?
The chocolate dataset includes several interesting variables, including the rating of chocolate, country of bean origin, company location, year of the review, percent of cocoa in the chocolate, and the most memorable characteristics of a bar of chocolate. Although I would not declare myself an avid chocolate enthusiast, dark chocolate holds a top-tier rank among the other variations in my personal ranking. My research question is: which of the most memorable characteristics occurred the most often and had the overall highest rating? I will secondarily examine the influence of cocoa percentage on chocolate ratings.
Given your question, what is your expectation about the data?
Given my personal preference, my expectation about the data is that 1) The chocolate characteristic “cocoa” is probably the most highly rated and the most frequently appearing among the other most memorable characteristics, and 2) Chocolate with a higher percentage of cocoa will rate higher than chocolate with lower of cocoa contents.
Note: Given the recent news about dark chocolate occurring later in 2022, this data is not likely to be affected.
Load the data below and use
dplyr::glimpse()
orskimr::skim()
on the data. You should upload the data file into thedata
directory.
#install package to obtain dataset
#install.packages("tidytuesdayR")
#load libraries
pacman::p_load(
tidyverse,
skimr,
here,
janitor
)
#read in dataset from the web
chocolate <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-01-18/chocolate.csv')
## Rows: 2530 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): company_manufacturer, company_location, country_of_bean_origin, spe...
## dbl (3): ref, review_date, rating
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#use skim to see the overall distribution of the variables and missingness
skim(chocolate)
Name | chocolate |
Number of rows | 2530 |
Number of columns | 10 |
_______________________ | |
Column type frequency: | |
character | 7 |
numeric | 3 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
company_manufacturer | 0 | 1.00 | 2 | 39 | 0 | 580 | 0 |
company_location | 0 | 1.00 | 4 | 21 | 0 | 67 | 0 |
country_of_bean_origin | 0 | 1.00 | 4 | 21 | 0 | 62 | 0 |
specific_bean_origin_or_bar_name | 0 | 1.00 | 3 | 51 | 0 | 1605 | 0 |
cocoa_percent | 0 | 1.00 | 3 | 6 | 0 | 46 | 0 |
ingredients | 87 | 0.97 | 4 | 14 | 0 | 21 | 0 |
most_memorable_characteristics | 0 | 1.00 | 3 | 37 | 0 | 2487 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
ref | 0 | 1 | 1429.80 | 757.65 | 5 | 802 | 1454.00 | 2079.0 | 2712 | ▆▇▇▇▇ |
review_date | 0 | 1 | 2014.37 | 3.97 | 2006 | 2012 | 2015.00 | 2018.0 | 2021 | ▃▅▇▆▅ |
rating | 0 | 1 | 3.20 | 0.45 | 1 | 3 | 3.25 | 3.5 | 4 | ▁▁▅▇▇ |
#use glimpse to double check the variable type, rows, and columns
glimpse(chocolate)
## Rows: 2,530
## Columns: 10
## $ ref <dbl> 2454, 2458, 2454, 2542, 2546, 2546, 2…
## $ company_manufacturer <chr> "5150", "5150", "5150", "5150", "5150…
## $ company_location <chr> "U.S.A.", "U.S.A.", "U.S.A.", "U.S.A.…
## $ review_date <dbl> 2019, 2019, 2019, 2021, 2021, 2021, 2…
## $ country_of_bean_origin <chr> "Tanzania", "Dominican Republic", "Ma…
## $ specific_bean_origin_or_bar_name <chr> "Kokoa Kamili, batch 1", "Zorzal, bat…
## $ cocoa_percent <chr> "76%", "76%", "76%", "68%", "72%", "8…
## $ ingredients <chr> "3- B,S,C", "3- B,S,C", "3- B,S,C", "…
## $ most_memorable_characteristics <chr> "rich cocoa, fatty, bready", "cocoa, …
## $ rating <dbl> 3.25, 3.50, 3.75, 3.00, 3.00, 3.25, 3…
If there are any quirks that you have to deal with
NA
coded as something else, or it is multiple tables, please make some notes here about what you need to do before you start transforming the data in the next section.
Make sure your data types are correct!
Overall, there is no substantial missing data, but note that the variable “ingredients” contains missing values. There are seven character type variables and three numeric variables. The variable cocoa_percent showed up as a character variable and needed to be transformed for later categorization. The most_memorable_characteristics variable contains multiple observations per cell and needed to be separated into individual columns to increase readability.
If the data needs to be transformed in any way (values recoded, pivoted, etc), do it here. Examples include transforming a continuous variable into a categorical using
case_when()
, etc.
#separate columns for each individual characteristic
chocolate_separated <- chocolate %>%
separate(
col = most_memorable_characteristics,
into = c("characteristics_1", "characteristics_2", "characteristics_3", "characteristics_4"),
sep = ",")
## Warning: Expected 4 pieces. Additional pieces discarded in 2 rows [5, 323].
## Warning: Expected 4 pieces. Missing pieces filled with `NA` in 2247 rows [1, 2,
## 3, 4, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 20, 21, 22, 23, ...].
#check to make sure the separation works
glimpse(chocolate_separated)
## Rows: 2,530
## Columns: 13
## $ ref <dbl> 2454, 2458, 2454, 2542, 2546, 2546, 2…
## $ company_manufacturer <chr> "5150", "5150", "5150", "5150", "5150…
## $ company_location <chr> "U.S.A.", "U.S.A.", "U.S.A.", "U.S.A.…
## $ review_date <dbl> 2019, 2019, 2019, 2021, 2021, 2021, 2…
## $ country_of_bean_origin <chr> "Tanzania", "Dominican Republic", "Ma…
## $ specific_bean_origin_or_bar_name <chr> "Kokoa Kamili, batch 1", "Zorzal, bat…
## $ cocoa_percent <chr> "76%", "76%", "76%", "68%", "72%", "8…
## $ ingredients <chr> "3- B,S,C", "3- B,S,C", "3- B,S,C", "…
## $ characteristics_1 <chr> "rich cocoa", "cocoa", "cocoa", "chew…
## $ characteristics_2 <chr> " fatty", " vegetal", " blackberry", …
## $ characteristics_3 <chr> " bready", " savory", " full body", "…
## $ characteristics_4 <chr> NA, NA, NA, NA, " nutty", NA, NA, NA,…
## $ rating <dbl> 3.25, 3.50, 3.75, 3.00, 3.00, 3.25, 3…
#each characteristic shows up in a single column
#separate columns for percent cocoa and remove empty column named "symbol"
chocolate_separated <- chocolate_separated %>%
separate(
col = cocoa_percent,
into = c("cocoa_percent", "symbol"),
sep = "%") %>%
remove_empty(which = "cols")
glimpse(chocolate_separated) #% symbol is removed from cocoa_percent, but this variable is still read as a character variable. Thus, cocoa_percent needed to converted into numeric type
## Rows: 2,530
## Columns: 14
## $ ref <dbl> 2454, 2458, 2454, 2542, 2546, 2546, 2…
## $ company_manufacturer <chr> "5150", "5150", "5150", "5150", "5150…
## $ company_location <chr> "U.S.A.", "U.S.A.", "U.S.A.", "U.S.A.…
## $ review_date <dbl> 2019, 2019, 2019, 2021, 2021, 2021, 2…
## $ country_of_bean_origin <chr> "Tanzania", "Dominican Republic", "Ma…
## $ specific_bean_origin_or_bar_name <chr> "Kokoa Kamili, batch 1", "Zorzal, bat…
## $ cocoa_percent <chr> "76", "76", "76", "68", "72", "80", "…
## $ symbol <chr> "", "", "", "", "", "", "", "", "", "…
## $ ingredients <chr> "3- B,S,C", "3- B,S,C", "3- B,S,C", "…
## $ characteristics_1 <chr> "rich cocoa", "cocoa", "cocoa", "chew…
## $ characteristics_2 <chr> " fatty", " vegetal", " blackberry", …
## $ characteristics_3 <chr> " bready", " savory", " full body", "…
## $ characteristics_4 <chr> NA, NA, NA, NA, " nutty", NA, NA, NA,…
## $ rating <dbl> 3.25, 3.50, 3.75, 3.00, 3.00, 3.25, 3…
#convert character variable type to numeric
chocolate_converted <- chocolate_separated %>%
mutate(cocoa_percent = as.numeric(cocoa_percent))
#check if conversion worked
skim(chocolate_converted)
Name | chocolate_converted |
Number of rows | 2530 |
Number of columns | 14 |
_______________________ | |
Column type frequency: | |
character | 10 |
numeric | 4 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
company_manufacturer | 0 | 1.00 | 2 | 39 | 0 | 580 | 0 |
company_location | 0 | 1.00 | 4 | 21 | 0 | 67 | 0 |
country_of_bean_origin | 0 | 1.00 | 4 | 21 | 0 | 62 | 0 |
specific_bean_origin_or_bar_name | 0 | 1.00 | 3 | 51 | 0 | 1605 | 0 |
symbol | 0 | 1.00 | 0 | 0 | 2530 | 1 | 0 |
ingredients | 87 | 0.97 | 4 | 14 | 0 | 21 | 0 |
characteristics_1 | 0 | 1.00 | 3 | 30 | 0 | 536 | 0 |
characteristics_2 | 95 | 0.96 | 0 | 26 | 1 | 580 | 0 |
characteristics_3 | 715 | 0.72 | 0 | 19 | 2 | 399 | 0 |
characteristics_4 | 2247 | 0.11 | 4 | 15 | 0 | 110 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
ref | 0 | 1 | 1429.80 | 757.65 | 5 | 802 | 1454.00 | 2079.0 | 2712 | ▆▇▇▇▇ |
review_date | 0 | 1 | 2014.37 | 3.97 | 2006 | 2012 | 2015.00 | 2018.0 | 2021 | ▃▅▇▆▅ |
cocoa_percent | 0 | 1 | 71.64 | 5.62 | 42 | 70 | 70.00 | 74.0 | 100 | ▁▁▇▁▁ |
rating | 0 | 1 | 3.20 | 0.45 | 1 | 3 | 3.25 | 3.5 | 4 | ▁▁▅▇▇ |
#make cocoa_percent categorical
chocolate_final <- chocolate_converted %>%
mutate(cocoa_cat = case_when(
cocoa_percent <= 65 ~ "<=65%",
(cocoa_percent > 65) & (cocoa_percent <= 70) ~ "66-70%",
(cocoa_percent > 70) & (cocoa_percent <= 100) ~ "71-100%",
))
#look at the final analytic dataset
skim(chocolate_final)
Name | chocolate_final |
Number of rows | 2530 |
Number of columns | 15 |
_______________________ | |
Column type frequency: | |
character | 11 |
numeric | 4 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
company_manufacturer | 0 | 1.00 | 2 | 39 | 0 | 580 | 0 |
company_location | 0 | 1.00 | 4 | 21 | 0 | 67 | 0 |
country_of_bean_origin | 0 | 1.00 | 4 | 21 | 0 | 62 | 0 |
specific_bean_origin_or_bar_name | 0 | 1.00 | 3 | 51 | 0 | 1605 | 0 |
symbol | 0 | 1.00 | 0 | 0 | 2530 | 1 | 0 |
ingredients | 87 | 0.97 | 4 | 14 | 0 | 21 | 0 |
characteristics_1 | 0 | 1.00 | 3 | 30 | 0 | 536 | 0 |
characteristics_2 | 95 | 0.96 | 0 | 26 | 1 | 580 | 0 |
characteristics_3 | 715 | 0.72 | 0 | 19 | 2 | 399 | 0 |
characteristics_4 | 2247 | 0.11 | 4 | 15 | 0 | 110 | 0 |
cocoa_cat | 0 | 1.00 | 5 | 7 | 0 | 3 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
ref | 0 | 1 | 1429.80 | 757.65 | 5 | 802 | 1454.00 | 2079.0 | 2712 | ▆▇▇▇▇ |
review_date | 0 | 1 | 2014.37 | 3.97 | 2006 | 2012 | 2015.00 | 2018.0 | 2021 | ▃▅▇▆▅ |
cocoa_percent | 0 | 1 | 71.64 | 5.62 | 42 | 70 | 70.00 | 74.0 | 100 | ▁▁▇▁▁ |
rating | 0 | 1 | 3.20 | 0.45 | 1 | 3 | 3.25 | 3.5 | 4 | ▁▁▅▇▇ |
#factorize cocoa_cat to see the counts/proportion for each category
chocolate_final %>%
mutate(cocoa_cat = factor(cocoa_cat, levels = c("<=65%", "66-70%", "71-100%"))) %>%
tabyl(cocoa_cat ) #table shows cocoa contents categories are in ascending level
## cocoa_cat n percent
## <=65% 239 0.0944664
## 66-70% 1193 0.4715415
## 71-100% 1098 0.4339921
#look at cocoa_percent
chocolate_final %>%
tabyl(cocoa_percent) #70% cocoa content occurred the most frequently
## cocoa_percent n percent
## 42.0 1 0.0003952569
## 46.0 1 0.0003952569
## 50.0 1 0.0003952569
## 53.0 1 0.0003952569
## 55.0 16 0.0063241107
## 56.0 2 0.0007905138
## 57.0 1 0.0003952569
## 58.0 8 0.0031620553
## 60.0 46 0.0181818182
## 60.5 1 0.0003952569
## 61.0 7 0.0027667984
## 62.0 16 0.0063241107
## 63.0 14 0.0055335968
## 64.0 34 0.0134387352
## 65.0 90 0.0355731225
## 66.0 28 0.0110671937
## 67.0 34 0.0134387352
## 68.0 72 0.0284584980
## 69.0 13 0.0051383399
## 70.0 1046 0.4134387352
## 71.0 43 0.0169960474
## 71.5 2 0.0007905138
## 72.0 295 0.1166007905
## 72.5 4 0.0015810277
## 73.0 66 0.0260869565
## 73.5 2 0.0007905138
## 74.0 67 0.0264822134
## 75.0 310 0.1225296443
## 76.0 35 0.0138339921
## 77.0 42 0.0166007905
## 78.0 21 0.0083003953
## 79.0 2 0.0007905138
## 80.0 89 0.0351778656
## 81.0 6 0.0023715415
## 82.0 18 0.0071146245
## 83.0 5 0.0019762846
## 84.0 4 0.0015810277
## 85.0 40 0.0158102767
## 86.0 1 0.0003952569
## 87.0 1 0.0003952569
## 88.0 8 0.0031620553
## 89.0 2 0.0007905138
## 90.0 9 0.0035573123
## 91.0 3 0.0011857708
## 99.0 2 0.0007905138
## 100.0 21 0.0083003953
*Chocolate_final is the analytic dataset to be used for visualizing and summarizing the data. In this dataset, the most_memorable_characteristics variable was separated into characteristics_1, characteristics_2, characteristics_3, and characteristics_4, with one column for each characteristic. Since the variable cocoa_percent was still a character after separating the percent sign from the numerical values, cocoa_percent was converted to a numeric type, and the skim function demonstrated that the conversion was successful. The cocoa percentage was categorized into 3 categories. As the categories were not even, 65% or less cocoa percentage had substantially fewer observations than the other categories. The cutoff points were somewhat arbitrary, and the rationale behind the categorization was rather subjective. Since milk chocolate and other chocolates with less cocoa content often contain 65% or less cocoa, I used that as one of the cutoff points. While exploring the distribution of cocoa percentage, 70% had a disproportionately higher count than the others, and I used 70% as another cutoff point. I noticed that the third category (71%-100%) consists of a large range of cocoa percentages, and there may be increased variations in ratings.
*Note: Filtered data were made into several sub-datasets for visualizing and summarizing the data
Bonus points (5 points) for datasets that require merging of tables, but only if you reason through whether you should use
left_join
,inner_join
, orright_join
on these tables. No credit will be provided if you don’t.
No merging of tables was required
Show your transformed table here. Use tools such as
glimpse()
,skim()
orhead()
to illustrate your point.
Are the values what you expected for the variables? Why or Why not?
n/a
Use
group_by()
andsummarize()
to make a summary of the data here. The summary should be relevant to your research question
#summarized by cocoa percentage categories above
chocolate_final %>%
group_by(cocoa_cat) %>%
summarize(
mean = mean(rating, na.rm = TRUE),
sd = sd(rating, na.rm = TRUE),
min = min(rating, na.rm = TRUE),
p25 = quantile(rating, probs = 0.25, na.rm = TRUE),
p50 = median(rating, na.rm = TRUE),
p75 = quantile(rating, probs = 0.75, na.rm = TRUE),
max = max(rating, na.rm = TRUE)
)
## # A tibble: 3 × 8
## cocoa_cat mean sd min p25 p50 p75 max
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 <=65% 3.11 0.436 1.5 2.75 3 3.5 4
## 2 66-70% 3.27 0.419 1 3 3.25 3.5 4
## 3 71-100% 3.13 0.462 1 2.75 3.25 3.5 4
#look at cocoa_percent again
chocolate_final %>%
tabyl(cocoa_percent )
## cocoa_percent n percent
## 42.0 1 0.0003952569
## 46.0 1 0.0003952569
## 50.0 1 0.0003952569
## 53.0 1 0.0003952569
## 55.0 16 0.0063241107
## 56.0 2 0.0007905138
## 57.0 1 0.0003952569
## 58.0 8 0.0031620553
## 60.0 46 0.0181818182
## 60.5 1 0.0003952569
## 61.0 7 0.0027667984
## 62.0 16 0.0063241107
## 63.0 14 0.0055335968
## 64.0 34 0.0134387352
## 65.0 90 0.0355731225
## 66.0 28 0.0110671937
## 67.0 34 0.0134387352
## 68.0 72 0.0284584980
## 69.0 13 0.0051383399
## 70.0 1046 0.4134387352
## 71.0 43 0.0169960474
## 71.5 2 0.0007905138
## 72.0 295 0.1166007905
## 72.5 4 0.0015810277
## 73.0 66 0.0260869565
## 73.5 2 0.0007905138
## 74.0 67 0.0264822134
## 75.0 310 0.1225296443
## 76.0 35 0.0138339921
## 77.0 42 0.0166007905
## 78.0 21 0.0083003953
## 79.0 2 0.0007905138
## 80.0 89 0.0351778656
## 81.0 6 0.0023715415
## 82.0 18 0.0071146245
## 83.0 5 0.0019762846
## 84.0 4 0.0015810277
## 85.0 40 0.0158102767
## 86.0 1 0.0003952569
## 87.0 1 0.0003952569
## 88.0 8 0.0031620553
## 89.0 2 0.0007905138
## 90.0 9 0.0035573123
## 91.0 3 0.0011857708
## 99.0 2 0.0007905138
## 100.0 21 0.0083003953
#look at proportion and counts for memorable characteristics and observe the top frequent characteristics (top 5%)- column 1
chocolate_final %>%
tabyl(characteristics_1) %>%
arrange(desc(n)) %>%
top_frac(.05) #n for top 5 = 83
## Selecting by percent
## characteristics_1 n percent
## creamy 163 0.064426877
## sandy 142 0.056126482
## intense 86 0.033992095
## sweet 84 0.033201581
## nutty 83 0.032806324
## fatty 78 0.030830040
## sticky 66 0.026086957
## dry 58 0.022924901
## spicy 56 0.022134387
## gritty 54 0.021343874
## oily 51 0.020158103
## roasty 51 0.020158103
## floral 49 0.019367589
## earthy 42 0.016600791
## cocoa 37 0.014624506
## molasses 36 0.014229249
## complex 30 0.011857708
## dried fruit 24 0.009486166
## rich cocoa 24 0.009486166
## smooth 24 0.009486166
## grassy 23 0.009090909
## vanilla 23 0.009090909
## coarse 20 0.007905138
## smokey 20 0.007905138
## fruity 19 0.007509881
## spice 17 0.006719368
## tart 17 0.006719368
## woody 17 0.006719368
#look at proportion and counts for memorable characteristics and observe the top frequent characteristics (top 5%)- column 2
chocolate_final %>%
tabyl(characteristics_2) %>%
arrange(desc(n)) %>%
top_frac(.05) #n for top 5 = 68
## Selecting by valid_percent
## characteristics_2 n percent valid_percent
## sweet 120 0.047430830 0.049281314
## nutty 94 0.037154150 0.038603696
## cocoa 83 0.032806324 0.034086242
## earthy 81 0.032015810 0.033264887
## roasty 68 0.026877470 0.027926078
## floral 62 0.024505929 0.025462012
## fatty 54 0.021343874 0.022176591
## sour 46 0.018181818 0.018891170
## spicy 46 0.018181818 0.018891170
## woody 45 0.017786561 0.018480493
## vanilla 40 0.015810277 0.016427105
## fruit 39 0.015415020 0.016016427
## tart 34 0.013438735 0.013963039
## intense 32 0.012648221 0.013141684
## rich 31 0.012252964 0.012731006
## molasses 28 0.011067194 0.011498973
## caramel 26 0.010276680 0.010677618
## coffee 26 0.010276680 0.010677618
## dried fruit 26 0.010276680 0.010677618
## grassy 22 0.008695652 0.009034908
## bitter 20 0.007905138 0.008213552
## honey 20 0.007905138 0.008213552
## fruity 19 0.007509881 0.007802875
## banana 18 0.007114625 0.007392197
## cherry 18 0.007114625 0.007392197
## smokey 18 0.007114625 0.007392197
## sandy 17 0.006719368 0.006981520
## tobacco 17 0.006719368 0.006981520
## melon 15 0.005928854 0.006160164
## red berry 15 0.005928854 0.006160164
## rich cocoa 15 0.005928854 0.006160164
#look at proportion and counts for memorable characteristics and observe the top frequent characteristics (top 5%)- column 3
chocolate_final %>%
tabyl(characteristics_3) %>%
arrange(desc(n)) %>%
top_frac(.05) #n for top 5 = 56
## Selecting by valid_percent
## characteristics_3 n percent valid_percent
## cocoa 111 0.043873518 0.06115702
## roasty 75 0.029644269 0.04132231
## nutty 72 0.028458498 0.03966942
## sour 67 0.026482213 0.03691460
## earthy 56 0.022134387 0.03085399
## sweet 54 0.021343874 0.02975207
## coffee 38 0.015019763 0.02093664
## spicy 34 0.013438735 0.01873278
## bitter 31 0.012252964 0.01707989
## floral 29 0.011462451 0.01597796
## spice 28 0.011067194 0.01542700
## acidic 24 0.009486166 0.01322314
## fruit 24 0.009486166 0.01322314
## fatty 23 0.009090909 0.01267218
## caramel 22 0.008695652 0.01212121
## woody 22 0.008695652 0.01212121
## brownie 21 0.008300395 0.01157025
## molasses 21 0.008300395 0.01157025
## dried fruit 20 0.007905138 0.01101928
## vanilla 19 0.007509881 0.01046832
#look at proportion and counts for memorable characteristics and observe the top frequent characteristics (top 5%)- column 4
chocolate_final %>%
tabyl(characteristics_4) %>%
arrange(desc(n)) %>%
top_frac(.05) #n for top 5 = 11
## Selecting by valid_percent
## characteristics_4 n percent valid_percent
## roasty 18 0.007114625 0.06360424
## cocoa 16 0.006324111 0.05653710
## sour 13 0.005138340 0.04593640
## nutty 11 0.004347826 0.03886926
## off 11 0.004347826 0.03886926
## rich 11 0.004347826 0.03886926
#create subset data for filtered top 5 characteristics for each column
chocolate_subset1 <- chocolate_final %>% mutate(char1_sub = fct_lump_min(characteristics_1, min=83))
chocolate_subset1 %>% tabyl(char1_sub) %>% arrange(desc(n)) #no missing values
## char1_sub n percent
## Other 1972 0.77944664
## creamy 163 0.06442688
## sandy 142 0.05612648
## intense 86 0.03399209
## sweet 84 0.03320158
## nutty 83 0.03280632
chocolate_subset2 <- chocolate_final %>% mutate(char2_sub = fct_lump_min(characteristics_2, min=68))
chocolate_subset2 %>% tabyl(char2_sub) %>% arrange(desc(n)) #note that missing values are included
## char2_sub n percent valid_percent
## Other 1989 0.78616601 0.81683778
## sweet 120 0.04743083 0.04928131
## <NA> 95 0.03754941 NA
## nutty 94 0.03715415 0.03860370
## cocoa 83 0.03280632 0.03408624
## earthy 81 0.03201581 0.03326489
## roasty 68 0.02687747 0.02792608
chocolate_subset3 <- chocolate_final %>% mutate(char3_sub = fct_lump_min(characteristics_3, min=56))
chocolate_subset3 %>% tabyl(char3_sub) %>% arrange(desc(n)) #note that missing values are included
## char3_sub n percent valid_percent
## Other 1434 0.56679842 0.79008264
## <NA> 715 0.28260870 NA
## cocoa 111 0.04387352 0.06115702
## roasty 75 0.02964427 0.04132231
## nutty 72 0.02845850 0.03966942
## sour 67 0.02648221 0.03691460
## earthy 56 0.02213439 0.03085399
chocolate_subset4 <- chocolate_final %>% mutate(char4_sub = fct_lump_min(characteristics_4, min=11))
chocolate_subset4 %>% tabyl(char4_sub) %>% arrange(desc(n)) #note that missing values are included
## char4_sub n percent valid_percent
## <NA> 2247 0.888142292 NA
## Other 203 0.080237154 0.71731449
## roasty 18 0.007114625 0.06360424
## cocoa 16 0.006324111 0.05653710
## sour 13 0.005138340 0.04593640
## nutty 11 0.004347826 0.03886926
## off 11 0.004347826 0.03886926
## rich 11 0.004347826 0.03886926
#summarized by characteristic column 1
chocolate_subset1 %>%
group_by(char1_sub) %>%
filter(char1_sub!="Other") %>%
summarize(
mean = mean(rating, na.rm = TRUE),
sd = sd(rating, na.rm = TRUE),
min = min(rating, na.rm = TRUE),
p25 = quantile(rating, probs = 0.25, na.rm = TRUE),
p50 = median(rating, na.rm = TRUE),
p75 = quantile(rating, probs = 0.75, na.rm = TRUE),
max = max(rating, na.rm = TRUE)
)
## # A tibble: 5 × 8
## char1_sub mean sd min p25 p50 p75 max
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 creamy 3.48 0.442 1.5 3.25 3.5 3.75 4
## 2 intense 3.21 0.429 2 3 3.25 3.5 4
## 3 nutty 3.26 0.369 2.5 3 3.25 3.5 4
## 4 sandy 3.09 0.368 2 2.75 3 3.5 3.75
## 5 sweet 3.08 0.385 2 2.75 3 3.25 4
#summarized by characteristic column 2
chocolate_subset2 %>%
group_by(char2_sub) %>%
filter(char2_sub!="Other") %>%
summarize(
mean = mean(rating, na.rm = TRUE),
sd = sd(rating, na.rm = TRUE),
min = min(rating, na.rm = TRUE),
p25 = quantile(rating, probs = 0.25, na.rm = TRUE),
p50 = median(rating, na.rm = TRUE),
p75 = quantile(rating, probs = 0.75, na.rm = TRUE),
max = max(rating, na.rm = TRUE)
)
## # A tibble: 5 × 8
## char2_sub mean sd min p25 p50 p75 max
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 " cocoa" 3.36 0.462 1 3.12 3.5 3.75 4
## 2 " earthy" 3.01 0.364 2 2.75 3 3.25 3.75
## 3 " nutty" 3.32 0.388 2.5 3 3.25 3.5 4
## 4 " roasty" 3.19 0.314 2.5 3 3.25 3.5 3.75
## 5 " sweet" 3.04 0.345 2 2.75 3 3.25 4
#summarized by characteristic column 3
chocolate_subset3 %>%
group_by(char3_sub) %>%
filter(char3_sub!="Other") %>%
summarize(
mean = mean(rating, na.rm = TRUE),
sd = sd(rating, na.rm = TRUE),
min = min(rating, na.rm = TRUE),
p25 = quantile(rating, probs = 0.25, na.rm = TRUE),
p50 = median(rating, na.rm = TRUE),
p75 = quantile(rating, probs = 0.75, na.rm = TRUE),
max = max(rating, na.rm = TRUE)
)
## # A tibble: 5 × 8
## char3_sub mean sd min p25 p50 p75 max
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 " cocoa" 3.39 0.373 2 3 3.5 3.75 4
## 2 " earthy" 3.03 0.389 2 2.75 3 3.25 4
## 3 " nutty" 3.30 0.419 2.25 3 3.38 3.5 4
## 4 " roasty" 3.23 0.366 2.25 3 3.25 3.5 4
## 5 " sour" 3.03 0.333 2.5 2.75 3 3.25 3.75
#summarized by characteristic column 4
chocolate_subset4 %>%
group_by(char4_sub) %>%
filter(char4_sub!="Other") %>%
summarize(
mean = mean(rating, na.rm = TRUE),
sd = sd(rating, na.rm = TRUE),
min = min(rating, na.rm = TRUE),
p25 = quantile(rating, probs = 0.25, na.rm = TRUE),
p50 = median(rating, na.rm = TRUE),
p75 = quantile(rating, probs = 0.75, na.rm = TRUE),
max = max(rating, na.rm = TRUE)
)
## # A tibble: 6 × 8
## char4_sub mean sd min p25 p50 p75 max
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 " cocoa" 3.59 0.287 3 3.5 3.5 3.75 4
## 2 " nutty" 3.27 0.236 3 3 3.25 3.5 3.5
## 3 " off" 2.80 0.218 2.5 2.62 2.75 3 3
## 4 " rich" 3.41 0.375 2.75 3.12 3.5 3.62 4
## 5 " roasty" 3.26 0.449 2.5 3 3.25 3.75 4
## 6 " sour" 3 0.408 2.5 2.75 3 3.5 3.5
?fct_lump_min()
What are your findings about the summary? Are they what you expected?
My findings indicate that the summary statistics are relative similar across categories of cocoa percentages. To my surprise, I thought there would be equal occurrence of chocolate with higher and lower cocoa contents, but the ratings here were majority 70% or above. The median ratings for chocolate with cocoa contents 65% or lower was 3.00, which is lower than the other 2 categories with higher cocoa percentage at 3.25. However, the means and standard deviations among the 3 categories of cocoa percentages do not seem visually too different in ratings. Although the sample size for each cocoa content category seems “sufficient,” and the standard deviation is aligned with the other categories, the category with the lowest cocoa contents had substantially fewer observations.
After tabulating for the 4 columns of the most memorable characteristics, I found some of the memorable characteristics interesting. The top 5 characteristics in the first column were creamy, sandy, intense, sweet, and nutty. The top 5 characteristics in the second column were sweet, nutty, cocoa, earthy, and roasty. The top 5 characteristics in the third column were cocoa, roasty, nutty, sour, and earthy. The top 5 characteristics in the fourth column were roasty, cocoa, sour, nutty, and off. I find the memorable characteristics of “sour” and “off” to be particularly surprising, as I would not normally expect a piece of chocolate to taste or smell sour and “off” sounds a little subjective to me. Nonetheless, the other characteristics, such as cocoa, roasty, sweet, intense, and nutty sound reasonable to describe chocolate.
*Notes: The counts in column 4 is substantially lower than the other 3 columns.
Make at least two plots that help you answer your question on the transformed or summarized data. Use scales and/or labels to make each plot informative.
#Boxplots for cocoa content and chocolate ratings (categorical)
cocoa_plot <- ggplot(chocolate_final) +
aes(y = rating,
x = cocoa_cat,
fill = cocoa_cat) +
labs(title = "Chocolate Ratings and Cocoa Content",
x = "Cocoa Content",
y = "Chocolate Rating",
fill = "Cocoa Category") +
scale_fill_viridis_d(option = "A") +
geom_boxplot(alpha = 0.5) +
coord_flip() +
theme_minimal()
cocoa_plot
#Scatterplot for cocoa content and chocolate ratings (continuous)
cocoa_plot2 <- ggplot(chocolate_final) +
aes(y = rating,
x = cocoa_percent) +
labs(title = "Chocolate Ratings and Cocoa Content",
x = "Cocoa Content",
y = "Chocolate Rating") +
scale_fill_viridis_d(option = "A") +
geom_point() +
theme_minimal()
cocoa_plot2
#Boxplots for memorable characteristics and chocolate ratings (column 1)
char_plot1 <- ggplot(chocolate_subset1 %>% filter(char1_sub!="Other")) +
aes(y = rating,
x = char1_sub) +
labs(title = "Chocolate Ratings and Memorable Characteristics 1",
x = "Characteristics",
y = "Chocolate Rating") +
geom_boxplot(alpha = 0.5) +
coord_flip() +
theme_minimal()
char_plot1
#Boxplots for memorable characteristics and chocolate ratings (column 2)
char_plot2 <- ggplot(chocolate_subset2 %>% filter(char2_sub!="Other")) +
aes(y = rating,
x = char2_sub) +
labs(title = "Chocolate Ratings and Memorable Characteristics 2",
x = "Characteristics",
y = "Chocolate Rating") +
geom_boxplot(alpha = 0.5) +
coord_flip() +
theme_minimal()
char_plot2
#Boxplots for memorable characteristics and chocolate ratings (column 3)
char_plot3 <- ggplot(chocolate_subset3 %>% filter(char3_sub!="Other")) +
aes(y = rating,
x = char3_sub) +
labs(title = "Chocolate Ratings and Memorable Characteristics 3",
x = "Characteristics",
y = "Chocolate Rating") +
geom_boxplot(alpha = 0.5) +
coord_flip() +
theme_minimal()
char_plot3
#Boxplots for memorable characteristics and chocolate ratings (column 4)
char_plot4 <- ggplot(chocolate_subset4 %>% filter(char4_sub!="Other")) +
aes(y = rating,
x = char4_sub) +
labs(title = "Chocolate Ratings and Memorable Characteristics 4",
x = "Characteristics",
y = "Chocolate Rating") +
geom_boxplot(alpha = 0.5) +
coord_flip() +
theme_minimal()
char_plot4
Summary for ggplots
The top 5 memorable characteristics of the chocolate for the first column do not suggest differences in ratings, as the median for each characteristic was within the interquartile ranges of the other categories. Likewise, similar findings were observed for the third column of memorable characteristics. In the second column, the ratings may vary between the earthy and cocoa characteristics, as the medians for both earthy and cocoa lie outside of the interquartile ranges of the other categories. Other characteristics, such as sweet, roasty, and nutty, were likely to exhibit no difference in chocolate rating. For the fourth column, the sample sizes were small and led to overall wider interquartile ranges among characteristics. Hence, the boxplots suggested differences in ratings between off and other characteristics (sour, roasty, rich, nutty, and cocoa). The characteristics of sour also may indicate differences in rating compared to rich or cocoa characteristics. However, I probably would not make conclusions or inferences based on the results from the fourth column since the small sample size may not be representative of the dataset. Overall, the ratings of the chocolate are less likely to be affected by the memorable characteristics, but the ratings may be different between certain characteristics, such as earthy and cocoa.
Cocoa contents were divided into 3 categories: less than 65%, 66-70%, and 71-100%. The distribution for less than 65% seems to be right-skewed, with potential outliers at the lower ratings. The distribution for 66-70% is normal, with potential outliers at the lower ratings. The distribution for 71-100% seems left-skewed, with potential outliers at the lower ratings. Despite the difference in distributions, the medians of the boxplots do not seem to lie outside of the interquartile ranges of the other categories, suggesting the three groups may not be different. Therefore, the rating of the chocolate is not likely to be affected by the cocoa content present.
Summarize your research question and findings below.
My first research question was to examine which memorable characteristics had occurred the most and had the highest ratings. The top 5 characteristics were extracted from each characteristic column, summing up 20 extractions. If there were any duplications of characteristics that appeared in two or more columns, the counts were summed up to obtain the total occurrence. In descending order, nutty had 260 mentions, cocoa had 210 mentions, sweet had 204 mentions, roasty had 161 mentions, creamy had 163 mentions, sandy had 142 mentions, earthy had 137 mentions, intense had 86 mentions, sour had 80 mentions, and off had 11 mentions. Hence, the top 3 memorable characteristics were nutty, cocoa, and sweet. From the boxplot, overall, the ratings of the chocolate were less likely to be affected by the memorable characteristics. However, the ratings may be different between certain characteristics, such as earthy and cocoa.
My second research question was to investigate the influence of cocoa percentage on chocolate ratings. The boxplots did not indicate a potential difference in rating among the three categories of cocoa contents, less than 65%, 66-70%, and 71-100%. Hence, the evidence from the dataset suggests that chocolate ratings are not substantially affected by the cocoa content present in the chocolate.
Are your findings what you expected? Why or Why not?
The findings for the first question were expected and partially aligned with my expectation that cocoa was one of the most mentioned characteristics. In addition, most of the top memorable characteristics, such as nutty, cocoa, sweet, creamy, roasty, and intense sounded like adjectives that I would associate with a piece of chocolate. However, I do find the “off” characteristic unexpected because I would not typically describe a bar of chocolate “off,” and “off” may be interpreted differently by each person as well. Overall, the findings seem what I expected.
Initially, I had a biased expectation about the higher ratings may be related to the high cocoa content in chocolate. However, that was not aligned with the findings, as the findings did not find evidence of differences in chocolate ratings with increasing cocoa contents. This finding was expected since not everyone favors chocolates with high cocoa content, and some may prefer milk chocolate.