Midterm (Due Sunday 2/19/2023 at 11:55 pm)

Please submit your .Rmd and .html files in Sakai. If you are working together, both people should submit the files.

60 / 60 points total

The goal of the midterm project is to showcase skills that you have learned in class so far. The midterm is open note, but if you use someone else’s code, you must attribute them.

Instructions: Before you get Started

  1. Pick a dataset. Ideally, the dataset should be around 2000 rows, and should have both categorical and numeric covariates. Choose a dataset with more than 4 columns/variables.
  • Potential Sources for data:
  • Note that most of the Tidy Tuesday data are .csv files. There is code to load the files from csv for each of the datasets and a short description of the variables, or you can upload the .csv file into your data folder. This resource is probably the easiest to deatl with.
  • You may use another dataset or your own data, but please make sure it is de-identified and has enough rows/variables.
  1. Define a research question, involving at least one categorical variable. You may schedule a time with Jessica or Brad to discuss your data set and research question, or you may message it to one of us in slack or email. Please do one of the two options pretty early on. We just want to look at the data and make sure that it is appropriate for your question.

  2. You must use each of the following functions at least once:

  • mutate()
  • group_by()
  • summarize()
  • ggplot()

and at least one of the following:

  • case_when()
  • across()
  • *_join() (i.e. left_join())
  • pivot_*() (i.e. pivot_longer())
  • function()
  1. The code chunks below are guides, please add more code chunks to do what you need.

  2. If you do not want your final project posted on the public website, please let Jessica know. We can also keep it anonymous if you’d like to remove your name from the Rmd and html, or use a pseudonym.

You may remove these instructions from your final Rmd if you like

Working Together

If you’d like to work together in pairs, that is encouraged, but you must divide the work equitably and you must note who worked on what. This is probably easiest as notes in the text. Please let Brad or Jessica know that you’ll be working together.

No acknowledgements of contributions = -10 points overall.

Please Note

I will take off points (-5 points for each section) if you don’t add observations and notes in your RMarkdown document. I want you to think and reason through your analysis, even if they are preliminary thoughts.

Define Your Research Question (10 points)

Define your research question below. What about the data interests you? What is a specific question you want to find out about the data?

The chocolate dataset includes several interesting variables, including the rating of chocolate, country of bean origin, company location, year of the review, percent of cocoa in the chocolate, and the most memorable characteristics of a bar of chocolate. Although I would not declare myself an avid chocolate enthusiast, dark chocolate holds a top-tier rank among the other variations in my personal ranking. My research question is: which of the most memorable characteristics occurred the most often and had the overall highest rating? I will secondarily examine the influence of cocoa percentage on chocolate ratings.

Given your question, what is your expectation about the data?

Given my personal preference, my expectation about the data is that 1) The chocolate characteristic “cocoa” is probably the most highly rated and the most frequently appearing among the other most memorable characteristics, and 2) Chocolate with a higher percentage of cocoa will rate higher than chocolate with lower of cocoa contents.

Note: Given the recent news about dark chocolate occurring later in 2022, this data is not likely to be affected.

Loading the Data (10 points)

Load the data below and use dplyr::glimpse() or skimr::skim() on the data. You should upload the data file into the data directory.

#install package to obtain dataset
#install.packages("tidytuesdayR")

#load libraries
pacman::p_load(
  tidyverse,  
  skimr,
  here,       
  janitor       
  )

#read in dataset from the web
chocolate <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-01-18/chocolate.csv')
## Rows: 2530 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): company_manufacturer, company_location, country_of_bean_origin, spe...
## dbl (3): ref, review_date, rating
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#use skim to see the overall distribution of the variables and missingness
skim(chocolate) 
Data summary
Name chocolate
Number of rows 2530
Number of columns 10
_______________________
Column type frequency:
character 7
numeric 3
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
company_manufacturer 0 1.00 2 39 0 580 0
company_location 0 1.00 4 21 0 67 0
country_of_bean_origin 0 1.00 4 21 0 62 0
specific_bean_origin_or_bar_name 0 1.00 3 51 0 1605 0
cocoa_percent 0 1.00 3 6 0 46 0
ingredients 87 0.97 4 14 0 21 0
most_memorable_characteristics 0 1.00 3 37 0 2487 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
ref 0 1 1429.80 757.65 5 802 1454.00 2079.0 2712 ▆▇▇▇▇
review_date 0 1 2014.37 3.97 2006 2012 2015.00 2018.0 2021 ▃▅▇▆▅
rating 0 1 3.20 0.45 1 3 3.25 3.5 4 ▁▁▅▇▇
#use glimpse to double check the variable type, rows, and columns
glimpse(chocolate)
## Rows: 2,530
## Columns: 10
## $ ref                              <dbl> 2454, 2458, 2454, 2542, 2546, 2546, 2…
## $ company_manufacturer             <chr> "5150", "5150", "5150", "5150", "5150…
## $ company_location                 <chr> "U.S.A.", "U.S.A.", "U.S.A.", "U.S.A.…
## $ review_date                      <dbl> 2019, 2019, 2019, 2021, 2021, 2021, 2…
## $ country_of_bean_origin           <chr> "Tanzania", "Dominican Republic", "Ma…
## $ specific_bean_origin_or_bar_name <chr> "Kokoa Kamili, batch 1", "Zorzal, bat…
## $ cocoa_percent                    <chr> "76%", "76%", "76%", "68%", "72%", "8…
## $ ingredients                      <chr> "3- B,S,C", "3- B,S,C", "3- B,S,C", "…
## $ most_memorable_characteristics   <chr> "rich cocoa, fatty, bready", "cocoa, …
## $ rating                           <dbl> 3.25, 3.50, 3.75, 3.00, 3.00, 3.25, 3…

If there are any quirks that you have to deal with NA coded as something else, or it is multiple tables, please make some notes here about what you need to do before you start transforming the data in the next section.

Make sure your data types are correct!

Overall, there is no substantial missing data, but note that the variable “ingredients” contains missing values. There are seven character type variables and three numeric variables. The variable cocoa_percent showed up as a character variable and needed to be transformed for later categorization. The most_memorable_characteristics variable contains multiple observations per cell and needed to be separated into individual columns to increase readability.

Transforming the data (15 points)

If the data needs to be transformed in any way (values recoded, pivoted, etc), do it here. Examples include transforming a continuous variable into a categorical using case_when(), etc.

#separate columns for each individual characteristic
chocolate_separated <- chocolate %>%
  separate(
    col = most_memorable_characteristics,
    into = c("characteristics_1", "characteristics_2", "characteristics_3", "characteristics_4"),
    sep = ",")
## Warning: Expected 4 pieces. Additional pieces discarded in 2 rows [5, 323].
## Warning: Expected 4 pieces. Missing pieces filled with `NA` in 2247 rows [1, 2,
## 3, 4, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 20, 21, 22, 23, ...].
#check to make sure the separation works 
glimpse(chocolate_separated) 
## Rows: 2,530
## Columns: 13
## $ ref                              <dbl> 2454, 2458, 2454, 2542, 2546, 2546, 2…
## $ company_manufacturer             <chr> "5150", "5150", "5150", "5150", "5150…
## $ company_location                 <chr> "U.S.A.", "U.S.A.", "U.S.A.", "U.S.A.…
## $ review_date                      <dbl> 2019, 2019, 2019, 2021, 2021, 2021, 2…
## $ country_of_bean_origin           <chr> "Tanzania", "Dominican Republic", "Ma…
## $ specific_bean_origin_or_bar_name <chr> "Kokoa Kamili, batch 1", "Zorzal, bat…
## $ cocoa_percent                    <chr> "76%", "76%", "76%", "68%", "72%", "8…
## $ ingredients                      <chr> "3- B,S,C", "3- B,S,C", "3- B,S,C", "…
## $ characteristics_1                <chr> "rich cocoa", "cocoa", "cocoa", "chew…
## $ characteristics_2                <chr> " fatty", " vegetal", " blackberry", …
## $ characteristics_3                <chr> " bready", " savory", " full body", "…
## $ characteristics_4                <chr> NA, NA, NA, NA, " nutty", NA, NA, NA,…
## $ rating                           <dbl> 3.25, 3.50, 3.75, 3.00, 3.00, 3.25, 3…
#each characteristic shows up in a single column

#separate columns for percent cocoa and remove empty column named "symbol"
chocolate_separated <- chocolate_separated %>%
  separate(
    col = cocoa_percent,
    into = c("cocoa_percent", "symbol"),
    sep = "%") %>%
    remove_empty(which = "cols")

glimpse(chocolate_separated) #% symbol is removed from cocoa_percent, but this variable is still read as a character variable. Thus, cocoa_percent needed to converted into numeric type
## Rows: 2,530
## Columns: 14
## $ ref                              <dbl> 2454, 2458, 2454, 2542, 2546, 2546, 2…
## $ company_manufacturer             <chr> "5150", "5150", "5150", "5150", "5150…
## $ company_location                 <chr> "U.S.A.", "U.S.A.", "U.S.A.", "U.S.A.…
## $ review_date                      <dbl> 2019, 2019, 2019, 2021, 2021, 2021, 2…
## $ country_of_bean_origin           <chr> "Tanzania", "Dominican Republic", "Ma…
## $ specific_bean_origin_or_bar_name <chr> "Kokoa Kamili, batch 1", "Zorzal, bat…
## $ cocoa_percent                    <chr> "76", "76", "76", "68", "72", "80", "…
## $ symbol                           <chr> "", "", "", "", "", "", "", "", "", "…
## $ ingredients                      <chr> "3- B,S,C", "3- B,S,C", "3- B,S,C", "…
## $ characteristics_1                <chr> "rich cocoa", "cocoa", "cocoa", "chew…
## $ characteristics_2                <chr> " fatty", " vegetal", " blackberry", …
## $ characteristics_3                <chr> " bready", " savory", " full body", "…
## $ characteristics_4                <chr> NA, NA, NA, NA, " nutty", NA, NA, NA,…
## $ rating                           <dbl> 3.25, 3.50, 3.75, 3.00, 3.00, 3.25, 3…
#convert character variable type to numeric
chocolate_converted <- chocolate_separated %>%
  mutate(cocoa_percent = as.numeric(cocoa_percent))

#check if conversion worked
skim(chocolate_converted)
Data summary
Name chocolate_converted
Number of rows 2530
Number of columns 14
_______________________
Column type frequency:
character 10
numeric 4
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
company_manufacturer 0 1.00 2 39 0 580 0
company_location 0 1.00 4 21 0 67 0
country_of_bean_origin 0 1.00 4 21 0 62 0
specific_bean_origin_or_bar_name 0 1.00 3 51 0 1605 0
symbol 0 1.00 0 0 2530 1 0
ingredients 87 0.97 4 14 0 21 0
characteristics_1 0 1.00 3 30 0 536 0
characteristics_2 95 0.96 0 26 1 580 0
characteristics_3 715 0.72 0 19 2 399 0
characteristics_4 2247 0.11 4 15 0 110 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
ref 0 1 1429.80 757.65 5 802 1454.00 2079.0 2712 ▆▇▇▇▇
review_date 0 1 2014.37 3.97 2006 2012 2015.00 2018.0 2021 ▃▅▇▆▅
cocoa_percent 0 1 71.64 5.62 42 70 70.00 74.0 100 ▁▁▇▁▁
rating 0 1 3.20 0.45 1 3 3.25 3.5 4 ▁▁▅▇▇
#make cocoa_percent categorical
chocolate_final <- chocolate_converted %>% 
  mutate(cocoa_cat = case_when(
   cocoa_percent <= 65 ~ "<=65%",
   (cocoa_percent > 65) & (cocoa_percent <= 70) ~ "66-70%",
   (cocoa_percent > 70) & (cocoa_percent <= 100) ~ "71-100%",
   )) 

#look at the final analytic dataset 
skim(chocolate_final)
Data summary
Name chocolate_final
Number of rows 2530
Number of columns 15
_______________________
Column type frequency:
character 11
numeric 4
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
company_manufacturer 0 1.00 2 39 0 580 0
company_location 0 1.00 4 21 0 67 0
country_of_bean_origin 0 1.00 4 21 0 62 0
specific_bean_origin_or_bar_name 0 1.00 3 51 0 1605 0
symbol 0 1.00 0 0 2530 1 0
ingredients 87 0.97 4 14 0 21 0
characteristics_1 0 1.00 3 30 0 536 0
characteristics_2 95 0.96 0 26 1 580 0
characteristics_3 715 0.72 0 19 2 399 0
characteristics_4 2247 0.11 4 15 0 110 0
cocoa_cat 0 1.00 5 7 0 3 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
ref 0 1 1429.80 757.65 5 802 1454.00 2079.0 2712 ▆▇▇▇▇
review_date 0 1 2014.37 3.97 2006 2012 2015.00 2018.0 2021 ▃▅▇▆▅
cocoa_percent 0 1 71.64 5.62 42 70 70.00 74.0 100 ▁▁▇▁▁
rating 0 1 3.20 0.45 1 3 3.25 3.5 4 ▁▁▅▇▇
#factorize cocoa_cat to see the counts/proportion for each category
chocolate_final %>%
  mutate(cocoa_cat = factor(cocoa_cat, levels = c("<=65%", "66-70%", "71-100%"))) %>%
  tabyl(cocoa_cat ) #table shows cocoa contents categories are in ascending level
##  cocoa_cat    n   percent
##      <=65%  239 0.0944664
##     66-70% 1193 0.4715415
##    71-100% 1098 0.4339921
#look at cocoa_percent
chocolate_final %>%
  tabyl(cocoa_percent) #70% cocoa content occurred the most frequently
##  cocoa_percent    n      percent
##           42.0    1 0.0003952569
##           46.0    1 0.0003952569
##           50.0    1 0.0003952569
##           53.0    1 0.0003952569
##           55.0   16 0.0063241107
##           56.0    2 0.0007905138
##           57.0    1 0.0003952569
##           58.0    8 0.0031620553
##           60.0   46 0.0181818182
##           60.5    1 0.0003952569
##           61.0    7 0.0027667984
##           62.0   16 0.0063241107
##           63.0   14 0.0055335968
##           64.0   34 0.0134387352
##           65.0   90 0.0355731225
##           66.0   28 0.0110671937
##           67.0   34 0.0134387352
##           68.0   72 0.0284584980
##           69.0   13 0.0051383399
##           70.0 1046 0.4134387352
##           71.0   43 0.0169960474
##           71.5    2 0.0007905138
##           72.0  295 0.1166007905
##           72.5    4 0.0015810277
##           73.0   66 0.0260869565
##           73.5    2 0.0007905138
##           74.0   67 0.0264822134
##           75.0  310 0.1225296443
##           76.0   35 0.0138339921
##           77.0   42 0.0166007905
##           78.0   21 0.0083003953
##           79.0    2 0.0007905138
##           80.0   89 0.0351778656
##           81.0    6 0.0023715415
##           82.0   18 0.0071146245
##           83.0    5 0.0019762846
##           84.0    4 0.0015810277
##           85.0   40 0.0158102767
##           86.0    1 0.0003952569
##           87.0    1 0.0003952569
##           88.0    8 0.0031620553
##           89.0    2 0.0007905138
##           90.0    9 0.0035573123
##           91.0    3 0.0011857708
##           99.0    2 0.0007905138
##          100.0   21 0.0083003953

*Chocolate_final is the analytic dataset to be used for visualizing and summarizing the data. In this dataset, the most_memorable_characteristics variable was separated into characteristics_1, characteristics_2, characteristics_3, and characteristics_4, with one column for each characteristic. Since the variable cocoa_percent was still a character after separating the percent sign from the numerical values, cocoa_percent was converted to a numeric type, and the skim function demonstrated that the conversion was successful. The cocoa percentage was categorized into 3 categories. As the categories were not even, 65% or less cocoa percentage had substantially fewer observations than the other categories. The cutoff points were somewhat arbitrary, and the rationale behind the categorization was rather subjective. Since milk chocolate and other chocolates with less cocoa content often contain 65% or less cocoa, I used that as one of the cutoff points. While exploring the distribution of cocoa percentage, 70% had a disproportionately higher count than the others, and I used 70% as another cutoff point. I noticed that the third category (71%-100%) consists of a large range of cocoa percentages, and there may be increased variations in ratings.

*Note: Filtered data were made into several sub-datasets for visualizing and summarizing the data

Bonus points (5 points) for datasets that require merging of tables, but only if you reason through whether you should use left_join, inner_join, or right_join on these tables. No credit will be provided if you don’t.

No merging of tables was required

Show your transformed table here. Use tools such as glimpse(), skim() or head() to illustrate your point.

Are the values what you expected for the variables? Why or Why not?

n/a

Visualizing and Summarizing the Data (15 points)

Use group_by() and summarize() to make a summary of the data here. The summary should be relevant to your research question

#summarized by cocoa percentage categories above
chocolate_final  %>% 
  group_by(cocoa_cat) %>%
  summarize(
            mean = mean(rating, na.rm = TRUE),
            sd = sd(rating, na.rm = TRUE),
            min = min(rating, na.rm = TRUE),
            p25 = quantile(rating, probs = 0.25, na.rm = TRUE),
            p50 = median(rating, na.rm = TRUE),
            p75 = quantile(rating, probs = 0.75, na.rm = TRUE),
            max = max(rating, na.rm = TRUE)
            ) 
## # A tibble: 3 × 8
##   cocoa_cat  mean    sd   min   p25   p50   p75   max
##   <chr>     <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 <=65%      3.11 0.436   1.5  2.75  3      3.5     4
## 2 66-70%     3.27 0.419   1    3     3.25   3.5     4
## 3 71-100%    3.13 0.462   1    2.75  3.25   3.5     4
#look at cocoa_percent again
chocolate_final %>%
  tabyl(cocoa_percent ) 
##  cocoa_percent    n      percent
##           42.0    1 0.0003952569
##           46.0    1 0.0003952569
##           50.0    1 0.0003952569
##           53.0    1 0.0003952569
##           55.0   16 0.0063241107
##           56.0    2 0.0007905138
##           57.0    1 0.0003952569
##           58.0    8 0.0031620553
##           60.0   46 0.0181818182
##           60.5    1 0.0003952569
##           61.0    7 0.0027667984
##           62.0   16 0.0063241107
##           63.0   14 0.0055335968
##           64.0   34 0.0134387352
##           65.0   90 0.0355731225
##           66.0   28 0.0110671937
##           67.0   34 0.0134387352
##           68.0   72 0.0284584980
##           69.0   13 0.0051383399
##           70.0 1046 0.4134387352
##           71.0   43 0.0169960474
##           71.5    2 0.0007905138
##           72.0  295 0.1166007905
##           72.5    4 0.0015810277
##           73.0   66 0.0260869565
##           73.5    2 0.0007905138
##           74.0   67 0.0264822134
##           75.0  310 0.1225296443
##           76.0   35 0.0138339921
##           77.0   42 0.0166007905
##           78.0   21 0.0083003953
##           79.0    2 0.0007905138
##           80.0   89 0.0351778656
##           81.0    6 0.0023715415
##           82.0   18 0.0071146245
##           83.0    5 0.0019762846
##           84.0    4 0.0015810277
##           85.0   40 0.0158102767
##           86.0    1 0.0003952569
##           87.0    1 0.0003952569
##           88.0    8 0.0031620553
##           89.0    2 0.0007905138
##           90.0    9 0.0035573123
##           91.0    3 0.0011857708
##           99.0    2 0.0007905138
##          100.0   21 0.0083003953
#look at proportion and counts for memorable characteristics and observe the top frequent characteristics (top 5%)- column 1
chocolate_final %>%
  tabyl(characteristics_1) %>%  
  arrange(desc(n)) %>% 
  top_frac(.05) #n for top 5 = 83
## Selecting by percent
##  characteristics_1   n     percent
##             creamy 163 0.064426877
##              sandy 142 0.056126482
##            intense  86 0.033992095
##              sweet  84 0.033201581
##              nutty  83 0.032806324
##              fatty  78 0.030830040
##             sticky  66 0.026086957
##                dry  58 0.022924901
##              spicy  56 0.022134387
##             gritty  54 0.021343874
##               oily  51 0.020158103
##             roasty  51 0.020158103
##             floral  49 0.019367589
##             earthy  42 0.016600791
##              cocoa  37 0.014624506
##           molasses  36 0.014229249
##            complex  30 0.011857708
##        dried fruit  24 0.009486166
##         rich cocoa  24 0.009486166
##             smooth  24 0.009486166
##             grassy  23 0.009090909
##            vanilla  23 0.009090909
##             coarse  20 0.007905138
##             smokey  20 0.007905138
##             fruity  19 0.007509881
##              spice  17 0.006719368
##               tart  17 0.006719368
##              woody  17 0.006719368
#look at proportion and counts for memorable characteristics and observe the top frequent characteristics (top 5%)- column 2
chocolate_final %>%
  tabyl(characteristics_2) %>%  
  arrange(desc(n)) %>% 
  top_frac(.05) #n for top 5 = 68
## Selecting by valid_percent
##  characteristics_2   n     percent valid_percent
##              sweet 120 0.047430830   0.049281314
##              nutty  94 0.037154150   0.038603696
##              cocoa  83 0.032806324   0.034086242
##             earthy  81 0.032015810   0.033264887
##             roasty  68 0.026877470   0.027926078
##             floral  62 0.024505929   0.025462012
##              fatty  54 0.021343874   0.022176591
##               sour  46 0.018181818   0.018891170
##              spicy  46 0.018181818   0.018891170
##              woody  45 0.017786561   0.018480493
##            vanilla  40 0.015810277   0.016427105
##              fruit  39 0.015415020   0.016016427
##               tart  34 0.013438735   0.013963039
##            intense  32 0.012648221   0.013141684
##               rich  31 0.012252964   0.012731006
##           molasses  28 0.011067194   0.011498973
##            caramel  26 0.010276680   0.010677618
##             coffee  26 0.010276680   0.010677618
##        dried fruit  26 0.010276680   0.010677618
##             grassy  22 0.008695652   0.009034908
##             bitter  20 0.007905138   0.008213552
##              honey  20 0.007905138   0.008213552
##             fruity  19 0.007509881   0.007802875
##             banana  18 0.007114625   0.007392197
##             cherry  18 0.007114625   0.007392197
##             smokey  18 0.007114625   0.007392197
##              sandy  17 0.006719368   0.006981520
##            tobacco  17 0.006719368   0.006981520
##              melon  15 0.005928854   0.006160164
##          red berry  15 0.005928854   0.006160164
##         rich cocoa  15 0.005928854   0.006160164
#look at proportion and counts for memorable characteristics and observe the top frequent characteristics (top 5%)- column 3
chocolate_final %>%
  tabyl(characteristics_3) %>%  
  arrange(desc(n)) %>% 
  top_frac(.05) #n for top 5 = 56
## Selecting by valid_percent
##  characteristics_3   n     percent valid_percent
##              cocoa 111 0.043873518    0.06115702
##             roasty  75 0.029644269    0.04132231
##              nutty  72 0.028458498    0.03966942
##               sour  67 0.026482213    0.03691460
##             earthy  56 0.022134387    0.03085399
##              sweet  54 0.021343874    0.02975207
##             coffee  38 0.015019763    0.02093664
##              spicy  34 0.013438735    0.01873278
##             bitter  31 0.012252964    0.01707989
##             floral  29 0.011462451    0.01597796
##              spice  28 0.011067194    0.01542700
##             acidic  24 0.009486166    0.01322314
##              fruit  24 0.009486166    0.01322314
##              fatty  23 0.009090909    0.01267218
##            caramel  22 0.008695652    0.01212121
##              woody  22 0.008695652    0.01212121
##            brownie  21 0.008300395    0.01157025
##           molasses  21 0.008300395    0.01157025
##        dried fruit  20 0.007905138    0.01101928
##            vanilla  19 0.007509881    0.01046832
#look at proportion and counts for memorable characteristics and observe the top frequent characteristics (top 5%)- column 4
chocolate_final %>%
  tabyl(characteristics_4) %>%  
  arrange(desc(n)) %>% 
  top_frac(.05) #n for top 5 = 11
## Selecting by valid_percent
##  characteristics_4  n     percent valid_percent
##             roasty 18 0.007114625    0.06360424
##              cocoa 16 0.006324111    0.05653710
##               sour 13 0.005138340    0.04593640
##              nutty 11 0.004347826    0.03886926
##                off 11 0.004347826    0.03886926
##               rich 11 0.004347826    0.03886926
#create subset data for filtered top 5 characteristics for each column
chocolate_subset1 <- chocolate_final %>% mutate(char1_sub = fct_lump_min(characteristics_1, min=83))
chocolate_subset1 %>% tabyl(char1_sub) %>% arrange(desc(n)) #no missing values
##  char1_sub    n    percent
##      Other 1972 0.77944664
##     creamy  163 0.06442688
##      sandy  142 0.05612648
##    intense   86 0.03399209
##      sweet   84 0.03320158
##      nutty   83 0.03280632
chocolate_subset2 <- chocolate_final %>% mutate(char2_sub = fct_lump_min(characteristics_2, min=68))
chocolate_subset2 %>% tabyl(char2_sub) %>% arrange(desc(n)) #note that missing values are included
##  char2_sub    n    percent valid_percent
##      Other 1989 0.78616601    0.81683778
##      sweet  120 0.04743083    0.04928131
##       <NA>   95 0.03754941            NA
##      nutty   94 0.03715415    0.03860370
##      cocoa   83 0.03280632    0.03408624
##     earthy   81 0.03201581    0.03326489
##     roasty   68 0.02687747    0.02792608
chocolate_subset3 <- chocolate_final %>% mutate(char3_sub = fct_lump_min(characteristics_3, min=56))
chocolate_subset3 %>% tabyl(char3_sub) %>% arrange(desc(n)) #note that missing values are included
##  char3_sub    n    percent valid_percent
##      Other 1434 0.56679842    0.79008264
##       <NA>  715 0.28260870            NA
##      cocoa  111 0.04387352    0.06115702
##     roasty   75 0.02964427    0.04132231
##      nutty   72 0.02845850    0.03966942
##       sour   67 0.02648221    0.03691460
##     earthy   56 0.02213439    0.03085399
chocolate_subset4 <- chocolate_final %>% mutate(char4_sub = fct_lump_min(characteristics_4, min=11))
chocolate_subset4 %>% tabyl(char4_sub) %>% arrange(desc(n)) #note that missing values are included
##  char4_sub    n     percent valid_percent
##       <NA> 2247 0.888142292            NA
##      Other  203 0.080237154    0.71731449
##     roasty   18 0.007114625    0.06360424
##      cocoa   16 0.006324111    0.05653710
##       sour   13 0.005138340    0.04593640
##      nutty   11 0.004347826    0.03886926
##        off   11 0.004347826    0.03886926
##       rich   11 0.004347826    0.03886926
#summarized by characteristic column 1
chocolate_subset1  %>% 
  group_by(char1_sub) %>%
  filter(char1_sub!="Other") %>%
  summarize(
            mean = mean(rating, na.rm = TRUE),
            sd = sd(rating, na.rm = TRUE),
            min = min(rating, na.rm = TRUE),
            p25 = quantile(rating, probs = 0.25, na.rm = TRUE),
            p50 = median(rating, na.rm = TRUE),
            p75 = quantile(rating, probs = 0.75, na.rm = TRUE),
            max = max(rating, na.rm = TRUE)
            )
## # A tibble: 5 × 8
##   char1_sub  mean    sd   min   p25   p50   p75   max
##   <fct>     <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 creamy     3.48 0.442   1.5  3.25  3.5   3.75  4   
## 2 intense    3.21 0.429   2    3     3.25  3.5   4   
## 3 nutty      3.26 0.369   2.5  3     3.25  3.5   4   
## 4 sandy      3.09 0.368   2    2.75  3     3.5   3.75
## 5 sweet      3.08 0.385   2    2.75  3     3.25  4
#summarized by characteristic column 2
chocolate_subset2  %>% 
  group_by(char2_sub) %>%
  filter(char2_sub!="Other") %>%
  summarize(
            mean = mean(rating, na.rm = TRUE),
            sd = sd(rating, na.rm = TRUE),
            min = min(rating, na.rm = TRUE),
            p25 = quantile(rating, probs = 0.25, na.rm = TRUE),
            p50 = median(rating, na.rm = TRUE),
            p75 = quantile(rating, probs = 0.75, na.rm = TRUE),
            max = max(rating, na.rm = TRUE)
            )
## # A tibble: 5 × 8
##   char2_sub  mean    sd   min   p25   p50   p75   max
##   <fct>     <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 " cocoa"   3.36 0.462   1    3.12  3.5   3.75  4   
## 2 " earthy"  3.01 0.364   2    2.75  3     3.25  3.75
## 3 " nutty"   3.32 0.388   2.5  3     3.25  3.5   4   
## 4 " roasty"  3.19 0.314   2.5  3     3.25  3.5   3.75
## 5 " sweet"   3.04 0.345   2    2.75  3     3.25  4
#summarized by characteristic column 3
chocolate_subset3  %>% 
  group_by(char3_sub) %>%
  filter(char3_sub!="Other") %>%
  summarize(
            mean = mean(rating, na.rm = TRUE),
            sd = sd(rating, na.rm = TRUE),
            min = min(rating, na.rm = TRUE),
            p25 = quantile(rating, probs = 0.25, na.rm = TRUE),
            p50 = median(rating, na.rm = TRUE),
            p75 = quantile(rating, probs = 0.75, na.rm = TRUE),
            max = max(rating, na.rm = TRUE)
            )
## # A tibble: 5 × 8
##   char3_sub  mean    sd   min   p25   p50   p75   max
##   <fct>     <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 " cocoa"   3.39 0.373  2     3     3.5   3.75  4   
## 2 " earthy"  3.03 0.389  2     2.75  3     3.25  4   
## 3 " nutty"   3.30 0.419  2.25  3     3.38  3.5   4   
## 4 " roasty"  3.23 0.366  2.25  3     3.25  3.5   4   
## 5 " sour"    3.03 0.333  2.5   2.75  3     3.25  3.75
#summarized by characteristic column 4
chocolate_subset4  %>% 
  group_by(char4_sub) %>%
  filter(char4_sub!="Other") %>%
  summarize(
            mean = mean(rating, na.rm = TRUE),
            sd = sd(rating, na.rm = TRUE),
            min = min(rating, na.rm = TRUE),
            p25 = quantile(rating, probs = 0.25, na.rm = TRUE),
            p50 = median(rating, na.rm = TRUE),
            p75 = quantile(rating, probs = 0.75, na.rm = TRUE),
            max = max(rating, na.rm = TRUE)
            )
## # A tibble: 6 × 8
##   char4_sub  mean    sd   min   p25   p50   p75   max
##   <fct>     <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 " cocoa"   3.59 0.287  3     3.5   3.5   3.75   4  
## 2 " nutty"   3.27 0.236  3     3     3.25  3.5    3.5
## 3 " off"     2.80 0.218  2.5   2.62  2.75  3      3  
## 4 " rich"    3.41 0.375  2.75  3.12  3.5   3.62   4  
## 5 " roasty"  3.26 0.449  2.5   3     3.25  3.75   4  
## 6 " sour"    3    0.408  2.5   2.75  3     3.5    3.5
?fct_lump_min()

What are your findings about the summary? Are they what you expected?

My findings indicate that the summary statistics are relative similar across categories of cocoa percentages. To my surprise, I thought there would be equal occurrence of chocolate with higher and lower cocoa contents, but the ratings here were majority 70% or above. The median ratings for chocolate with cocoa contents 65% or lower was 3.00, which is lower than the other 2 categories with higher cocoa percentage at 3.25. However, the means and standard deviations among the 3 categories of cocoa percentages do not seem visually too different in ratings. Although the sample size for each cocoa content category seems “sufficient,” and the standard deviation is aligned with the other categories, the category with the lowest cocoa contents had substantially fewer observations.

After tabulating for the 4 columns of the most memorable characteristics, I found some of the memorable characteristics interesting. The top 5 characteristics in the first column were creamy, sandy, intense, sweet, and nutty. The top 5 characteristics in the second column were sweet, nutty, cocoa, earthy, and roasty. The top 5 characteristics in the third column were cocoa, roasty, nutty, sour, and earthy. The top 5 characteristics in the fourth column were roasty, cocoa, sour, nutty, and off. I find the memorable characteristics of “sour” and “off” to be particularly surprising, as I would not normally expect a piece of chocolate to taste or smell sour and “off” sounds a little subjective to me. Nonetheless, the other characteristics, such as cocoa, roasty, sweet, intense, and nutty sound reasonable to describe chocolate.

*Notes: The counts in column 4 is substantially lower than the other 3 columns.

Make at least two plots that help you answer your question on the transformed or summarized data. Use scales and/or labels to make each plot informative.

#Boxplots for cocoa content and chocolate ratings (categorical)
cocoa_plot <- ggplot(chocolate_final) +
  aes(y = rating, 
      x = cocoa_cat,
      fill = cocoa_cat) +
  
  labs(title = "Chocolate Ratings and Cocoa Content",
       x = "Cocoa Content",
       y = "Chocolate Rating",
       fill = "Cocoa Category") +
  
  scale_fill_viridis_d(option = "A") +
  
  geom_boxplot(alpha = 0.5) +
  
  coord_flip()  + 
  
  theme_minimal()

cocoa_plot 

#Scatterplot for cocoa content and chocolate ratings (continuous)
cocoa_plot2 <- ggplot(chocolate_final) +
  aes(y = rating, 
      x = cocoa_percent) +
  
  labs(title = "Chocolate Ratings and Cocoa Content",
       x = "Cocoa Content",
       y = "Chocolate Rating") +
  
  scale_fill_viridis_d(option = "A") +
  
  geom_point() +
  
  theme_minimal()

cocoa_plot2 

#Boxplots for memorable characteristics and chocolate ratings (column 1)
char_plot1 <- ggplot(chocolate_subset1 %>% filter(char1_sub!="Other")) +
  aes(y = rating, 
      x = char1_sub) +
  
  labs(title = "Chocolate Ratings and Memorable Characteristics 1",
       x = "Characteristics",
       y = "Chocolate Rating") +
  
  geom_boxplot(alpha = 0.5) +
  
  coord_flip()  + 
  
  theme_minimal()

char_plot1

#Boxplots for memorable characteristics and chocolate ratings (column 2) 
char_plot2 <- ggplot(chocolate_subset2 %>% filter(char2_sub!="Other")) +
  aes(y = rating, 
      x = char2_sub) +
  
  labs(title = "Chocolate Ratings and Memorable Characteristics 2",
       x = "Characteristics",
       y = "Chocolate Rating") +
  
  geom_boxplot(alpha = 0.5) +
  
  coord_flip()  + 
  
  theme_minimal()

char_plot2

#Boxplots for memorable characteristics and chocolate ratings (column 3)
char_plot3 <- ggplot(chocolate_subset3 %>% filter(char3_sub!="Other")) +
  aes(y = rating, 
      x = char3_sub) +
  
  labs(title = "Chocolate Ratings and Memorable Characteristics 3",
       x = "Characteristics",
       y = "Chocolate Rating") +
  
  geom_boxplot(alpha = 0.5) +
  
  coord_flip()  + 
  
  theme_minimal()

char_plot3

#Boxplots for memorable characteristics and chocolate ratings (column 4)
char_plot4 <- ggplot(chocolate_subset4 %>% filter(char4_sub!="Other")) +
  aes(y = rating, 
      x = char4_sub) +
  
  labs(title = "Chocolate Ratings and Memorable Characteristics 4",
       x = "Characteristics",
       y = "Chocolate Rating") +
  
  geom_boxplot(alpha = 0.5) +
  
  coord_flip()  + 
  
  theme_minimal()

char_plot4

Summary for ggplots

The top 5 memorable characteristics of the chocolate for the first column do not suggest differences in ratings, as the median for each characteristic was within the interquartile ranges of the other categories. Likewise, similar findings were observed for the third column of memorable characteristics. In the second column, the ratings may vary between the earthy and cocoa characteristics, as the medians for both earthy and cocoa lie outside of the interquartile ranges of the other categories. Other characteristics, such as sweet, roasty, and nutty, were likely to exhibit no difference in chocolate rating. For the fourth column, the sample sizes were small and led to overall wider interquartile ranges among characteristics. Hence, the boxplots suggested differences in ratings between off and other characteristics (sour, roasty, rich, nutty, and cocoa). The characteristics of sour also may indicate differences in rating compared to rich or cocoa characteristics. However, I probably would not make conclusions or inferences based on the results from the fourth column since the small sample size may not be representative of the dataset. Overall, the ratings of the chocolate are less likely to be affected by the memorable characteristics, but the ratings may be different between certain characteristics, such as earthy and cocoa.

Cocoa contents were divided into 3 categories: less than 65%, 66-70%, and 71-100%. The distribution for less than 65% seems to be right-skewed, with potential outliers at the lower ratings. The distribution for 66-70% is normal, with potential outliers at the lower ratings. The distribution for 71-100% seems left-skewed, with potential outliers at the lower ratings. Despite the difference in distributions, the medians of the boxplots do not seem to lie outside of the interquartile ranges of the other categories, suggesting the three groups may not be different. Therefore, the rating of the chocolate is not likely to be affected by the cocoa content present.

Final Summary (10 points)

Summarize your research question and findings below.

My first research question was to examine which memorable characteristics had occurred the most and had the highest ratings. The top 5 characteristics were extracted from each characteristic column, summing up 20 extractions. If there were any duplications of characteristics that appeared in two or more columns, the counts were summed up to obtain the total occurrence. In descending order, nutty had 260 mentions, cocoa had 210 mentions, sweet had 204 mentions, roasty had 161 mentions, creamy had 163 mentions, sandy had 142 mentions, earthy had 137 mentions, intense had 86 mentions, sour had 80 mentions, and off had 11 mentions. Hence, the top 3 memorable characteristics were nutty, cocoa, and sweet. From the boxplot, overall, the ratings of the chocolate were less likely to be affected by the memorable characteristics. However, the ratings may be different between certain characteristics, such as earthy and cocoa.

My second research question was to investigate the influence of cocoa percentage on chocolate ratings. The boxplots did not indicate a potential difference in rating among the three categories of cocoa contents, less than 65%, 66-70%, and 71-100%. Hence, the evidence from the dataset suggests that chocolate ratings are not substantially affected by the cocoa content present in the chocolate.

Are your findings what you expected? Why or Why not?

The findings for the first question were expected and partially aligned with my expectation that cocoa was one of the most mentioned characteristics. In addition, most of the top memorable characteristics, such as nutty, cocoa, sweet, creamy, roasty, and intense sounded like adjectives that I would associate with a piece of chocolate. However, I do find the “off” characteristic unexpected because I would not typically describe a bar of chocolate “off,” and “off” may be interpreted differently by each person as well. Overall, the findings seem what I expected.

Initially, I had a biased expectation about the higher ratings may be related to the high cocoa content in chocolate. However, that was not aligned with the findings, as the findings did not find evidence of differences in chocolate ratings with increasing cocoa contents. This finding was expected since not everyone favors chocolates with high cocoa content, and some may prefer milk chocolate.