Midterm (Due Sunday 2/19/2023 at 11:55 pm)

Please submit your .Rmd and .html files in Sakai. If you are working together, both people should submit the files.

60 / 60 points total

The goal of the midterm project is to showcase skills that you have learned in class so far. The midterm is open note, but if you use someone else’s code, you must attribute them.

Instructions: Before you get Started

  1. Pick a dataset. Ideally, the dataset should be around 2000 rows, and should have both categorical and numeric covariates. Choose a dataset with more than 4 columns/variables.
  • Potential Sources for data:
  • Note that most of the Tidy Tuesday data are .csv files. There is code to load the files from csv for each of the datasets and a short description of the variables, or you can upload the .csv file into your data folder. This resource is probably the easiest to deatl with.
  • You may use another dataset or your own data, but please make sure it is de-identified and has enough rows/variables.
  1. Define a research question, involving at least one categorical variable. You may schedule a time with Jessica or Brad to discuss your data set and research question, or you may message it to one of us in slack or email. Please do one of the two options pretty early on. We just want to look at the data and make sure that it is appropriate for your question.

  2. You must use each of the following functions at least once:

  • mutate()
  • group_by()
  • summarize()
  • ggplot()

and at least one of the following:

  • case_when()
  • across()
  • *_join() (i.e. left_join())
  • pivot_*() (i.e. pivot_longer())
  • function()
  1. The code chunks below are guides, please add more code chunks to do what you need.

  2. If you do not want your final project posted on the public website, please let Jessica know. We can also keep it anonymous if you’d like to remove your name from the Rmd and html, or use a pseudonym.

You may remove these instructions from your final Rmd if you like

Working Together

If you’d like to work together in pairs, that is encouraged, but you must divide the work equitably and you must note who worked on what. This is probably easiest as notes in the text. Please let Brad or Jessica know that you’ll be working together.

No acknowledgements of contributions = -10 points overall.

Please Note

I will take off points (-5 points for each section) if you don’t add observations and notes in your RMarkdown document. I want you to think and reason through your analysis, even if they are preliminary thoughts.

Define Your Research Question (10 points)

Define your research question below. What about the data interests you? What is a specific question you want to find out about the data?

This data was collected form 2015 - 2018 to see the price of an avocado and it shows how many avocado was sold through out the year. The question I would like to ask is if the conventional or organic avocado have been selling more? and also I want to ask as the year increase if the price of avocado(conventional and organic) increased or decreased?

There are so many state and regions in this study, I have selected some of them to see in which state or region will avocado sell more and in which state or region will avocado will sell less.

I love eating avocado toast in the morning and eating avocado toast with conventional or organic avocado have a huge test difference. I love eating mine with organic avocado, there is a huge difference in avocado when it is organic or conventional. I was also curious in this study which state sells more avocado and find out the priece difference.

Given your question, what is your expectation about the data?

I expect people to buy more organic avocado than conventional, since the test is more better than the conventional plus they don’t go bad in short time and I would expect the price of avocado will increase through out the year.

I also expect more avocado bags to be sold in California and less avocado in the south east area.

Loading the Data (10 points)

Load the data below and use dplyr::glimpse() or skimr::skim() on the data. You should upload the data file into the data directory.

# Get the Data

# Upload data using read_cvs

# This study is about avocado 
library(readr)
avocado <- read_csv("~/Downloads/R Programing /sph_r_programming_class_project_folders_2023 (1)/data/avocado.csv")
## New names:
## Rows: 18249 Columns: 14
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (2): type, region dbl (11): ...1, AveragePrice, Total Volume, 4046, 4225, 4770,
## Total Bags, S... date (1): Date
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
dplyr::glimpse(avocado)
## Rows: 18,249
## Columns: 14
## $ ...1           <dbl> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1…
## $ Date           <date> 2015-12-27, 2015-12-20, 2015-12-13, 2015-12-06, 2015-1…
## $ AveragePrice   <dbl> 1.33, 1.35, 0.93, 1.08, 1.28, 1.26, 0.99, 0.98, 1.02, 1…
## $ `Total Volume` <dbl> 64236.62, 54876.98, 118220.22, 78992.15, 51039.60, 5597…
## $ `4046`         <dbl> 1036.74, 674.28, 794.70, 1132.00, 941.48, 1184.27, 1368…
## $ `4225`         <dbl> 54454.85, 44638.81, 109149.67, 71976.41, 43838.39, 4806…
## $ `4770`         <dbl> 48.16, 58.33, 130.50, 72.58, 75.78, 43.61, 93.26, 80.00…
## $ `Total Bags`   <dbl> 8696.87, 9505.56, 8145.35, 5811.16, 6183.95, 6683.91, 8…
## $ `Small Bags`   <dbl> 8603.62, 9408.07, 8042.21, 5677.40, 5986.26, 6556.47, 8…
## $ `Large Bags`   <dbl> 93.25, 97.49, 103.14, 133.76, 197.69, 127.44, 122.05, 5…
## $ `XLarge Bags`  <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0…
## $ type           <chr> "conventional", "conventional", "conventional", "conven…
## $ year           <dbl> 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2…
## $ region         <chr> "Albany", "Albany", "Albany", "Albany", "Albany", "Alba…
# Use skimer data to provide summary statistics about variables in data frames
skim(avocado)
Data summary
Name avocado
Number of rows 18249
Number of columns 14
_______________________
Column type frequency:
character 2
Date 1
numeric 11
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
type 0 1 7 12 0 2 0
region 0 1 4 19 0 54 0

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
Date 0 1 2015-01-04 2018-03-25 2016-08-14 169

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
…1 0 1 24.23 15.48 0.00 10.00 24.00 38.00 52.00 ▇▆▆▆▆
AveragePrice 0 1 1.41 0.40 0.44 1.10 1.37 1.66 3.25 ▂▇▅▁▁
Total Volume 0 1 850644.01 3453545.36 84.56 10838.58 107376.76 432962.29 62505646.52 ▇▁▁▁▁
4046 0 1 293008.42 1264989.08 0.00 854.07 8645.30 111020.20 22743616.17 ▇▁▁▁▁
4225 0 1 295154.57 1204120.40 0.00 3008.78 29061.02 150206.86 20470572.61 ▇▁▁▁▁
4770 0 1 22839.74 107464.07 0.00 0.00 184.99 6243.42 2546439.11 ▇▁▁▁▁
Total Bags 0 1 239639.20 986242.40 0.00 5088.64 39743.83 110783.37 19373134.37 ▇▁▁▁▁
Small Bags 0 1 182194.69 746178.51 0.00 2849.42 26362.82 83337.67 13384586.80 ▇▁▁▁▁
Large Bags 0 1 54338.09 243965.96 0.00 127.47 2647.71 22029.25 5719096.61 ▇▁▁▁▁
XLarge Bags 0 1 3106.43 17692.89 0.00 0.00 0.00 132.50 551693.65 ▇▁▁▁▁
year 0 1 2016.15 0.94 2015.00 2015.00 2016.00 2017.00 2018.00 ▇▇▁▇▂

If there are any quirks that you have to deal with NA coded as something else, or it is multiple tables, please make some notes here about what you need to do before you start transforming the data in the next section.

Make sure your data types are correct!

Transforming the data (15 points)

If the data needs to be transformed in any way (values recoded, pivoted, etc), do it here. Examples include transforming a continuous variable into a categorical using case_when(), etc.

#Get the Average price and get the sum of total bags to see how the price increased/decreased and how many total bags were sold in each year. 
Average_price <- avocado %>% filter(region=="TotalUS") %>% group_by(year, type) %>% 
  summarise(AveragePrice = mean(AveragePrice),
            Totalbags=sum(`Total Bags`) ) %>% 
  mutate(type = case_when(type=="conventional"~"Conventional", type=="organic"~"Organic"))
## `summarise()` has grouped output by 'year'. You can override using the
## `.groups` argument.
Average_price
## # A tibble: 8 × 4
## # Groups:   year [4]
##    year type         AveragePrice  Totalbags
##   <dbl> <chr>               <dbl>      <dbl>
## 1  2015 Conventional         1.01 281278933.
## 2  2015 Organic              1.50   9270535.
## 3  2016 Conventional         1.05 519729094.
## 4  2016 Organic              1.48  24418736.
## 5  2017 Conventional         1.22 578073467.
## 6  2017 Organic              1.65  39591069.
## 7  2018 Conventional         1.06 174111781.
## 8  2018 Organic              1.55  12127166.
# use geom bar to see the graph for year vs Average Price
Average_price %>% ggplot(aes(x=year, y=AveragePrice, fill=type))+
  geom_bar(stat = "identity", position = "dodge")

## use geom bar to see the graph for year vs Total bags
Average_price %>% ggplot(aes(x=year, y=Totalbags, fill=type))+
  geom_bar(stat = "identity", position = "dodge")

# We select main region 
#Find the Average price and get the sum of total bags for each selected regions
#See how much the price increased/decreased and how many total bags were sold in each year for this selected regions. 
table(avocado$region)
## 
##              Albany             Atlanta BaltimoreWashington               Boise 
##                 338                 338                 338                 338 
##              Boston    BuffaloRochester          California           Charlotte 
##                 338                 338                 338                 338 
##             Chicago    CincinnatiDayton            Columbus       DallasFtWorth 
##                 338                 338                 338                 338 
##              Denver             Detroit         GrandRapids          GreatLakes 
##                 338                 338                 338                 338 
##  HarrisburgScranton HartfordSpringfield             Houston        Indianapolis 
##                 338                 338                 338                 338 
##        Jacksonville            LasVegas          LosAngeles          Louisville 
##                 338                 338                 338                 338 
##   MiamiFtLauderdale            Midsouth           Nashville    NewOrleansMobile 
##                 338                 338                 338                 338 
##             NewYork           Northeast  NorthernNewEngland             Orlando 
##                 338                 338                 338                 338 
##        Philadelphia       PhoenixTucson          Pittsburgh              Plains 
##                 338                 338                 338                 338 
##            Portland   RaleighGreensboro     RichmondNorfolk             Roanoke 
##                 338                 338                 338                 338 
##          Sacramento            SanDiego        SanFrancisco             Seattle 
##                 338                 338                 338                 338 
##       SouthCarolina        SouthCentral           Southeast             Spokane 
##                 338                 338                 338                 338 
##             StLouis            Syracuse               Tampa             TotalUS 
##                 338                 338                 338                 338 
##                West    WestTexNewMexico 
##                 338                 335
selectregions <- c("Midsouth","GreatLakes", "Northeast", "NorthernNewEngland", "Plains", "SouthCentral", "Southeast", "West", "WestTexNewMexico")
Average_price <- avocado %>% filter(region%in%selectregions) %>% group_by(year, type, region) %>% 
  summarise(AveragePrice = mean(AveragePrice),
            Totalbags=sum(`Total Bags`) ) %>% 
  mutate(type = case_when(type=="conventional"~"Conventional", type=="organic"~"Organic"))
## `summarise()` has grouped output by 'year', 'type'. You can override using the
## `.groups` argument.
Average_price
## # A tibble: 72 × 5
## # Groups:   year, type [8]
##     year type         region             AveragePrice Totalbags
##    <dbl> <chr>        <chr>                     <dbl>     <dbl>
##  1  2015 Conventional GreatLakes                1.08  33279209.
##  2  2015 Conventional Midsouth                  1.12  31782551.
##  3  2015 Conventional Northeast                 1.21  47684446.
##  4  2015 Conventional NorthernNewEngland        1.11   3414599.
##  5  2015 Conventional Plains                    1.08  14996064.
##  6  2015 Conventional SouthCentral              0.812 34873350.
##  7  2015 Conventional Southeast                 1.08  32835290.
##  8  2015 Conventional West                      0.94  49459638.
##  9  2015 Conventional WestTexNewMexico          0.772  5399316.
## 10  2015 Organic      GreatLakes                1.57    777590.
## # … with 62 more rows
## use geom bar to see the graph for year vs Total bags for all selected regions 
Average_price %>% ggplot(aes(x=year, y=Totalbags, fill=type))+
  geom_bar(stat = "identity", position = "dodge")+
  facet_wrap(~region)

Bonus points (5 points) for datasets that require merging of tables, but only if you reason through whether you should use left_join, inner_join, or right_join on these tables. No credit will be provided if you don’t.

regionyearsummary <- avocado %>%  group_by(year, type, region) %>% 
  summarise(AveragePriceregion = mean(AveragePrice),
            Totalbagsregion=sum(`Total Bags`) ) 
## `summarise()` has grouped output by 'year', 'type'. You can override using the
## `.groups` argument.
regionyearsummary
## # A tibble: 432 × 5
## # Groups:   year, type [8]
##     year type         region              AveragePriceregion Totalbagsregion
##    <dbl> <chr>        <chr>                            <dbl>           <dbl>
##  1  2015 conventional Albany                           1.17          662366.
##  2  2015 conventional Atlanta                          1.05         2935926.
##  3  2015 conventional BaltimoreWashington              1.17         9311602.
##  4  2015 conventional Boise                            1.05          492546.
##  5  2015 conventional Boston                           1.14         5594482.
##  6  2015 conventional BuffaloRochester                 1.40         3016830.
##  7  2015 conventional California                       1.02        36368386.
##  8  2015 conventional Charlotte                        1.15         2761782.
##  9  2015 conventional Chicago                          1.15         4586222.
## 10  2015 conventional CincinnatiDayton                 0.977        2956864.
## # … with 422 more rows
avocadowsummary <- inner_join(avocado, regionyearsummary)
## Joining, by = c("type", "year", "region")
view(avocadowsummary)

In this study, I used the inner joint function to join the avocado and region year summery price to see the price difference in the final result. For example, looking at the column 18096, we have an Average price for avocado $1.53 and average price avocado being sold in Santiago was around 1.836667. There is a price difference by 0.3 cent, it seems very small, but it is a lot when total bags of avocado’s being sold increase.

Show your transformed table here. Use tools such as glimpse(), skim() or head() to illustrate your point.

glimpse(Average_price)
## Rows: 72
## Columns: 5
## Groups: year, type [8]
## $ year         <dbl> 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 201…
## $ type         <chr> "Conventional", "Conventional", "Conventional", "Conventi…
## $ region       <chr> "GreatLakes", "Midsouth", "Northeast", "NorthernNewEnglan…
## $ AveragePrice <dbl> 1.0776923, 1.1196154, 1.2148077, 1.1130769, 1.0773077, 0.…
## $ Totalbags    <dbl> 33279208.9, 31782550.6, 47684445.5, 3414598.7, 14996063.8…
head(Average_price)
## # A tibble: 6 × 5
## # Groups:   year, type [1]
##    year type         region             AveragePrice Totalbags
##   <dbl> <chr>        <chr>                     <dbl>     <dbl>
## 1  2015 Conventional GreatLakes                1.08  33279209.
## 2  2015 Conventional Midsouth                  1.12  31782551.
## 3  2015 Conventional Northeast                 1.21  47684446.
## 4  2015 Conventional NorthernNewEngland        1.11   3414599.
## 5  2015 Conventional Plains                    1.08  14996064.
## 6  2015 Conventional SouthCentral              0.812 34873350.

Are the values what you expected for the variables? Why or Why not?

Yes, the values looks accurate and right to me. It is different form my hypothesis but looking at the data, the data makes so much sense than my assumption.

Visualizing and Summarizing the Data (15 points)

Use group_by() and summarize() to make a summary of the data here. The summary should be relevant to your research question

Average_price <- avocado %>% filter(region%in%selectregions) %>% group_by(year, type, region) %>% 
  summarize(AveragePrice = mean(AveragePrice),
            Totalbags=sum(`Total Bags`))
## `summarise()` has grouped output by 'year', 'type'. You can override using the
## `.groups` argument.
Average_price
## # A tibble: 72 × 5
## # Groups:   year, type [8]
##     year type         region             AveragePrice Totalbags
##    <dbl> <chr>        <chr>                     <dbl>     <dbl>
##  1  2015 conventional GreatLakes                1.08  33279209.
##  2  2015 conventional Midsouth                  1.12  31782551.
##  3  2015 conventional Northeast                 1.21  47684446.
##  4  2015 conventional NorthernNewEngland        1.11   3414599.
##  5  2015 conventional Plains                    1.08  14996064.
##  6  2015 conventional SouthCentral              0.812 34873350.
##  7  2015 conventional Southeast                 1.08  32835290.
##  8  2015 conventional West                      0.94  49459638.
##  9  2015 conventional WestTexNewMexico          0.772  5399316.
## 10  2015 organic      GreatLakes                1.57    777590.
## # … with 62 more rows
view(Average_price)

What are your findings about the summary? Are they what you expected?

Looking at the summary, the values are a bit different for what I expected. I said organic avocado’s will sell more than the conventional but the data does not align with my hypothesis. Which make sense though, most restaurants and food stores buy avocado’s and they will end up buying the cheep once’s. This actually make sense after doing all the data. I also hypothesized less avocado will be used in south east area, but that was incorrect as well, less avocado was used in west new mexico according to the data and a lot of avocado were being used in west area like California.

Make at least two plots that help you answer your question on the transformed or summarized data. Use scales and/or labels to make each plot informative.

Plot 1 (Geom Point)

# Use geom point to see the price difference 
# we can see the avocado type in blue and orange
Average_price %>% ggplot(aes(x=year, y=AveragePrice, fill=type))+
 geom_point(shape = 21, size = 2) + 
  theme(legend.position = "bottom")+
  labs(title ="Avocado Average Price in Type",
       x = "Year",
       y = "Average Price")

# Use geom point to see total bag sold
# We can see the avocado type in blue and orange
Average_price %>% ggplot(aes(x=year, y=Totalbags, fill=type))+
 geom_point(shape = 21, size = 2) +  
  theme(legend.position = "bottom")+
  labs(title ="Conventional vs Organic Avocado Total Bags Sold(2015-2018)",
       x = "Year ",
       y = "Total Bags")

# Use geom point to see total bag sold in each selected region
# We can see the avocado type in blue and orange
Average_price %>% ggplot(aes(x=year, y=Totalbags, color=year, fill = type))+
 geom_point(shape = 21, size = 2) + facet_wrap(vars(region), scales = "free_x")+
  theme(legend.position = "bottom")+
  labs(title ="Conventional vs Organic Avocado Type Total Bags in Each Regions",
       x = "Year",
       y = "Total Bags")

Plot 2 (Box Plot)

# Use box plot to see the price difference 
# we can see the avocado type in blue and orange
ggplot(data = Average_price) +
  aes(x = year,
      y = AveragePrice,  
      fill = type,
      color = year) +
  
  geom_boxplot() +
  
  labs(title = "Average Price in Type",
       x = "Year",
       y = "Average Price")

# Use box plot to see total bag sold
# We can see the avocado type in blue and orange
ggplot(data = Average_price) +
  aes(x = year,
      y = Totalbags,  
      fill = type,
      color = year) +
  
  geom_boxplot() +
  
  labs(title = "Conventional vs Organic Avocado Total Bags Sold(2015-2018)",
       x = "Year",
       y = "Total Bags")

# Use box plot to see total bag sold in each selected region
# We can see the avocado type in blue and orange
ggplot(data = Average_price) +
  aes(x = year,
      y = Totalbags,  
      fill = type,
      color = year) + facet_wrap(~region)+
  geom_boxplot() + 
  
  
  labs(title = "Conventional vs Organic Avocado Type Total Bags in Each Regions",
       x = "Year",
       y = "Total Bags")

Final Summary (10 points)

Summarize your research question and findings below.

My question for this research was to see if the average price for avocado(conventional and organic) increased or decreased from year 2015-2018, seeing which avocado type (conventional or organic) is being sold for each year and also to see in which region is more avocado is being used.

According to my study, we see that the the price range for year 2015 and 2016 is about the same but in 2017 the price was really high for both conventional and organic avocado’s then we see the price getting lower in year 2018. I actually looked it up and I found a paper that the avocado price was really high in 2017 because of unsuccessful harvests and rises in demand.

https://globaledge.msu.edu/blog/post/53425/2017-sees-a-shocking-increase-in-avocado

Looking at out data, more of the conventional avocado is being sold than organic avocado. This make sense, as I mentioned earlier most restaurant and food store will buy avocado’s and they will end up buying the cheep avocado which is the conventional.

Looking the selected region, the west region seems to use more avocado and west tex mexico seems to use less avocado each year. Which make sense, more avocado is being consumed more in Californian, Nevada, Washington… and almost no avocado is being used in west new Mexico.

Are your findings what you expected? Why or Why not?

My finding did not align with my hypothesis, I was thinking more organic avocado will be sold but the data make sense, where more conventional avocado will be sold since it is cheep.

I expect the price will increase as year increase, according to this data 2017 shows a higher rate than the others and the price got lower in year 2018. And that was because there where unsuccessful harvests in year 2017.

I expected more avocado’s will be used in the west area and that confirms with my data but I expected the southeast region to use less avocado, according to this data less avocado’s was being used in west new mexico region.