Please submit your .Rmd
and .html
files in
Sakai. If you are working together, both people should submit the
files.
The goal of the midterm project is to showcase skills that you have learned in class so far. The midterm is open note, but if you use someone else’s code, you must attribute them.
.csv
file into your data
folder. This resource is probably the
easiest to deatl with.Define a research question, involving at least one categorical variable. You may schedule a time with Jessica or Brad to discuss your data set and research question, or you may message it to one of us in slack or email. Please do one of the two options pretty early on. We just want to look at the data and make sure that it is appropriate for your question.
You must use each of the following functions at least once:
mutate()
group_by()
summarize()
ggplot()
and at least one of the following:
case_when()
across()
*_join()
(i.e. left_join()
)pivot_*()
(i.e. pivot_longer()
)function()
The code chunks below are guides, please add more code chunks to do what you need.
If you do not want your final project posted on the public website, please let Jessica know. We can also keep it anonymous if you’d like to remove your name from the Rmd and html, or use a pseudonym.
You may remove these instructions from your final Rmd if you like
Working Together
If you’d like to work together in pairs, that is encouraged, but you must divide the work equitably and you must note who worked on what. This is probably easiest as notes in the text. Please let Brad or Jessica know that you’ll be working together.
No acknowledgements of contributions = -10 points overall.
Please Note
I will take off points (-5 points for each section) if you don’t add observations and notes in your RMarkdown document. I want you to think and reason through your analysis, even if they are preliminary thoughts.
Define your research question below. What about the data interests you? What is a specific question you want to find out about the data?
This data was collected form 2015 - 2018 to see the price of an avocado and it shows how many avocado was sold through out the year. The question I would like to ask is if the conventional or organic avocado have been selling more? and also I want to ask as the year increase if the price of avocado(conventional and organic) increased or decreased?
There are so many state and regions in this study, I have selected some of them to see in which state or region will avocado sell more and in which state or region will avocado will sell less.
I love eating avocado toast in the morning and eating avocado toast with conventional or organic avocado have a huge test difference. I love eating mine with organic avocado, there is a huge difference in avocado when it is organic or conventional. I was also curious in this study which state sells more avocado and find out the priece difference.
Given your question, what is your expectation about the data?
I expect people to buy more organic avocado than conventional, since the test is more better than the conventional plus they don’t go bad in short time and I would expect the price of avocado will increase through out the year.
I also expect more avocado bags to be sold in California and less avocado in the south east area.
Load the data below and use
dplyr::glimpse()
orskimr::skim()
on the data. You should upload the data file into thedata
directory.
# Get the Data
# Upload data using read_cvs
# This study is about avocado
library(readr)
avocado <- read_csv("~/Downloads/R Programing /sph_r_programming_class_project_folders_2023 (1)/data/avocado.csv")
## New names:
## Rows: 18249 Columns: 14
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (2): type, region dbl (11): ...1, AveragePrice, Total Volume, 4046, 4225, 4770,
## Total Bags, S... date (1): Date
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
dplyr::glimpse(avocado)
## Rows: 18,249
## Columns: 14
## $ ...1 <dbl> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1…
## $ Date <date> 2015-12-27, 2015-12-20, 2015-12-13, 2015-12-06, 2015-1…
## $ AveragePrice <dbl> 1.33, 1.35, 0.93, 1.08, 1.28, 1.26, 0.99, 0.98, 1.02, 1…
## $ `Total Volume` <dbl> 64236.62, 54876.98, 118220.22, 78992.15, 51039.60, 5597…
## $ `4046` <dbl> 1036.74, 674.28, 794.70, 1132.00, 941.48, 1184.27, 1368…
## $ `4225` <dbl> 54454.85, 44638.81, 109149.67, 71976.41, 43838.39, 4806…
## $ `4770` <dbl> 48.16, 58.33, 130.50, 72.58, 75.78, 43.61, 93.26, 80.00…
## $ `Total Bags` <dbl> 8696.87, 9505.56, 8145.35, 5811.16, 6183.95, 6683.91, 8…
## $ `Small Bags` <dbl> 8603.62, 9408.07, 8042.21, 5677.40, 5986.26, 6556.47, 8…
## $ `Large Bags` <dbl> 93.25, 97.49, 103.14, 133.76, 197.69, 127.44, 122.05, 5…
## $ `XLarge Bags` <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0…
## $ type <chr> "conventional", "conventional", "conventional", "conven…
## $ year <dbl> 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2…
## $ region <chr> "Albany", "Albany", "Albany", "Albany", "Albany", "Alba…
# Use skimer data to provide summary statistics about variables in data frames
skim(avocado)
Name | avocado |
Number of rows | 18249 |
Number of columns | 14 |
_______________________ | |
Column type frequency: | |
character | 2 |
Date | 1 |
numeric | 11 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
type | 0 | 1 | 7 | 12 | 0 | 2 | 0 |
region | 0 | 1 | 4 | 19 | 0 | 54 | 0 |
Variable type: Date
skim_variable | n_missing | complete_rate | min | max | median | n_unique |
---|---|---|---|---|---|---|
Date | 0 | 1 | 2015-01-04 | 2018-03-25 | 2016-08-14 | 169 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
…1 | 0 | 1 | 24.23 | 15.48 | 0.00 | 10.00 | 24.00 | 38.00 | 52.00 | ▇▆▆▆▆ |
AveragePrice | 0 | 1 | 1.41 | 0.40 | 0.44 | 1.10 | 1.37 | 1.66 | 3.25 | ▂▇▅▁▁ |
Total Volume | 0 | 1 | 850644.01 | 3453545.36 | 84.56 | 10838.58 | 107376.76 | 432962.29 | 62505646.52 | ▇▁▁▁▁ |
4046 | 0 | 1 | 293008.42 | 1264989.08 | 0.00 | 854.07 | 8645.30 | 111020.20 | 22743616.17 | ▇▁▁▁▁ |
4225 | 0 | 1 | 295154.57 | 1204120.40 | 0.00 | 3008.78 | 29061.02 | 150206.86 | 20470572.61 | ▇▁▁▁▁ |
4770 | 0 | 1 | 22839.74 | 107464.07 | 0.00 | 0.00 | 184.99 | 6243.42 | 2546439.11 | ▇▁▁▁▁ |
Total Bags | 0 | 1 | 239639.20 | 986242.40 | 0.00 | 5088.64 | 39743.83 | 110783.37 | 19373134.37 | ▇▁▁▁▁ |
Small Bags | 0 | 1 | 182194.69 | 746178.51 | 0.00 | 2849.42 | 26362.82 | 83337.67 | 13384586.80 | ▇▁▁▁▁ |
Large Bags | 0 | 1 | 54338.09 | 243965.96 | 0.00 | 127.47 | 2647.71 | 22029.25 | 5719096.61 | ▇▁▁▁▁ |
XLarge Bags | 0 | 1 | 3106.43 | 17692.89 | 0.00 | 0.00 | 0.00 | 132.50 | 551693.65 | ▇▁▁▁▁ |
year | 0 | 1 | 2016.15 | 0.94 | 2015.00 | 2015.00 | 2016.00 | 2017.00 | 2018.00 | ▇▇▁▇▂ |
If there are any quirks that you have to deal with
NA
coded as something else, or it is multiple tables, please make some notes here about what you need to do before you start transforming the data in the next section.
Make sure your data types are correct!
If the data needs to be transformed in any way (values recoded, pivoted, etc), do it here. Examples include transforming a continuous variable into a categorical using
case_when()
, etc.
#Get the Average price and get the sum of total bags to see how the price increased/decreased and how many total bags were sold in each year.
Average_price <- avocado %>% filter(region=="TotalUS") %>% group_by(year, type) %>%
summarise(AveragePrice = mean(AveragePrice),
Totalbags=sum(`Total Bags`) ) %>%
mutate(type = case_when(type=="conventional"~"Conventional", type=="organic"~"Organic"))
## `summarise()` has grouped output by 'year'. You can override using the
## `.groups` argument.
Average_price
## # A tibble: 8 × 4
## # Groups: year [4]
## year type AveragePrice Totalbags
## <dbl> <chr> <dbl> <dbl>
## 1 2015 Conventional 1.01 281278933.
## 2 2015 Organic 1.50 9270535.
## 3 2016 Conventional 1.05 519729094.
## 4 2016 Organic 1.48 24418736.
## 5 2017 Conventional 1.22 578073467.
## 6 2017 Organic 1.65 39591069.
## 7 2018 Conventional 1.06 174111781.
## 8 2018 Organic 1.55 12127166.
# use geom bar to see the graph for year vs Average Price
Average_price %>% ggplot(aes(x=year, y=AveragePrice, fill=type))+
geom_bar(stat = "identity", position = "dodge")
## use geom bar to see the graph for year vs Total bags
Average_price %>% ggplot(aes(x=year, y=Totalbags, fill=type))+
geom_bar(stat = "identity", position = "dodge")
# We select main region
#Find the Average price and get the sum of total bags for each selected regions
#See how much the price increased/decreased and how many total bags were sold in each year for this selected regions.
table(avocado$region)
##
## Albany Atlanta BaltimoreWashington Boise
## 338 338 338 338
## Boston BuffaloRochester California Charlotte
## 338 338 338 338
## Chicago CincinnatiDayton Columbus DallasFtWorth
## 338 338 338 338
## Denver Detroit GrandRapids GreatLakes
## 338 338 338 338
## HarrisburgScranton HartfordSpringfield Houston Indianapolis
## 338 338 338 338
## Jacksonville LasVegas LosAngeles Louisville
## 338 338 338 338
## MiamiFtLauderdale Midsouth Nashville NewOrleansMobile
## 338 338 338 338
## NewYork Northeast NorthernNewEngland Orlando
## 338 338 338 338
## Philadelphia PhoenixTucson Pittsburgh Plains
## 338 338 338 338
## Portland RaleighGreensboro RichmondNorfolk Roanoke
## 338 338 338 338
## Sacramento SanDiego SanFrancisco Seattle
## 338 338 338 338
## SouthCarolina SouthCentral Southeast Spokane
## 338 338 338 338
## StLouis Syracuse Tampa TotalUS
## 338 338 338 338
## West WestTexNewMexico
## 338 335
selectregions <- c("Midsouth","GreatLakes", "Northeast", "NorthernNewEngland", "Plains", "SouthCentral", "Southeast", "West", "WestTexNewMexico")
Average_price <- avocado %>% filter(region%in%selectregions) %>% group_by(year, type, region) %>%
summarise(AveragePrice = mean(AveragePrice),
Totalbags=sum(`Total Bags`) ) %>%
mutate(type = case_when(type=="conventional"~"Conventional", type=="organic"~"Organic"))
## `summarise()` has grouped output by 'year', 'type'. You can override using the
## `.groups` argument.
Average_price
## # A tibble: 72 × 5
## # Groups: year, type [8]
## year type region AveragePrice Totalbags
## <dbl> <chr> <chr> <dbl> <dbl>
## 1 2015 Conventional GreatLakes 1.08 33279209.
## 2 2015 Conventional Midsouth 1.12 31782551.
## 3 2015 Conventional Northeast 1.21 47684446.
## 4 2015 Conventional NorthernNewEngland 1.11 3414599.
## 5 2015 Conventional Plains 1.08 14996064.
## 6 2015 Conventional SouthCentral 0.812 34873350.
## 7 2015 Conventional Southeast 1.08 32835290.
## 8 2015 Conventional West 0.94 49459638.
## 9 2015 Conventional WestTexNewMexico 0.772 5399316.
## 10 2015 Organic GreatLakes 1.57 777590.
## # … with 62 more rows
## use geom bar to see the graph for year vs Total bags for all selected regions
Average_price %>% ggplot(aes(x=year, y=Totalbags, fill=type))+
geom_bar(stat = "identity", position = "dodge")+
facet_wrap(~region)
Bonus points (5 points) for datasets that require merging of
tables, but only if you reason through whether you should use
left_join
, inner_join
, or
right_join
on these tables. No credit will be provided if
you don’t.
regionyearsummary <- avocado %>% group_by(year, type, region) %>%
summarise(AveragePriceregion = mean(AveragePrice),
Totalbagsregion=sum(`Total Bags`) )
## `summarise()` has grouped output by 'year', 'type'. You can override using the
## `.groups` argument.
regionyearsummary
## # A tibble: 432 × 5
## # Groups: year, type [8]
## year type region AveragePriceregion Totalbagsregion
## <dbl> <chr> <chr> <dbl> <dbl>
## 1 2015 conventional Albany 1.17 662366.
## 2 2015 conventional Atlanta 1.05 2935926.
## 3 2015 conventional BaltimoreWashington 1.17 9311602.
## 4 2015 conventional Boise 1.05 492546.
## 5 2015 conventional Boston 1.14 5594482.
## 6 2015 conventional BuffaloRochester 1.40 3016830.
## 7 2015 conventional California 1.02 36368386.
## 8 2015 conventional Charlotte 1.15 2761782.
## 9 2015 conventional Chicago 1.15 4586222.
## 10 2015 conventional CincinnatiDayton 0.977 2956864.
## # … with 422 more rows
avocadowsummary <- inner_join(avocado, regionyearsummary)
## Joining, by = c("type", "year", "region")
view(avocadowsummary)
In this study, I used the inner joint function to join the avocado and region year summery price to see the price difference in the final result. For example, looking at the column 18096, we have an Average price for avocado $1.53 and average price avocado being sold in Santiago was around 1.836667. There is a price difference by 0.3 cent, it seems very small, but it is a lot when total bags of avocado’s being sold increase.
Show your transformed table here. Use tools such as
glimpse()
,skim()
orhead()
to illustrate your point.
glimpse(Average_price)
## Rows: 72
## Columns: 5
## Groups: year, type [8]
## $ year <dbl> 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 201…
## $ type <chr> "Conventional", "Conventional", "Conventional", "Conventi…
## $ region <chr> "GreatLakes", "Midsouth", "Northeast", "NorthernNewEnglan…
## $ AveragePrice <dbl> 1.0776923, 1.1196154, 1.2148077, 1.1130769, 1.0773077, 0.…
## $ Totalbags <dbl> 33279208.9, 31782550.6, 47684445.5, 3414598.7, 14996063.8…
head(Average_price)
## # A tibble: 6 × 5
## # Groups: year, type [1]
## year type region AveragePrice Totalbags
## <dbl> <chr> <chr> <dbl> <dbl>
## 1 2015 Conventional GreatLakes 1.08 33279209.
## 2 2015 Conventional Midsouth 1.12 31782551.
## 3 2015 Conventional Northeast 1.21 47684446.
## 4 2015 Conventional NorthernNewEngland 1.11 3414599.
## 5 2015 Conventional Plains 1.08 14996064.
## 6 2015 Conventional SouthCentral 0.812 34873350.
Are the values what you expected for the variables? Why or Why not?
Yes, the values looks accurate and right to me. It is different form my hypothesis but looking at the data, the data makes so much sense than my assumption.
Use
group_by()
andsummarize()
to make a summary of the data here. The summary should be relevant to your research question
Average_price <- avocado %>% filter(region%in%selectregions) %>% group_by(year, type, region) %>%
summarize(AveragePrice = mean(AveragePrice),
Totalbags=sum(`Total Bags`))
## `summarise()` has grouped output by 'year', 'type'. You can override using the
## `.groups` argument.
Average_price
## # A tibble: 72 × 5
## # Groups: year, type [8]
## year type region AveragePrice Totalbags
## <dbl> <chr> <chr> <dbl> <dbl>
## 1 2015 conventional GreatLakes 1.08 33279209.
## 2 2015 conventional Midsouth 1.12 31782551.
## 3 2015 conventional Northeast 1.21 47684446.
## 4 2015 conventional NorthernNewEngland 1.11 3414599.
## 5 2015 conventional Plains 1.08 14996064.
## 6 2015 conventional SouthCentral 0.812 34873350.
## 7 2015 conventional Southeast 1.08 32835290.
## 8 2015 conventional West 0.94 49459638.
## 9 2015 conventional WestTexNewMexico 0.772 5399316.
## 10 2015 organic GreatLakes 1.57 777590.
## # … with 62 more rows
view(Average_price)
What are your findings about the summary? Are they what you expected?
Looking at the summary, the values are a bit different for what I expected. I said organic avocado’s will sell more than the conventional but the data does not align with my hypothesis. Which make sense though, most restaurants and food stores buy avocado’s and they will end up buying the cheep once’s. This actually make sense after doing all the data. I also hypothesized less avocado will be used in south east area, but that was incorrect as well, less avocado was used in west new mexico according to the data and a lot of avocado were being used in west area like California.
Make at least two plots that help you answer your question on the transformed or summarized data. Use scales and/or labels to make each plot informative.
# Use geom point to see the price difference
# we can see the avocado type in blue and orange
Average_price %>% ggplot(aes(x=year, y=AveragePrice, fill=type))+
geom_point(shape = 21, size = 2) +
theme(legend.position = "bottom")+
labs(title ="Avocado Average Price in Type",
x = "Year",
y = "Average Price")
# Use geom point to see total bag sold
# We can see the avocado type in blue and orange
Average_price %>% ggplot(aes(x=year, y=Totalbags, fill=type))+
geom_point(shape = 21, size = 2) +
theme(legend.position = "bottom")+
labs(title ="Conventional vs Organic Avocado Total Bags Sold(2015-2018)",
x = "Year ",
y = "Total Bags")
# Use geom point to see total bag sold in each selected region
# We can see the avocado type in blue and orange
Average_price %>% ggplot(aes(x=year, y=Totalbags, color=year, fill = type))+
geom_point(shape = 21, size = 2) + facet_wrap(vars(region), scales = "free_x")+
theme(legend.position = "bottom")+
labs(title ="Conventional vs Organic Avocado Type Total Bags in Each Regions",
x = "Year",
y = "Total Bags")
# Use box plot to see the price difference
# we can see the avocado type in blue and orange
ggplot(data = Average_price) +
aes(x = year,
y = AveragePrice,
fill = type,
color = year) +
geom_boxplot() +
labs(title = "Average Price in Type",
x = "Year",
y = "Average Price")
# Use box plot to see total bag sold
# We can see the avocado type in blue and orange
ggplot(data = Average_price) +
aes(x = year,
y = Totalbags,
fill = type,
color = year) +
geom_boxplot() +
labs(title = "Conventional vs Organic Avocado Total Bags Sold(2015-2018)",
x = "Year",
y = "Total Bags")
# Use box plot to see total bag sold in each selected region
# We can see the avocado type in blue and orange
ggplot(data = Average_price) +
aes(x = year,
y = Totalbags,
fill = type,
color = year) + facet_wrap(~region)+
geom_boxplot() +
labs(title = "Conventional vs Organic Avocado Type Total Bags in Each Regions",
x = "Year",
y = "Total Bags")
Summarize your research question and findings below.
My question for this research was to see if the average price for avocado(conventional and organic) increased or decreased from year 2015-2018, seeing which avocado type (conventional or organic) is being sold for each year and also to see in which region is more avocado is being used.
According to my study, we see that the the price range for year 2015 and 2016 is about the same but in 2017 the price was really high for both conventional and organic avocado’s then we see the price getting lower in year 2018. I actually looked it up and I found a paper that the avocado price was really high in 2017 because of unsuccessful harvests and rises in demand.
https://globaledge.msu.edu/blog/post/53425/2017-sees-a-shocking-increase-in-avocado
Looking at out data, more of the conventional avocado is being sold than organic avocado. This make sense, as I mentioned earlier most restaurant and food store will buy avocado’s and they will end up buying the cheep avocado which is the conventional.
Looking the selected region, the west region seems to use more avocado and west tex mexico seems to use less avocado each year. Which make sense, more avocado is being consumed more in Californian, Nevada, Washington… and almost no avocado is being used in west new Mexico.
Are your findings what you expected? Why or Why not?
My finding did not align with my hypothesis, I was thinking more organic avocado will be sold but the data make sense, where more conventional avocado will be sold since it is cheep.
I expect the price will increase as year increase, according to this data 2017 shows a higher rate than the others and the price got lower in year 2018. And that was because there where unsuccessful harvests in year 2017.
I expected more avocado’s will be used in the west area and that confirms with my data but I expected the southeast region to use less avocado, according to this data less avocado’s was being used in west new mexico region.