Please submit your .Rmd
and .html
files in
Sakai. If you are working together, both people should submit the
files.
The goal of the midterm project is to showcase skills that you have learned in class so far. The midterm is open note, but if you use someone else’s code, you must attribute them.
.csv
file into your data
folder. This resource is probably the
easiest to deal with.Define a research question, involving at least one categorical variable. You may schedule a time with Jessica or Brad to discuss your data set and research question, or you may message it to one of us in slack or email. Please do one of the two options pretty early on. We just want to look at the data and make sure that it is appropriate for your question.
You must use each of the following functions at least once:
mutate()
group_by()
summarize()
ggplot()
and at least one of the following:
case_when()
across()
*_join()
(i.e. left_join()
)pivot_*()
(i.e. pivot_longer()
)function()
I will take off points (-5 points for each section) if you don’t add observations and notes in your RMarkdown document. I want you to think and reason through your analysis, even if they are preliminary thoughts.
Define your research question below. What about the data interests you? What is a specific question you want to find out about the data? Has there been an increase in the release of South Korean TV shows on Netflix from 2010 to 2019?
Given your question, what is your expectation about the data? My expectation is that there is an increase in the release of South Korean TV shows on Netflix from 2010 to 2019
Load the data below and use
dplyr::glimpse()
orskimr::skim()
on the data. You should upload the data file into thedata
directory.
# From = https://github.com/rfordatascience/tidytuesday/blob/master/data/2021/2021-04-20/readme.md
netflix <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-04-20/netflix_titles.csv')
## Rows: 7787 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (11): show_id, type, title, director, cast, country, date_added, rating,...
## dbl (1): release_year
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
dplyr::glimpse(netflix)
## Rows: 7,787
## Columns: 12
## $ show_id <chr> "s1", "s2", "s3", "s4", "s5", "s6", "s7", "s8", "s9", "s1…
## $ type <chr> "TV Show", "Movie", "Movie", "Movie", "Movie", "TV Show",…
## $ title <chr> "3%", "7:19", "23:59", "9", "21", "46", "122", "187", "70…
## $ director <chr> NA, "Jorge Michel Grau", "Gilbert Chan", "Shane Acker", "…
## $ cast <chr> "João Miguel, Bianca Comparato, Michel Gomes, Rodolfo Val…
## $ country <chr> "Brazil", "Mexico", "Singapore", "United States", "United…
## $ date_added <chr> "August 14, 2020", "December 23, 2016", "December 20, 201…
## $ release_year <dbl> 2020, 2016, 2011, 2009, 2008, 2016, 2019, 1997, 2019, 200…
## $ rating <chr> "TV-MA", "TV-MA", "R", "PG-13", "PG-13", "TV-MA", "TV-MA"…
## $ duration <chr> "4 Seasons", "93 min", "78 min", "80 min", "123 min", "1 …
## $ listed_in <chr> "International TV Shows, TV Dramas, TV Sci-Fi & Fantasy",…
## $ description <chr> "In a future where the elite inhabit an island paradise f…
# Setting CRAN repository,From Ramin Ar at https://stackoverflow.com/questions/33969024/install-packages-fails-in-knitr-document-trying-to-use-cran-without-setting-a
r = getOption("repos")
r["CRAN"] = "http://cran.us.r-project.org"
options(repos = r)
# Making sure that tidyverse is installed and that we have access to all the functions we need
install.packages("tidyverse")
## Installing package into 'C:/Users/ahnab/AppData/Local/R/win-library/4.2'
## (as 'lib' is unspecified)
## package 'tidyverse' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\ahnab\AppData\Local\Temp\1\RtmpEvwoNS\downloaded_packages
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0 ✔ purrr 1.0.1
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.1 ✔ stringr 1.5.0
## ✔ readr 2.1.3 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
If there are any quirks that you have to deal with
NA
coded as something else, or it is multiple tables, please make some notes here about what you need to do before you start transforming the data in the next section.
Make sure your data types are correct!
If the data needs to be transformed in any way (values recoded, pivoted, etc), do it here. Examples include transforming a continuous variable into a categorical using
case_when()
, etc.
# Creating a new variable that eliminates unnecessary columns like cast, directors, etc.
netflix_modified <- netflix %>%
select(type, country, date_added, release_year, listed_in) %>%
arrange(type, country)
glimpse(netflix_modified)
## Rows: 7,787
## Columns: 5
## $ type <chr> "Movie", "Movie", "Movie", "Movie", "Movie", "Movie", "Mo…
## $ country <chr> "Argentina", "Argentina", "Argentina", "Argentina", "Arge…
## $ date_added <chr> "May 1, 2018", "March 20, 2020", "August 25, 2016", "Febr…
## $ release_year <dbl> 2018, 2020, 2015, 2018, 2016, 2006, 2015, 2017, 1985, 201…
## $ listed_in <chr> "Action & Adventure, Comedies, International Movies", "Do…
# Filtering the data by country and type of media
library(dplyr)
ktv_shows <- netflix_modified %>%
filter(type == "TV Show", country == "South Korea")
# Remove "international TV show, Korean TV show" from "listed_as
library(janitor)
##
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
ktv_shows <- ktv_shows %>%
mutate(
listed_none = str_remove_all(listed_in, "International TV Shows, |Korean TV Shows, |TV Shows, |TV |Shows|Korean ")
)
Show your transformed table here. Use tools such as
glimpse()
,skim()
orhead()
to illustrate your point.
# Change "listed_as" to "show_genre" and eliminate unnecessary categories
library(stringr)
ktv_shows_new <- ktv_shows %>%
ungroup()%>%
select(-type, -country, -listed_in, -release_year) %>%
rename(
show_genre = listed_none
)
# Separate month, date, and year from date_added, remove month/date, and rename year_added
library(tidyr)
ktv_shows_year <- ktv_shows_new %>%
tidyr::separate(date_added, c("md", "year_added"), sep=", ") %>%
select(-md)
# Applying across() to the columns by making all strings lowercase and making year_added numeric instead of categorical
ktv_shows_year %>%
mutate(
across(.cols = c("year_added","show_genre"),
.fns = str_to_lower),
year_added_final = as.numeric(year_added)
) %>%
select(-year_added)
## # A tibble: 147 × 2
## show_genre year_added_final
## <chr> <dbl>
## 1 "romantic dramas" 2020
## 2 "romantic " 2017
## 3 "romantic " 2017
## 4 "romantic " 2019
## 5 "romantic comedies" 2020
## 6 "crime " 2019
## 7 "stand-up comedy & talk " 2017
## 8 "crime " 2019
## 9 "romantic " 2019
## 10 "dramas" 2017
## # … with 137 more rows
Are the values what you expected for the variables? Why or Why not? Year_added is supposed to be numerical but it is inputted as characters
Use
group_by()
andsummarize()
to make a summary of the data here. The summary should be relevant to your research question
# Make summary of data and create table
tabl_ktv_year <- ktv_shows_year %>%
group_by(year_added)%>%
summarize(count = n())
View(tabl_ktv_year)
What are your findings about the summary? Are they what you expected?
The number of South Korean TV shows does not show an obvious upward trend as 2017 saw an increase to 33 shows but in 2018, Netflix only added 12. Likewise, in 2020, Netflix added 5 fewer shows than 2019.
Make at least two plots that help you answer your question on the transformed or summarized data. Use scales and/or labels to make each plot informative.
# Creating a bar graph that shows the number of shows added per year
library(ggplot2)
ggplot(tabl_ktv_year) +
aes(x = year_added,
y = count,
fill = year_added) +
geom_col() +
labs(title = "Number of South Korean TV Shows Added Per Year to Netflix",
x = "Year Added to Netflix",
y = "Number of South Korean TV Shows")
# Creating a line graph that shows the number of shows added per year
ggplot(tabl_ktv_year) +
aes(x = year_added,
y = count,
group = 1) +
geom_line(linetype = "solid", color = "red") +
geom_point() +
labs(title = "Number of South Korean TV Shows Added Per Year to Netflix",
x = "Year Added to Netflix",
y = "Number of South Korean TV Shows")
Summarize your research question and findings below. My research question aimed to find out if the number of Korean TV shows released in Netflix increased from 2010 to 2019. The data only had information from 2016 but the number of shows dramatically increased from 9 to 33 in the year 2017. However, this dropped to 12 in 2018 and increased to 49 in 2019.
Are your findings what you expected? Why or Why not? My findings are not what I expected but I have noticed a general upward trend in the number of South Korean TV shows released by Netflix from 2016-2019.