Midterm (Due Sunday 2/19/2023 at 11:55 pm)

Please submit your .Rmd and .html files in Sakai. If you are working together, both people should submit the files.

60 / 60 points total

The goal of the midterm project is to showcase skills that you have learned in class so far. The midterm is open note, but if you use someone else’s code, you must attribute them.

Instructions: Before you get Started

  1. Pick a dataset. Ideally, the dataset should be around 2000 rows, and should have both categorical and numeric covariates. Choose a dataset with more than 4 columns/variables.
  • Potential Sources for data:
  • Note that most of the Tidy Tuesday data are .csv files. There is code to load the files from csv for each of the datasets and a short description of the variables, or you can upload the .csv file into your data folder. This resource is probably the easiest to deal with.
  • You may use another dataset or your own data, but please make sure it is de-identified and has enough rows/variables.
  1. Define a research question, involving at least one categorical variable. You may schedule a time with Jessica or Brad to discuss your data set and research question, or you may message it to one of us in slack or email. Please do one of the two options pretty early on. We just want to look at the data and make sure that it is appropriate for your question.

  2. You must use each of the following functions at least once:

  • mutate()
  • group_by()
  • summarize()
  • ggplot()

and at least one of the following:

  • case_when()
  • across()
  • *_join() (i.e. left_join())
  • pivot_*() (i.e. pivot_longer())
  • function()
  1. The code chunks below are guides, please add more code chunks to do what you need.

Please Note

I will take off points (-5 points for each section) if you don’t add observations and notes in your RMarkdown document. I want you to think and reason through your analysis, even if they are preliminary thoughts.

Define Your Research Question (10 points)

Define your research question below. What about the data interests you? What is a specific question you want to find out about the data? Has there been an increase in the release of South Korean TV shows on Netflix from 2010 to 2019?

Given your question, what is your expectation about the data? My expectation is that there is an increase in the release of South Korean TV shows on Netflix from 2010 to 2019

Loading the Data (10 points)

Load the data below and use dplyr::glimpse() or skimr::skim() on the data. You should upload the data file into the data directory.

# From = https://github.com/rfordatascience/tidytuesday/blob/master/data/2021/2021-04-20/readme.md
netflix <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-04-20/netflix_titles.csv')
## Rows: 7787 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (11): show_id, type, title, director, cast, country, date_added, rating,...
## dbl  (1): release_year
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
dplyr::glimpse(netflix)
## Rows: 7,787
## Columns: 12
## $ show_id      <chr> "s1", "s2", "s3", "s4", "s5", "s6", "s7", "s8", "s9", "s1…
## $ type         <chr> "TV Show", "Movie", "Movie", "Movie", "Movie", "TV Show",…
## $ title        <chr> "3%", "7:19", "23:59", "9", "21", "46", "122", "187", "70…
## $ director     <chr> NA, "Jorge Michel Grau", "Gilbert Chan", "Shane Acker", "…
## $ cast         <chr> "João Miguel, Bianca Comparato, Michel Gomes, Rodolfo Val…
## $ country      <chr> "Brazil", "Mexico", "Singapore", "United States", "United…
## $ date_added   <chr> "August 14, 2020", "December 23, 2016", "December 20, 201…
## $ release_year <dbl> 2020, 2016, 2011, 2009, 2008, 2016, 2019, 1997, 2019, 200…
## $ rating       <chr> "TV-MA", "TV-MA", "R", "PG-13", "PG-13", "TV-MA", "TV-MA"…
## $ duration     <chr> "4 Seasons", "93 min", "78 min", "80 min", "123 min", "1 …
## $ listed_in    <chr> "International TV Shows, TV Dramas, TV Sci-Fi & Fantasy",…
## $ description  <chr> "In a future where the elite inhabit an island paradise f…
# Setting CRAN repository,From Ramin Ar at https://stackoverflow.com/questions/33969024/install-packages-fails-in-knitr-document-trying-to-use-cran-without-setting-a
r = getOption("repos")
r["CRAN"] = "http://cran.us.r-project.org"
options(repos = r)

# Making sure that tidyverse is installed and that we have access to all the functions we need 
install.packages("tidyverse")
## Installing package into 'C:/Users/ahnab/AppData/Local/R/win-library/4.2'
## (as 'lib' is unspecified)
## package 'tidyverse' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\ahnab\AppData\Local\Temp\1\RtmpEvwoNS\downloaded_packages
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0      ✔ purrr   1.0.1 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.5.0 
## ✔ readr   2.1.3      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

If there are any quirks that you have to deal with NA coded as something else, or it is multiple tables, please make some notes here about what you need to do before you start transforming the data in the next section.

Make sure your data types are correct!

Transforming the data (15 points)

If the data needs to be transformed in any way (values recoded, pivoted, etc), do it here. Examples include transforming a continuous variable into a categorical using case_when(), etc.

# Creating a new variable that eliminates unnecessary columns like cast, directors, etc. 
netflix_modified <- netflix %>%
  select(type, country, date_added, release_year, listed_in) %>%
    arrange(type, country)
      glimpse(netflix_modified)
## Rows: 7,787
## Columns: 5
## $ type         <chr> "Movie", "Movie", "Movie", "Movie", "Movie", "Movie", "Mo…
## $ country      <chr> "Argentina", "Argentina", "Argentina", "Argentina", "Arge…
## $ date_added   <chr> "May 1, 2018", "March 20, 2020", "August 25, 2016", "Febr…
## $ release_year <dbl> 2018, 2020, 2015, 2018, 2016, 2006, 2015, 2017, 1985, 201…
## $ listed_in    <chr> "Action & Adventure, Comedies, International Movies", "Do…
# Filtering the data by country and type of media 
library(dplyr)
ktv_shows <- netflix_modified %>%
  filter(type == "TV Show", country == "South Korea")

# Remove "international TV show, Korean TV show" from "listed_as 
library(janitor)
## 
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test
ktv_shows <- ktv_shows %>%
  mutate(
    listed_none = str_remove_all(listed_in, "International TV Shows, |Korean TV Shows, |TV Shows, |TV |Shows|Korean ")
  )

Show your transformed table here. Use tools such as glimpse(), skim() or head() to illustrate your point.

# Change "listed_as" to "show_genre" and eliminate unnecessary categories 
library(stringr)
ktv_shows_new <- ktv_shows %>%
  ungroup()%>%
  select(-type, -country, -listed_in, -release_year) %>%
    rename(
      show_genre = listed_none
    )

# Separate month, date, and year from date_added, remove month/date, and rename year_added 
library(tidyr)
ktv_shows_year <- ktv_shows_new %>% 
    tidyr::separate(date_added, c("md", "year_added"), sep=", ") %>%
        select(-md)

# Applying across() to the columns by making all strings lowercase and making year_added numeric instead of categorical
ktv_shows_year %>%
  mutate(
    across(.cols = c("year_added","show_genre"),
           .fns = str_to_lower),
    year_added_final = as.numeric(year_added)
  ) %>%
    select(-year_added)
## # A tibble: 147 × 2
##    show_genre                year_added_final
##    <chr>                                <dbl>
##  1 "romantic dramas"                     2020
##  2 "romantic "                           2017
##  3 "romantic "                           2017
##  4 "romantic "                           2019
##  5 "romantic comedies"                   2020
##  6 "crime "                              2019
##  7 "stand-up comedy & talk "             2017
##  8 "crime "                              2019
##  9 "romantic "                           2019
## 10 "dramas"                              2017
## # … with 137 more rows

Are the values what you expected for the variables? Why or Why not? Year_added is supposed to be numerical but it is inputted as characters

Visualizing and Summarizing the Data (15 points)

Use group_by() and summarize() to make a summary of the data here. The summary should be relevant to your research question

# Make summary of data and create table
tabl_ktv_year <- ktv_shows_year %>% 
  group_by(year_added)%>%
    summarize(count = n())

View(tabl_ktv_year)

What are your findings about the summary? Are they what you expected?

The number of South Korean TV shows does not show an obvious upward trend as 2017 saw an increase to 33 shows but in 2018, Netflix only added 12. Likewise, in 2020, Netflix added 5 fewer shows than 2019.

Make at least two plots that help you answer your question on the transformed or summarized data. Use scales and/or labels to make each plot informative.

# Creating a bar graph that shows the number of shows added per year 
library(ggplot2)
ggplot(tabl_ktv_year) + 
    
  aes(x = year_added, 
      y = count,
      fill = year_added) + 
  
    geom_col() + 
  
  labs(title = "Number of South Korean TV Shows Added Per Year to Netflix", 
       x = "Year Added to Netflix", 
       y = "Number of South Korean TV Shows")

# Creating a line graph that shows the number of shows added per year 
ggplot(tabl_ktv_year) + 
    
  aes(x = year_added, 
      y = count,
      group = 1) + 
  
    geom_line(linetype = "solid", color = "red") +
  
    geom_point() +

  labs(title = "Number of South Korean TV Shows Added Per Year to Netflix", 
       x = "Year Added to Netflix", 
       y = "Number of South Korean TV Shows")

Final Summary (10 points)

Summarize your research question and findings below. My research question aimed to find out if the number of Korean TV shows released in Netflix increased from 2010 to 2019. The data only had information from 2016 but the number of shows dramatically increased from 9 to 33 in the year 2017. However, this dropped to 12 in 2018 and increased to 49 in 2019.

Are your findings what you expected? Why or Why not? My findings are not what I expected but I have noticed a general upward trend in the number of South Korean TV shows released by Netflix from 2016-2019.