Midterm (Due Sunday 2/19/2023 at 11:55 pm)

Please submit your .Rmd and .html files in Sakai. If you are working together, both people should submit the files.

60 / 60 points total

The goal of the midterm project is to showcase skills that you have learned in class so far. The midterm is open note, but if you use someone else’s code, you must attribute them.

Instructions: Before you get Started

  1. Pick a dataset. Ideally, the dataset should be around 2000 rows, and should have both categorical and numeric covariates. Choose a dataset with more than 4 columns/variables.
  • Potential Sources for data:
  • Note that most of the Tidy Tuesday data are .csv files. There is code to load the files from csv for each of the datasets and a short description of the variables, or you can upload the .csv file into your data folder. This resource is probably the easiest to deatl with.
  • You may use another dataset or your own data, but please make sure it is de-identified and has enough rows/variables.
  1. Define a research question, involving at least one categorical variable. You may schedule a time with Jessica or Brad to discuss your data set and research question, or you may message it to one of us in slack or email. Please do one of the two options pretty early on. We just want to look at the data and make sure that it is appropriate for your question.

  2. You must use each of the following functions at least once:

  • mutate()
  • group_by()
  • summarize()
  • ggplot()

and at least one of the following:

  • case_when()
  • across()
  • *_join() (i.e. left_join())
  • pivot_*() (i.e. pivot_longer())
  • function()
  1. The code chunks below are guides, please add more code chunks to do what you need.

  2. If you do not want your final project posted on the public website, please let Jessica know. We can also keep it anonymous if you’d like to remove your name from the Rmd and html, or use a pseudonym.

You may remove these instructions from your final Rmd if you like

Working Together

If you’d like to work together in pairs, that is encouraged, but you must divide the work equitably and you must note who worked on what. This is probably easiest as notes in the text. Please let Brad or Jessica know that you’ll be working together.

No acknowledgements of contributions = -10 points overall.

Please Note

I will take off points (-5 points for each section) if you don’t add observations and notes in your RMarkdown document. I want you to think and reason through your analysis, even if they are preliminary thoughts.

Define Your Research Question (10 points)

Define your research question below. What about the data interests you? What is a specific question you want to find out about the data?

  • For my research, I’ve decided to use billboard’s top 100 songs from the tidytuesdayr package. I’m hoping to see which genre of music was most popular in each decade and potentially see the rise and fall of pop music. I’m drawn into this data set because music has always been a big part of my life and I enjoy listening to a variety of music. I’m also interested in seeing how music has change or grown, so I think it’ll be fun to see how music taste has changed over the years.

Given your question, what is your expectation about the data?

  • I’m expecting to see a mixture of genres be popular in the 50’s & 60’s, then a rise in disco music in the 70’s and 80’s. Finally R&B will become popular in the 90’s while rap being popular in the 2000’s. We will then finally see a rise in pop music in 2010 to present. I also believe we will see a gradual rise in pop music throughout the years.

Loading the Data (10 points)

Load the data below and use dplyr::glimpse() or skimr::skim() on the data. You should upload the data file into the data directory.

# Get the Data

# Read in with tidytuesdayR package 
# Install from CRAN via: install.packages("tidytuesdayR")
# This loads the readme and all the datasets for the week of interest

# Either ISO-8601 date or year/week works!


billboard <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-09-14/billboard.csv')
## Rows: 327895 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): url, week_id, song, performer, song_id
## dbl (5): week_position, instance, previous_week_position, peak_position, wee...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
audio <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-09-14/audio_features.csv')
## Rows: 29503 Columns: 22
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): song_id, performer, song, spotify_genre, spotify_track_id, spotify...
## dbl (14): spotify_track_duration_ms, danceability, energy, key, loudness, mo...
## lgl  (1): spotify_track_explicit
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(billboard)
## Rows: 327,895
## Columns: 10
## $ url                    <chr> "http://www.billboard.com/charts/hot-100/1965-0…
## $ week_id                <chr> "7/17/1965", "7/24/1965", "7/31/1965", "8/7/196…
## $ week_position          <dbl> 34, 22, 14, 10, 8, 8, 14, 36, 97, 90, 97, 97, 9…
## $ song                   <chr> "Don't Just Stand There", "Don't Just Stand The…
## $ performer              <chr> "Patty Duke", "Patty Duke", "Patty Duke", "Patt…
## $ song_id                <chr> "Don't Just Stand TherePatty Duke", "Don't Just…
## $ instance               <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ previous_week_position <dbl> 45, 34, 22, 14, 10, 8, 8, 14, NA, 97, 90, 97, 9…
## $ peak_position          <dbl> 34, 22, 14, 10, 8, 8, 8, 8, 97, 90, 90, 90, 90,…
## $ weeks_on_chart         <dbl> 4, 5, 6, 7, 8, 9, 10, 11, 1, 2, 3, 4, 5, 6, 1, …
skim(billboard)
Data summary
Name billboard
Number of rows 327895
Number of columns 10
_______________________
Column type frequency:
character 5
numeric 5
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
url 0 1 50 51 0 3279 0
week_id 0 1 8 10 0 3279 0
song 0 1 1 75 0 24360 0
performer 0 1 1 113 0 10061 0
song_id 0 1 5 123 0 29389 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
week_position 0 1.0 50.50 28.87 1 25.5 50 75 100 ▇▇▇▇▇
instance 0 1.0 1.07 0.33 1 1.0 1 1 10 ▇▁▁▁▁
previous_week_position 31954 0.9 47.60 28.06 1 23.0 47 72 100 ▇▇▇▇▆
peak_position 0 1.0 41.36 29.54 1 14.0 39 66 100 ▇▅▅▅▃
weeks_on_chart 0 1.0 9.15 7.59 1 4.0 7 13 87 ▇▁▁▁▁
glimpse(audio)
## Rows: 29,503
## Columns: 22
## $ song_id                   <chr> "-twistin'-White Silver SandsBill Black's Co…
## $ performer                 <chr> "Bill Black's Combo", "Augie Rios", "Andy Wi…
## $ song                      <chr> "-twistin'-White Silver Sands", "¿Dònde Està…
## $ spotify_genre             <chr> "[]", "['novelty']", "['adult standards', 'b…
## $ spotify_track_id          <chr> NA, NA, "3tvqPPpXyIgKrm4PR9HCf0", "1fHHq3qHU…
## $ spotify_track_preview_url <chr> NA, NA, "https://p.scdn.co/mp3-preview/cef48…
## $ spotify_track_duration_ms <dbl> NA, NA, 166106, 172066, 211066, 208186, 2055…
## $ spotify_track_explicit    <lgl> NA, NA, FALSE, FALSE, FALSE, FALSE, TRUE, FA…
## $ spotify_track_album       <chr> NA, NA, "The Essential Andy Williams", "Comp…
## $ danceability              <dbl> NA, NA, 0.154, 0.588, 0.759, 0.613, NA, 0.64…
## $ energy                    <dbl> NA, NA, 0.185, 0.672, 0.699, 0.764, NA, 0.68…
## $ key                       <dbl> NA, NA, 5, 11, 0, 2, NA, 2, NA, NA, 7, NA, 1…
## $ loudness                  <dbl> NA, NA, -14.063, -17.278, -5.745, -6.509, NA…
## $ mode                      <dbl> NA, NA, 1, 0, 0, 1, NA, 0, NA, NA, 1, NA, 0,…
## $ speechiness               <dbl> NA, NA, 0.0315, 0.0361, 0.0307, 0.1360, NA, …
## $ acousticness              <dbl> NA, NA, 0.91100, 0.00256, 0.20200, 0.05270, …
## $ instrumentalness          <dbl> NA, NA, 2.67e-04, 7.45e-01, 1.31e-04, 0.00e+…
## $ liveness                  <dbl> NA, NA, 0.1120, 0.1450, 0.4430, 0.1970, NA, …
## $ valence                   <dbl> NA, NA, 0.150, 0.801, 0.907, 0.417, NA, 0.95…
## $ tempo                     <dbl> NA, NA, 83.969, 121.962, 92.960, 160.015, NA…
## $ time_signature            <dbl> NA, NA, 4, 4, 4, 4, NA, 4, NA, NA, 4, NA, 4,…
## $ spotify_track_popularity  <dbl> NA, NA, 38, 11, 77, 73, 61, 40, NA, NA, 31, …
skim(audio)
Data summary
Name audio
Number of rows 29503
Number of columns 22
_______________________
Column type frequency:
character 7
logical 1
numeric 14
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
song_id 0 1.00 5 123 0 29386 0
performer 0 1.00 1 113 0 10059 0
song 0 1.00 1 75 0 24359 0
spotify_genre 1600 0.95 2 319 0 3375 0
spotify_track_id 5106 0.83 22 22 0 24177 0
spotify_track_preview_url 15012 0.49 107 107 0 14371 0
spotify_track_album 5112 0.83 1 2688 0 13125 0

Variable type: logical

skim_variable n_missing complete_rate mean count
spotify_track_explicit 5106 0.83 0.12 FAL: 21449, TRU: 2948

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
spotify_track_duration_ms 5106 0.83 220684.26 67746.71 29688.00 175053.00 214850.00 253253.00 3079157.00 ▇▁▁▁▁
danceability 5169 0.82 0.60 0.15 0.00 0.50 0.61 0.71 0.99 ▁▂▆▇▂
energy 5169 0.82 0.62 0.20 0.00 0.48 0.63 0.78 1.00 ▁▃▆▇▅
key 5169 0.82 5.23 3.56 0.00 2.00 5.00 8.00 11.00 ▇▃▃▅▆
loudness 5169 0.82 -8.66 3.60 -28.03 -11.03 -8.20 -5.86 2.29 ▁▁▅▇▁
mode 5169 0.82 0.73 0.45 0.00 0.00 1.00 1.00 1.00 ▃▁▁▁▇
speechiness 5169 0.82 0.07 0.08 0.00 0.03 0.04 0.07 0.95 ▇▁▁▁▁
acousticness 5169 0.82 0.29 0.28 0.00 0.05 0.20 0.51 0.99 ▇▃▂▂▁
instrumentalness 5169 0.82 0.03 0.14 0.00 0.00 0.00 0.00 0.98 ▇▁▁▁▁
liveness 5169 0.82 0.19 0.16 0.01 0.09 0.13 0.25 1.00 ▇▂▁▁▁
valence 5169 0.82 0.60 0.24 0.00 0.42 0.62 0.80 0.99 ▂▅▇▇▇
tempo 5169 0.82 120.28 28.05 0.00 99.06 118.91 136.48 241.01 ▁▃▇▂▁
time_signature 5169 0.82 3.93 0.32 0.00 4.00 4.00 4.00 5.00 ▁▁▁▇▁
spotify_track_popularity 5106 0.83 41.22 22.48 0.00 23.00 43.00 59.00 100.00 ▆▆▇▆▁
audio_songs <- audio %>% select(song, performer, spotify_genre) #created a new dataset and selecting only the ones that I need for my research question
billboard_list <- billboard %>% select(week_id, song, performer) #created a new dataset and selecting only the ones that I need for my research question

If there are any quirks that you have to deal with NA coded as something else, or it is multiple tables, please make some notes here about what you need to do before you start transforming the data in the next section.

  • There was a lot of data that I didn’t really need to approach my analysis. I’ve selected only ones that were important to my research question to help make my data more cleaner.

Make sure your data types are correct!

Transforming the data (15 points)

If the data needs to be transformed in any way (values recoded, pivoted, etc), do it here. Examples include transforming a continuous variable into a categorical using case_when(), etc.

# this keeps the data frame on the left and matches record on the right data frame. I've chose to left_join with billboard list first as I wanted to keep my dates, which was the most important for my research question. 
song_list <- 
  billboard_list %>%
  left_join(audio_songs)
## Joining, by = c("song", "performer")
song_list <-na.omit(song_list[song_list$spotify_genre != "[]",]) #remove songs that doesn't contain any genre 
song_list <- cSplit(song_list, 'spotify_genre', ',', type.convert = FALSE) #split multiple listed genres into separate ones, this is done by dividing anything that has a "," between. Since each genre has a comma between them, this will allow me to separate multiple genres apart. 
song_list <- song_list %>%
  mutate(dates = as.Date(week_id, format = "%m/%d/%Y")) #I wanted to convert the week_id into dates so eventually I can group my song genres by their corresponding dates. This allows me to create a column of dates that will actually be in "dates" rather than character.

song_list_years <- song_list %>%  # now that I have dates, I mutated another new column where I can separate each song genre by their decade, going from the 50's to the present. 
  mutate(
    "song year" = case_when(
    dates < "1960-01-01" ~ "50's",
    dates >= "1960-01-01" & dates < "1970-01-01" ~ "60's",
    dates >= "1970-01-01" & dates < "1980-01-01" ~ "70's",
    dates >= "1980-01-01" & dates < "1990-01-01" ~ "80's",
    dates >= "1990-01-01" & dates < "2000-01-01" ~ "90's",
    dates >= "2000-01-01" & dates < "2010-01-01" ~ "00's",
    dates >= "2010-01-01" & dates < "2020-01-01" ~ "2010's",
    dates >= "2020-01-01" ~ "2020's"
    )
  ) 

song_list_years$`song year` <- factor(song_list_years$`song year`, levels = c("50's", "60's", "70's", "80's", "90's", "00's", "2010's", "2020's")) # After creating my decades, I've decided to factor each decade so I'm able to organize my data easier. 
song_list_years <- song_list_years[!duplicated(song_list_years$song),] #I wanted to remove any duplicate song because my goal is to just see which individual songs were played mainly in each decade. I didn't want to have duplicate because some songs were listed multiple times which I felt wouldn't have been good indicators if the same song were always played.

song_list_years <- song_list_years %>%  #since there is an abundance of genres, I've converted some of these genres into bigger names for simplicity
  mutate(
    "genre" = case_when(
      grepl("blues", spotify_genre_01) ~ "Blues",
      grepl("jazz|bebop|post-bop|boogie-woogie|bossa nova", spotify_genre_01) ~ "Jazz",
      grepl("electro|edm|house|techno|freestyle|disco|dub|big room", spotify_genre_01) ~ "Electronic",
      grepl("rock|metal|grunge|invasion|punk|mellow gold|alt z|emo|permanent wave", spotify_genre_01) ~ "Rock",
      grepl("country|honky|cowboy western|bluegrass|red dirt|nashville", spotify_genre_01) ~ "Country",
      grepl("rhythm and blues|rb|doowop|funk|soul|gospel|r&b|quiet storm|motown", spotify_genre_01) ~ "R&B",
      grepl("hip hop|rap|trap|bounce|crunk|New jack swing|miami bass", spotify_genre_01) ~ "Hip Hop",
      grepl("pop|standard", spotify_genre_01) ~ "Pop",
      grepl("classical|orchestra|instrument", spotify_genre_01) ~ "Classical",
      TRUE ~ "Others"
    )
)

Genre_Year <- song_list_years %>% select(`song year`,genre) #after tweaking my dataset, I've decided to keep the the main things I needed to proceed with my analysis and these are the ones I mainly needed. 

Bonus points (5 points) for datasets that require merging of tables, but only if you reason through whether you should use left_join, inner_join, or right_join on these tables. No credit will be provided if you don’t.

  • Within my data set, I did end up using a left_join. I could’ve used right_join as well since I’ve narrowed my data set small enough and it would’ve done the same thing, i just had to make sure my data frame’s were in opposite position. Left_joins just retains everything from the left data frame and then only match records from the other. I didn’t want to use inner_join as this only spits out matching records, so I would’ve lost my week_id and genre columns. I’ve ended up choosing to left join with my billboard_list data frame as I wanted to ensure I kept my dates.

Show your transformed table here. Use tools such as glimpse(), skim() or head() to illustrate your point.

glimpse(Genre_Year)
## Rows: 21,214
## Columns: 2
## $ `song year` <fct> 60's, 90's, 70's, 70's, 80's, 80's, 00's, 90's, 70's, 80's…
## $ genre       <chr> "Pop", "R&B", "Pop", "Pop", "Rock", "Rock", "Pop", "Countr…
head(Genre_Year)
##    song year genre
## 1:      60's   Pop
## 2:      90's   R&B
## 3:      70's   Pop
## 4:      70's   Pop
## 5:      80's  Rock
## 6:      80's  Rock

Are the values what you expected for the variables? Why or Why not?

  • After transforming the data, this wasn’t something I was expecting. I managed to obtain most of the information I needed but was still hoping to grab a count of all genres played for each decade. I removed most of the information in our data as I felt they were irrelevant to our research question. I’ve also had to adjust the genre as well, since there were a variety of them, I’ve only selected the first one listed as they’ll be considered the main genre for that song. I’ve also changed most of the genres into the bigger categories in music to simplify things since music genre is always evolving, an example would be categorizing rap with hip hop.

Visualizing and Summarizing the Data (15 points)

Use group_by() and summarize() to make a summary of the data here. The summary should be relevant to your research question

Genre_Year %>% group_by(`song year`, genre) %>% summarise(n=n())
## `summarise()` has grouped output by 'song year'. You can override using the
## `.groups` argument.
## # A tibble: 72 × 3
## # Groups:   song year [8]
##    `song year` genre         n
##    <fct>       <chr>     <int>
##  1 50's        Blues        34
##  2 50's        Classical     6
##  3 50's        Country      49
##  4 50's        Hip Hop       7
##  5 50's        Jazz          4
##  6 50's        Others       75
##  7 50's        Pop         348
##  8 50's        R&B           7
##  9 50's        Rock         26
## 10 60's        Blues       179
## # … with 62 more rows
Genre_Year %>% tabyl(genre, `song year`) %>% adorn_totals() %>% gt()
genre 50's 60's 70's 80's 90's 00's 2010's 2020's
Blues 34 179 109 27 4 6 2 0
Classical 6 42 17 11 2 0 0 0
Country 49 178 165 75 191 579 604 85
Electronic 0 32 236 269 172 9 72 5
Hip Hop 7 38 16 33 392 493 1166 222
Jazz 4 71 26 27 10 0 0 0
Others 75 469 272 197 407 269 358 40
Pop 348 2610 1300 757 540 873 1068 115
R&B 7 570 676 163 175 109 154 17
Rock 26 360 987 1472 572 377 172 14
Total 556 4549 3804 3031 2465 2715 3596 498
Genre_Year_Count <- Genre_Year %>% group_by(`song year`, genre) %>% tally()

What are your findings about the summary? Are they what you expected?

  • After grouping our data, I was able to do a count to help me distinguish how many genres were played during each decade. After doing this, this is what I’ve expected for my data, however, the values weren’t what I expected though. i was surprised to see the popularity in rock music. I was also shocked to see that R&B and hip hop were lower than I’ve expected in the 90’s and 2000’s but shocked to see hip hop have a significant increase in 2010. I knew pop music was big in our country but didn’t expect to drop in the early 2000’s as well. However, we do see that pop music has a slow upcoming growth as we continue through the years but it did take a significant dip after the 60’s.

Make at least two plots that help you answer your question on the transformed or summarized data. Use scales and/or labels to make each plot informative.

ggplot(Genre_Year_Count, aes(x = `song year`,
                             y = n,
                             fill = genre)) +
geom_bar(stat = "identity") +

  labs(title = "Popular genre of music in years",
       x = "Years (in decades)",
       y = "Number of songs",
       fill = "Music Genre")

ggplot(Genre_Year, aes(x = `song year`)) +
geom_bar(position = "dodge", fill = "Blue") +
  facet_wrap(~genre, scales = "free", ncol = 5) +
  scale_y_continuous(limit=c(0,3000)) + 
  theme(axis.text.x=element_text(angle = -90, hjust = 0)) + 
  
      labs(title = "Individual genre popularity by years",
       x = "Years (in decades)",
       y = "Number of songs")

ggplot(Genre_Year_Count, aes(x = `song year`,
                       y = n,
                       group = 1
                       )) +
  geom_line(color = "red") +
  facet_wrap(~genre, scales = "free", ncol = 5) +
  scale_y_continuous(limit=c(0,2800)) + 
  theme(axis.text.x=element_text(angle = -90, hjust = 0)) +
  
  labs(title = "Individual genre popularity by years",
       x = "Years (in decades)",
       y = "Number of songs")

ggplot(Genre_Year_Count) +
  aes(x=`song year`, y=genre, fill=(n)) +
  geom_tile() +
  scale_fill_gradient(low = "purple", high = "red") + 
  
    labs(title = "Individual genre popularity by years",
       x = "Years (in decades)",
       y = "Popularity of genre",
       fill = "Song count")

Final Summary (10 points)

Summarize your research question and findings below.

  • My research question was to find out what was the top genre of each decade according to Billboard’s Top 100. After my analysis, we are able to visualize and see which genre was most popular in each decade. With our table and graphs, we are also able to see the rise and fall of pop music. We see that rock music was becoming popular in the 70’s and reaching it’s peak in the 80’s. This would make sense as those years had really big rock groups, such as Aerosmith, Queens, and Metallica to name a few. We can also see pop music being really popular in the 60’s and start to significantly drop afterwards due to the popularity of rock music. Pop music eventually climbs back up but unfortunately, it seems like hip hop music is dominating our more recent years.

Are your findings what you expected? Why or Why not?

  • I’m quite surprised with my findings and it’s not what i was expecting. I was mainly expecting pop music to remain consistent and gradually grow over time but it turns out it was hip hop that actually had that trend. I was quite surprised that pop music dipped very low and didn’t expect rock music to be superior in the 80’s, I would’ve thought disco, which would’ve been categorized as electronic, to be popular during that time. Another interesting thing was seeing pop music wasn’t the popular genre for our recent years and it was actually hip hop. Pop did come quite close but seems like hip hop was been dominating the more recent years. Pop music seems to be the most popular genre in our timeline as they have extraordinary numbers. Overall, this data was taken from Billboards Top 100 which mean there could be some bias. This is because billboard ranking is determined by sales, radio play and streaming which can be heavily influenced by many factors.